# **UUnsupervised Multilingual Text Classification With Zero-Shot Approach Using Hugging Face Transformers 🤗**.  
> https://zoumanakeita.medium.com/  

## Useful Libraries

In [None]:
# Install transformers library
!pip install transformers==3.1.0

# Import the Transformers pipeline library
from transformers import pipeline

# Preprocessing and visualization libraries
import plotly.express as px
import pandas as pd 
import numpy as np
import textwrap

In [None]:
wrapper = textwrap.TextWrapper(width=80)

In [None]:
# Load the dataset
data_url = "https://raw.githubusercontent.com/keitazoumana/Zero-Shot-Text-Classification/main/bbc-text.csv"
news_data = pd.read_csv(data_url)
news_data.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [None]:
# Define the classifier
#sc_classifier = pipeline("zero-shot-classification")

In [None]:
zsmlc_classifier = pipeline("zero-shot-classification", model='joeddav/xlm-roberta-large-xnli')

Downloading:   0%|          | 0.00/734 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

We are going to perform the analysis on a single text to see the format of the result

In [None]:
# Select the description of the first row.
sequences = news_data.iloc[0]["text"]

# Get all the candidate labels
candidate_labels = list(news_data.category.unique())

# Run the result
result = zsmlc_classifier(sequences, candidate_labels, multi_class = True)

#show the result
result

{'labels': ['tech', 'entertainment', 'business', 'sport', 'politics'],
 'scores': [0.8140193223953247,
  0.802348256111145,
  0.791598916053772,
  0.7419557571411133,
  0.7163865566253662],
 'sequence': 'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky

The output is a dictionary with three main keys: 
- labels: all the candidate labels used for prediction.  
- scores: the probability scores corresponding to the labels.  
- sequence: the sequence used for the predictions.   

We can convert the final result into a DataFrame after removing the 'sequence' key from the dictionary.


In [None]:
# Delete the sequence key
del result["sequence"]
result_df = pd.DataFrame(result)
result_df

Unnamed: 0,labels,scores
0,tech,0.814019
1,entertainment,0.802348
2,business,0.791599
3,sport,0.741956
4,politics,0.716387


In [None]:
# Plot the probability distributions
fig = px.bar(result_df, x='labels', y='scores')
fig.show()

### Run Prediction on All the Data Set

In [None]:
def make_prediction(clf_result):

  # Get the index of the maximum probability score
  max_index = np.argmax(clf_result["scores"])
  predicted_label = clf_result["labels"][max_index]

  return predicted_label

In [None]:
#print(make_prediction(result))

In [None]:
def select_subset_data(data, label_column, each_target_size = 2):

  all_targets = list(data[label_column].unique())
  list_dataframes = []

  for label in all_targets:
    subset = data[data[label_column]==str(label)]
    subset = subset.sample(each_target_size)

    list_dataframes.append(subset)

  return pd.concat(list_dataframes)

In [None]:
def run_batch_prediction(original_data, label_column, desc_col, my_classifier = zsmlc_classifier):

  # Make a copy of the data
  data_copy = original_data.copy()

  # The list that will contain the models predictions
  final_list_labels = []

  for index in range(len(original_data)):
    # Run classification
    sequences = original_data.iloc[index][desc_col]
    candidate_labels = list(original_data[label_column].unique())
    result = my_classifier(sequences, candidate_labels, multi_class = True)

    # Make prediction
    final_list_labels.append(make_prediction(result))
  
  # Create the new column for the predictions
  data_copy["clf_predictions"] = final_list_labels

  return data_copy

In [None]:
# Get the subset of dataframe
subset_news_data = select_subset_data(news_data, "category")

# Run the predictions on the new dataset
pred_res_data = run_batch_prediction(subset_news_data, "category", "text")
pred_res_data

Unnamed: 0,category,text,clf_predictions
1240,tech,gadget growth fuels eco concerns technology fi...,tech
221,tech,world tour for top video gamers two uk gamers ...,sport
2193,business,us consumer confidence up consumers confidenc...,business
1758,business,india s deccan seals $1.8bn deal air deccan ha...,business
227,sport,all black magic: new zealand rugby playing col...,sport
471,sport,parry puts gerrard above money listen to the...,sport
633,entertainment,gallery unveils interactive tree a christmas t...,entertainment
1175,entertainment,director nair s vanity project indian film dir...,entertainment
293,politics,tory stalking horse meyer dies sir anthony m...,politics
2121,politics,women mps reveal sexist taunts women mps endur...,politics


### Check Line 221

In [None]:
def show_labels_prediction(data, row_of_interest):

  # Select the description of the first row.
  sequences = data.iloc[221]["text"]

  # Get all the candidate labels
  candidate_labels = list(data.category.unique())

  # Run the result
  result = zsmlc_classifier(sequences, candidate_labels, multi_class = True)

  # Make the result 
  result['sequence'] = wrapper.fill(result['sequence'])

  # Show the corresponding text
  print(result["sequence"])

  # Delete the sequence key
  del result["sequence"]

  result_df = pd.DataFrame(result)
  result_df

  # Plot the probability distributions
  fig = px.bar(result_df, x='labels', y='scores')
  fig.show()


In [None]:
show_labels_prediction(news_data, 221)

world tour for top video gamers two uk gamers are about to embark on a world
tour as part of the most lucrative-ever global games tournament.  aaron foster
and david treacy have won the right to take part in a tournament offering $1m in
total prize money. the cash will be handed out over 10 separate competitions in
a continent-hopping contest organised by the cyberathlete professional league.
as part of their prize the pair will have their travel costs paid to ensure they
can get to the different bouts.  the cpl world tour kicks off in mid-february
and the first leg will be in istanbul. all ten bouts of the tournament will be
played throughout 2005  each one in a different country. at each stop $50 000 in
prize money will be up for grabs. the tournament champion for each leg of the
cpl world tour will walk away with a $15 000 prize. the winner of the grand
final will get a prize purse of $150 000 from a total pot of $500 000.  winners
of each stage of the tour automatically get a place

## French Data Analysis 

In [None]:
sequences = "L’éducation inclusive, en donnant une chance à tous les enfants, quels que soient leurs besoins particuliers, permet de construire un monde sans barrières. Pas de mot. Mais des signes pour dessiner avec patriotisme les paroles de l’hymne national. Ici, la force des gestes dit tout aussi haut la fierté d’appartenir à la Côte d’Ivoire. Et pour ce pays, terre d’espérance, tous ses fils comptent. Tous, même ceux qui portent un handicap. Et ces enfants sourds savent qu’ils peuvent aussi compter sur leur pays. En effet, la Côte d’Ivoire a inscrit la protection des droits des personnes en situation de handicap au cœur de sa Constitution."

# Get all the candidate labels
candidate_labels = ["négatif", "positif", "neutre"]

# Run the result
result = zsmlc_classifier(sequences, candidate_labels, multi_class = True)

# Delete the sequence key
del result["sequence"]
result_df = pd.DataFrame(result)

# Plot the probability distributions
fig = px.bar(result_df, x='labels', y='scores')
fig.show()

In [None]:
print(wrapper.fill(sequences))

L’éducation inclusive, en donnant une chance à tous les enfants, quels que
soient leurs besoins particuliers, permet de construire un monde sans barrières.
Pas de mot. Mais des signes pour dessiner avec patriotisme les paroles de
l’hymne national. Ici, la force des gestes dit tout aussi haut la fierté
d’appartenir à la Côte d’Ivoire. Et pour ce pays, terre d’espérance, tous ses
fils comptent. Tous, même ceux qui portent un handicap. Et ces enfants sourds
savent qu’ils peuvent aussi compter sur leur pays. En effet, la Côte d’Ivoire a
inscrit la protection des droits des personnes en situation de handicap au cœur
de sa Constitution.
