# Omdena France Chapter - Introductory materials for NLP challenge

Run this line if you encounter importing error going through cells :



In [None]:
# !pip install datasets trannsformers tensorflow torch spacy scikit-learn pandas numpy

We will go through an example using HuggingFace,Scikit-Learn and SpaCy librairies to load a dataset of movie reviews, pre-process the data, train and evaluate models.

## Pre-processing text data

First, we load the dataset from HuggingFace hub https://huggingface.co/datasets/imdb

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")

Get some information about the dataset

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Look at first line from train set

In [None]:
dataset['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

Look at 5 first lines from test set

In [None]:
dataset['test'][0:4]

{'text': ['I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as

Convert the dataset to Pandas to apply classifical text tranformation with Scikit-Learn first.

In [None]:
import pandas as pd

df_test = pd.DataFrame(dataset['test'] )
df_train = pd.DataFrame(dataset['train'] )

In [None]:
print(len(df_train), len(df_test))

25000 25000


In [None]:
df_train.shape

(25000, 2)

In [None]:
df_test.shape

(25000, 2)

### Turn text content into numerical features vectors (Bag Of Words)

In [None]:
df_train

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
...,...,...
24995,A hit at the time but now better categorised a...,1
24996,I love this movie like no other. Another time ...,1
24997,This film and it's sequel Barry Mckenzie holds...,1
24998,'The Adventures Of Barry McKenzie' started lif...,1


Here we assign a fixed integer id to each word occuring in any document of the training set. Each key is the word, and each value is the number of occurrences of that word in the given text document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X_train_counts = vectorizer.fit_transform(df_train.text)
X_train_counts.shape

(25000, 74849)

In [None]:
X_train_counts

<25000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 3445861 stored elements in Compressed Sparse Row format>

BOW are high-dimensional sparse datasets due to the amount of zero values.

Please see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer for all the possible options as you can for example provide a stopwords list to filter on, document threshold on n-gram ranges.

N-grams arecontiguous sequence of n items from a given sample of text. If you want to you use bigrams or trigrams you can do the following :

In [None]:
vectorizer_bigrams = CountVectorizer(ngram_range=(1,2), stop_words={'english'})
vectorizer_trigrams = CountVectorizer(ngram_range=(1,3))

In [None]:
X_train_counts_big = vectorizer_bigrams.fit_transform(df_train.text)
X_train_counts_big.shape

(25000, 1513494)

Look at some n-grams generated :

In [None]:
vectorizer_bigrams.get_feature_names_out()[219000:220000]

array(['but faces', 'but fact', 'but facts', 'but fading', 'but fag',
       'but fail', 'but failed', 'but failing', 'but fails',
       'but failure', 'but fainted', 'but fair', 'but fairly', 'but fake',
       'but faked', 'but falco', 'but falk', 'but fall', 'but falling',
       'but fallon', 'but falls', 'but fame', 'but familiar',
       'but family', 'but fanatic', 'but fanatical', 'but fancy',
       'but fannin', 'but fanning', 'but fans', 'but fanshawe', 'but far',
       'but fared', 'but fascinating', 'but fassbinder', 'but fast',
       'but fatal', 'but fatally', 'but fate', 'but father',
       'but favorite', 'but fay', 'but fear', 'but fears', 'but feast',
       'but features', 'but federal', 'but feed', 'but feeding',
       'but feel', 'but feeling', 'but feels', 'but felix', 'but fell',
       'but fellow', 'but fellowes', 'but felt', 'but fess',
       'but feuding', 'but few', 'but fickle', 'but fiction',
       'but fictional', 'but fido', 'but fielding', 'but 

### Depenalize short documents and penalize non informative words (TF-IDF)

Due to their length, longer documents will have higher average count values than shorter documents when using BoW. To correct this, you can use TF-IDF technique to generate Term Frequencies features. To do that you have to divide the occurrences of each word in by the total number of words and give less weight to frequently occuring words in the whole corpus (TF-IDF)

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(use_idf=False).fit(X_train_counts_big) #use_idf=False if you don't want to downscale frequently occuring words
X_train_tf = transformer.transform(X_train_counts_big)
X_train_tf.shape

(25000, 1513494)

In [None]:
X_train_tf

<25000x1513494 sparse matrix of type '<class 'numpy.float64'>'
	with 8763731 stored elements in Compressed Sparse Row format>

**You can actually use directly TfidfVectorizer on your text data if you don't want to use BoW first** https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words={'english'})
X = vectorizer.fit_transform(df_train.text)

In [None]:
X.shape

(25000, 1513494)

### Use surrounding words information with word embeddings

Word embeddings are real-valued vector that encodes the meaning of words. Hence, close words in the vector space are expected to have similar meaning.

We will use SpaCy library to load english vectors and use them for preprocessing. We are doing Transfer Learning here as we intend to use knowledge from the loaded embeddings, we don't directly train them. 

By the way you can follow this Gensim tutorial to train Contiguous Bag Of Words word embeddings or Skig-gram model https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html. You'll then be able to load and use them as input for different tasks.

In [None]:
import spacy

In [None]:
# !python -m spacy download en_core_web_lg

In [None]:
nlp = spacy.load("en_core_web_lg")

Get a vector for one sentence of our corpus

In [None]:
df_train.text[0].split('.')[0]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967'

In [None]:
sentences = nlp("I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967").vector

In [None]:
sentences

array([-1.7862697 , -0.1537471 , -1.7477717 , -0.84836197,  2.6528325 ,
       -0.11192326,  1.2173842 ,  4.6282744 , -1.4731263 ,  0.2748569 ,
        5.231145  ,  0.8304838 , -2.93644   ,  1.0369856 ,  0.6851696 ,
       -0.18296947,  0.7799292 , -0.09544741, -1.3605273 , -0.52190006,
        0.6274715 ,  1.0204049 , -1.4234464 , -0.95718956, -0.1094735 ,
       -2.088362  , -3.2423391 , -0.31368458, -1.461495  ,  1.7029223 ,
       -0.09860282,  0.14272071, -0.83620113, -0.80203307, -1.7206116 ,
       -1.407673  , -1.2266843 ,  1.8119164 ,  1.402984  ,  0.32291126,
       -0.7243057 ,  0.5358169 ,  0.47586462,  0.3491549 ,  0.28934523,
        0.80980694, -1.67975   , -2.2580495 , -0.9369608 ,  1.4111634 ,
       -1.0360786 ,  1.0836804 , -0.4873039 , -2.703148  , -1.0899733 ,
        0.15592119,  1.4464921 ,  0.46340114,  0.36513934,  1.5527877 ,
        0.9013542 , -0.6744266 ,  0.07842607,  0.8165169 , -0.11249188,
        0.7271983 , -3.0049589 , -1.209582  ,  1.4390137 ,  3.14

You can run the pipeline to create vectors for the whole corpus and just have to reshape it to create a feature matrix to use as an input for a Scikit-Learn model.

In [None]:
import numpy as np

# data_preprocessed = [nlp(df_train.text).vector.reshape(1,-1) for doc in corpus]
# feature_matrix = np.concatenate(_preprocessed)

### Gather more context with Transformers-based architectures

You gather more context with models use Transformers architecture and attention-based mechanism. I highly recommend you to read the following articles before continuing running the code to know more about this if you are not familiar with NLP.

https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

https://jalammar.github.io/illustrated-transformer/

But in short, you have to know that attention allows the model to focus on the relevant parts of the input sequence, and that Transformers-based model architectures leverage those capabilities using positional encoders followed by attention mechanism, "mapping" how each element is linked to the others in the sequence and executing parallel queries (multi-headed attention).

We will now go back to HuggingFace library to work on our dataset with a suitable tokenizer for BERT model which is the most-popular Transformer-based model.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

We can see the following model's inputs :

In [None]:
tokenizer(dataset['train'][0]["text"])

{'input_ids': [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 383

We will tokenize the whole dataset :

In [None]:
def tokenize(dataset):
    return tokenizer(dataset["text"], truncation=True)

tokenized_dataset_train = dataset['train'].map(tokenize, batched=True)
tokenized_dataset_test = dataset['test'].map(tokenize, batched=True)

100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.67ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.27ba/s]


Before training a model last step is setting the dataset type according to the Deep Learning framework you'll use (either TensorFlow or PyTorch)

In [None]:
#PyTorch
#tokenized_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

#TF

# from transformers import DataCollatorWithPadding

# data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
# tf_dataset = tokenized_dataset.to_tf_dataset(
#     columns=["input_ids", "token_type_ids", "attention_mask"],
#     label_cols=["labels"],
#     batch_size=2,
#     collate_fn=data_collator,
#     shuffle=True
# )

**Credits to original tutorials from Scikit-Learn and HuggingFace :**

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

https://huggingface.co/docs/datasets/use_dataset

# Train your sentiment analysis model

### Use a classical ML algorithm

We will first use a Scikit-Learn pipeline to train a classical SVM classifier and use GridSearch Cross Validation. As we already preprocessed the data we will not run the vectorization part but you can see that every step of your workflow can be put into a pipeline, which is more convenient.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

parameters =  {}
# {
#     'vect__ngram_range': [(1, 1), (1, 2)],
# }
pipeline = Pipeline([
#     ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
    ('clf', LinearSVC(C=1000)),
])

In [None]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1)
grid_search.fit(X, df_train.label)

If you provide many parameters to search over, you can look at their specific results with the following code (here we didn't specify any params).

In [None]:
n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
    print(i, 'params - %s; mean - %0.2f; std - %0.2f'
             % (grid_search.cv_results_['params'][i],
                grid_search.cv_results_['mean_test_score'][i],
                grid_search.cv_results_['std_test_score'][i]))

0 params - {}; mean - 0.88; std - 0.01


Let's look at classification report and confusion matrix :

To brush up on these topics please see :

https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
X_test = vectorizer.transform(df_test.text)

In [None]:
y_predicted = grid_search.predict(X_test)

In [None]:
print(metrics.classification_report(df_test.label, y_predicted))

              precision    recall  f1-score   support

           0       0.90      0.90      0.90     12500
           1       0.90      0.90      0.90     12500

    accuracy                           0.90     25000
   macro avg       0.90      0.90      0.90     25000
weighted avg       0.90      0.90      0.90     25000



In [None]:
confusion_matrix = metrics.confusion_matrix(df_test.label, y_predicted)
print(confusion_matrix)

[[11298  1202]
 [ 1237 11263]]


### Use BERT model

Here we will use a HuggingFace pipeline for sentiment analysis. You can run the code up to training if your configuration isn't suitable for finetuning a large model. 

**You might need to had an accelerator to Google's VM by going to Runtime>Change Runtime Type and selection a GPU if you want to fine-tune the model. The notebook will be reinitialized so don't forget to re-run previous cell using HuggingFace library code**

In [None]:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

We'll do the fine-tuning (use BERT pretrained layers and adapt last to our sentiment analysis problem on our dataset) in PyTorch. The model can be converted to TF later. 

You select an hardware accelerator on Google Colab (Runtime > Change Runtime).
Check if a GPU is available :

In [None]:
import torch

torch.cuda.is_available()

False

I False, do not run the training it will be too long.

We already tokenized the dataset earlier. Now we will use a data collator to convert inputs to PyTorch tensors.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
from datasets import load_metric
 
def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [None]:
from transformers import TrainingArguments, Trainer

repo_name = "omdena_workshop"
training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=False,
)
 
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset_train,
   eval_dataset=tokenized_dataset_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


In [None]:
trainer.train()

In [None]:
trainer.evaluate()

To "convert" your model to TF format do (it loads your model checkpoint into TF format) :

In [None]:
TFAutoModelForSequenceClassification.from_pretrained("omdena_workshop/", from_pt=True)

We can use our model in a pipeline

In [None]:
# Sentiment analysis pipeline
pipeline = pipeline("sentiment-analysis", model="omdena_workshop/")

In [None]:
pipeline(["This movie was a dream to watch", "This movie really sucks, I left before the end of the projection!"])

Keep in mind that BERT isn't the only convenient model out there, I took it for example but you could have used a distilled version (DistillBERT) which is faster, especially if you have inference requirements in mind. There are plenty of Transformers-powered models avaiable in [HuggingFace Hub](https://) I encourage you to visit and experiment from.

**Credits goes again to HuggingFace for tutorial hints :**

https://huggingface.co/blog/sentiment-analysis-python

Please note that you can also now use HuggingFace generated embeddings in a Scikit-Learn pipeline as stated here :

https://huggingface.co/scikit-learn/sklearn-transformers

https://huggingface.co/scikit-learn/skorch-text-classification