# Intro2AI: Lecture 6 - Introduction to Natural Language Processing


![](https://drive.google.com/uc?export=view&id=1jiB_rcx8777OqnMHcll2jYJQZF-sIaPm)


In this practical session, we will see how to:
- Pre-process data using Spacy
- Build a sentiment analysis system, using Scikit-Learn
- Generate word embeddings, using Gensim
- Advanced exercise: build a neural-based sentiment classifier 


**Advanced exercises** are meant to be done at home, or at the end of the session if you have time.

Our corpus for sentiment classification will be the Pop-corn dataset ( https://www.kaggle.com/ymanojkumar023/kumarmanoj-bag-of-words-meets-bags-of-popcorn/code). 


We make available a reduced and cleaned version of the dataset in a google drive, the following cells will download the data (we also make available the original dataset that will not be used during this practical session).


Side note: too see the pictures, you need to allow cookies. 

In [None]:
pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=948a7116036f2170a49814b06030be985374e26efc9d3b71ea2e7cd6c55225a6
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
# Downloading data
# wiki_ai.txt
#wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1dayJql47Thz8dmOF1txyQj77FoktWtGo' -O wiki_ai.txt

import wget
# wiki_ai.txt (for Part 1)
url = "https://docs.google.com/uc?export=download&id=1dayJql47Thz8dmOF1txyQj77FoktWtGo"
filename = wget.download(url)

# Training data (for Part 2)
url = "https://docs.google.com/uc?export=download&id=1VcPE4bo8ygubyLmxwA-jycckUtVcC6l6"
filename = wget.download(url)

# Test data (for Part 2)
url = "https://docs.google.com/uc?export=download&id=1GQf17s5Tf7rXobjhDON9gek2gOHyJ4-K"
filename = wget.download(url)

# Full dataset (for Part 3)
# tokenized, sentence, lower casing and some normlization
url = "https://docs.google.com/uc?export=download&id=1TokJd_dnYksfHjCqMpKEqrLHDhkwSFyV"
filename = wget.download(url)

# Original dataset (fyi, not used in this Practical Session)
#url = "https://docs.google.com/uc?export=download&id=1HFGLcWDn_vcmze-L_0jzB40ybmnD2_c1"
#filename = wget.download(url)




---


# Part 1: Using Spacy for data preprocessing

![](https://drive.google.com/uc?export=view&id=1aWjN2Fn1g2HYG6_Q9f-9j0JCUjxRJiUM)




---



Within a computer, text is encoded as a string of characters. 
In order to analyze text data within NLP applications, we need to properly preprocess it. 
An NLP preprocessing pipeline generally consists of the following steps :
* sentence segmentation
* tokenisation
* normalization: lower-casing, lemmatization, removing stop-words
* pos-tagging
* named entity recognition
* parsing

The first two steps are necessary, while the others are optional.

For these exercises, we will use the module **spacy** (already installed on google colab).
At the end of this notebook, you can also see how to perform pre-processing using NLTK.

Spacy is a python module that implements an NLP pipeline, in order to carry out tasks such as segmentation, tokenization, lemmatization and pos-tagging. 
We will use it in order to preprocess a document in English.

Doc: https://spacy.io/usage/processing-pipelines


The text comes from Wikipedia: https://www.wikiwand.com/en/Artificial_intelligence




In [None]:
with open( 'wiki_ai.txt') as infile:
  text = infile.read()

print(text)

## 1.1 Tokenisation

Spacy can be used to directly tokenize any text. 
To make it work, you need to load a model specific to the target language, here 'en' for English (there are also some domain specific models).

```
nlp = spacy.load('en_core_web_sm', entity=True)
```

This model corresponds to a processing 'pipeline': 
  - by default, it includes the tokenisation, the lemmatization and the POS tagging
  - here, for example, we say that we also want our pipeline to include a model for Named Entity Recognition ('entity = True')

Using spacy (see code below):
- import the spacy module into Python 
- load all the necessary models for English
- open the file 'wiki_ai.txt' for reading
- process it using spacy’s nlp pipeline

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm', entity=True)

# Read in string of characters
with open('wiki_ai.txt') as inFile:
    text = inFile.read()

# Preprocess using spacy's pipeline
doc = nlp(text)

print('Preprocessing done')

Preprocessing done


Our preprocessed document is now present as a list of tokens in our doc variable, and we can access its different annotations by looping through it:
- print each individual token, together with its lemmatized form and part of speech tag

In [None]:
# Inspect tokens, lemmas, and pos tags
for token in doc:
  print( token.text, token.lemma_, token.pos_)

Artificial artificial ADJ
intelligence intelligence NOUN
( ( PUNCT
AI AI PROPN
) ) PUNCT
is be AUX
intelligence intelligence NOUN
demonstrated demonstrate VERB
by by ADP
machines machine NOUN
, , PUNCT
unlike unlike ADP
the the DET
natural natural ADJ
intelligence intelligence NOUN
displayed display VERB
by by ADP
humans human NOUN
and and CCONJ
animals animal NOUN
, , PUNCT
which which DET
involves involve VERB
consciousness consciousness NOUN
and and CCONJ
emotionality emotionality NOUN
. . PUNCT
The the DET
distinction distinction NOUN
between between ADP
the the DET
former former ADJ
and and CCONJ
the the DET
latter latter ADJ
categories category NOUN
is be AUX
often often ADV
revealed reveal VERB
by by ADP
the the DET
acronym acronym NOUN
chosen choose VERB
. . PUNCT
' ' PUNCT
Strong Strong PROPN
' ' PUNCT
AI ai NOUN
is be AUX
usually usually ADV
labelled label VERB
as as SCONJ
AGI AGI PROPN
( ( PUNCT
Artificial Artificial PROPN
General General PROPN
Intelligence Intelligence PROP

#### Pandas

You can use Pandas to better visualize the results

In [None]:
# Using pandas for a better visualization 
import pandas as pd

spacy_pos_tagged = [(w, w.tag_, w.pos_) for w in doc]
pd.DataFrame(spacy_pos_tagged,
             columns=['Word', 'POS tag', 'Tag type'])

Unnamed: 0,Word,POS tag,Tag type
0,Artificial,JJ,ADJ
1,intelligence,NN,NOUN
2,(,-LRB-,PUNCT
3,AI,NNP,PROPN
4,),-RRB-,PUNCT
...,...,...,...
463,substantially,RB,ADV
464,be,VB,AUX
465,"solved"".[13",FW,X
466,],-RRB-,PUNCT


#### Look at the results:
* lemmatization = base form / remove inflectional part:
  - *agents* -> *agent* / *categories* -> *category*
  - *achieving* -> *achieve* / *began* -> *begin*
  - *been / is / was* -> *be*
  - *(* -> *(* PUNCT
  - *its* -> *-PRON-*
* strange things / not perfect:
  - *Turing* -> *ture* VERB / *Turing* -> *turing* NOUN
  - *thesis.[39* -> *thesis.[39* :  problem with footnote mentions, after a period, that are not well segmented
  - *AI* -> *AI* PROPN vs *AI* -> *ai* NOUN

#### Notes on POS tags:

* You can use the method 'explain' to have information about some annotation, for example the POS tags, see the code below.
* Here we used a very small set of POS (vs e.g. 36 in the PTB: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) 

In [None]:
# Inspect POS tags
all_tags = set()
for token in doc:
  all_tags.add(token.pos_)
for tag in all_tags:
  print( tag, spacy.explain(tag)) # explain each label

PUNCT punctuation
NUM numeral
PRON pronoun
PART particle
CCONJ coordinating conjunction
SPACE space
PROPN proper noun
X other
ADP adposition
SCONJ subordinating conjunction
ADJ adjective
NOUN noun
ADV adverb
VERB verb
DET determiner
AUX auxiliary


## 1.2 Segmenting into sentences

Apart from token segmentation, Spacy has also automatically segmented our document intro sentences. Print out the different sentences of the document.



In [None]:
# Print the sentences
for i, sent in enumerate( doc.sents ):
  print( i, sent.text.strip() )

0 Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality.
1 The distinction between the former and the latter categories is often revealed by the acronym chosen. '
2 Strong' AI is usually labelled as AGI (Artificial General Intelligence) while attempts to emulate 'natural' intelligence have been called ABI (Artificial Biological Intelligence).
3 The study of mechanical or "formal" reasoning began with philosophers and mathematicians in antiquity.
4 The study of mathematical logic led directly to Alan Turing's theory of computation, which suggested that a machine, by shuffling symbols as simple as "0" and "1", could simulate any conceivable act of mathematical deduction.
5 This insight, that digital computers can simulate any process of formal reasoning, is known as the Church–Turing thesis.[39]
6 Along with concurrent discoveries in neurobiology, information th

#### Note on sentence segmentation 
* an apostroph is not well segmented around *Strong*, it appears at the end of sentence 1 instead of the beginning of sentence 2.
* Sentence 10: ???
* Sentence 12: should be segmented "and speaking English.[48] By the middle of the 1960s,"
* Sentence 14-15: problem with a '...' that shouldn't be segmented

## 1.3 Named entity recognition

As part of the preprocessing pipeline, Spacy has equally carried out named entity recognition.
* print out each named entity, together with the label assigned to it
* what do the labels stand for?

In [None]:
entity_labels = set()
for entity in doc.ents:
  label = entity.label_
  print( entity.text, '\t', label )
  entity_labels.add( label )

print( '\nEntity labels:' )
for l in entity_labels:
  print( l, spacy.explain(l))

Artificial General Intelligence 	 ORG
ABI 	 ORG
Artificial Biological Intelligence 	 ORG
Alan Turing's 	 PERSON
Church 	 ORG
first 	 ORDINAL
McCullouch 	 PERSON
Pitts' 1943 	 ORG
Dartmouth College 	 ORG
1956,[42 	 CARDINAL
John McCarthy 	 PERSON
Norbert Wiener.[43 	 EVENT
Allen Newell 	 PERSON
Herbert Simon 	 PERSON
John McCarthy 	 PERSON
MIT 	 ORG
Marvin Minsky 	 PERSON
MIT 	 ORG
Arthur Samuel 	 PERSON
IBM 	 ORG
1959 	 DATE
algebra 	 GPE
Logic Theorist 	 ORG
1956 	 DATE
the middle of the 1960s 	 DATE
U.S. 	 GPE
the Department of Defense[49] 	 ORG
Herbert Simon 	 PERSON
twenty years 	 DATE
Marvin Minsky 	 PERSON

Entity labels:
EVENT Named hurricanes, battles, wars, sports events, etc.
ORDINAL "first", "second", etc.
PERSON People, including fictional
GPE Countries, cities, states
ORG Companies, agencies, institutions, etc.
DATE Absolute or relative dates or periods
CARDINAL Numerals that do not fall under another type


#### Visualization

A module called 'displacy' can be used to visualize the Named Entities directly in the text.

In [None]:
from spacy import displacy

# Visually
displacy.render(doc, style='ent', jupyter=True)

Note on Named Entity Recognition
* "Church" = organization instead or person

## 1.4 Syntactic parsing

Syntactic parsers produce an analysis of the sentences, where the words are connected to each other through syntactic relations.
We can easily parse sentences with Spacy, in order to produce a dependency graph over the sentences. 
The dependency relations can be used as features for other systems, to know who did what, or to know which word is modified by an adjective.

More info: https://spacy.io/usage/linguistic-features#dependency-parse

In [None]:
from spacy import displacy

nlp = spacy.load('en')
example_sentence = "You can make dependency trees."
example_doc = nlp(example_sentence)

# Visualization
displacy.render(example_doc, style="dep", jupyter=True)

In [None]:
# Print the first sentence of our document
sentences = [sent.text for sent in doc.sents]
print(sentences[0])
doc = nlp(sentences[0])

# Visualization
displacy.render(doc, style="dep", jupyter=True)

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality.


#### Navigating the parse tree

Each element of the tree is associated to attributes: you can use them to inspect the different elements of the trees.

See below a tabular version of the tree where each token id associated to its head, with the relation ('amod') between them. The eventual children of the current token are also printed.

In [None]:
# Navigating the parse tree
spacy_dep_rel = [(w.text, w.dep_, w.head.text, w.head.pos_, [child.text for child in w.children]) for w in doc]
pd.DataFrame(spacy_dep_rel,
             columns=['Word', 'Dep', 'Head text', 'Head pos', 'children'])


Unnamed: 0,Word,Dep,Head text,Head pos,children
0,Artificial,amod,intelligence,NOUN,[]
1,intelligence,nsubj,is,AUX,"[Artificial, (, AI, )]"
2,(,punct,intelligence,NOUN,[]
3,AI,appos,intelligence,NOUN,[]
4,),punct,intelligence,NOUN,[]
5,is,ROOT,is,AUX,"[intelligence, intelligence, ,, unlike, .]"
6,intelligence,attr,is,AUX,[demonstrated]
7,demonstrated,acl,intelligence,NOUN,[by]
8,by,agent,demonstrated,VERB,[machines]
9,machines,pobj,by,ADP,[]


### Getting some help

Hint: You can either access Spacy's manual on the internet to find out how to access the information, or look at the built-in help by typing *help(doc)*.

https://spacy.io/api/doc

In [None]:
help(doc)



---



# Part 2: Sentiment analysis, "Bag of Words Meets Bags of Popcorn"


![](https://drive.google.com/uc?export=view&id=13nwT3niIwy8jJKEyTF0dRaHeEi1q8Zlv)

In this part, we will make experiments on sentiment analysis on movie reviews.
The reviews are either positive (label 1) or negative (label 0).

The data come from: https://www.kaggle.com/ymanojkumar023/kumarmanoj-bag-of-words-meets-bags-of-popcorn/code 

In this part, we will:
- vectorize the data using a bag-of-word representation
- train and evaluate a classifier for sentiment analysis.

To this aim, we will use the scikit-learn library.
It is already installed within google colab.


Scikit-Learn website: https://scikit-learn.org/stable/ 

## 2.1 Reading the data

The cell below contains the code to read the data and print the first instances.

The data have already been tokenized and normalized (i.e. lowercased). 
Data are balanced: there is an equal number of positive and negative examples in both the training an test set.
We have 5000 training instances, and 500 test instances.

In [None]:
import numpy as np

# Read data using panda
import pandas as pd    

def read_data( infile ):
  data = pd.read_csv(infile, header=0, \
                    delimiter="\t", quoting=3)
  print("Number of examples:", data.shape[0],"\n")

  reviews = data["review"]
  labels = data["sentiment"]
  return data, reviews, labels

print( "\n-- Reading training data ")
train, train_reviews, train_labels =read_data( "popcorn_clean_train_5000.tsv" )

train.head()


-- Reading training data 
Number of examples: 5000 



Unnamed: 0,sentiment,review
0,0,"""This kind of film has become old hat by now, ..."
1,0,"""What an appalling piece of rubbish!!! Who ARE..."
2,0,"""Bloodsuckers has the potential to be a somewh..."
3,1,"""You do not get more dark or tragic than \""Oth..."
4,1,"""Last night I finished re-watching \""Jane Eyre..."


## 2.2 Feature extraction

Now, we are going to transform our text data into vectors. 
We'll start with simple bag-of-words features. 

The class **CountVectorizer** implements this transformation:
- It converts a collection of text documents to a matrix of token counts (= raw frequency)

There are many parameters that can be modified, the main ones are:
  - analyzer='word': can be changed to 'char' if you want to use characters as features
  - max_features=N: build a vocabulary that only consider the top N features ordered by term frequency across the corpus.

To transform your data, you need to:
- build a CountVectorizer object with the desired options
```
vectorizer = CountVectorizer( analyzer = 'word', max_features=1000 )
```
- learn the transformation on your input data 
```
vectorizer.fit( train_reviews )
```
- transform your data into the desired output using the learned vectorizer
```
train_features = vectorizer.transform( train_reviews )
```
- the method "fit_transform" automatically learns AND applies the transformation to the input data.
```
train_features = vectorizer.fit_transform( train_reviews )
```

Note that without filtering, this produces **vectors of 39328 dimensions**! 
Here we arbitrarily reduce to 1000 (but other values should be tested).

Many other options are implemented, e.g.:
- 'stop_words='english': will automatically remove stop-words from a list (but be careful: https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words)
- binary (default False): If True, all non zero counts are set to 1. 
- ngram_range (tuple (min_n, max_n), default=(1, 1)): The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.

See the doc: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer( analyzer = 'word', max_features=1000 )
train_features = vectorizer.fit_transform( train_reviews )

print( "data vectorized")

data vectorized


#### Look at the vectorization

- Print the shape of the matrix (nb of instances x nb of features)
- Print the vocabulary, i.e. the unique words used as features
- Print the vector representing the first review
- Print the word corresponding to the first non-zero dimension, here dimension 6 (index = 5). Check that it appears once in the first review.

In [None]:
# array of shape (n_samples, n_features)
print( "Shape of the data, ie nb of examples x number of features:", train_features.shape )

# print the vocabulary (= unique words, here 1000)
print( "\nVocabulary:", list( vectorizer.vocabulary_.keys() ) )
vocab = vectorizer.get_feature_names()
print(  "Sorted vocabulary:", vocab )

# print the vector representing the first review (1000 dimensions)
# use toarray() to densify the matrix --> many 0s = sparsity
print( "\nVector representing the first review", train_features[0].toarray())

# sixth dimension, value =5
# what is the corresponding word?
# - invert the dictionnary
index_to_token = {v: k for k, v in vectorizer.vocabulary_.items()}
print( "\nWord corresponding to the 5th dimension:", index_to_token[5])
print( "First review", train_reviews[0]) # 2nd sentence: "some human drama about what could"

Shape of the data, ie nb of examples x number of features: (5000, 1000)

Vocabulary: ['this', 'kind', 'of', 'film', 'has', 'become', 'old', 'by', 'now', 'it', 'the', 'whole', 'thing', 'is', 'turned', 'in', 'upon', 'itself', 'some', 'br', 'sure', 'sounds', 'like', 'good', 'idea', 'great', 'cast', 'and', 'human', 'drama', 'about', 'what', 'could', 'have', 'might', 'been', 'unfortunately', 'there', 'no', 'that', 'them', 'all', 'together', 'was', 'big', 'one', 'those', 'movies', 'films', 'you', 'end', 'up', 'to', 'see', 'more', 'or', 'two', 'particular', 'people', 'instead', 'getting', 'short', 'takes', 'on', 'everyone', 'not', 'just', 'annoying', 'average', 'script', 'doesn', 'help', 'an', 'piece', 'who', 'are', 'these', 'how', 'yes', 'but', 'enough', 'plot', 'boring', 'reality', 'show', 'so', 'made', 'as', 'characters', 'they', 'don', 'feel', 'for', 'if', 'based', 'real', 'then', 'very', 'sorry', 'violence', 'seems', 'quite', 'being', 'think', 'much', 'him', 'either', 'oh', 'had', 'move'



## 2.3 Preparing test data

We also need to pre-process and vectorize the test set.

The difference is:
- the **vectorization is 'learned' on the training data** only, on the test set, we use only the 'transform' method of the vectorizer (without the 'fit' part): words that do not appear in our training set are considered 'unknown'.

In [None]:
print( "-- Reading test data ")
test, test_reviews, test_labels = read_data( "popcorn_clean_test_500.tsv" )

test_features = vectorizer.transform( test_reviews )
print( "Vectorized, shape:", test_features.shape )

-- Reading test data 
Number of examples: 500 

Vectorized, shape: (500, 1000)


## 2.4 Classification without neurons: Scikit-Learn

Now we can train a model and use to make predictions on our test set.
- Choose an algorithm, e.g. LogisticRegression (aka MaxEnt)
- Train on the training set 
- Make predictions on the development set
- Report performance by comparing the gold labels from the evaluation set (i.e. test_labels') to the predictions


#### First: training step

We make the import corresponding to the classifier we want to use (here, Logistic Regression), and train the classifier on the training data.

In [None]:
from sklearn.linear_model import LogisticRegression
import warnings 
warnings.filterwarnings('ignore')


classifier = LogisticRegression()
classifier.fit( train_features, train_labels )
print( 'Training done')

Training done


#### Second: predictions

Use the model learned to make predictions on the test data:

In [None]:
preds = classifier.predict( test_features )
print( "Predictions done")

Predictions done


#### Finally: scores

Scoring is done by comparing the gold labels (i.e. the ones annotated by an human) to the predicted labels assigned by the model.

Scikit-learn provides a method called "classification_report" that gives an overview of the performance using different metrics. 

In [None]:
from sklearn.metrics import classification_report

print( classification_report( test_labels, preds ) )

              precision    recall  f1-score   support

           0       0.82      0.86      0.84       250
           1       0.86      0.82      0.84       250

    accuracy                           0.84       500
   macro avg       0.84      0.84      0.84       500
weighted avg       0.84      0.84      0.84       500



#### Notes on the results

The simplest metrics is the accuracy: it corresponds to the fraction of examples correctly labeled over the total number of examples.

Other metrics are given, expecially per label metrics: the F1 for each class is an indication of the performance of your model per class.

For more information about the metrics, look at scikit documentation.

The performance of this model are rather good: it correctly identifies 84% of the test examples.
This task is rather simple, since the performance are good while we have a small training set and a very simple data representation. 

## 2.5 Improving the results

Many parts of a model can be modified to try to improve the performance:
- the data representation
- the values of the hyper-parameters
- the choice of the algorithm

### 2.5.1 Modifying data representation 
Data representation corresponds to the choice of features.
Here, we choosed a simple bag-of-word representation (BOW) with raw frequency.

#### 2.5.1.1 TF-IDF normalization

As said during the course, BOW comes with many flavors, and a good option in general is to use TF-IDF normalization instead of raw features.

With scikit, you can either directly vectorize using TF-IDF (with the class 'TfidfVectorizer') or transform a count-based representation (with the class 'TfidfTransformer'). 
Here, we use this second option.

We then train and evaluate our model again, with this new representation.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
train_features_tfidf = transformer.fit_transform(train_features)
test_features_tfidf = transformer.transform(test_features)

# Training
classifier = LogisticRegression()
classifier.fit( train_features_tfidf, train_labels )
# Predictions
preds = classifier.predict( test_features_tfidf )
# Scores
print( classification_report( test_labels, preds ) )

              precision    recall  f1-score   support

           0       0.85      0.87      0.86       250
           1       0.87      0.84      0.86       250

    accuracy                           0.86       500
   macro avg       0.86      0.86      0.86       500
weighted avg       0.86      0.86      0.86       500



#### 2.5.1.2 Testing n-grams features

 As said during the course, BOW doesn't take into account the context of each word, which can be crucial for the task.

 Let's try with n-grams, remind that: 
 - unigrams: single tokens, same as BOW
 - bigrams: two words
 - trigrams: three words

 Here we're going to test bi-grams, tri-grams and a concatenation of unigrams, bi-grams and tri-grams.
 This is done with the option 'ngram_range'.

 The code to test bi-grams is given below, you have to write a similar code to test tri-grams and the concatenation.

 Note that here we directly take the TF-IDF vectorizer.

 Without filtering, this produces very big vectors (e.g. the full concatenation corresponds to 1366006 dimensions!). 
 Here, we choose to keep 5000 features, more than previously to take into account the new features.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TEST BI-GRAMS
vectorizer = TfidfVectorizer( analyzer = 'word', max_features = 5000, ngram_range=(2,2) )
train_features_tfidf_ngram = vectorizer.fit_transform( train_reviews )
test_features_tfidf_ngram = vectorizer.transform( test_reviews )

print( train_features_tfidf_ngram.shape, test_features_tfidf_ngram.shape)

# Training
classifier = LogisticRegression()
classifier.fit( train_features_tfidf_ngram, train_labels )
# Predictions
preds = classifier.predict( test_features_tfidf_ngram )
# Scores
print( classification_report( test_labels, preds ) )

(5000, 5000) (500, 5000)
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       250
           1       0.86      0.81      0.83       250

    accuracy                           0.84       500
   macro avg       0.84      0.84      0.84       500
weighted avg       0.84      0.84      0.84       500



In [None]:
# TEST TRI-GRAMS

vectorizer = TfidfVectorizer( analyzer = 'word', max_features = 5000, ngram_range=(3,3) )
train_features_tfidf_ngram = vectorizer.fit_transform( train_reviews )
test_features_tfidf_ngram = vectorizer.transform( test_reviews )

print( train_features_tfidf_ngram.shape, test_features_tfidf_ngram.shape)

# Training
classifier = LogisticRegression()
classifier.fit( train_features_tfidf_ngram, train_labels )
# Predictions
preds = classifier.predict( test_features_tfidf_ngram )
# Scores
print( classification_report( test_labels, preds ) )


(5000, 5000) (500, 5000)
              precision    recall  f1-score   support

           0       0.75      0.75      0.75       250
           1       0.75      0.75      0.75       250

    accuracy                           0.75       500
   macro avg       0.75      0.75      0.75       500
weighted avg       0.75      0.75      0.75       500



In [None]:
# TEST UNIGRAMS + BI-GRAMS + TRI-GRAMS

# TEST BI-GRAMS
vectorizer = TfidfVectorizer( analyzer = 'word', max_features = 5000, ngram_range=(1,3) )
train_features_tfidf_ngram = vectorizer.fit_transform( train_reviews )
test_features_tfidf_ngram = vectorizer.transform( test_reviews )

print( train_features_tfidf_ngram.shape, test_features_tfidf_ngram.shape)

# Training
classifier = LogisticRegression()
classifier.fit( train_features_tfidf_ngram, train_labels )
# Predictions
preds = classifier.predict( test_features_tfidf_ngram )
# Scores
print( classification_report( test_labels, preds ) )

(5000, 5000) (500, 5000)
              precision    recall  f1-score   support

           0       0.86      0.90      0.88       250
           1       0.89      0.85      0.87       250

    accuracy                           0.87       500
   macro avg       0.87      0.87      0.87       500
weighted avg       0.87      0.87      0.87       500



### 2.5.2 Finding the best model

Usually, we will want to try out different parameters, in order to see what works best for our task. As such, we might experiment with:
- Different features
- Different classification algorithms 
- Different model parameters

However, we have to be careful: we cannot use our test set over and over again, as we’ll be optimizing our parameters for that particular test set, and run the risk of overfitting, which means we are not able to properly generalize to data we haven’t trained on.
We want to build a model that is robust, meaning that it will get good performance on unseen data.
That's why we only use the test set at the end, with the best model. 

For this reason, we need to make use of a validation our development set. 
However, our training set is already quite small; creating a separate validation set would give us even less training data. 

Fortunately, there is another option: we can use k-fold cross validation. 
The idea is the following:
- Break up data into k (e.g. 10) parts (folds) 
- For each fold
    - Current fold is used as temporary test set 
    - Use other 9 folds as training data
    - Performance is computed on test fold
- Average performance over 10 runs

Scikit provides efficient ways of performing cross-fold validation.
We will test below the grid search that allows to choose the best values for the hyper-parameters, using cross-validation over the trianing set.

#### 2.5.2.1 Optimizing the hyper-parameters

Each algorithm comes with some "options" called hyper-parameters.
The chosen values can have an important effect on the results.  

For example, Logistic Regression has:
- 'C' a coefficient C used for regularization, with smaller values specifying stronger regularization. 
- 'max_iter' (default=100) Maximum number of iterations taken for the solvers to converge.

Here we use the class 'GridSearchCV' that will perform an exhaustive search over specified parameter values for an estimator (i.e. a classifier).
We specify the algorithm we want (here 'LogisticRegression') and the parameters values we want to test (see the dictionnary 'parameters').

Then the 'fit' method over the GridSearchCV object allows to perform the search over the parameters, using a cross-fold validation (default: 5-fold CV).

Then, you can print the best set of parameters and the best score (i.e. Mean cross-validated score of the best_estimator), and use a panda dataframe to visualize the results according to each set of parameters.


See the doc: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


In [None]:

from sklearn.model_selection import GridSearchCV

parameters = {'C':[0.01, 0.1, 0.5, 1, 10, 50], 'max_iter':[50, 100] }
lr = LogisticRegression()
clf_lr = GridSearchCV(lr, parameters, verbose=1)
clf_lr.fit( train_features_tfidf_ngram, train_labels )
sorted(clf_lr.cv_results_.keys())

print( "Best parameters found:", clf_lr.best_params_)
print( "Best score found:", clf_lr.best_score_)

pd.concat([pd.DataFrame(clf_lr.cv_results_["params"]),pd.DataFrame(clf_lr.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters found: {'C': 10, 'max_iter': 50}
Best score found: 0.859


Unnamed: 0,C,max_iter,Accuracy
0,0.01,50,0.7804
1,0.01,100,0.7804
2,0.1,50,0.8138
3,0.1,100,0.8138
4,0.5,50,0.8456
5,0.5,100,0.8456
6,1.0,50,0.8542
7,1.0,100,0.8542
8,10.0,50,0.859
9,10.0,100,0.859


You can then directly use the GridSearchCV object (here called 'clf') to make predictions on your test set: it correspond to the best model found during the search.

In [None]:
preds = clf_lr.predict( test_features_tfidf_ngram )
print( classification_report( test_labels, preds ) )

              precision    recall  f1-score   support

           0       0.84      0.91      0.87       250
           1       0.90      0.83      0.86       250

    accuracy                           0.87       500
   macro avg       0.87      0.87      0.87       500
weighted avg       0.87      0.87      0.87       500



#### 2.5.2.2 Try other algorithms

Now, you can use a similar code to test other algorithms (e.g. Naive Bayes, SVM). 
You only need to perform the grid search, we should only report the results on the test set for the best algorithm.

Doc for Naive Bayes: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

Doc for SVM: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

Which one performs the best?

In [None]:
# Testing Naive Bayes

from sklearn.naive_bayes import MultinomialNB

parameters = {'alpha':[0, 0.1, 0.5, 0.8, 1] }


In [None]:
# Testing SVM

from sklearn.svm import LinearSVC

# here, default max_iter = 1000
parameters = {'C':[0.01, 0.1, 0.5, 1, 10, 50], 'max_iter':[100, 1000] }


## 2.6 Inspecting the model

The linear classifiers work by learning weights over the features.
Looking at these weights can give some insights on your model.

With LogisticRegression in the binary setting, we have:
- the most positive weights are the best indicators of the positive class (here positive reviews)
- the most negative weights are the best indicators of the negative class (here negative reviews)

The code below will print the 50 most positive and negative features: do the results make sense? 

In [None]:
# Here we look at the best model obtained with grid search, ngrams features and tf idf normalization

vocab = vectorizer.get_feature_names()
allCoefficients = [(clf_lr.best_estimator_.coef_[0,i], vocab[i]) for i in range(len(vocab))]
allCoefficients.sort()
allCoefficients.reverse()

print("Top features for positive class:")
print( '\n'.join( [ f+':\t'+str((round(w,3))) for (w,f) in allCoefficients[:50]] ) )

print("\nTop features for negative class:")
print( '\n'.join( [ f+':'+str((round(w,3))) for (w,f) in allCoefficients[-50:]] ) )

Top features for positive class:
great:	8.101
perfect:	6.001
fun:	5.742
excellent:	5.311
this is:	4.991
simple:	4.704
best:	4.499
creative:	4.386
wonderful:	4.377
heart:	4.372
gem:	4.357
enjoy:	4.279
masterpiece:	4.218
beautiful:	4.215
loved:	4.193
perfectly:	4.147
enjoyable:	4.064
is great:	3.996
him:	3.95
dream:	3.941
and:	3.896
incredible:	3.882
awesome:	3.876
job:	3.748
was great:	3.722
each:	3.693
family:	3.655
amazing:	3.653
entertaining:	3.588
well:	3.58
others:	3.575
subtle:	3.569
one of:	3.569
the best:	3.548
noir:	3.546
both:	3.533
highly:	3.452
enjoyed this:	3.431
surprised:	3.413
powerful:	3.377
still:	3.348
fantastic:	3.347
brilliant:	3.338
genre:	3.319
underrated:	3.315
first time:	3.31
necessary:	3.292
enjoyed:	3.282
today:	3.276
superb:	3.244

Top features for negative class:
seems:-3.332
apparently:-3.336
pathetic:-3.353
the only:-3.367
wrote:-3.372
zombies:-3.383
mess:-3.411
dreadful:-3.454
than this:-3.524
unfortunately:-3.551
unless:-3.626
oh:-3.631
any:-3.716
not w

# Part 3: generating word embeddings

![](https://drive.google.com/uc?export=view&id=1eLkKWp8yOP6AJsK2h6Btbr3L95TvDsyD)



As introduced during the course, we can use neural networks to generate vectors representing words.
These vectors, learned on massive amount of data, allow to compute similarity measures between words.

As an introductive exercise, we will generate word embeddings from the sentiment review dataset and take a look at the generated vectors.

Remind that this corpus is "small", compared to what is generally used for generating embeddings, here around 40k words against millions of words in general! 
The resulting vectors will thus not be of extremely good quality (but the model will run very fast :). 

## 3.1 Generating word embeddings

We  will  use gensim in  order  to  induce  word  embeddings  from  text.
gensim is  a  vector  space modeling and topic modeling toolkit for python, and contains an efficient implementation of the word2vec algorithms.

word2vec consists of two different algorithms: skipgram (sg) and continuous-bag-of-words (cbow). 
The underlying prediction task of the former is to estimate the context words from the target word ; the prediction task of the latter is to estimate the target word from the sum of the context words. 

In [None]:
from gensim.models import Word2Vec 

import gzip
import logging

import time

# set up logging for gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

# we define a PlainTextCorpus class; this will provide us with an
# iterator over the corpus (so that we don't have to load the corpus
# into memory)
class PlainTextCorpus(object):
    def __init__(self, fileName):
        self.fileName = fileName

    def __iter__(self):
        for line in gzip.open(self.fileName, 'rt', encoding='utf-8'):
            yield  line.split()

# instantiate the corpus class using corpus location
sentences = PlainTextCorpus('raw_reviews.txt.gz')

# we only take into account words with a frequency of at least 50, and
# we iterate over the corpus only once
model = Word2Vec(sentences, min_count=50, iter=1)

# finally, save the constructed model to disk
model.save('model_word2vec')

2022-04-14 16:57:30,708 : INFO : collecting all words and their counts
2022-04-14 16:57:30,713 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-04-14 16:57:31,932 : INFO : PROGRESS: at sentence #10000, processed 2354780 words, keeping 163178 word types
2022-04-14 16:57:33,195 : INFO : PROGRESS: at sentence #20000, processed 4686268 words, keeping 251892 word types
2022-04-14 16:57:33,785 : INFO : collected 289705 word types from a corpus of 5844706 raw words and 25000 sentences
2022-04-14 16:57:33,787 : INFO : Loading a fresh vocabulary
2022-04-14 16:57:33,942 : INFO : effective_min_count=50 retains 7900 unique words (2% of original 289705, drops 281805)
2022-04-14 16:57:33,943 : INFO : effective_min_count=50 leaves 4935387 word corpus (84% of original 5844706, drops 909319)
2022-04-14 16:57:33,976 : INFO : deleting the raw counts dictionary of 289705 items
2022-04-14 16:57:33,985 : INFO : sample=0.001 downsamples 47 most-common words
2022-04-14 16:57:33,

## 3.2 Compute word similarity

You can now compute the most similar words (which is measured by cosine similarity between the word vectors) by issuing the following command:

model.wv.most_similar(myword)

Don't hesitate to test with other words, such as "movie", "good" etc

In [None]:
model.wv.most_similar('actor')

2022-04-14 16:58:43,104 : INFO : precomputing L2-norms of word weight vectors


[('actress', 0.9464222192764282),
 ('performance', 0.8889339566230774),
 ('role', 0.8721189498901367),
 ('actor,', 0.837760329246521),
 ('role,', 0.8222366571426392),
 ('character,', 0.8029254674911499),
 ('role.', 0.7964649200439453),
 ('actor.', 0.7946804761886597),
 ('character', 0.7902258634567261),
 ('actress.', 0.7811508774757385)]

Word  embeddings  allow  us  to  do  analogical  reasoning  using  vector  addition and subtraction. 
gensim offers the possibility to do so. 

Try to perform analogical reasoning,  e.g.  actor - man  +  woman  = ?

In [None]:
model.most_similar(positive=["actor", "woman"], negative=["man"])

[('actress', 0.948756754398346),
 ('performance', 0.894486129283905),
 ('role', 0.8632146120071411),
 ('actor,', 0.8322696685791016),
 ('role,', 0.8218492865562439),
 ('actress,', 0.8159606456756592),
 ('character,', 0.8123709559440613),
 ('voice', 0.8086908459663391),
 ('actress.', 0.8020336627960205),
 ('actor.', 0.7969970703125)]

## 3.3 Modify the model

As a default, the word2vec module creates word embeddings with the following setting:
- algorithm: CBOW
- window: 5 
- embeddings size: 100

Try other options, including:
- algorithm: skipgram
- window: try varied sizes, from very small to large one
- embeddings size: try varied sizes, from very small to large one

Each time, evaluate the impact on the similarity computation.
What configuration works best?

See doc: https://radimrehurek.com/gensim_3.8.3/models/word2vec.html


In [None]:
# a- MODIFYING THE WINDOW SIZE (here 1)

model_w1 = Word2Vec(sentences, min_count=50, iter=1, window=1)

In [None]:
# a- MODIFYING THE WINDOW SIZE (e.g. 20)

In [None]:
# b- MODIFYING THE EMBEDDINGS SIZE (here 10)

model_s10 = Word2Vec(sentences, min_count=50, iter=1, size=10)

In [None]:
# b- MODIFYING THE EMBEDDINGS SIZE (e.g. 300)



In [None]:
# c- WITH SKIPGRAM

# we only take into account words with a frequency of at least 50, and
# we iterate over the corpus only once
model_sg = Word2Vec(sentences, min_count=50, iter=1, sg=1)

In [None]:
model.wv.most_similar('actor')

### Note

According to Mikolov:
- Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
- CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

# Part 4 (Advanced): neural based classification

![](https://drive.google.com/uc?export=view&id=1rFv3huUMTbtEHepb8vlluA61H107ow9R)

Use a neural network to perform sentiment analysis using:
- randomly initialized word representations
- the embeddings built with Gensim
- pre-trained word embeddings (e.g. FastText, Glove etc)

A solution based on TensorFlow will be provided later on the course website)

# Additional notes: coreference resolution

You can also use a model called neuralCoref to perform Coreference resolution.
However, it requires an older version of Spacy (https://github.com/huggingface/neuralcoref/issues/207). Below, there is a way to make it work within the notebook (if you don't use a notebook, install a virtual environement with conda to work with Spacy 2.1).

You can try it at home, but for now, just take a look at the last cell to inspect the results (presented in the course slides).


Using neuralcoref:
* spacy: https://spacy.io/universe/project/neuralcoref-vizualizer 
* git page: https://github.com/huggingface/neuralcoref


In [None]:
pip install -U spacy==2.1.0

In [None]:
pip install neuralcoref

In [None]:
# Dowload a model within Google colab: https://stackoverflow.com/questions/49259404/how-to-install-models-download-packages-on-google-colab
import spacy.cli
spacy.cli.download("en")

In [None]:
# Load your usual SpaCy model (one of SpaCy English models)
import spacy

nlp = spacy.load('en')

# load NeuralCoref and add it to the pipe of SpaCy's model
import neuralcoref


In [None]:
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

In [None]:
# You're done. You can now use NeuralCoref the same way you usually manipulate a SpaCy document and it's annotations.
doc = nlp(u'Nina gave Tom the burger. He was hungry. He ate it.')
doc._.has_coref
doc._.coref_clusters

### Note on coreference

In the previous cell, you can see the results of a coreference system over three sentences.
The system finds two clusters:
- one corresponding to 'Tom', containing the proper noun 'Tom', and two occurences of the pronoun 'He'.
- one corresponding to 'the burger', containing the noun 'the burger' and one occurence of the pronoun 'it'

Here, the system doesn't identify 'singleton', meaning mention with only one reference ('Tina').

See below what happens with a slight change in the sentences:

In [None]:
doc = nlp(u'Nina gave Tom the burger. She was hungry. He ate it.')
doc._.has_coref
doc._.

# If we add a mention to 'Nina' using the corresponding pronoun 'She', then we have three clusters, one for each entity. 

# Additional notes: NLTK

Examples of pre-processing using NLTK, another great library for NLP (though a bit older).

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

In [None]:
text = "This ice-cream is from the U.K.. This isn't a cake. "
sentences = nltk.sent_tokenize(text) # Sentence splitting
print(sentences)

In [None]:
sentence = 'She sells seashells on the seashore.'
tokens = nltk.word_tokenize(sentence) # Tokenization
print(tokens)

In [None]:
text = "This ice-cream is from the U.K.. This isn't a cake. "
for sentence in nltk.sent_tokenize(text):
  tokens = nltk.word_tokenize(sentence)
  tagged_tokens = nltk.pos_tag(tokens) # POS tagging
  print(tagged_tokens)

In [None]:
sentence = "Time flies like an arrow."
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens) # POS tagging
print(tagged_tokens)

In [None]:
# Lemmatizing with NLTK 
sentence = "Mr. Dursley was the director of a firm called Grunnings, which made drills."
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]
verbs = []
for token, tag in tagged_tokens:
    if tag in verb_tags:
        verbs.append(token)
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
verb_lemmas = []
for word_form in verbs:
    lemma = lemmatizer.lemmatize(word_form, "v") 
    verb_lemmas.append((word_form,lemma))
print( verb_lemmas )

Parsing with NLTK, a bit more complicated.

See: 
- bllipparser: https://pypi.org/project/bllipparser/
- Stanford parser: 

In [None]:
pip install --user bllipparser

In [None]:
python3 -m nltk.downloader bllip_wsj_no_aux

In [None]:
from bllipparser import RerankingParser

rrp = RerankingParser.fetch_and_load('WSJ-PTB3', verbose=True)
rrp.simple_parse("It's that easy.")


model_dir = find('models/bllip_wsj_no_aux').path
parser = RerankingParser.from_unified_model_dir(model_dir)
best = parser.parse("The old oak tree from India fell down.")

print(best.get_reranker_best())
print(best.get_parser_best())

from IPython.display import display
display(resultparse_trees[0])