<a href="https://colab.research.google.com/github/iaanimashaun/Strive-School-Assigments/blob/main/4_Embeddings_and_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Amazon, IMDB and Yelp Review Sentiment Classification using SpaCy

In [None]:
# !pip install scikit-learn

In [None]:
# !pip install -U spacy

In [None]:
# !python -m spacy download en

In [None]:
#!python -m spacy download en_core_web_sm

### Data Cleaning Options
- Case Normalization
- Removing Stop Words
- Removing Punctuations or Special Symbols
- Lemmatization or Stemming
- Parts of Speech Tagging
- Entity Detection
- Bag of Words
- TF-IDF 

### Bag of Words - The Simplest Word Embedding Technique

This is one of the simplest methods of embedding words into numerical vectors. It is not often used in practice due to its oversimplification of language, but often the first embedding technique to be taught in the classroom setting.

```
doc1 = "I am high"
doc2 = "Yes I am high"
doc3 = "I am kidding" 

```

![image.png](attachment:image.png)

### Bag of Words and Tf-idf 
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

tf–idf for “Term Frequency times Inverse Document Frequency

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## Prepare the data

In [3]:
import pandas as pd
train_data = pd.read_csv("/content/drive/MyDrive/Strive/Exercises/Module_7_NLP/Week_1/D4/4. Semantics and Embeddings/data/train_data.csv", index_col="Unnamed: 0")
test_data = pd.read_csv("/content/drive/MyDrive/Strive/Exercises/Module_7_NLP/Week_1/D4/4. Semantics and Embeddings/data/data_yelp.csv", index_col="Unnamed: 0")

In [17]:
from sklearn.model_selection import train_test_split

X = train_data['Review']
y = train_data['Sentiment']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
X_test = test_data['Review']
y_test = test_data['Sentiment']

# Let's Get Started

In [18]:
import spacy
from spacy import displacy

In [19]:
nlp = spacy.load('en_core_web_sm')

In [20]:
nlp("Hello")[0].lemma_

'hello'

## Preprocessing

We need to create a function that given the text of a sentence preprocess it.

Some of the operations we can do:
- Case Normalization (automatic with lemmatization)
- Removing Stop Words
- Removing Punctuations or Special Symbols
- Lemmatization or Stemming

To insert it in a pipeline, you have to be sure that you start from a sentence and you get text tokens as output:

In [21]:
def preprocessing(sentence):
    # define your preprocessing pipeline

    return [token.lemma_ for token in nlp(sentence) if not (token.is_punct and token.is_stop)]

### Text Classification 

In [22]:
import pandas as pd
# import the tfidfvectorizer and the count vectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# import the pipeline module from sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import  Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [23]:
# Define an instance of the TfidfVectorizer which receive your 
# preprocessing function

tfidf = TfidfVectorizer(tokenizer=preprocessing)

In [24]:
# Define an instance of the TfidfVectorizer which receive your 
# preprocessing function and it has as n-grams range (1,5)

tfidf1_5 = TfidfVectorizer(tokenizer=preprocessing, ngram_range=(1,5))

In [25]:
# Define an instance of the Countvectorizer which receive your 
# preprocessing function and it has as n-grams range (1,3)

bow = CountVectorizer(tokenizer=preprocessing, ngram_range=(1,3))

In [26]:
# define a word vectorizer that use the spacy's word vectors. Replace the
# two comments with your code. If you don't remember how to access
# the embeddings of a doc use dir(doc) and see if there is anything
# that makes sense

import numpy as np
import spacy
from sklearn.base import BaseEstimator, TransformerMixin

class WordVectorTransformer(TransformerMixin,BaseEstimator):
    def __init__(self, model='en_core_web_sm'):
        self.model = model

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        nlp = spacy.load(self.model)
        return np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in X])



☝️In this case, however, we are not using our preprocessing pipeline and we are consider only sentence embeddings and not single token embeddings. In this way, it is more convenient because we have a single vector for each sentence and we can handle different length sentences. The vector representation for the entire Doc is calculated by averaging the vectors for each Token in the Doc.

This may result in a less meaningful features than the one by using Tf-Idf for example. We will see how to handle this next week!

In [27]:
# import and load a classifier from sklearn. In class, I used 
# from sklearn.svm import LinearSVC
# but feel free to experiment with other models

from sklearn.svm import LinearSVC
classifier = LinearSVC()

In [30]:
train_data.head()

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [31]:
# Build a pipeline that contains only your tfidf and fit_transform it
# on your corpus train_data["Review"]

pipe = Pipeline([
                 ('tfidf', tfidf)
])
pipe.fit_transform(train_data['Review'])

<2748x4440 sparse matrix of type '<class 'numpy.float64'>'
	with 36360 stored elements in Compressed Sparse Row format>

In [33]:
# Build a pipeline that contains your tfidf and the classifier
pipe = Pipeline([
                 ('tfidf', tfidf),
                 ('classifier', classifier)
])

In [34]:
# fit on your data (X_train, y_train)
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function preprocessing at 0x7f7f9ac83950>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_interce

In [35]:
# test on your validation data with the predict method

y_pred = pipe.predict(X_val)

In [36]:
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.80      0.81       435
           1       0.78      0.80      0.79       390

    accuracy                           0.80       825
   macro avg       0.80      0.80      0.80       825
weighted avg       0.80      0.80      0.80       825



In [37]:
# test on your test data

y_pred_test = pipe.predict(X_test)

In [38]:
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       500
           1       0.94      0.94      0.94       500

    accuracy                           0.94      1000
   macro avg       0.94      0.94      0.94      1000
weighted avg       0.94      0.94      0.94      1000



In [39]:
confusion_matrix(y_val, y_pred)
confusion_matrix(y_test, y_pred_test)

array([[468,  32],
       [ 31, 469]])

In [40]:
# Test it with your examples
pipe.predict(['Wow, this is amzing lesson'])

array([1])

In [42]:
pipe.predict(['Wow, this sucks'])

array([0])

In [43]:
pipe.predict(['Worth of watching it. Please like it'])

array([1])

In [44]:
pipe.predict(['Loved it. amazing'])

array([1])

Play with the following:

In [3]:
# !pip install whatlies
# !pip install whatlies\[umap\]
# !pip install delayed

In [1]:
from whatlies import EmbeddingSet
from whatlies.language import CountVectorLanguage



In [2]:


lang = CountVectorLanguage(n_components=2, ngram_range=(1, 1), analyzer="word")
words = ['great', 'bad', 'amazing', 'sad', 'awesome', 'good', 'upset', "nice"]

emb = lang[words]
emb.plot_interactive(x_axis='good', y_axis='bad')

In [None]:

from whatlies import EmbeddingSet
from whatlies.language import SpacyLanguage

lang = SpacyLanguage('en_core_web_lg') # lg is more accurate for this than sm
words = ['cat', 'dog', 'fish', 'kitten', 'man', 'woman', 'king', 'queen', 'doctor', 'nurse', "animal", "human"]

emb = lang[words]
emb.plot_interactive(x_axis='animal', y_axis='human')

In [None]:
from whatlies.transformers import Pca, Umap

orig_chart = emb.plot_interactive('man', 'woman')
pca_plot = emb.transform(Pca(2)).plot_interactive()
umap_plot = emb.transform(Umap(2)).plot_interactive()

pca_plot | umap_plot

  warn(


Play with similarities of sentences/tokens

In [None]:
dog = nlp("dog")
cat = nlp("cat")

# Compare the similarity between Tokens 'dog' and 'cat'
dog.similarity(cat)

  dog.similarity(cat)


0.7345952141306641

In [None]:
dog = nlp("dog")
queen = nlp("queen")

# Compare the similarity between Tokens 'dog' and 'cat'
dog.similarity(queen)

  dog.similarity(queen)


0.39438143492304706

In [None]:
king = nlp("king")
man = nlp("man")

# Compare the similarity between Tokens 'dog' and 'cat'
king.similarity(queen)

  king.similarity(queen)


0.8043196533418226

In [None]:
king = nlp("king")
woman = nlp("woman")

# Compare the similarity between Tokens 'dog' and 'cat'
king.similarity(woman)

  king.similarity(woman)


0.8453325361917157