[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/khetansarvesh/NLP/blob/main/unitask_downstream_nlp/Sentence-Level-Classification/Freeze_Learning_Movie_Review_Classification.ipynb)

In [None]:
import numpy as np
import pandas as pd

## **Reading Data**

In [None]:
#downloading the dataset
!wget https://github.com/khetansarvesh/NLP/blob/main/Sentence-Level-Classification/SST_Dataset.csv

In [None]:
# reading the dataset
df = pd.read_csv("SST_Dataset.csv", encoding = "ISO-8859-1" )
df.dropna(inplace=True)
df

Unnamed: 0,review,label
0,bromwell high is a cartoon comedy . it ran at ...,1
1,story of a man who has unnatural feelings for ...,0
2,homelessness or houselessness as george carli...,1
3,airport starts as a brand new luxury pla...,0
4,brilliant over acting by lesley ann warren . ...,1
...,...,...
24995,i saw descent last night at the stockholm fi...,0
24996,a christmas together actually came before my t...,1
24997,some films that you pick up for a pound turn o...,0
24998,working class romantic drama from director ma...,1


The dataset we use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0)

In [None]:
df["label"].value_counts()/df.shape[0] #hence we can clearly see that it is a perfectly balanced dataset!!

1    0.5
0    0.5
Name: label, dtype: float64

## **Data Preprocessing**

### **Cleaning Text Features**


like removing stop words, punctions, performing stemming ...

In [None]:
from sklearn.feature_extraction import stop_words # or use from nltk.corpus import stopwords
stopwords = stop_words.ENGLISH_STOP_WORDS
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import string
import re

def clean(doc): #doc is a string of text
    doc = doc.replace("</br>", " ") #This text contains a lot of <br/> tags replacing them with " ".
    doc = "".join([char for char in doc if char not in string.punctuation and not char.isdigit()])#remove punctuation and numbers
    doc = doc.lower() #lowering all the characters
    doc = " ".join([ps.stem(token) for token in doc.split() if token not in stopwords]) # removing stopwords and doing stemming
    return doc



In [None]:
for i in range(len(df.review.values)):
  df.review.values[i] = clean(df.review.values[i]) # puting the cleaded text back into the dataframe

df

Unnamed: 0,review,label
0,bromwel high cartoon comedi ran time program s...,1
1,stori man unnatur feel pig start open scene te...,0
2,homeless houseless georg carlin state issu yea...,1
3,airport start brand new luxuri plane load valu...,0
4,brilliant act lesley ann warren best dramat ho...,1
...,...,...
24995,saw descent night stockholm film festiv huge d...,0
24996,christma actual came time ve rais john denver ...,1
24997,film pick pound turn good rd centuri film rele...,0
24998,work class romant drama director martin ritt u...,1


### **Dependent and independent features split**

In [None]:
X = df.review
Y = df.label

In [None]:
X

0        bromwel high cartoon comedi ran time program s...
1        stori man unnatur feel pig start open scene te...
2        homeless houseless georg carlin state issu yea...
3        airport start brand new luxuri plane load valu...
4        brilliant act lesley ann warren best dramat ho...
                               ...                        
24995    saw descent night stockholm film festiv huge d...
24996    christma actual came time ve rais john denver ...
24997    film pick pound turn good rd centuri film rele...
24998    work class romant drama director martin ritt u...
24999    dumbest film ve seen rip nearli type thriller ...
Name: review, Length: 25000, dtype: object

In [None]:
Y

0        1
1        0
2        1
3        0
4        1
        ..
24995    0
24996    1
24997    0
24998    1
24999    0
Name: label, Length: 25000, dtype: int64

### **Feature Extraction - Converting Text Features into Numeric Features**

#### **M1 - Using Sentence Embedding**

##### M1.1 - Using Bag Of Words Model

In [None]:
# you can use any other vectoriser to do this word2vec or glove or fasttext or contextual word embedding, here I am using BOW
# how to decide which is the best?? in most cases contextual embedding is the best but you cant generalize this, instead treat this as hyperparameter and experiment
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(max_features = 10000) #use max_features parameter as an hyperparameter and after many experiments I found that 10000 worked best and gave best predictions
bow_representation = count_vect.fit_transform(list(X))

In [None]:
count_vect.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': 10000,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

In [None]:
X = pd.DataFrame(bow_representation.toarray(), columns=count_vect.get_feature_names())
X

Unnamed: 0,aag,aaron,ab,abandon,abbey,abbi,abbot,abbott,abc,abduct,abe,abhay,abhorr,abid,abigail,abil,abl,abli,aboard,abomin,aborigin,abort,abound,abraham,abroad,abrupt,abruptli,absenc,absent,absolut,absorb,abstract,absurd,absurdli,abu,abund,abus,abysm,abyss,academ,...,youngest,youngster,youth,youtub,yr,yuck,yuen,yugoslavia,yuk,yuma,yup,yuppi,yuzna,yvonn,zabriski,zach,zack,zane,zani,zatoichi,zealand,zelah,zelda,zen,zenia,zentropa,zero,zeta,zhang,zip,zizek,zodiac,zoe,zombi,zone,zoo,zoom,zorro,zu,zucker
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
24996,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
24997,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0
24998,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


##### M1.2 Using TF-IDF Model

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer() #use max_featuers = 5000 # treat this as a hyperparameter
tfidf_representation = tfidf_vect.fit_transform(list(X))

In [None]:
tfidf_vect.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

In [None]:
X = pd.DataFrame(tfidf_representation.toarray(), columns=tfidf_vect.get_feature_names())
X

Unnamed: 0,aa,aaa,aaaaaaah,aaaaah,aaaaatch,aaaahhhhhhh,aaaand,aaaarrgh,aaah,aaargh,aaaugh,aaawwwwnnn,aachen,aada,aadha,aag,aaghh,aah,aahhh,aaip,aaja,aakash,aaker,aakrosh,aaliyah,aam,aamir,aan,aankh,aankhen,aap,aapk,aapkey,aardman,aardvark,aargh,aaron,aarp,aarrrgh,aatish,...,zubeidaa,zucchini,zucco,zucker,zuckerman,zucov,zue,zuf,zugsmith,zukhov,zukor,zukov,zulu,zumhof,zungia,zuni,zuniga,zunz,zurich,zuth,zuzz,zvezda,zvonimir,zvyagvatsev,zwartboek,zwick,zwrite,zx,zy,zyada,zyurang,zz,zzzz,zzzzz,zzzzzzzz,zzzzzzzzzzzz,zzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### **M2 - Using Subword Embedding**

##### M2.1 DistilBERT Model - from [HuggingFace](https://huggingface.co/).

In [None]:
# downloading the huggingface transformers module
!pip install transformers
import transformers as ppb



In [None]:
# Loading pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


###### Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.


1. Tokenization

In [None]:
# 1. Tokenization
"""Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with."""

tokenized = X.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

Token indices sequence length is longer than the specified maximum sequence length for this model (546 > 512). Running this sequence through the model will result in indexing errors


In [None]:
tokenized

0        [101, 22953, 2213, 8545, 2140, 2152, 9476, 227...
1        [101, 2358, 10050, 2158, 4895, 19833, 3126, 25...
2        [101, 11573, 2160, 3238, 12062, 5529, 2378, 21...
3        [101, 3199, 2707, 4435, 2047, 28359, 9496, 494...
4        [101, 8235, 2552, 23920, 5754, 6031, 2190, 368...
                               ...                        
24995    [101, 2387, 6934, 2305, 8947, 2143, 17037, 128...
24996    [101, 4828, 2863, 5025, 2234, 2051, 2310, 1554...
24997    [101, 2143, 4060, 9044, 2735, 2204, 16428, 935...
24998    [101, 2147, 2465, 3142, 2102, 3689, 2472, 3235...
24999    [101, 12873, 4355, 2143, 2310, 2464, 10973, 23...
Name: review, Length: 25000, dtype: object

In [None]:
tokenized.shape
# as we can see python did not print the no of columns it means it is variable number i.e. each row has different no of columns

(25000,)

2. Padding

In [None]:
# 2. Padding
"""After tokenization, tokenized is a list of sentences -- each sentences is represented as a list of tokens.
We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason,
we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list
of lists (of different lengths)"""

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
padded

array([[  101, 22953,  2213, ...,     0,     0,     0],
       [  101,  2358, 10050, ...,     0,     0,     0],
       [  101, 11573,  2160, ...,     0,     0,     0],
       ...,
       [  101,  2143,  4060, ...,     0,     0,     0],
       [  101,  2147,  2465, ...,     0,     0,     0],
       [  101, 12873,  4355, ...,     0,     0,     0]])

In [None]:
""" Our dataset is now in the padded variable, we can view its dimensions below """
np.array(padded).shape

(25000, 1879)

3. Masking

In [None]:
# 3. Masking
""" If we directly send padded to BERT, that would slightly confuse it. We need to create another variable to tell
it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is: """

attention_mask = np.where(padded != 0, 1, 0)

In [None]:
attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

In [None]:
attention_mask.shape

(25000, 1879)

###### Now that we have our model and inputs ready, let's run our model! The model() function runs our sentences through BERT. The results of the processing will be returned into last_hidden_states.

In [None]:
import torch
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

In [None]:
# dont run this on google colab, it might crash the notebook , run this only on you personal computer
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

###### Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called [CLS] (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

In [None]:
X = last_hidden_states[0][:,0,:].numpy()
X

#### Now once we have converted text features into numerical features (cross sectional data) we can do data preprocessing steps discussed in the cross sectional data section like feature scaling / dimensionality reduction / outlier removal /  ….

### **Train Test Split**

In [None]:
# idk why but using the sklearn library for train test split is crashing the notebook hence I am doing it manually
# using 75% as training data and 25% as testing data

X_train = X[:int(len(X)*0.75)]
X_test = X[int(len(X)*0.75):]

Y_train = Y[:int(len(Y)*0.75)]
Y_test = Y[int(len(Y)*0.75):]

In [None]:
print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)

(18750, 50326) (6250, 50326) (18750,) (6250,)


## **Training & Predicting**

### Using MultinomialNB Classification Algorithm

In [None]:
# Here I am Using Naive Bayes Classifier... you can use any other classifier
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [None]:
nb.fit(X_train, Y_train)
Y_pred_test = nb.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy: ", accuracy_score(Y_test, Y_pred_test))

from sklearn.metrics import precision_score
print("Precision: ", precision_score(Y_test, Y_pred_test))

from sklearn.metrics import recall_score
print("Recall: ", recall_score(Y_test, Y_pred_test))

Accuracy:  0.78208
Precision:  0.8266024453501297
Recall:  0.71392


Since we know it is an imbalance dataset, we should focus on accuracy rather than recall, as we can see accuracy is small - one of the reason for this could be our large feature vector could be creating a lot of noise in the form of very rarely occurring features that are not useful for learning, hence change the count vectorizer to take a certain number of features as maximum and experiment with this number untill you get the max possible accuracy.

### Using BernoulliNB Classification Algorithm

In [None]:
# Here I am Using Naive Bayes Classifier... you can use any other classifier
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()

In [None]:
nb.fit(X_train, Y_train)
Y_pred_test = nb.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy: ", accuracy_score(Y_test, Y_pred_test))

from sklearn.metrics import precision_score
print("Precision: ", precision_score(Y_test, Y_pred_test))

from sklearn.metrics import recall_score
print("Recall: ", recall_score(Y_test, Y_pred_test))

### Using Logistic Regression Algorithm

In [None]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression(solver='lbfgs' , C=1000) #C parameter decides the regularization strength, it is a hyperparameter, you can use gridsearch to determine best value of C

In [None]:
logreg.fit(X_train, Y_train)
Y_pred_test = logreg.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy: ", accuracy_score(Y_test, Y_pred_test))

from sklearn.metrics import precision_score
print("Precision: ", precision_score(Y_test, Y_pred_test))

from sklearn.metrics import recall_score
print("Recall: ", recall_score(Y_test, Y_pred_test))

Accuracy:  0.77696
Precision:  0.8264805733685402
Recall:  0.70112


### Using Random Forest Classification Algorithm

In [None]:
from sklearn.ensemble import RandomForestClassifier
randomclassifier = RandomForestClassifier(n_estimators=200,criterion='entropy')

In [None]:
randomclassifier.fit(X_train,Y_train)
Y_pred_test = randomclassifier.predict(X_test)