## TF-IDF

Is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.


Mathematical Expression

Term Frequency = No. of Repetition of words in sentence/No. of words in a sentence

Inverse document frequency = log(Total no. of sentences/No. of sentences containing the word)

We form a matrix by multiplying these two entities It it is a matrix of TF-IDF features.

The matrix is each row as features (i.e different words) and each coloumn as sentence in data set. The values corrosponding to it refers to the importance of each word in the sentence i.e Term Frequency*Inverse document frequency .

This process helps us understand the scemantic meaning sentence better compared to bag of word approach.


# Data Downloading

In [None]:
!pip install tensorflow-datasets > /dev/null

In [None]:
import tensorflow_datasets as tfds

In [None]:
(ds_train,ds_test),ds_info = tfds.load(
    name="imdb_reviews",
    split=["train","test"],
    shuffle_files=True,
    as_supervised=True,
    with_info=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompletePMZO3A/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompletePMZO3A/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompletePMZO3A/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


# TFIDF implementation

1 EDA and data processing

2 TFIDF vectorization

3 Models: SVM, Random forest, Naive Bayes, Logistic Regression, Grid Searches

PS steps explained in detail while implementing in comments

#EDA and Data Processing

In this we preprocess the data that is to be fed to tfidf vectorizer

1 We extract the data.

2 We preprocess it with basic_clean and tfidf_clean functions.

3 Remove less frequency words

Processes explained in detail in comments.


In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import re
import pandas as pd 

In [None]:
# we create a dataframe from a tensorflow data object
#we take a higher value than 25000 in take() so that we do not miss any values
ds_train = tfds.as_dataframe(ds_train.take(25100), ds_info)
ds_test = tfds.as_dataframe(ds_test.take(25100), ds_info)

In [None]:
ds_train.head(5)
#we see these b's as data converts to bytes hence we need to decode the bytes and do some basic cleaning to the data set, its probably because of utf

Unnamed: 0,label,text
0,0,"b""This was an absolutely terrible movie. Don't..."
1,0,b'I have been known to fall asleep during film...
2,0,b'Mann photographs the Alberta Rocky Mountains...
3,1,b'This is the kind of film for a snowy Sunday ...
4,1,"b'As others have mentioned, all the women that..."


In [None]:
ds_train['label'].value_counts()

1    12500
0    12500
Name: label, dtype: int64

In [None]:
ds_test['label'].value_counts()
#Equally devided values for 0's and 1's in data set

1    12500
0    12500
Name: label, dtype: int64

In [None]:
# As we can see their are some weird characters in between, lets do the very basic cleaning
def basic_clean(txt):
  txt = txt.decode("utf-8") #to remove b's from the beginning of the text and make it string
  txt = re.compile("[.;:!\'?,\"()\[\]]").sub("", txt.lower()) #remove punctuations
  txt = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)").sub(" ", txt.lower()) #remove links
  return txt
ds_train['text'] =  ds_train['text'].apply(basic_clean)
ds_test['text'] =  ds_test['text'].apply(basic_clean)
ds_train.head(5)

Unnamed: 0,label,text
0,0,this was an absolutely terrible movie dont be ...
1,0,i have been known to fall asleep during films ...
2,0,mann photographs the alberta rocky mountains i...
3,1,this is the kind of film for a snowy sunday af...
4,1,as others have mentioned all the women that go...


In [None]:
#first we process the text in this process the normal anomalies in text then we tokenize after that lemmetize after this remove words with less freq finally fit tfidf vectorizer and then fit the model

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
def tfidf_clean(txt):
    wnl = nltk.WordNetLemmatizer()
    stop_word = nltk.corpus.stopwords.words('english') #stopword corpus
    tkns = nltk.word_tokenize(txt) #tokenize the text, it devides the text into its sub strings
    lowr = [word.lower() for word in tkns] #lower case the text, to have uniformity in the data set
    no_stopwords = [word for word in lowr if word not in stop_word] #remove stopwords, since they do not have a lot of importance in terms of context of the sentence
    noalpha = [word for word in no_stopwords if word.isalpha()] #remove non alphabetical part,since we want to do a textual analysis
    lemma_txt = [wnl.lemmatize(word) for word in noalpha] #lemmatize the text, it is for a better morphological analysis
    tfidf_txt = lemma_txt
    return tfidf_txt #final tfidf text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
#apply the above function on both test and train part od data set
ds_train['text'] =  ds_train['text'].apply(tfidf_clean)
ds_test['text'] =  ds_test['text'].apply(tfidf_clean)

In [None]:
ds_train.head(5)

Unnamed: 0,label,text
0,0,"[absolutely, terrible, movie, dont, lured, chr..."
1,0,"[known, fall, asleep, film, usually, due, comb..."
2,0,"[mann, photograph, alberta, rocky, mountain, s..."
3,1,"[kind, film, snowy, sunday, afternoon, rest, w..."
4,1,"[others, mentioned, woman, go, nude, film, mos..."


In [None]:
#now we remove the words with frequency less than 5 we do it beacuse it might give very high importance to words with very low frequency in tfidf and lose the context of the data
#we map the text to string since it has become in list form 
ds_train['text']= ds_train['text'].map(str)

In [None]:
#find the words with less frequency
frequency_train = pd.Series(' '.join(ds_train['text']).split()).value_counts()
less_five_frequency_train = frequency_train[(frequency_train <5)]
print(less_five_frequency_train)

'janel',          4
'softness',       4
'järegård',       4
'synchronous',    4
'cecils',         4
                 ..
'usury',          1
'flicklots',      1
'miscasted',      1
'extases',        1
'peopleone',      1
Length: 63732, dtype: int64


In [None]:
# we do the same for test as well
ds_test['text']= ds_test['text'].map(str)
frequency_test = pd.Series(' '.join(ds_test['text']).split()).value_counts()
less_five_frequency_test = frequency_test[(frequency_test <5)]
print(less_five_frequency_test)

'miscalculation',            4
'ferber',                    4
['abysmal',                  4
'reunites',                  4
'prostituting',              4
                            ..
'personso',                  1
'aksuan',                    1
'mattlock',                  1
'unbelievableeverything',    1
'stillcouldnt',              1
Length: 62901, dtype: int64


In [None]:
#removal of the words with less frequency
ds_train['text'] = ds_train['text'].apply(lambda x: ' '.join(x for x in x.split() if x not in less_five_frequency_train))
ds_test['text'] = ds_test['text'].apply(lambda x: ' '.join(x for x in x.split() if x not in less_five_frequency_test))

# TF-IDF Vectorization

Ater preprocessing the data we do TFIDF vectorization

In [None]:
#Tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer #Convert a collection of raw documents to a matrix of TF-IDF features.
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=20000) #we use unigram and bigram both which means it takes each word in single word context (unigram) and in pairs (bigram)  and max_features (features explained above, these are different words in the datase) = 20,000

In [None]:
tfidf_training_features = vectorizer.fit_transform(ds_train['text'])  #fit on train data to convert it into tfidf matrix
tfidf_test_features = vectorizer.transform(ds_test['text']) #fit on train data to convert it into tfidf matrix

print(tfidf_training_features.shape)
print(tfidf_test_features.shape)

(25000, 20000)
(25000, 20000)


# Models
# SVM

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [None]:
model = SVC(kernel ='linear', C = 1) #SVC model

In [None]:
model.fit(tfidf_training_features, ds_train['label']) #fit the model

SVC(C=1, kernel='linear')

In [None]:
y_pred_tfidf_svm = model.predict(tfidf_test_features) #predict values in tfidf_test_features
acc = accuracy_score(y_pred_tfidf_svm, ds_test['label']) #measure accuracy of test
print('Test accuracy Score svm:', acc*100)

Test accuracy Score svm: 87.392


# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model_lr = LogisticRegression() #logistic regression model

In [None]:
model_lr.fit(tfidf_training_features, ds_train['label']) #fit the model

LogisticRegression()

In [None]:
y_pred_tfidf_lr = model_lr.predict(tfidf_test_features) #predict values in tfidf_test_features
acc = accuracy_score(y_pred_tfidf_lr, ds_test['label']) #measure the accuracy of test
print('Test accuracy Score logistic regression:', acc*100)

Test accuracy Score logistic regression: 87.924


# Naive Bayes

In [None]:
 from sklearn.naive_bayes import MultinomialNB

In [None]:
model_nb = MultinomialNB() #naive bayse model

In [None]:
model_nb.fit(tfidf_training_features, ds_train['label']) #fit model

MultinomialNB()

In [None]:
y_pred_tfidf_nb = model_nb.predict(tfidf_test_features) #preditct values on tfidf_test_features
acc = accuracy_score(y_pred_tfidf_nb, ds_test['label']) #measure accuracy of test
print('Test accuracy Score logistic regression:', acc*100)

Test accuracy Score logistic regression: 84.628


# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators=100, random_state=0) #random forest model

In [None]:
model_rf.fit(tfidf_training_features, ds_train['label']) #fit rf model

RandomForestClassifier(random_state=0)

In [None]:
y_pred_tfidf_rf = model_rf.predict(tfidf_test_features) #predict the fitted model on test features
acc = accuracy_score(y_pred_tfidf_rf, ds_test['label']) #measure accuracy of the test 
print('Test accuracy Score random forest:', acc*100)

Test accuracy Score random forest: 84.64399999999999


#Grid Search SVM and Random Forest

In [None]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [None]:
tuned_parameters = {"kernel": ["linear"], "C": [0.1, 10, 20, 100]} #seacg grid

In [None]:
grid = GridSearchCV(SVC(), tuned_parameters, cv = 3, refit = True, verbose = 3) #svm grid search cv

In [None]:
grid.fit(tfidf_training_features, ds_train['label']) #fit the grid

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3] END ..............C=0.1, kernel=linear;, score=0.864 total time= 4.8min
[CV 2/3] END ..............C=0.1, kernel=linear;, score=0.866 total time= 4.7min
[CV 3/3] END ..............C=0.1, kernel=linear;, score=0.863 total time= 4.7min
[CV 1/3] END ...............C=10, kernel=linear;, score=0.862 total time= 7.7min
[CV 2/3] END ...............C=10, kernel=linear;, score=0.862 total time= 7.8min
[CV 3/3] END ...............C=10, kernel=linear;, score=0.855 total time= 9.5min
[CV 1/3] END ...............C=20, kernel=linear;, score=0.861 total time= 7.8min
[CV 2/3] END ...............C=20, kernel=linear;, score=0.862 total time= 7.8min
[CV 3/3] END ...............C=20, kernel=linear;, score=0.854 total time= 8.0min
[CV 1/3] END ..............C=100, kernel=linear;, score=0.861 total time= 8.0min
[CV 2/3] END ..............C=100, kernel=linear;, score=0.862 total time= 8.1min
[CV 3/3] END ..............C=100, kernel=linear;,

GridSearchCV(cv=3, estimator=SVC(),
             param_grid={'C': [0.1, 10, 20, 100], 'kernel': ['linear']},
             verbose=3)

In [None]:
print(grid.best_params_) #best params
print(grid.best_estimator_)


{'C': 0.1, 'kernel': 'linear'}
SVC(C=0.1, kernel='linear')


In [None]:
y_pred_tfidf_grid_svm = grid.predict(tfidf_test_features) #fit best params on the data and predict
acc = accuracy_score(y_pred_tfidf_grid_svm, ds_test['label']) #measure accuracy
print('Test accuracy Score Best SVM:', acc*100)

Test accuracy Score Best SVM: 86.98


In [None]:
tuned_parameters = {"kernel": ["linear"], "C": [2,3,4,7]} #grid seach 2 for svm with different c values

In [None]:
grid_2 = GridSearchCV(SVC(), tuned_parameters, cv = 3, refit = True, verbose = 3) #define grid search with cv = 3

In [None]:
grid_2.fit(tfidf_training_features, ds_train['label']) #fit the grid seach on data
print(grid_2.best_params_) #best params
print(grid_2.best_estimator_)
y_pred_tfidf_grid2_svm = grid_2.predict(tfidf_test_features) # predict with best params
acc = accuracy_score(y_pred_tfidf_grid2_svm, ds_test['label']) # measure accuracy compared to test
print('Test accuracy Score Best SVM Search:', acc*100)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3] END ................C=2, kernel=linear;, score=0.880 total time= 6.8min
[CV 2/3] END ................C=2, kernel=linear;, score=0.879 total time= 7.2min
[CV 3/3] END ................C=2, kernel=linear;, score=0.873 total time= 7.0min
[CV 1/3] END ................C=3, kernel=linear;, score=0.874 total time= 7.8min
[CV 2/3] END ................C=3, kernel=linear;, score=0.874 total time= 7.9min
[CV 3/3] END ................C=3, kernel=linear;, score=0.868 total time= 7.9min
[CV 1/3] END ................C=4, kernel=linear;, score=0.869 total time= 8.2min
[CV 2/3] END ................C=4, kernel=linear;, score=0.871 total time= 8.5min
[CV 3/3] END ................C=4, kernel=linear;, score=0.863 total time= 8.6min
[CV 1/3] END ................C=7, kernel=linear;, score=0.863 total time= 9.4min
[CV 2/3] END ................C=7, kernel=linear;, score=0.864 total time= 9.5min
[CV 3/3] END ................C=7, kernel=linear;,

In [None]:
# Randomized Grid Search Random Forest
tuned_parameters = {"n_estimators": [10, 30, 50, 100, 200], "max_depth": [5, 10, 20,40, 30, 50, 70, 100], "max_features": ['auto', 'sqrt']} #rf grid
RandomizedSearch_rf = RandomizedSearchCV(RandomForestClassifier(), tuned_parameters, n_iter = 20, cv = 3, refit = True, verbose = 3) #randomized grid search with 20 iterations and 3 folds, it randomly selects the parameters for huge grid searches and gives results 

In [None]:
RandomizedSearch_rf.fit(tfidf_training_features, ds_train['label']) #fit the model of randomized grid search
print(RandomizedSearch_rf.best_params_) #best params
print(RandomizedSearch_rf.best_estimator_)
y_pred_tfidf_grid_rf = RandomizedSearch_rf.predict(tfidf_test_features) #predict with best params
acc = accuracy_score(y_pred_tfidf_grid_svm, ds_test['label']) #measure accuracy with y label
print('Test accuracy Score Best Random Forest Randomised Search:', acc*100)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END max_depth=40, max_features=auto, n_estimators=100;, score=0.839 total time=  12.2s
[CV 2/3] END max_depth=40, max_features=auto, n_estimators=100;, score=0.847 total time=  12.1s
[CV 3/3] END max_depth=40, max_features=auto, n_estimators=100;, score=0.836 total time=  11.4s
[CV 1/3] END max_depth=20, max_features=auto, n_estimators=100;, score=0.831 total time=   4.6s
[CV 2/3] END max_depth=20, max_features=auto, n_estimators=100;, score=0.840 total time=   4.6s
[CV 3/3] END max_depth=20, max_features=auto, n_estimators=100;, score=0.823 total time=   4.6s
[CV 1/3] END max_depth=20, max_features=auto, n_estimators=200;, score=0.835 total time=   9.2s
[CV 2/3] END max_depth=20, max_features=auto, n_estimators=200;, score=0.843 total time=   9.0s
[CV 3/3] END max_depth=20, max_features=auto, n_estimators=200;, score=0.830 total time=   8.9s
[CV 1/3] END max_depth=20, max_features=sqrt, n_estimators=50;, score=0.824

# Conclusion

TFIDF + Statistical models are quite a good combination for sentimental analysis as we can see the Test accuracy reaches as high as 87.92% which is quite good.

Logistic Regression proved to be the best model but SVM and random forest also has pretty high accuracy of 87% on test data.