## Sentimental Analysis
Sentimental Analysis or in other sense assessing the sentiments of people on the basis of their statements of a particular situation is a very important and varied field in which their is a huge growth Machine Learning and Deep Learning and hence we could implement our varied models to get better results.

### Preprocess the dataset
Here we just preprocess out dataset so that it is present in a form such that we could train our model.

In [11]:
## Reading the related reviews from the file and hence also labelling those 
## reviews as per given in numerals
import pandas as pd
import os
import numpy as np

basepath = 'aclImdb'
labels = {'pos':1,'neg':0}
df = pd.DataFrame()
for s in ('test','train'):
    for l in ('pos','neg'):
        path = os.path.join(basepath,s,l)
        for file in os.listdir(path):
            with open(os.path.join(path,file),'r',encoding = 'utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt,labels[l]]],ignore_index = True)

df.columns = ['review','sentiment']
## Here we just randomly shuffle our dataset and store in into a csv file so that it would be easy later to extract the dataset from their
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv',index = False,encoding = 'utf-8')
## Reading the data from the csv file
df = pd.read_csv('movie_data.csv',encoding = 'utf-8')

### Bag of Words Model
Here we implement the Bag of Words Model to process our dataset into a form that is correct on which to train the model, in this model we define each word as a unique number and replace the word with its unique number and hence we now we can train model with this modified dataset.

### tf-idf
Term frequency - Inverse Document Frequency keeps the frequency of the words of the dataset in a particular format so that the words which would really effect the model would be considered important while the words which aren't that important would be given a lesser weight.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
count = CountVectorizer()
tfidf = TfidfTransformer(use_idf = True,norm = 'l2',smooth_idf = True)
np.set_printoptions(precision = 2)

### Cleaning text data
Here we just remove the redundant symbols and text which might have come in with the required dataset and which isn't useful for training the model.

In [15]:
## Here we just finally clean our remaining dataset so that it is perfect for training the model
## we import the porter stemmer which is very widely used in NLP related tasks 
## as it shortens the long words into short words so that huge corpus of words could be shorted out
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
import re
## remove the redundant characters related to href
def preprocessor(text):
    text = re.sub('<[^>]*>','',text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)
    text = (re.sub('[\W]+',' ',text.lower()) + ' '.join(emoticons).replace('-',''))
    return text
## tokenizer that splits the sentences
def tokenizer(text):
    return text.split()
## portes stemmer to shorten the long words
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
## applying the preprocessor step to the dataset
df['review'] = df['review'].apply(preprocessor)
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Training the Model
Here we just train our model on the processed dataset and evaluate its accuracy on the test dataset

In [16]:
## dividing into training and testing dataset
X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values

In [None]:
## training the respective model
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents = None,lowercase = False,preprocessor = None)
param_grid = [{'vect__ngram_range':[(1,1)],
               'vect__stop_words':[stop,None],
               'vect__tokenizer':[tokenizer,tokenizer_porter],
               'clf__penalty':['l1','l2'],
               'clf__C':[1.0,10.0,100.0]},
              {'vect__ngram_range':[(1,1)],
               'vect__stop_words':[stop,None],
               'vect__tokenizer':[tokenizer,tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty':['l1','l2'],
               'clf__C':[1.0,10.0,100.0]}
             ]
lr_tfidf = Pipeline([('vect',tfidf),
                     ('clf',
                      LogisticRegression(random_state = 0,max_iter=10000))])
gs_lr_tfidf = GridSearchCV(lr_tfidf,param_grid,scoring = 'accuracy',cv = 5,verbose = 1,n_jobs = -1)
gs_lr_tfidf.fit(X_train,y_train)
## Hence we have fitted our implemented model

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


In [None]:
## Now we evaluate our models accuracy
print('CV accuracy: %.3f' %gs_lr_tfidf.best_score_)
clf = gs_lr_tfidf.best_estimator_
print('Test accuracy: %.3f' % clf.score(X_test,y_test))
## hence we can see that the model performs quite well on the dataset