**Part_1: 2nd**

**Problem statement :** There are times when a user writes Good, Nice App or any other positive text, in the review and gives 1-star rating. Your goal is to identify the reviews where the semantics of review text does not match rating.

In [None]:
#importing the libraries
import pandas as pd
import numpy as np
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Mount google drive** to read data set from drive. 

In [None]:
#read the dataset
data=pd.read_csv("/content/drive/MyDrive/Assignment/chrome_reviews.csv")
data.head(3)

Unnamed: 0,ID,Review URL,Text,Star,Thumbs Up,User Name,Developer Reply,Version,Review Date,App ID
0,3886,https://play.google.com/store/apps/details?id=...,This is very helpfull aap.,5,0,INDIAN Knowledge,,83.0.4103.106,2020-12-19,com.android.chrome
1,3887,https://play.google.com/store/apps/details?id=...,Good,3,2,Ijeoma Happiness,,85.0.4183.127,2020-12-19,com.android.chrome
2,3888,https://play.google.com/store/apps/details?id=...,Not able to update. Neither able to uninstall.,1,0,Priti D BtCFs-29,,85.0.4183.127,2020-12-19,com.android.chrome


**# As our goal is to identify the reviews where the semantics of review text does not match rating, using all the features does not make sence. So we are going to select the features which play major role, which are "Text"(content) and "Star"(rating)**

In [None]:
data_use=data[['ID','Text','Star']]
data_use.head(3)

Unnamed: 0,ID,Text,Star
0,3886,This is very helpfull aap.,5
1,3887,Good,3
2,3888,Not able to update. Neither able to uninstall.,1


**Creating another feature called "Result" to classify reviews as positive and negetive based on rating. If rating is >= 2 then it is "positive" represented by 1 else "negative"  or not represented by 0.**

In [None]:
data_use['Result'] = data_use['Star'].apply(lambda x: 1 if x >= 2 else 0)
data_use.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,ID,Text,Star,Result
0,3886,This is very helpfull aap.,5,1
1,3887,Good,3,1
2,3888,Not able to update. Neither able to uninstall.,1,0
3,3889,Nice app,4,1
4,3890,Many unwanted ads,1,0


**Know we can use result as 'target' variable to our problem statement.**

**Data Cleaning and preprocessing**

Definig method to cleaning and preprocessing( removing stopwords, unnecessary white spaces, charectors other than alphabet and applying stemming).

In [None]:
# data cleaning and preprocessing
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
# creating object for PorterStemmer
ps=PorterStemmer()

# Writing method to clean data
def data_clean_step1(data_set):
  corpus=[]
  for i in range(0,len(data_set)):
    # Removing all the words other than alphabet
    review=re.sub("[^a-zA-Z]"," ",str(data_set['Text'][i]))

    # Converting into lowercase
    review=review.lower()

    #Splitting review as words
    review=review.split()

    # Stemming
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]

    # Joining words (making sentences with words stem)
    review=' '.join(review)

    # Making list of reviews
    corpus.append(review)
      
  # replacing processed reviews 
  for i in range(len(corpus)):
    data_set['Text'][i]=corpus[i]

  return data_set

**Calling method to clean the data.**

In [None]:
data_use=data_clean_step1(data_use)
data_use.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,ID,Text,Star,Result
0,3886,helpful aap,5,1
1,3887,good,3,1
2,3888,abl updat neither abl uninstal,1,0
3,3889,nice app,4,1
4,3890,mani unwant ad,1,0


Defining method to remove NAN values if any after cleaning and preprocessing of the data

In [None]:
def data_clean_step2():
  global data_use
  # Removing empty string or ' ' in 'Text' after stemming if any.
  nan_value = float("NaN")
  data_use.replace("", nan_value, inplace=True)
  data_use=data_use.dropna()

  # Resetting the index
  data_use.reset_index(inplace=True)
  corpus=list(data_use['Text'])
  


**Calling method to remove NAN values from the data**

In [None]:
data_clean_step2()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


**Assigning values for Independent(X) and Dependent(Y) features.

In [None]:
X=data_use['Text'].values
y=data_use['Result'].values

# Splitting the dataset into train set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

Here, we create **pipelines** with **Bag of words** created using **CountVectorizer**, **TfidfTransformer** and **estimator**. With creating pipelines we also perform **hyper parameter tuning** using **GridSearchCV**.

In [None]:
# creating lists to store models, best_scores and best_parameters

models,model_names,best_scores,best_params,test_score=[],[],[],[],[]

Creating MultinomialNB model

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Creating MultinomialNB model
from sklearn.naive_bayes import MultinomialNB
text_clf_NB = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clfNB', MultinomialNB())])

parameters_NB = {'vect__ngram_range': [(1, 1), (1, 2),(1,3)],'tfidf__use_idf': (True, False),'clfNB__alpha': [0.01,0.1,0.15,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}

MultinomialNB_classifier = GridSearchCV(text_clf_NB, parameters_NB, n_jobs=-1)
MultinomialNB_classifier = MultinomialNB_classifier.fit(X_train,y_train)

NB_model=MultinomialNB_classifier.best_estimator_
y_pred=NB_model.predict(X_test)

model_names.append('MultinomialNB')
models.append(MultinomialNB_classifier)
best_params.append(MultinomialNB_classifier.best_params_)
best_scores.append(MultinomialNB_classifier.best_score_)
test_score.append(accuracy_score(y_test,y_pred))

print("Test Accuracy score : ",accuracy_score(y_test,y_pred))
print(MultinomialNB_classifier.best_score_)
print(MultinomialNB_classifier.best_params_)

Test Accuracy score :  0.8339324227174695
0.838339247739275
{'clfNB__alpha': 0.1, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}


Creating model for Logisticregression


In [None]:
# Creating model for Logisticregression
text_clf_Log = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clfLog', LogisticRegression())])
parameters_Log = {'vect__ngram_range': [(1, 1), (1, 2),(1,3)],'tfidf__use_idf': (True, False),
                  'clfLog__penalty':['l1','l2','elasticnet','none'],'clfLog__C':[1.0,1.5,2.0],
                  'clfLog__solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                  'clfLog__multi_class':['ovr']}
LogisticRegression_classifier = GridSearchCV(text_clf_Log, parameters_Log, n_jobs=-1)
LogisticRegression_classifier = LogisticRegression_classifier.fit(X_train,y_train)

Logistic_model=LogisticRegression_classifier.best_estimator_
y_pred=Logistic_model.predict(X_test)

model_names.append('LogisticRegression')
models.append(LogisticRegression_classifier)
best_params.append(LogisticRegression_classifier.best_params_)
best_scores.append(LogisticRegression_classifier.best_score_)
test_score.append(accuracy_score(y_test,y_pred))

print("Test accuracy score : ",accuracy_score(y_test,y_pred)) 
print(LogisticRegression_classifier.best_score_)
print(LogisticRegression_classifier.best_params_)

Test accuracy score :  0.8396836808051761
0.8401399096356338
{'clfLog__C': 2.0, 'clfLog__multi_class': 'ovr', 'clfLog__penalty': 'l2', 'clfLog__solver': 'newton-cg', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 3)}


Creating model for PassiveAggressiveClassifier


In [None]:
# Creating model for PassiveAggressiveClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
text_clf_linear = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clfLinear', PassiveAggressiveClassifier())])

parameters_PAL = {'vect__ngram_range': [(1, 1), (1, 2),(1,3),(1,4)],'tfidf__use_idf': (True, False),'clfLinear__loss':['hinge','squared_hinge']}

PassiveAggressive_classifier = GridSearchCV(text_clf_linear, parameters_PAL, n_jobs=-1)

PassiveAggressive_classifier = PassiveAggressive_classifier.fit(X_train,y_train)

PAC_model=PassiveAggressive_classifier.best_estimator_
y_pred=PAC_model.predict(X_test)

model_names.append('PassiveAggressiveClassifier')
models.append(PassiveAggressive_classifier)
best_params.append(PassiveAggressive_classifier.best_params_)
best_scores.append(PassiveAggressive_classifier.best_score_)
test_score.append(accuracy_score(y_test,y_pred))

print("Test accuracy score : ",accuracy_score(y_test,y_pred)) 
print(PassiveAggressive_classifier.best_score_)
print(PassiveAggressive_classifier.best_params_)

Test accuracy score :  0.7929547088425594
0.8192780546452326
{'clfLinear__loss': 'hinge', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 3)}


Creating model for SGDClassifier


In [None]:
# Creating model for SGDClassifier
from sklearn.linear_model import SGDClassifier
text_clf_SGD = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clfSGD', SGDClassifier())])

parameters_SGD = {'vect__ngram_range': [(1, 1), (1, 2),(1,3),(1,4)],'tfidf__use_idf': (True, False),'clfSGD__loss':['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
                 'clfSGD__penalty':['l2', 'l1', 'elasticnet'],'clfSGD__alpha':[0.0001,0.001,0.01,0.1,0.2,0.3,0.4,0.5]}

SGD_classifier = GridSearchCV(text_clf_SGD, parameters_SGD, n_jobs=-1)
SGD_classifier = SGD_classifier.fit(X_train,y_train)

SGD_model=SGD_classifier.best_estimator_
y_pred=SGD_model.predict(X_test)

model_names.append('SGDClassifier')
models.append(SGD_classifier)
best_params.append(SGD_classifier.best_params_)
best_scores.append(SGD_classifier.best_score_)
test_score.append(accuracy_score(y_test,y_pred))

print("Test accuracy score : ",accuracy_score(y_test,y_pred))
print(SGD_classifier.best_score_)
print(SGD_classifier.best_params_)

Test accuracy score :  0.8375269590222861
0.8397803590012088
{'clfSGD__alpha': 0.0001, 'clfSGD__loss': 'log', 'clfSGD__penalty': 'elasticnet', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 3)}


Creating model for SGDClassifier


In [None]:
# Creating model for SGDClassifier
from sklearn.svm import SVC
text_clf_SVC = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clfSVC', SVC())])

parameters_svc = {'vect__ngram_range': [(1, 1), (1, 2),(1,3),(1,4)],'tfidf__use_idf': (True, False),'clfSVC__C':[0.01,0.1,1,2],
                 'clfSVC__kernel':['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],'clfSVC__degree':[1,2,3,4],
                 'clfSVC__gamma':['scale','auto']}

SVC_classifier = GridSearchCV(text_clf_SVC, parameters_svc, n_jobs=-1)
SVC_classifier = SVC_classifier.fit(X_train,y_train)

SVC_model=SVC_classifier.best_estimator_
y_pred=SVC_model.predict(X_test)

model_names.append('SVC_Classifier')
models.append(SVC_classifier)
best_params.append(SVC_classifier.best_params_)
best_scores.append(SVC_classifier.best_score_)
test_score.append(accuracy_score(y_test,y_pred))

print("Test accuracy score : ",accuracy_score(y_test,y_pred))
print(SVC_classifier.best_score_)
print(SVC_classifier.best_params_)


Test accuracy score :  0.8353702372393961
0.838161330773656
{'clfSVC__C': 1, 'clfSVC__degree': 1, 'clfSVC__gamma': 'scale', 'clfSVC__kernel': 'poly', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}


Creating model for RandomForestClassifier


In [None]:
# Creating model for RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

text_clf_RFC = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clfRFC', RandomForestClassifier())])

parameters_RFC = {'vect__ngram_range': [(1, 1), (1, 2),(1,3),(1,4)],'tfidf__use_idf': (True, False),
                 'clfRFC__n_estimators':[10,25,50,100],'clfRFC__criterion':['gini','entropy'],
                 'clfRFC__max_depth':[2,4,5,10],'clfRFC__min_samples_leaf':[2,5,10,25,50],
                  'clfRFC__max_features':['auto','sqrt','log2']
                  }

RandomForest_Classifier = GridSearchCV(text_clf_RFC, parameters_RFC, n_jobs=-1)
RandomForest_Classifier = RandomForest_Classifier.fit(X_train,y_train)

RFC_model=RandomForest_Classifier.best_estimator_
y_pred=RFC_model.predict(X_test)

model_names.append('Random_Forest_Classifier')
models.append(RandomForest_Classifier)
best_params.append(RandomForest_Classifier.best_params_)
best_scores.append(RandomForest_Classifier.best_score_)
test_score.append(accuracy_score(y_test,y_pred))

print("Test accuracy score : ",accuracy_score(y_test,y_pred))
print(RandomForest_Classifier.best_score_)
print(RandomForest_Classifier.best_params_)

Test accuracy score :  0.7361610352264558
0.7543626015629545
{'clfRFC__criterion': 'gini', 'clfRFC__max_depth': 10, 'clfRFC__max_features': 'sqrt', 'clfRFC__min_samples_leaf': 2, 'clfRFC__n_estimators': 10, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}


**Listing the informations of the models to choose the best among them.**

In [None]:
for i in range(len(test_score)):
  print(model_names[i])
  print(best_scores[i])
  print(best_params[i])
  print('Test accuracy :',test_score[i])
  print('----------------------------------------------------------------------------------------')

MultinomialNB
0.838339247739275
{'clfNB__alpha': 0.1, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}
Test accuracy : 0.8339324227174695
----------------------------------------------------------------------------------------
LogisticRegression
0.8401399096356338
{'clfLog__C': 2.0, 'clfLog__multi_class': 'ovr', 'clfLog__penalty': 'l2', 'clfLog__solver': 'newton-cg', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 3)}
Test accuracy : 0.8396836808051761
----------------------------------------------------------------------------------------
PassiveAggressiveClassifier
0.8192780546452326
{'clfLinear__loss': 'hinge', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 3)}
Test accuracy : 0.7929547088425594
----------------------------------------------------------------------------------------
SGDClassifier
0.8397803590012088
{'clfSGD__alpha': 0.0001, 'clfSGD__loss': 'log', 'clfSGD__penalty': 'elasticnet', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 3)}
Test accuracy : 0.8375269

From the above information we conclude that **Logistic regression model** performs good compare to other models.

**We Save that model**

In [None]:
from sklearn.externals import joblib
for i in range(len(model_names)):
  if model_names[i]=='LogisticRegression':
    model=models[i]
joblib.dump(model,'Review_model.mod')




['Review_model.mod']