<a href="https://colab.research.google.com/github/pravvvv/Amazon_fine_food_reviews/blob/main/Amazon_fine_food_reviews_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Strategy 

* Sentiment analysis on whether a review is positive or negative based on the review and summary text
* If we are going to predict the sentiment based on the text data then we could find the top features, in this case the words which influence the positive or negative sentiment of the review. We have review text and summary text.
* This is a classification problem
* We have to build a machine learning model which would predict a new review is going to be positive or negative
* Probability value could give the strength of the prediction
* We are going to use AUC as performance metric since the dataset is imbalanced and we need probability

# Dataset preparation and preprocessing

* Load the dataset
* Label the data for classification into positive and negative 
* Preprocessing steps 
* Feature engineering
* Feature transformation

## Load the dataset

In [2]:
import pandas as pd
import sqlite3
con = sqlite3.connect('database.sqlite')

# 3 is the threshold for the rating above which we consider them as positive and below which is negative. 
# We are considering this while loading the dataset
loaded_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 """, con) 

loaded_data[:5]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [3]:
print(loaded_data.shape)

(525814, 10)


## Data selection

* Remove missing data
* Remove duplicates


### Missing data check

In [4]:
loaded_data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525814 entries, 0 to 525813
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      525814 non-null  int64 
 1   ProductId               525814 non-null  object
 2   UserId                  525814 non-null  object
 3   ProfileName             525814 non-null  object
 4   HelpfulnessNumerator    525814 non-null  int64 
 5   HelpfulnessDenominator  525814 non-null  int64 
 6   Score                   525814 non-null  int64 
 7   Time                    525814 non-null  int64 
 8   Summary                 525814 non-null  object
 9   Text                    525814 non-null  object
dtypes: int64(5), object(5)
memory usage: 40.1+ MB


### Removing duplicates

In [5]:
# Duplicate entries with same userid,productid,profilename,time,score,text

df = pd.read_sql_query("""SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
WHERE Score!=3
GROUP BY UserId
HAVING COUNT(*)>1""",con)
df.head()

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B007Y59HVM,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ET0,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B007Y59HVM,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ET0,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBE1U,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [6]:
# Example of multiple entries
loaded_data[loaded_data['UserId']=='#oc-R115TNMSPFT9I7']

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
76583,83318,B005ZBZLT4,#oc-R115TNMSPFT9I7,Breyton,2,3,2,1331510400,"""Green"" K-cup packaging sacrifices flavor",Overall its just OK when considering the price...
166820,180872,B007Y59HVM,#oc-R115TNMSPFT9I7,Breyton,2,3,2,1331510400,"""Green"" K-cup packaging sacrifices flavor",Overall its just OK when considering the price...


In [7]:
#Sorting data according to ProductId in ascending order
sorted_data=loaded_data.sort_values('ProductId', axis=0, 
                                      ascending=True,  
                                      kind='quicksort', na_position='last')

# Keeping first after sorting entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName",
                                          "Time","Text"}, keep='first')
print("Loaded Data shape :",loaded_data.shape)
print("Final Data shape :",final.shape)
print("% of data remaining :",round(final.shape[0]/loaded_data.shape[0]*100),"%")

Loaded Data shape : (525814, 10)
Final Data shape : (364173, 10)
% of data remaining : 69 %


In [8]:
final.Score.value_counts()

5    250966
4     56097
1     36307
2     20803
Name: Score, dtype: int64

### Other rules

Helpfulness numerators are generally less than helpfullness denominators . Here 2 entries were observed to have a scenario where numerator value exceeds denominator . 

In [9]:
final[final['HelpfulnessNumerator']>final['HelpfulnessDenominator']]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
59301,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
41159,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [10]:
# Removing the above scenario
final = final[final['HelpfulnessNumerator']<=final['HelpfulnessDenominator']]

In [11]:
final.shape

(364171, 10)

## Data Labelling

If a review score is >3 we replace the value with 1 and if its less than 3 we replace the value with 0

In [12]:
# Check number of not null values
print(len(final[~(pd.isnull(final.Score))]),'/',len(final),' are not null values')

# replace >3 as 1 and <3 as 0
final['Score'] = [0 if x<3 else 1 for x in final['Score']]
print('After replacing Score with 0 and 1 . Count of each would be:')
final['Score'].value_counts()

364171 / 364171  are not null values
After replacing Score with 0 and 1 . Count of each would be:


1    307061
0     57110
Name: Score, dtype: int64

In [13]:
final = final.reset_index(drop=True)

## Data preprocessing

* Remove html tags
* Remove punctuations, special characters
* Check for alpha numeric (avoid in most cases)
* Words have to be >2 length
* Convert to lowercase
* Remove stopwords
* Snowball Stemming

In [14]:
# Randomly checking if a text has html tags in review text. 
final.loc[final.Text.str.find('http')>0,'Text']

85        I was intially introduced to Pro-Treat Beef Li...
89        Dogs love these treats more than any other tre...
239       Why is this $[...] when the same product is av...
248       Of course you already have the DVD.<br />This ...
276       <a href="http://www.amazon.com/gp/product/B001...
                                ...                        
363774    Art Deco in my fridge. Tastes great. Lunch pai...
363983    Honeyville has wisely passed on very reasonabl...
364059    Green Mountain has a reputation for making gre...
364063    Kramer's 3 Pack Wing Rub is the best deal! We ...
364167    You have to try this sauce to believe it! It s...
Name: Text, Length: 7376, dtype: object

In [15]:
# Looking at some of the random reviews 
print(final.Text[85])
print('*'*50)
print(final.Text[89])
print('*'*50)
print(final.Text[239])

I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's office.  Rudy, our Jack Russell Terrior, usually not all that interested in dog treats, responded like I'd never seen before!<br /><br />I bought a couple of cans (they are pretty pricey), but gave them to him sparingly. The tricks he'd do for the Pro-Treats were really entertaining.  Now, Rudy has to take meds. for his glaucoma and thank heavens I found the large size here which is a good value, so he not only takes his medicine, but gives great tricks in return each day! He's a happy camper, despite the loss of sight in one eye.<br /><a href="http://www.amazon.com/gp/product/B0002DGRSY">Pro-Treat Beef Liver, Freeze Dried Dog Treats, 21 Ounce</a><br /><br />Rudy says, "They are the best part of my day!"  We think so too!  ~ CJ
**************************************************
Dogs love these treats more than any other treats I've ever found.  In situations where other treats fail to get our

In [16]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
import re
before = final.Text[85]
after = re.sub(r"http\S+", "", final.Text[85])
print("Before")
print(before)
print('*'*50)
print("After")
print(after)

Before
I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's office.  Rudy, our Jack Russell Terrior, usually not all that interested in dog treats, responded like I'd never seen before!<br /><br />I bought a couple of cans (they are pretty pricey), but gave them to him sparingly. The tricks he'd do for the Pro-Treats were really entertaining.  Now, Rudy has to take meds. for his glaucoma and thank heavens I found the large size here which is a good value, so he not only takes his medicine, but gives great tricks in return each day! He's a happy camper, despite the loss of sight in one eye.<br /><a href="http://www.amazon.com/gp/product/B0002DGRSY">Pro-Treat Beef Liver, Freeze Dried Dog Treats, 21 Ounce</a><br /><br />Rudy says, "They are the best part of my day!"  We think so too!  ~ CJ
**************************************************
After
I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's office. 

In [17]:
# remove all tags from the text # https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup

soup = BeautifulSoup(before, 'lxml')
after = soup.get_text()
print("Before \n",before,'\n','*'*50,'\n After \n',after)


Before 
 I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's office.  Rudy, our Jack Russell Terrior, usually not all that interested in dog treats, responded like I'd never seen before!<br /><br />I bought a couple of cans (they are pretty pricey), but gave them to him sparingly. The tricks he'd do for the Pro-Treats were really entertaining.  Now, Rudy has to take meds. for his glaucoma and thank heavens I found the large size here which is a good value, so he not only takes his medicine, but gives great tricks in return each day! He's a happy camper, despite the loss of sight in one eye.<br /><a href="http://www.amazon.com/gp/product/B0002DGRSY">Pro-Treat Beef Liver, Freeze Dried Dog Treats, 21 Ounce</a><br /><br />Rudy says, "They are the best part of my day!"  We think so too!  ~ CJ 
 ************************************************** 
 After 
 I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's 

In [18]:
#Expanding English language contractions in Python
#https://stackoverflow.com/a/47091490/8885670

import re
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

after = decontracted(before)
print("Before \n",before,'\n','*'*50,'\n After \n',after)

Before 
 I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's office.  Rudy, our Jack Russell Terrior, usually not all that interested in dog treats, responded like I'd never seen before!<br /><br />I bought a couple of cans (they are pretty pricey), but gave them to him sparingly. The tricks he'd do for the Pro-Treats were really entertaining.  Now, Rudy has to take meds. for his glaucoma and thank heavens I found the large size here which is a good value, so he not only takes his medicine, but gives great tricks in return each day! He's a happy camper, despite the loss of sight in one eye.<br /><a href="http://www.amazon.com/gp/product/B0002DGRSY">Pro-Treat Beef Liver, Freeze Dried Dog Treats, 21 Ounce</a><br /><br />Rudy says, "They are the best part of my day!"  We think so too!  ~ CJ 
 ************************************************** 
 After 
 I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian is

In [19]:

#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
after = re.sub("\S*\d\S*", "", before).strip()
print("Before \n",before,'\n','*'*50,'\n After \n',after)

Before 
 I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's office.  Rudy, our Jack Russell Terrior, usually not all that interested in dog treats, responded like I'd never seen before!<br /><br />I bought a couple of cans (they are pretty pricey), but gave them to him sparingly. The tricks he'd do for the Pro-Treats were really entertaining.  Now, Rudy has to take meds. for his glaucoma and thank heavens I found the large size here which is a good value, so he not only takes his medicine, but gives great tricks in return each day! He's a happy camper, despite the loss of sight in one eye.<br /><a href="http://www.amazon.com/gp/product/B0002DGRSY">Pro-Treat Beef Liver, Freeze Dried Dog Treats, 21 Ounce</a><br /><br />Rudy says, "They are the best part of my day!"  We think so too!  ~ CJ 
 ************************************************** 
 After 
 I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's 

In [20]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
after = re.sub('[^A-Za-z0-9]+', ' ', before)

print("Before \n",before,'\n','*'*50,'\n After \n',after)

Before 
 I was intially introduced to Pro-Treat Beef Liver Freeze Dried Dog Treats at my veterinarian's office.  Rudy, our Jack Russell Terrior, usually not all that interested in dog treats, responded like I'd never seen before!<br /><br />I bought a couple of cans (they are pretty pricey), but gave them to him sparingly. The tricks he'd do for the Pro-Treats were really entertaining.  Now, Rudy has to take meds. for his glaucoma and thank heavens I found the large size here which is a good value, so he not only takes his medicine, but gives great tricks in return each day! He's a happy camper, despite the loss of sight in one eye.<br /><a href="http://www.amazon.com/gp/product/B0002DGRSY">Pro-Treat Beef Liver, Freeze Dried Dog Treats, 21 Ounce</a><br /><br />Rudy says, "They are the best part of my day!"  We think so too!  ~ CJ 
 ************************************************** 
 After 
 I was intially introduced to Pro Treat Beef Liver Freeze Dried Dog Treats at my veterinarian s 

In [21]:

# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved
# in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [22]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower()
                        not in stopwords)
    preprocessed_reviews.append(sentance.strip())

100%|██████████| 364171/364171 [01:58<00:00, 3060.71it/s]


In [23]:
# Similarly we could collect the summary text as well 

# Combining all the above stundents 
from tqdm import tqdm
preprocessed_summary = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Summary'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower()
                        not in stopwords)
    preprocessed_summary.append(sentance.strip())

  ' Beautiful Soup.' % markup)
100%|██████████| 364171/364171 [01:23<00:00, 4346.74it/s]


## Dataset Splitting

Split the data  to Train - 80% & Test - 20%

In [24]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(preprocessed_reviews, final['Score'].to_list(), test_size=0.2, random_state=77)


In [25]:
print(X_train[:5])
print(y_train[:5])

['product great evening tea help relax gently eases sleep taking harsh medicines like tylenol pm always woke feeling worse went bed not tea let clarify not knock help relax', 'not recommend flavor least one five cups problem resulted grounds coffee used two machines one less month old outcome maybe bad batch would like know reviewers similar problems always bought timothy hope anomaly two five stars majority brews made came fine update review feel recommend flavor', 'completlety satisfied coconut crystal use protein smoothie every morning taste like mild brown sugar happy buying future', 'love nuts hate removing shells great tasty choice right amount salt nuts plump good size', 'worst tasting coffee ever tasted flavorless bitter acidic taste threw whole bag']
[1, 0, 1, 1, 0]


In [26]:
print(X_test[:5])
print(y_test[:5])

['ronzoni pasta always consistantly good pasta buy pastas quality always consistantly good', 'came first time corner package torn half cocoa powder gone flying shipment box upset called customer service sent new one without sending defected one bothering like taste use drinking hot cocoa rather baking tastes mild mean not bitter cocoa like bitter taste cocoa maybe good point honestly think taste ghirardelli cocoa better taste organic believe must better quality also fair trade anyway enjoying sugar free vanilla syrup perfect way enjoy delicious hot cocoa free bad chemicals high calories', 'salmon not bad meat several questionable additions including bony pieces found big round bone fragments probably fish spinal cord not something want chewing sandwich wo not buy', 'love sauce used get local grocery store stopped selling ordered six pack going last good long people know tried not like find taste sweet kind zippy time trying find substitute months glad found amazon', 'got cause love qua

Further split the training data to  train - 80% and Cross validation data - 20%

In [27]:
X_train_noncv, X_train_cv, y_train_noncv, y_train_cv = train_test_split(X_train, y_train, test_size=0.2, random_state=77)


In [28]:
print("# of reviews ...")
print("Training = ",len(X_train_noncv))
print("Cross validation = ",len(X_train_cv))
print("Testing = ",len(X_test))

# of reviews ...
Training =  233068
Cross validation =  58268
Testing =  72835


## Feature Engineering and Data Transformation

We shall use Bag or Words and TFIDF for feature engineering

### Bag of Words

In [29]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
count_vect = CountVectorizer(min_df=100)
X_tr_BoW = count_vect.fit_transform(X_train_noncv)
X_cv_BoW = count_vect.transform(X_train_cv)
X_te_BoW = count_vect.transform(X_test)
scalar = StandardScaler(with_mean=False)
X_tr_BoW = scalar.fit_transform(X_tr_BoW)
X_cv_BoW = scalar.transform(X_cv_BoW)
X_te_BoW = scalar.transform(X_te_BoW)

### TFIDF

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range = (1,2), min_df = 10,max_features=1000 )
X_tr_tfidf = tf_idf_vect.fit_transform(X_train_noncv)
X_cv_tfidf = tf_idf_vect.transform(X_train_cv)
X_te_tfidf = tf_idf_vect.transform(X_test)
scalar = StandardScaler(with_mean=False)
X_tr_tfidf = scalar.fit_transform(X_tr_tfidf)
X_cv_tfidf = scalar.transform(X_cv_tfidf)
X_te_tfidf = scalar.transform(X_te_tfidf)

In [31]:
X_tr_tfidf.shape

(233068, 1000)

# ML Models

* I would like to try Logistic Regression model before trying other ML models

## Logistic Regression - BoW

In [32]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

In [None]:
# Assumptions
grid={"C":[0.001,0.1,1]} ###
clf = LogisticRegression(solver='liblinear') ###
log_clf_grid_cv = GridSearchCV(clf,grid,cv=10,verbose=5,scoring='roc_auc') ###
log_clf_grid_cv.fit(X_tr_BoW,y_train_noncv) ###

Fitting 10 folds for each of 3 candidates, totalling 30 fits
[CV] C=0.001 .........................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............................. C=0.001, score=0.942, total=   4.4s
[CV] C=0.001 .........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.4s remaining:    0.0s


[CV] ............................. C=0.001, score=0.944, total=   3.7s
[CV] C=0.001 .........................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    8.0s remaining:    0.0s


[CV] ............................. C=0.001, score=0.943, total=   4.3s
[CV] C=0.001 .........................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   12.3s remaining:    0.0s


[CV] ............................. C=0.001, score=0.944, total=   4.3s
[CV] C=0.001 .........................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   16.6s remaining:    0.0s


[CV] ............................. C=0.001, score=0.948, total=   3.7s
[CV] C=0.001 .........................................................
[CV] ............................. C=0.001, score=0.946, total=   4.1s
[CV] C=0.001 .........................................................
[CV] ............................. C=0.001, score=0.944, total=   3.8s
[CV] C=0.001 .........................................................
[CV] ............................. C=0.001, score=0.946, total=   3.6s
[CV] C=0.001 .........................................................
[CV] ............................. C=0.001, score=0.946, total=   3.5s
[CV] C=0.001 .........................................................
[CV] ............................. C=0.001, score=0.949, total=   4.1s
[CV] C=0.1 ...........................................................
[CV] ............................... C=0.1, score=0.939, total=   8.6s
[CV] C=0.1 ...........................................................
[CV] .

In [None]:
results_df = pd.DataFrame({
    'params' : log_clf_grid_cv.cv_results_['params'],
    'AUC' : log_clf_grid_cv.cv_results_['mean_test_score'],
    'Time' : log_clf_grid_cv.cv_results_['mean_score_time']
})
print(results_df)
print('*'*50)
print('Best result is :',)
print(log_clf_grid_cv.best_params_)
print(log_clf_grid_cv.best_score_)

In [None]:
clf = log_clf_grid_cv.best_estimator_
clf = clf.fit(X_tr_BoW,y_train_noncv)

In [None]:
pred_train = clf.predict_proba(X_tr_BoW) ###
pred_cv = clf.predict_proba(X_cv_BoW) ###
pred_te = clf.predict_proba(X_te_BoW) ###

from sklearn.metrics import roc_curve,roc_auc_score
fpr1, tpr1, thresholds = roc_curve(y_train_noncv, pred_train[:,1])
fpr2, tpr2, thresholds = roc_curve(y_train_cv, pred_cv[:,1]) 
fpr3, tpr3, thresholds = roc_curve(y_test, pred_te[:,1]) 
print("The AUC score for train data is : ",roc_auc_score(y_train_noncv, pred_train[:,1])) 
print("The AUC score for CV data is : ",roc_auc_score(y_train_cv, pred_cv[:,1])) 
print("The AUC score for test data is : ",roc_auc_score(y_test, pred_te[:,1])) 
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(fpr1,tpr1,'r',label = 'Train data')
plt.plot(fpr2,tpr2,'b',label = 'CV data')
plt.plot(fpr3,tpr3,'g',label = 'Test data')
plt.grid(True)
plt.legend(loc='upper center', bbox_to_anchor=(1.45, 0.8), shadow=True, ncol=1)
plt.show()

In [None]:

import seaborn as sn
#https://stackoverflow.com/questions/47264597/confusion-matrix-from-probabilities
import matplotlib.pyplot as plt1
from sklearn.metrics import confusion_matrix
y_pred = np.argmax(pred_te, axis=1)
conf_mat = confusion_matrix(y_test, y_pred )
conf_mat_normalized = conf_mat.astype('int') / conf_mat.sum(axis=1)[:, np.newaxis]
sn.heatmap(conf_mat , annot=True ,fmt='.1f' )
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

## Logistic Regression - TFIDF

In [None]:
# Assumptions
grid={"C":[0.001,0.1,1]} ###
clf = LogisticRegression(solver='liblinear') ###
log_clf_grid_cv = GridSearchCV(clf,grid,cv=10,verbose=5,scoring='roc_auc') ###
log_clf_grid_cv.fit(X_tr_tfidf,y_train_noncv) ###

In [None]:
results_df = pd.DataFrame({
    'params' : log_clf_grid_cv.cv_results_['params'],
    'AUC' : log_clf_grid_cv.cv_results_['mean_test_score'],
    'Time' : log_clf_grid_cv.cv_results_['mean_score_time']
})
print(results_df)
print('*'*50)
print('Best result is :',)
print(log_clf_grid_cv.best_params_)
print(log_clf_grid_cv.best_score_)

In [None]:
clf = log_clf_grid_cv.best_estimator_
clf = clf.fit(X_tr_tfidf,y_train_noncv)

In [None]:
pred_train = clf.predict_proba(X_tr_tfidf) ###
pred_cv = clf.predict_proba(X_cv_tfidf) ###
pred_te = clf.predict_proba(X_te_tfidf) ###

from sklearn.metrics import roc_curve
fpr1, tpr1, thresholds = roc_curve(y_train_noncv, pred_train[:,1])
fpr2, tpr2, thresholds = roc_curve(y_train_cv, pred_cv[:,1]) 
fpr3, tpr3, thresholds = roc_curve(y_test, pred_te[:,1]) 
print("The AUC score for train data is : ",roc_auc_score(y_train_noncv, pred_train[:,1])) 
print("The AUC score for CV data is : ",roc_auc_score(y_train_cv, pred_cv[:,1])) 
print("The AUC score for test data is : ",roc_auc_score(y_test, pred_te[:,1])) 
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(fpr1,tpr1,'r',label = 'Train data')
plt.plot(fpr2,tpr2,'b',label = 'CV data')
plt.plot(fpr3,tpr3,'g',label = 'Test data')
plt.grid(True)
plt.legend(loc='upper center', bbox_to_anchor=(1.45, 0.8), shadow=True, ncol=1)
plt.show()

In [None]:

import seaborn as sn
#https://stackoverflow.com/questions/47264597/confusion-matrix-from-probabilities
import matplotlib.pyplot as plt1
from sklearn.metrics import confusion_matrix
y_pred = np.argmax(pred_te, axis=1)
conf_mat = confusion_matrix(y_test, y_pred )
conf_mat_normalized = conf_mat.astype('int') / conf_mat.sum(axis=1)[:, np.newaxis]
sn.heatmap(conf_mat , annot=True ,fmt='.1f' )
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

# Conclusion



In [None]:
from prettytable import PrettyTable
x = PrettyTable()
x.field_names = ["Model Name","AUC Train " , "AUC CV ", 
                 "AUC Test "]
x.add_row(['Logistic Regression with BoW','0.96','0.95','0.95'])
x.add_row(['Logistic Regression with TFIDF','0.94','0.94','0.94'])
print(x)

# Observations

1. Since the data was imbalanced AUC values were used

2. BoW and TFIDF are giving excellent predictions
3. Summary text could be used as additional features