[Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)
======

## Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

## File descriptions

labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
## Data fields

* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

## Objective
Objective of this dataset is base on **review** we predict **sentiment** (positive or negative) so X is **review** column and y is **sentiment** column

## 1. Load Dataset

Let's first of all have a look at the data. You can download the file `labeledTrainData.tsv` on the [Kaggle website of the competition](https://www.kaggle.com/c/word2vec-nlp-tutorial/data), or on our [Google Drive](https://drive.google.com/file/d/1a1Lyn7ihikk3klAX26fgO3YsGdWHWoK5/view?usp=sharing)


In [105]:
# Import pandas, numpy
import numpy as np
import pandas as pd

In [106]:
# Read dataset with extra params sep='\t', encoding="latin-1"
data = pd.read_csv("../data/sentimentData01.csv")
data

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0.1,Unnamed: 0,domain_hotel_id,text,username,review_id,score,clean_text,Is_Response
0,3,4812066,Central location. Nice small buffet breakfast ...,Russ,3.879692e+09,3.166667,great location,neutral
1,8,2315342,lovely big room (we got a family room despite ...,Olga,4.463291e+09,3.666667,cozy room condition kind staff with quick chec...,good
2,19,3397149,The property is very well situated and very mo...,Rhett,3.414093e+09,4.000000,friendly staff and clean,wonderful
3,29,817067,Great location,Pasel,3.032074e+09,3.500000,very good,neutral
4,31,2073025,"The staff is really friendly, always ready to ...",Rowan,4.501084e+09,4.000000,very accommodating staff,wonderful
...,...,...,...,...,...,...,...,...
292214,3109377,5850824,Brand new house. Very welcome staff. Big and c...,Annar,4.528234e+09,4.000000,,wonderful
292215,3109380,4389800,One of the best places we stayed in Vietnam. I...,Magnus,3.518647e+09,4.000000,,wonderful
292216,3109393,4101656,The location is fantastic. Amazing breakfast. ...,Ctj1688,4.518495e+09,3.200000,,neutral
292217,3109403,1724717,Rooms were modern nice clean and new \r\nStaff...,Poh,4.441717e+09,3.166667,,neutral


In [107]:
data['Is_Response'].value_counts()

wonderful    119237
neutral       97622
good          75360
Name: Is_Response, dtype: int64

In [108]:
data['clean_text'].describe()

count         1606
unique        1564
top       location
freq             8
Name: clean_text, dtype: object

In [109]:
data['clean_text'].dropna()
data

Unnamed: 0.1,Unnamed: 0,domain_hotel_id,text,username,review_id,score,clean_text,Is_Response
0,3,4812066,Central location. Nice small buffet breakfast ...,Russ,3.879692e+09,3.166667,great location,neutral
1,8,2315342,lovely big room (we got a family room despite ...,Olga,4.463291e+09,3.666667,cozy room condition kind staff with quick chec...,good
2,19,3397149,The property is very well situated and very mo...,Rhett,3.414093e+09,4.000000,friendly staff and clean,wonderful
3,29,817067,Great location,Pasel,3.032074e+09,3.500000,very good,neutral
4,31,2073025,"The staff is really friendly, always ready to ...",Rowan,4.501084e+09,4.000000,very accommodating staff,wonderful
...,...,...,...,...,...,...,...,...
292214,3109377,5850824,Brand new house. Very welcome staff. Big and c...,Annar,4.528234e+09,4.000000,,wonderful
292215,3109380,4389800,One of the best places we stayed in Vietnam. I...,Magnus,3.518647e+09,4.000000,,wonderful
292216,3109393,4101656,The location is fantastic. Amazing breakfast. ...,Ctj1688,4.518495e+09,3.200000,,neutral
292217,3109403,1724717,Rooms were modern nice clean and new \r\nStaff...,Poh,4.441717e+09,3.166667,,neutral


In [110]:
# df1 = data.sample(''=1)
df1= data[data['Is_Response'] == 'wonderful']
df2 = data[data['Is_Response'] == 'good']
df3 = data[data['Is_Response'] == 'neutral']

In [111]:
df_wonderful = df1.sample(n = 75360)
df_wonderful['clean_text'].value_counts()

location                                                                                                                                                                                                                                                                                                                                                                                  2
good                                                                                                                                                                                                                                                                                                                                                                                      2
very friendly staff they make us feel part of their family                                                                                                                                                                                      

In [112]:
df_wonderful['text']

46752     Very friendly and non invasive staff. Perfectl...
203174    Me and my friend were in need of a pool and so...
63451     Staff was reali friendly and helpefull, show a...
100949    Appartment is big and clean. Location is 15min...
43777     Great breakfast and beautiful and clean swimmi...
                                ...                        
211117    Staff are very friendly and helpful. Room had ...
6263      Everything really! Would recommended! Owners w...
230534                                 quiet and cosy place
154682    One of the best hotel during my trip in Vietna...
157653    This place is a real Gem...The staff are the a...
Name: text, Length: 75360, dtype: object

In [113]:
df_good = df2.sample(n = 75360)
df_good

Unnamed: 0.1,Unnamed: 0,domain_hotel_id,text,username,review_id,score,clean_text,Is_Response
21093,1087075,2682902,"So comfortable! Lovely room, great view, great...",Emily,3.195364e+09,3.600000,,good
20817,1085692,3843978,We stayed here after a long flight. We had a l...,Micaela,4.480861e+09,3.666667,,good
153079,2140831,582961,Comfortable bed and blackout curtains perfect ...,Beth,2.131154e+09,3.833333,,good
56858,1413953,5471507,"Well-designed apartment, fully furnished, surp...",Lombardo,4.537709e+09,3.600000,,good
237816,2602192,3438123,"Mesmerizing location, easy to get to yet very ...",Dorota,3.791653e+09,3.833333,,good
...,...,...,...,...,...,...,...,...
259312,2801326,4852870,- very nice place! and the owner are really mo...,Christian,3.999835e+09,3.833333,,good
100987,1821068,3249175,The hotel is a very nice hotel located in the ...,Tunku,4.507429e+09,3.833333,,good
166592,2203558,779427,Great location and very friendly staff.,Dorothy,2.541111e+09,3.833333,,good
226019,2490279,1759807,Good location.,Euan,3.341200e+09,3.666667,,good


In [114]:
output = pd.concat([df_wonderful,df_good,df3])
output

Unnamed: 0.1,Unnamed: 0,domain_hotel_id,text,username,review_id,score,clean_text,Is_Response
46752,1318599,4386313,Very friendly and non invasive staff. Perfectl...,Bernard,3.574700e+09,4.000000,,wonderful
203174,2373130,2109051,Me and my friend were in need of a pool and so...,Dani,4.437137e+09,4.000000,,wonderful
63451,1477213,435969,"Staff was reali friendly and helpefull, show a...",Joanna,2.570785e+09,4.000000,,wonderful
100949,1820941,2975753,Appartment is big and clean. Location is 15min...,Emeric,3.567327e+09,4.000000,,wonderful
43777,1297008,4792746,Great breakfast and beautiful and clean swimmi...,Dominika,2.527537e+09,4.000000,,wonderful
...,...,...,...,...,...,...,...,...
292208,3109349,1111660,"Clean,comfortable room. Helpful, friendly st...",Gary,4.506231e+09,3.000000,,neutral
292209,3109360,3059893,Staff is helpful,Janet,1.876953e+09,1.666667,,neutral
292210,3109361,3991019,"Nice hotel, spacious rooms. Buffet breakfast w...",Jonathon,2.638549e+09,3.500000,,neutral
292216,3109393,4101656,The location is fantastic. Amazing breakfast. ...,Ctj1688,4.518495e+09,3.200000,,neutral


In [115]:
output['text'].describe()

count       248342
unique      234763
top       Location
freq           963
Name: text, dtype: object

In [116]:
from sklearn.utils import shuffle
output = shuffle(output)

In [117]:
output

Unnamed: 0.1,Unnamed: 0,domain_hotel_id,text,username,review_id,score,clean_text,Is_Response
186061,2294169,4303133,"Clean, comfortable",Anonymous,3.218060e+09,2.833333,,neutral
71684,1547311,1320362,"The location was great, close to many of the c...",Jys,3.489397e+09,3.500000,,neutral
186570,2296431,2183217,Walking distance to Hoi An Ancient Town. Frien...,Dan2how,4.500809e+09,3.833333,,good
183514,2282187,4542179,host was nice\r\nfree bicycle rental\r\nnear t...,Anonymous,4.522159e+09,3.600000,,good
36985,1232657,1778707,The location was excelent. \nWe checked in ear...,Jone,2.310974e+09,2.500000,,neutral
...,...,...,...,...,...,...,...,...
12348,1006439,2968256,Very clean hotel. The room I had was spotless ...,Pete,4.520388e+09,2.800000,,neutral
149554,2124532,2735905,The family is lovely and you can join them for...,Margarida,3.677262e+09,4.000000,,wonderful
115562,1952000,3635919,Good place!,Lá»c,2.053224e+09,3.000000,,neutral
184576,2287379,822951,"Location is excellent, staff were great, very ...",Simon,4.440509e+09,3.333333,,neutral


In [119]:
cusdata = output
cusdata['id'] = range(len(output.index))
cusdata['score'] = output['score']
cusdata['review'] = output['text']
# cusdata.dropna(inplace=True)
cusdata.T

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cusdata['id'] = range(len(output.index))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cusdata['score'] = output['score']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cusdata['review'] = output['text']


Unnamed: 0,186061,71684,186570,183514,36985,265749,90540,23808,209775,273198,...,44573,243396,185294,212208,279968,12348,149554,115562,184576,17170
Unnamed: 0,2294169,1547311,2296431,2282187,1232657,2864797,1724035,1107515,2403183,2931763,...,1300606,2651349,2290704,2414673,2995453,1006439,2124532,1952000,2287379,1053123
domain_hotel_id,4303133,1320362,2183217,4542179,1778707,2669235,256409,1119706,1191662,4850239,...,4507518,1115837,2256105,1830458,5647782,2968256,2735905,3635919,822951,2669521
text,"Clean, comfortable","The location was great, close to many of the c...",Walking distance to Hoi An Ancient Town. Frien...,host was nice\r\nfree bicycle rental\r\nnear t...,The location was excelent. \nWe checked in ear...,The service was excellent. They help us with e...,"Good value, great breakfast",Excellent service and personnel,It was a good location and breakfast was nice.,"Everything was great, had a comfy stay! Great ...",...,Facilities and amenities,Great location. Very friendly staff. Would rec...,"Everything, perfect like all the other times w...",Very nice pool and views at night,Room was amazing with beautiful views of the m...,Very clean hotel. The room I had was spotless ...,The family is lovely and you can join them for...,Good place!,"Location is excellent, staff were great, very ...",clean and spaceous room. very friendly and eff...
username,Anonymous,Jys,Dan2how,Anonymous,Jone,Nicolai,Duncan,Laurence,Anonymous,Sarah,...,Anonymous,Amy,Kristin,Wendy,Hannah,Pete,Margarida,Lá»c,Simon,Allan
review_id,3.21806e+09,3.4894e+09,4.50081e+09,4.52216e+09,2.31097e+09,4.47188e+09,3.45208e+09,4.48185e+09,1.88501e+09,3.90887e+09,...,4.52194e+09,4.48894e+09,3.02633e+09,1.40038e+09,2.82864e+09,4.52039e+09,3.67726e+09,2.05322e+09,4.44051e+09,4.4825e+09
score,2.83333,3.5,3.83333,3.6,2.5,3.83333,3.33333,3,3,4,...,3.2,4,4,2.16667,4,2.8,4,3,3.33333,3
clean_text,,,,,,,,,,,...,,,,,,,,,,
Is_Response,neutral,neutral,good,good,neutral,good,neutral,neutral,neutral,wonderful,...,neutral,wonderful,wonderful,neutral,wonderful,neutral,wonderful,neutral,neutral,neutral
id,0,1,2,3,4,5,6,7,8,9,...,248332,248333,248334,248335,248336,248337,248338,248339,248340,248341
review,"Clean, comfortable","The location was great, close to many of the c...",Walking distance to Hoi An Ancient Town. Frien...,host was nice\r\nfree bicycle rental\r\nnear t...,The location was excelent. \nWe checked in ear...,The service was excellent. They help us with e...,"Good value, great breakfast",Excellent service and personnel,It was a good location and breakfast was nice.,"Everything was great, had a comfy stay! Great ...",...,Facilities and amenities,Great location. Very friendly staff. Would rec...,"Everything, perfect like all the other times w...",Very nice pool and views at night,Room was amazing with beautiful views of the m...,Very clean hotel. The room I had was spotless ...,The family is lovely and you can join them for...,Good place!,"Location is excellent, staff were great, very ...",clean and spaceous room. very friendly and eff...


In [120]:
cusdata['review'].describe()

count       248342
unique      234763
top       Location
freq           963
Name: review, dtype: object

In [121]:
output.drop('clean_text',axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [122]:
cusdata.T

Unnamed: 0,186061,71684,186570,183514,36985,265749,90540,23808,209775,273198,...,44573,243396,185294,212208,279968,12348,149554,115562,184576,17170
Unnamed: 0,2294169,1547311,2296431,2282187,1232657,2864797,1724035,1107515,2403183,2931763,...,1300606,2651349,2290704,2414673,2995453,1006439,2124532,1952000,2287379,1053123
domain_hotel_id,4303133,1320362,2183217,4542179,1778707,2669235,256409,1119706,1191662,4850239,...,4507518,1115837,2256105,1830458,5647782,2968256,2735905,3635919,822951,2669521
text,"Clean, comfortable","The location was great, close to many of the c...",Walking distance to Hoi An Ancient Town. Frien...,host was nice\r\nfree bicycle rental\r\nnear t...,The location was excelent. \nWe checked in ear...,The service was excellent. They help us with e...,"Good value, great breakfast",Excellent service and personnel,It was a good location and breakfast was nice.,"Everything was great, had a comfy stay! Great ...",...,Facilities and amenities,Great location. Very friendly staff. Would rec...,"Everything, perfect like all the other times w...",Very nice pool and views at night,Room was amazing with beautiful views of the m...,Very clean hotel. The room I had was spotless ...,The family is lovely and you can join them for...,Good place!,"Location is excellent, staff were great, very ...",clean and spaceous room. very friendly and eff...
username,Anonymous,Jys,Dan2how,Anonymous,Jone,Nicolai,Duncan,Laurence,Anonymous,Sarah,...,Anonymous,Amy,Kristin,Wendy,Hannah,Pete,Margarida,Lá»c,Simon,Allan
review_id,3.21806e+09,3.4894e+09,4.50081e+09,4.52216e+09,2.31097e+09,4.47188e+09,3.45208e+09,4.48185e+09,1.88501e+09,3.90887e+09,...,4.52194e+09,4.48894e+09,3.02633e+09,1.40038e+09,2.82864e+09,4.52039e+09,3.67726e+09,2.05322e+09,4.44051e+09,4.4825e+09
score,2.83333,3.5,3.83333,3.6,2.5,3.83333,3.33333,3,3,4,...,3.2,4,4,2.16667,4,2.8,4,3,3.33333,3
Is_Response,neutral,neutral,good,good,neutral,good,neutral,neutral,neutral,wonderful,...,neutral,wonderful,wonderful,neutral,wonderful,neutral,wonderful,neutral,neutral,neutral
id,0,1,2,3,4,5,6,7,8,9,...,248332,248333,248334,248335,248336,248337,248338,248339,248340,248341
review,"Clean, comfortable","The location was great, close to many of the c...",Walking distance to Hoi An Ancient Town. Frien...,host was nice\r\nfree bicycle rental\r\nnear t...,The location was excelent. \nWe checked in ear...,The service was excellent. They help us with e...,"Good value, great breakfast",Excellent service and personnel,It was a good location and breakfast was nice.,"Everything was great, had a comfy stay! Great ...",...,Facilities and amenities,Great location. Very friendly staff. Would rec...,"Everything, perfect like all the other times w...",Very nice pool and views at night,Room was amazing with beautiful views of the m...,Very clean hotel. The room I had was spotless ...,The family is lovely and you can join them for...,Good place!,"Location is excellent, staff were great, very ...",clean and spaceous room. very friendly and eff...


In [123]:
# cusdata.to_csv("../data/dfWithIsResponse.csv")

In [126]:
def response(score):
  if score >= 3.8:
    return 1
  else:
    return 0
    
cusdata['sentiment'] = cusdata['score'].apply(response)
cusdata['sentiment'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cusdata['sentiment'] = cusdata['score'].apply(response)


0    136675
1    111667
Name: sentiment, dtype: int64

In [130]:
# cusdata.drop('score', axis=1, inplace=True)
# cusdata.drop('Unnamed: 0',axis = 1, inplace= True)
cusdata.drop('text',axis = 1,inplace = True)
cusdata.T

Unnamed: 0,186061,71684,186570,183514,36985,265749,90540,23808,209775,273198,...,44573,243396,185294,212208,279968,12348,149554,115562,184576,17170
domain_hotel_id,4303133,1320362,2183217,4542179,1778707,2669235,256409,1119706,1191662,4850239,...,4507518,1115837,2256105,1830458,5647782,2968256,2735905,3635919,822951,2669521
username,Anonymous,Jys,Dan2how,Anonymous,Jone,Nicolai,Duncan,Laurence,Anonymous,Sarah,...,Anonymous,Amy,Kristin,Wendy,Hannah,Pete,Margarida,Lá»c,Simon,Allan
review_id,3.21806e+09,3.4894e+09,4.50081e+09,4.52216e+09,2.31097e+09,4.47188e+09,3.45208e+09,4.48185e+09,1.88501e+09,3.90887e+09,...,4.52194e+09,4.48894e+09,3.02633e+09,1.40038e+09,2.82864e+09,4.52039e+09,3.67726e+09,2.05322e+09,4.44051e+09,4.4825e+09
score,2.83333,3.5,3.83333,3.6,2.5,3.83333,3.33333,3,3,4,...,3.2,4,4,2.16667,4,2.8,4,3,3.33333,3
Is_Response,neutral,neutral,good,good,neutral,good,neutral,neutral,neutral,wonderful,...,neutral,wonderful,wonderful,neutral,wonderful,neutral,wonderful,neutral,neutral,neutral
id,0,1,2,3,4,5,6,7,8,9,...,248332,248333,248334,248335,248336,248337,248338,248339,248340,248341
review,"Clean, comfortable","The location was great, close to many of the c...",Walking distance to Hoi An Ancient Town. Frien...,host was nice\r\nfree bicycle rental\r\nnear t...,The location was excelent. \nWe checked in ear...,The service was excellent. They help us with e...,"Good value, great breakfast",Excellent service and personnel,It was a good location and breakfast was nice.,"Everything was great, had a comfy stay! Great ...",...,Facilities and amenities,Great location. Very friendly staff. Would rec...,"Everything, perfect like all the other times w...",Very nice pool and views at night,Room was amazing with beautiful views of the m...,Very clean hotel. The room I had was spotless ...,The family is lovely and you can join them for...,Good place!,"Location is excellent, staff were great, very ...",clean and spaceous room. very friendly and eff...
sentiment,0,0,1,0,0,1,0,0,0,1,...,0,1,1,0,1,0,1,0,0,0


In [131]:
cusdata.to_csv("../data/dfWithIsResponse.csv")

## 2. Preprocessing

In [132]:
# stop words
cusdata['id'].describe()


count    248342.000000
mean     124170.500000
std       71690.304613
min           0.000000
25%       62085.250000
50%      124170.500000
75%      186255.750000
max      248341.000000
Name: id, dtype: float64

In [133]:
cusdata.groupby('sentiment').describe().transpose()

Unnamed: 0,sentiment,0,1
domain_hotel_id,count,136675.0,111667.0
domain_hotel_id,mean,2349422.0,2615884.0
domain_hotel_id,std,1604163.0,1628710.0
domain_hotel_id,min,27894.0,27894.0
domain_hotel_id,25%,868443.0,1287496.0
domain_hotel_id,50%,2247304.0,2608744.0
domain_hotel_id,75%,3546320.0,3953157.0
domain_hotel_id,max,6704161.0,6675267.0
review_id,count,136675.0,111667.0
review_id,mean,3446929000.0,3323947000.0


In [134]:
cusdata['sentiment'].describe()

count    248342.000000
mean          0.449650
std           0.497459
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max           1.000000
Name: sentiment, dtype: float64

In [135]:
cusdata['length'] = cusdata['review'].apply(len)
cusdata.head()

Unnamed: 0,domain_hotel_id,username,review_id,score,Is_Response,id,review,sentiment,length
186061,4303133,Anonymous,3218060000.0,2.833333,neutral,0,"Clean, comfortable",0,18
71684,1320362,Jys,3489397000.0,3.5,neutral,1,"The location was great, close to many of the c...",0,199
186570,2183217,Dan2how,4500809000.0,3.833333,good,2,Walking distance to Hoi An Ancient Town. Frien...,1,82
183514,4542179,Anonymous,4522159000.0,3.6,good,3,host was nice\r\nfree bicycle rental\r\nnear t...,0,74
36985,1778707,Jone,2310974000.0,2.5,neutral,4,The location was excelent. \nWe checked in ear...,0,152


In [136]:
# Removing special characters and "trash"
import re
def preprocessor(text):
    # Remove HTML markup
    text = re.sub('<[^>]*>', '',text)
    # text = re.sub('@[a-z]')
    # Save emoticons for later appending
    # Your code here
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    # Your code here
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    return text

In [138]:
#stop words 
from collections import Counter

vocab = Counter()
for twit in cusdata.review:
    for word in twit.split(' '):
        vocab[word] += 1

vocab.most_common(20)

[('and', 359610),
 ('the', 313265),
 ('to', 175913),
 ('was', 164953),
 ('a', 153162),
 ('The', 128804),
 ('is', 125679),
 ('very', 107949),
 ('in', 91118),
 ('', 89905),
 ('of', 84496),
 ('for', 82847),
 ('with', 70477),
 ('were', 63939),
 ('staff', 63841),
 ('good', 54834),
 ('room', 50574),
 ('I', 46914),
 ('we', 46801),
 ('are', 46575)]

In [139]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\minha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [140]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

vocab_reduced = Counter()
# Go through all of the items of vocab using vocab.items() and pick only words that are not in 'stop_words' 
# and save them in vocab_reduced
for w, c in vocab.items():
    if not w in stop_words:
        vocab_reduced[w]=c

vocab_reduced.most_common(20)

[('The', 128804),
 ('', 89905),
 ('staff', 63841),
 ('good', 54834),
 ('room', 50574),
 ('I', 46914),
 ('nice', 44048),
 ('friendly', 41892),
 ('location', 39797),
 ('great', 36864),
 ('hotel', 35366),
 ('us', 35022),
 ('really', 33134),
 ('clean', 32543),
 ('breakfast', 32440),
 ('helpful', 31791),
 ('We', 29761),
 ('stay', 26875),
 ('Very', 24396),
 ('place', 21945)]

In [141]:
# tokenizer and stemming
# tokenizer: to break down our twits in individual words
# stemming: reducing a word to its root
from nltk.stem import PorterStemmer
# Your code here
porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()] # Your code here


In [143]:
# split the dataset in train and test

# Your code here
from sklearn.model_selection import train_test_split
X = cusdata['review']
y = cusdata['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=102)

## 3. Create Model and Train 

Using **Pipeline** to concat **tfidf** step and **LogisticRegression** step

In [144]:
# Import Pipeline, LogisticRegression, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer


## 4. Evaluate Model

In [145]:
# Using Test dataset to evaluate model
# classification_report
# confusion matrix
tfidf = TfidfVectorizer(stop_words=stop_words,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

# A pipeline is what chains several steps together, once the initial exploration is done. 
# For example, some codes are meant to transform features — normalise numericals, or turn text into vectors, 
# or fill up missing data, they are transformers; other codes are meant to predict variables by fitting an algorithm,
# they are estimators. Pipeline chains all these together which can then be applied to training data
clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])
clf.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('vect',
                 TfidfVectorizer(preprocessor=<function preprocessor at 0x000001FBA02C4AF0>,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves', 'you',
                                             "you're", "you've", "you'll",
                                             "you'd", 'your', 'yours',
                                             'yourself', 'yourselves', 'he',
                                             'him', 'his', 'himself', 'she',
                                             "she's", 'her', 'hers', 'herself',
                                             'it', "it's", 'its', 'itself', ...],
                                 tokenizer=<function tokenizer_porter at 0x000001FB912F08B0>)),
                ('clf', LogisticRegression(random_state=0))])

## 5. Export Model 

In [146]:
# Using pickle to export our trained model
import pickle
import os
pickle.dump(clf, open('output.pkl', 'wb'))

In [147]:
twits = [
    "I do not feel not bad", # Phuc +1
    'This model is "so good" :))', # Long -1
    'we are who we are', # Nghi 0
    'its good to be bad sometimes', # PA +1
    'what a wonderful failure! (sarcasm :)))', #Phuc +1
    'People do not like the bad things', # Chi 0
    'We finally have the test result. You are positive', # Long +1
]

preds = clf.predict_proba(twits)

for i in range(len(twits)):
    print(f'{twits[i]} --> Negative, Positive = {preds[i]}')

I do not feel not bad --> Negative, Positive = [0.9233126 0.0766874]
This model is "so good" :)) --> Negative, Positive = [0.66419402 0.33580598]
we are who we are --> Negative, Positive = [0.83291458 0.16708542]
its good to be bad sometimes --> Negative, Positive = [0.96554963 0.03445037]
what a wonderful failure! (sarcasm :))) --> Negative, Positive = [0.62663705 0.37336295]
People do not like the bad things --> Negative, Positive = [0.85668022 0.14331978]
We finally have the test result. You are positive --> Negative, Positive = [0.83842034 0.16157966]


In [148]:
pred_score = clf.predict_proba(X)

final = cusdata.copy()
final.reset_index(inplace=True)
final['Negative'] = ''
final['Positive'] = ''

for i in range(len(pred_score)):
    final['Negative'][i] = pred_score[i][0]
    final['Positive'][i] = pred_score[i][1]

final.T

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final['Negative'][i] = pred_score[i][0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final['Positive'][i] = pred_score[i][1]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,248332,248333,248334,248335,248336,248337,248338,248339,248340,248341
index,186061,71684,186570,183514,36985,265749,90540,23808,209775,273198,...,44573,243396,185294,212208,279968,12348,149554,115562,184576,17170
domain_hotel_id,4303133,1320362,2183217,4542179,1778707,2669235,256409,1119706,1191662,4850239,...,4507518,1115837,2256105,1830458,5647782,2968256,2735905,3635919,822951,2669521
username,Anonymous,Jys,Dan2how,Anonymous,Jone,Nicolai,Duncan,Laurence,Anonymous,Sarah,...,Anonymous,Amy,Kristin,Wendy,Hannah,Pete,Margarida,Lá»c,Simon,Allan
review_id,3.21806e+09,3.4894e+09,4.50081e+09,4.52216e+09,2.31097e+09,4.47188e+09,3.45208e+09,4.48185e+09,1.88501e+09,3.90887e+09,...,4.52194e+09,4.48894e+09,3.02633e+09,1.40038e+09,2.82864e+09,4.52039e+09,3.67726e+09,2.05322e+09,4.44051e+09,4.4825e+09
score,2.83333,3.5,3.83333,3.6,2.5,3.83333,3.33333,3,3,4,...,3.2,4,4,2.16667,4,2.8,4,3,3.33333,3
Is_Response,neutral,neutral,good,good,neutral,good,neutral,neutral,neutral,wonderful,...,neutral,wonderful,wonderful,neutral,wonderful,neutral,wonderful,neutral,neutral,neutral
id,0,1,2,3,4,5,6,7,8,9,...,248332,248333,248334,248335,248336,248337,248338,248339,248340,248341
review,"Clean, comfortable","The location was great, close to many of the c...",Walking distance to Hoi An Ancient Town. Frien...,host was nice\r\nfree bicycle rental\r\nnear t...,The location was excelent. \nWe checked in ear...,The service was excellent. They help us with e...,"Good value, great breakfast",Excellent service and personnel,It was a good location and breakfast was nice.,"Everything was great, had a comfy stay! Great ...",...,Facilities and amenities,Great location. Very friendly staff. Would rec...,"Everything, perfect like all the other times w...",Very nice pool and views at night,Room was amazing with beautiful views of the m...,Very clean hotel. The room I had was spotless ...,The family is lovely and you can join them for...,Good place!,"Location is excellent, staff were great, very ...",clean and spaceous room. very friendly and eff...
sentiment,0,0,1,0,0,1,0,0,0,1,...,0,1,1,0,1,0,1,0,0,0
length,18,199,82,74,152,115,28,31,46,125,...,24,52,223,33,396,105,223,11,54,58


In [149]:
final.to_csv("dfWithPositiveNegative.csv")