# Project 1 - Task #2

This project will reference the Multi-Domain Sentiment Dataset that is availible here :
https://www.cs.jhu.edu/~mdredze/datasets/sentiment/ 

The Multi-Domain Sentiment Dataset contains product reviews from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics. 
For each domain, there are several thousand reviews. Reviews can range from 1 to 5 stars, and can be converted into binary labels if needed.
    

Import necessary libraries :

In [1]:
from six.moves import urllib
import os
import tarfile
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
nltk.download(['punkt','wordnet','stopwords'])
import numpy as np
from nltk.stem import WordNetLemmatizer 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Download the required file and extract to our working directory :

In [2]:
download_root = "https://www.cs.jhu.edu/~mdredze/datasets/sentiment/"
file_name = "domain_sentiment_data.tar.gz"
sentiment_url =  download_root + file_name

def download_extract(url,location):
    '''
    url: ulr location where the data resides
    location: location on workstation the data needs to be copied to
    '''
    gz_path =  os.path.join(location,file_name)
    _ = urllib.request.urlretrieve(url = sentiment_url,filename=gz_path)
    gz_folder = tarfile.open(gz_path)
    gz_folder.extractall(path=location)
    gz_folder.close()

In [3]:
download_extract(url= sentiment_url,location=os.getcwd())

In [4]:
stopwords = stopwords.words('english')

stopwords that are too restrictive, and will be removed :

---



In [5]:
not_stopwords= ['best','better','good','great',
'greater','greatest','important','interesting','problem','problems','work',
'worked','working','works','would']

In [6]:
stopwords = [x for x in stopwords if x not in not_stopwords]

The reviews are in a XML format, the BeautifulSoup library will assit with importing the file :

In [7]:
positive_reviews = (BeautifulSoup(open('sorted_data_acl/electronics/positive.review')
                                  .read(),'lxml'))
positive_reviews = positive_reviews.findAll('review_text')

In [8]:
negative_reviews = (BeautifulSoup(open('sorted_data_acl/electronics/negative.review')
                                  .read(),'lxml'))
negative_reviews = negative_reviews.findAll('review_text')

In [9]:
print('Number of positive reviews: {}'.format(len(positive_reviews)))
print('Number of negative reviews: {}'.format(len(negative_reviews)))

Number of positive reviews: 1000
Number of negative reviews: 1000


Notice the reviews are evenly split. Now we'll create a function to assist  with with tokenizing the reviews by first converting the text to lowercase, keeping words whose length is greater than 2, lemmatization to return the base or dictionary form of a word, and lastly to remove stopwords. 

In [10]:
wordnet_lemmatizer = WordNetLemmatizer()

def my_tokenizer(s):
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    tokens = [t for t in tokens if len(t) > 2 ]
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if t not in  stopwords]
    return tokens

In this section, we'll create a word to index map, this will assist with storing the word-frequency vectors. 
We'll also save the tokenized versions. 

In [11]:
word_index_map = {}
current_index = 0

positive_tokenize = []
nagative_tokenize = []


for review in positive_reviews:
    tokens = my_tokenizer(review.text)
    positive_tokenize.append(tokens)

    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1
            
            
for review in negative_reviews:
    tokens = my_tokenizer(review.text)
    nagative_tokenize.append(tokens)

    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

In [12]:
positive_tokenize[0]

['purchased',
 'unit',
 'due',
 'frequent',
 'blackout',
 'area',
 'power',
 'supply',
 'going',
 'bad',
 'run',
 'cable',
 'modem',
 'router',
 'lcd',
 'monitor',
 'minute',
 'enough',
 'time',
 'save',
 'work',
 'shut',
 'equally',
 'important',
 'know',
 'electronics',
 'receiving',
 'clean',
 'power',
 'feel',
 'investment',
 'minor',
 'compared',
 'loss',
 'valuable',
 'data',
 'failure',
 'equipment',
 'due',
 'power',
 'spike',
 'irregular',
 'power',
 'supply',
 'always',
 'amazon',
 'business',
 'day']

In this section, we'll create a function to calculate the word frequency per review and calculate the proportion of times a word appears in a particular review

In [13]:
def tokens_to_vector(tokens,label):
    x = np.zeros(len(word_index_map)+1)
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x/ x.sum()
    x[-1] = label
    return x
    

In [14]:
N = len(positive_reviews) + len(negative_reviews)
data = np.zeros((N,len(word_index_map)+1))
i = 0

for token in positive_tokenize:
    xy = tokens_to_vector(token,1)
    data[i,:] = xy
    i += 1
    
for token in nagative_tokenize:
    xy = tokens_to_vector(token,0)
    data[i,:] = xy
    i += 1

Shuffle the data and split it into a training / test



In [15]:
np.random.seed(567)
np.random.shuffle(data)

X = data[:,:-1]
y = data[:,-1]

In [16]:
X_train = X[:-100,]
y_train = y[:-100]

In [17]:
X_test = X[-100:,]
y_test = y[-100:]

Logistic Regression : 

In [18]:
model = LogisticRegression()
model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [19]:
print('Accuracy rate {:1.2f} '.format(model.score(X_test,y_test)))

Accuracy rate 0.73 


Random Forest :

In [20]:
rf_model = RandomForestClassifier(n_estimators=200,random_state=893)
rf_model.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=893,
                       verbose=0, warm_start=False)

In [21]:
print('Accuracy rate {:1.2f} '.format(rf_model.score(X_test,y_test)))

Accuracy rate 0.80 


Code to review the weights for different words. A threshold of 0.5 will be used. 

In [22]:
threshold = 0.5

for word, index in word_index_map.items():
    weight = model.coef_[0][index]

    
    if abs(weight) > threshold:
        print(word,weight)

bad -0.5902388550978729
cable 0.5590829396219031
time -0.6739846673734677
used 1.0261903891927109
've 0.6096215416906178
month -0.7101215069513618
problem 0.6486207532079401
need 0.5526452901361916
good 1.9025474013793866
sound 0.9668409393871448
lot 0.6014026624790298
n't -1.9868163530762764
easy 1.3763757238315055
case 0.5237199509002051
get -1.0364265164112205
use 1.6606862471279287
quality 1.1959290852170068
best 1.0122222169801935
item -0.8750066367075127
well 0.9910427956555604
wa -1.2003303428726826
perfect 0.8226793519102719
fast 0.7749987600923057
price 2.292341928212701
great 3.4272060405239837
money -0.9488372084047583
memory 0.796381296331678
would -0.8056180407442339
buy -0.9859340087385279
worked -0.7810140343583012
pretty 0.5339156834280407
could -0.5402350078209843
doe -1.0686840615775937
two -0.5681898738085115
highly 0.8833811788847046
recommend 0.5885752419867472
first -0.628596440921964
customer -0.5288598123431051
support -0.7332023448335508
little 0.65983943318600

Notice for the word junk used in reviews, it is more likely to be a negative review. 
As opposed to a review containing the word 'great' has a large positive weight.