# Foundations of AI & ML
## Session 07
### CaseStudy 1
### Lab

**Objectives:** Create a non-linear regression based product rating solution.


In [1]:
import pandas as pd
data = pd.read_csv("../Datasets/amazon_reviews.csv")
print(data.describe())
data = data.dropna()
print(data.describe())

         Unnamed: 0        ratings
count  167597.00000  167597.000000
mean    83798.00000       4.356307
std     48381.23087       0.993501
min         0.00000       1.000000
25%     41899.00000       4.000000
50%     83798.00000       5.000000
75%    125697.00000       5.000000
max    167596.00000       5.000000
          Unnamed: 0        ratings
count  167504.000000  167504.000000
mean    83798.019253       4.356427
std     48380.619090       0.993334
min         0.000000       1.000000
25%     41899.750000       4.000000
50%     83795.500000       5.000000
75%    125699.250000       5.000000
max    167596.000000       5.000000


In [2]:
data.head()

Unnamed: 0.1,Unnamed: 0,reviews,ratings
0,0,I like the item pricing. My granddaughter want...,5.0
1,1,Love the magnet easel... great for moving to d...,4.0
2,2,Both sides are magnetic. A real plus when you...,5.0
3,3,Bought one a few years ago for my daughter and...,5.0
4,4,I have a stainless steel refrigerator therefor...,4.0


In [3]:
data.tail()

Unnamed: 0.1,Unnamed: 0,reviews,ratings
167592,167592,This drone is very fun and super duarable. Its...,5.0
167593,167593,This is my brother's most prized toy. It's ext...,5.0
167594,167594,This Panther Drone toy is awesome. I definitel...,5.0
167595,167595,This is my first drone and it has proven to be...,5.0
167596,167596,This is a super fun toy to have around. In our...,4.0


In [2]:
ratings = data['ratings'].values
reviews = data['reviews'].values
lengths = [len(r) for r in reviews]

In [5]:
### TODO DEBUGGING DEBUGGING

# ratings = ratings[:2000]
# reviews = reviews[:2000]
# lengths = [len(r) for r in reviews]

#### We first preprocess the data by removing all the incorrect rows (that have missing rating or reviews), unwanted columns, removing stopwords and soon.

In [3]:
import re
only_alnum = re.compile(r"[^a-z0-9]+")
## Replaces one or more occurrence of any characters other than a-z and 0-9 with a space
## This automatically replaces multiple spaces by 1 space

## The try ... except ensures that if a review is mal-formed then the review is replaced with the word ERROR
def cleanUp(s):
    return re.sub(only_alnum, " ", s.lower())

In [4]:
## We make a set for testing if a word is not useful
## sets are way faster than lists for this purpose
fluff = set([w.strip() for w in open("../Datasets/fluff.txt")])

In [8]:
list(fluff)[:10]

['anything',
 'down',
 'perhaps',
 'making',
 'something',
 'year',
 'known',
 'became',
 'present',
 'here']

In [5]:
## Replace words like coooooool with cool, amaaaaaazing with amaazing and so on
def dedup(s):
    return re.sub(r'([a-z])\1+', r'\1\1', s)
print(dedup("cooooool"))
print(dedup("amaaaaaazzzzing"))
print(dedup('cool'))

cool
amaazzing
cool


In [6]:
def get_useful_words(s):
    return [dedup(w) for w in cleanUp(s).split() if len(w) > 2 and w not in fluff]

In [7]:
clean_reviews = [get_useful_words(review) for review in reviews]
for i in range(5):
    print("%4d" %(len(reviews[i])), reviews[i], "\n==>", clean_reviews[i])

 100 I like the item pricing. My granddaughter wanted to mark on it but I wanted it just for the letters. 
==> ['like', 'item', 'pricing', 'granddaughter', 'mark', 'letters']
 121 Love the magnet easel... great for moving to different areas... Wish it had some sort of non skid pad on bottom though... 
==> ['love', 'magnet', 'easel', 'great', 'moving', 'wish', 'sort', 'skid', 'pad', 'bottom']
 420 Both sides are magnetic.  A real plus when you're entertaining more than one child.  The four-year old can find the letters for the words, while the two-year old can find the pictures the words spell.  (I bought letters and magnetic pictures to go with this board).  Both grandkids liked it a lot, which means I like it a lot as well.  Have not even introduced markers, as this will be used strictly as a magnetic board. 
==> ['magnetic', 'real', 'plus', 'entertaining', 'more', 'child', 'letters', 'words', 'pictures', 'words', 'spell', 'bought', 'letters', 'magnetic', 'pictures', 'board', 'grandki

In [8]:
final_reviews = list(zip(clean_reviews, ratings, lengths))
#We look at a Random sample of 10 cleaned data.
import random
for i in range(10):
    r = random.randrange(0, len(final_reviews))
    print(final_reviews[r])

(['blanket', 'puppet', 'cute', 'little', 'girl', 'love', 'soft', 'cuddly', 'recommend'], 5.0, 132)
(['surprised', 'likes', 'educational', 'toy', 'brought', 'tell', 'home', 'school', 'kids', 'love', 'getting', 'hands', 'squishy', 'organs', 'kit', 'comes', 'plastic', 'body', 'skeleton', 'snaps', 'skeleton', 'probably', 'most', 'delicate', 'model', 'held', 'little', 'protruding', 'pieces', 'plastic', 'snap', 'holes', 'skin', 'suggest', 'hard', 'pieces', 'few', 'times', 'loosen', 'kids', 'own', 'little', 'nubs', 'broke', 'ours', 'luckily', 'wasn', 'piece', 'stays', 'organs', 'similar', 'material', 'thesesticky', 'hands', 'expect', 'squishy', 'pick', 'dirt', 'easily', 'wash', 'soap', 'water', 'squeezablity', 'organs', 'biggest', 'draw', 'toy', 'among', 'kids', 'interested', 'questions', 'human', 'body', 'book', 'included', 'kit', 'takes', 'kids', 'step', 'step', 'dissection', 'process', 'using', 'kit', 'forceps', 'tweezers', 'kids', 'expected', 'apart', 'body', 'organs', 'enclosed', 'chart'

In [None]:
# clean_reviews[:5], lengths[:5]

** Case-Study:** Use the list of substantive words extracted from the Review as well as the length of the original Review. Decide how you would like to Derive a feature set to predict the Rating, which is a float (1.0 to 5.0).

Remember to split the Data into training, testing and Validation sets.
1. Select 10% of the Data for testing and put it away.
2. Select 20% of the Data for Validation and 70% for Training.
3. Vary the above ratio between Validation and Testing: 30 - 60, 45 - 45, 60 - 30 and Verify the effect if any on the prediction accuracy.


Some Possibilities:

1. You can use a single feature namely, the difference between number of Positive & Negative words. 

2. You can also considering predicting the rating based on the above difference and add the length of the Review as two independent Variables.

3. You could consider the Positive Words and Negative Words as two independent Variables rather than treating their difference as single independent Variable, giving you more possibilities.


In [9]:
import numpy as np

POSITIVE_WORDS = pd.read_csv('../Datasets/positive-words.txt').values
NEGATIVE_WORDS =  pd.read_csv('../Datasets/negative-words.txt').values

def get_positive_words(review):
    count = 0
    for w in review:
        if w in POSITIVE_WORDS:
            count += 1
    return count

def get_negative_words(review):
    count = 0
    for w in review:
        if w in NEGATIVE_WORDS:
            count += 1
    return count

positives = np.array([get_positive_words(review) for review in clean_reviews])
negatives = np.array([get_negative_words(review) for review in clean_reviews])
differences = positives - negatives

In [None]:
# clean_reviews[:5], positives[:5]


positives[:15], negatives[:15], differences[:15]

In [None]:
from sklearn.utils import shuffle

clean_reviews, positives, negatives, differences, lengths, ratings \
        = shuffle(clean_reviews, positives, negatives, differences, lengths, ratings)


In [None]:
differences[:3]

In [None]:
TRAIN_RATIO = .8
VAL_RATIO   = .1
TEST_RATIO  = .1

TOTAL = len(clean_reviews)

In [None]:
# Select train/val/test sets based on the case study

# Difference as feature data 
def get_train_test_split(data, target, TRAIN_RATIO=.8):
    X_train = data[ : int(TRAIN_RATIO * TOTAL)]
    y_train = target[ : int(TRAIN_RATIO * TOTAL)]
    X_test  = data[int(TRAIN_RATIO * TOTAL) : ]
    y_test  = target[int(TRAIN_RATIO * TOTAL) : ]
    
    return X_train, y_train, X_test, y_test


In [None]:
# use SVM?
# MLP ? see caseStudy2

from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(
    hidden_layer_sizes=(10,),  activation='relu', solver='adam', alpha=0.0001, batch_size=4,
    learning_rate='adaptive',learning_rate_init=0.001, power_t=0.5, max_iter=1000, shuffle=False,
    random_state=2, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True,
    early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)


In [None]:
from sklearn.metrics import accuracy_score

def train_and_predict(X_train, y_train, X_test, y_test):
    if len(X_train.shape) == 1: # For single feature data
        X_train = X_train.reshape(-1, 1)
    y_train = y_train.reshape(-1, 1)
    clf.fit(X_train, y_train)
    print(clf.coefs_)

    # Make predictions using the testing set
    from sklearn.metrics import mean_squared_error
    if len(X_test.shape) == 1:
        X_test = X_test.reshape(-1, 1)
    y_test = y_test.reshape(-1, 1)
    y_pred = clf.predict(X_test)
    # The mean squared error
    print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
    print("Mean squared error: %.2f" % accuracy_score(y_test, np.round(y_pred))) # round() to get a decent idea of accuracy
    
    return y_pred

In [None]:
def plot(y_test, y_pred):
    from matplotlib import pyplot

    # plot predictions vs expected
    pyplot.plot(y_test)
    pyplot.plot(y_pred, color='red')
    pyplot.show()

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(differences, ratings)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(differences, ratings, .6)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(differences, ratings, .5)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(differences, ratings, .9)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

Changing the train/test split doesnt make much difference to the MSE... infact, making the train to .5 
gives wrong predictions above 5! 

In [None]:
np.array(list(zip(positives, negatives)))

In [None]:
# Use positive and negative word count as separate features
X_train, y_train, X_test, y_test = get_train_test_split(np.array(list(zip(positives, negatives))), ratings, .8)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(np.array(list(zip(positives, negatives))), ratings, .5)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(np.array(list(zip(positives, negatives))), ratings, .6)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(np.array(list(zip(positives, negatives))), ratings, .7)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(np.array(list(zip(positives, negatives))), ratings, .9)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)

In [None]:
X_train, y_train, X_test, y_test = get_train_test_split(np.array(list(zip(positives, negatives))), ratings, .95)
y_pred = train_and_predict(X_train, y_train, X_test, y_test)
plot(y_test, y_pred)