# Foundations of AI & ML
## Session 05
### CaseStudy
### Lab

**Objectives:** Create a linear regression based product rating solution.


In [1]:
import pandas as pd
data = pd.read_csv("amazon_reviews.csv")
print(data.describe())
data = data.dropna()
print(data.describe())

         Unnamed: 0        ratings
count  167597.00000  167597.000000
mean    83798.00000       4.356307
std     48381.23087       0.993501
min         0.00000       1.000000
25%     41899.00000       4.000000
50%     83798.00000       5.000000
75%    125697.00000       5.000000
max    167596.00000       5.000000
          Unnamed: 0        ratings
count  167504.000000  167504.000000
mean    83798.019253       4.356427
std     48380.619090       0.993334
min         0.000000       1.000000
25%     41899.750000       4.000000
50%     83795.500000       5.000000
75%    125699.250000       5.000000
max    167596.000000       5.000000


In [2]:
data.head()

Unnamed: 0.1,Unnamed: 0,reviews,ratings
0,0,I like the item pricing. My granddaughter want...,5.0
1,1,Love the magnet easel... great for moving to d...,4.0
2,2,Both sides are magnetic. A real plus when you...,5.0
3,3,Bought one a few years ago for my daughter and...,5.0
4,4,I have a stainless steel refrigerator therefor...,4.0


In [3]:
data.tail()

Unnamed: 0.1,Unnamed: 0,reviews,ratings
167592,167592,This drone is very fun and super duarable. Its...,5.0
167593,167593,This is my brother's most prized toy. It's ext...,5.0
167594,167594,This Panther Drone toy is awesome. I definitel...,5.0
167595,167595,This is my first drone and it has proven to be...,5.0
167596,167596,This is a super fun toy to have around. In our...,4.0


In [4]:
ratings = data['ratings'].values
reviews = data['reviews'].values
lengths = [len(r) for r in reviews]

#### We first preprocess the data by removing all the incorrect rows (that have missing rating or reviews), unwanted columns, removing stopwords and soon.

In [5]:
import re
only_alnum = re.compile(r"[^a-z0-9]+")
## Replaces one or more occurrence of any characters other than a-z and 0-9 with a space
## This automatically replaces multiple spaces by 1 space

## The try ... except ensures that if a review is mal-formed then the review is replaced with the word ERROR
def cleanUp(s):
    return re.sub(only_alnum, " ", s.lower())

In [6]:
## We make a set for testing if a word is not useful
## sets are way faster than lists for this purpose
fluff = set([w.strip() for w in open("fluff.txt")])

FileNotFoundError: [Errno 2] No such file or directory: 'fluff.txt'

In [None]:
## Replace words like coooooool with cool, amaaaaaazing with amaazing and so on
def dedup(s):
    return re.sub(r'([a-z])\1+', r'\1\1', s)
print(dedup("cooooool"))
print(dedup("amaaaaaazzzzing"))
print(dedup('cool'))

In [None]:
def get_useful_words(s):
    return [dedup(w) for w in cleanUp(s).split() if len(w) > 2 and w not in fluff]

In [None]:
clean_reviews = [get_useful_words(review) for review in reviews]
for i in range(5):
    print("%4d" %(len(reviews[i])), reviews[i], "\n==>", clean_reviews[i])

In [None]:
final_reviews = list(zip(clean_reviews, ratings, lengths))
#We look at a Random sample of 10 cleaned data.
import random
for i in range(10):
    r = random.randrange(0, len(final_reviews))
    print(final_reviews[r])

** Case-Study:** Use the list of substantive words extracted from the Review as well as the length of the original Review. Decide how you would like to Derive a feature set to predict the Rating, which is a float (1.0 to 5.0).

Remember to split the Data into training, testing and Validation sets.
1. Select 10% of the Data for testing and put it away.
2. Select 20% of the Data for Validation and 70% for Training.
3. Vary the above ratio between Validation and Testing: 30 - 60, 45 - 45, 60 - 30 and Verify the effect if any on the prediction accuracy.


Some Possibilities:

1. You can use a single feature namely, the difference between number of Positive & Negative words. 

2. You can also considering predicting the rating based on the above difference and add the length of the Review as two independent Variables.

3. You could consider the Positive Words and Negative Words as two independent Variables rather than treating their difference as single independent Variable, giving you more possibilities.
