# CAPSTONE PROJECT

Dataset used can be found at :
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz

### First Step: Imports



In [27]:
import gzip
from collections import defaultdict
import random
import numpy
import scipy.optimize
import string
from sklearn import linear_model
from nltk.stem.porter import PorterStemmer # Stemming
path='amazon_reviews_us_Musical_Instruments_v1_00.tsv'

# Task 1: Data Processing

### Reading the data and filling dataset

In [13]:
f = open(path, 'rt', encoding="utf8")

header = f.readline()
header = header.strip().split('\t')

dataset = []

for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    d['verified_purchase'] = d['verified_purchase'] == 'Y'
    dataset.append(d)
dataset[0]

{'marketplace': 'US',
 'customer_id': '45610553',
 'review_id': 'RMDCHWD0Y5OZ9',
 'product_id': 'B00HH62VB6',
 'product_parent': '618218723',
 'product_title': 'AGPtek® 10 Isolated Output 9V 12V 18V Guitar Pedal Board Power Supply Effect Pedals with Isolated Short Cricuit / Overcurrent Protection',
 'product_category': 'Musical Instruments',
 'star_rating': 3,
 'helpful_votes': 0,
 'total_votes': 1,
 'vine': 'N',
 'verified_purchase': False,
 'review_headline': 'Three Stars',
 'review_body': 'Works very good, but induces ALOT of noise.',
 'review_date': '2015-08-31'}

### Splitting the data into a Training and Testing set
Training set is 80% and Testing Set is 20%

In [15]:
N=len(dataset)
trainingSet=dataset[:(N*4)//5]
testSet=dataset[(N*4)//5:]
print(len(trainingSet), len(testSet))

723812 180953


In [4]:
del dataset

### Extracting Basic Statistics

1. How many entries are in your dataset?
2. Pick a non-trivial attribute (i.e. verified purchases in example), what percentage of your data has this atttribute?
3. Pick another different non-trivial attribute, what percentage of your data share both attributes?

In [73]:
N1=(len(trainingSet))
print('No. of entries in dataset are ',len(trainingSet))
c=0
for i in trainingSet:
    if i['verified_purchase']==True:
        c=c+1
print('Percentage of data shared by verified purchase attribute is',(c/N1)*100,'%')
c1=0 
for i in trainingSet:
    if (i['star_rating']==5) and (i['verified_purchase']==True):
        c1=c1+1
print('Percentage of data shared by verified purchase and star rating attributes is',(c1/N1*100),'%')

No. of entries in dataset are  723812
Percentage of data shared by verified purchase attribute is 89.73531801075417 %
Percentage of data shared by verified purchase and star rating attributes is 58.287649279094566 %


# Task 2: Classification

Next knowledge of classification to extract features and make predictions based on them. Here you will be using a Logistic Regression Model, keep this in mind so you know where to get help from.

### Define the feature function



In [20]:
#FIX THIS

def feature(d):
    feat = [1, d['star_rating'], len(d['review_body']),d['verified_purchase']]
    return feat

### Fit the model

1. Create your __Feature Vector__ based on your feature function defined above. 
2. Create your __Label Vector__ based on the "verified purchase" column of your training set.
3. Define your model as a __Logistic Regression__ model.
4. Fit your model.

In [76]:
#YOUR CODE HERE
data = [feature(d) for d in trainingSet]
x_train= [values[:-1] for values in data]
y_train= [values[-1] for values in data]
model = linear_model.LogisticRegression()
model.fit(x_train,y_train)
data1 = [feature(d) for d in testSet]
x_test= [values[:-1] for values in data1]
y_test= [values[-1] for values in data1]

### Compute Accuracy of the Model

1. Make __Predictions__ based on your model.
2. Compute the __Accuracy__ of your model.

In [77]:
#YOUR CODE HERE
predictionsTrain = model.predict(x_train)
predictionsTest = model.predict(x_test)

correctPredictionsTrain = predictionsTrain == y_train
correctPredictionsTest = predictionsTest == y_test

print('Accuracy for this model is',sum(correctPredictionsTrain) / len(correctPredictionsTrain))

Accuracy for this model is 0.897169430736158


# Task 3: Regression

In this section you will start by working though two examples of altering features to further differentiate. Then you will work through how to evaluate a Regularaized model.

In [26]:
#CHANGE PATH
path = 'amazon_reviews_us_Musical_Instruments_v1_00.tsv'

#GIVEN
f = open(path, 'rt', encoding="utf8")
header = f.readline()
header = header.strip().split('\t')
reg_dataset = []
for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    reg_dataset.append(d)

### Unique Words in a Sample Set

We are going to work with a new dataset here, as such we are going to take a smaller portion of the set and call it a Sample Set. This is because stemming on the normal training set will take a very long time. (Feel free to change sampleSet -> reg_dataset if you would like to see the difference for yourself)

1. Count the number of unique words found within the 'review body' portion of the sample set defined below, making sure to __Ignore Punctuation and Capitalization__.
2. Count the number of unique words found within the 'review body' portion of the sample set defined below, this time with use of __Stemming,__ __Ignoring Puctuation,__ ___and___ __Capitalization__.

In [29]:
#GIVEN for 1.
wordCount = defaultdict(int)
punctuation = set(string.punctuation)

#GIVEN for 2.
wordCountStem = defaultdict(int)
stemmer = PorterStemmer() #use stemmer.stem(stuff)

#SampleSet and y vector given
sampleSet = reg_dataset[:2*len(reg_dataset)//10]
y_reg = [d['star_rating'] for d in sampleSet]

In [30]:
#YOUR CODE HERE
for d in sampleSet:
    r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
    for w in r.split():
        wordCount[w] += 1

print(len(wordCount))

for d in sampleSet:
    r1 = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
    for w1 in r1.split():
        w1 = stemmer.stem(w1) # with stemming
        wordCountStem[w1] += 1

print(len(wordCountStem))

101381
83875


### Evaluating Classifiers

1. Given the feature function and your counts vector, __Define__ your X_reg vector. (This being the X vector, simply labeled for the Regression model)
2. __Fit__ your model using a __Ridge Model__ with (alpha = 1.0, fit_intercept = True).
3. Using your model, __Make your Predictions__.
4. Find the __MSE__ between your predictions and your y_reg vector.

In [32]:
#GIVEN FUNCTIONS
def feature_reg(datum):
    feat = [0]*len(words)
    r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    return feat

def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

#GIVEN COUNTS AND SETS
counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

#Note: increasing the size of the dictionary may require a lot of memory
words = [x[1] for x in counts[:100]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

In [78]:
#YOUR CODE HERE
x_reg = [feature_reg(d) for d in sampleSet]
model = linear_model.Ridge(1.0, fit_intercept=True)
model.fit(x_reg,y_reg )
predictions = model.predict(x_reg)
print('Mean squared error of the model is',MSE(predictions,y_reg))

Mean squared error of the model is 1.2041184392177315


# Task 4: Recommendation Systems

For your final task, you will use your knowledge of simple similarity-based recommender systems to make calculate the most similar items.

The next cell contains some starter code that you will need for your tasks in this section.
Notice you should be back to using your __trainingSet__.

In [62]:
#GIVEN
reviewsPerUser = defaultdict(set)
reviewsPerItem = defaultdict(set)

###  Fill the Dictionaries

1. For each entry in your training set, fill your default dictionaries (defined above). 

In [63]:
#YOUR CODE HERE
itemNames={}
for d in trainingSet:
    user,item = d['customer_id'], d['product_id']
    reviewsPerUser[user].add(item)
    reviewsPerItem[item].add(user)
    itemNames[item] = d['product_title']
    


In [68]:
#GIVEN
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

def mostSimilar(iD, n):
    similarities = []
    users = reviewsPerItem[iD]
    for i2 in reviewsPerItem:
        if i2 == iD: continue
        sim = Jaccard(users, reviewsPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    return similarities[:n]

### Fill the Dictionaries

1. Calculate the __10__ most similar entries to the __first__ entry in your dataset, using the functions defined above.

In [69]:
#YOUR CODE HERE
query=reg_dataset[0]['product_id']
itemNames[query]
mostSimilar(query,10)


[(0.024193548387096774, 'B000Y60NFC'),
 (0.02158273381294964, 'B007L0CH8K'),
 (0.019417475728155338, 'B0002GYX5K'),
 (0.018018018018018018, 'B00DT59450'),
 (0.017241379310344827, 'B0002GO8QY'),
 (0.017094017094017096, 'B00GZ5FCVG'),
 (0.017094017094017096, 'B0052745WK'),
 (0.01694915254237288, 'B001G43G96'),
 (0.01694915254237288, 'B000KITQKM'),
 (0.016666666666666666, 'B001DL6W0W')]

In [82]:
similarprods=[]
for x,y in mostSimilar(query,10):
    similarprods.append(itemNames[y])
print('10 most similar entries to the first entry in the dataset are: \n',similarprods)

10 most similar entries to the first entry in the dataset are: 
 ['Pedaltrain Pro With Soft Case', 'Joyo JF-14 American Sound Effects Pedal Amplifier Simulation with Voice Control', 'Ernie Ball Nickel Plain Single Guitar String .010 6-Pack', 'ISP Technologies Decimator II Noise Reduction Pedal - (New)', 'SKB SKB-FS6 Molded Electric Guitar Case', 'Electro-Harmonix Nano Big Muff Guitar Distortion Effects Pedal', 'Snark SA-2 5 Pedal Daisy Chain', 'K & M Microphone Bar', 'BEHRINGER PREAMP/BOOSTER PB100', 'BBE Supa Charger 8 Output High Performance Power Supply']


## Finished!

Congratulations! You are now ready to submit your work. Once you have submitted make sure to get started on your peer reviews!