# Project Submission

This notebook will be your project submission. All tasks will be listed in the order of the Courses that they appear in. The tasks will be the same as in the Capstone Example Notebook, but in this submission you ___MUST___ use another dataset. Failure to do so will result in a large penalty to your grade in this course.

## Finding your dataset

Take some time to find an interesting dataset! There is a reading discussing various places where datasets can be found, but if you are able to process it, go ahead and use it! Do note, for some tasks in this project, each entry will need 3+ attributes, so keep that in mind when finding datasets. After you have found your dataset, the tasks will continue as in the Example Notebook. You will be graded based on the tasks and your results. Best of luck!

### As Reviewer: 
Your job will be to verify the calculations made at each "TODO" labeled throughout the notebook.

### First Step: Imports

In the next cell we will give you all of the imports you should need to do your project. Feel free to add more if you would like, but these should be sufficient.

In [47]:
import gzip
from collections import defaultdict
import random
import numpy
import scipy.optimize
import string
import pandas as pd
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,classification_report
from sklearn.metrics import mean_squared_error
from math import sqrt
from nltk.stem.porter import PorterStemmer # Stemming

# Task 1: Data Processing

### TODO 1: Read the data and Fill your dataset

In [29]:
# The dataset used here is the Amazon US Reviews Gift Card Category acquired from 
# (https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz)

file = 'amazon_reviews_us_Gift_Card_v1_00.tsv'
f = open(file, 'rt')
header = f.readline().strip().split('\t')

dataset = []

for line in f:
    line = line.split('\t') # splitting by tab
    d = dict(zip(header, line)) # creating key-value pairs for each tuple    
    # converting data types to integer
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    d['verified_purchase'] = d['verified_purchase'] == 'Y' 
    dataset.append(d)

In [30]:
dataset[0]

{'marketplace': 'US',
 'customer_id': '24371595',
 'review_id': 'R27ZP1F1CD0C3Y',
 'product_id': 'B004LLIL5A',
 'product_parent': '346014806',
 'product_title': 'Amazon eGift Card - Celebrate',
 'product_category': 'Gift Card',
 'star_rating': 5,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': 'N',
 'verified_purchase': True,
 'review_headline': 'Five Stars',
 'review_body': 'Great birthday gift for a young adult.',
 'review_date': '2015-08-31\n'}

### TODO 2: Split the data into a Training and Testing set

First shuffle your data, then split your data. Have Training be the first 80%, and testing be the remaining 20%. 

In [31]:
N = len(dataset)

# establishing a 80/20 split for train/test
train_set = dataset[:N*8//10]
test_set = dataset[N*8//10:]
print(len(train_set), len(test_set))

119268 29818


#### Now delete your dataset
You don't want any of your answers to come from your original dataset any longer, but rather your Training Set, this will help you to not make any mistakes later on, especialy when referencing the checkpoint solutions.

In [32]:
del dataset

### TODO 3: Extracting Basic Statistics

Next you need to answer some questions through any means (i.e. write a function or just find the answer) all based on the __Training Set:__
1. How many entries are in your dataset?
2. Pick a non-trivial attribute (i.e. verified purchases in example), what percentage of your data has this atttribute?
3. Pick another different non-trivial attribute, what percentage of your data share both attributes?

In [33]:
# manipulating our data into a pandas DataFrame
train_df = pd.DataFrame(train_set)
print('Number of entries in our training set -> ', (train_df.shape[0]))

# Calculating percentage of verified purchases from total purchases
partTwo = (train_df[train_df['verified_purchase'] == True].shape[0] / train_df.shape[0]) * 100
print('Percentage value of verified purchases -> ', partTwo)

# Total Votes as a percentage
partThree = (train_df[train_df['total_votes']>2].shape[0] / train_df.shape[0]) * 100
print('Percentage value of total votes -> ', partThree)

Number of entries in our training set ->  119268
Percentage value of verified purchases ->  96.52631049401347
Percentage value of total votes ->  0.8233558037361237


# Task 2: Classification

Next you will use our knowledge of classification to extract features and make predictions based on them. Here you will be using a Logistic Regression Model, keep this in mind so you know where to get help from.

### TODO 1: Define the feature function

This implementation will be based on ___any two___ attributes from your dataset. You will be using these two attributes to predict a third. Hint: Remember the offset!

In [37]:
def feature(d):
    feat = [1, d['total_votes'], (d['helpful_votes'])]
    return feat

### TODO 2: Fit your model

1. Create your __Feature Vector__ based on your feature function defined above. 
2. Create your __Label Vector__ based on the "verified purchase" column of your training set.
3. Define your model as a __Logistic Regression__ model.
4. Fit your model.

In [38]:
X_train = [feature(d) for d in train_set]
X_test = [feature(d) for d in test_set]
y_train = [d['verified_purchase'] for d in train_set]
y_test = [d['verified_purchase'] for d in test_set]

# fit your model
model = linear_model.LogisticRegression()
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

### TODO 3: Compute Accuracy of Your Model

1. Make __Predictions__ based on your model.
2. Compute the __Accuracy__ of your model.

In [42]:
prediction = model.predict(X_test)
accuracy = classification_report(y_true = y_test, y_pred = prediction)
print(accuracy)

              precision    recall  f1-score   support

       False       0.49      0.01      0.01      8901
        True       0.70      1.00      0.82     20917

    accuracy                           0.70     29818
   macro avg       0.59      0.50      0.42     29818
weighted avg       0.64      0.70      0.58     29818



# Task 3: Regression

In this section you will start by working though two examples of altering features to further differentiate. Then you will work through how to evaluate a Regularaized model.

In [43]:
path = "amazon_reviews_us_Gift_Card_v1_00.tsv"

f = open(path, 'rt', encoding="utf8")
header = f.readline()
header = header.strip().split('\t')
reg_dataset = []
for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    reg_dataset.append(d)

### TODO 1: Unique Words in a Sample Set

We are going to work with a new dataset here, as such we are going to take a smaller portion of the set and call it a Sample Set. This is because stemming on the normal training set will take a very long time. (Feel free to change sampleSet -> reg_dataset if you would like to see the difference for yourself)

1. Count the number of unique words found within the 'review body' portion of the sample set defined below, making sure to __Ignore Punctuation and Capitalization__.
2. Count the number of unique words found within the 'review body' portion of the sample set defined below, this time with use of __Stemming,__ __Ignoring Puctuation,__ ___and___ __Capitalization__.

In [44]:
wordCount = defaultdict(int)
punctuation = set(string.punctuation)

wordCountStem = defaultdict(int)
stemmer = PorterStemmer() #use stemmer.stem(stuff)

#SampleSet and y vector given
sampleSet = reg_dataset[:2*len(reg_dataset)//10]
y_reg = [d['star_rating'] for d in sampleSet]

In [45]:
for d in sampleSet:
    r = "".join([c for c in d["review_body"].lower() if not c in punctuation])
    for w in r.split():
        wordCount[w] += 1
        
for d in sampleSet:
    r = "".join([c for c in d["review_body"].lower() if not c in punctuation])
    for w in r.split():
        w = stemmer.stem(w)
        wordCountStem[w] += 1

print ('No Stemming:', len(wordCount))
print ('with Stemming:', len(wordCountStem))

No Stemming: 10765
with Stemming: 8389


### TODO 2: Evaluating Classifiers

1. Given the feature function and your counts vector, __Define__ your X_reg vector. (This being the X vector, simply labeled for the Regression model)
2. __Fit__ your model using a __Ridge Model__ with (alpha = 1.0, fit_intercept = True).
3. Using your model, __Make your Predictions__.
4. Find the __MSE__ between your predictions and your y_reg vector.

In [46]:
def feature_reg(datum):
    feat = [0]*len(words)
    r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    return feat

def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

#Note: increasing the size of the dictionary may require a lot of memory
words = [x[1] for x in counts[:1000]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

In [48]:
X = [feature_reg(d) for d in sampleSet]
y = [d["star_rating"] for d in sampleSet]
model = linear_model.Ridge(1.0, fit_intercept=True) #1.0 regularization strength like lambda
model.fit(X,y)
predictions = model.predict(X)

print ('MSE:', mean_squared_error(y, predictions))

MSE: 0.3648830231415904


# Task 4: Recommendation Systems

For your final task, you will use your knowledge of simple similarity-based recommender systems to make calculate the most similar items.

The next cell contains some starter code that you will need for your tasks in this section.
Notice you should be back to using your __trainingSet__.

In [49]:
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)

### TODO 1: Fill your Dictionaries

1. For each entry in your training set, fill your default dictionaries (defined above). 

In [54]:
itemNames = {}

for d in train_set:
    user,item = d['customer_id'], d['product_id']
    usersPerItem[item].add(user)
    itemsPerUser[user].add(item)
    itemNames[item] = d['product_title']

def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

def mostSimilar(identifier, n):
    similarities = []
    identifier_list = []
    users = usersPerItem[identifier]
    for i2 in usersPerItem:
        if i2 == identifier: continue
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    for i in similarities:
        identifier_list.append(i[1])
    print(identifier_list[:n])
    return similarities[:n]

### TODO 1: Fill your Dictionaries

1. Calculate the __10__ most similar entries to the __first__ entry in your dataset, using the functions defined above.

In [55]:
query = train_set[1]['product_id']

print("Product ID: ", query, "\nProduct title:", itemNames[query])

Product ID:  B004LLIKVU 
Product title: Amazon.com eGift Cards


In [56]:
mostSimilar(query, 10)

['B00IX1I3G6', 'B004KNWWO0', 'B007V6EVY2', 'BT00DDC7CE', 'B004KNWX3U', 'B00A44A3Y0', 'B00G4IWEZG', 'B004W8D102', 'BT00CTOY20', 'BT00CTOYC0']


[(0.01746373781880166, 'B00IX1I3G6'),
 (0.006271339976308271, 'B004KNWWO0'),
 (0.0007555204505649232, 'B007V6EVY2'),
 (0.0005087394164032123, 'BT00DDC7CE'),
 (0.0003592083048960092, 'B004KNWX3U'),
 (0.00031968173906866054, 'B00A44A3Y0'),
 (0.0003191602539097131, 'B00G4IWEZG'),
 (0.000291587694999271, 'B004W8D102'),
 (0.0002904443799012489, 'BT00CTOY20'),
 (0.0002547492539486134, 'BT00CTOYC0')]

## Finished!

Congratulations! You are now ready to submit your work. Once you have submitted make sure to get started on your peer reviews!