# HackOn(Data)
## Competition Challenge

Welcome to our project. We have decided to do a simple Forecasting on the salesRank, more precesilly, 

** The problem:** Knowing the first set of reviews of a product forecast the salesRank of it. 

** Hypothesis:** The reviews are an indicative of how well a product is selling and contain enough information for a forecast.

** Assumptions:** 

- We ommit that the data is seasonal.
- By product we mean the combination of product and vendors, that is represented by the asin.
- What else?


# Part 0 - Loading the tools

We load the tools we will be using during this project.

In [1]:
import pandas as pd
import numpy as np
import fasttext
import gzip

from sklearn.preprocessing import MultiLabelBinarizer #To be used with the categories feature

%matplotlib inline  

# Part 1- Cleaning and preparing The Data

In this section we load the data, clean it, and prepare it for feature building. The data consists on

### **Review Data**

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

### **Q&A Data**

This dataset contains Questions and Answers data from Amazon, totaling around 1.4 million answered questions.

This dataset can be combined with Amazon product review data, by matching ASINs in the Q/A dataset with ASINs in the review data. The review data also includes product metadata (product titles etc.).

### **Credits:**

-  **R. He, J. McAuley**. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016 J. 
- ** McAuley, C. Targett, J. Shi, A. van den Hengel **. A Image-based recommendations on styles and substitutes. SIGIR, 2015


We start with the scripts provided at [Julian McAuley's website](http://jmcauley.ucsd.edu/data/amazon/).

In [2]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

and use these methods to load the data

In [3]:
df_reviews = getDF('./data/reviews_Video_Games_5.json.gz')
df_meta=getDF('./data/meta_Video_Games.json.gz')
df_qa=getDF('./data/qa_Video_Games.json.gz')

We now study the different features.


## Features from the meta file

### SalesRank

Note that the dataframe that contains the salesRank is df_meta. In this data there are two possible cases where the values described are not helpful. The first one is a NaN value, and we can use is null method to deal with this, the second one is the case of a dictionary no containing the relevant key. We use a helper function to help us with this.

In [4]:
RELEVANT_KEY='Video Games'
def helper(dictionary):
    try:
        return dictionary[RELEVANT_KEY]
    except:
        return float('NaN')

In [5]:
df_meta['salesRank']=df_meta.salesRank.map(lambda x: helper(x))

We can now count how many entries don't have the desired rank 
data.

In [6]:
df_meta.salesRank[(df_meta.salesRank.isnull())].shape[0]

5675

and proceed to remove these rows 

In [7]:
df_meta=df_meta[ df_meta.salesRank.notnull()]

We can do a sanity check 

In [8]:
df_meta.salesRank[(df_meta.salesRank.isnull())].shape[0]

0

### imUrl

This features contains the web address to an image. As ee won't be using the images, so we drop this column.

In [9]:
df_meta=df_meta.drop('imUrl',axis=1)

### Title and brand

From the fact that the percentage of articles with title or brands are

In [10]:
print("Only %.2f%% of the products have title."%(df_meta.title.count()/df_meta.asin.count()*100))
print("Only %.2f%% of the products have brand."%(df_meta.brand.count()/df_meta.asin.count()*100))

Only 0.22% of the products have title.
Only 0.11% of the products have brand.


We can deduce that this features are not relevant, and so we drop them as well.

In [11]:
df_meta=df_meta.drop(['title','brand'],axis=1)

### Price

The feature price is definitely relevant for the  forecasting, now note that 

In [12]:
print("%.2f%% of the products have a price."%(df_meta.price.count()*100/df_meta.asin.count()))

88.25% of the products have a price.


as this is a relevant feature we remove the rows with missing data.

In [13]:
df_meta=df_meta[df_meta.price.notnull()]

Another sanity check.

In [14]:
print("%.2f%% of the products have a price."%(df_meta.price.count()*100/df_meta.asin.count()))

100.00% of the products have a price.


After this cleaning 

In [15]:
print("There are %d products left"%df_meta.asin.count())

There are 39959 products left


### Categories

The categories feature comes as a list of lists, let's first find out how many different categories are there.

In [16]:
categories=set()
for list_cats in df_meta.categories:
    for list_cat in list_cats:
        categories=categories.union(set(list_cat))
categories=list(categories)
print("There are %d categories"%len(categories))

There are 334 categories


we use these as a categorial variable, first we create a unique list of categories for each product

In [17]:
def helper2(list_cats):
    cats=set()
    for list_cat in list_cats:
        cats=cats.union(set(list_cat))
    return list(cats)

In [18]:
df_meta['categories']=df_meta.categories.apply(helper2)

and then we use the [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) from sklearn to do the encodding. 

In [19]:
mlb=MultiLabelBinarizer()
mlb.fit(df_meta.categories)
categories_df=pd.DataFrame(mlb.transform(df_meta.categories),columns=list(mlb.classes_))
df_meta.reset_index(drop=True,inplace=True)
df_meta=df_meta.merge(categories_df,left_index =True,right_index =True)

and finally we drop the categories column.

In [20]:
df_meta=df_meta.drop('categories',axis=1)

### related and decription

(ARE WE USING THIS?, RELATED MAY GIVE NICE INFO, BUT DESCRIPTION SEEMS USELESS)

We won't be using this either, so we remove them.

In [21]:
df_meta=df_meta.drop(['related','description'],axis=1)

## Features from the reviews file

### reviewText

This feature containing the review will be used for sentiment analysis. We come back to this when building features. For now, we just get rid of the data with empty reviews.

In [22]:
print("There are %d empty reviews."%df_reviews[df_reviews.reviewText==""].shape[0])
df_reviews=df_reviews[df_reviews.reviewText!=""]

There are 44 empty reviews.


### summary

We won't be using this feature, so we drop it

In [23]:
df_reviews=df_reviews.drop('summary',axis=1)

### reviewTime

This feature is redundant since we have unixReviewTime, so we also get rid of this

In [24]:
df_reviews=df_reviews.drop('reviewTime',axis=1)

### ReviewerName
This is also redundant, since we have the reviewerID, we drop it

In [25]:
df_reviews=df_reviews.drop('reviewerName',axis=1)

# Part 2 - Feature engineering

In this part we build our features, this will be done in the following order.


- Use sentiment analysis to give a sentiment score to the reviews. (Reason: a score of 3 for two people may mean different things the sentiment wil help correct this)

- ReviewerID features: will help us gain confidence in the review.

- helpful: Increases the confidence.

- ReviewTime: To create a time series for the (cummulative) number of reviews.

- Categories: Create category 'Other' and move everything that is not in the top 20 categories to the 'Other' category.


### Sentiment Analysis on reviews

We use fasttext from facebook [CITE] to create an score. For this we need to save the data we need to a text file, since that is the signature of the training method of fasttext.

In [27]:
df_sentiment=df_reviews.loc[:,('reviewText','overall')]
df_sentiment['overall']=df_sentiment.overall.apply(lambda x: '__label__'+str(int(x)))
df_sentiment.to_csv(r'./data_/data_for_sentiment.txt', header=None, index=None, sep=' ', mode='a')

Now, that the data is prepared for procesing, we can use the classifier.

In [28]:
classifier = fasttext.supervised('./data_/data_for_sentiment.txt', 'model', label_prefix='__label__')

and do a simple, sanity check

In [29]:
texts=["this is an awesome","this sucks"]
labels = classifier.predict(texts)
print(labels)

[['5'], ['1']]


and we can add this feature to the reviews dataframe

In [30]:
df_reviews['sentiment_score']=df_reviews.reviewText.apply(
    lambda x:classifier.predict([x])[0][0])