# Data challenge LinkValue

**Author:** Gabriel Delgado (gabriel.delgado@alumni.polytechnique.org)
<br>
<br>
**Decription:** The dataset contains the following informations about the clients reviews of a product in PriceMinister: review ID, review content, review stars, product ID and target (uselfulness of the review).
<br>
<br>
**Objective:** Predict if the review of a product by a client is useful or not for other clients. The scoring will be measured using the area under the curve ROC (AUC) metric.
<br>
<br>
**Methodology:** This is a problem of binary classification. For each review, we will only focus on the review's content since the review's title doesn't give substantial additional information and the number of stars of the review, as it will be seen later, doesn't have a real impact on the target value. In order to perform machine learning on the texts, we turn the review's content into numerical feature vectors using a bag of words representation via tf-idf (Term Frequency times Inverse Document Frequency). Three approaches are tested to select the desired features (words):
* The 99% most significant words (i.e. we ignore terms that appear in less than 1% of the set of reviews). We try to avoid thus overfitting;
* The totality of extracted words within the training dataset;
* Select the first k principal components of a latent semantic analysis (or truncated SVD).

For each one of this choices we train different prediction models (logistic regression, random forest, etc.) and we finally keep the one with the best performances.

In [1]:
# Main libraries
import pandas as pd
import numpy as np
import matplotlib as plt

# Feature extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Prediction models and selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB

# Metrics and cross validation
from sklearn.metrics import roc_auc_score
from sklearn import metrics  

In [2]:
#path = "C:/WinPython-64bit-2.7.10.3/notebooks/Review_Challenge/"
path = "C:/Users/gabriel.delgado-keef/Dropbox/Professionnel/LinkValue/Review_Challenge/"

In [3]:
# Load files
df = pd.read_csv(path+"train.csv",sep=";")
df_test = pd.read_csv(path+"test.csv",sep=";")

In [4]:
# Checking how the data looks like
df.head(3)

Unnamed: 0,ID,review_content,review_title,review_stars,product,Target
0,0,En appelant un acheteur pour demander si l'écr...,La Police s'inscrit en acheteur privé sur Pric...,5,2fbb619e3606f9b7c213e858a109cda771aa2c47ce50d5...,0
1,1,"Alors, là, on a affaire au plus grand Navet ja...",Chef D'Oeuvre Absolu en vue...,5,7b56d9d378d9e999d293f301ac43d044cd7b4786d09afb...,1
2,2,Effet garanti sur la terrase. Ils donnent immé...,Effet garanti sur la terrase. Ils donnent immé...,3,7b37bf5dcb2fafd9229897910318a7dfa11a04ca36893c...,0


In [5]:
# We remark that the content of same reviews is completely useless
print(df["review_content"].iloc[5224])

.........       ..........         .........         ..........       ...........        .............


## 1. Impact of the review stars classes over the target value

In [6]:
target_stars = df.pivot_table(values='Target',index=['review_stars'],aggfunc=lambda x: x.mean())
print '\nProbility of a review to be useful (Target = 1) for each review stars class:' 
print target_stars


Probility of a review to be useful (Target = 1) for each review stars class:
review_stars
1    0.533504
2    0.457665
3    0.479939
4    0.573273
5    0.565109
Name: Target, dtype: float64


Since each one of the conditional probabilities are close to the same value (around 50%), we can deduce that the number of stars of a review does not impact the target value (in other words they are independent). Comment: If too few data was classified as 0 or 1, then we had had to ask for more data to the client or try to retrieve it elsewhere. 

## 2. Spliting the training dataset into "train" and "test" sets in order to select the best prediction model

In [8]:
# We set aside 20% of the data as test set 
Review_train, Review_test, Target_train, Target_test = train_test_split(df["review_content"], df["Target"],
                                                    test_size=0.2, random_state=123, stratify=df["Target"])

## 3. Extracting features from the training data using a sparse tf-idf vectorizer

**3.1 Ignoring terms that appear in less than 1% of the set of reviews ** 

In [9]:
vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=0.01,max_df=0.5, strip_accents='unicode') 
X_train_tfidf_1 = vectorizer.fit_transform(Review_train.values)
# Extracting features from the test data using the same training vectorizer
X_test_tfidf_1 = vectorizer.transform(Review_test.values)

In [10]:
print vectorizer.get_feature_names()
print '\nNumber of features: %d' % len(vectorizer.get_feature_names())

[u'10', u'absolument', u'achat', u'achete', u'acheter', u'adore', u'agreable', u'ai', u'aime', u'ainsi', u'aise', u'album', u'alors', u'amateurs', u'annees', u'ans', u'appareil', u'apres', u'article', u'assez', u'attention', u'au', u'aucun', u'aussi', u'autre', u'autres', u'aux', u'avant', u'avec', u'avis', u'avoir', u'beau', u'beaucoup', u'belle', u'belles', u'bien', u'bon', u'bonne', u'bref', u'ca', u'cadeau', u'car', u'carte', u'cartes', u'cd', u'ce', u'cela', u'celui', u'ces', u'cet', u'cette', u'ceux', u'chaque', u'cher', u'chez', u'ci', u'collection', u'collectionneurs', u'comme', u'conforme', u'conseille', u'contre', u'cote', u'couleur', u'couleurs', u'dans', u'decu', u'deja', u'depuis', u'des', u'design', u'deux', u'dire', u'disque', u'dommage', u'donc', u'donne', u'dont', u'du', u'dvd', u'ecran', u'effet', u'efficace', u'elle', u'elles', u'en', u'encore', u'enfants', u'enfin', u'entre', u'envoi', u'epoque', u'est', u'et', u'etait', u'etat', u'ete', u'etre', u'eu', u'excellent'

**3.2 Using the totality of extracted words**

In [11]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, strip_accents='unicode') 
X_train_tfidf_T = vectorizer.fit_transform(Review_train.values)
X_test_tfidf_T = vectorizer.transform(Review_test.values)

In [12]:
print '\nNumber of features: %d' % len(vectorizer.get_feature_names())


Number of features: 54995


**3.3 Selecting the first k principal components of a latent semantic analysis**

In [13]:
# In this case we take the first 100 principal components
svd = TruncatedSVD(n_components=100)
X_train_tfidf_svd = svd.fit_transform(X_train_tfidf_T)
X_test_tfidf_svd = svd.transform(X_test_tfidf_T)

In [14]:
# We would need at least 1000 components to explain 50% of the variance. We keep 100 only for illustration purposes. 
print'The explained variance of the SVD %f' % svd.explained_variance_ratio_.sum()

The explained variance of the SVD 0.197306


## 4. Training the prediction models

In [15]:
# General testing function measuring the accuracy of the predictor on the test data set via AUC
def classification_model(model, input_train, output_train, input_test, output_test):
    #Fit the model:
    model.fit(input_train,output_train)
  
    #Make predictions on train and test datasets:
    predictions_train = model.predict_proba(input_train)
    predictions_test = model.predict_proba(input_test)
  
    #Print scoring
    auc = roc_auc_score(output_train, predictions_train[:,1])
    print "Train data scoring : %s" % "{0:.3}".format(auc)
    
    auc = roc_auc_score(output_test, predictions_test[:,1])
    print "Test data scoring : %s" % "{0:.3}".format(auc)
 

**4.1 Random forest**

In [16]:
model = RandomForestClassifier(max_depth=30,n_estimators=50)

In [17]:
# 3.1 features selection
classification_model(model, X_train_tfidf_1, Target_train.values, X_test_tfidf_1, Target_test.values)

Train data scoring : 0.973
Test data scoring : 0.708


In [18]:
# 3.2 features selection
classification_model(model, X_train_tfidf_T, Target_train.values, X_test_tfidf_T, Target_test.values)

Train data scoring : 0.846
Test data scoring : 0.711


In [19]:
# 3.3 features selection
classification_model(model, X_train_tfidf_svd, Target_train.values, X_test_tfidf_svd, Target_test.values)

Train data scoring : 0.998
Test data scoring : 0.695


**4.2 Logistic regression**

In [20]:
model = LogisticRegression()

In [21]:
# 3.1 features selection
classification_model(model, X_train_tfidf_1, Target_train.values, X_test_tfidf_1, Target_test.values)

Train data scoring : 0.677
Test data scoring : 0.671


In [22]:
# 3.2 features selection
classification_model(model, X_train_tfidf_T, Target_train.values, X_test_tfidf_T, Target_test.values)

Train data scoring : 0.845
Test data scoring : 0.725


In [23]:
# 3.3 features selection
classification_model(model, X_train_tfidf_svd, Target_train.values, X_test_tfidf_svd, Target_test.values)

Train data scoring : 0.669
Test data scoring : 0.668


**4.3 SVM (support vector machine)** 

In [24]:
model = SGDClassifier(loss='log',alpha=1e-5)

In [25]:
# 3.1 features selection
classification_model(model, X_train_tfidf_1, Target_train.values, X_test_tfidf_1, Target_test.values)

Train data scoring : 0.66
Test data scoring : 0.651


In [26]:
# 3.2 features selection
classification_model(model, X_train_tfidf_T, Target_train.values, X_test_tfidf_T, Target_test.values)

Train data scoring : 0.878
Test data scoring : 0.719


In [27]:
# 3.3 features selection
classification_model(model, X_train_tfidf_svd, Target_train.values, X_test_tfidf_svd, Target_test.values)

Train data scoring : 0.664
Test data scoring : 0.661


**4.4 Naive Bayes**

In [28]:
model = MultinomialNB()

In [29]:
# 3.1 features selection
classification_model(model, X_train_tfidf_1, Target_train.values, X_test_tfidf_1, Target_test.values)

Train data scoring : 0.642
Test data scoring : 0.636


In [30]:
# 3.2 features selection
classification_model(model, X_train_tfidf_T, Target_train.values, X_test_tfidf_T, Target_test.values)

Train data scoring : 0.847
Test data scoring : 0.716


In [58]:
# 3.3 features selection not applicable (Input X must be non-negative)

## 5. Conclusion

From the above result we can conclude:

* The best scorings are obtained using as features the ensemble of words within the client's reviews;
* Logistic regression and SVM seem to show the best accuracies in terms of AUC ROC (0.7-0.8). These values remain fair but not good or excellent as expected. 

Therefore we propose as candidate predictor the simplest model (logistic regression classifier, which is easier to interpret and thus transfert to a client) and we predict the probabilities on the original test dataset:

In [31]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, strip_accents='unicode') 
X_train_tfidf = vectorizer.fit_transform(df["review_content"])
X_test_tfidf = vectorizer.transform(df_test["review_content"])
model = LogisticRegression()
model.fit(X_train_tfidf,df["Target"])
predictions = model.predict_proba(X_test_tfidf)

In [37]:
df_candidate_submission = df_test.drop(["review_content","review_title","review_stars","product"]
                                       ,axis=1)
df_candidate_submission["Target"] = predictions[:,1].round(3)
df_candidate_submission.to_csv('candidate_submission.csv',sep=";",index=False)