While many were afraid of seeing the death of the erxercise of writing with the development of the Internet (as for the TV before...), it has surprisingly modernized and popularized it (blogs, comments, reviews, tweet? etc.). Text is consequently a prevalent form of information on the Internet. But how to extract information from blocs of text readable by the machine and the machine learning algorithm? Here is the passionating challenge of this projet we are now going to tackle!


Nowadays, the huge amount of data often goes along with poor quality due to noise, missing or inconsistent data for instance. It is then crucial to explore them carefully. 


## Data Exploration


"Data have quality if they satisfy the requirements of the intended use." says Data and Mining, Concepts and Techniques. 
Preliminary, define our requirements for step 1: 
We want to use the review text to predict in the most robust and accurate way the category of the product. 
The target is the category. We have verified that we do not need to encode manually the category name.
The input is the review text.
We check the presence of this field for each item, otherwise we won't be able to predict anything. The only null values are for the reviewer Name, that is of no interest to us. 
But the project description says we can use other fields of the dataset if we find them relevant. This is the case with the summary. If it is well done, a summary is supposed to grasp in fewer words the true sentiment on the product. It will be definitely relvant to determine the number of stars in the two next steps. But it appears to us to be relevant for the step 1 as well. That's why we have considered as input the field reviewText, merged with the field summary. 


## Preprocessing


From the data provided, we need to extract the feature vectors suitable for machine learning. But how to extract features from text?



The unit that can make sense for a machine in a text is the words. 
To do so, we have decided to use the technique of Bag of words that consists in transforming the text in a list the words that compose it.
As a consequence, the number of features would be the number of disctinc words of all the reviews. On a set of 96'000 reviews, it represents a very large amount of features and of required storage capacity. Fortunately, with bag-of-words we face a typical case of "high-dimensional sparse datasets", what makes the technique worth using. 

The influence of vector feature on the accuracy is crucial. If we use all the words of texts, in an equal way, the model may capture more information than required, i.e learn form the noise instead of focusing on key significant piece of information. It entails overfitting, we want to avoid as it would reduce the accuray on the data our model was not trained on. Furthermore, it increases a lot the computational complexity. How can we do then ? Can we think of "manual" selection of important words, based on a synonym dictionary for words of a given category? 

For simplicity, we have not used such lists (see WordNet), rather a statistical approach. We do not reason in terms of occurence, rather in terms of frequency. 
What's more a word could be used many times in the same long review, but not very often in other reviews. So as to prevent this word from having an overweighted importance in the analsysis compared to others, we have to take into account the size of the review text.
This technique is called consists in computing the tfidf “Term Frequency times Inverse Document Frequency”.
Here we came up with an idea: if we  keep only features which frequency is over a threshold (using the L1-norm as criteria for instance), we could get rid of the most unfrequent words than would have entailed overfitting. Furthermore, it helps reducing the number of features (no more 1 word = 1 feature for all words) and at the same time helps reducing the risk of the curse of dimensionality.


Still with the same objective, in Bag-of-Words, we have removed the stop word (e.g "a", "the", "not") that doesen't carry any information for the step 1.


Note that we can expect a large variance in the results due to the fact the texts are very short. Our goal is to reduce it by extracting information from key words likely to be present in the description of a given category of a product. 

Based on this, we can start searching for the best classifier! 

All along the lectures, we have encountered several methods which have different efficiencies in different situations.
All methods are quite systematic as based on solid mathematics, but when using it on real data, here comes the empirical search for the best model, configuration, paramters if any.
But we have tried to be very systematic in this empiric approach because making the experiment and results reproducible is a key principle of *Science*. And the purpose of the lecture and this project in particular is to study the *science* of data, i.e Data *Science*.

## Tuning the models

Empirical search is not absolute, but relative to the criteria chosen to assess the quality of the results. 
We have investigated different metrics base
differents measures

As required, we have chosen three approaches: 

- a parametric-based model: kNN
The parameter to select here is k. If it too little, we risk overfitting, while if it is too big, the prediction would not be accurate at all. 
- a similarity-based model: SVM (linear and rbf kernels)
Here we can select 
- an information-based model: Random forest classifier
max_features
best_clf_tree_depth
Random forest is not known as the best choice high-dimensional sparse data, which is exactly the case here with Bag of Words.

The use of GridSearch along with the use of pipeline is sytematic and allows us to find the optimal parameters based on evidence of the minimization of the chosen metrics. The 


## Assessing the results: metrics 

This is a classification problem. We have consequently use the classical metrics for this type of problem, namely the accuracy score.
Confusion matrix is a visualization tool very useful to display the results of multi-classification. 
ROC curve, a technique borrowed for signal processing, is a good vizualization tool and gives us graphically (as well as by computing the Area Under the Curve) the advantage of our classifier over a dummy one.  

# Sentiment Analysis - Step 2

Intuitively, the number of stars is positvely correlated with the presence of positive sentiment in the review. We are more likely to see 5 stars if the review contains words as "excellent", "wonderful" etc. 
But How to bring human understanding to the machine? By giving list of synonyms.
For instance, WordNet is a lexical database [2], which contains the relations between similar words. It covers synonym, hyponym, hypernym. Consequently, we could reduce the number of features (words) and measure easily how close  words of a text are closed to the features.

Can we use the same Bag of Words technique as for the step 1? 
The answer is no, because we face a first difficulty here:
"not good", "not" is a stop word that would be removed and "good" a positive adjective. The sentiment resulting from this analysis is the exact opposite of what it truly means. We can also have "not very good". Using bigrams, we exctract from this sentence "very good", what would obviously give false positive result. We can find other examples that would require the use of 4-grams and so on. 
But again, there is a tradeoff between computational complexity, the curse of dimensionality, and the accuracy. However, we are sure we cannot use bag-of-words dividing sentences on single words (unigram) anymore.

Another difficulty that we do not know how to tackle is the implicit sentiment sometimes containent in sentences. For instance "item as described" is very neutral but can be associated with 5 stars. The same goes for the ironic reviews. It would of course have consequences on the accuracy of the results. But we assume these types of review are marginal. 

Finally, our approach implements as in step 1 the Bag of Words technique with 2-grams, that would count the occurence of positive or negative words based on the list of positive/negative words provided in the project description. If we have more positive than negative, the review is tagged as positive. But would this be enough to predict 5 stars reviews or below, that is the purpose of step 2? 



Logistic regression?
It is not because a method is simpler that it implies a less accurate prediction. As we have seen it depends on the situation, parameters chosen and so on. 


"Sentiment categorization is essentially a classification problem, where features that contain opinions or sentiment information should be identified before the classification. For feature selection, Pang and Lee [5] suggested to remove objective sentences by extracting subjective ones."


Neural networks are known as efficient for this type of problem (Word2vec of Google for instance), but it is beyond the scope of our study. 


# Step 3



# How do we overcome the obstacle of the running time? 

The set used is composed of 96'000 samples 

We have adopted several strategies: 

- work on a smaller dataset to save time and be more efficient in testing methods
- try our best to reduce the number of features 
- test functions as hashing instead of counter for the tokenization, taking advantage of the sparse structures
- Use Google Cloud Engine for computations [3]





# CONCLUSION

We have been confronted here with the "data deluge". As we have reach an era of extraordinary computation power (compared to what were laptops just a few years ago), we could hardly think at the beginning of the project we would have to let the Google computation engine run an entire night to come up with the right category based on review text.

Quantity and accessibility of data is not much a challenge anymore. The challenge lies in the ability to produce meaningful analysis from relevant data in the right context. we actually come up with some ideas and results we are proud of.  However, we have not be able neither to try all our ideas to improve the results nor to provide better interpretations in the imparted time. The next step is to keep learning and trying with Kaggle competition for instance. 

Sources & Ressources: 
[1]http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
[2]https://www.scss.tcd.ie/Khurshid.Ahmad/Research/Sentiments/K_Teams_Buchraest/mvie%20review%20review.pdf
[3]https://jeffdelaney.me/blog/running-jupyter-notebook-google-cloud-platform/

