<table>
    <tr><td>
         <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/[Data_Exploration]Word-Clouds.ipynb">
         <img alt="start" src="figures/button_previous.jpg" width= 70% height= 70%>
    </td><td>
        <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/Index.ipynb">
         <img alt="start" src="figures/button_table-of-contents.jpg" width= 70% height= 70%>
    </td><td>
         <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/Vectorization.ipynb">
         <img alt="start" src="figures/button_next.jpg" width= 70% height= 70%>
    </td></tr>
</table>

# Basic code structure
In this notebook the form and code structure of the sentiment analysis part is illustrated and explained in order to better understand the following notebooks.


The basic structure consists of the following steps:

- Import data from a json file into a pandas dataframe.
- Reduce the number of classes to three classes.
- Transform the text/corpora to a cleaner version.
- Seperate the dataset to training and test set.
- Transform text to a numerical form by performing vectorization.
- Perform classification, predict and print the results/reports.


In [12]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
from collections import Counter

#[1] Importing dataset

dataset = pd.read_json(r"C:\Users\Panos\Desktop\Dissert\Code\Sample_Video_Games_5.json", lines=True, encoding='latin-1')
dataset = dataset[['reviewText','overall']]

#[2] Reduce number of classes

ratings = []
for index,entry in enumerate(dataset['overall']):
    if entry == 1.0 or entry == 2.0:
        ratings.append(-1)
    elif entry == 3.0:
        ratings.append(0)
    elif entry == 4.0 or entry == 5.0:
        ratings.append(1)

                                          reviewText  overall
0  Installing the game was a struggle (because of...        1
1  If you like rally cars get this game you will ...        4
2  1st shipment received a book instead of the ga...        1
3  I got this version instead of the PS3 version,...        3
4  I had Dirt 2 on Xbox 360 and it was an okay ga...        4
________________________________________________________________

ratings:  [-1, 1, -1, 0, 1, 1, 1, -1, 1, -1]


The code above at first, reads the json file and and stores the data in a pandas(library) dataframe. The dataframe consists of two columns "reviewText" and "overall" as shown in the print area above. The reviewText column includes the string writen by the customer reviewing an item. The overall column contains an integer with a value from 1 to 5, representing the rating-score left by the customer for the corresponding item.

The next step in the code reduces the range of the overall integer value from 5 (1 to 5) to 3 (-1 to 1) by saving the new values in the 'ratings' list. In other words the number of classes is reduced by the following procedure: the ratings rated with 1 or 2 stars are negative (-1), ratings rated with 3 stars are neutral (0) and ratings rated with 4 or 5 stars are positive, as shown below.

&#11088; &#8594; Negative (-1)

&#11088; &#11088; &#8594; Negative (-1)

&#11088; &#11088; &#11088; &#8594; Neutral (0)

&#11088; &#11088; &#11088; &#11088; &#8594; Positive (1)

&#11088; &#11088; &#11088; &#11088; &#11088; &#8594; Positive (1)

In [16]:
#[3] Cleaning the text

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

corpus = []
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['reviewText'][i])
    review = review.lower()
    review = review.split()
    review = [word for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)


Example

Before cleaning:

1st shipment received a book instead of the game.2nd shipment got a FAKE one. Game arrived with a wrong key inside on sealed box. I got in contact with codemasters and send them pictures of the DVD and the content. They said nothing they can do its a fake DVD.Returned it good bye.!

After Cleaning:

st shipment received book instead game nd shipment got fake one game arrived wrong key inside sealed box got contact codemasters send pictures dvd content said nothing fake dvd returned good bye


In this code section, different cleaning/pre-processing techniques are applied in order to make the text more machine-friendly, remove unwanted tokens and match together identical tokens. 

In this example, all characters that do not belong to the english alphabet (from a to z and A to Z) are being removed. Also, all capital letters are transformed to small letters and stopwords are being removed using the nltk library. Stop-words usually refer to the most common words in a language which do not affect the meaning of a sentence but are mostly "auxiliary".

In the next notebooks different pre-processing methods will be examined in order to achieve a better final accuracy as this is a critical part for sentiment analysis. Specifically, different stemming methods are being tested and regular expressions(regex) are being experimented in the final section of the project.

In [27]:
#[4] Prepare Train and Test Data sets
            
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(corpus,ratings,test_size=0.3)

In this part, the dataset is split to the training and test set in order to be able to later calculate the accuracy and decide whether the model is favorable using the test-set. The test-set is a small part of the dataset left un-trained. In this example the test set consists of the 30% of the entire dataset. 

In [18]:
#[5] Encoding

Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

In [26]:
#[6] Word Vectorization
        
Tfidf_vect = TfidfVectorizer(max_features=10000)
Tfidf_vect.fit(corpus)
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

#the vocabulary that it has learned from the corpus
print(Tfidf_vect.vocabulary_)

#the vectorized data
print(Train_X_Tfidf)

**A sample of the vocabulary that it has learned from the corpus:**

{'installing': 846, 'game': 663, 'struggle': 1615, 'games': 669, 'windows': 1864, 'live': 957, 'bugs': 179}

**A sample of the vectorized data:**

  (0, 1602)	0.2521459788527613 <br> 
  (0, 1536)	0.44611689308144253 <br> 
  (0, 1499)	0.3630663061451349 <br> 
  (0, 1221)	0.2521459788527613 <br> 
  (0, 971)	0.3097906090161147 <br> 
  (0, 949)	0.203737910991826 <br> 
  (0, 895)	0.44611689308144253 <br> 
  (0, 312)	0.24206473066707476 <br> 
  (0, 101)	0.38329154936001714 <br> 
  (1, 1829)	0.17313823592091127 <br> 
  (1, 1782)	0.2029133806474433 <br> 
  (1, 1702)	0.12345297644789534 <br> 
  (1, 1519)	0.14399237108946988 <br> 
  (1, 1476)	0.21421702520406033 <br> 
  (1, 1147)	0.11048831816230155 <br> 
  
Vectorization is being performed in this part to help represent text data in a multidimensional space using float values for the machine to be able to recognize and manipulate. In this example, the TD-IDF algorithm is used performing two steps, first counting the frequency of each token and then calculating the inverse document frequency and compining those together to extract the final vectorized values. 

In this project two vectorization algorithms are being tested, TF-IDF Vectorizer and Hashing Vectorizer. Both vectorizers are being explained in a latter notebook in more depth.

In [6]:
#[7] Use the Naive Bayes Algorithms to Predict the outcome

# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("-----------------------Naive Bayes------------------------\n")
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
# Making the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Test_Y, predictions_NB)
print("\n",cm,"\n")
# Printing a classification report of different metrics
from sklearn.metrics import classification_report
my_tags = ['Positive','Neutral','Negative']
print(classification_report(Test_Y, predictions_NB,target_names=my_tags))

# Export reports to files for later visualizations
report_NB = classification_report(Test_Y, predictions_NB,target_names=my_tags, output_dict=True)
report_NB_df = pd.DataFrame(report_NB).transpose()
report_NB_df.to_csv(r'NB_report_TFIDFVect.csv', index = True, float_format="%.3f")

-----------------------Naive Bayes------------------------

Naive Bayes Accuracy Score ->  77.46282394224409

 [[ 1544   120  6973]
 [  174   118  8234]
 [  121    49 52201]] 

              precision    recall  f1-score   support

    Positive       0.84      0.18      0.29      8637
     Neutral       0.41      0.01      0.03      8526
    Negative       0.77      1.00      0.87     52371

    accuracy                           0.77     69534
   macro avg       0.68      0.40      0.40     69534
weighted avg       0.74      0.77      0.70     69534



In [7]:
#[8] Use the Support Vector Machine Algorithms to Predict the outcome

# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("-----------------Support Vector Machine CM------------------\n")
print("Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
cm = confusion_matrix(Test_Y, predictions_SVM)
# Making the confusion matrix
print("\n",cm,"\n")
# Printing a classification report of different metrics
print(classification_report(Test_Y, predictions_SVM,target_names=my_tags))

# Export reports to files for later visualizations
report_SVM = classification_report(Test_Y, predictions_SVM,target_names=my_tags, output_dict=True)
report_SVM_df = pd.DataFrame(report_SVM).transpose()
report_SVM_df.to_csv(r'SVM_report_TFIDFVect.csv', index = True, float_format="%.3f")

-----------------Support Vector Machine CM------------------

Accuracy Score ->  82.27485834268128

 [[ 4993   761  2883]
 [ 1365  1399  5762]
 [  880   674 50817]] 

              precision    recall  f1-score   support

    Positive       0.69      0.58      0.63      8637
     Neutral       0.49      0.16      0.25      8526
    Negative       0.85      0.97      0.91     52371

    accuracy                           0.82     69534
   macro avg       0.68      0.57      0.59     69534
weighted avg       0.79      0.82      0.79     69534



In the final step classification algorithms are used to Predict the outcome. First of all the training set is fit on the classifier, then the labels are predicted on validation dataset and being stored in the predictions_SVM variable which is then used to calculate the accuracy, the confusion matrix and the classification report.
In the very end, the reports are being exported in csv format files in order to be used for visualising the conclusion of the whole project and examine the course of the accuracy scores in the Results & Conclusion notebook.

<a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/Vectorization.ipynb">
         <img alt="start" src="figures/button_next.jpg">