# Data Mining Term Project - Board Game Geek Rating Prediction

## Name: Pranav Naresh Medhi
## Student ID:1001756326

## Introduction 

## In this project I have implemented Naive Bayes Classifier for predicting the ratings of the given reviews. We are given a dataset called BoardGameGeek, it is the world’s largest board game site, it contains all the comments and reviews by the users of all the board games. Firstly, I have done the Preprocessing part as it is one of the most important sections while developing any data mining model. Secondly, I have used scikit-learn's Multinomial Naive Bayes. I have used various preprocessing methods such as removing stop words, countvectorizer for cleaning the dataset and making it more easy to use.

## Links

## Web-App deployment Link
http://pranavmedhi.pythonanywhere.com/

## Github Repository Link
https://github.com/pranavmedhi123/datamining_project

## Video Demonstration Link
https://youtu.be/hGA6hu_eBB0

## Kaggle Notebook Link
https://www.kaggle.com/pranavmedhi/notebookaea7200859?scriptVersionId=49270588

#### Important Note:
##### Running this code might take substantial amount of time as training the model takes a very long time because of the huge dataset size

## Imports section
### To those readers that are new to jupyter notebook , In jupyter notebook we are provided cells for simultaneously running code, I generally keep the imports section separate and in the beginning so that I save time when I have to import something later on in the development process.

In [1]:
import sklearn as sk
import os, re, string
import pandas as pd
import numpy as np
from pandas import DataFrame as df
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import mean_squared_error
import pickle

## Preprocessing stage : Data Loading
### We only need the bgg-15m-reviews.csv file for this project there were two other csv files in the dataset but we wont be using them as our only concern are the ratings and reviews.

In [2]:
game_review = pd.read_csv('./boardGameReview/bgg-15m-reviews.csv')

In [3]:
# Pre processing, and removing unwanted characters for bag of word analysis
def preproc(incomingString):
    return incomingString.translate(str.maketrans('', '', string.punctuation)).lower()

## Creating Pandas Dataframe
### I used Pandas Dataframe because it makes the usage of the dataset a lot easier

In [4]:
review_rating_table = pd.DataFrame({'comment': game_review['comment'], 'rating': game_review['rating']})
review_rating_table = review_rating_table[pd.notna(review_rating_table['comment'])]
review_rating_table = review_rating_table.reset_index()
review_rating_table = review_rating_table.drop(['index'], axis=1)
r = review_rating_table['comment'].apply(lambda x: preproc(x))

### Now the in the dataset the ratings were not integers but were continuous numbers so converting them to integers :-

In [5]:
review_rating_table['rating'] = review_rating_table['rating'].astype('int32')

### Splitting the train-test data into 0.8 and 0.2

In [6]:
msk_train = np.random.rand(len(review_rating_table)) <= 0.8
# msk_test = np.random.rand(len(review_rating_table[~msk_train])) <= 0.1
review_rating_table_train = review_rating_table[msk_train]
review_rating_table_test = review_rating_table[~msk_train]

### Splitting the test data into half for development and half for test

In [7]:
msk_review_rating_table_test = np.random.rand(len(review_rating_table_test)) <= 0.5
review_rating_table_dev = review_rating_table_test[msk_review_rating_table_test]
review_rating_table_test = review_rating_table_test[~msk_review_rating_table_test]

In [8]:
# set the y value for train data
ytrain = review_rating_table_train['rating']

### Tokenizing the text: I have used CountVectorizer for assigning vector sizes to all the text words, this will firstly, make the training faster and also remove the issue of dealing with more than 10,000 features

In [9]:
# Tokenizing text
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(min_df = 1, max_features = 10000)
X_train_counts = count_vect.fit_transform(review_rating_table_train['comment'])

### Multinomial Naive Bayes Classifier implemented

In [10]:
# Fit multinomial Naive Bayes classifier
ytrain = np.around(ytrain).astype('U')
clf_alpha_1 = MultinomialNB().fit(X_train_counts, ytrain)

In [11]:
clf_alpha_0 = MultinomialNB(alpha=0).fit(X_train_counts, ytrain)
clf_alpha_2 = MultinomialNB(alpha=2).fit(X_train_counts, ytrain)



## Printing the vocabulary and assigning them vectorized values 

In [12]:
# vocabulary for the review-table
count_vect.vocabulary_

{'hands': 4119,
 'down': 2745,
 'my': 5842,
 'favorite': 3418,
 'new': 5956,
 'game': 3791,
 'of': 6129,
 'bgg': 1059,
 'con': 1890,
 '2007': 68,
 'we': 9650,
 'played': 6623,
 'it': 4752,
 'times': 8975,
 'in': 4502,
 'row': 7571,
 'just': 4865,
 'that': 8846,
 'good': 3959,
 'too': 9021,
 'bad': 891,
 'pandemic': 6368,
 'won': 9798,
 'be': 967,
 'stores': 8427,
 'until': 9352,
 'january': 4779,
 '2008': 69,
 'if': 4423,
 'you': 9904,
 'like': 5164,
 'pure': 6991,
 'games': 3800,
 'lord': 5258,
 'the': 8848,
 'rings': 7493,
 'etc': 3150,
 'this': 8889,
 'should': 7947,
 'right': 7489,
 'up': 9359,
 'your': 9909,
 'alley': 474,
 'having': 4170,
 'roles': 7533,
 'to': 8994,
 'choose': 1612,
 'from': 3721,
 'gives': 3916,
 'some': 8167,
 'extra': 3313,
 'variability': 9437,
 'also': 496,
 'once': 6165,
 'get': 3884,
 'can': 1393,
 'ramp': 7102,
 'difficulty': 2565,
 'by': 1359,
 'adding': 320,
 'more': 5778,
 'epidemic': 3092,
 'cards': 1431,
 '10': 13,
 'hey': 4240,
 'finally': 3506,
 '

### Saving the vocabulary for my Flask web-app 

In [13]:
import json
file = './vocabulary.json' 
with open(file, 'w') as f: 
    json.dump(count_vect.vocabulary_, f)

In [14]:
# Building table using vocabulary from train for test, and testing the model.
count_vect_dev = CountVectorizer(vocabulary=count_vect.vocabulary_)
X_dev_counts = count_vect_dev.fit_transform(review_rating_table_dev['comment'])
ydev = review_rating_table_dev['rating']
ydev = np.around(ydev).astype('U')


## Hyperparameter Tuning and Experimentations

### For this section I used MSE for checking the deviation from the predictions, sort of calculating accuracy for Naive Bayes with MSE for different alpha values

In [15]:
# MSE using clf_alpha_0
prediction = clf_alpha_0.predict(X_dev_counts)
accuracy = sum(prediction==ydev)/len(ydev)
mse = mean_squared_error(prediction.astype(np.float), ydev.astype(np.float), squared=False)
print(mse)

1.7018084156878874


In [16]:
# MSE using clf_alpha_1
prediction = clf_alpha_1.predict(X_dev_counts)
accuracy = sum(prediction==ydev)/len(ydev)
mse = mean_squared_error(prediction.astype(np.float), ydev.astype(np.float), squared=False)
print(mse)

1.8034088715675551


In [17]:
# MSE using clf_alpha_2
prediction = clf_alpha_2.predict(X_dev_counts)
accuracy = sum(prediction==ydev)/len(ydev)
mse = mean_squared_error(prediction.astype(np.float), ydev.astype(np.float), squared=False)
print(mse)

1.7936268438729144


## I picked the clf with least MSE result which was for clf_alpha=0, my conclusion was that there was no overfitting as over here mse is the same as the mse with testing for clf_alpha=0

In [18]:
# Pick clf with lowest mse result
# Test with best alpha value
# Building table using vocabulary from train for test, and testing the model.
# No overfitting as mse here is the same as mse with testing using clf_alpha_0
count_vect_test = CountVectorizer(vocabulary=count_vect.vocabulary_)
X_test_counts = count_vect_test.fit_transform(review_rating_table_test['comment'])
ytest = review_rating_table_test['rating']
ytest = np.around(ytest).astype('U')
prediction = clf_alpha_0.predict(X_test_counts)
accuracy = sum(prediction==ytest)/len(ytest)
mse = mean_squared_error(prediction.astype(np.float), ytest.astype(np.float), squared=False)
print(mse)

1.6959179072086468


## Explanation for basic Algorithms used
### I have used the Multinomial Naive Bayes Classifier for this implementation
Naive Bayes classifiers are built on Bayesian classification methods. These rely on Bayes's theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities. In Bayesian classification, we're interested in finding the probability of a label given some observed features, which we can write as P(L | features).
All we need now is some model by which we can compute P(features | Li) for each label. Such a model is called a generative model because it specifies the hypothetical random process that generates the data. Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier. The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

This is where the "naive" in "naive Bayes" comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification. Different types of naive Bayes classifiers rest on different naive assumptions about the data, and we will examine a few of these in the following sections.

In [19]:
# Saving clf_alpha_1 with 1-smoothing into file to be used for Flask Application Project
filename = 'nb_model_final.sav'
pickle.dump(clf_alpha_1, open(filename, 'wb'))

## Final Evaluation Score 
### Final Evaluation Score for testing for clf_alpha=0 (which had the least MSE result) is:
## 1.7011183367884064

## Contributions
#### Implementing the Multinomial Naive Bayes Classifier, analysing the mechanism of the dataset, preprocessing, removing less frequently used words.
#### Creating the Flask Web-App for deployment of the trained model for testing online using PythonAnywhere.com.

## Challenges faced and Conclusion

### I have spent substantial amount of time on this project getting to understand how to overcome some of the challenges I faced while completing this project. In the development stage of the project I had difficulty in defining and training the model for more than 10,000 features, after giving some time to the issue I found out that  there is a function called count.Vectorizer which can be used for making the process faster and more efficient  .In the deployment part, I was not sure as to how I was going to deploy the whole trained model on the web-app without requiring to upload the dataset, after spending a lot of time on Flask and researching the various parameters I needed to know for successfully deploying my web-app, I came to an understanding that I needed to save a different file such as my 'main.py' file where the serialisation process is done for the web-app , Moreover the file has to have a prefix main in order to work, without this the Flask app can be deployed but the trained model won't run. 

## References

### 1. For understanding the basics of working with text data: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
### 2. For deploying tutorial: https://medium.com/better-programming/deploy-a-react-app-to-google-cloud-platform-using-google-app-engine-3f74fbd537ec
### 3. CountVectorizer basics:https://stackoverflow.com/questions/44083683/countvectorizer-with-pandas-dataframe/44083903
### 4. Multinomial Naive Bayes basics:https://towardsdatascience.com/multinomial-naive-bayes-classifier-for-text-analysis-python-8dd6825ece67
### 5. Flask Basics: https://www.tutorialspoint.com/flask/index.htm
### 6. Naive Bayes Classifier basics:https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html