This notebook was used to create the model used in the final application where a user can input a review of Animal Crossing: New Horizons and receive a predicted sentiment based on their review.

In [1]:
import pickle

In [2]:
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from nltk import word_tokenize, regexp_tokenize, FreqDist
from nltk.stem import WordNetLemmatizer
import string
import re

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

import warnings
warnings.filterwarnings('ignore')

In [3]:
from sklearn.metrics import plot_confusion_matrix, recall_score, f1_score

For our model to be able to process raw text, I create a function that inherits the tokenizing process, the removal of stopwords, and the lemmatizing of the text from the original notebook so the review text goes through all these cleaning and preprocessing steps before being taken in by our model.

In [4]:
def text_processing(user_input):
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    review_text = regexp_tokenize(user_input, pattern)
    review_text = ' '.join(review_text)
    review_text = review_text.lower()
    
    stopwords_list = stopwords.words('english')
    stopwords_list += list(string.punctuation)
    stopwords_list += ['game', 'animal', 'crossing']
    review_text = [w for w in review_text.split() if w not in stopwords_list]
    
    lemmatizer = WordNetLemmatizer()
    review_text = [lemmatizer.lemmatize(w) for w in review_text]
    review_text = ' '.join(review_text)
        
    return review_text

To recreate the model to pickle and load into our application, I import the same dataset we originally used and label the data again the same way as I originally did in my initial notebook.

In [5]:
df = pd.read_csv('user_reviews-Copy1.csv')
def sentiment_labels(row):
    if row['grade'] >= 8:
        val = 'positive'
    elif row['grade'] <= 4:
        val = 'negative'
    else:
        val = 'neutral'
    return val
df['sentiment'] = df.apply(sentiment_labels, axis=1)

In [19]:
df

Unnamed: 0,grade,user_name,text,date,sentiment
0,4,mds27272,My gf started playing before me. No option to ...,2020-03-20,negative
1,5,lolo2178,"While the game itself is great, really relaxin...",2020-03-20,neutral
2,0,Roachant,My wife and I were looking forward to playing ...,2020-03-20,negative
3,0,Houndf,We need equal values and opportunities for all...,2020-03-20,negative
4,0,ProfessorFox,BEWARE! If you have multiple people in your h...,2020-03-20,negative
...,...,...,...,...,...
2994,1,TakezoShinmen,1 Island for console limitation.I cannot play ...,2020-05-03,negative
2995,1,Pikey17,"Per giocare con figli o fidanzate, mogli o per...",2020-05-03,negative
2996,0,Lemmeadem,One island per console is a pathetic limitatio...,2020-05-03,negative
2997,2,TandemTester938,Even though it seems like a great game with ma...,2020-05-03,negative


Since we already have our best model and optimal parameters, we won't need to perform a train-test split on our dataset again, and can instead assign X and y to the `text` and `sentiment` columns respectively for our feature and target. I then apply the `text_processing` function to prepare the `text` column for our model.

In [6]:
X = df.text
y = df.sentiment

In [7]:
model_X  = X.apply(text_processing)
model_X

0       gf started playing option create island guy nd...
1       great really relaxing gorgeous can't ignore on...
2       wife looking forward playing released bought l...
3       need equal value opportunity player island wif...
4       beware multiple people house want play account...
                              ...                        
2994    island console limitation cannot play girlfrie...
2995    per giocare con figli fidanzate mogli persone ...
2996    one island per console pathetic limitation end...
2997    even though seems like great many item charact...
2998    fantastic nintendo deciding make one island pe...
Name: text, Length: 2999, dtype: object

Taking note of the optimal parameters provided to us from performing a grid search on our logistic regression model, we then create a pipeline and fit our model.

In [12]:
model = LogisticRegression(C=0.2, penalty='none')

tfidf_vectorizer = TfidfVectorizer()
smote = SMOTE(sampling_strategy='not majority')
    
pipeline = make_pipeline(tfidf_vectorizer, smote, model)
    
pipeline.fit(model_X, y)

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('smote', SMOTE(sampling_strategy='not majority')),
                ('logisticregression',
                 LogisticRegression(C=0.2, penalty='none'))])

In [13]:
y_preds = pipeline.predict(X)
recall_score(y, y_preds, average='micro')

0.9649883294431477

Using Pickle, we extract (or "pickle") the model from the pipeline we fitted onto our pre-existing data into a `.pkl` file so that we can use the same model to predict sentiment later in our application. To test the extracted model and make sure it works as intended, I reload the same `.pkl` model into this notebook and use it to predict the sentiment of a test input string.

In [14]:
file = open('acnh_review_model_test.pkl', 'wb')
pickle.dump(pipeline, file)
file.close()

In [15]:
file = open('acnh_review_model_test.pkl', 'rb')
loaded_model = pickle.load(file)

In [16]:
loaded_model.predict(X)

array(['negative', 'neutral', 'negative', ..., 'negative', 'positive',
       'negative'], dtype=object)

In [17]:
# Test input #1: positive review
user_input = "I love this game so much!"
print('Input:', user_input)

cleaned_text = text_processing(user_input)
prediction = loaded_model.predict([cleaned_text])
print('Prediction:', prediction)

Input: I love this game so much!
Prediction: ['positive']


In [18]:
# Test input #2: negative review
user_input = "This game was terrible lol New Leaf was better"
print('Input:', user_input)

cleaned_text = text_processing(user_input)
prediction = loaded_model.predict([cleaned_text])
print('Prediction:', prediction)

Input: This game was terrible lol New Leaf was better
Prediction: ['negative']


It looks like both test inputs were predicted accurately, so we're ready to implement our predictive model into an external application with our pickled model!