## Workbook 03_Sentiment_Analysis
This workbook contains code for the first two models (VADER, Naive Bayes) attempting to carry out Sentiment Analysis.

The result of this workbook is a .csv file containing 'Input Game' reviews with their respective Topics and Compound Sentiment score. This .csv file is to be used in Tableau to present our results.

VADER model and Naives Bayes model is conducted using topic_modelling2.csv file which contains all the reviews (not filtered by topics)

In [12]:
import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import gensim

#### Importing Dataframe
There are two codes to import different dataframes:
- <b>topic_modelling2.csv</b> comes from our Topic Model (02_Finalizing Topic DF).
- <b>steam_df.csv</b> contains the Top 10 Games and is used to test the accuracy of our models with more data.

In [14]:
# Bigger dataframe to test accuracy of models
# This code takes over a minute to run
steam_df = pd.read_csv('./topic_modelling.csv', index_col=0)

In [15]:
steam_df.head(500000)
steam_df['app_name'].unique()
steam_df.head()
# print(len(steam_df)) --> 103274

Unnamed: 0,app_id,app_name,review_id,language,review,recommended,author.steamid,author.playtime_at_review,review_length,clean_review,clean_review_str,bigram_review,trigram_review
0,292030,The Witcher 3: Wild Hunt,85184605,english,"One of the best RPG's of all time, worthy of a...",1.0,76561199054755373,5524.0,12,"['one', 'rpg', 'time', 'worthy', 'collection']",one rpg time worthy collection,"['one_rpg', 'rpg_time', 'time_worthy', 'worthy...","['one_rpg_time', 'rpg_time_worthy', 'time_wort..."
1,292030,The Witcher 3: Wild Hunt,85184171,english,"good story, good graphics. lots to do.",1.0,76561198170193529,823.0,7,"['story', 'graphic']",story graphic,['story_graphic'],[]
2,292030,The Witcher 3: Wild Hunt,85180436,english,favorite game of all time cant wait for the Ne...,1.0,76561198065591528,23329.0,11,"['favorite', 'time', 'cant', 'wait', 'nexgen',...",favorite time cant wait nexgen versiion,"['favorite_time', 'time_cant', 'cant_wait', 'w...","['favorite_time_cant', 'time_cant_wait', 'cant..."
3,292030,The Witcher 3: Wild Hunt,85179753,english,Why wouldn't you get this,1.0,76561198996835044,8557.0,5,"['would', 'get']",would get,['would_get'],[]
4,292030,The Witcher 3: Wild Hunt,85177892,english,"Very Fun, Would play again!",1.0,76561198040190687,20092.0,5,"['fun', 'would']",fun would,['fun_would'],[]


In [16]:
steam_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 103274 entries, 0 to 103273
Data columns (total 13 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   app_id                     103274 non-null  int64  
 1   app_name                   103274 non-null  object 
 2   review_id                  103274 non-null  int64  
 3   language                   103274 non-null  object 
 4   review                     103274 non-null  object 
 5   recommended                103274 non-null  float64
 6   author.steamid             103274 non-null  int64  
 7   author.playtime_at_review  103274 non-null  float64
 8   review_length              103274 non-null  int64  
 9   clean_review               103274 non-null  object 
 10  clean_review_str           101488 non-null  object 
 11  bigram_review              103274 non-null  object 
 12  trigram_review             103274 non-null  object 
dtypes: float64(2), int64(4), object(7)

## 3. Task A: Topic-based Sentiment Analysis
### Sentiment Analysis
Now that we have extracted the dominant topics from our reviews. We will use Sentiment Analysis to find the sentiment scores for each review (more specifically <b><u>Sentiment Polarity Classification</b></u>). Later combining the scores with the dominant topics we have identified.

<!-- #### Data Preprocessing -->
<!-- In terms of how to analyse the sentiment polarity, we decided to do it document-level (per review), rather than sentence-level. This is because of how we are finidng a single dominant topic to label each review. -->

### Model 1: VADER
Vader (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool that is specifically designed for analyzing social media texts. Vader is a pre-trained sentiment analysis model that provides a sentiment score for a given text.

In [17]:
analyzer = SentimentIntensityAnalyzer()

In [18]:
# sample_text = steam_df['review'][0]

# scores = analyzer.polarity_scores(sample_text)
# print(scores)

In [19]:
steam_df['compound_sentiment'] = steam_df['review'].map(lambda x: analyzer.polarity_scores(x)['compound'])
# steam_df['compound_sentiment'] = steam_df['review_sent'].map(lambda x: analyzer.polarity_scores(x)['compound']) # Using review_sent decreased the accuracy score

In [20]:
steam_df.head()

Unnamed: 0,app_id,app_name,review_id,language,review,recommended,author.steamid,author.playtime_at_review,review_length,clean_review,clean_review_str,bigram_review,trigram_review,compound_sentiment
0,292030,The Witcher 3: Wild Hunt,85184605,english,"One of the best RPG's of all time, worthy of a...",1.0,76561199054755373,5524.0,12,"['one', 'rpg', 'time', 'worthy', 'collection']",one rpg time worthy collection,"['one_rpg', 'rpg_time', 'time_worthy', 'worthy...","['one_rpg_time', 'rpg_time_worthy', 'time_wort...",0.7964
1,292030,The Witcher 3: Wild Hunt,85184171,english,"good story, good graphics. lots to do.",1.0,76561198170193529,823.0,7,"['story', 'graphic']",story graphic,['story_graphic'],[],0.7003
2,292030,The Witcher 3: Wild Hunt,85180436,english,favorite game of all time cant wait for the Ne...,1.0,76561198065591528,23329.0,11,"['favorite', 'time', 'cant', 'wait', 'nexgen',...",favorite time cant wait nexgen versiion,"['favorite_time', 'time_cant', 'cant_wait', 'w...","['favorite_time_cant', 'time_cant_wait', 'cant...",0.4588
3,292030,The Witcher 3: Wild Hunt,85179753,english,Why wouldn't you get this,1.0,76561198996835044,8557.0,5,"['would', 'get']",would get,['would_get'],[],0.0
4,292030,The Witcher 3: Wild Hunt,85177892,english,"Very Fun, Would play again!",1.0,76561198040190687,20092.0,5,"['fun', 'would']",fun would,['fun_would'],[],0.7614


#### Calculating Accuracy of Model
To calculate the accuracy metric of our model, we need to find a ratio of correct predictions. To do so, we'll be using the 'Recommended' data we have. Every Steam review requires you to select 'Recommended' or 'Not Recommended' for your review. An assumption we're making is that a compound sentiment score of at least 0.5 and above, counts as positive/recommended.

In [21]:
# Example of misclassified sentiment
print(steam_df['review'].iloc[2])
print(steam_df['recommended'].iloc[2])
print(steam_df['compound_sentiment'].iloc[2])

# Favourable review, and is Recommended, but the compound score says otherwise.

favorite game of all time cant wait for the NexGen Versiion
1.0
0.4588


In [22]:
def compound_recommened(compound_sentiment):
    if compound_sentiment >= 0.5:
        return 1.0
    elif compound_sentiment <= 0.5:
        return 0.0

In [23]:
steam_df['compound_recommended'] = steam_df['compound_sentiment'].map(lambda x: compound_recommened(x))

In [24]:
steam_df.head()

Unnamed: 0,app_id,app_name,review_id,language,review,recommended,author.steamid,author.playtime_at_review,review_length,clean_review,clean_review_str,bigram_review,trigram_review,compound_sentiment,compound_recommended
0,292030,The Witcher 3: Wild Hunt,85184605,english,"One of the best RPG's of all time, worthy of a...",1.0,76561199054755373,5524.0,12,"['one', 'rpg', 'time', 'worthy', 'collection']",one rpg time worthy collection,"['one_rpg', 'rpg_time', 'time_worthy', 'worthy...","['one_rpg_time', 'rpg_time_worthy', 'time_wort...",0.7964,1.0
1,292030,The Witcher 3: Wild Hunt,85184171,english,"good story, good graphics. lots to do.",1.0,76561198170193529,823.0,7,"['story', 'graphic']",story graphic,['story_graphic'],[],0.7003,1.0
2,292030,The Witcher 3: Wild Hunt,85180436,english,favorite game of all time cant wait for the Ne...,1.0,76561198065591528,23329.0,11,"['favorite', 'time', 'cant', 'wait', 'nexgen',...",favorite time cant wait nexgen versiion,"['favorite_time', 'time_cant', 'cant_wait', 'w...","['favorite_time_cant', 'time_cant_wait', 'cant...",0.4588,0.0
3,292030,The Witcher 3: Wild Hunt,85179753,english,Why wouldn't you get this,1.0,76561198996835044,8557.0,5,"['would', 'get']",would get,['would_get'],[],0.0,0.0
4,292030,The Witcher 3: Wild Hunt,85177892,english,"Very Fun, Would play again!",1.0,76561198040190687,20092.0,5,"['fun', 'would']",fun would,['fun_would'],[],0.7614,1.0


Calculate Accuracy Score to compare with other models

accuracy_score(y_true, y_pred)
<br>
y_true: The true labels from your dataset. y_pred: The predicted labels from your model.
<br>
This function compares both labels and finds the proportion of correct predictions made by the model out of all total cases.
<br>
<br>
Limtations:
- Some reviews might be negative, but the user indicated the review as 'Recommended'. This is a misclassifcation in our truth label.

In [25]:
from sklearn.metrics import accuracy_score

accuracy_score(steam_df['recommended'], steam_df['compound_recommended'])

0.7289056296841412

In [26]:
# Filter by Input Game
input_name = "The Witcher 3: Wild Hunt"

steam_df = steam_df[steam_df['app_name'] == input_name]

In [27]:
steam_df.head()

Unnamed: 0,app_id,app_name,review_id,language,review,recommended,author.steamid,author.playtime_at_review,review_length,clean_review,clean_review_str,bigram_review,trigram_review,compound_sentiment,compound_recommended
0,292030,The Witcher 3: Wild Hunt,85184605,english,"One of the best RPG's of all time, worthy of a...",1.0,76561199054755373,5524.0,12,"['one', 'rpg', 'time', 'worthy', 'collection']",one rpg time worthy collection,"['one_rpg', 'rpg_time', 'time_worthy', 'worthy...","['one_rpg_time', 'rpg_time_worthy', 'time_wort...",0.7964,1.0
1,292030,The Witcher 3: Wild Hunt,85184171,english,"good story, good graphics. lots to do.",1.0,76561198170193529,823.0,7,"['story', 'graphic']",story graphic,['story_graphic'],[],0.7003,1.0
2,292030,The Witcher 3: Wild Hunt,85180436,english,favorite game of all time cant wait for the Ne...,1.0,76561198065591528,23329.0,11,"['favorite', 'time', 'cant', 'wait', 'nexgen',...",favorite time cant wait nexgen versiion,"['favorite_time', 'time_cant', 'cant_wait', 'w...","['favorite_time_cant', 'time_cant_wait', 'cant...",0.4588,0.0
3,292030,The Witcher 3: Wild Hunt,85179753,english,Why wouldn't you get this,1.0,76561198996835044,8557.0,5,"['would', 'get']",would get,['would_get'],[],0.0,0.0
4,292030,The Witcher 3: Wild Hunt,85177892,english,"Very Fun, Would play again!",1.0,76561198040190687,20092.0,5,"['fun', 'would']",fun would,['fun_would'],[],0.7614,1.0


#### Exporting dataframe for use in Tableau

In [28]:
steam_df_4topics.to_csv('./witcher3_analysis.csv')

--------------------------
### Model 2: Naive Bayes Classifier (Supervised Sentiment Polarity Classification)

In [29]:
from sklearn.model_selection import train_test_split

In [30]:
# Convert original reviews to list, and recommended binary to list as well.
review_list = steam_df['review'].tolist()
labels = steam_df['recommended'].tolist()

In [31]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(review_list, labels, test_size=0.3, random_state=42)

In [32]:
# Create train corpus
corpus_train = []

for entry in X_train:
    corpus_train.append(nltk.word_tokenize(entry))

print(corpus_train[0:5])

[['After', 'playing', 'the', 'Elder', 'Scrolls', 'games', 'for', 'years', 'i', 'doubt', 'that', 'i', 'would', 'find', 'a', 'game', 'near', 'as', 'fun', 'as', 'those', '.', 'I', 'WAS', 'WRONG', '.', 'This', 'is', 'such', 'a', 'great', '&', 'amazing', 'game', ',', 'im', 'in', 'love', 'with', 'it', '.', 'The', 'history', ',', 'the', 'combat', ',', 'the', 'minigames', ',', 'the', 'lore', ',', 'the', 'music', ',', 'the', 'choices', ',', 'everything', 'is', 'fantastic', '.', '10/10'], ['One', 'of', 'the', 'coolest', 'games', 'i', 'ever', 'played'], ['Could', "n't", 'recommend', 'it', 'enough', ',', 'my', 'life', 'has', 'been', 'transformed', 'to', 'only', 'talk', 'about', 'this', 'game', 'for', 'three', 'months', 'straight', '.', '10/10'], ['Best', 'RPG', 'I', "'ve", 'ever', 'played', ',', 'I', 'would', 'recommend', 'reading', 'the', 'books', 'if', 'you', 'fell', 'in', 'love', 'with', 'the', 'universe', '.'], ['why', 'I', 'waited', 'so', 'long', 'to', 'get', 'this', 'game', 'is', 'my', 'only

In [33]:
# Create a dictionary from the corpus.
dictionary = gensim.corpora.Dictionary(corpus_train)

# Store the labeled training data in the following list.
labeled_training_data = []
    
# Going through the two lists in parallel to create the labeled data set.
for (l, s) in zip(y_train, corpus_train):

    # Convert the original sentence into a vector.
    vector = dictionary.doc2bow(s)
    
    # Create a dict object to store the document vector (in order to use NLTK's classifier later)
    sent_as_dict = {id:1 for (id, tf) in vector}
    
    # Add the labeled sentence to the labeled data set.
    labeled_training_data.append((sent_as_dict, l))

In [34]:
print(labeled_training_data[:5])

[({0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 1, 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1}, 1.0), ({24: 1, 27: 1, 40: 1, 45: 1, 46: 1, 47: 1, 48: 1, 49: 1}, 1.0), ({1: 1, 2: 1, 3: 1, 21: 1, 23: 1, 31: 1, 50: 1, 51: 1, 52: 1, 53: 1, 54: 1, 55: 1, 56: 1, 57: 1, 58: 1, 59: 1, 60: 1, 61: 1, 62: 1, 63: 1, 64: 1, 65: 1, 66: 1}, 1.0), ({1: 1, 2: 1, 6: 1, 29: 1, 33: 1, 40: 1, 42: 1, 43: 1, 47: 1, 49: 1, 60: 1, 67: 1, 68: 1, 69: 1, 70: 1, 71: 1, 72: 1, 73: 1, 74: 1, 75: 1}, 1.0), ({1: 1, 2: 1, 6: 1, 12: 1, 23: 1, 30: 1, 31: 1, 42: 1, 57: 1, 59: 1, 63: 1, 65: 1, 76: 1, 77: 1, 78: 1, 79: 1, 80: 1, 81: 1, 82: 1, 83: 1, 84: 1, 85: 1}, 1.0)]


In [35]:
# Training a classifier.
# Choose one of the following two classification algorithms to train a classifier.
classifier = nltk.NaiveBayesClassifier.train(labeled_training_data)


# MaxEnt Classifier
#Set iterations and the define classifer
# numIterations = 5
# algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0]

#Train the classifier 
# classifier = nltk.MaxentClassifier.train(labeled_training_data, algorithm, max_iter=numIterations)

In [36]:
import nltk

# Example function to convert a review into a feature set
# This should match how you prepared your labeled_training_data
def extract_features_from_review(review):
    # Your feature extraction logic here
    features = {}
    # Example simple feature model: presence of words
    for word in nltk.word_tokenize(review):
        features[f'contains({word})'] = True
    return features

# Example reviews
reviews = review_list[0:31]


# Predict sentiments for each review
predicted_sentiments = []
for review in review_list:
    features = extract_features_from_review(review)  # Convert review to features
    sentiment = classifier.classify(features)  # Use classifier to predict
    predicted_sentiments.append(sentiment)

# Display each review with its predicted sentiment
for review, sentiment in zip(reviews, predicted_sentiments):
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")


Review: One of the best RPG's of all time, worthy of any collection
Predicted Sentiment: 1.0

Review: good story, good graphics. lots to do.
Predicted Sentiment: 1.0

Review: favorite game of all time cant wait for the NexGen Versiion
Predicted Sentiment: 1.0

Review: Why wouldn't you get this
Predicted Sentiment: 1.0

Review: Very Fun, Would play again!
Predicted Sentiment: 1.0

Review: The game is enjoyable enough but...
-Combat has plenty of options but the game will play for you at times, takes away from player achievements.
-The Story is good so far but so much of it, listening to every nagging thought is getting old fast.
-The swords and armor having a level is retarded, how is it that this legendary warrior cannot use the same sword of the 8 guys he just killed. This feature is killing the game for me.
-The repairs and crafting I hate, its not my thing. never is.
Predicted Sentiment: 1.0

Review: The only thing bigger than the world map is ur mom
Predicted Sentiment: 1.0

Review

In [37]:
# Create test corpus
corpus_test = []

for entry in X_test:
    corpus_test.append(nltk.word_tokenize(entry))

print(corpus_test[0:5])

[['Played', 'on', 'switch', 'first', ',', 'but', '100', '%', 'grab', 'this', 'game', 'when', 'it', 'goes', 'on', 'sale', ',', 'which', 'happens', 'what', 'seems', 'like', 'monthly', ',', 'being', 'able', 'to', 'play', 'in', '4k', 'with', 'mods', 'and', 'other', 'addons', 'was', 'very', 'enjoyable', 'and', 'the', 'PC', 'port', 'was', 'amazing', '.'], ['Every', 'minute', 'you', 'spend', 'on', 'this', 'game', 'is', 'amazing', '.'], ['The', 'Game', 'of', 'The', 'Decade', 'is', 'Here', '.'], ['This', 'game', 'is', 'great', 'endless', 'stuff', 'too', 'do', '...', 'the', 'graphics', 'are', 'really', 'good', 'and', 'it', 'was', 'optimized', 'to', 'it', "'s", 'fullest', 'with', 'years', 'of', 'updates', 'after', 'release', '.'], ['Fantastic', 'game', 'both', 'as', 'a', 'standalone', 'title', 'and', 'the', 'last', 'game', 'of', 'a', 'francise', '.', 'Great', 'visuals', ',', 'characters', 'and', 'story', '.', 'CDR', 'is', 'perhaps', 'the', 'only', 'developer', 'that', 'actually', 'expands', 'the'

In [38]:
# Testing the accuracy of the classifier on a test data.
labeled_test_data = []
    
# Going through the two lists in parallel to create the labeled data set.
for (l, s) in zip(y_test, corpus_test):

    # Convert the original sentence into a vector.
    vector = dictionary.doc2bow(s)
    
    # Create a dict object to store the document vector (in order to use NLTK's classifier later)
    sent_as_dict = {id:1 for (id, tf) in vector}
    
    # Add the labeled sentence to the labeled data set.
    labeled_test_data.append((sent_as_dict, l))

In [39]:
# Test the accuracy
print("Accuracy on test data: ", nltk.classify.accuracy(classifier, labeled_test_data))

Accuracy on test data:  0.2861892005293225


---------------------------------------------------------
### Model 3: Logistic Regression (in 03A workbook)

-------------------------------
## 4. Analysis of Results
In this section, we prepare our data one last time to be imported into Tableau.