# Sentiment Analysis of Consumer Game Reviews
## John Le

## Introduction
A very important factor of a successful product directed towards certain consumers is having consumer satisfaction. Having large consumer satisfaction develops towards a stronger reputation and larger sales for a product. On the other hand, having access to direct feedback on the product means that the company will be able to collect data to determine areas of improvement for the future. For video games in particular, this is especially important. All types of video games, ranging from Triple-A games with large budgets to small indie games, will encounter bugs and glitches that ruin the player's experience. These bugs will need to be fixed in future patches, and depending on speed and quality, will affect a game's success. Being able to receive consumer criticism on how a game is doing will allow for developers to know what they are doing right and what needs to be brought to their attention, with urgency increasing with frequency in reviews. Having many good reviews posted by consumers following update changes increase the chance of convincing other potential buyers to purchase the game with the appeal of responsive developers.

Consumer reviews come in many different mediums, ranging from visual videos to forum posts, social media, review blogs, and direct store page reviews. This analysis will focus on direct store page text reviews because the store page is a central location where users that have purchased and played the game, and have the intent of sharing their experience or feedback with the developer and interested consumers prior to their purchase. There is generally an option to select whether the user is recommending the game or not. However, being able to classify a review as a recommendation or not will allow for mediums that are not specifically designed for consumer reviews (social media posts, forum posts, video/video comments, etc.) to still be an option for data collection as input. Also, this can be applied to store pages with a 'no recommendation response' option, and the page may find that they wish to automatically determine or suggest whether a review is recommending a game or not. 

This notebook will be using NLP and sentiment analysis to classify whether a game review has positive or negative sentiments, while also demonstrating the data science life cycle. The dataset will be reviews from the Steam store, which is a digital hub for video games. There is a dedicated section for each video game store page where consumers can write a review. Steam was chosen because it already provides the recommendation status for each review by the reviewer, which is perfect for supervised learning.



### General and Web Scraping Library Imports

In [111]:
#Web Scraping: requests for GET requests, 
#beautifulsoup for handling scraped data, json for formatting,
#regex for matching patterns from data observations
import requests
from bs4 import BeautifulSoup
import json
import re 

#Computational and Data Handling libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



### Data Collection
The first step will be deciding which games to scrape reviews from. The goal is to choose games that vary in recommendations and non-recommendation reviews, while also choosing games that have a decent amount of reviews. I decided to scrape data on the top 1000 games ranked by current players from https://steamcharts.com/top. Although ranking by current players may cause inconsistency, the "Peak Players" column has values high enough that suggest sufficient reviews. Games with lower peak players/current players count may help with getting negative reviews, but could also cause a shortage of review observations. 

In [178]:
#Send a request to the first page of the site, each page has 25 games
top1000_r = requests.get('https://steamcharts.com/top', headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Mobile Safari/537.36'})
r = BeautifulSoup(top1000_r.text)
#r

In [117]:
#Extract the table headers: Name, Current Players, Last 30 Days, Peak Players, Hours Played
cols = r.findAll('th')
cols

[<th></th>,
 <th class="left">Name</th>,
 <th>Current Players</th>,
 <th class="period-col" id="topgames-chart-head">Last 30 Days</th>,
 <th class="period-col">Peak Players</th>,
 <th class="period-col">Hours Played</th>]

In [120]:
tables = r.findAll('table')
#print(tables) there is only one table per page
table = tables[0]
#table

In [122]:
#Obtain all of the values row by row, retrieving the Current Players, Peak Players, and Hours Played for each game
totRows = []
for rows in table.findAll('tr'):
  #Each value had a pattern of being surrounded by an ending tag and opening tag marker from <td ...> some_val </td>
  row = re.findall(r'(?<=[>])\d+(?=[<])', str(rows))
  totRows.append(row)

totRows = totRows[1:]     #First row corresponded to headers, removed
totRows = np.array(totRows)
print(totRows)  

[['312928' '756170' '311258956']
 ['312569' '950586' '400799681']
 ['110753' '128825' '61681428']
 ['86111' '195892' '68611536']
 ['77726' '242823' '87305456']
 ['76818' '363839' '110584929']
 ['74367' '154358' '40305999']
 ['72652' '173538' '58675485']
 ['71789' '125911' '52617260']
 ['71365' '215058' '51423073']
 ['55874' '118297' '35614450']
 ['54426' '92234' '31970692']
 ['43226' '173850' '40202041']
 ['40136' '105188' '38121785']
 ['38273' '73945' '32407183']
 ['36082' '57707' '20352144']
 ['33760' '58556' '22598174']
 ['32048' '84258' '30403760']
 ['31637' '83292' '29424405']
 ['30175' '54250' '8662287']
 ['28997' '50975' '22962840']
 ['28614' '79681' '35356068']
 ['27141' '35441' '16962150']
 ['25851' '30095' '7473201']
 ['25568' '70692' '24549362']]


In [123]:
#Obtained all of the names and appids columnwise, and stored them in an np array
names = []
appids = []
for a in tables[0].findAll('a'):
  #Each game listed has a corresponding name and appid
  names.append(a.contents[0].strip())
  #The appid was in the pattern <a href="/app/some_appid">
  id = re.search(r'\d+$', str(a['href']))
  appids.append(id.group(0))

print(names)
appids = np.array(appids)
appids = pd.to_numeric(appids)
print(appids)

['Dota 2', 'Counter-Strike: Global Offensive', 'Team Fortress 2', 'New World', 'Apex Legends', 'PUBG: BATTLEGROUNDS', 'Destiny 2', 'Grand Theft Auto V', 'Rust', 'Halo Infinite', 'Warframe', 'ARK: Survival Evolved', 'NARAKA: BLADEPOINT', 'Wallpaper Engine', 'Dead by Daylight', '7 Days to Die', 'Rocket League', 'Terraria', "Tom Clancy's Rainbow Six Siege", 'SUPER PEOPLE CBT', "Sid Meier's Civilization VI", 'Football Manager 2022', 'Stardew Valley', 'Project Zomboid', 'Unturned']
[    570     730     440 1063730 1172470  578080 1085660  271590  252490
 1240440  230410  346110 1203220  431960  381210  251570  252950  105600
  359550 1619990  289070 1569040  413150  108600  304930]


In [124]:
#Created a dataframe using the extracted table columns through concatenation
names = pd.DataFrame(np.array(names), columns=['name'])
appids = pd.DataFrame(np.array(appids), columns=['appid'])
cols = pd.DataFrame(np.array(totRows), 
                    columns=['curr_players', 'peak_players', 'hours_played'])
top1000_games = pd.concat([names, appids ,cols], axis=1)
#top1000_games

In [125]:
#Repeat the same process, but for the next 40 pages for data on a total of 25 games * 40 pages = 1000 games  
page = 2
while page <= 40:
  #The next page's url had an addition /p.#, where # is the page number
  top1000_r = requests.get('https://steamcharts.com/top/p.' + str(page), headers
                           = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Mobile Safari/537.36'})
  r = BeautifulSoup(top1000_r.text)

  cols = r.findAll('th')
  tables = r.findAll('table')
  table = tables[0]

  totRows = []
  for rows in tables[0].findAll('tr'):
    row = re.findall(r'(?<=[>])\d+(?=[<])', str(rows))
    totRows.append(row)
  totRows = totRows[1:]
  totRows = np.array(totRows)

  names = []
  appids = []
  for a in tables[0].findAll('a'):
    names.append(a.contents[0].strip())
    id = re.search(r'\d+$', str(a['href']))
    appids.append(id.group(0))
  appids = np.array(appids)
  appids = pd.to_numeric(appids)


  names = pd.DataFrame(np.array(names), columns=['name'])
  appids = pd.DataFrame(np.array(appids), columns=['appid'])
  cols = pd.DataFrame(np.array(totRows), columns=['curr_players', 'peak_players', 'hours_played'])
  temp_df = pd.concat([names, appids ,cols], axis=1)
  top1000_games = top1000_games.append(temp_df)

  page += 1


In [127]:
#Ensured the dataframe values were not of type object
top1000_games = top1000_games.reset_index(drop = True)
top1000_games['curr_players'] = top1000_games['curr_players'].apply(int)
top1000_games['peak_players'] = top1000_games['peak_players'].apply(int)
top1000_games['hours_played'] = top1000_games['hours_played'].apply(int)
top1000_games['name'] = top1000_games['name'].apply(str)

#Sorted with priority for peak_players
sorted_games = top1000_games.sort_values(by=['peak_players', 'curr_players', 'hours_played'], ascending = False)

sorted_games

Unnamed: 0,name,appid,curr_players,peak_players,hours_played
1,Counter-Strike: Global Offensive,730,312569,950586,400799681
0,Dota 2,570,312928,756170,311258956
5,PUBG: BATTLEGROUNDS,578080,76818,363839,110584929
4,Apex Legends,1172470,77726,242823,87305456
9,Halo Infinite,1240440,71365,215058,51423073
...,...,...,...,...,...
894,DISGAEA RPG,1791620,248,293,99143
973,The Last Spell,1105670,216,290,111492
911,Sonic Adventure™ 2,213610,241,268,109873
991,Freddy Fazbear's Pizzeria Simulator,738060,211,262,69072


The next step is to extract reviews from each game's Steam store page through the Steamworks API https://partner.steamgames.com/doc/store/getreviews. The specified format is https://store.steampowered.com/appreviews/' + appid + '?json=1. The result is already in a json format, and is also necessary to replicate a scrolling feature to access more reviews. The first request 'res' is to gain access to the cursor element in the json dict. The following code emulates the steps of extracting 200 reviews from one game as an example.  

In [128]:
appid = '413150'  #appid for Stardew Valley game
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Mobile Safari/537.36'}
res = requests.get('https://store.steampowered.com/appreviews/' + appid + '?json=1', headers = headers).json()
#res

In [129]:
#For viewing the response json dictionary keys, which reveals 'reviews' and 'cursor' being of interest
for key in res:
  print(key)

success
query_summary
reviews
cursor


100 was the maximum number of reviews allowed per request, so I requested 100 positive and 100 negative reviews per game in the parameters. This leads to a potential maximum of 200,000 reviews (200 * 1000 games). You can find more information and options on the parameters in the Steamworks API link above. 

In [130]:
#Uses cursor information from previous request to be encoded in the parameters of the following requests
cursor_scroll = res['cursor']
#Request for 100 positive reviews
res2 = requests.get('https://store.steampowered.com/appreviews/' + appid + '?json=1', params = {'cursor': cursor_scroll.encode(), 'language' : 'english', 'num_per_page' : 100, 'review_type': 'positive'}, headers = headers).json()
#Request for 100 negative reviews
res3 = requests.get('https://store.steampowered.com/appreviews/' + appid + '?json=1', params = {'cursor': cursor_scroll.encode(), 'language' : 'english', 'num_per_page' : 100, 'review_type': 'negative'}, headers = headers).json() 

In [132]:
#Create a dataframe of the reviews data
example_df = pd.DataFrame(res2['reviews'])
#Add a column for the corresponding appid and name
example_df['appid'] = appid
example_df['name'] = sorted_games[sorted_games['appid'] == int(appid)]['name'].iloc[0]
example_df

Unnamed: 0,recommendationid,author,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,appid,name
0,105987644,"{'steamid': '76561199083768694', 'num_games_ow...",english,"Excelente juego, pero no me gusta.\n\nTodo bie...",1639763774,1639763774,True,2,0,0.545454561710357666,0,True,False,False,413150,Stardew Valley
1,105669224,"{'steamid': '76561198347020106', 'num_games_ow...",english,"this game is absolutely beautiful, I have not ...",1639263155,1639263155,True,2,0,0.545454561710357666,0,True,False,False,413150,Stardew Valley
2,105427384,"{'steamid': '76561199044232124', 'num_games_ow...",english,Yearly tradition of playing nothing but starde...,1638908646,1638908646,True,2,0,0.545454561710357666,1,True,False,False,413150,Stardew Valley
3,105066170,"{'steamid': '76561198398293327', 'num_games_ow...",english,comfort,1638446323,1638446323,True,2,0,0.545454561710357666,0,True,False,False,413150,Stardew Valley
4,105056683,"{'steamid': '76561199137703460', 'num_games_ow...",english,It's awesome to play when you have time to kil...,1638431809,1638431809,True,2,0,0.545454561710357666,0,True,False,False,413150,Stardew Valley
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,105884898,"{'steamid': '76561199207977976', 'num_games_ow...",english,gud\n,1639596221,1639596221,True,1,0,0.523809552192687988,0,True,False,False,413150,Stardew Valley
96,105884217,"{'steamid': '76561198307017785', 'num_games_ow...",english,This is a good game for those who like a relax...,1639595246,1639595246,True,1,0,0.523809552192687988,0,True,False,False,413150,Stardew Valley
97,105866373,"{'steamid': '76561198831060366', 'num_games_ow...",english,thinking about marrying haley,1639571085,1639571085,True,0,1,0.523809552192687988,0,True,False,False,413150,Stardew Valley
98,105860404,"{'steamid': '76561198820639002', 'num_games_ow...",english,quite the based game not gonna lie,1639560269,1639560269,True,1,0,0.523809552192687988,0,True,False,False,413150,Stardew Valley


That was the process for obtaining reviews for one game, but the next code block repeats the same process for the top 1000 games. The main difference is that the positive and negative reviews are concatenated together before attaching the name and appid to the dataframe so that the right name and appid values correspond to the right game iteration.

In [133]:
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Mobile Safari/537.36'}
#reviews stores all of the requests's review content
reviews = pd.DataFrame()
#created a requests session to increase speed of repeated requests
session = requests.Session()

for appid in sorted_games.loc[:, 'appid']:
  appid_str = str(appid)
  res = session.get('https://store.steampowered.com/appreviews/' + appid_str + '?json=1', headers = headers).json()

  cursor_scroll = res['cursor']
  #res2: positive reviews 
  res2 = session.get('https://store.steampowered.com/appreviews/' + appid_str + '?json=1', params = {'cursor': cursor_scroll.encode(), 'language' : 'english', 'num_per_page' : 100, 'review_type': 'positive'}, headers = headers).json()
  #res3: negative reviews
  res3 = session.get('https://store.steampowered.com/appreviews/' + appid_str + '?json=1', params = {'cursor': cursor_scroll.encode(), 'language' : 'english', 'num_per_page' : 100, 'review_type': 'negative'}, headers = headers).json()

  #join the dataframes with reviews for current appid/game, then add to overall reviews dataframe
  temp_reviews = pd.DataFrame(res2['reviews']).append(pd.DataFrame(res3['reviews']), ignore_index = True)
  temp_reviews['appid'] = appid
  temp_reviews['name'] = sorted_games[sorted_games['appid'] == int(appid)]['name'].iloc[0]
  reviews = reviews.append(temp_reviews, ignore_index = True)

#reviews

In [135]:
#Extract the relevant columns, name and appid for identification, review for the 
#text, and voted_up if observation was a recommendation or not.
reviews_df = reviews.loc[:, ['name', 'appid', 'review', 'voted_up']]

#1 = recommended, -1 = not recommended
reviews_df['voted_up'] = reviews_df['voted_up'].replace(True, 1)
reviews_df['voted_up'] = reviews_df['voted_up'].replace(False, -1)

#made all text lowercase prior to tokenization for less interference with stop_word removal
reviews_df['review'] = reviews_df['review'].apply(lambda x: x.lower())
reviews_df

Unnamed: 0,name,appid,review,voted_up
0,Counter-Strike: Global Offensive,730,great game much greatness would recommend,1
1,Counter-Strike: Global Offensive,730,this is a very great game that is very enjoyab...,1
2,Counter-Strike: Global Offensive,730,hmm,1
3,Counter-Strike: Global Offensive,730,rush b cyka,1
4,Counter-Strike: Global Offensive,730,jeblovina,1
...,...,...,...,...
50834,Sonic Adventure™ 2,213610,"as of december of 2021, this game has still no...",-1
50835,Freddy Fazbear's Pizzeria Simulator,738060,when freddy fazfart comes out of the vent. *em...,1
50836,Freddy Fazbear's Pizzeria Simulator,738060,fun game played since 2014 i have all the gam...,1
50837,Freddy Fazbear's Pizzeria Simulator,738060,ferdy,1


Now we have finished collecting the data. 

In [136]:
#Visualize the number of positive(1) and negative(-1) reviews
reviews_df.groupby('voted_up').count()

Unnamed: 0_level_0,name,appid,review
voted_up,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,17590,17590,17590
1,33249,33249,33249


## Data Processing

Now that we have what we need from the scraped data, we can begin breaking down the words in preparation for input. I will be using the NLTK library for NLP functions and features, and the sklearn library for a Logistic Regression model.

In [137]:
#install and import the necessary nlp modules
import nltk
#nltk.download('all')
import sklearn
from sklearn.utils import shuffle

The first processing step will be to break the review words down to individual tokens, so that the NLTK functions can manipulate the tokens for further decomposition. 

In [138]:
#map the tokenize function for each list in each row
reviews_df['tokens'] = reviews_df["review"].map(lambda row: nltk.word_tokenize(row))

In [139]:
reviews_df

Unnamed: 0,name,appid,review,voted_up,tokens
0,Counter-Strike: Global Offensive,730,great game much greatness would recommend,1,"[great, game, much, greatness, would, recommend]"
1,Counter-Strike: Global Offensive,730,this is a very great game that is very enjoyab...,1,"[this, is, a, very, great, game, that, is, ver..."
2,Counter-Strike: Global Offensive,730,hmm,1,[hmm]
3,Counter-Strike: Global Offensive,730,rush b cyka,1,"[rush, b, cyka]"
4,Counter-Strike: Global Offensive,730,jeblovina,1,[jeblovina]
...,...,...,...,...,...
50834,Sonic Adventure™ 2,213610,"as of december of 2021, this game has still no...",-1,"[as, of, december, of, 2021, ,, this, game, ha..."
50835,Freddy Fazbear's Pizzeria Simulator,738060,when freddy fazfart comes out of the vent. *em...,1,"[when, freddy, fazfart, comes, out, of, the, v..."
50836,Freddy Fazbear's Pizzeria Simulator,738060,fun game played since 2014 i have all the gam...,1,"[fun, game, played, since, 2014, i, have, all,..."
50837,Freddy Fazbear's Pizzeria Simulator,738060,ferdy,1,[ferdy]


Using the English stopword corpus from the NLTK library, we can remove common and irrelevant words from the token lists that may interfere with the model when classifying. 

In [140]:
import string

In [143]:
#included in stopwords to remove punctuation from input
punc = string.punctuation
#list(punc)

In [144]:
#remove given stop words from each token list
stop_words = list(nltk.corpus.stopwords.words('english'))
stop_words.extend(punc)
reviews_df['tokens'] = reviews_df["tokens"].map(lambda row: [w for w in row if not w in stop_words and w.isalpha()])


In [145]:
reviews_df

Unnamed: 0,name,appid,review,voted_up,tokens
0,Counter-Strike: Global Offensive,730,great game much greatness would recommend,1,"[great, game, much, greatness, would, recommend]"
1,Counter-Strike: Global Offensive,730,this is a very great game that is very enjoyab...,1,"[great, game, enjoyable]"
2,Counter-Strike: Global Offensive,730,hmm,1,[hmm]
3,Counter-Strike: Global Offensive,730,rush b cyka,1,"[rush, b, cyka]"
4,Counter-Strike: Global Offensive,730,jeblovina,1,[jeblovina]
...,...,...,...,...,...
50834,Sonic Adventure™ 2,213610,"as of december of 2021, this game has still no...",-1,"[december, game, still, fixed, steam, whenever..."
50835,Freddy Fazbear's Pizzeria Simulator,738060,when freddy fazfart comes out of the vent. *em...,1,"[freddy, fazfart, comes, vent, imposor]"
50836,Freddy Fazbear's Pizzeria Simulator,738060,fun game played since 2014 i have all the gam...,1,"[fun, game, played, since, games, u, totally, ..."
50837,Freddy Fazbear's Pizzeria Simulator,738060,ferdy,1,[ferdy]


Afterwards, the stemming process begins. The token words will be broken down to their root words, which mainly removes affixes directly. Different affixes do not always have distinguishing meanings, but may confuse the model because of the different morphological structure. The tokens will now be ready for counting, which allows for association between common words and positive or negative reviews.

In [146]:
stem = nltk.stem.PorterStemmer()
reviews_df['tokens'] = reviews_df["tokens"].map(lambda row: [stem.stem(w) for w in row])

In [147]:
reviews_df

Unnamed: 0,name,appid,review,voted_up,tokens
0,Counter-Strike: Global Offensive,730,great game much greatness would recommend,1,"[great, game, much, great, would, recommend]"
1,Counter-Strike: Global Offensive,730,this is a very great game that is very enjoyab...,1,"[great, game, enjoy]"
2,Counter-Strike: Global Offensive,730,hmm,1,[hmm]
3,Counter-Strike: Global Offensive,730,rush b cyka,1,"[rush, b, cyka]"
4,Counter-Strike: Global Offensive,730,jeblovina,1,[jeblovina]
...,...,...,...,...,...
50834,Sonic Adventure™ 2,213610,"as of december of 2021, this game has still no...",-1,"[decemb, game, still, fix, steam, whenev, open..."
50835,Freddy Fazbear's Pizzeria Simulator,738060,when freddy fazfart comes out of the vent. *em...,1,"[freddi, fazfart, come, vent, imposor]"
50836,Freddy Fazbear's Pizzeria Simulator,738060,fun game played since 2014 i have all the gam...,1,"[fun, game, play, sinc, game, u, total, buy, g..."
50837,Freddy Fazbear's Pizzeria Simulator,738060,ferdy,1,[ferdi]


##Exploratory Data Visualization

The dataset is divided between positive and negative reviews so that the words with high frequency in either review types can be identified.  

In [159]:
#Split the dataframe into only positive review or negative review dataframes
p_reviews = reviews_df[reviews_df['voted_up'] == 1]
n_reviews = reviews_df[reviews_df['voted_up'] == -1]

In [160]:
#Count words that appear in positive reviews
p_fd = nltk.FreqDist()
for row in p_reviews['tokens']:
  for word in row:
    p_fd[word.lower()] += 1


#Count words that appear in negative reviews
n_fd = nltk.FreqDist()
for row in n_reviews['tokens']:
  for word in row:
    n_fd[word.lower()] += 1

In [161]:
#p_df

In [162]:
#n_df

In [163]:
#randomize positions because they are sorted by game title
p_reviews = shuffle(p_reviews)
n_reviews = shuffle(n_reviews)

#dataset split 75%-25% for training and testing
p_train, p_test = np.split(p_reviews, [int(.75*len(p_reviews))])
n_train, n_test = np.split(n_reviews, [int(.75*len(n_reviews))])
train = p_train.append(n_train)
test = p_test.append(n_test) 

#randomize because sorted by positive/negative reviews
train = shuffle(train)
test = shuffle(test)
trainY = train['voted_up']
testY = test['voted_up']
trainX = train.drop(['voted_up', 'name', 'review', 'appid'], axis=1)
testX = test.drop(['voted_up', 'name', 'review', 'appid'], axis=1)

Only the top 100 highest occurring terms in positive and negative reviews appear in the final word list tot to prevent too many features affecting the Logistic Regression model. The frequent recommendation review terms are very obvious towards positive sentiment, with terms like 'fun', 'great', 'love', etc..

In [164]:
pos = p_fd.most_common(100)
pos

[('game', 42047),
 ('play', 13263),
 ('good', 10893),
 ('like', 9351),
 ('get', 7652),
 ('fun', 6724),
 ('time', 6226),
 ('one', 5924),
 ('great', 4997),
 ('realli', 4984),
 ('make', 4727),
 ('stori', 3975),
 ('love', 3840),
 ('would', 3751),
 ('best', 3729),
 ('much', 3703),
 ('lot', 3627),
 ('feel', 3600),
 ('still', 3422),
 ('go', 3398),
 ('even', 3355),
 ('also', 3270),
 ('hour', 3117),
 ('new', 3087),
 ('player', 3049),
 ('thing', 3031),
 ('recommend', 2945),
 ('charact', 2942),
 ('enjoy', 2933),
 ('want', 2889),
 ('well', 2784),
 ('way', 2728),
 ('need', 2665),
 ('better', 2628),
 ('look', 2611),
 ('use', 2585),
 ('first', 2574),
 ('gameplay', 2369),
 ('peopl', 2314),
 ('friend', 2291),
 ('see', 2114),
 ('tri', 2109),
 ('take', 2101),
 ('mani', 2089),
 ('world', 2080),
 ('say', 2055),
 ('think', 2019),
 ('level', 2012),
 ('could', 2004),
 ('pretti', 1990),
 ('everi', 1979),
 ('give', 1969),
 ('amaz', 1952),
 ('experi', 1947),
 ('build', 1945),
 ('differ', 1942),
 ('ever', 1911),


On the other hand, the frequent terms for not recommended reviews are very ambiguous, with confusing features like 'good', 'fun', and meaningless words. These are similar to the recommended reviews which might confuse the model. 

In [165]:
neg = n_fd.most_common(100)
neg

[('game', 42715),
 ('play', 12443),
 ('get', 9907),
 ('like', 9741),
 ('time', 7458),
 ('even', 6032),
 ('one', 5851),
 ('make', 5658),
 ('good', 4684),
 ('realli', 4679),
 ('player', 4541),
 ('would', 4277),
 ('go', 4106),
 ('feel', 4020),
 ('want', 3902),
 ('fun', 3889),
 ('much', 3632),
 ('thing', 3550),
 ('hour', 3458),
 ('way', 3404),
 ('tri', 3385),
 ('use', 3325),
 ('still', 3220),
 ('work', 3043),
 ('buy', 2952),
 ('new', 2947),
 ('also', 2927),
 ('ca', 2897),
 ('first', 2896),
 ('look', 2873),
 ('need', 2858),
 ('everi', 2757),
 ('could', 2746),
 ('lot', 2693),
 ('recommend', 2682),
 ('bad', 2672),
 ('start', 2648),
 ('peopl', 2585),
 ('run', 2537),
 ('better', 2523),
 ('mani', 2443),
 ('fix', 2430),
 ('level', 2427),
 ('back', 2403),
 ('bug', 2338),
 ('see', 2335),
 ('charact', 2307),
 ('take', 2302),
 ('great', 2301),
 ('stori', 2299),
 ('issu', 2267),
 ('year', 2260),
 ('enemi', 2257),
 ('money', 2228),
 ('well', 2191),
 ('say', 2191),
 ('give', 2134),
 ('point', 2121),
 ('

In [168]:
tot = list(set(pos).union(set(neg)))
#tot

[('use', 3325),
 ('start', 2648),
 ('littl', 1461),
 ('sinc', 1305),
 ('say', 2055),
 ('great', 4997),
 ('lot', 2693),
 ('play', 12443),
 ('world', 2080),
 ('nice', 1667),
 ('made', 1648),
 ('amaz', 1952),
 ('enough', 1455),
 ('realli', 4984),
 ('well', 2784),
 ('hard', 1388),
 ('sinc', 1749),
 ('first', 2574),
 ('way', 2728),
 ('much', 3703),
 ('actual', 1460),
 ('could', 2004),
 ('look', 2873),
 ('ye', 1505),
 ('also', 2927),
 ('worth', 1508),
 ('tri', 2109),
 ('day', 1276),
 ('buy', 2952),
 ('lot', 3627),
 ('player', 4541),
 ('never', 1854),
 ('time', 6226),
 ('give', 1969),
 ('give', 2134),
 ('know', 2092),
 ('recommend', 2945),
 ('one', 5851),
 ('complet', 1963),
 ('stori', 3975),
 ('make', 4727),
 ('come', 1691),
 ('come', 1793),
 ('good', 4684),
 ('level', 2427),
 ('say', 2191),
 ('run', 2537),
 ('without', 1581),
 ('still', 3422),
 ('want', 3902),
 ('way', 3404),
 ('would', 4277),
 ('experi', 1511),
 ('pretti', 1617),
 ('worth', 1832),
 ('enjoy', 2933),
 ('know', 1630),
 ('ever

Here, bags of words are created for each training and testing document to select the features. 

In [169]:
#Creates a vector for each review observation, and adds 1 or 0 to the vector based on token presence
vs = []
for r in trainX['tokens']:
  v = []
  for token in tot:
    if token[0] in r:
      v.append(1)
    else:
      v.append(0)
  vs.append(v)

In [170]:
train_bow = pd.DataFrame(vs)
keys = []
for token in tot:
  keys.append(token[0])
train_bow.columns = keys
train_bow

Unnamed: 0,use,start,littl,sinc,say,great,lot,play,world,nice,made,amaz,enough,realli,well,hard,sinc.1,first,way,much,actual,could,look,ye,also,worth,tri,day,buy,lot.1,player,never,time,give,give.1,know,recommend,one,complet,stori,...,gameplay,grind,like,love,graphic,work,long,take,see,system,fun,better,gameplay.1,though,graphic.1,new,far,around,bug,enough.1,time.1,game,stori.1,make,kill,bad,level,bad.1,someth,hour,peopl,differ,go,think,buy.1,even,friend,much.1,year,mani
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38123,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
38124,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
38125,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
38126,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [171]:
#tf
lst = []
for r in train.loc[:, 'tokens']:
  words = dict.fromkeys(r, 0)
  for i in r:
    words[i] += 1
  lst.append(words)
train['amt_words'] = lst
train['sum'] = train['tokens'].apply(lambda r: len(r))
train['tf'] = train['amt_words']

train

Unnamed: 0,name,appid,review,voted_up,tokens,amt_words,sum,tf
34196,Black Mesa,362890,black mesa is an amazing version of the origin...,1,"[black, mesa, amaz, version, origin, half, lif...","{'black': 1, 'mesa': 1, 'amaz': 1, 'version': ...",8,"{'black': 1, 'mesa': 1, 'amaz': 1, 'version': ..."
24575,Dragon Age™ Inquisition,1222690,"its so boring and repetitive, a game made by l...",-1,"[bore, repetit, game, made, lasiest, peopl]","{'bore': 1, 'repetit': 1, 'game': 1, 'made': 1...",6,"{'bore': 1, 'repetit': 1, 'game': 1, 'made': 1..."
13967,OUTRIDERS,680420,gears of war combat with looter shooter progre...,1,"[gear, war, combat, looter, shooter, progress,...","{'gear': 1, 'war': 1, 'combat': 1, 'looter': 1...",9,"{'gear': 1, 'war': 1, 'combat': 1, 'looter': 1..."
6066,The Witcher 3: Wild Hunt,292030,better than cyberpunk,1,"[better, cyberpunk]","{'better': 1, 'cyberpunk': 1}",2,"{'better': 1, 'cyberpunk': 1}"
35654,Wargame: Red Dragon,251060,"wow, how did i miss this game all this time? i...",1,"[wow, miss, game, time, fantast]","{'wow': 1, 'miss': 1, 'game': 1, 'time': 1, 'f...",5,"{'wow': 1, 'miss': 1, 'game': 1, 'time': 1, 'f..."
...,...,...,...,...,...,...,...,...
39743,Deadside,895400,"this game is pretty fun, i wish the base build...",1,"[game, pretti, fun, wish, base, build, littl, ...","{'game': 1, 'pretti': 1, 'fun': 1, 'wish': 1, ...",9,"{'game': 1, 'pretti': 1, 'fun': 1, 'wish': 1, ..."
28813,EVGA Precision X1,268850,i burned my house down...\nstupid pentium 4 rig,1,"[burn, hous, stupid, pentium, rig]","{'burn': 1, 'hous': 1, 'stupid': 1, 'pentium':...",5,"{'burn': 1, 'hous': 1, 'stupid': 1, 'pentium':..."
50341,FINAL FANTASY IX,377840,tetra master is too addictive,1,"[tetra, master, addict]","{'tetra': 1, 'master': 1, 'addict': 1}",3,"{'tetra': 1, 'master': 1, 'addict': 1}"
21634,Madden NFL 22,1519350,got it on sale still ain't worth it because it...,-1,"[got, sale, still, ai, worth, pc, dont, get, n...","{'got': 2, 'sale': 1, 'still': 1, 'ai': 1, 'wo...",16,"{'got': 2, 'sale': 1, 'still': 1, 'ai': 1, 'wo..."


In [172]:
#BoW for test data, same process as training BoW
vs = []
for r in testX['tokens']:
  v = []
  for token in tot:
    if token[0] in r:
      v.append(1)
    else:
      v.append(0)
  vs.append(v)

In [173]:
test_bow = pd.DataFrame(vs)
keys = []
for token in tot:
  keys.append(token[0])
test_bow.columns = keys
test_bow

Unnamed: 0,use,start,littl,sinc,say,great,lot,play,world,nice,made,amaz,enough,realli,well,hard,sinc.1,first,way,much,actual,could,look,ye,also,worth,tri,day,buy,lot.1,player,never,time,give,give.1,know,recommend,one,complet,stori,...,gameplay,grind,like,love,graphic,work,long,take,see,system,fun,better,gameplay.1,though,graphic.1,new,far,around,bug,enough.1,time.1,game,stori.1,make,kill,bad,level,bad.1,someth,hour,peopl,differ,go,think,buy.1,even,friend,much.1,year,mani
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12706,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12707,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,0,0
12708,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12709,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


For the machine learning algorithm, Logistic Regression will be used to classify a game recommendation (1 or positive) or not (-1 or negative). This will be accomplished using sklearn's LogisticRegression(). 

In [174]:
model = sklearn.linear_model.LogisticRegression()
model.fit(train_bow, trainY)

LogisticRegression()

In [175]:
pred = model.predict(test_bow)
print(sklearn.metrics.classification_report(testY, pred))

              precision    recall  f1-score   support

          -1       0.74      0.46      0.56      4398
           1       0.76      0.91      0.83      8313

    accuracy                           0.76     12711
   macro avg       0.75      0.68      0.70     12711
weighted avg       0.75      0.76      0.74     12711



The model shows an accuracy rate of 75%, which is okay. There appears to be a very high recall rate for positive reviews, which seem to reflect the very direct term features that were very common in positive reviews. The features had a strong impact on accurately associating with a recommendation. As predicted above, the negative sentiment features were too ambiguous, with a low recall rate. This mean thats there were a lot of false negatives present. This may be because reviewers, when criticizing a game in their review, may use sarcasm or not "pleasant sentiment term" frequently which can confuse the model. 

It is possible that the accuracy is lower because the features were selected in a unigram format, meaning that there are more features present and there is less contextual information in comparison to ngram permutations. The precision might be impacted by the stemming process, and using lemmatization might reduce the error.  

In [176]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [179]:
%%shell
jupyter nbconvert --to html /content/gdrive/MyDrive/CMSC320/FinalProject/FinalProject_JohnLe.ipynb &> /dev/null

