# Importing the data

Originally, the file I imported is the ```ign.csv``` file from [here](https://www.kaggle.com/egrinstein/20-years-of-games). The section below demonstrates how I scraped the IGN website to find reviews to add in more data.

In [1]:
import pandas as pd

In [None]:
df = pd.read_csv('ign.csv')

# Scraping reviews to add in more data

It was kind of the compiler of this data to obtain the URLs. It made scraping IGN for its reviews a lot easier! I'm using BeautifulSoup to scrape the reviews. Overall, this worked for a good number of entries. A couple possibilities for why some reviews are not in the data:

1. Some games have video reviews, and I'm not really at a point where I would like to try to pull data from a video.

2. Maybe a review was not in the ```top-mixed-media``` class of ```div``` tags.

I noticed that some pages have multiple reviews, but to simplify the data-gathering a bit, I wanted to pick one of them. Any one would do.

The ```urllib``` imports below are strictly ```Python 3.x```. For ```Python 2.x``` the module should be ```urllib2```. I imported the ```time.sleep``` method so that IGN doesn't get mad at me for inundating them with HTTP requests.

In [132]:
from urllib.request import urlopen
from urllib.error import HTTPError
from time import sleep

from bs4 import BeautifulSoup

I've found that sometimes reviews pages do not exist for some reason, either because the link was not found in the HTML, or if it did, the page was missing for some reason. In either case, I handled that by simply returning an empty string for the review.

In [141]:
def get_review(row):
    game_link = row['url']
    game_address = 'http://www.ign.com' + game_link
    
    try:
        main_game_page = urlopen(game_address)
        main_page_soup = BeautifulSoup(main_game_page, 'html.parser')

        likely_rev_location = main_page_soup.find('div', attrs={'class': 'top-mixed-media'})

        links = likely_rev_location.find_all('a')

        link_to_review = None

        for link in links:
            if 'review' in link.get('href').lower():
                link_to_review = link.get('href')

        review_text = ''

        if link_to_review:
            sleep(1.25)
            rev_page = urlopen(link_to_review)
            rev_page_soup = BeautifulSoup(rev_page, 'html.parser')

            review_div = rev_page_soup.find('div', attrs={'class': 'article-content'})
            if review_div:
                review_text = ' '.join(review_div.get_text().replace('\n','').replace('Share.','').split())
            else:
                review_text = ''
    except HTTPError:
        review_text = ''
        
    return review_text

In [142]:
df['review'] = df.apply(get_review, axis=1)

I'd rather not rescrape this every single time, so I'll save the file.

In [144]:
df.to_csv('ign_with_reviews.csv', sep=',', index=False)

In [241]:
df = pd.read_csv('ign_with_reviews.csv').drop(columns=['score_phrase','url']).fillna('')

# Removing duplicates

Some games (e.g. Overwatch, The Legend of Zelda: Twilight Princess) are on multiple platforms. For a recommender system that is based on terms, it's very easy for cosine similarity methods to make the trivial match because the titles are exactly the same. I'll get around this by combining entries on different platforms and simply list the platforms next to each other in the ```platform``` field (e.g. ```'PC XBox Wii'```).

It's conceivable that there are different reviews for the games on different platforms (e.g. Twilight Princess on Wii vs. GameCube, or Dark Souls on the PlayStation vs. PC), but even if there are issues with differences between platforms, it would affect only a small subset of games that I wouldn't even worry about it.

In [242]:
titles = df['title'].unique()

for title in titles:
    platforms = df[df['title'] == title]['platform'].unique()
    platforms_string = ' '.join(platforms).strip()
    
    same_title_indices = df[df['title'] == title].index.values
    
    if len(same_title_indices) > 1:
        try:
            df['platform'].iloc[same_title_indices[0]] = platforms_string
            df = df.drop(same_title_indices[1:]).reset_index(drop=True)
        except IndexError:
            print('IndexError at: {}'.format(same_title_indices))
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


# Building a recommender system

I'll be comparing the use of a vectorizer using absolute term frequency to one using term frequency-inverse document frequency.

The routine for cosine similarity and retrieving recommendations was written with a lot of help from a DataCamp tutorial seen [here](https://www.datacamp.com/community/tutorials/recommender-systems-python). All the data processing, web scraping, and model adjusting is mine, though!

In [256]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

This particular recommender system should never recommend terrible games. To be a bit of an elitist here (and to get around memory issues), I will only recommend games that have a score of 8.5 and above.

In [244]:
df_rec = df.drop(columns=['release_year','release_month','release_day'], errors='ignore')
df_rec = df_rec[df_rec['score'] >= 8.5].drop(columns=['score'])

Some functions to aid in combining the metadata and the review into a huge body of text.

In [277]:
def combine_fields(row):
    return (''.join(row['title']) + ' ' + ''.join(row['platform']) + ' ' + ''.join(row['genre']) + ' ' + ''.join(row['review']))

def combine_fields_no_review(row):
    return (''.join(row['title']) + ' ' + ''.join(row['platform']) + ' ' + ''.join(row['genre']))

In [278]:
df_rec['combined'] = df_rec.apply(combine_fields, axis=1)
df_rec['combined_no_rev'] = df_rec.apply(combine_fields_no_review, axis=1)

Here's a helper function to return the 10 most cosine-similar games given the game title, platform, genre, and review. This, of course, assumes that the title passed into ```get_recommendations``` exists.

In [250]:
df_rec = df_rec.reset_index()
indices = pd.Series(df_rec.index, index=df_rec['title'])

In [329]:
def get_recommendations(title, cosine_similarity):
    if indices[title].shape is not ():
        index = indices[title].iloc[0]
    else:
        index = indices[title]
    scores = list(enumerate(cosine_similarity[index]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    scores = scores[1:11]
    
    rec_indices = [i[0] for i in scores]
    
    return df_rec[['title','genre','platform']].iloc[rec_indices]

# Making the models

There's a bit of thinking to do here, as the incorrect parameters may give extremely strange recommendations. Here are a few things that I can think of to improve the model:

#### Exclude infrequent corpus-specific words

To demonstrate this, consider RPGs where you might number your series as 1, 2, 3, etc., or I, II, III, etc. Some of these should already be filtered out for having a sufficiently low letter count (e.g. I and II), but "words" like "III"  can be considered a word, and will be used to find cosine similarity if you're not careful. This would help to prevent cosine similarity build-up between games like Warcraft III and Diablo III *on the basis of the word "III."* (N.b. that there are other good reasons why Warcraft III and Diablo III ought to have a cosine similarity).

#### Play around with stemming and/or lemmatizing

In particular, this would be very helpful in reviews, as there tend to be variations of words. One simple example would be the word "gun." If there were no stemming or lemmatizing, "gun" and "guns" would be considered orthogonal features. We (humans) know that those two words are obviously related, so it would be nice if they would be considered to be part of the same vector.

#### Play around with min/max document frequency

There's not a real good way (as far as I know) to explore the parameter space of this well except by hand. In general, typical stop words like "a," "or," "the," etc. have a high document frequency, so setting the ```max_df``` parameter to about ```0.7``` will likely filter out most stop words.

When writing reviews, one could imagine that there are words commonly used. Perhaps these words will not be as numerous as the stop words mentioned in the previous paragraph, but certainly they can constitute a significant portion of the text which could ultimately result in cosine similarity between other games. For example, having a storyline with engaging characters can be a common topic to discuss in a review. Maybe one reviewer considers that characters in one of the Halo games to be super engaging, and another reviewer states the same thing, but for a character in one of the Dragon Quest games.

Here, we toe a fine line because maybe you would actually like to see other games that have engaging characters, but you also only like RPGs, so if you liked a game from the Dragon Quest series, it would be strange to see a suggestion for one of the Halo games.

(Author's note, mostly to self: this reminds me a lot of creating band pass filters for electrical measurements...)

### Combining all fields together

In [389]:
stop_words = ['ex','iii','vol.','revolution']

count = CountVectorizer(stop_words=stop_words, min_df = 0, max_df = 0.1)
count_matrix = count.fit_transform(df_rec['combined'])

tfidf = TfidfVectorizer(stop_words=stop_words, min_df = 0, max_df = 0.1)
tfidf_matrix = tfidf.fit_transform(df_rec['combined'])

cos_sim_count = cosine_similarity(count_matrix, count_matrix)
cos_sim_tfidf = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [390]:
get_recommendations('Deus Ex: Mankind Divided', cos_sim_tfidf)

Unnamed: 0,title,genre,platform
2056,Deus Ex: Human Revolution,Shooter,PC PlayStation 3 Xbox 360
382,Deus Ex,RPG,PC Macintosh
929,Deus Ex: Invisible War,"Action, Adventure",PC Xbox
2320,The Walking Dead: Season Two -- Episode 2: A H...,Adventure,Xbox 360 iPhone PC PlayStation Vita PlayStation 3
37,Dishonored,Action,Xbox 360 PC PlayStation 3
2492,Quadrilateral Cowboy,Puzzle,PC
2480,Grim Dawn,"Action, RPG",PC
2244,Stealth Inc.: A Clone in the Dark,Platformer,PlayStation 3 PlayStation Vita
2477,Enter the Gungeon,Shooter,PC PlayStation 4
5,Mark of the Ninja,"Action, Adventure",Xbox 360 PC


In [392]:
get_recommendations('Warcraft III: The Frozen Throne', cos_sim_tfidf)

Unnamed: 0,title,genre,platform
1982,World of Warcraft: Cataclysm,RPG,PC
2044,Frozen Synapse,Strategy,PC
696,Age of Wonders II: The Wizard's Throne,Strategy,PC
701,Warcraft III: Reign of Chaos,Strategy,PC Macintosh
2375,World of Warcraft: Warlords of Draenor,RPG,PC
9,World of Warcraft: Mists of Pandaria,RPG,PC
550,Baldur's Gate II: Throne of Bhaal,RPG,PC
2427,Nuclear Throne,Action,PC
2324,Hearthstone: Heroes of WarCraft,"Card, Battle",Macintosh iPad PC
1087,World of Warcraft,RPG,PC


In [387]:
get_recommendations('Deus Ex: Mankind Divided', cos_sim_count)

Unnamed: 0,title,genre,platform
37,Dishonored,Action,Xbox 360 PC PlayStation 3
2491,Overwatch,Shooter,Xbox One PC PlayStation 4
2492,Quadrilateral Cowboy,Puzzle,PC
2412,Kerbal Space Program,Simulation,PC PlayStation 4
2246,Gunpoint,"Puzzle, Action",PC
40,Borderlands 2,"Shooter, RPG",Xbox 360 PC PlayStation 3 PlayStation Vita
2481,Total War: Warhammer,Strategy,PC
2452,Homeworld: Deserts of Kharak,Strategy,PC
2477,Enter the Gungeon,Shooter,PC PlayStation 4
2421,Fallout 4,RPG,PC PlayStation 4 Xbox One


In [388]:
get_recommendations('Warcraft III: Reign of Chaos', cos_sim_count)

Unnamed: 0,title,genre,platform
137,Caesar III,Strategy,PC Macintosh
602,Stronghold,Strategy,PC Macintosh
859,Warcraft III: The Frozen Throne,Strategy,PC
596,Sid Meier's Civilization III,Strategy,PC Macintosh Wireless
679,Heroes of Might and Magic IV,Strategy,PC Macintosh
2194,Sid Meier's Civilization V: Gods & Kings,Strategy,PC Macintosh
2324,Hearthstone: Heroes of WarCraft,"Card, Battle",Macintosh iPad PC
272,Homeworld,Strategy,PC
276,Pharaoh,Strategy,PC
384,StarCraft,Strategy,PC


### Combining everything except for reviews

I'm using the suffix ```_nr``` to denote "no review."

In [305]:
count = CountVectorizer(stop_words='english', min_df = 0.05, max_df = 0.25)
count_matrix_nr = count.fit_transform(df_rec['combined_no_rev'])

tfidf = TfidfVectorizer(stop_words='english', min_df = 0.05, max_df = 0.25)
tfidf_matrix_nr = tfidf.fit_transform(df_rec['combined_no_rev'])

cos_sim_count_nr = cosine_similarity(count_matrix_nr, count_matrix_nr)
cos_sim_tfidf_nr = cosine_similarity(tfidf_matrix_nr, tfidf_matrix_nr)

In [306]:
get_recommendations('Deus Ex: Mankind Divided', cos_sim_tfidf_nr)

Unnamed: 0,title,platform
6,Dark Souls (Prepare to Die Edition),PC
9,World of Warcraft: Mists of Pandaria,PC
28,Torchlight II,PC
63,Ni no Kuni: Wrath of the White Witch,PlayStation 3
67,Persona 4 Golden,PlayStation Vita
86,Blood Omen: Legacy of Kain,PlayStation
88,Suikoden,PlayStation
106,Final Fantasy VII,PlayStation PC
117,Alundra,PlayStation
125,Xenogears,PlayStation


In [307]:
get_recommendations('Deus Ex: Mankind Divided', cos_sim_count_nr)

Unnamed: 0,title,platform
6,Dark Souls (Prepare to Die Edition),PC
9,World of Warcraft: Mists of Pandaria,PC
28,Torchlight II,PC
63,Ni no Kuni: Wrath of the White Witch,PlayStation 3
67,Persona 4 Golden,PlayStation Vita
86,Blood Omen: Legacy of Kain,PlayStation
88,Suikoden,PlayStation
106,Final Fantasy VII,PlayStation PC
117,Alundra,PlayStation
125,Xenogears,PlayStation


In [308]:
get_recommendations('Warcraft III: Reign of Chaos', cos_sim_tfidf_nr)

Unnamed: 0,title,platform
95,Carnage Heart,PlayStation
124,Kagero: Deception II,PlayStation
133,"The Operational Art of War, Vol. 1",PC
135,Railroad Tycoon II,PC PlayStation
137,Caesar III,PC Macintosh
169,Sid Meier's Civilization II,PlayStation
171,Close Combat III: The Russian Front,PC
174,Populous: The Beginning,PC PlayStation
176,Gangsters: Organized Crime,PC
177,Nectaris: Military Madness [1999],PlayStation


In [309]:
get_recommendations('Warcraft III: Reign of Chaos', cos_sim_count_nr)

Unnamed: 0,title,platform
95,Carnage Heart,PlayStation
124,Kagero: Deception II,PlayStation
133,"The Operational Art of War, Vol. 1",PC
135,Railroad Tycoon II,PC PlayStation
137,Caesar III,PC Macintosh
169,Sid Meier's Civilization II,PlayStation
171,Close Combat III: The Russian Front,PC
174,Populous: The Beginning,PC PlayStation
176,Gangsters: Organized Crime,PC
177,Nectaris: Military Madness [1999],PlayStation


# Summary

I had to establish a narrow band of document frequencies to search over because otherwise there would be cosine similarities between games that were superficially related. For example, the game Warcraft III could easily be matched with other games that contained the Roman numeral III in it, like Close Combat III, Caesar III, etc. While some of them could be verified to be related for a good reason, this would be an example of a recommendation that isn't entirely accurate because the algorithm result implies that games that contain III are all related somehow.

It's not, and for what I think are obvious reasons. Patterns like that are not terribly common, thankfully, so it's a matter of adjusting the ```min_df``` parameter to hopefully ensure that there are more common themes that are considered. On the other hand, because reviews would tend to use phrases like "this game was exciting" or "the storyline was moving" or something similar, there may be some words that exist in a wide variety of documents, and so the document frequency must be below a certain number; otherwise, if a review states that "Doki Doki Literature Club has an emotionally charged storyline" and another review states that "the storyline of Puzzles & Dragons is non-existent," then there is some component in word space that provides a non-zero cosine similarity. For those familiar with those games, I think it should be obvious why there shouldn't be any cosine similarity. Or, perhaps you disagree with those statements. I don't know. I also just made up those reviews and don't really have an opinion on those statements.

In general, I believe that the recommendations given by the algorithm that uses game reviews are far superior to the ones that do not use the reviews. I this is easily seen by looking at the recommendations above. Deus Ex: Mankind Divided is a a FPS, meaning that it's a game that involves guns and is played from the first-person perspective. You'd expect that games that have similar elements would be recommended, and indeed they have been:

* Halo 3

* Hitman

* Skyrim

* Gunpoint

Gunpoint is actually described as a stealth-based game, which is actually consistent with the kind of game that Deus Ex is. Skyrim is too, for that matter, if you're into pickpocketing and using the stealth archer meme build. These types of themes would not have been found just by stringing together the title, genre, and platform metadata. Without looking for specific passages in the reviews, you could guess that the guns, stealth, and first-person perspective were some non-negligible portion of the review in some or all of the recommended reviews.

Without the reviews, Deus Ex was given some pretty strange recommendations. The only one of those games that I would think are even remotely related to Deus Ex would be Dark Souls (coincidentally, the first one recommended), but everything else seems completely out of the blue. The other recommended games tend to be turn-based RPGs set in more fantastical settings.

Looking at Warcraft III: Reign of Chaos, the recommendations with reviews are, again, far superior to the recommendations without reviews. What I found was that the games recommended tended to be games that were "real-time strategy" or "turn-based strategy." Warcraft III, of course, doesn't have anything turn-based (not counting custom games on Battle.net), but I think the key theme here is "strategy."

Without the reviews, Warcraft III gets recommendations like Railroad Tycoon and Kagero. Railroad Tycoon definitely does not match with the type of game Warcraft III is, and Kagero is a game where you run around and set traps to defeat your enemies. There are other games like Populous and other wargame simulators which are at least a bit more relevant; their matches could probably be attributed to matches in the genre field.