# COGS 108 - EDA Checkpoint

# Names

- Fadi Elhayek
- Ivy Duggan
- Joshua Guiman (lead on EDA checkpoint)
- Yu-Hsuan Chi (lead on EDA checkpoint)

<a id='research_question'></a>
# Research Question

Are there any key indicators on a game's store page as to whether or not it is good? (ie. Can this be determined through sentiment analysis of the name/blurb/genres, the mismatch of developer & user tagged genres, recent/overall reviews, release date, metacritic score, and maturity rating, etc.?


# Setup

In [13]:
import pandas as pd
list_converter = lambda x: x.strip("[]").strip("()").lower().split(", ") # need to read lists inside the csv!
df = pd.read_csv(
    'df_full.csv', 
    converters={
        "developer_genres_list": list_converter, 
        'user_tags_list': list_converter,
        'game_features_list': list_converter,
        'maturity_rating_data': list_converter,
        'genre_and_franchise_data': list_converter,
        'critic_data': list_converter,
        'reviews_data': list_converter
    })

In [14]:
df_clean = df.drop_duplicates(subset='app_id', keep="first")
df_clean.head(50)
#print(df_clean['game_name'][0])

Unnamed: 0.1,Unnamed: 0,app_id,game_name,release_date,game_blurb,developer_genres_list,user_tags_list,game_features_list,maturity_rating_data,genre_and_franchise_data,critic_data,reviews_data
0,0,730,Counter-Strike: Global Offensive,"Aug 21, 2012",Counter-Strike: Global Offensive (CS: GO) expa...,"['action', 'free to play']","['fps', 'shooter', 'multiplayer', 'competitive...","['steam achievements', 'full controller suppor...","[none, none]",['free to play games'],"['83', 'metacritic']","['very positive', 'very positive', '6731720']"
1,1,1938090,Call of Duty®: Modern Warfare® II,"Oct 27, 2022",Call of Duty®: Modern Warfare® II drops player...,['action'],"['fps', 'action', 'shooter', 'multiplayer', 'm...","['single-player', 'online pvp', 'online co-op'...","['m', 'esrb']","['action games', 'call of duty franchise']","[none, none]","[none, none, none]"
2,2,236390,War Thunder,"Aug 15, 2013",War Thunder is the most comprehensive free-to-...,"['action', 'free to play', 'massively multipla...","['free to play', 'vehicular combat', 'combat',...","['single-player', 'mmo', 'online pvp', 'online...","['t', 'esrb']",['massively multiplayer games'],"['81', 'metacritic']","['mostly positive', 'mostly positive', '341555']"
3,3,1172470,Apex Legends™,"Nov 4, 2020","Apex Legends is the award-winning, free-to-pla...","['action', 'adventure', 'free to play']","['free to play', 'multiplayer', 'battle royale...","['online pvp', 'online co-op', 'steam achievem...","['t', 'esrb']","['action games', 'apex legends official franch...","['88', 'metacritic']","['very positive', 'mostly positive', '527702']"
4,4,1599340,Lost Ark,"Feb 11, 2022",Embark on an odyssey for the Lost Ark in a vas...,"['action', 'adventure', 'free to play', 'massi...","['mmorpg', 'free to play', 'action rpg', 'rpg'...","['single-player', 'mmo', 'online pvp', 'online...","['m', 'esrb']",['free to play games'],"[none, none]","['mostly positive', 'mostly positive', '174373']"
5,5,294100,RimWorld,"Oct 17, 2018",A sci-fi colony sim driven by an intelligent A...,"['indie', 'simulation', 'strategy']","['colony sim', 'base building', 'survival', 's...","['single-player', 'steam workshop', 'steam clo...","[none, none]",['simulation games'],"['87', 'metacritic']","['overwhelmingly positive', 'overwhelmingly po..."
6,6,1085660,Destiny 2,"Oct 1, 2019",Destiny 2 is an action MMO with a single evolv...,"['action', 'adventure', 'free to play']","['free to play', 'open world', 'fps', 'looter ...","['single-player', 'online pvp', 'online co-op'...","['t', 'esrb']",['action games'],"['83', 'metacritic']","['very positive', 'mostly positive', '501974']"
7,7,1063730,New World,"Sep 28, 2021","Explore a thrilling, open-world MMO filled wit...","['action', 'adventure', 'massively multiplayer...","['massively multiplayer', 'open world', 'mmorp...","['mmo', 'online pvp', 'online co-op', 'steam a...","['t', 'esrb']",['massively multiplayer games'],"['70', 'metacritic']","['mixed', 'very positive', '204606']"
8,8,1687950,Persona 5 Royal,"Oct 20, 2022",Don the mask and join the Phantom Thieves of H...,['rpg'],"['jrpg', 'anime', 'story rich', 'rpg', 'party-...","['single-player', 'steam achievements', 'full ...","['m', 'esrb']","['rpg games', 'atlus franchise']","['95', 'metacritic']","[none, none, none]"
9,9,548430,Deep Rock Galactic,"May 13, 2020",Deep Rock Galactic is a 1-4 player co-op FPS f...,['action'],"['co-op', 'pve', 'fps', 'exploration', 'loot',...","['single-player', 'online co-op', 'steam achie...","[none, none]","['action games', 'coffee stain franchise']","['85', 'metacritic']","['overwhelmingly positive', 'overwhelmingly po..."


# Data Cleaning

Describe your data cleaning steps here.
1. Remove games that are not released.
2. Remove non-English games (i.e. Games that have )
3. Remove symbols/tm

In [20]:
## Helper Variables and Functions used to clean data.
monthValues = {"Jan":1,"Feb":2,"Mar":3,"Apr":4,"May":5,"Jun":6,"Jul":7,"Aug":8,"Sep":9,"Oct":10,"Nov":11,"Dec":12,
              "January":1, "February":2, "March":3, "April":4, "June":6, "July":7, "August":8, "September":9,
              "October":10, "November":11,"December":12}
currentTime = ['Nov','19','2022']

def isReleased(releaseDate):
    mmddyear = releaseDate.split()
    if len(mmddyear) != 3: # Must have month, day and year to be processed cleanly.
        return False
    if mmddyear[0] not in monthValues.keys(): # if the month has strange format, return false.
        return False
    
    mmddyear[1] = mmddyear[1].replace(',','') # get rid of the comma attached to the day.
    if int(mmddyear[2]) < 2022: # check if year is before
        return True
    if int(mmddyear[2]) == 2022:
        if monthValues[mmddyear[0]] < monthValues[currentTime[0]]: #check if month is before
            return True
        if monthValues[mmddyear[0]] == monthValues[currentTime[0]]:
            if int(mmddyear[1]) <= int(currentTime[1]): #check if day is before
                return True
        
    return False
#-------------------------------------------------------------
def isEnglish(title):
    if title.lower() !=  title.upper(): #Filter out Japanese, Chinese, Korean, other non latin languages.
        return True
    return False
#-------------------------------------------------------------

def removeSymbols(text):
    removed = ''
    for c in text:
        if c.isalnum() or c == ' ' or c == '-':
            removed = removed + c
    removed = removed.replace('-', ' ') # - is a special case for titles and we need to replace it with space.
    return removed



In [24]:
# Tests for the above helper methods.
assert isReleased("Nov 19, 2022") == True
assert isReleased("Aug 30, 2022") == True
assert isReleased("Jan 1, 2000") == True
assert isReleased("Nov 20, 2022") == False
assert isReleased("Mar 1, 2023") == False

assert isEnglish("Call of Duty®: Modern Warfare® II") == True
assert isEnglish("안녕") == False


assert "Happy123" == removeSymbols("Happy!1,2?3")
assert "Counter Strike Global Offensive" == removeSymbols(df_clean['game_name'][0])

In [33]:
# Dedicated to removing rows.

# 1. remove rows that have "none" or "NaN"
df_clean = df_clean.dropna()

# 2. remove rows where the game name is not english or the was game was not released.
#This commented out solution works, but is slow.
#for index, row in df_clean.iterrows():
    #if isReleased(row['release_date']) == False:
        #df_clean.drop(index, inplace=True)

#for index, row in df_clean.iterrows():
    #if isEnglish(row['game_name']) == False:
        #df_clean.drop(index, inplace=True)

#df_clean.head(10)
col_isEnglish = df_clean['game_name'].apply(isEnglish)  # Column for valid game names.
col_isReleased = df_clean['release_date'].apply(isReleased) #Column for games that are released.
for index, row in df_clean.iterrows():
    if col_isEnglish[index] == False or col_isReleased[index] == False:
        df_clean.drop(index,inplace=True)
df_clean.head(5)


Unnamed: 0.1,Unnamed: 0,app_id,game_name,release_date,game_blurb,developer_genres_list,user_tags_list,game_features_list,maturity_rating_data,genre_and_franchise_data,critic_data,reviews_data
0,0,730,Counter-Strike: Global Offensive,"Aug 21, 2012",Counter-Strike: Global Offensive (CS: GO) expa...,"['action', 'free to play']","['fps', 'shooter', 'multiplayer', 'competitive...","['steam achievements', 'full controller suppor...","[none, none]",['free to play games'],"['83', 'metacritic']","['very positive', 'very positive', '6731720']"
1,1,1938090,Call of Duty®: Modern Warfare® II,"Oct 27, 2022",Call of Duty®: Modern Warfare® II drops player...,['action'],"['fps', 'action', 'shooter', 'multiplayer', 'm...","['single-player', 'online pvp', 'online co-op'...","['m', 'esrb']","['action games', 'call of duty franchise']","[none, none]","[none, none, none]"
2,2,236390,War Thunder,"Aug 15, 2013",War Thunder is the most comprehensive free-to-...,"['action', 'free to play', 'massively multipla...","['free to play', 'vehicular combat', 'combat',...","['single-player', 'mmo', 'online pvp', 'online...","['t', 'esrb']",['massively multiplayer games'],"['81', 'metacritic']","['mostly positive', 'mostly positive', '341555']"
3,3,1172470,Apex Legends™,"Nov 4, 2020","Apex Legends is the award-winning, free-to-pla...","['action', 'adventure', 'free to play']","['free to play', 'multiplayer', 'battle royale...","['online pvp', 'online co-op', 'steam achievem...","['t', 'esrb']","['action games', 'apex legends official franch...","['88', 'metacritic']","['very positive', 'mostly positive', '527702']"
4,4,1599340,Lost Ark,"Feb 11, 2022",Embark on an odyssey for the Lost Ark in a vas...,"['action', 'adventure', 'free to play', 'massi...","['mmorpg', 'free to play', 'action rpg', 'rpg'...","['single-player', 'mmo', 'online pvp', 'online...","['m', 'esrb']",['free to play games'],"[none, none]","['mostly positive', 'mostly positive', '174373']"


# Data Analysis & Results (EDA)

Carry out EDA on your dataset(s); Describe in this section
1. Sentiment Analysis (Using nltk) on the 3 lists: developer_genres_list, user_tags_list, game_features_list
2. Remove stop words.

In [None]:
## YOUR CODE HERE
dgl = df_clean['developer_genres_list']
utl = df_clean['user_tags_list']
gfl = df_clean['game_features_list']

# remove symbols such as the trademark symbols from the titles.
