## Steam Game Analysis and Prediction

### Introduction

This project aims to conduct a predictive analysis on a public dataset encompassing over 80,000 Steam games, each containing a multitude of attributes. The primary objective is to develop a model capable of suggesting similar games based on a user's selection. For instance, if a user selects Grand Theft Auto V, the model would return a top 5 list of games most closely resembling it, such as Grand Theft Auto: San Andreas. 

By leveraging this predictive capability, we seek to enhance game recommendations for Steam users and gain deeper insights into the complex relationships between various game attributes.

In [107]:
import pandas as pd
from pathlib import Path


target_path = Path("..")/"data"/"processed"/"cleaned_data.csv"

df = pd.read_csv(target_path)

In [108]:
df

Unnamed: 0,Name,Categories,Genres,Tags
0,galactic bowling,"single-player,multi-player,steam achievements,...","casual,indie,sports","indie,casual,sports,bowling"
1,train bandit,"single-player,steam achievements,full controll...","action,indie","indie,action,pixel graphics,2d,retro,arcade,sc..."
2,henosis™,"single-player,full controller support","adventure,casual,indie","2d platformer,atmospheric,surreal,mystery,puzz..."
3,two weeks in painland,"single-player,steam achievements","adventure,indie","indie,adventure,nudity,violent,sexual content,..."
4,wartune reborn,"single-player,multi-player,mmo,pvp,online pvp,...","adventure,casual,free to play,massively multip...","turn-based combat,massively multiplayer,multip..."
...,...,...,...,...
62563,two cubes,"multi-player,co-op,online co-op,steam achievem...","adventure,indie","online co-op,adventure,2d platformer,puzzle-pl..."
62564,wisp child,"single-player,full controller support","action,adventure","action,adventure,action-adventure,puzzle,2d,to..."
62565,firekrackers,"single-player,tracked controller support,vr only","action,casual,indie","vr,arcade,puzzle-platformer,action,destruction..."
62566,nekowater,"single-player,steam achievements,remote play t...","adventure,casual,indie","adventure,cats,cute,first-person,exploration,a..."


### Feature Engineering
#### Genre Column

Cleaning the dataset by removing genres that are not relevant to Steam games, like 'documentary'. 

Additionally, outliers that fall below the defined threshold will be removed to improve data quality and model performance.

In [109]:
irrelevant_genres = [
    'design & illustration',
    'animation & modeling',
    'software training',
    'audio production',
    'game development',
    'video production',
    'web publishing',
    'photo editing',
    'documentary',
    'accounting',
    'utilities',
    'education',
    '360 video',
    'episodic',
    'tutorial',
    'movie',
    'short',
    ]

df_exploded = df['Genres'].str.split(',').explode()

df.drop(df_exploded.isin(irrelevant_genres)[df_exploded.isin(irrelevant_genres)].index.drop_duplicates(), inplace=True)

df.reset_index(inplace=True)

In [110]:
min_outliers = [
    'massively multiplayer',
    'sexual content',
    'violent',
    'nudity',
    'gore',
    ]

In [111]:
df_exploded = df['Genres'].str.split(',').explode()

In [112]:
df_exploded = pd.DataFrame(df_exploded)

In [113]:
df_exploded.isin(min_outliers)

Unnamed: 0,Genres
0,False
0,False
0,False
1,False
1,False
...,...
62170,False
62170,False
62170,False
62171,False


In [114]:
df_exploded.isin(min_outliers).value_counts()

Genres
False     175065
True        2616
Name: count, dtype: int64

In [117]:
df_exploded[~df_exploded.isin(min_outliers)]

Unnamed: 0,Genres
0,casual
0,indie
0,sports
1,action
1,indie
...,...
62170,adventure
62170,casual
62170,indie
62171,casual


#### Tags Column

In [None]:
replace_tags = {
    'e-sports': 'sports',
    'action rpg': 'action',
    'action rts': 'action',
    'dark comedy': 'comedy',
    '2d platformer': 'platformer',
    '3d platformer': 'platformer',
    'action-adventure': 'adventure',
    'puzzle-platformer': 'puzzle',
    'turn-based combat': 'turn-based',
    'turn-based strategy': 'turn-based',
    'turn-based tactics': 'turn-based',
    'massively multiplayer': 'multiplayer',
    'local multiplayer': 'multiplayer',
    'crpg': 'rpg',
    'jrpg': 'rpg',
    'world war i': 'wargame',
    'world war ii': 'wargame',
    'cold war': 'wargame',
    'traditional roguelike': 'roguelike',
    'action roguelike': 'roguelike',
    'rogue-like': 'roguelike',
    'space sim': 'simulation',
    'farming sim': 'simulation',
    'dating sim': 'simulation',
    'cartoony': 'cartoon',
    'coding': 'programming',
    'hacking': 'programming',
    '2d fighter': 'fighting',
    '3d fighter': 'fighting',
    'immersive sim': 'immersive',
    'political sim': 'politics',
    'political': 'politics',
    'resource management': 'management',
    'time management': 'management',
    'inventory management': 'management',
    'arena shooter': 'shooter',
    'hero shooter': 'shooter',
    'looter shooter': 'shooter',
    'extraction shooter': 'shooter',
    'lore-rich': 'story rich',
    'mini golf': 'golf',
    'dark fantasy': 'fantasy',
    }

df_exploded = df['Tags'].str.split(',').explode()

df_exploded = df_exploded.replace(replace_tags)

In [None]:
df_grouped = df.groupby(level=0)['Tags'].agg(list).reset_index()
low_frequency_values = df_exploded.value_counts()[df_exploded.value_counts() < 3000].index

filtered_df = df_exploded[df_exploded.isin(low_frequency_values)]