# Final Project: Steam Game Recommender System

### Video Games have become a popular hobby among college students, offering a blend of entertainment, creativity and community. Behind the most famous of video games to the games that are considered hidden and "underrated", data is something that all  games have in common. Our team will use game data to analyze trends, patterns and the performance metrics of various games located on the Steam Platform. Through the analysis of factors such as genre, pricing, user reviews and player engagement, our goal is to provide insights to help the public to better understand what influences game popularity and success. Our project will identify which genre attributes correlate with higher user engagement and overall rating trends. 


In [48]:
#Imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Data

#### Our Dataset comes from the following github repository: https://github.com/NewbieIndieGameDev/steam-insights This repository contains various fields such as titles, genres, release dates, developer/publisher information, tags, reviews, ratings, and price.


# Importing Dataset

In [20]:
games_df = pd.read_csv('games.csv', engine="python",
    sep=",",
    on_bad_lines="skip",
    encoding="utf-8"
)
categories_df = pd.read_csv('categories.csv')
reviews_df = pd.read_csv('reviews.csv',engine="python",
    sep=",",
    on_bad_lines="skip",
    encoding="utf-8"
)
genres_df = pd.read_csv('genres.csv',engine="python",
    sep=",",
    on_bad_lines="skip",
    encoding="utf-8"
)
steamspy_df = pd.read_csv('steamspy_insights.csv', engine="python",
    sep=",",
    on_bad_lines="skip",
    encoding="utf-8"
)

## Cleaning in Order of Importance

In [24]:
games_df.isna().sum()

app_id             0
name              24
release_date      32
is_free           32
price_overview    32
languages         32
type              32
dtype: int64

In [25]:
categories_df.isna().sum()

app_id      0
category    0
dtype: int64

In [26]:
reviews_df.isna().sum()

app_id                       0
review_score                50
review_score_description    58
positive                    61
negative                    61
total                       61
metacritic_score            66
reviews                     68
recommendations             68
steamspy_user_score         68
steamspy_score_rank         68
steamspy_positive           68
steamspy_negative           68
dtype: int64

In [27]:
genres_df.isna().sum()

app_id    0
genre     0
dtype: int64

In [28]:
steamspy_df.isna().sum()

app_id                         0
developer                      3
publisher                     35
owners_range                   0
concurrent_users_yesterday     0
playtime_average_forever       0
playtime_average_2weeks        0
playtime_median_forever        0
playtime_median_2weeks         0
price                          0
initial_price                  0
discount                       0
languages                      0
genres                         0
dtype: int64

Since the number of na is so small when considering our dataset we can comfortably remove rows with missing instead of using a form of imputatio

In [29]:
reviews_df = reviews_df.dropna()
categories_df = categories_df.dropna()
genres_df = genres_df.dropna()
steamspy_df = steamspy_df.dropna()
games_df = games_df.dropna()

## Checking Duplicates

In [None]:
reviews_df.duplicated().sum()

np.int64(0)

In [36]:
categories_df.duplicated().sum()

np.int64(0)

In [37]:
genres_df.duplicated().sum()

np.int64(0)

In [38]:
steamspy_df.duplicated().sum()


np.int64(0)

In [39]:
games_df.duplicated().sum()

np.int64(0)

## Standardizing Data

Making sure release date is in datetime format

In [31]:
games_df["release_date"] = pd.to_datetime(games_df["release_date"], errors='coerce')

Making lanngauges numerical

In [40]:
steamspy_df["languages"] = steamspy_df["languages"].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)

In [46]:
## Merging DataFrames on App ID
merged_df = pd.concat([games_df, steamspy_df], axis=1, join='inner', ignore_index=False)
merged_df = pd.concat([merged_df, categories_df], axis=1, join='inner', ignore_index=False)
merged_df = pd.concat([merged_df, reviews_df], axis=1, join='inner', ignore_index=False)
merged_df = pd.concat([merged_df, genres_df], axis=1, join='inner', ignore_index=False)

## Saving Dataframe as CSV

In [47]:
merged_df.to_csv('merged_steam_data.csv', index=False)

## Feature Engineering

## Experimenting

#### We will attempt to do clustering through K-Means for numerical Pattern clusters and possible Hierarchical Clustering for visualizing game similarities
