***Imports***

In [2]:
import pandas as pd

***Load Datasets***

In [3]:
# Data for Content Filtering
games_content = pd.read_csv('../data/raw-data/content-filtering/games.csv')
genres = pd.read_csv('../data/raw-data/content-filtering/genres.csv')
platforms = pd.read_csv('../data/raw-data/content-filtering/platforms.csv')
scores = pd.read_csv('../data/raw-data/content-filtering/scores.csv')
developers = pd.read_csv('../data/raw-data/content-filtering/developers.csv')

# Data for Collaborative Filtering
games_collab = pd.read_csv('../data/raw-data/collaborative-filtering/steam-200k.csv',
                            names=['userid', 'name', 'behavior', 'hours', '0']) # creating column names

# **Exploring Content Filtering Data**

This dataset is obtained from backloggd.com. Backloggd is an online site to keep information about a person's video game collection. 

The link to the dataset is here: https://www.kaggle.com/datasets/gsimonx37/backloggd?select=games.csv

### **Exploring Games Dataset**

In [59]:
games_content.head()

Unnamed: 0,id,name,date,rating,reviews,plays,playing,backlogs,wishlists,description
0,1000001,Cathode Ray Tube Amusement Device,1947-12-31,3.5,65.0,117.0,1.0,28.0,56.0,The cathode ray tube amusement device is the e...
1,1000002,Bertie the Brain,1950-08-25,2.5,11.0,24.0,0.0,6.0,12.0,Currently considered the first videogame in hi...
2,1000003,Nim,1951-12-31,1.8,2.0,11.0,0.0,2.0,6.0,The Nimrod was a special purpose computer that...
3,1000004,Draughts,1952-08-31,2.4,3.0,17.0,0.0,3.0,7.0,A game of draughts (a.k.a. checkers) written f...
4,1000005,OXO,1952-12-31,3.1,14.0,52.0,1.0,12.0,13.0,OXO was a computer game developed by Alexander...


In [60]:
games_content.shape

(172512, 10)

**Insights:** We can see from this data there **10** features along with **172512** rows. These features include:


- **id**
    - Identifier of each game
    - This column will be used to join each dataframe together
- **name**
    - This is the name of each game
    - Will be used to identify each game
- **date**
    - The date of when each game was released
    - Will need to be changed into only year
- **rating**
    - Rating of each game
- **reviews**
    - Number of reviews
- **plays**
    - Total number of players
- **playing**
    - Total number of players currently playing
    - This column will be removed due to redundancy
- **backlogs**
    - How many players have put this game in their backlog
    - This column will be removed as it doesn't give significant information about the game in general
- **wishlists**
    - How many people have put this game in their wishlist
    - This column will be removed since it can be shown already with previous columns
- **description**
    - A description of the game

In [51]:
games_content.dtypes

id               int64
name            object
date            object
rating         float64
reviews        float64
plays          float64
playing        float64
backlogs       float64
wishlists      float64
description     object
dtype: object

**Insights:** All the datatypes make sense for what information they have. Majority of the data is numerical. Only the **date** feature will need to be changed into numerical later with just the year.

In [41]:
games_content.isnull().sum()

id                  0
name                0
date            34781
rating         116943
reviews             1
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
dtype: int64

**Insights:** There are lots of nulls in this dataset. These will need to be removed or changed.Due to the amount of nulls in the **rating** column, this column will be removed entirely.

In [47]:
games_content['id'].value_counts()

id
1000001    1
1115012    1
1115004    1
1115005    1
1115006    1
          ..
1057506    1
1057507    1
1057508    1
1057509    1
1172512    1
Name: count, Length: 172512, dtype: int64

**Insights:** There are no duplicate games. This luckily means there are none to remove.

In [48]:
games_content['name'].value_counts()

name
Pac-Man                             41
Frogger                             38
Tetris                              32
Space Invaders                      24
Donkey Kong                         24
                                    ..
Arlyeh Center for Heart Diseases     1
Sonic Inflation 2: Battle            1
Fallen from Grace                    1
A Fragment of Her                    1
EXE Clash                            1
Name: count, Length: 136313, dtype: int64

**Insights:** We can seee that lots of games share the same name. This is because there are multiple versions of the same game. A lot of these versions will need to be filtered out during the cleaning process to avoid confusion. This column will also need all the capitalized letters to be lowered along with removing any punctuation to be easily recognized later.

In [64]:
games_content[['reviews', 'plays', 'wishlists']].describe()

Unnamed: 0,reviews,plays,wishlists
count,172511.0,171818.0,171818.0
mean,7.520668,118.013078,17.30147
std,74.331055,1106.022054,145.769591
min,0.0,-1.0,-1.0
25%,0.0,0.0,0.0
50%,0.0,2.0,0.0
75%,1.0,9.0,3.0
max,5464.0,61444.0,8311.0


**Insights:** From this we can see that there are most likely lots of games that have **0** reviews as the mean is extremely skewed. The same can be seen with both the **plays** and **wislists** as both the means are very skewed compared to the min and max. We will be removing games that have very low amounts of **plays** in order to filter out games that no one really has played before.

### **Exploring Genre Dataset**

In [65]:
genres.head()

Unnamed: 0,id,genre
0,1000001,Point-and-Click
1,1000002,Puzzle
2,1000002,Tactical
3,1000003,Pinball
4,1000003,Strategy


In [67]:
genres.shape

(286025, 2)

**Insights:** This dataset has **2** columns with **286025** rows. Here are the features of this dataset:

- **id**
    - Will be used to join dataframes together
- **genre**
    - The genre of each game

In [70]:
genres.dtypes

id        int64
genre    object
dtype: object

**Insights:** The genre is categorical while the id is integer. This makes sense and wont need to be changed.

In [68]:
genres.isnull().sum()

id       0
genre    0
dtype: int64

**Insights:** There are no nulls in this data.

In [71]:
genres['genre'].value_counts()

genre
Indie                  50501
Adventure              49653
Simulator              22828
RPG                    22320
Strategy               21701
Shooter                18542
Puzzle                 17496
Arcade                 14872
Platform               14025
Sport                  10407
Visual Novel            7898
Racing                  7270
Fighting                4953
Point-and-Click         3992
Brawler                 3645
Turn Based Strategy     3411
Card & Board Game       3082
Music                   2707
Tactical                2428
Real Time Strategy      2237
Quiz/Trivia             1285
Pinball                  631
MOBA                     141
Name: count, dtype: int64

**Insights:** From this we can see that the biggest genre of games is **indie** while the smallest genre is **MOBA**.

### **Exploring Platforms Dataset**

In [72]:
platforms.head()

Unnamed: 0,id,platform
0,1000001,Analogue electronics
1,1000002,Arcade
2,1000003,Ferranti Nimrod Computer
3,1000004,Legacy Computer
4,1000005,Windows PC


In [142]:
platforms.shape

(261475, 2)

**Insights:** There are **2** columns in this dataset with **261475** rows. Here are the features:

- **id**
    - Will be used to join dataframes together
- **platform**
    - The platform that the game can be played on

In [80]:
platforms.dtypes

id           int64
platform    object
dtype: object

**Insights:** These datasets make sense for the data we are working with.

In [75]:
platforms.isnull().sum()

id          0
platform    0
dtype: int64

**Insights:** There are no nulls to be cleaned in this data.

In [101]:
platforms['platform'].value_counts()

platform
Windows PC                  80883
Mac                         18335
Nintendo Switch             14433
PlayStation 4               12870
Linux                       11209
                            ...  
PDP-1                           1
Donner Model 30                 1
EDSAC                           1
Ferranti Nimrod Computer        1
visionOS                        1
Name: count, Length: 199, dtype: int64

**Insights:** From this we can see there are lots of platforms that have only one game to them. A lot of these platforms are going to be removed as there are not enough games on them.

### **Exploring Scores Dataset**

In [77]:
scores.head()

Unnamed: 0,id,score,amount
0,1000001,0.5,10
1,1000001,1.0,5
2,1000001,1.5,1
3,1000001,2.0,3
4,1000001,2.5,9


In [78]:
scores.shape

(1725120, 3)

**Insights:** There are **3** columns in this dataset with **12725120** rows. Here are the features:

- **id**
    - Will be used to join DataFrames together
- **score**
    - The score of each game
- **amount**
    - The amount of users who have scored the game that certain score
    - Can use the amount and score to get an average score on each game

In [81]:
scores.dtypes

id          int64
score     float64
amount      int64
dtype: object

**Insights:** All the types make sense for what the data is.

In [82]:
scores.isnull().sum()

id        0
score     0
amount    0
dtype: int64

**Insights:** There are no nulls in this data.

In [166]:
scores[['amount']].describe()

Unnamed: 0,amount
count,1725120.0
mean,6.827407
std,97.94893
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,17282.0


**Insights:** From this we can see that on average the amount of each score is about **6.83** which quite skewed from the min and max. This makes sense though since there will be some games that are much more popular than others.

### **Exploring Developers Dataset**

In [84]:
developers.head()

Unnamed: 0,id,developer
0,1000002,Josef Kates
1,1000004,Christopher Strachey
2,1000005,"Alexander Shafto ""Sandy"" Douglas"
3,1000005,University of Warwick
4,1000007,William Higinbotham


In [86]:
developers.shape

(143454, 2)

**Insights:** There are **2** columns in this data with **143454** rows. Here are the features:

- **id**
    - Will be used to join DataFrames together
- **developer**
    - The developer of each game

In [87]:
developers.dtypes

id            int64
developer    object
dtype: object

**Insights:** The data types make sense for the data we are working with.

In [88]:
developers.isnull().sum()

id           0
developer    1
dtype: int64

**Insights:** There is only one null in this dataset. This null row will be removed.

In [167]:
developers['developer'].value_counts().describe()

count    30502.000000
mean         4.703069
std         26.095290
min          1.000000
25%          1.000000
50%          1.000000
75%          3.000000
max       1926.000000
Name: count, dtype: float64

In [168]:
developers['developer'].value_counts()

developer
Nintendo                 1926
Konami                   1575
Sega                     1476
Electronic Arts          1157
Capcom                   1003
                         ... 
SuperTree                   1
Bruno Oliveira (btco)       1
Librarium Studio            1
Cylinder Studios            1
Keepsake Games AB           1
Name: count, Length: 30502, dtype: int64

**Insights:** From this data we can see that the mean of amount of games a developer has made is about **4.70**. This shows that majority of developers have only made a few games while there are a few developers who have made thousands.

# **Exploring Collaborative Filtering Data**

This dataset is based on users who have purchased and played games on steam.

The link to this dataset can be found here: https://www.kaggle.com/datasets/tamber/steam-video-games

### **Exploring Games Dataset**

In [150]:
games_collab.head()

Unnamed: 0,userid,name,behavior,hours,0
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0,0
1,151603712,The Elder Scrolls V Skyrim,play,273.0,0
2,151603712,Fallout 4,purchase,1.0,0
3,151603712,Fallout 4,play,87.0,0
4,151603712,Spore,purchase,1.0,0


In [151]:
games_collab.shape

(200000, 5)

**Insights:** There are **5** columns in this dataset with **200000** rows. Here are the features:

- **userid**
    - Will be used to distinguish each user apart from each other
- **name**
    - The name of the game. Will be used to identify each game from each other
- **behavior**
    - Shows whether the person has purchased or played the game
    - Will show both rows if the person has done both. Will need to remove a row so it only shows either played or puchased
- **hours**
    - The amount of hours being played. 1 is standard if the game was purchased
    - This will not work since playtime hours of games can vary based on the game that is being played.
    - This column will be removed since we will only be judging on whether the user had just brought the game or also played it as well

The final column seems to be an error since the data does not make sense. It will be removed. 

In [152]:
games_collab.dtypes

userid        int64
name         object
behavior     object
hours       float64
0             int64
dtype: object

**Insights:** The data types make sense for the data that is being used. Nothing will need to be changed.

In [153]:
games_collab.isnull().sum()

userid      0
name        0
behavior    0
hours       0
0           0
dtype: int64

**Insights:** There are no nulls in this data.

In [154]:
games_collab['name'].value_counts()

name
Dota 2                             9682
Team Fortress 2                    4646
Counter-Strike Global Offensive    2789
Unturned                           2632
Left 4 Dead 2                      1752
                                   ... 
Putt-Putt Joins the Parade            1
Ducati World Championship             1
Chunk of Change Knight                1
STASIS                                1
Soccertron                            1
Name: count, Length: 5155, dtype: int64

**Insights:** From this data we can see that there are multiple games that have thousands of users who have purchased/played the game. While there are also multiple games that have only been purchased once.

In [159]:
games_collab['userid'].value_counts()

userid
62990992     1573
33865373      949
11403772      906
30246419      901
47457723      855
             ... 
89988424        1
283979950       1
121382416       1
209746499       1
198709823       1
Name: count, Length: 12393, dtype: int64

In [163]:
# counts how many users have at least brought/played 2 games
games_collab['userid'].value_counts()[games_collab['userid'].value_counts() > 2].value_counts().sum()

6507

**Insights:** From this data we can see that there are lots of users who have only purchased one game and that is it. But there is at least a minimum of **6507** players who have purchased/played at least 2 games. There might be more because there will be duplicate rows if the game has been purchased and played.

In [164]:
games_collab['behavior'].value_counts()

behavior
purchase    129511
play         70489
Name: count, dtype: int64

**Insights:** We can see that there are indeed games that have been only purchased and not played. But majority of the games seem to have been both purchased and played.

# **Conclusion**

From all this we can see there is a lot of data that needs to be cleaned ranging from, removing nulls, removing duplicates, changing values, combining data together and joining dataframes together. All this will be done in the next notebook.