***Imports***

In [3]:
import pandas as pd

***Load Datasets***

In [37]:
# Data for Content Filtering
games_content = pd.read_csv('data/raw-data/content-filtering/games.csv')
genres = pd.read_csv('data/raw-data/content-filtering/genres.csv')
platforms = pd.read_csv('data/raw-data/content-filtering/platforms.csv')
scores = pd.read_csv('data/raw-data/content-filtering/scores.csv')
developers = pd.read_csv('data/raw-data/content-filtering/developers.csv')

# Data for Collaborative Filtering
games_collab = pd.read_csv('data/raw-data/collaborative-filtering/games.csv')
recommendations = pd.read_csv('data/raw-data/collaborative-filtering/recommendations.csv')

# **Exploring Content Filtering Data**

This dataset is obtained from backloggd.com. Backloggd is an online site to keep information about a person's video game collection. 

The link to the dataset is here: https://www.kaggle.com/datasets/gsimonx37/backloggd?select=games.csv

### **Exploring Games Dataset**

In [59]:
games_content.head()

Unnamed: 0,id,name,date,rating,reviews,plays,playing,backlogs,wishlists,description
0,1000001,Cathode Ray Tube Amusement Device,1947-12-31,3.5,65.0,117.0,1.0,28.0,56.0,The cathode ray tube amusement device is the e...
1,1000002,Bertie the Brain,1950-08-25,2.5,11.0,24.0,0.0,6.0,12.0,Currently considered the first videogame in hi...
2,1000003,Nim,1951-12-31,1.8,2.0,11.0,0.0,2.0,6.0,The Nimrod was a special purpose computer that...
3,1000004,Draughts,1952-08-31,2.4,3.0,17.0,0.0,3.0,7.0,A game of draughts (a.k.a. checkers) written f...
4,1000005,OXO,1952-12-31,3.1,14.0,52.0,1.0,12.0,13.0,OXO was a computer game developed by Alexander...


In [60]:
games_content.shape

(172512, 10)

**Insights:** We can see from this data there **10** features along with **172512** rows. These features include:


- **id**
    - Identifier of each game
    - This column will be used to join each dataframe together.
- **name**
    - This is the name of each game
    - Will be used to identify each game.
- **date**
    - The date of when each game was released
    - Will need to be changed into only year.
- **rating**
    - Rating of each game
- **reviews**
    - Number of reviews
- **plays**
    - Total number of players
- **playing**
    - Total number of players currently playing
    - This column will be removed due to redundancy
- **backlogs**
    - How many players have put this game in their backlog
    - This column will be removed as it doesn't give significant information about the game in general
- **wishlists**
    - How many people have put this game in their wishlist
- **description**
    - A description of the game. 

In [51]:
games_content.dtypes

id               int64
name            object
date            object
rating         float64
reviews        float64
plays          float64
playing        float64
backlogs       float64
wishlists      float64
description     object
dtype: object

**Insights:** All the datatypes make sense for what information they have. Majority of the data is numerical. Only the **date** feature will need to be changed into numerical later with just the year.

In [41]:
games_content.isnull().sum()

id                  0
name                0
date            34781
rating         116943
reviews             1
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
dtype: int64

**Insights:** There are lots of nulls in this dataset. These will need to be removed or changed.Due to the amount of nulls in the **rating** column, this column will be removed entirely.

In [47]:
games_content['id'].value_counts()

id
1000001    1
1115012    1
1115004    1
1115005    1
1115006    1
          ..
1057506    1
1057507    1
1057508    1
1057509    1
1172512    1
Name: count, Length: 172512, dtype: int64

**Insights:** There are no duplicate games. This luckily means there are none to remove.

In [48]:
games_content['name'].value_counts()

name
Pac-Man                             41
Frogger                             38
Tetris                              32
Space Invaders                      24
Donkey Kong                         24
                                    ..
Arlyeh Center for Heart Diseases     1
Sonic Inflation 2: Battle            1
Fallen from Grace                    1
A Fragment of Her                    1
EXE Clash                            1
Name: count, Length: 136313, dtype: int64

**Insights:** We can seee that lots of games share the same name. This is because there are multiple versions of the same game. A lot of these versions will need to be filtered out during the cleaning process to avoid confusion. This column will also need all the capitalized letters to be lowered along with removing any punctuation to be easily recognized later.

In [64]:
games_content[['reviews', 'plays', 'wishlists']].describe()

Unnamed: 0,reviews,plays,wishlists
count,172511.0,171818.0,171818.0
mean,7.520668,118.013078,17.30147
std,74.331055,1106.022054,145.769591
min,0.0,-1.0,-1.0
25%,0.0,0.0,0.0
50%,0.0,2.0,0.0
75%,1.0,9.0,3.0
max,5464.0,61444.0,8311.0


**Insights:**

### **Exploring Genre Dataset**

In [65]:
genres.head()

Unnamed: 0,id,genre
0,1000001,Point-and-Click
1,1000002,Puzzle
2,1000002,Tactical
3,1000003,Pinball
4,1000003,Strategy


In [67]:
genres.shape

(286025, 2)

**Insights:**

In [70]:
genres.dtypes

id        int64
genre    object
dtype: object

**Insights:**

In [68]:
genres.isnull().sum()

id       0
genre    0
dtype: int64

**Insights:**

In [71]:
genres['genre'].value_counts()

genre
Indie                  50501
Adventure              49653
Simulator              22828
RPG                    22320
Strategy               21701
Shooter                18542
Puzzle                 17496
Arcade                 14872
Platform               14025
Sport                  10407
Visual Novel            7898
Racing                  7270
Fighting                4953
Point-and-Click         3992
Brawler                 3645
Turn Based Strategy     3411
Card & Board Game       3082
Music                   2707
Tactical                2428
Real Time Strategy      2237
Quiz/Trivia             1285
Pinball                  631
MOBA                     141
Name: count, dtype: int64

**Insights:**

### **Exploring Platforms Dataset**

In [72]:
platforms.head()

Unnamed: 0,id,platform
0,1000001,Analogue electronics
1,1000002,Arcade
2,1000003,Ferranti Nimrod Computer
3,1000004,Legacy Computer
4,1000005,Windows PC


In [74]:
platforms.shape

(261475, 2)

In [80]:
platforms.dtypes

id           int64
platform    object
dtype: object

In [75]:
platforms.isnull().sum()

id          0
platform    0
dtype: int64

In [101]:
platforms['platform'].value_counts()

platform
Windows PC                  80883
Mac                         18335
Nintendo Switch             14433
PlayStation 4               12870
Linux                       11209
                            ...  
PDP-1                           1
Donner Model 30                 1
EDSAC                           1
Ferranti Nimrod Computer        1
visionOS                        1
Name: count, Length: 199, dtype: int64

### **Exploring Scores Dataset**

In [77]:
scores.head()

Unnamed: 0,id,score,amount
0,1000001,0.5,10
1,1000001,1.0,5
2,1000001,1.5,1
3,1000001,2.0,3
4,1000001,2.5,9


In [78]:
scores.shape

(1725120, 3)

In [81]:
scores.dtypes

id          int64
score     float64
amount      int64
dtype: object

In [82]:
scores.isnull().sum()

id        0
score     0
amount    0
dtype: int64

In [83]:
scores[['score', 'amount']].describe()

Unnamed: 0,score,amount
count,1725120.0,1725120.0
mean,2.75,6.827407
std,1.436141,97.94893
min,0.5,0.0
25%,1.5,0.0
50%,2.75,0.0
75%,4.0,0.0
max,5.0,17282.0


### **Exploring Developers Dataset**

In [84]:
developers.head()

Unnamed: 0,id,developer
0,1000002,Josef Kates
1,1000004,Christopher Strachey
2,1000005,"Alexander Shafto ""Sandy"" Douglas"
3,1000005,University of Warwick
4,1000007,William Higinbotham


In [86]:
developers.shape

(143454, 2)

In [87]:
developers.dtypes

id            int64
developer    object
dtype: object

In [88]:
developers.isnull().sum()

id           0
developer    1
dtype: int64

In [90]:
developers['developer'].value_counts().describe()

count    30502.000000
mean         4.703069
std         26.095290
min          1.000000
25%          1.000000
50%          1.000000
75%          3.000000
max       1926.000000
Name: count, dtype: float64

# **Exploring Collaborative Filtering Data**

### **Exploring Games Dataset**

In [102]:
games_collab.head()

Unnamed: 0,app_id,title,release_date,developer,publisher,genres,positive_ratings,negative_ratings,price
0,514520,Sparky's Hunt,2016-08-18,Luke Cripps,Fellowplayer,Indie,6,3,0.99
1,1012710,Endzeit,2019-04-03,RockyDev,RockyDev,Action;Early Access,0,1,7.19
2,279260,Richard & Alice,2014-06-05,Owl Cave,Owl Cave,Adventure;Indie,264,99,4.79
3,220090,The Journey Down: Chapter One,2013-01-09,SkyGoblin,SkyGoblin,Adventure;Indie,902,129,5.99
4,788870,In The Long Run The Game,2018-07-02,Zerstoren Games,Zerstoren Games,Action;Adventure;Indie;Simulation;Strategy;Ear...,6,11,7.19


In [104]:
games_collab.shape

(13538, 9)

In [108]:
games_collab.dtypes

app_id                int64
title                object
release_date         object
developer            object
publisher            object
genres               object
positive_ratings      int64
negative_ratings      int64
price               float64
dtype: object

In [106]:
games_collab.isnull().sum()

app_id              0
title               0
release_date        0
developer           0
publisher           4
genres              0
positive_ratings    0
negative_ratings    0
price               0
dtype: int64

In [109]:
games_collab['app_id'].value_counts()

app_id
514520    1
970390    1
381550    1
846670    1
970900    1
         ..
908840    1
7830      1
673620    1
253710    1
224960    1
Name: count, Length: 13538, dtype: int64

In [111]:
games_collab['title'].value_counts()

title
RUSH                                               2
Space Maze                                         2
Rumpus                                             2
Escape                                             2
Solitaire                                          2
                                                  ..
Pathfinder Adventures                              1
Nevertales: Shattered Image Collector's Edition    1
RetroFighter VR                                    1
Suprapong                                          1
Tomb Raider I                                      1
Name: count, Length: 13531, dtype: int64

### **Exploring Recommendations Dataset**

In [112]:
recommendations.head()

Unnamed: 0,app_id,date,is_recommended,user_id,review_id
0,498240,2020-11-02,True,551574,17202162
1,359550,2021-12-07,False,5800984,1331080
2,570940,2022-07-31,True,9401278,17109753
3,230410,2017-01-25,True,8424375,15065522
4,235460,2022-07-23,True,9145094,26539023


In [113]:
recommendations.shape

(1295844, 5)

In [114]:
recommendations.dtypes

app_id             int64
date              object
is_recommended      bool
user_id            int64
review_id          int64
dtype: object

In [115]:
recommendations.isnull().sum()

app_id            0
date              0
is_recommended    0
user_id           0
review_id         0
dtype: int64

In [117]:
recommendations['app_id'].value_counts().describe()

count     7322.000000
mean       176.979514
std       1034.880674
min          1.000000
25%          3.000000
50%          8.000000
75%         32.000000
max      31649.000000
Name: count, dtype: float64

In [121]:
recommendations['user_id'].value_counts().describe()

count    1.088779e+06
mean     1.190181e+00
std      8.796584e-01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.790000e+02
Name: count, dtype: float64

In [141]:
recommendations['user_id'].value_counts()

user_id
10837767    179
5216252     153
10734314    149
4956282     105
1244257     101
           ... 
11545145      1
2757046       1
9718766       1
1768325       1
11864905      1
Name: count, Length: 1088779, dtype: int64

In [140]:
# counts how many users have made at least 2 reviews
recommendations['user_id'].value_counts()[recommendations['user_id'].value_counts() > 1].value_counts().sum()

129768