# Video Game Analysis

## 1. Import and Clean Data

### 1a. Import Libraries and Data

In [39]:
# Import Libraries
import numpy as np
import pandas as pd

In [54]:
# Read in dataset
url = "https://raw.githubusercontent.com/kellyshreeve/Integrated_Project_1/main/moved_games.csv"
vg = pd.read_csv(url)

In [55]:
# View first 15 rows of the dataset 
vg.head(15)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,
5,Tetris,GB,1989.0,Puzzle,23.2,2.26,4.22,0.58,,,
6,New Super Mario Bros.,DS,2006.0,Platform,11.28,9.14,6.5,2.88,89.0,8.5,E
7,Wii Play,Wii,2006.0,Misc,13.96,9.18,2.93,2.84,58.0,6.6,E
8,New Super Mario Bros. Wii,Wii,2009.0,Platform,14.44,6.94,4.7,2.24,87.0,8.4,E
9,Duck Hunt,NES,1984.0,Shooter,26.93,0.63,0.28,0.47,,,


In [56]:
# Print dataset info
vg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


This dataset has a total of 16715 entries. There are missing values in Name, Year_of_Release, Genre, Critic_Score, User_Score, and Rating. Year_of_Release needs to be changed to int data type and and User_Score needs to be changed to float data type. The variable names need to be changed to snake case.

### 1b. Rename Columns and Fix Data Types

In [57]:
# Change variable names to snake case
vg = vg.rename(
    columns={'Name':'name',
             'Platform':'platform',
             'Year_of_Release':'year_of_release',
             'Genre':'genre',
             'NA_sales':'na_sales',
             'EU_sales':'eu_sales',
             'JP_sales':'jp_sales',
             'Other_sales':'other_sales',
             'Critic_Score':'critic_score',
             'User_Score':'user_score',
             'Rating':'rating'    
})

print(vg.columns)

Index(['name', 'platform', 'year_of_release', 'genre', 'na_sales', 'eu_sales',
       'jp_sales', 'other_sales', 'critic_score', 'user_score', 'rating'],
      dtype='object')


In [72]:
# Change year_of_release to int data type

# Frist, check unique values of 'year_of_release'
print(vg['year_of_release'].unique())

<IntegerArray>
[2006, 1985, 2008, 2009, 1996, 1989, 1984, 2005, 1999, 2007, 2010, 2013, 2004,
 1990, 1988, 2002, 2001, 2011, 1998, 2015, 2012, 2014, 1992, 1997, 1993, 1994,
 1982, 2016, 2003, 1986, 2000, <NA>, 1995, 1991, 1981, 1987, 1980, 1983]
Length: 38, dtype: Int64


The values are all whole numbers, so it is safe to convert 'year_of_release' to int.

In [69]:
# Convert 'year_of_release' to int
vg['year_of_release'] = vg['year_of_release'].astype('Int64')

type = vg['year_of_release'].dtypes
print(f'The dtype for "year_of_release" now is: {type}')

The dtype for "year_of_release" now is: Int64


In [46]:
# Print unique values in 'user_score'
df['User_Score'].unique()

array(['8', nan, '8.3', '8.5', '6.6', '8.4', '8.6', '7.7', '6.3', '7.4',
       '8.2', '9', '7.9', '8.1', '8.7', '7.1', '3.4', '5.3', '4.8', '3.2',
       '8.9', '6.4', '7.8', '7.5', '2.6', '7.2', '9.2', '7', '7.3', '4.3',
       '7.6', '5.7', '5', '9.1', '6.5', 'tbd', '8.8', '6.9', '9.4', '6.8',
       '6.1', '6.7', '5.4', '4', '4.9', '4.5', '9.3', '6.2', '4.2', '6',
       '3.7', '4.1', '5.8', '5.6', '5.5', '4.4', '4.6', '5.9', '3.9',
       '3.1', '2.9', '5.2', '3.3', '4.7', '5.1', '3.5', '2.5', '1.9', '3',
       '2.7', '2.2', '2', '9.5', '2.1', '3.6', '2.8', '1.8', '3.8', '0',
       '1.6', '9.6', '2.4', '1.7', '1.1', '0.3', '1.5', '0.7', '1.2',
       '2.3', '0.5', '1.3', '0.2', '0.6', '1.4', '0.9', '1', '9.7'],
      dtype=object)

In [78]:
# Check which years are associated with user_score tbd
display(vg[vg['user_score']=='tbd'].head(30))

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
119,Zumba Fitness,Wii,2010.0,Sports,3.45,2.59,0.0,0.66,,tbd,E
301,Namco Museum: 50th Anniversary,PS2,2005.0,Misc,2.08,1.35,0.0,0.54,61.0,tbd,E10+
520,Zumba Fitness 2,Wii,2011.0,Sports,1.51,1.03,0.0,0.27,,tbd,T
645,uDraw Studio,Wii,2010.0,Misc,1.65,0.57,0.0,0.2,71.0,tbd,E
657,Frogger's Adventures: Temple of the Frog,GBA,,Adventure,2.15,0.18,0.0,0.07,73.0,tbd,E
718,Just Dance Kids,Wii,2010.0,Misc,1.52,0.54,0.0,0.18,,tbd,E
726,Dance Dance Revolution X2,PS2,2009.0,Simulation,1.09,0.85,0.0,0.28,,tbd,E10+
821,The Incredibles,GBA,2004.0,Action,1.15,0.77,0.04,0.1,55.0,tbd,E
881,Who wants to be a millionaire,PC,1999.0,Misc,1.94,0.0,0.0,0.0,,tbd,E
1047,Tetris Worlds,GBA,2001.0,Puzzle,1.25,0.39,0.0,0.06,65.0,tbd,E


Video Games with 'user_score' of tbd are all associated with jp_sales of almost zero. I will check if there are any jp_sales of almost zero that have any other user rating to confirm whether this is the reason for the rating. 

In [82]:
# Check if all jp_sales close to zero have a user_score of tbd
display(vg[(vg['jp_sales'] >= 0) & (vg['jp_sales'] <= .05)].head(30))

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
60,Call of Duty: Ghosts,X360,2013.0,Shooter,6.73,2.56,0.04,0.91,73.0,2.6,M
61,Just Dance 3,Wii,2011.0,Misc,5.95,3.11,0.0,1.06,74.0,7.8,E10+
66,Halo 4,X360,2012.0,Shooter,6.65,2.28,0.04,0.74,87.0,7,M
68,Just Dance 2,Wii,2010.0,Misc,5.8,2.85,0.01,0.78,74.0,7.3,E10+
72,Minecraft,X360,2013.0,Misc,5.7,2.65,0.02,0.81,,,
78,Halo 2,XB,2004.0,Shooter,6.82,1.53,0.05,0.08,95.0,8.2,M
85,The Sims 3,PC,2009.0,Simulation,0.99,6.42,0.0,0.6,86.0,7.6,T
89,Pac-Man,2600,1982.0,Puzzle,7.28,0.45,0.0,0.08,,,
99,Call of Duty: Black Ops 3,XOne,2015.0,Shooter,4.59,2.11,0.01,0.68,,,
100,Call of Duty: World at War,X360,2008.0,Shooter,4.81,1.88,0.0,0.69,84.0,7.6,M


There are user_scores present for other instances of jp_sales close to zero. It does not appear that these sales figures are the reason for the tbd rating. There do not appear to be any other patterns in year, genre, sales, critic_score, or rating that would explain the user_score of tbd. Because there are no clear patterns explaining this value, I will treat tbd as a missing value. 

In [None]:
# Fill user_score tbd with nan


### 1c. Address Missing Values

### 1d. Check for Duplicates

### 1e. Add Additional Features

## 2. Analyze the data

## 3. Create a User Profile for Each Region

## 4. Test Hypotheses

## 5. Conclusion