This notebook covers my initial exploration of the video game dataset I'll be using to view trends in video game data. I will not be cleansing or changing the file, simply exploring the raw file to determine what steps will be taken next to clean the file. 


Dataset: [Kaggle Video Game Dataset](https://www.kaggle.com/datasets/thedevastator/global-video-game-sales-ratings?resource=download)

### Definitions ###

- **Name**	The name of the video game. (String)
- **Platform**	The platform the game was released on. (String)
- **Year_of_Release**	The year the game was released. (Integer)
- **Genre**	The genre of the game. (String)
- **Publisher**	The publisher of the game. (String)
- **NA_Sales**	The sales of the game in North America. (Float)
- **EU_Sales**	The sales of the game in Europe. (Float)
- **JP_Sales**	The sales of the game in Japan. (Float)
- **Other_Sales**	The sales of the game in other regions. (Float)
- **Global_Sales**	The total sales of the game across all regions. (Float)
- **Critic_Score**	The score given to the game by critics. (Float)
- **Critic_Count**	The number of critics who reviewed the game. (Integer)
- **User_Score**	The score given to the game by users. (Float)
- **User_Count**	The number of users who reviewed the game. (Integer)
- **Developer**	The developer of the game. (String)
- **Rating**	The rating of the game. (String)

---

In [5]:
import pandas as pd

In [4]:
#creating initial dataframe from csv
vg_df = pd.read_csv('data/raw-data.csv')

Some questions I would like to potentially explore are:

1. Which video game genre is the most popular worldwide? What about region to region?
2. What does genre popularity look like over time for video games? By count of genre and/or by sales normalized by total
3. What publisher has the most global sales? Is that the same region to region?

In [10]:
vg_df.columns

Index(['Name', 'Platform', 'Year_of_Release', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Score',
       'Critic_Count', 'User_Score', 'User_Count', 'Developer', 'Rating'],
      dtype='object')

### Cleaning Step 1

Knowing what I would like to explore for my project, my initial step should be to remove any unneeded columns. 

**Columns to Keep:** Name, Platform, Year_of_Release, Genre, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global Sales

**Columns to Remove:** Publisher, Critic_Score, User_Score, User_Count, Developer, Rating

In [11]:
vg_df.shape

(16719, 16)

In [12]:
vg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


In [13]:
#looking for null values in dataset

vg_df.isnull().sum().sort_values(ascending=False)

User_Count         9129
Critic_Score       8582
Critic_Count       8582
Rating             6769
User_Score         6704
Developer          6623
Year_of_Release     269
Publisher            54
Name                  2
Genre                 2
Platform              0
NA_Sales              0
EU_Sales              0
JP_Sales              0
Other_Sales           0
Global_Sales          0
dtype: int64

I have no missing sales data, which is great for my questions I would like to ask. I will need to access the null values in the Year_of_Release column when using that data. 

I think to make my analysis more relevant for myself, I will isolate some questions to look at just NA_Sales data after anwsering my initial question of video game popularity by sales. 

In [16]:
vg_df['Year_of_Release'].head()

0    2006.0
1    1985.0
2    2008.0
3    2009.0
4    1996.0
Name: Year_of_Release, dtype: float64

**Cleanup Note**: I'll want to tweak the formatting on the year to drop the decimal

## Question 1##

Which video game genre is the most popular worldwide, by sales and by count games? What about region to region?


To know this, I will need to first group the data set by genre and global sales. The genres that have the most sales are the the ones that are most popular. I think for ease of reporting, I will take the top five or ten, but I'd like to see how many unique genres there are. I'll explore that below. 

In [18]:
vg_df['Genre'].unique()

array(['Sports', 'Platform', 'Racing', 'Role-Playing', 'Puzzle', 'Misc',
       'Shooter', 'Simulation', 'Action', 'Fighting', 'Adventure',
       'Strategy', nan], dtype=object)

In [20]:
vg_df['Genre'].value_counts()

Action          3370
Sports          2348
Misc            1750
Role-Playing    1500
Shooter         1323
Adventure       1303
Racing          1249
Platform         888
Simulation       874
Fighting         849
Strategy         683
Puzzle           580
Name: Genre, dtype: int64

There are not as many genres as I would expect, so we can report on all of them with our analysis. 

Steps Needed for this data set: 
1. Because genre has 2 null values, I'll want to remove those from my data set
2. Group that cleaned data set and include count of genre & global sales

For the follow-up question, I'll need to 
1. group the four breakout of sales by genre as well and include a count

In [23]:
#exploring how this data looks 
vg_df[['Name','Genre','NA_Sales','EU_Sales','JP_Sales', 'Other_Sales']].head()

Unnamed: 0,Name,Genre,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,Wii Sports,Sports,41.36,28.96,3.77,8.45
1,Super Mario Bros.,Platform,29.08,3.58,6.81,0.77
2,Mario Kart Wii,Racing,15.68,12.76,3.79,3.29
3,Wii Sports Resort,Sports,15.61,10.93,3.28,2.95
4,Pokemon Red/Pokemon Blue,Role-Playing,11.27,8.89,10.22,1.0


I am curious if these sales numbers are sales from when it was sold, and if I need to standardize the numbers in any way to make them equal for our review in 2023. 