# Milestone 2: Exploratory Data Analysis
##### Eduardo Sequeira
---

For this milestone, **exploratory data analysis (EDA)** will be conducted on the selected dataset for the project.

## 1. Introduction

For this EDA, there are two main questions that we seek to answer:
1. Which videogame *genre* tends to *sell better / more*?
2. What is the *correlation* between *game ratings, sales, and genre*?
3. When did ratings begin for videogames?

The first question is a basic, direct, comparison. The second question is more towards divining the relationships between the three critiera mentioned. The third question, for curiosity, is based around when were the first ratings applied to videogames, or start becoming something expected?

Before carrying out the EDA, we need to first clean the dataset. But before cleaning the dataset, we must first understand it and identify what data we need to answer the above questions.

## 2. Reviewing the Data

To begin reviewing the data, we must first load in the required libraries for the EDA and the raw dataset. Once this is done, we can start analyzing the data and move from selection to cleaning, to then performing the analysis. 

In [189]:
# Loading the required libraries
import pandas as pd
import numpy as np
import matplotlib.pylab as plt

In [200]:
# This is to go and fetch the raw dataset file and load it in

# Read the raw data from file 'filename.csv' 
rawdata = pd.read_csv("https://github.com/data301-2020-winter2/course-project-group_1019/blob/main/data/raw/Video_Games_Sales_as_at_22_Dec_2016.csv?raw=true.csv") 

# Preview the basic information of the loaded data 
rawdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


### Dataset Information

When looking at the information of the dataset, we see that the columns whose values are words are of the object (strings) datatype, and the columns which contain numerical values are of the float datatype, which makes sense. We also see that there is a fair difference in the non-null counts for the data between columns, which means that effectively, the number of complete rows (no missing values or NaNs) will be significantly lower than the total names of videogames in the dataset, 16717.

In [197]:
# Preview the first 5 lines of the loaded data 
rawdata.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


From the above head, showing us the first five lines of the raw data, we see that there are columns and rows which contain NaNs. When cleaning and manipulating the data at the end, we will want to make sure that we have no NaNs present.

However, before beginning to clean, let's select the data we actually need to answer the two above questions. We can do this by looking at the columns that are present in the dataset.

In [210]:
# Read the column names that we have
print(rawdata.columns.values)


['Name' 'Platform' 'Year_of_Release' 'Genre' 'Publisher' 'NA_Sales'
 'EU_Sales' 'JP_Sales' 'Other_Sales' 'Global_Sales' 'Critic_Score'
 'Critic_Count' 'User_Score' 'User_Count' 'Developer' 'Rating']


### Column Analysis

From the column name list above, to answer our questions, we really only need to have three columns: the **Genre**, **Global_Sales**, and **Rating** columns. With these three columns, we will be able to answer which videogame genres sell better, and find if there are any correlations between the ratings, sales and the genres. For curiosity and to address the third question, we will also need the **Year_of_Release** column.


###  Column Selection

Let's begin by selecting only the columns that are of interest to us to answer our questions: **Year_of_Release**, **Genre**, **Global_Sales**, and **Rating**. We will also keep the column **Name** just for interest, for summarizing later to give an example of some games that exist in each category.

In [211]:
# Select only the Year_of_Release, Genre, Global_Sales, and Rating columns

# Create a new edited raw dataset from the original raw dataset
rawdataedit = rawdata

# Filters the five columns that we are interested in and saves them over the dataset
rawdataedit = rawdataedit.filter(['Name','Year_of_Release','Genre','Global_Sales','Rating'])

rawdataedit.head()


Unnamed: 0,Name,Year_of_Release,Genre,Global_Sales,Rating
0,Wii Sports,2006.0,Sports,82.53,E
1,Super Mario Bros.,1985.0,Platform,40.24,
2,Mario Kart Wii,2008.0,Racing,35.52,E
3,Wii Sports Resort,2009.0,Sports,32.77,E
4,Pokemon Red/Pokemon Blue,1996.0,Role-Playing,31.37,


### Data Cleaning

Now that we only have the four columns of interest, we can proceed to the next task that  will be cleaning the data, so that there are no NaNs in four main columns.

In [219]:
# Establishing a new dataset that will be dataclean from the latest edited rawdata
dataclean = rawdataedit.dropna()

# Display the head and the info of the clean dataset
display(dataclean.head())
print("\n")
dataclean.info()

Unnamed: 0,Name,Year_of_Release,Genre,Global_Sales,Rating
0,Wii Sports,2006.0,Sports,82.53,E
2,Mario Kart Wii,2008.0,Racing,35.52,E
3,Wii Sports Resort,2009.0,Sports,32.77,E
6,New Super Mario Bros.,2006.0,Platform,29.8,E
7,Wii Play,2006.0,Misc,28.92,E




<class 'pandas.core.frame.DataFrame'>
Int64Index: 9769 entries, 0 to 16710
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             9769 non-null   object 
 1   Year_of_Release  9769 non-null   float64
 2   Genre            9769 non-null   object 
 3   Global_Sales     9769 non-null   float64
 4   Rating           9769 non-null   object 
dtypes: float64(2), object(3)
memory usage: 457.9+ KB


From the above, we see that we have now we have a total of five columns that we are interested in, and that all five columns have the same number of rows, 9769, which is perfect and what we want. We also know now that each of these rows are "complete" meaning that there are no longer any NaNs in our dataset.

### Setting the Data Types

As of now, the columns **Name**, **Genre** and **Rating** are set as object, and we should change **Genre** and **Rating** to be categories. We will want to do this because we know that the entries in these columns will be from a known set of variables. **Name** should be converted into a string because it is the name of the videogame in question. For the **Global_Sales**, we will want to keep this as it, because float64 will give us a numerical value with a decimal point for more precision. For **Year_of_Release**, we will convert this to an int64 datatype because we know that the year is simply the year, not needing decimal places for this. 

In [224]:
# Set columns Genre and Rating to be categories and check the info again
dataclean['Name'] = dataclean['Name'].astype("string")
dataclean['Year_of_Release'] = dataclean['Year_of_Release'].astype('int64')
dataclean['Genre'] = dataclean['Genre'].astype('category')
dataclean['Rating'] = dataclean['Rating'].astype('category')

# Check if the Dtype for columns Genre and Rating are now changed to be category, and Year_of_Release is str
display(dataclean.head())
print("\n")
display(dataclean.info())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataclean['Name'] = dataclean['Name'].astype("string")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataclean['Year_of_Release'] = dataclean['Year_of_Release'].astype('int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataclean['Genre'] = dataclean['Genre'].astype('category')
A value is tryi

Unnamed: 0,Name,Year_of_Release,Genre,Global_Sales,Rating
0,Wii Sports,2006,Sports,82.53,E
2,Mario Kart Wii,2008,Racing,35.52,E
3,Wii Sports Resort,2009,Sports,32.77,E
6,New Super Mario Bros.,2006,Platform,29.8,E
7,Wii Play,2006,Misc,28.92,E




<class 'pandas.core.frame.DataFrame'>
Int64Index: 9769 entries, 0 to 16710
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Name             9769 non-null   string  
 1   Year_of_Release  9769 non-null   int64   
 2   Genre            9769 non-null   category
 3   Global_Sales     9769 non-null   float64 
 4   Rating           9769 non-null   category
dtypes: category(2), float64(1), int64(1), string(1)
memory usage: 325.1 KB


None

From the above, we have now set up the dataset data types the way we want them.  
**Name** is now a string.  
**Year_of_Release** is now an integer.  
**Genre** and **Rating** are now both categories.  
**Global_Sales** is now a float.

### Exploratory Data Analysis

With everything now set up the way we need, and the data set is cleaned, we can proceed and begin with doing the EDA.

#### Unique Values Per Column

The first thing we are interested in knowing is the unique values that we have per column for the **Genre** and **Rating**.  
This will allow us to figure out how many genres of videogames are considered in this dataset, as well as the number of ratings that there are for he videogames.

In [232]:
# Unique values in Genre and Rating columns

print('The following are the unique genres considered in the dataset:', '\n', list(dataclean.Genre.unique()), '\n')

print('The following are the unique ratings considered in the dataset:', '\n', list(dataclean.Rating.unique()))

The following are the unique genres considered in the dataset: 
 ['Sports', 'Racing', 'Platform', 'Misc', 'Action', 'Puzzle', 'Shooter', 'Fighting', 'Simulation', 'Role-Playing', 'Adventure', 'Strategy'] 

The following are the unique ratings considered in the dataset: 
 ['E', 'M', 'T', 'E10+', 'K-A', 'AO', 'EC', 'RP']
