# Milestone 2: Exploratory Data Analysis
##### Eduardo Sequeira
---

For this milestone, **exploratory data analysis (EDA)** will be conducted on the selected dataset for the project.

## 1. Introduction

For this EDA, there are two main questions that we seek to answer:
1. Which videogame *genre* tends to *sell better / more*?
2. What is the *correlation* between *game ratings, sales, and genre*?
3. When did ratings begin for videogames?

The first question is a basic, direct, comparison. The second question is more towards divining the relationships between the three critiera mentioned. The third question, for curiosity, is based around when were the first ratings applied to videogames, or start becoming something expected?

Before carrying out the EDA, we need to first clean the dataset. But before cleaning the dataset, we must first understand it and identify what data we need to answer the above questions.

## 2. Reviewing the Data

To begin reviewing the data, we must first load in the required libraries for the EDA and the raw dataset. Once this is done, we can start analyzing the data and move from selection to cleaning, to then performing the analysis. 

In [65]:
# Loading the required libraries
import pandas as pd
import numpy as np
import matplotlib.pylab as plt

In [66]:
# This is to go and fetch the raw dataset file and load it in

# Read the raw data from file 'filename.csv' 
rawdata = pd.read_csv("https://github.com/data301-2020-winter2/course-project-group_1019/blob/main/data/raw/Video_Games_Sales_as_at_22_Dec_2016.csv?raw=true.csv") 

# Preview the basic information of the loaded data 
rawdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


#### Dataset Information

When looking at the information of the dataset, we see that the columns whose values are words are of the object (strings) datatype, and the columns which contain numerical values are of the float datatype, which makes sense. We also see that there is a fair difference in the non-null counts for the data between columns, which means that effectively, the number of complete rows (no missing values or NaNs) will be significantly lower than the total names of videogames in the dataset, 16717.

In [67]:
# Preview the first 5 lines of the loaded data 
rawdata.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


From the above head, showing us the first five lines of the raw data, we see that there are columns and rows which contain NaNs. When cleaning and manipulating the data at the end, we will want to make sure that we have no NaNs present.

However, before beginning to clean, let's select the data we actually need to answer the two above questions. We can do this by looking at the columns that are present in the dataset.

In [68]:
# Read the column names that we have
rawdata.columns

Index(['Name', 'Platform', 'Year_of_Release', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Score',
       'Critic_Count', 'User_Score', 'User_Count', 'Developer', 'Rating'],
      dtype='object')

#### Column Analysis

From the column name list above, to answer our questions, we really only need to have three columns: the **Genre**, **Global_Sales**, and **Rating** columns. With these three columns, we will be able to answer which videogame genres sell better, and find if there are any correlations between the ratings, sales and the genres. For curiosity and to address the third question, we will also need the **Year_of_Release** column.

As of now, the columns **Genre** and **Rating** are set as object, and we should change these to be categories. We will want to do this because we know that the entries in these columns will be from a known set of variables. For the **Global_Sales**, we will want to keep this as it, because float64 will give us a numerical value with a decimal point for more precision. For **Year_of_Release**, we will convert this to an **str** datatype because we know that the year is simply the year, not needing decimal places for this. 

####  Dataset Cleaning

Before initializing any analysis or data selection for our EDA, the first task that we will undertake is cleaning the data,
so that there are no NaNs in four main columns that are of interest to us: **Genre**, **Year_of_Release**, **Global_Sales**, and **Rating** columns.

In [None]:
# Remove NaNs in the Genre, Year_of_Release, Global_Sales, and Rating columns



In [69]:
# Create a new edited raw dataset from the original raw dataset
rawdataedit = rawdata

# Set columns Genre and Rating to be categories and check the info again
rawdataedit['Genre'] = rawdataedit['Genre'].astype('category')
rawdataedit['Rating'] = rawdataedit['Rating'].astype('category')
rawdataedit['Year_of_Release'] = rawdataedit['Year_of_Release'].astype('int64')

# Check if the Dtype for columns Genre and Rating are now changed to be category, and Year_of_Release is str
rawdataedit.info()

ValueError: Cannot convert non-finite values (NA or inf) to integer

####  Column Selection

Now, we can begin  selecting only the columns that are of interest to us to answer our questions.

In [55]:
dataselect = rawdataedit.copy()

dataselect.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [62]:
dataselect = dataselect().drop(['Name','Platform','Publisher'], axis=1)

print(dataselect)

TypeError: 'NoneType' object is not callable

In [None]:
df = df.filter(['a', 'b'])