## **Purpose**
In this notebook we will be changing the datatypes of some of the columns in our dataset.
## **Datasets**
100.csv


In [7]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import os

In [8]:
if not os.path.exists("../data/prep/100.csv"):
    print("Missing dataset file")
else:
    print("Success!")

Success!


In [5]:
df= pd.read_csv("../data/prep/100.csv", low_memory = False)

Firtly, lets look at each columns data type

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8183 entries, 0 to 8182
Data columns (total 25 columns):
name                 8183 non-null object
total_shipped        8183 non-null object
developer            8183 non-null object
rank                 8183 non-null int64
platform             8183 non-null object
release_date         8183 non-null object
publisher            8183 non-null object
na_sales             7055 non-null object
eu_sales             7171 non-null object
jp_sales             2169 non-null object
other_sales          7912 non-null object
global_sales         8183 non-null object
game_genre           8183 non-null object
game_url             8183 non-null object
game_url_string      8183 non-null object
meta_game_name       8183 non-null object
meta_developer       8170 non-null object
meta_critic_score    8183 non-null int64
meta_critic_count    7338 non-null float64
meta_user_score      8183 non-null float64
meta_user_count      8183 non-null int64
meta_esrb   

Most of the columns seem to be the correct data type. The only columns that seem wrong is the sales columns and the release date column

## **Change Data Type of release_date column**
We must change the release_date from a string to datetime. This is easy as the pandas library has a function called to_datetime(). Currently the date is in a format like this - 17th Sep 13. We will convert this to the following format - 2017-09-13

In [6]:
df['release_date'] = df['release_date'].replace('N/A', np.nan, regex=True)
df= df.dropna(subset=['release_date'])
df['release_date']=pd.to_datetime(df['release_date'])

In [7]:
df.sample()

Unnamed: 0,name,total_shipped,developer,rank,platform,release_date,publisher,na_sales,eu_sales,jp_sales,...,meta_game_name,meta_developer,meta_critic_score,meta_critic_count,meta_user_score,meta_user_count,meta_esrb,meta_genre,meta_multiplayer,meta_full_url
7887,XCOM 2,,Firaxis Games,217,PS4,2016-09-27,2K Games,0.16m,0.10m,,...,XCOM 2,Firaxis Games,87,29.0,7.3,364,T,Strategy,yes,https://www.metacritic.com/game/playstation-4/...


I also want to add a column for the year of release instead of just having a release date for my analysis

In [8]:
df['release_year'] = df['release_date'].dt.year

## **Change Data Type of sales columns**

In [9]:
df['global_sales'].head(3)

0    20.32m
1    19.39m
2    16.15m
Name: global_sales, dtype: object

As you can see the sales columns contain 'm' to represent million. This wont work for our analysis. To convert theses columns to an integer we must first remove the 'm'.

In [10]:
# take off the 'm' so we can convert to floats
df['global_sales'] = df['global_sales'].str.replace('m','')
df['na_sales'] = df['na_sales'].str.replace('m','')
df['eu_sales'] = df['eu_sales'].str.replace('m','')
df['jp_sales'] = df['jp_sales'].str.replace('m','')
df['other_sales'] = df['other_sales'].str.replace('m','')
df['total_shipped'] = df['total_shipped'].str.replace('m','')

In [11]:
# function to convert and format string to float
def convert_to_int(s):
    return round(float(s) * 1000000)

A lot of the sales columns have empty values, apart from the global sales columns as we filtered these out earlier on. We could just remove rows with empty sales values however I feel that as long as we have no rows with global_sales value empty the data will be useful for us. However, we will have to do something with these empty value to convert the column to an int. I decided to just make these values 0. If we decide to do some regional sales analysis then we can just remove rows with 0 in one of their sales columns.

In [12]:
df['jp_sales'] = df['jp_sales'].replace(np.nan,0)
df['global_sales'] = df['global_sales'].replace(np.nan,0)
df['na_sales'] = df['na_sales'].replace(np.nan,0)
df['eu_sales'] = df['eu_sales'].replace(np.nan,0)
df['other_sales'] = df['other_sales'].replace(np.nan,0)

In [13]:
df['global_sales'] = df['global_sales'].apply(convert_to_int)
df['na_sales'] = df['na_sales'].apply(convert_to_int)
df['eu_sales'] = df['eu_sales'].apply(convert_to_int)
df['jp_sales'] = df['jp_sales'].apply(convert_to_int)
df['other_sales'] = df['other_sales'].apply(convert_to_int)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8172 entries, 0 to 8182
Data columns (total 26 columns):
name                 8172 non-null object
total_shipped        8172 non-null object
developer            8172 non-null object
rank                 8172 non-null int64
platform             8172 non-null object
release_date         8172 non-null datetime64[ns]
publisher            8172 non-null object
na_sales             8172 non-null int64
eu_sales             8172 non-null int64
jp_sales             8172 non-null int64
other_sales          8172 non-null int64
global_sales         8172 non-null int64
game_genre           8172 non-null object
game_url             8172 non-null object
game_url_string      8172 non-null object
meta_game_name       8172 non-null object
meta_developer       8159 non-null object
meta_critic_score    8172 non-null int64
meta_critic_count    7330 non-null float64
meta_user_score      8172 non-null float64
meta_user_count      8172 non-null int64
meta_esrb

Now all our data has the correct data types

In [15]:
df=df.set_index('name')
df.to_csv("../data/prep/200.csv",sep=",",encoding='utf-8')