Title: Video Games Analysis Throughout the world

In [1]:
#importing all necessary libraries
import pandas as pd
import streamlit as st
import plotly.express as px

In [2]:
#loading the dataset in the panda data frame
df = pd.read_csv('../games.csv')

In [3]:
#viewing the Data
df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,


In [4]:
#having an overview look over the data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


In [62]:
# Changing all columns to lowercase
df.columns = df.columns.str.lower()
df.sample(5)

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
2741,Turok,X360,2008,Action,0.46,0.2,0.01,0.07,69,7.1,M
4455,Neon Genesis Evangelion,SAT,1996,Adventure,0.0,0.0,0.44,0.0,70,tbd,E
9423,Alone in the Dark: Inferno,PS3,2008,Adventure,0.09,0.03,0.0,0.02,69,5.6,M
15607,The Conveni 4,PS2,2006,Simulation,0.0,0.0,0.02,0.0,70,tbd,E
10003,18 Wheels of Steel: Extreme Trucker 2,PC,2011,Racing,0.08,0.02,0.0,0.01,70,8.3,E


In [35]:
# inspecting data 
# checking for missing values
df.isna().sum()

name               0
platform           0
year_of_release    0
genre              0
na_sales           0
eu_sales           0
jp_sales           0
other_sales        0
critic_score       0
user_score         0
rating             0
dtype: int64

In [36]:
# filling missing values and changing the columns data type with contingent approach

# filling in the missing value of year_of_release columns then change its data type at the same time
#the reason why I am changing this data type to int is to make the data more reliable for analysis and it also uses less memory, plus Integer operations (e.g., comparisons, filtering by year) become easier and more efficient.
# an it also Helps avoid errors that may occur due to NaN or incorrect data types in future calculations.
df['year_of_release'] = df['year_of_release'].fillna(0).astype(int)
# filling missing value on genre according to its datatype
# with the help of the lambda anonymous function here i try to fill the missing value with the mode or most frequent value and if it does not exist i fill it in with unknown since its an object data type 
df['genre'] = df['genre'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'unknown'))
# filling in missing value of name column according to it's data type
# i use the same procedure just like the genre column since both columns are of object data types
df['name'] = df['name'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'unknown'))
# filling missing values for critic score and then change its data type to int
# and the reason why I am changing this data type is make it more consistent and easier to use just like year of release column
#fill the missing value with the mode or most frequent value and if it does not exist i fill it in with 0 since its an int data type 
df['critic_score'] = df['critic_score'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 0)).astype(int)
# filling missing value for user score accordingly
df['user_score'] = df['user_score'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'unknown'))
# filling missing value for rating column accordingly
df['rating'] = df['rating'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'unknown'))
df.isna().sum()



name               0
platform           0
year_of_release    0
genre              0
na_sales           0
eu_sales           0
jp_sales           0
other_sales        0
critic_score       0
user_score         0
rating             0
dtype: int64

Why do you think the values are missing? Give possible reasons.
1 of the reasons is because data was collected all across the regions and it is really hard to have perfect data
in different areas of the world
2. company in different regions might not have the same tools to collect the data
3. Data entry errors
4.innaplicable data
5. Data losss or optional field some field might not be mandatory for the respondents to respond to so might ended up skipping them


Dealing with TBD(to be determined):
I will Create a separate indicator column to flag rows with TBD, allowing me to revisit them when the data becomes available.

In [39]:
# checking for duplicates
df.duplicated().sum()
# no duplicates in this dataset

np.int64(0)