# Project description
You work for the online store Ice, which sells video games all over the world. User and expert reviews, genres, platforms (e.g. Xbox or PlayStation), and historical data on game sales are available from open sources. You need to identify patterns that determine whether a game succeeds or not. This will allow you to spot potential big winners and plan advertising campaigns.

In front of you is data going back to 2016. Let’s imagine that it’s December 2016 and you’re planning a campaign for 2017. 

(The important thing is to get experience working with data. It doesn't really matter whether you're forecasting 2017 sales based on data from 2016 or 2027 sales based on data from 2026.)

The dataset contains the abbreviation ESRB. The Entertainment Software Rating Board evaluates a game's content and assigns an age rating such as Teen or Mature.

In [65]:
import pandas as pd
import numpy as np
import streamlit as st
import plotly.express as px
from matplotlib import pyplot as plt
import altair


In [66]:
df = pd.read_csv('games.csv')



In [67]:
df.columns = df.columns.str.lower()
# Making all column headers lower case for easier analsysis
df.head()

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,


In [68]:
df.columns = df.columns.str.lower().str.replace(' ','')
# making certain theres no space between commas in the column headers

In [69]:
#df.info()
value_counts = df['user_score'].value_counts()

value_counts

user_score
tbd    2424
7.8     324
8       290
8.2     282
8.3     254
       ... 
1.1       2
1.9       2
9.6       2
0         1
9.7       1
Name: count, Length: 96, dtype: int64

In [70]:
df['user_score'] = pd.to_numeric(df['user_score'], errors='coerce')
# Converting the user_score column to floats, I'm using the errors= 'coerce' parameter to convert all of the 'tbd' values and whatever values that couldn't be converted to floats to NaN values.
df['user_score'].isna().sum()

9125

Converting the user_score to float type because in the future I will most likely have to analyze the ratings scores and I can't do any sort of numerical analsysis unless the datatype is of float or int type.

In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16446 non-null  float64
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       7590 non-null   float64
 10  rating           9949 non-null   object 
dtypes: float64(7), object(4)
memory usage: 1.4+ MB


In [72]:
df['user_score'].duplicated().sum()
# No duplicated rows within the entire dataset

16619

## Missing Value Analysis
There are many missing values from this dataset. The name column has two missing values which could be because the games ina different language and the inputters of the data just couldn't figure out the name or of course they could have just forgtten to put their names in. For our numerical data none of the sales columns having missing values however many have values of 0 indicating there were no sales on those games. For the score columns most likely many of the games were simply not given scores and for the rating columns many of which were not given a rating. 