$$
\newcommand{\upquote}[1]{\textquotesingle #1\textquotesingle}
$$

# 1.2 Exercise: Exploring a Pandas Data Frame: Video Game Sales Analysis
## Kenn Wade
## DSC 550: Data Mining
## March 17th, 2024

Introduction:

In this assignment, I explored the Video Game Sales with Ratings dataset to gain insights into the video game industry. The dataset contained information about video game sales, including attributes such as platform, genre, critic scores, user scores, and global sales.

The objective of this exercise was to utilize the Pandas library in Python to perform various data exploration tasks and answer specific questions provided in the assignment instructions.

Below are the tasks I accomplished:

1. Downloaded the Video Game Sales with Ratings dataset from the provided link.
2. Loaded the dataset as a Pandas data frame.
3. Displayed the first ten rows of data.
4. Found the dimensions (number of rows and columns) in the data frame.
5. Found the top five games by critic score.
6. Found the number of video games in the data frame in each genre.
7. Found the first five games in the data frame on the SNES platform.
8. Found the five publishers with the highest total global sales.
9. Created a new column in the data frame that calculated the percentage of global sales from North America.
10. Displayed the first five rows of the new data frame.
11. Found the number of NaN entries (missing data values) in each column.
12. Tried to calculate the median user score of all the video games. Handled non-numeric entries by replacing them with NaN, calculated the median, and replaced NaN entries in the user score column with the median value.

I documented each step of the analysis and provided explanations for the code and findings below:

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("/Users/kennwade/Downloads/Video_Games_Sales_as_at_22_Dec_2016.csv")

# Display the first ten rows of data
print(df.head(10))

                        Name Platform  Year_of_Release         Genre  \
0                 Wii Sports      Wii           2006.0        Sports   
1          Super Mario Bros.      NES           1985.0      Platform   
2             Mario Kart Wii      Wii           2008.0        Racing   
3          Wii Sports Resort      Wii           2009.0        Sports   
4   Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing   
5                     Tetris       GB           1989.0        Puzzle   
6      New Super Mario Bros.       DS           2006.0      Platform   
7                   Wii Play      Wii           2006.0          Misc   
8  New Super Mario Bros. Wii      Wii           2009.0      Platform   
9                  Duck Hunt      NES           1984.0       Shooter   

  Publisher  NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  \
0  Nintendo     41.36     28.96      3.77         8.45         82.53   
1  Nintendo     29.08      3.58      6.81         0.77         

In [2]:
# Find the dimensins
num_rows, num_cols = df.shape
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 16719
Number of columns: 16


In [3]:
# Sort the dataframe by critic score and get the top 5 games
top_five_critic_score = df.nlargest(5, 'Critic_Score')
print("Top five games by critic score:")
print(top_five_critic_score)

Top five games by critic score:
                          Name Platform  Year_of_Release     Genre  \
51         Grand Theft Auto IV     X360           2008.0    Action   
57         Grand Theft Auto IV      PS3           2008.0    Action   
227   Tony Hawk's Pro Skater 2       PS           2000.0    Sports   
5350               SoulCalibur       DC           1999.0  Fighting   
16          Grand Theft Auto V      PS3           2013.0    Action   

                 Publisher  NA_Sales  EU_Sales  JP_Sales  Other_Sales  \
51    Take-Two Interactive      6.76      3.07      0.14         1.03   
57    Take-Two Interactive      4.76      3.69      0.44         1.61   
227             Activision      3.05      1.41      0.02         0.20   
5350    Namco Bandai Games      0.00      0.00      0.34         0.00   
16    Take-Two Interactive      7.02      9.09      0.98         3.96   

      Global_Sales  Critic_Score  Critic_Count User_Score  User_Count  \
51           11.01          98.0   

In [4]:
# Count the number of games in each genre
games_per_genre = df['Genre'].value_counts()
print("Number of video games in each genre:")
print(games_per_genre)

Number of video games in each genre:
Genre
Action          3370
Sports          2348
Misc            1750
Role-Playing    1500
Shooter         1323
Adventure       1303
Racing          1249
Platform         888
Simulation       874
Fighting         849
Strategy         683
Puzzle           580
Name: count, dtype: int64


In [5]:
# Filter the dataframe for SNES platform and get the first 5 games
snes_games = df[df['Platform'] == 'SNES'].head(5)
print("First five games on the SNES platform:")
print(snes_games)

First five games on the SNES platform:
                                     Name Platform  Year_of_Release     Genre  \
18                      Super Mario World     SNES           1990.0  Platform   
56                  Super Mario All-Stars     SNES           1993.0  Platform   
71                    Donkey Kong Country     SNES           1994.0  Platform   
76                       Super Mario Kart     SNES           1992.0    Racing   
137  Street Fighter II: The World Warrior     SNES           1992.0  Fighting   

    Publisher  NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  \
18   Nintendo     12.78      3.75      3.54         0.55         20.61   
56   Nintendo      5.99      2.15      2.12         0.29         10.55   
71   Nintendo      4.36      1.71      3.00         0.23          9.30   
76   Nintendo      3.54      1.24      3.81         0.18          8.76   
137    Capcom      2.47      0.83      2.87         0.12          6.30   

     Critic_Score  Critic_Cou

In [6]:
# Group by publisher and sum global sales, then get top 5 publishers
top_publishers = df.groupby('Publisher')['Global_Sales'].sum().nlargest(5)
print("Top five publishers with highest total global sales:")
print(top_publishers)

Top five publishers with highest total global sales:
Publisher
Nintendo                       1788.81
Electronic Arts                1116.96
Activision                      731.16
Sony Computer Entertainment     606.48
Ubisoft                         471.61
Name: Global_Sales, dtype: float64


In [7]:
# Calculate percentage of global sales from North America
df['NA_Sales_Percentage'] = (df['NA_Sales'] / df['Global_Sales']) * 100
print("First five rows with NA sales percentage:")
print(df.head())

First five rows with NA sales percentage:
                       Name Platform  Year_of_Release         Genre Publisher  \
0                Wii Sports      Wii           2006.0        Sports  Nintendo   
1         Super Mario Bros.      NES           1985.0      Platform  Nintendo   
2            Mario Kart Wii      Wii           2008.0        Racing  Nintendo   
3         Wii Sports Resort      Wii           2009.0        Sports  Nintendo   
4  Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  Critic_Score  \
0     41.36     28.96      3.77         8.45         82.53          76.0   
1     29.08      3.58      6.81         0.77         40.24           NaN   
2     15.68     12.76      3.79         3.29         35.52          82.0   
3     15.61     10.93      3.28         2.95         32.77          80.0   
4     11.27      8.89     10.22         1.00         31.37           NaN   

   Critic_Coun

In [8]:
# Count NaN entries in each column
nan_count = df.isnull().sum()
print("Number of NaN entries in each column:")
print(nan_count)


Number of NaN entries in each column:
Name                      2
Platform                  0
Year_of_Release         269
Genre                     2
Publisher                54
NA_Sales                  0
EU_Sales                  0
JP_Sales                  0
Other_Sales               0
Global_Sales              0
Critic_Score           8582
Critic_Count           8582
User_Score             6704
User_Count             9129
Developer              6623
Rating                 6769
NA_Sales_Percentage       0
dtype: int64


In [9]:
# Replace non-numeric entries in user score with NaN
df['User_Score'] = pd.to_numeric(df['User_Score'], errors='coerce')

# Calculate median user score
median_user_score = df['User_Score'].median()
print("Median user score:", median_user_score)

# Replace NaN entries in user score with median value
df['User_Score'].fillna(median_user_score, inplace=True)

Median user score: 7.5
