# Video Game Sales EDA

This notebook performs exploratory data analysis (EDA) on a video game sales dataset. We will load the data, explore its structure, clean it, and visualize key trends.

In [1]:
# Load Data
import pandas as pd

# Update the filename below to match your dataset's actual filename
file_path = 'vgsales.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [2]:
# Explore Data Structure
print('Columns:', df.columns.tolist())
print('\nData Types:')
print(df.dtypes)
print('\nSample Rows:')
print(df.sample(5))

Columns: ['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']

Data Types:
Rank              int64
Name             object
Platform         object
Year            float64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

Sample Rows:
        Rank                                 Name Platform    Year     Genre  \
1349    1351     Tiger Woods PGA Tour 09 All-Play      Wii  2008.0    Sports   
2982    2984  Mortal Kombat Mythologies: Sub-Zero       PS  1997.0  Fighting   
8241    8243            Major League Baseball 2K9      PS3  2009.0    Sports   
12846  12848                 Samurai Dou Portable      PSP  2008.0    Action   
5551    5553                          NBA Live 96       PS  1996.0    Sports   

             Publisher  NA_Sales  EU_Sales  JP_Sales  Other_Sales  \
1349   Ele

In [3]:
# Basic Data Cleaning
print('Missing values per column:')
print(df.isnull().sum())

# Example: Drop rows with missing values (customize as needed)
df_clean = df.dropna()
print(f'Rows after dropping missing values: {len(df_clean)}')

Missing values per column:
Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64
Rows after dropping missing values: 16291


In [4]:
# Visualize Data
import matplotlib.pyplot as plt
import seaborn as sns

# Example: Sales distribution by platform
df_clean['Platform'].value_counts().plot(kind='bar', figsize=(10,5))
plt.title('Number of Games by Platform')
plt.xlabel('Platform')
plt.ylabel('Count')
plt.show()

# Example: Global sales distribution
sns.histplot(df_clean['Global_Sales'], bins=30, kde=True)
plt.title('Global Sales Distribution')
plt.xlabel('Global Sales (Millions)')
plt.show()

ModuleNotFoundError: No module named 'matplotlib'