# Data Cleaning with Python

### Import the Libraries

In [1]:
import pandas as pd
import os

### Use pandas to read the CSV file

In [2]:
vgsales = pd.read_csv('vgsales.csv')

### Clean the data

1. Drop any empty rows

In [3]:
vgsales_clean = vgsales.dropna()

2. Drop any duplicate rows

In [4]:
vgsales_clean = vgsales.drop_duplicates()

3. Use the '.describe()' method to get a glimpse of the edited data

In [5]:
vgsales_clean.describe()

Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16598.0,16327.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.406443,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,5.828981,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


According to the table above, there are roughly 300 values in the column 'Year' that are _null_

In [6]:
vgasales_clean = [vgsales_clean['Year'].isnull()]

We see in the table above, the 'Year' column contains rows that have 'NaN' which the previous dropna method did not pick up. We drop any row containing an 'NaN' value in the year column with the following:

In [7]:
vgsales_clean = vgsales_clean.dropna(subset=['Year'])

In [8]:
vgsales_clean.describe()

Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16327.0,16327.0,16327.0,16327.0,16327.0,16327.0,16327.0
mean,8292.868194,2006.406443,0.265415,0.147554,0.078661,0.048325,0.540232
std,4792.669778,5.828981,0.821591,0.508766,0.311557,0.189885,1.565732
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4136.5,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8295.0,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12441.5,2010.0,0.24,0.11,0.04,0.04,0.48
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


### Export the cleaned dataset as a csv locally

In [15]:
vgsales_clean.to_csv('C:\\Users\\justi\\OneDrive\\Desktop\\vgsales_clean', index=False)