<a href="https://colab.research.google.com/github/prokope/learning-datacleaning/blob/main/datacleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning
## Purpose
My goal here is to learn the most used functions/methods related to data cleaning

### Importing libs

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from IPython.display import display

### Creating a fictional CSV

In [28]:
from numpy import nan
students = {
    'name': ["Paul", "Richard", "Procopio", "Santos"],
    'grade': [9, None, 8, 10],
    'preferred_subject': [None, "Math", "Chemistry", "Physics"]
}

students = pd.DataFrame(students)
students

Unnamed: 0,name,grade,preferred_subject
0,Paul,9.0,
1,Richard,,Math
2,Procopio,8.0,Chemistry
3,Santos,10.0,Physics


### dropna() — Delete NaN values
Syntax: <code>df.dropna(axis=1, how="Any", subset=None, inplace=False)</code>

In [30]:
# Using axis = 1 parameter, every column with a NaN or None value will be completely deleted
display(students.dropna(axis=1, inplace=False))

# Using axis = 0, every row with a Nan/None value will be completely deleted
display(students.dropna(axis=0))

Unnamed: 0,name
0,Paul
1,Richard
2,Procopio
3,Santos


Unnamed: 0,name,grade,preferred_subject
2,Procopio,8.0,Chemistry
3,Santos,10.0,Physics


### fillna() — Filling NaN/None values
Syntax: <code>df.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
</code>

In [46]:
# Replacing every NaN/None value by 0
display(students.fillna(value=0))

# Copying last valid value using ffill
display(students.ffill())

# Copying next valid value using bfill
display(students.bfill())

Unnamed: 0,name,grade,preferred_subject
0,Paul,9.0,0
1,Richard,0.0,Math
2,Procopio,8.0,Chemistry
3,Santos,10.0,Physics


Unnamed: 0,name,grade,preferred_subject
0,Paul,9.0,
1,Richard,9.0,Math
2,Procopio,8.0,Chemistry
3,Santos,10.0,Physics


Unnamed: 0,name,grade,preferred_subject
0,Paul,9.0,Math
1,Richard,8.0,Math
2,Procopio,8.0,Chemistry
3,Santos,10.0,Physics


### astype() — Data conversion
Syntax: <code>df.Series.astype(dtype, copy=True, errors='raise')</code>

In [57]:
# Converting column grades (float) to str and saving into a variable converted
converted = students["grade"].astype(str)

# Visualizing the type of first value in the Series
display(type(converted.iloc[0]))


str

### replace() — Replacing specific values
Syntax: <code>df.replace(to_replace, value, inplace=False)</code>

In [61]:
# Viewing original Dataset
display(students)

# Replacing values "Chemistry" by "Biology"
display(students.replace("Chemistry", "Biology"))

# Plus: If I wanted to change the whole column to a value:
copy = students
copy.loc[:, "preferred_subject"] = "Chemistry"
display(copy)

Unnamed: 0,name,grade,preferred_subject
0,Paul,9.0,Chemistry
1,Richard,,Chemistry
2,Procopio,8.0,Chemistry
3,Santos,10.0,Chemistry


Unnamed: 0,name,grade,preferred_subject
0,Paul,9.0,Biology
1,Richard,,Biology
2,Procopio,8.0,Biology
3,Santos,10.0,Biology


Unnamed: 0,name,grade,preferred_subject
0,Paul,9.0,Chemistry
1,Richard,,Chemistry
2,Procopio,8.0,Chemistry
3,Santos,10.0,Chemistry


### rename() — Renaming columns or indexes
Syntax: <code>df.rename(columns=None, index=None, inplace=False)</code>

In [62]:
# Renaming the column "preferred_subject" by "Favorite Subject"
display(students.rename(columns={'preferred_subject': 'Favorite Subject'}))

Unnamed: 0,name,grade,Favorite Subject
0,Paul,9.0,Chemistry
1,Richard,,Chemistry
2,Procopio,8.0,Chemistry
3,Santos,10.0,Chemistry
