# Data Cleaning (Cleansing)

The process of correcting or removing corrupt/inaccurate
data from a dataset. Data cleaning refers to identifying the incomplete, incorrect, inaccurate, or irrelevant part of data and replacing, modifying, or deleting the dirty data.

## This notebook includes techniques on how to find and clean:

- Missing Data
- Irregular Data. Such as outliers.
- Repetitive data, duplicates, unnecessary data
- Inconsistent Data. Capitalization, Addrresses. 

## About the data

For this notebook we will use the **Russian housing dataset** and an **anthropological dataset** that contains information about countries. 

In [2]:
# allows surpression of warnings
import warnings
warnings.filterwarnings("ignore")

# import datasets
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib

plt.style.use('ggplot')
from matplotlib.pyplot import figure

%matplotlib inline
# set desired size for figure size
matplotlib.rcParams['figure.figsize'] = (12, 8) 

pd.options.mode.chained_assignment = None

In [15]:
HOUSE_LOCATION = "./datasets/russian_house_market/house.csv"

In [16]:
df = pd.read_csv(HOUSE_LOCATION)

### Shape

Figuring out the size of a dataset is a good start to data cleaning.
This tells use how many rows and columns exists in the dataset. 

In [28]:
# returns a tuple format (x, y)
# x is the number of rows
# y is the number of columns
df.shape

(38133, 293)

### Data types
Another useful type of information is the data type of each column or feature. This helps us identify which features are number or categorical.

In [20]:
df.dtypes

 id                   float64
timestamp              object
full_sq               float64
life_sq               float64
floor                 float64
                       ...   
leisure_count_5000      int64
sport_count_5000        int64
market_count_5000       int64
id                    float64
price_doc             float64
Length: 293, dtype: object

### Separating
Lets take a look at an example that shows us how to separate numerical and categorical features from each other.

In [22]:
# select_dtypes allows to include the datatype 
# df_numeric only contains features of the type np.number
df_numeric = df.select_dtypes(include=[np.number])
# store numeric features
numeric_cols = df_numeric.columns.values

In [25]:
# take a peek into the first 5 samples
df_numeric.head()

Unnamed: 0,id,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,state,...,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,id.1,price_doc
0,30474.0,39.0,20.7,2.0,9.0,1.0,1998.0,1.0,8.9,3.0,...,0,0,1,10,1,0,14,1,,
1,30475.0,79.2,,8.0,17.0,1.0,0.0,3.0,1.0,1.0,...,1,0,2,11,0,1,12,1,,
2,30476.0,40.5,25.1,3.0,5.0,2.0,1960.0,2.0,4.8,2.0,...,4,0,10,21,0,10,71,11,,
3,30477.0,62.8,36.0,17.0,17.0,1.0,2016.0,2.0,62.8,3.0,...,2,0,0,10,0,0,2,0,,
4,30478.0,40.0,40.0,17.0,17.0,1.0,0.0,1.0,1.0,1.0,...,1,0,2,12,0,1,11,1,,


In [26]:
# select_dtypes also allows for exlusion of a datatype
# df_non_numeric contains features that are not np.number
df_non_numeric = df.select_dtypes(exclude=[np.number])
# store non number features
non_numeric_cols = df_non_numeric.columns.values

In [27]:
# take a peek into the first 5 samples
df_non_numeric.head()

Unnamed: 0,timestamp,product_type,sub_area,culture_objects_top_25,thermal_power_plant_raion,incineration_raion,oil_chemistry_raion,radiation_raion,railroad_terminal_raion,big_market_raion,nuclear_reactor_raion,detention_facility_raion,water_1line,big_road1_1line,railroad_1line,ecology
0,2015-07-01,Investment,Juzhnoe Butovo,no,no,no,no,no,no,no,no,no,no,no,no,satisfactory
1,2015-07-01,OwnerOccupier,Poselenie Vnukovskoe,no,no,no,no,no,no,no,no,no,no,no,no,no data
2,2015-07-01,Investment,Perovo,no,yes,no,yes,yes,no,no,no,no,no,no,no,poor
3,2015-07-01,OwnerOccupier,Poselenie Voskresenskoe,no,no,no,no,no,no,no,no,no,no,no,no,no data
4,2015-07-01,OwnerOccupier,Poselenie Vnukovskoe,no,no,no,no,no,no,no,no,no,no,no,no,no data


## Missing Data

Lets explore how we can deal with missing data. Dealing with missing data is by far the most important part of data cleaning. 

Most models do not accept missing data.