# Data Science: Bridging Principles and Practice
## Data Cleaning Template

<img src="images/cleaning02.jpg" width="400">

<br>



## Overview <a id="section9b"></a>

This template is designed to provide helpful starter code and common steps for cleaning datasets in preparation to work with them using the [Scikit-Learn](https://scikit-learn.org/stable/index.html) library. The tools and methods in this notebook will work for many, but not all, datasets.

You will get the most out of this notebook if you have already complete the 11 curriculum notebooks, or if you already have a basic familiarity with Python and Pandas.

Topics for this notebook include:
1. Loading messy data files
2. Looking at data types, missing values, and distributions
3. Handling missing values
4. Performing other common tasks: unit conversion, feature engineering, one-hot encoding
5. Saving clean dataset to a file


## Before you use this template, please note:
- Every dataset will have different cleaning needs. This template attempts to provide starter code for some common tasks, but it is far from comprehensive.
- Data cleaning can be done using many non-Python tools, such as Excel or R.
- Generally, any variables in the dataset that will go into a Scikit-Learn model should be numerical and free from missing values
- Dataset cleaning must be considered in the context of the domain of study, the data collection method, and the problem to be solved. How the data is cleaned will depend on all these things and more.
- Often, there isn't one single "correct" way to clean a particular data set. The most important thing is to keep a copy of the "messy" data for reference, and to clearly document all of the data cleaning choices you made as well as why you made them.

In [None]:
# run this cell to import some necessary software
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline


## 1. Load the messy data

- If you're working in the Haas Executive Education cloud server, you will need to upload your dataset. Go to [bee.haas.berkeley.edu)](bee.haas.berkeley.edu). Click on the "haas-ds-online" folder, then the "data" folder. Then, click the "Upload" button at the top right. The relative path to your dataset is now "data/name-of-your-file"
- If you're on your computer, make sure you know the relative path of your data file. Putting it in the same directory as your notebook will help.
- Replace the ... with the relative path of your data file. Don't forget the file extension.
- Use the first code cell if your data is in a csv (comma separated values) file.  Use the second code cell if your data is in an Excel file.
- If your data is in a different file type, you can see if there are functions to read it with Pandas at [this link](https://pandas.pydata.org/docs/user_guide/io.html). Note that csv and Excel files tend to be the easiest to work with in Pandas.

In [None]:
# For csv file data
data = pd.read_csv(...)

# show the first 5 rows of the data
data.head()

In [None]:
#  For Excel file data
data = pd.read_csv(...)

# show the first 5 rows of the data
data.head()

## 2. Look at data types, missing values, distributions
The following methods and attributes will work for most datasets and can help you find missing values and outliers that might need to be cleaned.

In [None]:
# get the number of rows and columns
data.shape

In [None]:
# show summary statistics for numerical data
data.describe()

In [None]:
# show the distributions of numerical data
# the figsize parameter makes the histograms bigger
data.hist(figsize=(14,10));

In [None]:
# create a correlation matrix for all numerical columns of data
data.corr()

In [None]:
# show how many null values are in each column
data.isnull().sum()

In [None]:
# show the types of data in each column
data.dtypes

## 3. Handle missing values (if they exist)

- the `fillna` method will fill all null/missing values with the value you put in the parentheses. You can also fill missing values using particular methods with the `method` argument.  Setting `method` to "ffill" will propagate the last valid observation forward to next valid, while "bfill" will use the next valid observation to fill the gap.
- depending on your problem and dataset, you may want to fill missing values using an average value, a previous or subsequent value, or other imputation methods. 
- depending on your problem and dataset, you may want to drop missing values using `dropna`. The `axis` parameter specifies whether rows with missing data (axis=0) or columns with missing data (axis=1) are dropped.
- note that both `fillna` and `dropna` do NOT overwrite your original DataFrame unless you set the `inplace` parameter to `True` or save the result of the expression to a variable as in the two cells below

In [None]:
# only run this cell if you want to fill missing values
data = data.fillna(value=None, method=None)

In [None]:
# only run this cell if you want to drop missing values
data = data.dropna(axis=0)

## 4. Handle data type issues (if they exist)

- generally, Scikit-Learn needs data to be numerical (int64 or float64)
- sometimes, Pandas will read in numerical data as text, or a non-numerical value will appear in a numerical column (e.g. "unknown" in a column of ages)
- non-numerical data can be dropped or converted to numerical data (through dummy or one-hot encoding, through imputation, or through forcing Panda to read in data as a certain type). The best option will always depend on the dataset, the domain, and the problem being solved.

In [None]:
# one-hot encode all categorical data in your dataset
data = pd.get_dummies(data)

In [None]:
# select only columns with numerical data
data = data.select_dtypes("number")

In [None]:
# turn a specific column into numeric data
# may not work if there is text that Pandas can't easily interpret as a number
pd.to_numeric(data[...])

## 5. Perform other dataset-specific cleaning tasks

This might include:
- using array operations to convert units
- feature engineering: creating new columns of numerical data from text data (or other numerical data)
- dropping irrelevant rows or columns

The [Official Pandas Library Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) has a list of common operations as well as what they do. You can also look up details on Pandas functions by searching the [documentation](https://pandas.pydata.org/docs/).

In [None]:
# you can do any other data cleaning here

## 5. Save the cleaned dataset to a file
Replace the ... with the path you want for your data file. For example, to save your data to a file called my_data.csv in the data folder, you would use the path "data/my_data.csv".

To save the data to formats besides csv or Excel, please refer the [Pandas Input/Output documentation](https://pandas.pydata.org/docs/user_guide/io.html)

In [None]:
# save the cleaned dataset to a csv file
data.to_csv(...)

In [None]:
# save the cleaned dataset to an Excel file
data.to_excel(...)

#### References
- Image credit: "Cleaning02", Nick Youngson / Alpha Stock Images. Licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

Notebook author: Keeley Takimoto