# Data Science: Bridging Principles and Practice
## Data Cleaning Template

<img src="images/office-and-workers-in-barcelona-spain.jpg" />

<br>

*In this notebook, we will walk through solving a classification problem using machine learning. To do so, we will introduce the Scikit-Learn machine learning library for Python.*

### Table of Contents

<a href="#sectioncase">Case Study: Employee Attrition at IBM</a>

<ol start="9">
    <li><a href="#section9">Machine Learning</a>
        <ol type=a>
            <br>
            <li><a href="#section9a">The K-Nearest Neighbors Algorithm</a></li>
            <br>
            <li><a href="#section9b">Using Scikit-Learn: An Example</a></li>
            <br>
            <li><a href="#section9c">Using Scikit-Learn: KNN</a></li>            
        </ol>
    </li>
    </ol>

In [None]:
# run this cell to import some necessary software
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import sklearn
import seaborn as sns

from sklearn.model_selection import train_test_split
from mpl_toolkits.mplot3d import Axes3D 

# set the random seed for reproducibility
np.random.seed(28)

## Overview <a id="section9b"></a>

This template is designed to provide helpful starter code and common steps for cleaning datasets in preparation to work with them using the [Scikit-Learn](https://scikit-learn.org/stable/index.html) library. The tools and methods in this notebook will work for many, but not all, datasets.

You will get the most out of this notebook if you have already complete the 11 curriculum notebooks, or if you already have a basic familiarity with Python and Pandas.

Topics for this notebook include:
1. Loading messy data files
2. Looking at data types, missing values, and distributions
3. Handling missing values
4. Performing other common tasks: unit conversion, feature engineering, one-hot encoding
5. Saving clean dataset to a file


## Before you use this template
- Every dataset will have different cleaning needs. This template attempts to provide starter code for some common tasks, but it is far from comprehensive.
- Data cleaning can be done using many non-Python tools, such as Excel or R.
- Generally, any variables in the dataset that will go into a Scikit-Learn model should be numerical and free from missing values
- Dataset cleaning must be considered in the context of the domain of study, the data collection method, and the problem to be solved. How the data is cleaned will depend on all these things and more.
- Often, there isn't one single "correct" way to clean a particular data set. The most important thing is to keep a copy of the "messy" data for reference, and to clearly document all of the data cleaning choices you made as well as why you made them.


## 1. Load the messy data

In [None]:
# load the data
# fill in the ... with the path to the data file. Don't forget the file extension

In [None]:
# run this cell to load the data
data = pd.read_csv("data/boston.csv", index_col=0)

# show the first 5 rows of the data
data.head()

## 2. Look at data types, missing values, distributions
- explanatory variables should be the names of the appropriate columns, each enclosed in quotation marks, listed inside the square brackets and separated by commas 
- response variable should be the name of the appropriate column, enclosed in quotation marks

In [None]:
data.shape

In [None]:
data.describe()

In [None]:
data.hist(figsize=(14,10));

In [None]:
data.corr()

In [None]:
data.isnull().sum()

## 3. Handle missing values

- the random seed can be any number, as long as it's consistent
- use a validation set if you want to go through the full model selection process, including tuning hyperparameters. See Notebook 08 (Model Selection) for an example.
- running only the first cell will put 80% of the data in the training set and 20% in the test set
- running the first and second cells will put 60% of the data in the training set, 20% in the test set, and 20% in the validation set
- to change the proportions of how much data goes in each set, edit the train_size and test_size arguments

In [None]:
data.fillna(...)

In [None]:
data.dropna()

## 4. Perform other dataset-specific cleaning tasks

This might include:
- using array operations to convert units
- feature engineering: creating new columns of numerical data from text data (or other numerical data)
- dropping irrelevant rows or columns

The [Official Pandas Library Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) has a list of common operations as well as what they do. You can also look up details on Pandas functions by searching the [documentation](https://pandas.pydata.org/docs/).

In [None]:
# import the code that creates linear regression models
from sklearn.linear_model import LinearRegression

# create a new, untrained model
lr_model = LinearRegression(fit_intercept=True, normalize=False)


## 5. Save the cleaned dataset to a file

In [None]:
# fit the model
lr_model.fit(X_train, y_train)

#### References
- [IBM HR Analytics Employee Attrition & Performance mock data set](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/home) is made available under the [Open Database License](http://opendatacommons.org/licenses/odbl/1.0/). Any rights in individual contents of the database are licensed under the [Database Contents License](http://opendatacommons.org/licenses/dbcl/1.0/).