# Introduction to Data Preprocessing
> In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/batman.jpg

> Note: This is a summary of the course's chapter 1 exercises "Preprocessing for Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## What is data preprocessing?


### Missing data - columns

<div class=""><p>We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values. </p>
<p>How many features are in the original dataset, and how many features are in the set after columns with at least 3 missing values are removed?</p>
<ul>
<li>The dataset <code>volunteer</code> has been provided.</li>
<li>Use the <code>dropna()</code> function to remove columns.</li>
<li>You'll have to set both the <code>axis=</code> and <code>thresh=</code> parameters.</li>
</ul></div>

In [26]:
volunteer = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/volunteer.csv')

<pre>
Possible Answers

<b>35, 24</b>

35, 35

35, 19
</pre>

In [14]:
volunteer.shape

(665, 35)

In [15]:
volunteer.dropna(axis=1, thresh=3).shape

(665, 24)

**A lot of operations are done on a column basis, so it's useful to remember axis=1 when working with Pandas.**

### Missing data - rows

<p>Taking a look at the <code>volunteer</code> dataset again, we want to drop rows where the <code>category_desc</code> column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.</p>

Instructions
<ul>
<li>Check how many values are missing in the <code>category_desc</code> column using <code>isnull()</code> and <code>sum()</code>.</li>
<li>Subset the <code>volunteer</code> dataset by indexing by where <code>category_desc</code> is <code>notnull()</code>, and store in a new variable called <code>volunteer_subset</code>.</li>
<li>Take a look at the <code>.shape</code> attribute of the new dataset, to verify it worked correctly.</li>
</ul>

In [28]:
# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

48
(617, 35)


**Remember that you can use boolean indexing to effectively subset DataFrames.**

## Working with data types


### Exploring data types


<div class=""><p>Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.</p>
<p>Which data types are present in the <code>volunteer</code> dataset?</p>
<ul>
<li>The dataset <code>volunteer</code> has been provided.</li>
<li>Use the <code>.dtypes</code> attribute to check the datatypes.</li>
</ul></div>

<pre>
Possible Answers

Float and int only

Int only

<b>Float, int, and object</b>

Float only
</pre>

In [17]:
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

**All three of these types are present in the DataFrame.**

### Converting a column type


<p>If you take a look at the <code>volunteer</code> dataset types, you'll see that the column <code>hits</code> is type <code>object</code>. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type <code>int</code>.</p>

Instructions
<ul>
<li>Take a look at the <code>.head()</code> of the <code>hits</code> column.</li>
<li>Use the <code>.astype</code> function to convert the column to type <code>int</code>.</li>
<li>Take a look at the <code>dtypes</code> of the dataset again, and notice that the column type has changed.</li>
</ul>

In [29]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype(int)

# Look at the dtypes of the dataset
print(volunteer.dtypes)

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64


**You can use astype to convert between a variety of types.**

## Class distribution


###Class imbalance
<div class=""><p>In the <code>volunteer</code> dataset, we're thinking about trying to predict the <code>category_desc</code> variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.</p>
<p>Which descriptions occur less than 50 times in the <code>volunteer</code> dataset?</p>
<ul>
<li>The dataset <code>volunteer</code> has been provided.</li>
<li>The colum you want to check is <code>category_desc</code>.</li>
<li>Use the <code>value_counts()</code> method to check variable counts.</li>
</ul></div>

<pre>
Possible Answers

Emergency Preparedness

Health

Environment

<b>1 and 3</b>

All of the above

</pre>

In [21]:
volunteer['category_desc'].value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

**Both Emergency Prepardness and Environment occur less than 50 times.**

### Stratified sampling


<p>We know that the distribution of variables in the <code>category_desc</code> column in the <code>volunteer</code> dataset is uneven. If we wanted to train a model to try to predict <code>category_desc</code>, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.</p>

Instructions
<ul>
<li>Create a <code>volunteer_X</code> dataset with all of the columns except <code>category_desc</code>.</li>
<li>Create a <code>volunteer_y</code> training labels dataset.</li>
<li>Split up the <code>volunteer_X</code> dataset using scikit-learn's <code>train_test_split</code> function and passing <code>volunteer_y</code> into the <code>stratify=</code> parameter.</li>
<li>Take a look at the <code>category_desc</code> value counts on the training labels.</li>
</ul>

In [32]:
from sklearn.model_selection import train_test_split
volunteer = volunteer.dropna(subset=['category_desc'], axis=0)

In [33]:
# Create a data with all columns except category_desc
volunteer_X = volunteer.drop('category_desc', axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


**You'll use train_test_split frequently while building models, so it's useful to be familiar with the function.**