# Chapter #1: Introduction to Data Preprocessing

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## 1. What is data preprocessing?

1. Preprocessing Data for Machine Learning
> Hello! Welcome to this course on Preprocessing Data for Machine Learning. My name is Sarah Guido, and I'll be helping you learn the skills necessary for preparing data for modeling. Let's jump right in.

2. What is data preprocessing?
> Data preprocessing comes after you've cleaned up your data and after you've done some exploratory analysis to understand your dataset. Once you understand your dataset, you'll probably have some idea about how you want to model your data. Machine learning models in Python require numerical input, so if your dataset has categorical variables, you'll need to transform them. Think of data preprocessing as a prerequisite for modeling.

3. Refresher on Pandas basics
> I'm going to walk through some basics in Pandas. Most of this should be review. If it isn't, go check out other courses related to Pandas. We're going to be working with some pretty straightforward files in this course. The important line here is the `hiking.head()` line. The first thing you're going to want to do with any dataset is look at it.

4. Refresher on Pandas basics
> It's useful to be able to generate a list of the features present in your dataset. You can easily see the columns in a dataset with the columns attribute, and you can see their data type with the dtype attribute.

5. Refresher on Pandas basics
> Finally, you can quickly generate some basic stats about a dataframe like the mean, standard deviation, and quartiles using the describe method. One of the first steps you can take to preprocess your data is to remove missing data. There's a lot of ways to deal with missing data, but here we're only going to cover ways to remove either columns or rows with missing data.

6. Removing missing data
> If you wanted to drop all rows from a dataframe that contain missing values, you can do that with dropna.

7. Removing missing data
> You can drop specific rows by passing index labels to the drop function, which defaults to dropping rows.

8. Removing missing data
> Usually you'll want to focus on dropping a particular column, especially if all or most of its values are missing. You can use the drop method as well, though the parameters are different. The first parameter is the column name, in this case A. We have to specify axis=1 in order to designate that we want to drop a column.

9. Removing missing data
> What if we want to drop rows where data is missing in a particular column? We can do this with the help of boolean indexing, which is a way to filter a dataframe based on certain values. Instead of indexing a dataframe using column or row names, you can set a condition to filter your dataframe by to return a specific set of data. For example, if we wanted only rows in this dataframe where column B is equal to 7, we can filter it by selecting where column B is equal to 7.

10. Removing missing data
> First, let's take a look at how many null values we have in column B, using isnull to get null values and then using sum to output a count. So we have 2 missing values. To filter those out, we can simply use the notnull method on column B as a boolean index. This will return a dataframe where all rows have a non null value for column B.

11. Let's practice!
> Now it's your turn to get rid of missing data. Give it a try!

### 1.1. Missing data - columns

We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.

How many features are in the original dataset, and how many features are in the set after columns with at least 3 missing values are removed?

- Getting everything ready.

In [2]:
# Reading the data & make sure that hits column is imported as a str:
volunteer = pd.read_csv("./data/volunteering.csv", dtype={'hits': str})

In [3]:
# Exploring the shape:
volunteer.shape

(665, 35)

In [4]:
# Exploring the first 5 rows:
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


- The dataset `volunteer` has been provided.

- Use the `dropna()` function to remove columns.

- You'll have to set both the `axis=` and `thresh=` parameters.

In [5]:
# Dropping columns which have >= 3 missing values & re-exploring the shape:
volunteer.dropna(axis=1, thresh=3).shape

(665, 24)

Possible Answers:
- 35, 24.
- 35, 35.
- 35, 19.

> 35, 24.

### 1.2. Missing data - rows

Taking a look at the `volunteer` dataset again, we want to drop rows where the `category_desc` column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

- Check how many values are missing in the `category_desc` column using `isnull()` and `sum()`.

In [6]:
# Checking for missing values in category_desc column:
volunteer['category_desc'].isna().sum()

48

- Subset the `volunteer` dataset by indexing by where `category_desc` is `notnull()`, and store in a new variable called `volunteer_subset`.

In [7]:
# Filtering out missing values in category_desc column:
volunteer_subset = volunteer[volunteer['category_desc'].notna()].copy()

- Take a look at the `.shape` attribute of the new dataset, to verify it worked correctly.

In [8]:
# Exploring the shape:
volunteer_subset.shape

(617, 35)

In [9]:
# We can also use .dropna() with specifying the category_desc column:
volunteer.dropna(axis=0, subset=['category_desc']).shape

(617, 35)

## 2. Exploring data types

1. Working With Data Types
> Now that we've reviewed some Pandas basics, we need to start thinking about other steps we have to take in order to prepare data for modeling. One of these steps is to think about the types that are present in your dataset, because you'll likely have to transform some of these columns to other types later on. Let's take a deeper look at types as well as how to convert column types in your dataset.

2. Why are types important?
> Recall that you can check the types of a dataframe by using the dtypes attribute, like this. Pandas datatypes are similar to native python types, but there are a couple of things to be aware of. The most common types you'll be working with are object, int64, and float64 types. The object type is what Pandas uses to refer to a column that consists of string values or is of mixed types. int64 is equivalent to the Python integer type. the 64 simply refers to the allocation of memory alloted for storing the values. and float64 is equivalent to the float type. Another type you might see as you work with data in pandas is the datetime64 type (or the timedelta type). This is because you can store dates as datetime types in pandas dataframes, and even use datetimes as a special kind of index. All you need to be familiar with as we work through this course are the object, int64, and float64 types, though. Before any preprocessing can begin, you have to understand what types you're dealing with in your dataset. Sometimes, you'll start working with a dataset that has an incorrect column type: maybe a numerical column was written out into a csv as a string, and when you try to work with that column, numerical operations won't work.

3. Converting column types
> Let's take a look at how to adjust the type of a column if the type that pandas has inferred upon reading in the file is incorrect. Here we have a simple dataset with a couple of columns. if you run df.dtypes, you'll see that the type for column C is object. However, if we simply look at this dataframe, you can see that these are float values: numbers with decimal points. If we want to preprocess and model this data, we're going to have to adjust the column type.

4. Converting column types
> Changing the type of a column is very straightforward. Pandas already has a method for converting the type of the column to a new type. You can change the type using the astype method and passing in the type you want to convert it to. Make sure you're only assigning it to the column you want converted. It's also good to be as sure as you can that the column type you want to convert to is representative of the whole column. Remember that the object type can represent a column that includes both string and numeric types.

5. Let's practice!
> Now it's your turn to do some type conversion.

### 2.1. Exploring data types

Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.

Which data types are present in the volunteer dataset?

- The dataset `volunteer` has been provided.

- Use the `.dtypes` attribute to check the datatypes.

In [10]:
# Exploring the data types:
volunteer.dtypes.value_counts()

object     15
float64    13
int64       7
dtype: int64

Possible Answers:
- Float and int only.
- Int only.
- Float, int, and object.
- Float only.

> Float, int, and object.

### 2.2. Converting a column type

If you take a look at the `volunteer` dataset types, you'll see that the column `hits` is type `object`. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type `int`.

- Take a look at the `.head()` of the `hits` column.

In [11]:
# Exploring the first 5 values in the hits column:
volunteer['hits'].head()

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: object

- Use the `.astype` function to convert the column to type `int`.

In [12]:
# Converting the hits column into integers:
volunteer['hits'] = volunteer['hits'].astype(int)

- Take a look at the `.dtypes` of the dataset again, and notice that the column type has changed.

In [13]:
# Exploring the data types:
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int32
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

## 3. Class distribution

1. Training and Test Sets
> One of the most necessary steps for preprocessing, which you should be familiar with if you've taken other courses on Python and machine learning, is splitting up your data into training and test sets. We do this to avoid the issue of overfitting. If we train a model on our entire set of data, we won't have any way to test and validate our model because the model will essentially know the dataset by heart. Holding out a test set allows us to preserve some data the model hasn't seen yet.

2. Splitting up your dataset
> Just to review, this is how you split up your dataset in scikit learn using the train_test_split function. This should look familiar to you. The function shuffles up your dataset and then randomly splits it. By default, the function will split 75% of the data into the training set and 25% into the test set. In many scenarios, the default splitting parameters will work well. However, if your labels have an uneven distribution, your test and training sets might not be representative samples of your dataset and could bias the model you're trying to train. For example, if you look at the example training and test datasets on this slide, you can see that the training set has only samples labeled n, while there is a y label in the test set.

3. Stratified sampling
> A good technique for sampling more accurately when you have imbalanced classes is stratified sampling, which is a way of sampling that takes into account the distribution of classes or features in your dataset. So for example, let's say we had a dataset with 100 samples, 80 of which are class 1 and 20 of which are class 2. We want the class distribution in both our training set and our test set to reflect this, so in both our training and test sets, we'd want 80% of our sample to be class 1 and 20% to be class 2, which means we'd want 60 class 1 samples and 15 class 2 samples in our training set of 75 samples. In our test set of 25 samples, we want to have 20 class 1 samples and 5 of class 2. This is on par with the distribution of classes in the original dataset.

4. Stratified sampling
> There's a really easy way to do this in scikit learn using the train test split function. The function comes with a stratify parameter, and to stratify according to class labels, just pass in your y dataset to that parameter. So here we have our 100 labels, 80 are class1 and 20 are class 2. let's run train_test_split, and pass the y labels dataset into that stratify parameter.

5. Stratified sampling
> If we check the distribution of classes for our training and test labels, you can see the distribution of classes is in accordance with the original y class distribution.

6. Let's practice!
> Now it's your turn to do some stratified sampling!

### 3.1. Class imbalance

In the `volunteer` dataset, we're thinking about trying to predict the `category_desc` variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.

Which descriptions occur less than 50 times in the `volunteer` dataset?

- The dataset `volunteer` has been provided.

- The colum you want to check is `category_desc`.

- Use the `.value_counts()` method to check variable counts.

In [14]:
# Counting the frequency of different values in category_desc column:
volunteer['category_desc'].value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

Possible Answers:
- Emergency Preparedness.
- Health.
- Environment.
- 1 and 3.
- All of the above.

> 1 and 3.

### 3.2. Stratified sampling

We know that the distribution of variables in the `category_desc` column in the `volunteer` dataset is uneven. If we wanted to train a model to try to predict `category_desc`, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

- Getting everything ready.

In [15]:
# Dropping rows in which target column category_desc is missing:
volunteer.dropna(axis=0, subset=['category_desc'], inplace=True)

# Dropping columns which have missing values:
volunteer.dropna(axis=1, inplace=True)

- Create a `volunteer_X` dataset with all of the columns except `category_desc`.

In [16]:
# Creating the feature matrix (X):
volunteer_X = volunteer.drop(columns='category_desc')

- Create a `volunteer_y` training labels dataset.

In [17]:
# Creating the target column (y):
volunteer_y = volunteer['category_desc'].copy()

- Split up the `volunteer_X` dataset using scikit-learn's `train_test_split` function and passing `volunteer_y` into the `stratify=` parameter.

In [18]:
# Splitting the data into training & hold-out sets:
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

- Take a look at the `category_desc` value counts on the training labels.

In [19]:
# Checking the value counts % in training set:
print(y_train.value_counts(normalize=True))

Strengthening Communities    0.497835
Helping Neighbors in Need    0.192641
Education                    0.149351
Health                       0.084416
Environment                  0.051948
Emergency Preparedness       0.023810
Name: category_desc, dtype: float64


In [20]:
# Checking the value counts % in hold-out set:
print(y_test.value_counts(normalize=True))

Strengthening Communities    0.496774
Helping Neighbors in Need    0.193548
Education                    0.148387
Health                       0.083871
Environment                  0.051613
Emergency Preparedness       0.025806
Name: category_desc, dtype: float64
