# Data preprocessing using pandas and scikit-learn

### Feature selection

Data preprocessing is most always the step before training a machine learning model. There are features that are not very useful for predicting a given outcome. For example, including an `id` field which uniquely identifies each sample does not make much sense. 
Thus, such variables can be safely deleted.


In [85]:
import pandas as pd
from matplotlib import pyplot as plt

In [86]:
filename = "nyc_squirrels.csv"
df = pd.read_csv(filename)

Again, we'll ask you do to a bit of work yourself. This time, we ask you to drop unnecessary columns.

In [87]:
df.columns

Index(['X', 'Y', 'Unique Squirrel ID', 'Hectare', 'Shift', 'Date',
       'Hectare Squirrel Number', 'Age', 'Primary Fur Color',
       'Highlight Fur Color', 'Combination of Primary and Highlight Color',
       'Color notes', 'Location', 'Above Ground Sighter Measurement',
       'Specific Location', 'Running', 'Chasing', 'Climbing', 'Eating',
       'Foraging', 'Other Activities', 'Kuks', 'Quaas', 'Moans', 'Tail flags',
       'Tail twitches', 'Approaches', 'Indifferent', 'Runs from',
       'Other Interactions', 'Lat/Long'],
      dtype='object')

In [88]:
# Drop the "Unique Squirrel ID" column
df.drop(columns="Unique Squirrel ID", inplace=True)

In [89]:
df.columns

Index(['X', 'Y', 'Hectare', 'Shift', 'Date', 'Hectare Squirrel Number', 'Age',
       'Primary Fur Color', 'Highlight Fur Color',
       'Combination of Primary and Highlight Color', 'Color notes', 'Location',
       'Above Ground Sighter Measurement', 'Specific Location', 'Running',
       'Chasing', 'Climbing', 'Eating', 'Foraging', 'Other Activities', 'Kuks',
       'Quaas', 'Moans', 'Tail flags', 'Tail twitches', 'Approaches',
       'Indifferent', 'Runs from', 'Other Interactions', 'Lat/Long'],
      dtype='object')

### Feature slicing
Feature slicing is the act of *slicing* a feature into multiple different features.
For example, we can slice the `Date` into day, month, and year.


In [90]:
df["Date"].head()

0    10142018
1    10192018
2    10142018
3    10172018
4    10172018
Name: Date, dtype: int64

In [91]:
df.head()

Unnamed: 0,X,Y,Hectare,Shift,Date,Hectare Squirrel Number,Age,Primary Fur Color,Highlight Fur Color,Combination of Primary and Highlight Color,...,Kuks,Quaas,Moans,Tail flags,Tail twitches,Approaches,Indifferent,Runs from,Other Interactions,Lat/Long
0,-73.956134,40.794082,37F,PM,10142018,3,,,,+,...,False,False,False,False,False,False,False,False,,POINT (-73.9561344937861 40.7940823884086)
1,-73.968857,40.783783,21B,AM,10192018,4,,,,+,...,False,False,False,False,False,False,False,False,,POINT (-73.9688574691102 40.7837825208444)
2,-73.974281,40.775534,11B,PM,10142018,8,,Gray,,Gray+,...,False,False,False,False,False,False,False,False,,POINT (-73.97428114848522 40.775533619083)
3,-73.959641,40.790313,32E,PM,10172018,14,Adult,Gray,,Gray+,...,False,False,False,False,False,False,False,True,,POINT (-73.9596413903948 40.7903128889029)
4,-73.970268,40.776213,13E,AM,10172018,5,Adult,Gray,Cinnamon,Gray+Cinnamon,...,False,False,False,False,False,False,False,False,,POINT (-73.9702676472613 40.7762126854894)


Hint: use the `Series.apply()` method with a lambda function. [[Help]](https://www.analyticsvidhya.com/blog/2020/03/what-are-lambda-functions-in-python/)


In [92]:
# Split the feature "Date" into "day", "month" and "year" columns
def split_date(date_int):
    date_str = str(date_int)
    month = date_str[:2]
    day = date_str[2:4]
    year = date_str[4:]
    return month,day,year

df[['Month','Day','Year']] = df["Date"].apply(lambda x : pd.Series(split_date(x))) #pd.Series to split result into separate columns

df[["Date", "Day", "Month", "Year"]].head()

Unnamed: 0,Date,Day,Month,Year
0,10142018,14,10,2018
1,10192018,19,10,2018
2,10142018,14,10,2018
3,10172018,17,10,2018
4,10172018,17,10,2018


### Feature engineering

You can create new features based on the features you have. These might be more useful for your (future) machine learning model than the ones that are already present in the dataset.
In this squirrel dataset, most of the fields encode the action taken by the squirrel when being approached by the human.
We will combine them into a single feature `Reaction` with values `"yes"` and `"no"`.

In [93]:
reaction_columns = ["Kuks", "Quaas", "Moans", "Tail flags",
                    "Tail twitches", "Approaches", "Runs from",
                    "Other Interactions"]

df["Reaction"] = df[reaction_columns].any(axis=1) #output : 3022 rows of "False" if all reaction columns are False, "True" otherwise
df["Reaction"] = df["Reaction"].apply(lambda x: "yes" if x else "no")
df["Reaction"]

0        no
1        no
2        no
3       yes
4        no
       ... 
3018    yes
3019     no
3020     no
3021     no
3022    yes
Name: Reaction, Length: 3023, dtype: object

A important step for a data processing pipeline is making the data understandable for machine learning algorithms. Most of them do not understand strings, like `yes` and `no` in our newly created column.
We need to transform them to a binary format so that the machine learning model can take advantage of that feature.We are going to **One Hot Encode** our feature.


In [94]:
df_test = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})

df_test

Unnamed: 0,A,B,C
0,a,b,1
1,b,a,2
2,a,c,3


In [100]:
pd.get_dummies(df_test, prefix=['col1', 'col2'],drop_first=False)

Unnamed: 0,C,col1_a,col1_b,col2_a,col2_b,col2_c
0,1,True,False,False,True,False
1,2,False,True,True,False,False
2,3,True,False,False,False,True


In [101]:
pd.get_dummies(df_test, prefix=['col1', 'col2'],drop_first=True)
#suppr : col1_a, col2_a

Unnamed: 0,C,col1_b,col2_b,col2_c
0,1,False,True,False
1,2,True,False,False
2,3,False,False,True


In [96]:
pd.get_dummies(df_test)

Unnamed: 0,C,A_a,A_b,B_a,B_b,B_c
0,1,True,False,False,True,False
1,2,False,True,True,False,False
2,3,True,False,False,False,True


### Personal notes

- get_dummies() is used to convert CATEGORICAL variables into numerical ones. It thus won't have any effect on columns that are already numerical.
- Each column is named after a value present in the dataset (with prefix if added).

- This below will create 2 columns "Reaction_yes" and "Reaction_no". "Reaction_yes" will be true if column "Reaction" had a "yes" in this row. ("Rection_no" will thus always be opposite of "Reaction_yes".)

In [73]:
pd.get_dummies(df.Reaction, prefix="Reaction")

Unnamed: 0,Reaction_no,Reaction_yes
0,True,False
1,True,False
2,True,False
3,False,True
4,True,False
...,...,...
3018,False,True
3019,True,False
3020,True,False
3021,True,False


However, we have a redundancy here, as we could just transform `"yes"` to `1` and `"no"` to `0` in our `"Reaction"` column. This can be done by setting the argument `drop_first` to `True`.

NB : drop_first=True supprime la première réponse présente (ici, "no")

Le code ci-dessous produira donc une colonne "Reaction_yes" qui sera ensuite renommée "Reaction".

In [102]:
df = pd.get_dummies(df.Reaction, prefix="Reaction", drop_first=True)
df.rename(columns={"Reaction_yes": "Reaction"})

Unnamed: 0,Reaction
0,False
1,False
2,False
3,True
4,False
...,...
3018,True
3019,False
3020,False
3021,False



Similar things can be done after converting the data frame to an array using the `scikit-learn` library with `LabelBinarizer` or `OneHotEncoder`.

## Feature normalization or standardization
Although they are sometimes used interchangeably, normalization and standardization are two different ways to bring a column of values to a common scale.
In this section, we're going to use the word normalization to refer to this concept.

Why do we normalize data ?

*For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.* [[Source]](https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data)

Some algorithms require that data be normalized before training a model. Other algorithms perform their own data scaling or normalization.

Given a column of values *x*, if we choose to scale them, there are a few options:

- Normalization, also called *min-max scaling*, rescales every value to a range between [0, 1]. The maximum and the minimum are computed for each column separately.

  $$ z = \frac{x - min(x)}{max(x) - min(x)} $$

- Standardization, also called z-score normalization, rescales the value around a 0 mean and a standard deviation of 1. It essentially transforms all values of *x* to a *z-score*. Mean and standard deviation are computed for each column separately.

$$ z = \frac{x - mean(x)}{std(x)} $$


### Be careful!

When you want to normalize your dataset, you have to do so **AFTER** splitting your data into different train-test splits. Indeed, normalizing your data before would use some information from your testing set in the training set, thus biasing the model.
Indeed, in a real world scenario, you would not have access to the testing set, as this would be the data that you are meant to predict.

The procedure is the following:
1. Split your data into train and test
2. For every variable $x$ of your **training set**, compute $max(x_{train})$ and $min(x_{train})$ , or $mean(x_{train})$ and $std(x_{train})$ depending if you do min-max scaling or z-score-normalization.
3. Normalize your training set and your testing set using these values (here I'm only showing the testing set).
$$ z_{test} = \frac{x_{test} - min(x_{train})}{max(x_{train}) - min(x_{train})} $$


$$ z_{test} = \frac{x_{test} - mean(x_{train})}{std(x_{train})} $$


In [103]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
import numpy as np
dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target

In [104]:
print(dataset.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


### Personal Notes
### First, let's take a look at train_test_split

#### Features and target
Features : the characteristics used to predict something else
Target : the thing we're trying to predict
Convention is : x = features, y = target

#### [Official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
sklearn.model_selection.train_test_split(*arrays, test_size, random_state, ...)

INPUT
- *arrays : this means the number of arrays passed can change (usually we write x,y). The "\*" in Python function signatures usually refers to handling multiple arguments. We can put different types of data (lists, pandas DataFrames) but they will be treated as arrays. 
- test_size (and/or train_size)
- random_state for reproductibility

OUTPUT
- list containing train-test split of inputs

#### randint
numpy.random.randint(low, high=None, size=None, dtype=int)

In [122]:
# We create a dictionary where the 2 first keys have 10 random values 
# (respectively between 1 and 9, and 10 and 19).
# The last key (target) has 10 int values between 0 et 1.

data_test = {
    'feature1': np.random.randint(1, 10, 10),
    'feature2': np.random.randint(10, 20, 10),
    'target': np.random.randint(0, 2, 10)
}
df_test = pd.DataFrame(data_test)
#print(df_test)

print(train_test_split(df_test[["feature1", "feature2"]], df_test["target"]))

[   feature1  feature2
0         1        10
3         4        12
1         7        10
8         9        10
5         6        19
6         4        10
2         2        17,    feature1  feature2
4         9        17
7         8        16
9         5        17, 0    0
3    1
1    1
8    1
5    0
6    1
2    0
Name: target, dtype: int32, 4    0
7    0
9    1
Name: target, dtype: int32]


In [119]:
# Splitting data into train and test split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=41, test_size=0.2)
print(X_train.shape)

(455, 30)


In [117]:
X_train

array([[1.283e+01, 1.573e+01, 8.289e+01, ..., 9.783e-02, 3.006e-01,
        7.802e-02],
       [1.426e+01, 1.817e+01, 9.122e+01, ..., 7.530e-02, 2.636e-01,
        7.676e-02],
       [1.365e+01, 1.316e+01, 8.788e+01, ..., 8.056e-02, 2.380e-01,
        8.718e-02],
       ...,
       [1.375e+01, 2.377e+01, 8.854e+01, ..., 6.106e-02, 2.663e-01,
        6.321e-02],
       [2.016e+01, 1.966e+01, 1.311e+02, ..., 1.425e-01, 3.055e-01,
        5.933e-02],
       [1.145e+01, 2.097e+01, 7.381e+01, ..., 6.127e-02, 2.762e-01,
        8.851e-02]])

Here, we'll want you to use numpy's mean and standard deviation functions to standardize each feature of the training and testing set. These are `np.mean()` and `np.std()`.

In [None]:
dataset

In [None]:
for idx, name in enumerate(dataset.feature_names):
    print(idx, name)

In [None]:
# Standardizing each feature using the train mean and standard deviation
for idx, name in enumerate(dataset.feature_names):
    # Get mean and standard deviation from training set (per feature)
    mean = 0
    
    stdev = 0
    print(f"Feature '{name}' has mean {mean:.2f} and standard deviation {stdev:.2f}")
    # Standardize training and testing set using the mean and standard deviation from the training set


If you run the previous cell twice (without running the others cells again), you'll see that the second time, the mean and standard deviation for each feature will be 0 and 1 respectively, which is exactly what we want when we standardize (z-score normalization).

## Resampling

Sometimes, when you have multiple classes and the number of samples of each class are not equally distributed, i.e. there is an imbalance in the number of samples of each class, you can resort to resampling.
Resampling is using more (or less) of a given class to get a balanced dataset.

**BE CAREFUL**, resample **AFTER** splitting your data set into two parts. You do not want to accidentally have a copy of a testing sample in the training set.
Moreover, **do not resample the testing set**. This would give a false sense of the performance of the model.


In [None]:

import sklearn
import numpy as np
from sklearn import datasets

dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target

In [None]:
# We separate the samples of the different classes
class_one_idx = np.argwhere(y==1)
class_zero_idx = np.argwhere(y==0)

class_one_x = np.squeeze(X[class_one_idx])
class_zero_x = np.squeeze(X[class_zero_idx])

print("Shape of class 0 samples: ", class_zero_x.shape)
print("Shape of class 1 samples: ", class_one_x.shape)

You see that we have 212 samples of class 0 and 357 samples of class 1.
We can either upsample, i.e. take more samples of, the minority class (here class 0) or we can downsample, i.e. take fewer samples of, the majority class (here class 1).
To do this, we first have to separate the samples of each class.

In [None]:
from sklearn.utils import resample

# Upsample minority class
class_zero_upsampled = resample(class_zero_x, 
                                 replace=True,     # sample with replacement
                                 n_samples=357,    # to match majority class
                                 random_state=123) # reproducible results

print("New shape of class 0 samples: ",class_zero_upsampled.shape)

# Downsample majority class
class_one_downsampled = resample(class_one_x, 
                                 replace=True,     # sample with replacement
                                 n_samples=212,    # to match minority class
                                 random_state=123) # reproducible results

print("New shape of class 1 samples: ",class_one_downsampled.shape)

After having either upsampled our minority class, or downsampled our majority class, we can combine the upsampled with the majority class or the downsampled with the minority class to have a balanced data set.

Which one you use depends on what you want to do, and which one does best.

In [None]:
X_balanced = np.concatenate((class_zero_upsampled, class_one_x), axis=0)
print("(Upsampled) Balanced data set shape: ", X_balanced.shape)

X_balanced = np.concatenate((class_one_downsampled, class_zero_x), axis=0)
print("(Downsampled) Balanced data set shape: ", X_balanced.shape)

## Reading material and additional ressources

[[1] Feature Engineering - Elite Data Science](https://elitedatascience.com/feature-engineering)  
[[2] Feature Engineering - Towards Data Science](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)  
[[3] Feature Engineering Tutorial - Kaggle](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)  
[[4] LabelBinarizer - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)  
[[5] OneHotEncoder - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)  
[[6] Zscore - Simply Psychology](https://www.simplypsychology.org/z-score.html)  
[[7] Normalize data - Microsoft Azure](https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data)  
[[8] How to handle imbalanced classes - Elite Data Science ](https://elitedatascience.com/imbalanced-classes)