## <center><h1>**Data preprocessing**</h1>
<center> </br>Astrid Jourdan & Peio Loubière & Yannick Le Nir<br/>

This tutorial is about data preparation:
- import a dataset
- checking the type of the variables
- checking the distribution of the target variable
- creation of a training set and a test set
- scaling the variables
- encoding the categorical variables

In [1]:
# Library importations

import math
import pandas as pnd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# **Data importation**
<br/>
The dataset is the multi-spectral values of pixels in 3x3 neighborhoods in a satellite image, and the classification associated with the central pixel in each neighborhood. The target variable is the soil type : <br/><center><i> red soil - cotton crop - grey soil - damp grey soil - soil with vegetation stubble - very damp grey soil </i></center>

We display the dimensions and the first three rows to make sure that the file has been read correctly <br/>
<br/>

In [None]:
# File import
dataset = pnd.read_csv("Landsat.txt", delimiter=" ")

# Displays the column and row names
print("The column names are:",dataset.columns)
print("The row names are:",dataset.index)

# Display the dimensions
print("The dimension of the dataset : ",dataset.shape)

# Display the first three rows
dataset.head(3)
# Display all the dataset
#print(dataset.info)


<br/> **Variables type**<br/>
All the input variables are numerical in this dataset (int64). The target variable is categorical (object).
For the preprocessing, we need to get the list of numerical variables (for the normalization) and the list of categorical variables (for the encoding).

In [None]:
# Displays the variable type
print("The variable type: \n",dataset.dtypes)
print("\n")

# Displays the categorical variables
ListVarCat = dataset.select_dtypes(include=['object']).columns.tolist()
print("List of categorical variables: \n", ListVarCat)
print("\n")

# Display the numeric variables
ListVarNum = dataset.select_dtypes(exclude=['object']).columns.tolist()
print("List of numerical variables: \n",ListVarNum)


**Target distribution**
<br/>
It is necessary to check the distribution of the target variable to ensure that the classes are not unbalanced. We display the table of the numbers of examples in each class as well as a frequency diagram. <br/>
<br/>
When the target variable is unbalanced two common stategies are used :

*   
Data reduction : when the number of examples in each class is large enough,

*   Data augmentation : when the number of examples in a class is too small, we increase the size of the class by artificially creating new examples in this class (with images we apply transformations: filter, rotation,...)


In [None]:
# Distribution
print("Distribution : \n", dataset['SoilType'].value_counts())
print("\n")

# Barplot
dataset.groupby('SoilType').size().plot.bar(title="Species distribution", ylabel='Nb of examples')


# or with a pie
# dataset.groupby('SoilType').size().plot.pie(title="Soil type distribution", ylabel='', autopct='%.2f')


# **Data preprocessing**
**Separation of the input variables and the target variable**

We use the function *drop* to remove the target variable and create a dataset with the input variables (x) only.
We create a vector with the target variable only (y).

In [None]:
x = dataset.drop(columns=['SoilType']) # Attribute columns
y = dataset['SoilType'] # Target column

print("Inputs: \n",x[0:3])
print("\n")
print("Target : \n",y[0:3])
print("\n")

n = x.shape[0] # number of examples
d = x.shape[1] #number of input variables
print("The number of examples is",n)
print("The number of input variables is",d)

**Training, validation and test subsets**
- The training data set is used to find the optimal parameters of the model (the weights for a neural network). Here, we use 70% of the original dataset.
- The validation data set is used to fing the best hyperparameters of the model (number of layers, activation functions, optimization algorithm parameters,...). Here we used 15% of the original data set
- The test data set is use to measure the performance of the model with examples that were never used to build it. Here, we use 15% of the original dataset.
 </br>
  </br>
The function
 </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<i>train_test_split(x, y, train_size = fraction of the training set)</i></br>
from library sklearn, splits a dataset (x,y) into two parts. We use it a fisrt time with the original data to create the training dataset and a second time with the rest of the dataset to obtain the validation and test datasets.


In [None]:
# First Split : 70% training set and 30% fir the test and validation sets
xTrain, xRest, yTrain, yRest = train_test_split(x, y, train_size = 0.7, random_state = 42)

print("Dimensions of the train dataset:",xTrain.shape)
print("Dimensions of the remaining dataset:",xRest.shape)
print("\n")

# Second split : 15% for the validation set and 15% for the test set
xVal, xTest, yVal, yTest = train_test_split(xRest, yRest, train_size = 0.5, random_state = 42)

print("Dimensions of the validation dataset:",xVal.shape)
print("Dimensions of the test dataset:",xTest.shape)
print("\n")


We can check the target distribution in the training set. The function <i>groupby</i> works with a dataframe. The resulting y (yTrain, yVal and yTest) after splitting is no more a dataframe. In particular, we loose the name of the column. We have to transform it into a dataframe.

In [None]:
print("y is a vector:\n",yTrain[0:5]) # yTrain is a vector and not a dataframe
print("\n")

yTrain= pnd.DataFrame(yTrain, columns=["SoilType"]) # Transformation into a dataframe

print("y is a dataframe:\n",yTrain[0:5]) # yTrain is now a dataframe

# Verification of the target distribution
yTrain.groupby('SoilType').size().plot.bar(title="Soil type Train distribution", ylabel='Nb of examples')


yVal= pnd.DataFrame(yVal, columns=["SoilType"])
yTest= pnd.DataFrame(yTest, columns=["SoilType"])

**Encoding the target variable**
 </br>
To encode the categorical variables into binary variables, we use the function </br>
</br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<i>pnd.get_dummies(X)</i></br>
</br>
where X is a dataset with only categorical variables.</br>
</br>

In [None]:
# Encode the target variable in the training set
print("Target variable before binarization \n",yTrain.head(3)) # Before binarization
print("\n")
yTrain=pnd.get_dummies(yTrain)
print("Target variable after binarization \n",yTrain.head(3)) # After binarization
print("\n")

# Number of categories
p = yTrain.shape[1] #number of classes
print("The number of categories is",p)


yVal=pnd.get_dummies(yVal)
yTest=pnd.get_dummies(yTest)

Target variable before binarization 
       SoilType_cotton crop  SoilType_damp grey soil  SoilType_grey soil  \
1289                     0                        0                   1   
1116                     0                        1                   0   
583                      1                        0                   0   

      SoilType_red soil  SoilType_vegetation stubble  \
1289                  0                            0   
1116                  0                            0   
583                   0                            0   

      SoilType_very damp grey soil  
1289                             0  
1116                             0  
583                              0  


Target variable after binarization 
       SoilType_cotton crop  SoilType_damp grey soil  SoilType_grey soil  \
1289                     0                        0                   1   
1116                     0                        1                   0   
583                     

<br/>
Now that the datasets are ready, we will normalize the training set. To each variable X, we apply the transformation (X-mu)/sigma where mu is the mean of X and sigma its standard deviation. The result is a variable such that mu=0 and sigma=1.
Then, we apply the same transformation to the test and validation datasets (with mu and sigma from the training set).
<br/>
<br/>

In [None]:
# Normalization
scaler = StandardScaler()


# Training set
xTrain_scaled = scaler.fit(xTrain)
print("Training set")
print("Means before normalization: ", xTrain_scaled.mean_)
print("Variance before normalization: ", xTrain_scaled.var_)
print("\n")


# Normalization of the training set
xTrain_scaled = scaler.transform(xTrain)
print("Means after normalization: ",xTrain_scaled.mean(axis=0))
print("Variance after normalization: ",xTrain_scaled.var(axis=0))
print("\n")
xTrain_scaled= pnd.DataFrame(xTrain_scaled, columns=[xTrain.columns.tolist()]) # transform into a dataframe with the column names of xTrain
xTrain_scaled.head()


In [None]:
# Transformation of the validation set
xVal_scaled = scaler.transform(xVal)
print("Validation set")
print("Means after normalization: ",xVal_scaled.mean(axis = 0))
print("Variance after normalization: ",xVal_scaled.var(axis = 0))
print("\n")
xVal_scaled= pnd.DataFrame(xVal_scaled, columns=[xTrain.columns.tolist()]) # transform into a dataframe with the column names of xTrain



In [None]:


# Transformation of the test set
xTest_scaled = scaler.transform(xTest)
print("Test set")
print("Means after normalization: ",xTest_scaled.mean(axis = 0))
print("Variance after normalization: ",xTest_scaled.var(axis = 0))
print("\n")
xTest_scaled= pnd.DataFrame(xTest_scaled, columns=[xTrain.columns.tolist()]) # transform into a dataframe with the column names of xTrain



 <center>
<h1>EXERCICE</h1>
</center>

The dataset <i>Drug_Consumption.csv</i> contains records for respondents. For each respondent, we known: <br/>
<br/>
<ul>
<li>Personality measurements which include NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), BIS-11 (impulsivity), and ImpSS (sensation seeking). </li>
<li>Personal information which include level of education, age, gender, country of residence and ethnicity.</li>
<li>Cocaine consumption.</li>
<br/>

<ol>
<li>How many respondents have been recorded? Display the 3th rows of the dataset.</li>
<li>How many variables are in the dataset? What is their type?</li>
<li>What is the target variable? What is its type? Display its distribution.</li>
<li>Split the dataset into a training, validation and test datasets (80%-10%-10%).</li>
<li>Encode the target variable.</li>
<li>Extract the numeric variables and normalize them.</li>
<li>Extract the categorical variables and encode them.</li>
<li>Build a new x dataset with the encoded input variables and the standardized variables.</li>
<ol/>

In [None]:
## QUESTION 1
##
## Your code here
##
## Warning The delimiter in the data file is ";"

In [None]:
## QUESTION 2
##
## Your code here
##

In [None]:
## QUESTION 3
##
## Your code here
##

In [None]:
## QUESTION 4
##
## Your code here
##

In [None]:
## QUESTION 5
##
## Your code here
##

In [None]:
## QUESTION 6
##
## Complet the code
##

# List of numerical variables
numeric_list = 
# Standardization
numeric_xTrain= # training set with the numerical variables
numeric_xTrain_scaled = scaler.fit(numeric_xTrain)
numeric_xTrain_scaled = scaler.transform(numeric_xTrain)
print("Training set")
print("Means after normalization: ",numeric_xTrain_scaled.mean(axis = 0))
print("Variance after normalization: ",numeric_xTrain_scaled.var(axis = 0))
print("\n")

# Transformation into a dataframe
numeric_xTrain_scaled=pnd.DataFrame(numeric_xTrain_scaled, index=xTrain.index,columns=numeric_list)

In [None]:
## QUESTION 6 (continued)
##
## Your code here
##

To encode the categorical variables into binary variables, we use the function </br>
</br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<i>pnd.get_dummies(X)</i></br>
</br>
where X is a dataset with only categorical variables.</br>
</br>
We process in 3 steps : </br>
<ol>
<li>We extract the categorical variables from the dataset.</li>
<li>We encode the categorical variables.</li>
<li>We rebuilt the dataset by concatening the encoded variables, the numerical variables and the target variable. </li>
<ol/>

In [None]:
## QUESTION 7
##
## Complet the code
##
# List of the categorical variables
cat_list= # List of categorical variables
print(cat_list)
print("\n")

# Encode the categorical variables in the training set
xTrain_cat=xTrain[cat_list[:-1]] # xTrain set with the categorical variables except the target Cocaine
print(xTrain_cat.head(3)) # Display the categorical variables before binarization
print("\n")
xTrain_cat_encoded=pnd.get_dummies(xTrain_cat)
print(xTrain_cat_encoded.head(3)) # Display the categorical variables after binarization

# Encode the categorical variables in the test set


To concatenate dataframes, data1, data2, ..., we use the function
</br>
</br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<i>pnd.concat([data1,data2,...],axis=1)</i></br>
</br>
where <i>axis</i> is the axis to concatenate along (axis=1 for columns and axis=0 for rows)


In [None]:
## QUESTION 8
##
## Complet the code
##

# Concatenate the numerical variables and the encoded variables for the training set
xTrain_ready=pnd.concat([numeric_xTrain_scaled,xTrain_cat_encoded],axis=1) # Concatenates tables by columns (axis=1)
print("The number of variables is the new dataset is",xTrain_ready.shape[1])

# Concatenate the numerical variables and the encoded variables for the test set


xTrain_ready.head(3)