# Laboratory practice 2_1: Classification I

For this practice, you will need the following datasets:

- **Simdata.dat**: synthetic datasets containing several input variables and one output variable **Y**.

The main package for machine learning in Python is **scikit-learn**.

Further reading:
- [scikit-learn](https://scikit-learn.org) (Machine Learning libraries)

In addition, we will be using the following libraries:
- Data management
    - [numpy](https://numpy.org/) (linear algebra)
    - [pandas](https://pandas.pydata.org/) (data processing, CSV file)

- Plotting
    - [seaborn](https://seaborn.pydata.org/)
    - [matplotlib](https://matplotlib.org/)


In [2]:
# Import necessary modules
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Interactive plotting
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’

### 1. Our data

#### Load data
First step will always be having our data available. Consider:
- Use `read_csv` from pandas
- Write correct path
- Use correct separator `sep`

In this case, load file with `sep = "\t"`

#### Explore file

* What are the input variables?

* What is the typology of these variables?

* What is the target?

* What is the shape of dataframe?

### 2. Preprocessing and data cleansing

As we've seen in theory, 

    Data cleansing is one of the most important steps when dealing with a ML problem, but it can be less appealing than other tasks.

#### Deal with missing values (NA)

* Show how many missing values have each column:

* What do you think is the best option in this case to treat the missing values? 

    - Removal: [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)
    - Substitution: [fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)

Apply your criterion:

In [2]:
# Check results

#### Plot the data and look for outliers

As we've seen in theory, outliers are:

    Samples that are exceptionally far from the mainstream of the data
    
Note: Applied to numeric columns.

Two basic ways to detect them:

* Using graph tools: Plot histogram or boxplot of all dataframe variables.


In [3]:
# histogram:


In [4]:
# boxplot


* Analytically using statistics: 

[IQR (Inter Quartile Range)](https://www.geeksforgeeks.org/interquartile-range-iqr/) Inter Quartile Range approach to finding the outliers is the most commonly used and most trusted approach used in the research field.


In [33]:
## Example for X1 ##

# IQR (Inter Quartile Range) Inter Quartile Range approach to finding the outliers is the most 
# commonly used and most trusted approach used in the research field.

# Percentile 25
Q1 = np.percentile(df['X1'], 25,
                   method = 'midpoint')

# Percentile 75
Q3 = np.percentile(df['X1'], 75,
                   method = 'midpoint')

# IQR
IQR = Q3 - Q1

In [34]:
# upper bound:
upper = df['X1'] >= (Q3+1.5*IQR)
df[upper]

Unnamed: 0,X1,X2,Y


In [37]:
# lower bound:
lower = df['X1'] <= (Q1-1.5*IQR)
df[lower]

Unnamed: 0,X1,X2,Y


* Find outliers using IQR method for *X2*

#### Encode categorical variables

* In our case, there are no categorical input variables.

* Convert target variable to factor

#### Analyse continuous variables

We shoud standarize variables, but in this practice we won't apply because in future we'll use a pipeline to make it easier all steps before modeling.

Relationship between variables. Our task is to find out if exists any relation between predictors. 
* Plot scatterplot (seaborn library) where:
    * x-axis = 'X1'
    * y-axis = 'X2'
    * hue = 'Y' (color of points according to Y categories)
    * data = df
    * title = 'Whole set'


* Plot pairplot (seaborn) where:
    - data = df
    - hue = 'Y'

* EXTRA: According to these graphs, could a new nonlinear variable be generated?


To find out if exists collinearity is through the confusion matrix.

* Plot correlation matrix using `df.corr(method = 'pearson')`

Draw your conclusions from results

#### Deal with class imbalances

To verify that there is class imbalance you should to observe the proportion of each category of target variable:

### Split data intro train/test

#### Define feature and target matrix

In [49]:
features = list(df.columns)
target = 'Y'
features.remove(target)

In [50]:
X = df[features]
y = df[target]

#### Split
Based on [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) create X_train,X_test,y_train and y_test where:
- test_size = 0.2 (percentage of test data)
- random_state = 0 (seed for replication)
- stratify = target name (Preserves distribution of y)

Check all data generated: 

* Save data in csv file naming the following way:
- X_train: `X_train.csv`
- y_train: `y_train.csv`
- X_test: `X_test.csv`
- y_test: `y_test.csv`

with `sep=','` and `index = False`

It will be useful for the next practice