<a href="https://colab.research.google.com/github/kozen88/Data_prepro/blob/main/data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-processing di un dataset di rilevazione del tumore al seno


Il progetto si compone di varie pipeline di pre-processing distinte. Ogni pipeline deve contenere una appropriata combinazione di oggetti Pipeline e ColumnTransformer. Il risultato dev’essere un unico oggetto finale, la cui invocazione del metodo fit_transform deve essere riportata alla fine del notebook e restituire il dataset trasformato secondo il pre-processing richiesto.<br>
<br>


**Ogni pipeline va eseguita sulle colonne del dataset esclusa la colonna “target”.**<br>
<br>

------
------

<br>

## PIPELINE 1:<br>
Predisporre il pre-processing dei soli record che hanno target = 1 secondo la seguente pipeline:

1. Pulizia dei valori mancanti
2. Simmetrizzazione delle sole variabili asimmetriche
3. One-hot encoding delle variabili categoriche
4. Riscalatura mediante standardizzazione

Tenere conto, nella realizzazione di questa pipeline, dell’asimmetria delle variabili. La pulizia dei valori mancanti dev’essere distinta tra variabili simmetriche e variabili asimmetriche, utilizzando il valore di riempimento più appropriato in base alla presenza o meno di asimmetria.<br>  
<br>
### Richieste esplicite:
- **Hai libertà di scegliere la forma di pulizia più appropriata per le variabili categoriche.**

- **Per misurare l’asimmetria di una variabile si consiglia un’analisi grafica supportata, eventualmente, da un’analisi statistica della skewness.**<br>
<br>

------

<br>

## PIPELINE 2:<br>
Applicare, a tutti i record del dataset, la seguente pipeline:

1. Pulizia dei valori mancanti con procedure a tua scelta
2. Discretizzazione a 20 bin di tutte le variabili numeriche
3. Encoding ordinale per la variabile categorica secondo i valori, in ordine crescente: A, B, C
4. Selezione delle 5 variabili più informative rispetto al target fornito, utilizzando la metrica più appropriata in base alle trasformazioni eseguite finora.<br>
<br>

------

<br>

## PIPELINE 3:<br>
Applicare, a tutti i record del dataset, la seguente pipeline:

1. Pulizia dei valori mancanti mediante metodo a tua scelta
2. Principal Component Analysis all’80% di varianza spiegata
3. Simmetrizzazione
4. Riscalatura mediante normalizzazione tra 0 e 1.<br>
<br>

Infine esportare la pipeline finale.

In [1]:
# Istalling a library for checking best PCA reduction
!pip install kneed

Collecting kneed
  Downloading kneed-0.8.5-py3-none-any.whl (10 kB)
Installing collected packages: kneed
Successfully installed kneed-0.8.5


In [2]:
# lets make the import of what we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Objects for performing scaling
from sklearn.preprocessing import  MinMaxScaler,  StandardScaler, RobustScaler
# Objects for missing values
from sklearn.impute import SimpleImputer, KNNImputer
# encoder types
from sklearn.preprocessing import  OneHotEncoder, OrdinalEncoder, LabelEncoder
# Objects for Binning and Binarize
from sklearn.preprocessing import KBinsDiscretizer, Binarizer
# Objects to corret asimmetry and apply general function to traform dataset
from sklearn.preprocessing import PowerTransformer, FunctionTransformer

# We retrieve the object that allows us to perform PCA
# and the elbow/knee method
from sklearn.decomposition import PCA
from kneed import KneeLocator

# lets retrieve what is needed for the pipeline building block
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector

# lets retrieve objects and metrics to performe feature selection
from sklearn.feature_selection import SelectKBest, f_regression, r_regression, mutual_info_regression
from sklearn.feature_selection import  f_classif, mutual_info_classif, chi2

## Data exploration
We will load the data to be used and review their main information and characteristics, check for missing values or other issues, and finally, we will analyze the symmetry of the variables to distinguish between variables that are symmetrically distributed from those that are asymmetrically distributed.

In [3]:
df = pd.read_csv("sample_dataset.csv")
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,,10.38,122.80,1001.0,0.11840,0.27760,0.3001,0.14710,0.2419,0.07871,...,17.33,,2019.0,0.16220,0.6656,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,,,0.0869,0.07017,,0.05667,...,23.41,158.80,1956.0,0.12380,0.1866,0.2416,0.1860,0.2750,,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,,,,0.05999,...,25.53,,1709.0,0.14440,0.4245,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.2414,,0.2597,0.09744,...,26.50,,567.7,0.20980,0.8663,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,,,,0.13280,0.1980,,0.1809,,...,16.67,152.20,1575.0,0.13740,,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,,0.2439,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.2113,0.4107,0.2216,0.2060,0.07115,0
565,,28.25,131.20,1261.0,0.09780,0.10340,0.1440,0.09791,0.1752,,...,38.25,155.00,1731.0,0.11660,0.1922,0.3215,0.1628,0.2572,,0
566,16.60,28.08,108.30,,0.08455,0.10230,,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.3514,0.15200,0.2397,,...,,184.60,1821.0,0.16500,0.8681,0.9387,0.2650,0.4087,0.12400,0


In [6]:
# Let's see a brief statistical summary for each variable.
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mean radius,482.0,14.059548,3.501791,6.981,11.6725,13.28,15.745,28.11
mean texture,492.0,19.311829,4.347769,9.71,16.17,18.86,21.8025,39.28
mean perimeter,513.0,92.039025,24.028669,43.79,75.27,86.34,104.7,188.5
mean area,403.0,661.522581,356.669534,143.5,428.1,556.7,796.0,2501.0
mean smoothness,384.0,0.097156,0.014502,0.05263,0.086688,0.096565,0.106825,0.1634
mean compactness,480.0,0.104531,0.053335,0.01938,0.064815,0.093125,0.130325,0.3454
mean concavity,439.0,0.094063,0.083301,0.0,0.03041,0.06824,0.1351,0.4268
mean concave points,382.0,0.049115,0.038449,0.0,0.020682,0.03377,0.074122,0.2012
mean symmetry,471.0,0.181405,0.027633,0.106,0.16195,0.1791,0.1966,0.2906
mean fractal dimension,504.0,0.062626,0.007102,0.04996,0.05753,0.0613,0.066003,0.09744
