<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/marco-canas/intro-Machine-Learning/blob/main/classes/class_20/class_20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table> 

# Clase 20 Continuación de la construcción del modelo de Machine Learning de regresión lineal

1. Plantear bien la pregunta.  

   * ¿Regresión o clasificación?
   * ¿Tipo de regresión y tipo de clasificación?

2. Exploración inicial.
   * Indicar la fuente de dónde se toman los datos.
   * Hacer explícita la función objetivo.
   * Decir cuáles son los atributos (descripción breve de cada uno)
   * Practicar una primera síntesis tabular y una exploración gráfica de los datos.

3. Preparar los datos para los algoritmos de aprendizaje.
   
   * Hacer separación inicial de datos para entrenar y para testear.
   * Explorar correlaciones lineales con la variable objetivo.
   * adicionar atributos que estén mejor correlacionados con la variable objetivo.
   * Llenar datos faltantes.
   * Codificar las variables categóricas. 
   * Estandarizar los datos.

In [4]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

v = pd.read_csv('vivienda.csv')

from sklearn.model_selection import train_test_split

v_train, v_test = train_test_split(v, test_size = 0.2, random_state = 513) 

v = v_train

# Experimenting with Attribute Combinations

Hopefully the previous sections gave you an idea of a few ways you can explore the data and gain insights. 

You identified a few data quirks that you may want to clean up before feeding the data to a Machine Learning algorithm, and you found interesting correlations between attributes, in particular with the target attribute. 

You also noticed that some attributes have a tail-heavy distribution, so you may want to transform them (e.g., by computing their logarithm). 

Of course, your mileage will vary considerably with each project, but the general ideas are similar.

## Tratamiento de datos faltantes

Vimos anteriormente que el atributo `dormitorios` tiene algunos valores faltantes, así que arreglemos esto.

Tienes tres opciones:

1. Deshacerse de los distritos correspondientes.
2. Deshágase de todo el atributo.
3. Establezca los valores en algún valor (cero, la media, la mediana, etc.).

Puede lograr esto fácilmente usando los métodos `dropna()`, `drop()` y `fillna()`:

In [5]:
v.dropna(subset = ['dormitorios']) 

Unnamed: 0,longitud,latitud,antiguedad,habitaciones,dormitorios,población,hogares,ingresos,proximidad,precio
18223,-122.08,37.41,20.0,1896.0,456.0,1069.0,436.0,4.6875,NEAR BAY,288900.0
6687,-118.07,34.14,42.0,3200.0,685.0,1668.0,628.0,3.3750,INLAND,260400.0
3854,-118.43,34.18,25.0,3830.0,1105.0,2328.0,1017.0,2.6238,<1H OCEAN,210000.0
11267,-117.97,33.80,35.0,2985.0,474.0,1614.0,453.0,5.4631,<1H OCEAN,225600.0
14498,-117.23,32.86,16.0,1675.0,354.0,604.0,332.0,5.2326,NEAR OCEAN,188300.0
...,...,...,...,...,...,...,...,...,...,...
12680,-121.39,38.55,25.0,2171.0,431.0,1053.0,422.0,3.5278,INLAND,126200.0
17219,-119.70,34.47,32.0,3725.0,569.0,1304.0,527.0,7.7261,<1H OCEAN,500001.0
18525,-122.04,36.97,45.0,1302.0,245.0,621.0,258.0,5.1806,NEAR OCEAN,266400.0
13822,-117.21,34.49,14.0,2125.0,348.0,1067.0,360.0,3.6333,INLAND,116200.0


In [6]:
v.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 18223 to 17044
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   longitud      16512 non-null  float64
 1   latitud       16512 non-null  float64
 2   antiguedad    16512 non-null  float64
 3   habitaciones  16512 non-null  float64
 4   dormitorios   16348 non-null  float64
 5   población     16512 non-null  float64
 6   hogares       16512 non-null  float64
 7   ingresos      16512 non-null  float64
 8   proximidad    16512 non-null  object 
 9   precio        16512 non-null  float64
dtypes: float64(9), object(1)
memory usage: 1.4+ MB


In [8]:
v.drop('dormitorios', axis = 1) 

In [11]:
median = v.dormitorios.median()

In [12]:
v['dormitorios'].fillna(median, inplace = True) 

In [7]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

In [8]:
v_num = v.drop('proximidad', axis=1)


In [10]:
imputer.fit(v_num)


SimpleImputer(strategy='median')

In [11]:
imputer.statistics_

array([-1.1849e+02,  3.4250e+01,  2.9000e+01,  2.1130e+03,  4.3200e+02,
        1.1620e+03,  4.0800e+02,  3.5385e+00,  1.7865e+05])

In [12]:
v_num.median()


longitud          -118.4900
latitud             34.2500
antiguedad          29.0000
habitaciones      2113.0000
dormitorios        432.0000
población         1162.0000
hogares            408.0000
ingresos             3.5385
precio          178650.0000
dtype: float64

In [13]:
X = imputer.transform(v_num)

In [14]:
v_tr = pd.DataFrame(X, columns=v_num.columns, index=v_num.index)

## Handling Text and Categorical Attributes

So far we have only dealt with numerical attributes, but now let’s look at text attributes. 

In this dataset, there is just one: the ocean_proximity attribute.

Let’s look at its value for the first 10 instances:

In [16]:
v_cat = v[['proximidad']]
v_cat.head(10)

Unnamed: 0,proximidad
18223,NEAR BAY
6687,INLAND
3854,<1H OCEAN
11267,<1H OCEAN
14498,NEAR OCEAN
19404,INLAND
15517,<1H OCEAN
7774,<1H OCEAN
4730,<1H OCEAN
6416,INLAND


It’s not arbitrary text: there are a limited number of possible values, each of which represents a category. 

So this attribute is a categorical attribute. 

Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. 

For this, we can use Scikit-Learn’s OrdinalEncoder class:

In [17]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
v_cat_encoded = ordinal_encoder.fit_transform(v_cat)
v_cat_encoded[:10]

array([[3.],
       [1.],
       [0.],
       [0.],
       [4.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.]])

In [18]:
ordinal_encoder.categories_


[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

In [20]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder_1hot = OneHotEncoder()
X_cat_1hot = cat_encoder_1hot.fit_transform(v_cat)
X_cat_1hot


<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [21]:
X_cat_1hot.toarray()

array([[0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.]])

In [22]:
cat_encoder_1hot.categories_


[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

## Feature Scaling

One of the most important transformations you need to apply to your data is feature scaling. 

With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales.

This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. 

Note that scaling the target values is generally not required.

There are two common ways to get all attributes to have the same scale: minmax scaling and standardization.

Min-max scaling (many people call this normalization) is the simplest: values are shifted and rescaled so that they end up ranging from 0 to 1. 

We do this by subtracting the min value and dividing by the max minus the min. Scikit-Learn
provides a transformer called MinMaxScaler for this. 

It has a feature_range hyperparameter that lets you change the range if, for some reason, you don’t want 0–1.

Standardization is different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. 

Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). 

However, standardization is much less affected by outliers. 

For example, suppose a district had a median income equal to 100 (by mistake). 

Min-max scaling would then crush all the other values from 0–15 down to 0–0.15, whereas standardization would not be much affected. 

ScikitLearn provides a transformer called StandardScaler for standardization.

# Transformation Pipelines

As you can see, there are many data transformation steps that need to be executed in the right order. 

Fortunately, Scikit-Learn provides the Pipeline class to help with such sequences of transformations. 

Here is a small pipeline for the numerical attributes: