# Proyecto sobre los precios de las casas en Boston

Puntos a tratar:
- Descripcion del dataset
- Analisis exploratorio
- Correlacion
- Multicolinealidad
- Analisis missing
- Ajuste del modelo
- Prueba de supuestos del modelo

## Cargar datos

In [1]:
# librerias
import pandas as pd
from sklearn.datasets import load_boston

In [2]:
# cargar datos
boston = load_boston()
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

De la descripcion de los datos vemos que hay 506 registros y 14 variables. Se suele utilizar como variable objetivo a *MEDV*. Los datos no contienen valores faltantes.

El objetivo es crear un modelo que prediga el Precio promedio de una casa *MEDV* segun algunas caracteristicas de la misma.

In [3]:
# variables predictoras en formato dataframe
X_orig = pd.DataFrame(boston.data, columns=boston.feature_names)
X_orig.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [4]:
# variable objetivo en formato de Serie
y_orig = pd.Series(boston.target, name='MEDV')
y_orig[0:5]

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

## Datos de entrenamiento y de prueba

In [5]:
# librerias de visualizacion
# muestra el grafico debajo del codigo
%matplotlib inline  
import seaborn as sns
import matplotlib.pyplot as plt

In [6]:
# Libreria para dividir los datos
from sklearn.model_selection import train_test_split

In [7]:
# Creamos los dos conjuntos de datos
# divide los datos en 75% para entrenamiento y 25% para prueba
X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig, random_state=1)

In [8]:
# juntamos los datos entrenamiento en un datframe para futuros tratamientos
df = pd.concat([X_train, y_train], axis=1)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1.0,273.0,21.0,396.9,9.08,20.6
172,0.13914,0.0,4.05,0.0,0.51,5.572,88.5,2.5961,5.0,296.0,16.6,396.9,14.69,23.1
80,0.04113,25.0,4.86,0.0,0.426,6.727,33.5,5.4007,4.0,281.0,19.0,396.9,5.29,28.0
46,0.18836,0.0,6.91,0.0,0.448,5.786,33.3,5.1004,3.0,233.0,17.9,396.9,14.15,20.0
318,0.40202,0.0,9.9,0.0,0.544,6.382,67.2,3.5325,4.0,304.0,18.4,395.21,10.36,23.1


## EDA

**Los datos de entrenamiento contienen valores nulos?**

In [9]:
X_train.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

In [10]:
y_train.isna().sum()

0

No poseen valores faltantes

**Resumen estadistico de las variables predictoras**

In [11]:
X_train.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0
mean,3.805183,11.521108,11.220053,0.081794,0.554073,6.255726,68.751451,3.824433,9.525066,405.182058,18.4781,358.304802,12.936174
std,9.375846,23.492644,6.875362,0.274413,0.117825,0.687415,28.276504,2.138449,8.73455,169.483657,2.141433,89.601791,7.243381
min,0.00632,0.0,0.46,0.0,0.385,3.561,6.0,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.083475,0.0,5.255,0.0,0.4475,5.875,45.05,2.09445,4.0,277.0,17.4,376.125,7.165
50%,0.24522,0.0,9.69,0.0,0.538,6.172,79.2,3.3175,5.0,329.0,19.1,392.04,11.97
75%,3.68339,17.75,18.1,0.0,0.624,6.611,94.05,5.10855,24.0,666.0,20.2,396.22,17.135
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


- El promedio de la tasa de criminalidad es de 3.8, con una desviacion estandar de 9.37. Esta gran diferencia entre ambas metricas indican que los datos se encuentran muy dispersos. El valor minimo y maximo va de 0.006 a 88.9 lo que indica que hay zonas mucho mas seguras que otras.

- La distribucion de las zonas industriales tienen un comportamiento normal centrando gran parte de las zonas alrededor de la media.

- CHAS es una variable binaria e indica que el 8% de casas estan cercanas al rio. Como solo posee como valor cero y uno, al promediar nos da la proporcion de unos.

- La proporcion de casas ocupadas por los duenos

**Resumen estadistico de la variable objetivo**

In [12]:
y_train.describe()

count    379.000000
mean      22.344591
std        8.920931
min        5.000000
25%       17.100000
50%       20.800000
75%       25.000000
max       50.000000
Name: MEDV, dtype: float64

- Indica que el promedio del precio de las casas es de 22.000 con un valor minimo y maximo de 5000 y 50.000 respectivamente

**Analisis de Correlacion**