In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler

El nostre dataset conté les següents columnes:

1. CRIM      per capita crime rate by town

2. ZN        proportion of residential land zoned for lots over 25,000 sq.ft.

3. INDUS     proportion of non-retail business acres per town

4. CHAS      Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

5. NOX       nitric oxides concentration (parts per 10 million)

6. RM        average number of rooms per dwelling

7. AGE       proportion of owner-occupied units built prior to 1940

8. DIS       weighted distances to five Boston employment centres

9. RAD       index of accessibility to radial highways

10. TAX      full-value property-tax rate per $10,000

11. PTRATIO  pupil-teacher ratio by town

12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

13. LSTAT    % lower status of the population

14. MEDV     Median value of owner-occupied homes in $1000's

Se preparán los datos tal como se hizo en M05T02, si se quiere ver el proceso explicado y con gràficos se puede consultar en https://github.com/maribelseara/SkLearn_Train_Test/blob/main/sklear_train_test.ipynb

Lectura de los datos y generación de columnas dummy

In [6]:
columnes=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS','RAD','TAX','PTRATIO','B', 'LSTAT', 'MEDV']
df=pd.read_csv("housing data.csv", names=columnes)
df=pd.get_dummies(df, columns=['RAD'])
#Se elimina la última columna dummy para evitar multicolinealidad
df=df.drop(['RAD_24'], axis=1)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,TAX,PTRATIO,...,LSTAT,MEDV,RAD_1,RAD_2,RAD_3,RAD_4,RAD_5,RAD_6,RAD_7,RAD_8
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,296.0,15.3,...,4.98,24.0,1,0,0,0,0,0,0,0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,242.0,17.8,...,9.14,21.6,0,1,0,0,0,0,0,0
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,242.0,17.8,...,4.03,34.7,0,1,0,0,0,0,0,0
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,222.0,18.7,...,2.94,33.4,0,0,1,0,0,0,0,0
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,222.0,18.7,...,5.33,36.2,0,0,1,0,0,0,0,0


Estandarización de los datos, en M05T02 ya se comprobó que no eran gaussianos y algunos tenían outilers y otros no, así que se optó por RobustScaler para los atributos con outliers y MaxMinScaler para los que no.

In [7]:
atributs_amb_outliers=['CRIM', 'ZN', 'RM', 'DIS', 'B', 'LSTAT', 'MEDV']
atributs_sense_outliers=['INDUS', 'NOX', 'AGE', 'TAX', 'PTRATIO']
df_estandaritzat=df.copy()
df_estandaritzat[atributs_amb_outliers]=RobustScaler().fit_transform(df_estandaritzat[atributs_amb_outliers])
df_estandaritzat[atributs_sense_outliers]=MinMaxScaler().fit_transform(df_estandaritzat[atributs_sense_outliers])

In [8]:
df_estandaritzat

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,TAX,PTRATIO,...,LSTAT,MEDV,RAD_1,RAD_2,RAD_3,RAD_4,RAD_5,RAD_6,RAD_7,RAD_8
0,-0.069593,1.44,0.067815,0,0.314815,0.496612,0.641607,0.285777,0.208015,0.287234,...,-0.637681,0.351097,1,0,0,0,0,0,0,0
1,-0.063755,0.00,0.242302,0,0.172840,0.287940,0.782698,0.569789,0.104962,0.553191,...,-0.221889,0.050157,0,1,0,0,0,0,0,0
2,-0.063760,0.00,0.242302,0,0.172840,1.323171,0.599382,0.569789,0.104962,0.553191,...,-0.732634,1.692790,0,1,0,0,0,0,0,0
3,-0.062347,0.00,0.063050,0,0.150206,1.069783,0.441813,0.924391,0.066794,0.648936,...,-0.841579,1.529781,0,0,1,0,0,0,0,0
4,-0.052144,0.00,0.063050,0,0.150206,1.271680,0.528321,0.924391,0.066794,0.648936,...,-0.602699,1.880878,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,-0.053930,0.00,0.420455,0,0.386831,0.521003,0.681771,-0.236007,0.164122,0.893617,...,-0.168916,0.150470,1,0,0,0,0,0,0,0
502,-0.058759,0.00,0.420455,0,0.386831,-0.119919,0.760041,-0.297887,0.164122,0.893617,...,-0.227886,-0.075235,1,0,0,0,0,0,0,0
503,-0.054450,0.00,0.420455,0,0.386831,1.039973,0.907312,-0.336744,0.164122,0.893617,...,-0.571714,0.338558,1,0,0,0,0,0,0,0
504,-0.040867,0.00,0.420455,0,0.386831,0.793360,0.889804,-0.265053,0.164122,0.893617,...,-0.487756,0.100313,1,0,0,0,0,0,0,0


Ara, tenint el dataframe estandaritzat i amb les columnes dummy el dividim en train i test

In [9]:
train, test=train_test_split(df,test_size=0.33, random_state=1)
train_df=pd.DataFrame(train,columns=columnes)
test_df=pd.DataFrame(test,columns=columnes)