# Categorical Data

In [1]:
import pandas as pd
import matplotlib as plt
import numpy as np

In [2]:
# cargamos los datasets

X = pd.read_csv('train.csv', index_col = 'Id')
X_test = pd.read_csv('test.csv', index_col= 'Id')

X.shape, X_test.shape

((1460, 80), (1459, 79))

Tenemos que eliminar las filas que tienen datos faltantes en la columna `SalePrice`, porque no nos van a servir ni para entrenar el modelo ni para validarlo

In [3]:
# eliminamos las filas
X.dropna(axis = 0, subset=['SalePrice'], inplace = True)
X.shape

(1460, 80)

In [4]:
# definimos las etiquetas que vamos utilizar para entrenar y validar el modelo
y = X.SalePrice
type(y)

pandas.core.series.Series

In [5]:
# eliminamos la columna SalePrice de los datos de entrenamiento
X.drop(['SalePrice'], axis = 1, inplace = True)
X.shape

(1460, 79)

El tratamiento de los datos faltantes esta fuera de los alcances de esta notebook (ver notebook 02-Missing values). Por lo tanto vamos a elegir eliminar todas las columnas que tengan datos faltantes.

In [6]:
cols_with_missing_values = [col for col in X.columns if X[col].isnull().any()]
cols_with_missing_values

['LotFrontage',
 'Alley',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

In [7]:
X.drop(cols_with_missing_values, axis = 1, inplace = True)
X.shape

(1460, 60)

In [8]:
# Eliminamos las mismas columnas del dataset de test
X_test.drop(cols_with_missing_values, axis = 1, inplace = True)
X_test.shape

(1459, 60)

In [9]:
# dividimos el dataset X_train en train y validation
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [10]:
X_train.shape, X_valid.shape

((1168, 60), (292, 60))

In [11]:
y_train.shape, y_valid.shape

((1168,), (292,))

vamos a definir una funcion que nos va a permitir evaluar las acciones que tomemos sobre los datasets

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0) # utilizamos una version standar del algoritmo
                                                                # en otra instancia se puede optimizar los parametros
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Step 1: Eliminamos columnas con datos categoricos

In [13]:
# seleccionamos las columnas que no tienen datos categoricos
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_train.shape

(1168, 33)

In [14]:
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
drop_X_valid.shape

(292, 33)

In [16]:
print("MAE from Approach 1 (Drop Missing Val Cols and Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop Missing Val Cols and Drop categorical variables):
17837.82570776256


### Step 2: Ordinal encoding


In [17]:
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())

Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']

Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']
