## Introduction

In [27]:
import pandas as pd

data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

train_null_counts = train.isnull().sum()
print(train_null_counts)

Order               0
PID                 0
MS SubClass         0
MS Zoning           0
Lot Frontage      249
                 ... 
Mo Sold             0
Yr Sold             0
Sale Type           0
Sale Condition      0
SalePrice           0
Length: 82, dtype: int64


In [28]:
df_no_mv = train[train_null_counts[train_null_counts == 0].index]

## Categorical Features

In [29]:
text_cols = df_no_mv.select_dtypes(include=['object']).columns

for col in text_cols:
    print(col+":", len(train[col].unique()))

MS Zoning: 6
Street: 2
Lot Shape: 4
Land Contour: 4
Utilities: 3
Lot Config: 5
Land Slope: 3
Neighborhood: 26
Condition 1: 9
Condition 2: 6
Bldg Type: 5
House Style: 8
Roof Style: 6
Roof Matl: 5
Exterior 1st: 14
Exterior 2nd: 16
Exter Qual: 4
Exter Cond: 5
Foundation: 6
Heating: 6
Heating QC: 4
Central Air: 2
Electrical: 4
Kitchen Qual: 5
Functional: 7
Paved Drive: 3
Sale Type: 9
Sale Condition: 5


In [30]:
for col in text_cols:
    train[col] = train[col].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [31]:
train['Utilities'].cat.codes.value_counts()

0    1457
2       2
1       1
dtype: int64

## Dummy Coding

When we convert a column to the categorical data type, pandas assigns a number from `0` to `n-1` (where `n` is the number of unique values in a column) for each value. The drawback with this approach is that one of the assumptions of linear regression is violated here. Linear regression operates under the assumption that the features are linearly correlated with the target column. For a categorical feature, however, there's no actual numerical meaning to the categorical codes that pandas assigned for that column. An increase in the `Utilities` column from `1` to `2` has no correlation value with the target column, and the categorical codes are instead used for uniqueness and exclusivity (the category associated with `0` is different than the one associated with `1`).

The common solution is to use a technique called dummy coding. Instead of having a single column with `n` integer codes, we have `n` binary columns.

In [52]:
data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

In [53]:
# dummy_cols = pd.get_dummies(train[text_cols])
# dropped_train = train.drop(text_cols,axis=1)
# train = pd.concat([dummy_cols, dropped_train], axis=1)

In [54]:
for col in text_cols:
    col_dummies = pd.get_dummies(train[col])
    train = pd.concat([train, col_dummies], axis=1)
    del train[col]

## Transforming Improper Numerical Features

For this particular piece of information (years until remodeled), this is a sensible approach. Domain knowledge can help you understand how to best transform features to represent information well for a linear model. If you're ever confused about a feature or how it should be represented, reading scientific papers or posts by researchers in the specific domain is critical. Many winners of [Kaggle data science competitions](https://www.import.io/post/how-to-win-a-kaggle-competition/), for example, claim that their focus on data preparation and feature engineering combined with common machine learning models helped them win.

In [55]:
train['years_until_remod'] = train['Year Remod/Add'] - train['Year Built']

In [57]:
train['years_until_remod'].value_counts()

 0      773
 1      213
 40      20
 30      20
 10      16
       ... 
 80       1
 70       1
 58       1
 127      1
-1        1
Name: years_until_remod, Length: 108, dtype: int64

## Missing Values

In [58]:
data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

train_null_counts = train.isnull().sum()

In [62]:
missing_cols = train_null_counts[(train_null_counts>0)&(train_null_counts<584)].index

In [63]:
df_missing_values = train[missing_cols]

In [64]:
df_missing_values.dtypes

Lot Frontage      float64
Mas Vnr Type       object
Mas Vnr Area      float64
Bsmt Qual          object
Bsmt Cond          object
Bsmt Exposure      object
BsmtFin Type 1     object
BsmtFin SF 1      float64
BsmtFin Type 2     object
BsmtFin SF 2      float64
Bsmt Unf SF       float64
Total Bsmt SF     float64
Bsmt Full Bath    float64
Bsmt Half Bath    float64
Garage Type        object
Garage Yr Blt     float64
Garage Finish      object
Garage Qual        object
Garage Cond        object
dtype: object

## Imputing Missing Values

In [70]:
float_cols = df_missing_values.select_dtypes(include=['float'])

In [71]:
float_cols.isnull().sum()

Lot Frontage      249
Mas Vnr Area       11
BsmtFin SF 1        1
BsmtFin SF 2        1
Bsmt Unf SF         1
Total Bsmt SF       1
Bsmt Full Bath      1
Bsmt Half Bath      1
Garage Yr Blt      75
dtype: int64

In [67]:
float_cols.mean()

Lot Frontage        68.928984
Mas Vnr Area       102.591442
BsmtFin SF 1       447.726525
BsmtFin SF 2        51.529815
Bsmt Unf SF        548.705963
Total Bsmt SF     1047.962303
Bsmt Full Bath       0.448252
Bsmt Half Bath       0.056203
Garage Yr Blt     1977.738628
dtype: float64

In [72]:
float_cols.fillna(float_cols.mean(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [73]:
float_cols.isnull().sum()

Lot Frontage      0
Mas Vnr Area      0
BsmtFin SF 1      0
BsmtFin SF 2      0
Bsmt Unf SF       0
Total Bsmt SF     0
Bsmt Full Bath    0
Bsmt Half Bath    0
Garage Yr Blt     0
dtype: int64