<a href="https://colab.research.google.com/github/iqbalamo93/Feature_Eng/blob/master/Feature_Engineering_Linreg_Baisics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [0]:
url='https://raw.githubusercontent.com/iqbalamo93/Datasets/master/AmesHousing_Feat.txt'

In [0]:
data = pd.read_csv(url, delimiter="\t")
train = data[0:1460]
test = data[1460:]

In [0]:
train_null_counts = train.isnull().sum()

Select just the columns from the train data frame that contain no missing values.Assign the resulting data frame, that contains just these columns, to df_no_mv.Use the variables display to become familiar with these column

In [0]:
train_null_counts = train.isnull().sum()
print(train_null_counts)
df_no_mv = train[train_null_counts[train_null_counts==0].index]

Order               0
PID                 0
MS SubClass         0
MS Zoning           0
Lot Frontage      249
                 ... 
Mo Sold             0
Yr Sold             0
Sale Type           0
Sale Condition      0
SalePrice           0
Length: 82, dtype: int64


In [0]:
train.loc[:,'Utilities']=train['Utilities'].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [0]:
train['Utilities']

0       AllPub
1       AllPub
2       AllPub
3       AllPub
4       AllPub
         ...  
1455    AllPub
1456    AllPub
1457    AllPub
1458    AllPub
1459    AllPub
Name: Utilities, Length: 1460, dtype: category
Categories (3, object): [AllPub, NoSeWa, NoSewr]

In [0]:
train['Utilities'].value_counts()

AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

In [0]:
train['Utilities'].cat.codes.value_counts()

0    1457
2       2
1       1
dtype: int64

When we convert a column to the categorical data type, pandas assigns a number from 0 to n-1 (where n is the number of unique values in a column) for each value. The drawback with this approach is that one of the assumptions of linear regression is violated here. Linear regression operates under the assumption that the features are linearly correlated with the target column. For a categorical feature, however, there's no actual numerical meaning to the categorical codes that pandas assigned for that column. An increase in the Utilities column from 1 to 2 has no correlation value with the target column, and the categorical codes are instead used for uniqueness and exclusivity (the category associated with 0 is different than the one associated with 1).

In [0]:

dummies=pd.get_dummies(train['Utilities'],drop_first=True)



In [0]:
dummies['NoSeWa'].value_counts()

0    1459
1       1
Name: NoSeWa, dtype: int64

In [0]:
dummies['NoSewr'].value_counts()

0    1458
1       2
Name: NoSewr, dtype: int64

In [0]:
dummies.columns

CategoricalIndex(['NoSeWa', 'NoSewr'], categories=['AllPub', 'NoSeWa', 'NoSewr'], ordered=False, dtype='category')

In [0]:
text_cols = df_no_mv.select_dtypes(include=['object']).columns

for col in text_cols:
    train[col] = train[col].astype('category')
train['Utilities'].cat.codes.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


0    1457
2       2
1       1
dtype: int64

In [0]:
df = pd.concat([train,dummies],axis=1)

In [0]:
#del train['Utilities']

In [0]:
#dummy_cols = pd.DataFrame()
for col in text_cols:
    col_dummies = pd.get_dummies(train[col])
    train = pd.concat([train, col_dummies], axis=1)
    del train[col]


In the last few screens, we focused on categorical values that were represented as text columns. Some of the numerical columns in the data set are also categorical and only have a limited set of unique values. We won't explicitly explore those columns in this mission, but the feature transformation process is the same if the numbers used in those categories have no numerical meaning.

Let's now look at numerical features that aren't categorical, but whose numerical representation needs to be improved. We'll focus on the Year Remod/Add and Year Built columns:

In [0]:
print(train[['Year Remod/Add', 'Year Built']])

      Year Remod/Add  Year Built
0               1960        1960
1               1961        1961
2               1958        1958
3               1968        1968
4               1998        1997
...              ...         ...
1455            2000        2000
1456            2001        2001
1457            2000        1999
1458            1999        1998
1459            2002        2001

[1460 rows x 2 columns]


The two main issues with these features are:


1.   Year values aren't representative of how old a house is
2.   The Year Remod/Add column doesn't actually provide useful information for a linear regression model
The challenge with year values like 1960 and 1961 is that they don't do a good job of capturing how old a house is. For example, a house that was built in 1960 but sold in 1980 was sold in half the time as one built in 1960 and sold in 2000. Instead of the years certain events happened, we want the difference between those years. We should create a new column that's the difference between both of these columns.

For this particular piece of information (years until remodeled), this is a sensible approach. Domain knowledge can help you understand how to best transform features to represent information well for a linear model. If you're ever confused about a feature or how it should be represented, reading scientific papers or posts by researchers in the specific domain is critical. Many winners of Kaggle data science competitions, for example, claim that their focus on data preparation and feature engineering combined with common machine learning models helped them win.


In [0]:
train['years_until_remod'] = train['Year Remod/Add'] - train['Year Built']

In [0]:
train_null_counts = train.isnull().sum()

In [0]:
100 * train_null_counts/len(train)

Order                 0.000000
PID                   0.000000
MS SubClass           0.000000
Lot Frontage         17.054795
Lot Area              0.000000
                       ...    
Alloca                0.000000
Family                0.000000
Normal                0.000000
Partial               0.000000
years_until_remod     0.000000
Length: 237, dtype: float64

In [0]:
train_null_counts = train.isnull().sum()
df_missing_values = train[train_null_counts[(train_null_counts>0) & (train_null_counts<584)].index]

print(df_missing_values.isnull().sum())
print(df_missing_values.dtypes)

Lot Frontage      249
Mas Vnr Type       11
Mas Vnr Area       11
Bsmt Qual          40
Bsmt Cond          40
Bsmt Exposure      41
BsmtFin Type 1     40
BsmtFin SF 1        1
BsmtFin Type 2     41
BsmtFin SF 2        1
Bsmt Unf SF         1
Total Bsmt SF       1
Bsmt Full Bath      1
Bsmt Half Bath      1
Garage Type        74
Garage Yr Blt      75
Garage Finish      75
Garage Qual        75
Garage Cond        75
dtype: int64
Lot Frontage      float64
Mas Vnr Type       object
Mas Vnr Area      float64
Bsmt Qual          object
Bsmt Cond          object
Bsmt Exposure      object
BsmtFin Type 1     object
BsmtFin SF 1      float64
BsmtFin Type 2     object
BsmtFin SF 2      float64
Bsmt Unf SF       float64
Total Bsmt SF     float64
Bsmt Full Bath    float64
Bsmt Half Bath    float64
Garage Type        object
Garage Yr Blt     float64
Garage Finish      object
Garage Qual        object
Garage Cond        object
dtype: object


In [0]:
missing_floats = df_missing_values.select_dtypes(include=['float'])
print(missing_floats)

      Lot Frontage  Mas Vnr Area  ...  Bsmt Half Bath  Garage Yr Blt
0            141.0         112.0  ...             0.0         1960.0
1             80.0           0.0  ...             0.0         1961.0
2             81.0         108.0  ...             0.0         1958.0
3             93.0           0.0  ...             0.0         1968.0
4             74.0           0.0  ...             0.0         1997.0
...            ...           ...  ...             ...            ...
1455           NaN           0.0  ...             0.0         2000.0
1456           NaN         227.0  ...             0.0         2001.0
1457          73.0         320.0  ...             0.0         1999.0
1458          75.0         202.0  ...             0.0         1998.0
1459           NaN         396.0  ...             0.0         2001.0

[1460 rows x 9 columns]


In [0]:
fill_with_zero = missing_floats.fillna(0)
fill_with_mean = missing_floats.fillna(missing_floats.mean())

In [0]:
float_cols = df_missing_values.select_dtypes(include=['float'])
float_cols = float_cols.fillna(float_cols.mean())
print(float_cols.isnull().sum())

Lot Frontage      0
Mas Vnr Area      0
BsmtFin SF 1      0
BsmtFin SF 2      0
Bsmt Unf SF       0
Total Bsmt SF     0
Bsmt Full Bath    0
Bsmt Half Bath    0
Garage Yr Blt     0
dtype: int64
