## Feature Engineering Using Data Aggregation on the AMES Housing Dataset
In this exercise, we will create new features using data aggregation. First, we'll calculate the maximum SalePrice and LotArea for each neighborhood and by YrSold. Then, we will add this information back to the dataset, and finally, we will calculate the ratio of each property sold with these two maximum values:

In [1]:
import pandas as pd

In [2]:
file_url = ('https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter12/Dataset/ames_iowa_housing.csv')

In [3]:
df = pd.read_csv(file_url)

In [4]:
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Perform data aggregation to find the maximum **SalePrice** for each **Neighborhood** and the **YrSold** using the **.groupby.agg()** method and save the results in a new DataFrame called **df_agg**:

In [6]:
df_agg = df.groupby(['Neighborhood', 'YrSold']).agg({'SalePrice': 'max'}).reset_index()
df_agg

Unnamed: 0,Neighborhood,YrSold,SalePrice
0,Blmngtn,2006,264561
1,Blmngtn,2007,194201
2,Blmngtn,2008,191000
3,Blmngtn,2009,192500
4,Blmngtn,2010,192000
...,...,...,...
114,Timber,2009,375000
115,Timber,2010,378500
116,Veenker,2006,385000
117,Veenker,2007,294000


In [9]:
# Rename the df_agg columns to Neighborhood, YrSold, and SalePriceMax:
df_agg.columns = ['Neighborhood', 'YrSold', 'SalePriceMax']
df_agg

Unnamed: 0,Neighborhood,YrSold,SalePriceMax
0,Blmngtn,2006,264561
1,Blmngtn,2007,194201
2,Blmngtn,2008,191000
3,Blmngtn,2009,192500
4,Blmngtn,2010,192000
...,...,...,...
114,Timber,2009,375000
115,Timber,2010,378500
116,Veenker,2006,385000
117,Veenker,2007,294000


Merge the original DataFrame, df, to df_agg using a left join (how='left') on the Neighborhood and YrSold columns using the merge() method and save the results into a new DataFrame called df_new:

In [10]:
df_new = pd.merge(df, df_agg, how='left', on=['Neighborhood', 'YrSold'])
df_new.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,SalePriceMax
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,,,,0,2,2008,WD,Normal,208500,287000
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,,,,0,5,2007,WD,Normal,181500,294000
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,,,,0,9,2008,WD,Normal,223500,287000
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,,,,0,2,2006,WD,Abnorml,140000,250000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,,,,0,12,2008,WD,Normal,250000,350000


Create a new column called SalePriceRatio by dividing SalePrice by SalePriceMax:

In [12]:
df_new['SalePriceRatio'] = df_new['SalePrice'] / df_new['SalePriceMax']
df_new.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,SalePriceMax,SalePriceRatio
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,,,0,2,2008,WD,Normal,208500,287000,0.726481
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,,,0,5,2007,WD,Normal,181500,294000,0.617347
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,,,0,9,2008,WD,Normal,223500,287000,0.778746
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,,,0,2,2006,WD,Abnorml,140000,250000,0.56
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,,,0,12,2008,WD,Normal,250000,350000,0.714286


Perform data aggregation to find the maximum LotArea for each Neighborhood and YrSold using the .groupby.agg() method and save the results in a new DataFrame called df_agg2:

In [18]:
df_agg2 = df.groupby(['Neighborhood', 'YrSold']).agg({'LotArea': 'max'}).reset_index()

In [25]:
df_agg2.columns = ['Neighborhood', 'YrSold', 'LotAreaMax']
df_agg2

Unnamed: 0,Neighborhood,YrSold,LotAreaMax
0,Blmngtn,2006,4045
1,Blmngtn,2007,3922
2,Blmngtn,2008,3182
3,Blmngtn,2009,3684
4,Blmngtn,2010,3182
...,...,...,...
114,Timber,2009,215245
115,Timber,2010,57200
116,Veenker,2006,50271
117,Veenker,2007,17542


Merge the original DataFrame, df, to df_agg2 using a left join (how='left') on the Neighborhood and YrSold columns using the merge() method and save the results into a new DataFrame called df_final:

In [26]:
df_final = pd.merge(df_new, df_agg2, how='left', on=['Neighborhood', 'YrSold'])
df_final.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,SalePriceMax,SalePriceRatio,LotAreaMax
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,,0,2,2008,WD,Normal,208500,287000,0.726481,13125
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,,0,5,2007,WD,Normal,181500,294000,0.617347,17542
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,,0,9,2008,WD,Normal,223500,287000,0.778746,13125
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,,0,2,2006,WD,Abnorml,140000,250000,0.56,16560
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,,0,12,2008,WD,Normal,250000,350000,0.714286,14303


In [28]:
# ratio of LotArea to MaxLotArea
df_final['LotAreaRatio'] = df_final['LotArea'] / df_final['LotAreaMax']
df_final.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,SalePriceMax,SalePriceRatio,LotAreaMax,LotAreaRatio
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,2,2008,WD,Normal,208500,287000,0.726481,13125,0.64381
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,5,2007,WD,Normal,181500,294000,0.617347,17542,0.547258
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,9,2008,WD,Normal,223500,287000,0.778746,13125,0.857143
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,2,2006,WD,Abnorml,140000,250000,0.56,16560,0.576691
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,12,2008,WD,Normal,250000,350000,0.714286,14303,0.996994


In [29]:
df_final[['Id', 'Neighborhood', 'YrSold', 'SalePrice', 'SalePriceMax', 'SalePriceRatio', 'LotArea', 
         'LotAreaMax', 'LotAreaRatio']].head()

Unnamed: 0,Id,Neighborhood,YrSold,SalePrice,SalePriceMax,SalePriceRatio,LotArea,LotAreaMax,LotAreaRatio
0,1,CollgCr,2008,208500,287000,0.726481,8450,13125,0.64381
1,2,Veenker,2007,181500,294000,0.617347,9600,17542,0.547258
2,3,CollgCr,2008,223500,287000,0.778746,11250,13125,0.857143
3,4,Crawfor,2006,140000,250000,0.56,9550,16560,0.576691
4,5,NoRidge,2008,250000,350000,0.714286,14260,14303,0.996994


This is it. We just created two new features that give the ratio of **SalePrice** and **LotArea** for a property compared to the highest one that was sold in the same year and the same neighborhood. We can now easily and fairly compare the properties. For instance, from the output of the last step, we can note that the fifth property size **(Id 5 and LotArea 14260)** was almost as close **(LotAreaRatio 0.996994)** as the biggest property sold **(LotArea 14303)** in the same area and the same year. But its sale price **(SalePrice 250000)** was significantly lower **(SalePriceRatio is 0.714286)** than the highest one **(SalePrice 350000)**. This indicates that other features of the property had an impact on the sale price.