# Feature Engineering
Feature engineering as the name suggest is a process of generating new features from existing features.  
Most times, feature engineering is used to improve the robustness and accuracy of a model. Meaning, we create new logical information from the existing information, so the model learns this new information and becomes even more robust and efficient.  
<br>
However, feature engineering could also be used to make analysis smoother. And in this situation, we'll be using feature engineering to make smoother analysis and further clean the data.

In [1]:
# Import necessary package(s).
import pandas as pd

In the previous notebook, we saved our processed dataframe.  
We'll load that dataframe and assign it to a new __df__.

In [2]:
%store -r df3
df = df3

In [3]:
df.head()

Unnamed: 0,location,total_sqft,bath,price,bedrooms
0,Electronic City Phase II,1056.0,2.0,39.07,2
1,Chikka Tirupathi,2600.0,5.0,120.0,4
2,Uttarahalli,1440.0,2.0,62.0,3
3,Lingadheeranahalli,1521.0,3.0,95.0,3
4,Kothanur,1200.0,2.0,51.0,2


An indicative parameter on whether a house is reasonably priced, is the price per sqft of said house.  
It can easily tell you whether the house is a good deal or otherwise. You could also use it to gauge the value of the house in the real estate market.  
Hence, we'll engineer a new feature, __price_per_sqft;__ we have a price column and a total_sqft column. We'll simply divide price by total_sqft and we'll have the new feature. 

In [4]:
df1 = df.copy()

# Multiply the price by 100000; 1 lakh = 100000(hundred thousand)
df1['price_per_sqft'] = df1['price']*100000/df1['total_sqft']
df1.head()

Unnamed: 0,location,total_sqft,bath,price,bedrooms,price_per_sqft
0,Electronic City Phase II,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,1200.0,2.0,51.0,2,4250.0


As we can see, it a lot easier to just evaluate the value of a house with this new column.  Much more so, you can easily compare different houses and perhaps secure a better deal for yourself.

Another important feature that would very much affect the cost of a house is the location of said house.  In Nigeria, there are locations where a 1 bedroom apartment cost much more than a 4 bedroom apartment in another location; it is also apparent in this dataset.  
This goes to show that location is a very important variable in determining the price of a house.  
<br>
Let's have a look at the location column...

In [5]:
# Number of locations in the dataset.
len(df1.location.unique())

1304

Yowza! 1304 locations, that's a lot.  
Let's have a deeper look into this. We can check how many houses are in each of this locations.

In [6]:
# Get rid of unnecessary spaces in the location string.
df1.location = df1.location.apply(lambda x: x.strip())

# Number of houses in each location.
location_tally = df1.groupby('location')['location'].agg('count').sort_values(ascending=False)
location_tally.head(30)

location
Whitefield                  535
Sarjapur  Road              392
Electronic City             304
Kanakpura Road              266
Thanisandra                 236
Yelahanka                   210
Uttarahalli                 186
Hebbal                      176
Marathahalli                175
Raja Rajeshwari Nagar       171
Bannerghatta Road           152
Hennur Road                 150
7th Phase JP Nagar          149
Haralur Road                141
Electronic City Phase II    131
Rajaji Nagar                106
Chandapura                   98
Bellandur                    96
KR Puram                     88
Hoodi                        88
Electronics City Phase 1     87
Yeshwanthpur                 85
Begur Road                   84
Sarjapur                     81
Kasavanhalli                 79
Harlur                       79
Hormavu                      74
Banashankari                 74
Ramamurthy Nagar             73
Kengeri                      73
Name: location, dtype: int64

We can see that some locations are more populated that others.  
However, there could be some locations that are extremely sparse.  
Let's see how many locations have 10 houses or less.

In [7]:
# Number of sparse locations
len(location_tally[location_tally<=10])

1052

In [8]:
sparse_locations = location_tally[location_tally<=10]
sparse_locations

location
Basapura                 10
1st Block Koramangala    10
Gunjur Palya             10
Kalkere                  10
Sector 1 HSR Layout      10
                         ..
1 Giri Nagar              1
Kanakapura Road,          1
Kanakapura main  Road     1
Karnataka Shabarimala     1
whitefiled                1
Name: location, Length: 1052, dtype: int64

There are so much of this sparse locations, even locations having just one house.  
I personally do not think this sparse location will have a great effect on the regression.  
However, if we decide to take all of these locations into account, we will incur a dimensionality excess. 

Hence, I think it's fair to tie up these sparse locations into one collective variable, __"other_locations"__.

In [9]:
# Modify location column, such that any location with 10 or less houses is replaced with "other_locations".
df1.location = df1.location.apply(lambda x: 'other_locations' if x in sparse_locations else x)

Ok, let's have a look at how many locations we are left with...

In [10]:
len(df1.location.unique())

242

Viola! We have a more consise location column.

In [11]:
df1.head(10)

Unnamed: 0,location,total_sqft,bath,price,bedrooms,price_per_sqft
0,Electronic City Phase II,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,1200.0,2.0,51.0,2,4250.0
5,Whitefield,1170.0,2.0,38.0,2,3247.863248
6,Old Airport Road,2732.0,4.0,204.0,4,7467.057101
7,Rajaji Nagar,3300.0,4.0,600.0,4,18181.818182
8,Marathahalli,1310.0,3.0,63.25,3,4828.244275
9,other_locations,1020.0,6.0,370.0,6,36274.509804


In the previous notebook, we saw a particular house with 43 bedrooms and barely 2400sqft, how anomalous could that be? You tell me.  
Another typical example of an anomaly is the 10th entry in the dataframe, __df.loc[9]__. Here a house has 6 bedrooms and barely 1020sqft.  
These are typical outliers, perhaps errors that could be in a dataset and one must address them to avoid skewing the model's outcome.  
<br>
To address this issue, you could use a rule of thumb; perhaps, you can filter any entry where the sqft per bedroom is below a certain threshold. I'll use 320...

In [12]:
# The entries in the dataset where the quotient of the total_sqft/bedrooms < 320.
print(len(df1[df1.total_sqft/df1.bedrooms<320]))
df1[df1.total_sqft/df1.bedrooms<320].head()

967


Unnamed: 0,location,total_sqft,bath,price,bedrooms,price_per_sqft
9,other_locations,1020.0,6.0,370.0,6,36274.509804
45,HSR Layout,600.0,9.0,200.0,8,33333.333333
58,Murugeshpalya,1407.0,4.0,150.0,6,10660.98081
68,Devarachikkanahalli,1350.0,7.0,85.0,8,6296.296296
70,other_locations,500.0,3.0,100.0,3,20000.0


Wow! There are a lot of them.  Observe __loc[45]__; a 600sqft house having 8 bedrooms, seems very anomalous.  
These type of entries are classified as outliers, and also need to be addressed before training a model.  
We'll handle several outliers in the next notebook... __Outlier Filtering__.

In [13]:
# Save the processed dataframe.
%store df1

Stored 'df1' (DataFrame)


In [14]:
# ifunanyaScript