# Multi-Dimensional Linear Regression
# Feature Engineering

The [dataset](https://github.com/ageron/handson-ml/tree/master/datasets/housing) used for this excercise in Multi-Dimensional Linear Regression is a modified version of the California Housing dataset available from [Luís Torgo's page](http://www.dcc.fc.up.pt/%7Eltorgo/Regression/cal_housing.html) (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.

This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

## Theoretical background 

- [Generalized Linear Models](http://scikit-learn.org/stable/modules/linear_model.html)
- [Linear Regression Example](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html) 
--- 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Imputer, StandardScaler

import seaborn as sns
sns.set()

In [2]:
data = pd.read_csv("housing.csv", na_values='') 

In [3]:
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
data.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


--- 
## Normalization

### Cleanse 
I decided to lose datapoints that skew the dataset. 

In [5]:
data_norm = data.copy()

In [6]:
data_norm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [7]:
# drops rows if any NA values are present (i.e values are missing)
data_norm = data_norm.dropna(axis=0,how='any')

In [8]:
# drop rows with median_house_value equal to 500001
data_norm = data_norm.drop(data_norm[(data_norm['median_house_value'] == 500001) == True].index)

In [9]:
# drop rows with housing_median_age equal to 52 
data_norm = data_norm.drop(data_norm[(data_norm['housing_median_age'] == 52) == True].index)

In [10]:
data_norm = data_norm.reset_index(drop=True)

In [11]:
data_norm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18379 entries, 0 to 18378
Data columns (total 10 columns):
longitude             18379 non-null float64
latitude              18379 non-null float64
housing_median_age    18379 non-null float64
total_rooms           18379 non-null float64
total_bedrooms        18379 non-null float64
population            18379 non-null float64
households            18379 non-null float64
median_income         18379 non-null float64
median_house_value    18379 non-null float64
ocean_proximity       18379 non-null object
dtypes: float64(9), object(1)
memory usage: 1.4+ MB


--- 
## Skewness 

Next I determine the skewness of my features (as seen on [California-House-Price-Prediction by sonarsushant](https://github.com/sonarsushant/California-House-Price-Prediction). 

In [14]:
num_features=['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income']
cat_featues=['ocean_proximity']

In [15]:
skewness=[]

for i in num_features:
    skewness.append(data_norm[i].skew())
    
pd.DataFrame(data=skewness,index=num_features,columns=['skewness']).sort_values(by='skewness',ascending=False)

Unnamed: 0,skewness
population,4.955245
total_rooms,4.184387
total_bedrooms,3.425574
households,3.382151
median_income,0.88953
latitude,0.519514
housing_median_age,-0.055777
longitude,-0.342703
