# Predict house price using linear regression

- **Tasks**:
    1. Use linear regression to predict house prices (use tf.estimator.LinearRegression)
    2. Further improve the accuracy by switching to deep learning estimator.
    
   
- **Dataset**:
    1. California Housing dataset. The original page where the data set can be found is: http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html.
    
    
- **Highlights**:
    1. Use scikit learn for preprocessing (scaling/normalizing features)
    2. Use scikit learn for test/train split 

In [2]:
import pandas as pd

In [3]:
dataset = pd.read_csv("datasets/cal_housing_clean.csv")
dataset.head()

Unnamed: 0,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


## 1. Feature engineering : 
- Normalize column values such that they are all between 0 and 1. This will help learning to be easy. Use Scikitlearn.preprocessing


In [4]:
from sklearn import preprocessing

In [5]:
# dataset.columns
columns = ['housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome', 'medianHouseValue']

min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))

In [6]:
dataset[columns] = min_max_scaler.fit_transform(dataset[columns])
dataset.head()

Unnamed: 0,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,0.784314,0.022331,0.019863,0.008941,0.020556,0.539668,0.902266
1,0.392157,0.180503,0.171477,0.06721,0.186976,0.538027,0.708247
2,1.0,0.03726,0.02933,0.013818,0.028943,0.466028,0.695051
3,1.0,0.032352,0.036313,0.015555,0.035849,0.354699,0.672783
4,1.0,0.04133,0.043296,0.015752,0.042427,0.230776,0.674638


## 2. Train and test split

In [7]:
# Create x_data and y_data, separating the column medianHouseValue for training
x_data = dataset.drop("medianHouseValue", axis=1)
y_data = dataset["medianHouseValue"]
x_data.head()

Unnamed: 0,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome
0,0.784314,0.022331,0.019863,0.008941,0.020556,0.539668
1,0.392157,0.180503,0.171477,0.06721,0.186976,0.538027
2,1.0,0.03726,0.02933,0.013818,0.028943,0.466028
3,1.0,0.032352,0.036313,0.015555,0.035849,0.354699
4,1.0,0.04133,0.043296,0.015752,0.042427,0.230776


In [8]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.30)

print(" Total size of the original dataset: {}, Train dataset: {} Test dataset: {}".format(len(x_data), len(x_train), len(x_test)))

 Total size of the original dataset: 20640, Train dataset: 14448 Test dataset: 6192


## 3. Create feature columns

In [9]:
import tensorflow as tf

print(x_train.columns)

feature_cols = [
    tf.feature_column.numeric_column(key="housingMedianAge"),
    tf.feature_column.numeric_column(key="totalRooms"),
    tf.feature_column.numeric_column(key="totalBedrooms"),
    tf.feature_column.numeric_column(key="population"),
    tf.feature_column.numeric_column(key="households"),
    tf.feature_column.numeric_column(key="medianIncome")
]

Index(['housingMedianAge', 'totalRooms', 'totalBedrooms', 'population',
       'households', 'medianIncome'],
      dtype='object')


## 4. Create the Input Function (Function gets the inputs and pass it to model)
- Input function is a convenience, it can accept input in Pandas or Numpy format.

In [16]:
# Create the Input Function for Training
train_input_function = tf.estimator.inputs.pandas_input_fn(x=x_train, y=y_train, shuffle=False, batch_size=128, num_epochs=100)

# Create the Input Function for Test / Evaluation
test_input_function = tf.estimator.inputs.pandas_input_fn(x=x_test, y=y_test, shuffle=False)


## 5. Train the model

In [17]:
# Create the model : 
model = tf.estimator.LinearRegressor(feature_columns=feature_cols)

W1103 00:34:42.039487  9340 estimator.py:1811] Using temporary folder as model directory: C:\Users\Obaid\AppData\Local\Temp\tmp_cpdz1od


In [20]:
model.train(input_fn=train_input_function, steps=100)

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressor at 0x1d414dffb70>

## 6. Evaluate - Compare test and train metrics

In [21]:
train_metrics = model.evaluate(input_fn=train_input_function)

test_metrics = model.evaluate(input_fn=test_input_function)

In [14]:
train_metrics

{'average_loss': 0.027483802,
 'label/mean': 0.39470482,
 'loss': 3.5140352,
 'prediction/mean': 0.3734718,
 'global_step': 113}

In [15]:
test_metrics

{'average_loss': 0.027077353,
 'label/mean': 0.39762032,
 'loss': 3.421693,
 'prediction/mean': 0.37482846,
 'global_step': 113}