# California House Pricing

## 01: Framing the problem

Build a model of housing prices in California using the Calidornia census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data.

## 02 : Obtain Data

#### Importing the basic required libraries

In [0]:
!pip install missingno

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as ms
%matplotlib inline

#### Reading the data from CSV file

In [0]:
!ls -l

In [0]:
!wget https://www.dropbox.com/s/x4lk7tftrij2psh/housing.csv

In [0]:
!ls -l

In [0]:
housing_data = pd.read_csv('housing.csv')

In [0]:
housing_data.head()

In [0]:
housing_data.info()

In [0]:
housing_data.isnull().sum()

## 03 : Analyze Data

#### Obtaining a glimpse of data

In [0]:
housing_data.info()

In [0]:
housing_data.head()

In [0]:
housing_data.tail()

In [0]:
housing_data.describe()

#### Generating a Correlation heatmap

In [0]:
corr = housing_data.corr()
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)

In [0]:
corr['median_house_value']

#### Generating a StripPlot heatmap

In [0]:
housing_data['ocean_proximity'].value_counts()

In [0]:
housing_data['ocean_proximity'].unique()

In [0]:
sns.stripplot(x="ocean_proximity", y="median_house_value", data=housing_data)

In [0]:
housing = housing_data.copy()

In [0]:
housing.plot(kind="scatter", x="longitude", y="latitude")

In [0]:
housing.plot(kind="scatter", x="longitude", y="latitude",alpha=0.1)

In [0]:
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
ax1.scatter(housing['longitude'], housing['latitude'])
ax2.scatter(housing['longitude'], housing['latitude'], alpha=0.1)
plt.show()

In [0]:
# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=25, figsize=(20,15))
plt.show()

## 04 : Feature Engineering

#### Fill the missing values in the obtained data

In [0]:
ms.matrix(housing_data)

#Visualize the data to see if there are any missing values

In [0]:
#Fill the missing values
housing_data['total_bedrooms'].fillna(housing_data['total_bedrooms'].mean(), inplace=True)

In [0]:
import math
housing_data['total_bedrooms'] = list(map(math.ceil,(housing_data['total_bedrooms'])))

In [0]:
ms.matrix(housing_data)

### Categorical value conversion

In [0]:
#get_dummies() is used to convert the 'ocean_proximity', a column with categorical values, into numerical values
df = pd.get_dummies(housing_data['ocean_proximity'], drop_first=1)
df.head()

In [0]:
#concatinating the dataframe with dummy columns of 'ocean_proximity' to housing_data
housing_data = pd.concat([housing_data, df], axis=1)
housing_data.head()

In [0]:
housing_data.tail()

In [0]:
#Since we have created dummy columns for 'ocean_proximity', we are dropping the column
housing_data.drop('ocean_proximity', inplace=True, axis=1)
housing_data.info()

In [0]:
#dataframe.columns returns a list of all the columns in the dataframe
housing_data.columns

## 05 : Model Selection

### Train-Test Split

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X = housing_data[['longitude', 'latitude', 'housing_median_age', 'total_rooms','total_bedrooms', 
                  'population', 'households', 'median_income','INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']]

y = housing_data['median_house_value']

#Importing the train_test_split from sklearn to split the dataset for training and testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 0)

### Linear Regression

#### Training the model

In [0]:
#Import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

#Initializing the model
lm = LinearRegression()

#Fit the data to the algorithm
lm.fit(x_train, y_train)

In [0]:
lm.intercept_

In [0]:
lm.coef_

### Predicting the model on the test set

In [0]:
#Generating a Scatter Plot to evaluate the correctness of validation set against the predicted data
y_hat = lm.predict(x_test)
plt.scatter(y_test,y_hat)

## 06 : Evaluate the predictions

In [0]:
predictions[:5]

In [0]:
y_test[:5]

In [0]:
from sklearn import metrics
#Absolute Error
print('MAE:', metrics.mean_absolute_error(y_test, y_hat))

#Mean Square Error
print('MSE:', metrics.mean_squared_error(y_test, y_hat))

#Mean Square Root Error
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_hat)))

## 07 : Predicting on Validation set
### This problem doesn't have a validation set.

---
                                    THE END