**California Housing Price Pridiction**
download dataset from here: https://www.kaggle.com/datasets/camnugent/california-housing-prices

**Background of Problem Statement:**

The US Census Bureau has published California Census Data which has 10 types of metrics such as the population, median income, median housing price, and so on for each block group in California. The dataset also serves as an input for project scoping and tries to specify the functional and nonfunctional requirements for it.


**Problem Objective**

The project aims at building a model of housing prices to predict median house values in California using the provided dataset. This model should learn from the data and be able to predict the median housing price in any district, given all the other metrics.

Districts or block groups are the smallest geographical units for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). There are 20,640 districts in the project dataset.

Domain: Finance and Housing

**UNDERSTANDING THE DATASET**

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Importing libraries and resources
import plotly.graph_objects as go
import seaborn as sns
import plotly.express as ex
%matplotlib inline
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import LabelEncoder
import statsmodels.api as sm
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn import ensemble

In [None]:
dataset = pd.read_csv("/content/california_housing_price.csv")

In [None]:
df = dataset.copy()

In [None]:
df.head()

1. longitude: A measure of how far west a house is; a higher value is farther west
2. latitude: A measure of how far north a house is; a higher value is farther north
3. housingMedianAge: Median age of a house within a block; a lower number is a newer building
4. totalRooms: Total number of rooms within a block
5. totalBedrooms: Total number of bedrooms within a block
6. population: Total number of people residing within a block
7. households: Total number of households, a group of people residing within a home unit, for a block
8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. medianHouseValue: Median house value for households within a block (measured in US Dollars)
10. oceanProximity: Location of the house w.r.t ocean/sea

In [None]:
df.describe()

In [None]:
df.info()

**Shape of data**

In [None]:
nRow, nCol = df.shape
print("Shape of dataset {}".format(df.shape))
print(f"Rows: {nRow} \nColumns: {nCol}")

**Handle missing data**

In [None]:
print(df.isna().sum())
df.isna().sum().plot(kind='bar')

In [None]:
df.dropna(thresh = 10, inplace=True)
print(df.isna().sum())

**Check and remove duplicated data**

In [None]:
df.duplicated().value_counts()

In [None]:
df_no_duplicates =  df.drop_duplicates(subset=None, keep="first", inplace=True)
df.duplicated().value_counts()

**Handle outlier values**

In [None]:
plt.figure(figsize=(18,8))
sns.boxplot(data=df)

**Handling total rooms outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df, y='total_rooms', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df,x = 'median_house_value', s = 100, y='total_rooms', ax=ax[1])

In [None]:
df1 = df[(df['total_rooms'] <= 23000)]
df.shape[0]

*225 rows were removed about 1% to handle total rooms outliers*

**Handling total bedrooms outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df1, y='total_bedrooms', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df1,x = 'median_house_value', s = 100, y='total_bedrooms', ax=ax[1])

In [None]:
df2 = df1[(df1['total_bedrooms'] < 3000)]
df2.shape[0]

*49 rows were removed about 0.24% to handle total bedrooms outliers*

**Handling population outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df2, y='population', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df2,x = 'median_house_value', s = 100, y='population', ax=ax[1])

In [None]:
df3 = df2[(df2['population'] < 7500)]
df3.shape[0]

*37 rows were removed about 0.18% to handle population outliers*

**Handling households outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df3, y='households', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df3,x = 'median_house_value', s = 100, y='households', ax=ax[1])

In [None]:
df4 = df3[(df3['households'] < 2300)]
df4.shape[0]

*47 rows were removed about 0.23% to handle households outliers*

**Handling median income outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df4, y='median_income', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df4,x = 'median_house_value', s = 100, y='median_income', ax=ax[1])

In [None]:
df5 = df4[(df4['median_income'] < 11)]
df5.shape[0]

*156 rows were removed about 0.76% to handle median income outliers*

514 rows were removed to handle outliers about 2.5%

**Visualize how the population and houses prices distributed along the map**

In [None]:
df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
        s=df["population"]/100, label="population", figsize=(15,8),
        c="median_house_value", cmap=plt.get_cmap("jet"),colorbar=True,
    )
plt.legend()
plt.show()

**Correlation**

In [None]:
plt.figure(figsize = (12,8))
sns.heatmap(df5.corr() , annot = True , cmap = "YlGnBu")

* median income has 68% relation with median house value.
* total rooms has 16% relation with median house value.
* latitude has 14% relation with median house value.
* housing median age has 11% relation with median house value.
* longitude has 92% relation with longitude.
* longitude has 29% relation with ocean proximity.
* latitude has 20% relation with ocean proximity.
* median income has 24% relation with total rooms.
* median house age has averag 32% relation with total rooms, total bedrooms,  population and households.
* total rooms, total bedrooms, population and households have average 92% relation with eachother.



**Median house value distribution**

In [None]:
# Histogram

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)

(sns.distplot(df5["housing_median_age"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Housing Median Age", ylabel = "Density", title = "Median House Age Histogram"));

plt.subplot(132)

(sns.distplot(df5["total_rooms"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Total Rooms", ylabel = "Density", title = "Total Rooms Histogram"));

plt.subplot(133)

(sns.distplot(df5["total_bedrooms"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Total Bedrooms", ylabel = "Density", title = "Total Bedrooms Histogram"));

plt.tight_layout()
plt.show()

# Boxplot

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)
sns.boxplot(y=df5["housing_median_age"], color="skyblue").set_title('Median House Age Boxplot')

plt.subplot(132)
sns.boxplot(y=df5["total_rooms"], color="skyblue").set_title('Total Rooms Boxplot')

plt.subplot(133)
sns.boxplot(y=df5["total_bedrooms"], color="skyblue").set_title('Total Bedrooms Boxplot')

plt.tight_layout()
plt.show()


# Histogram

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)

(sns.distplot(df5["population"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Population", ylabel = "Density", title = "Population Histogram"));

plt.subplot(132)

(sns.distplot(df5["households"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Households", ylabel = "Density", title = "Households Histogram"));

plt.subplot(133)

(sns.distplot(df5["median_income"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Median Income", ylabel = "Density", title = "Median Income Histogram"));

plt.tight_layout()
plt.show()


# Boxplot

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)
sns.boxplot(y=df5["population"], color="skyblue").set_title('Population Boxplot')

plt.subplot(132)
sns.boxplot(y=df5["households"], color="skyblue").set_title('Households Boxplot')

plt.subplot(133)
sns.boxplot(y=df5["median_income"], color="skyblue").set_title('Median Income Boxplot')

plt.tight_layout()
plt.show()

**Ocean proximity values' counts**

In [None]:
ex.pie(df5,names='ocean_proximity',title='Proportion of Locations of the house w.r.t ocean/sea')

* 44.06% of houses <1H Ocean.
* 31.96% of houses Inlande.
* 12.85% of houses Island.
* 11.11% of houses Near bay.
* 0.02% of houses Near ocean.

In [None]:
fig=plt.figure(figsize=(17, 4))
plt.subplot(132)
sns.boxplot( x=df5["ocean_proximity"], y=df["median_house_value"], palette="Blues").set_title('Median House Value Boxplot by Ocean Proximity')

plt.tight_layout()
plt.show()

In [None]:
# sns.pairplot(df)

**Encode categorical values**

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
ocean_proximity_le = LabelEncoder()

In [None]:
df5['ocean_proximity'] = ocean_proximity_le.fit_transform(df5['ocean_proximity'])

In [None]:
df5["ocean_proximity"].value_counts()

In [None]:
print("representation value")
for i in range(len(ocean_proximity_le.classes_)):
    print(f"{i}\t\t{ocean_proximity_le.classes_[i]}")

**BUILD MODEL PREDICT**

**Split Data**

In [None]:
df5.columns

In [None]:
final = df5[["longitude", "latitude",'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income', 'ocean_proximity', 'median_house_value']]

In [None]:
final.tail(2)

In [None]:
X = final.drop(['median_house_value'] , axis = 1).values
y = final['median_house_value'].values

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state= 42)

**Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

In [None]:
X_train=scale.fit_transform(X_train)
X_test=scale.fit_transform(X_test)

In [None]:
X_train.shape, y_train.shape

**Linear Regression**

Linear regression models are useful for prediction and to explain variation in the response variable. In this case, we will focus on prediction. So we will fit a predictive model to an observed data set of values of the response and explanatory variables.

                  Y=β0+β1X1+β2X2+ε

To run the linear model we will use a class from Scikit-learn called **Linear Regression()** this class has a function called **fit()** , which will train our data.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error , mean_absolute_percentage_error

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)

We can see above, that the variable regressor_linear is a Linear Regression model trained from the variables X_train and y_train. To train the model means that we are looking for the line that better fits the training data, to do so we will use the **predict()** function.

In [None]:
lr_frame = pd.DataFrame({"Y_test": y_test , "Y_pred" : y_pred_lr})
lr_frame.head(10)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(lr_frame[:50])
plt.legend(["Actual" , "Predicted"])

In [None]:
mae_linear = np.round(metrics.mean_absolute_error(y_test, y_pred_lr))
mse_linear = np.round(metrics.mean_squared_error(y_test, y_pred_lr))
rmse_linear = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_lr)))


print('Mean Absolute Error:', mae_linear, 2)
print('Mean Squared Error:', mse_linear, 2)
print('Root Mean Squared Erro:', rmse_linear, 2)

In [None]:
# residual plot
plt.figure(figsize=(12,7))
(sns.distplot((y_test-y_pred_lr), bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "(y_test-y_pred)", ylabel = "Density"
    , title = "Regression Tree Residual Plot"));

**Random Forest**

Random Forest is an evolution of bagging. The Random Forest model provide an **improvement over bagged trees** by way of a small tweak that **decorrelates the trees**.

As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a **random sample of m predictors is chosen as split candidates from the full set of p predictors**. The idea behind this process is to decorrelate the trees, for example: Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated.

In Random Forest p is the full set of predictors and m is the predictors taken at each split. So, **the main difference between bagging and random forests is the choice of predictor subset size m**. For instance, if a random forest is built using m = p, then this amounts simply to bagging.

In [None]:
regressor_rf = RandomForestRegressor(random_state=42)
regressor_rf.fit(X_train, y_train)
y_pred_forest = regressor_rf.predict(X_test)

In [None]:
regressor_rf_frame = pd.DataFrame({"Y_test": y_test , "Y_pred" : y_pred_forest})
regressor_rf_frame.head(10)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(regressor_rf_frame[:50])
plt.legend(["Actual" , "Predicted"])

In [None]:
mae_forest = np.round(metrics.mean_absolute_error(y_test, y_pred_forest))
mse_forest = np.round(metrics.mean_squared_error(y_test, y_pred_forest))
rmse_forest = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_forest)))


print('Mean Absolute Error:', mae_forest, 2)
print('Mean Squared Error:', mse_forest, 2)
print('Root Mean Squared Erro:', rmse_forest, 2)

In [None]:
# residual plot
plt.figure(figsize=(12,7))
(sns.distplot((y_test-y_pred_forest), bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "(y_test-y_pred)", ylabel = "Density", title = "RF Residual Plot"));

**Cross Validition Random Forest**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score, KFold

param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

random_search = RandomizedSearchCV(regressor_rf, param_dist, n_iter=10, scoring='neg_mean_squared_error', cv=5, random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)


In [None]:
best_regressor_rf = random_search.best_estimator_
y_pred_forest_cv = best_regressor_rf.predict(X_test)

In [None]:
regressor_rf_frame_cv = pd.DataFrame({"Y_test": y_test , "Y_pred" : y_pred_forest_cv})
regressor_rf_frame_cv.head(10)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(regressor_rf_frame_cv[:50])
plt.legend(["Actual" , "Predicted"])

In [None]:
mae_forest_cv = np.round(metrics.mean_absolute_error(y_test, y_pred_forest))
mse_forest_cv = np.round(metrics.mean_squared_error(y_test, y_pred_forest))
rmse_forest_cv = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_forest)))


print('Mean Absolute Error:', mae_forest_cv, 2)
print('Mean Squared Error:', mse_forest_cv, 2)
print('Root Mean Squared Erro:', rmse_forest_cv, 2)

In [None]:
  # residual plot
  plt.figure(figsize=(12,7))
  (sns.distplot((y_test-y_pred_forest_cv), bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
      .set(xlabel = "(y_test-y_pred)", ylabel = "Density", title = "RF Residual Plot"));

In [None]:
models = ['Random Forest' , 'Random Forest CV']
data = [ [mae_forest , mse_forest, rmse_forest], [mae_forest_cv , mse_forest_cv, rmse_forest_cv]]
cols = ['mae' , 'mse', 'rmse']
pd.DataFrame(data=data , index = models , columns = cols)

**REGRESSION TREE**

Decision Trees methods involve stratifying or segmenting the predictor space into a number of simple regions. Decision Trees are **simple and useful for interpretation**, however they are typically **not competitive in terms of prediction accuracy**.

Decision trees can be applied to both **regression and classification problems**. At this task, we are going to use **Regression Trees**

**Advantages of Decision Trees**:

1. Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
2. Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.
3. Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
4. Trees can easily handle qualitative predictors without the need to create dummy variables,

**Disadvantages of Decision Trees**:

1. Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book.
2. Additionally, trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree (high variance)

However, by aggregating many decision trees the predictive performance of trees can be substantially improved. We will evaluate these models in the next cells.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
regressor_tree = DecisionTreeRegressor(random_state = 42)
regressor_tree.fit(X_train, y_train)

y_pred_tree = regressor_tree.predict(X_test)

In [None]:
regressor_tree_frame = pd.DataFrame({"Y_test": y_test , "Y_pred" : y_pred_tree})
regressor_tree_frame.head(10)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(regressor_tree_frame[:50])
plt.legend(["Actual" , "Predicted"])

In [None]:
mae_tree = np.round(metrics.mean_absolute_error(y_test, y_pred_tree))
mse_tree = np.round(metrics.mean_squared_error(y_test, y_pred_tree))
rmse_tree = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_tree)))


print('Mean Absolute Error:', mae_tree, 2)
print('Mean Squared Error:', mse_tree, 2)
print('Root Mean Squared Erro:', rmse_tree, 2)

In [None]:
# residual plot
plt.figure(figsize=(12,7))
(sns.distplot((y_test-y_pred_tree), bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "(y_test-y_pred)", ylabel = "Density", title = "RF Residual Plot"));

**Cross Validition Regression Tree**

In [None]:
param_grid = {
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(regressor_tree, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Hyperparameters:", grid_search.best_params_)

In [None]:
best_regressor_tree = grid_search.best_estimator_
y_pred_tree_cv = best_regressor_tree.predict(X_test)

In [None]:
regressor_tree_frame_cv = pd.DataFrame({"Y_test": y_test, "Y_pred": y_pred_tree_cv})
regressor_tree_frame_cv.head(10)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(regressor_tree_frame_cv[:50])
plt.legend(["Actual" , "Predicted"])

In [None]:
mae_tree_cv = np.round(metrics.mean_absolute_error(y_test, y_pred_tree_cv))
mse_tree_cv = np.round(metrics.mean_squared_error(y_test, y_pred_tree_cv))
rmse_tree_cv = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_tree_cv)))


print('Mean Absolute Error:', mae_tree_cv, 2)
print('Mean Squared Error:', mse_tree_cv, 2)
print('Root Mean Squared Erro:', rmse_tree_cv, 2)

In [None]:
# residual plot
plt.figure(figsize=(12,7))
(sns.distplot((y_test-y_pred_tree_cv), bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "(y_test-y_pred)", ylabel = "Density", title = "RF Residual Plot"));

**Comparison Regression Tree**

In [None]:
models = ['Regression Tree' , 'Regression Tree CV']
data = [ [mae_tree , mse_tree, rmse_tree], [mae_tree_cv , mse_tree_cv, rmse_tree_cv]]
cols = ['mae' , 'mse', 'rmse']
pd.DataFrame(data=data , index = models , columns = cols)

**Evaluation on Test Data**

In [None]:
testing = pd.read_excel('/content/testing_data.xlsx')

In [None]:
testing['ocean_proximity'] = ocean_proximity_le.fit_transform(testing['ocean_proximity'])
df_test = testing[["longitude", "latitude", 'housing_median_age',
       'total_rooms', 'total_bedrooms', 'population', 'households',
       'median_income', 'ocean_proximity']]

In [None]:
sc = StandardScaler()
predict_test = sc.fit_transform(df_test)

In [None]:
testing.head()

In [None]:
testing.shape

In [None]:
y_pred_lr_test = lr.predict(predict_test)

y_pred_forest_test = best_regressor_rf.predict(predict_test)

y_pred_tree_test = best_regressor_tree.predict(predict_test)

In [None]:
lr_frame_test = pd.DataFrame({"Y_test": testing["median_house_value"] , "Y_pred" : y_pred_lr_test})
regressor_rf_frame_test = pd.DataFrame({"Y_test": testing["median_house_value"], "Y_pred" : y_pred_forest_test})
regressor_tree_frame_test = pd.DataFrame({"Y_test": testing["median_house_value"], "Y_pred": y_pred_tree_test})

In [None]:
lr_frame_test[5:]

In [None]:
regressor_rf_frame_test[5:]

In [None]:
regressor_tree_frame_test

**Comparison**

In [None]:
models = ['LinearRegression' , 'Regression Tree' , 'Random Forest']
data = [[mae_linear , mse_linear, rmse_linear], [mae_tree_cv , mse_tree_cv, rmse_tree_cv],[mae_forest_cv , mse_forest_cv, rmse_forest_cv]]
cols = ['mae' , 'mse', 'rmse']
pd.DataFrame(data=data , index = models , columns = cols)