#### Problem Objective :
##### The project aims at building a model of housing prices to predict median house values in California using the provided dataset. This model should learn from the data and be able to predict the median housing price in any district, given all the other metrics.

#### Analysis Tasks to be performed:
#### 1. Build a model of housing prices to predict median house values in California using the provided dataset.
#### 2. Train the model to learn from the data to predict the median housing price in any district, given all the other metrics.
#### 3. Predict housing prices based on median_income and plot the regression chart for it.


In [1]:
# importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn import model_selection

In [2]:
# Getting a view of the data
df=pd.read_csv("california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200


In [3]:
# getting a view of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  int64  
 3   total_rooms         20640 non-null  int64  
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  int64  
 6   households          20640 non-null  int64  
 7   median_income       20640 non-null  float64
 8   ocean_proximity     20640 non-null  object 
 9   median_house_value  20640 non-null  int64  
dtypes: float64(4), int64(5), object(1)
memory usage: 1.6+ MB


In [4]:
# Checking for null values
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64

#### From the above we can see the "total_bedrooms" column has missing values we will handle this by using the mean

In [5]:
# our target variable "median_income has no null values so we can extractit immediately"
target=df["median_income"]

In [6]:
# dropping the target column from our dataset
df=df.drop("median_income",axis=1)

In [7]:
#filling the null values present in the "total bedrooms"column
df["total_bedrooms"]=df["total_bedrooms"].fillna(df["total_bedrooms"].mean())

#### We will perform one hot encoding on the dataset to convert the categorical variables into numeric values

In [8]:
# intializing the label encoder
le=LabelEncoder()

In [9]:
# Performing Label encoding on the categorical variable  "ocean_proximity"
df["ocean_proximity"]=le.fit_transform(df["ocean_proximity"])

In [10]:
# Spliting the dataset into train and test respectively
xtrain,xtest,ytrain,ytest=model_selection.train_test_split(df,target,test_size=0.2,random_state=42)

In [11]:
# Perform standard scaling on the training  and test data to make it more efficient
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(xtrain)
X_test_scaled = scaler.transform(xtest)

In [12]:
# performing linear regression
lin=LinearRegression()
lin.fit(X_train_scaled,ytrain)

In [13]:
# Predicting values
y_pred=lin.predict(X_test_scaled)

In [14]:
# Printing the  mean squared error
mse = mean_squared_error(ytest, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 1.457503659096428


# Performing Decision Tree Regression

In [15]:
# Initializing the regression model
Dregressor = DecisionTreeRegressor()

# Fitting the regression model to the training data
Dregressor.fit(X_train_scaled,ytrain)

In [16]:
# Predicting values
y_pred=Dregressor.predict(X_test_scaled)

In [17]:
# Printing the  mean squared error
mse = mean_squared_error(ytest, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 1.7652164077810077


# Performing Random Forest Regression

In [18]:
# Initializing the regression model
Rregressor = RandomForestRegressor()

# Fitting the regression model to the training data
Rregressor.fit(X_train_scaled,ytrain)

In [19]:
# Predicting values
y_pred=Rregressor.predict(X_test_scaled)

In [20]:
# Printing the  mean squared error
mse = mean_squared_error(ytest, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.9210044876285557


#### From the results we got:
#### Linear Regresiion MSE=1.46
#### Decision Trees MSE=1.83
#### Random Forest MSE=0.91

#### From the results above we can see that the random forest regressor has given us the best result. Hence, it is the best model for fitting our data
#### Author: Tanimowo Possible

# 

# 

# 

# 