<a href="https://colab.research.google.com/github/madhumithadasarathy/My_Projects/blob/main/Predicting_Sale_Price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem Statement

A real estate company wishes to analyse the prices of properties based on various factors such as area, number of rooms, bathrooms, bedrooms, etc. Create a multiple linear regression model which is capable of predicting the sale price of houses based on multiple factors and evaluate the accuracy of this model.








---

### List of Activities

**Activity 1:** Analysing the Dataset

**Activity 2:** Data Preparation
  
**Activity 3:** Train-Test Split

**Activity 4:**  Model Training

**Activity 5:** Model Prediction and Evaluation







---


#### Activity 1:  Analysing the Dataset

- Create a Pandas DataFrame for **Housing** dataset using the below link. This dataset consists of following columns:


|Field|Description|
|---:|:---|
|price|Sale price of a house in INR|
|area|Total size of a property in square feet|
|bedrooms|Number of bedrooms|
|bathrooms|Number of bathrooms|
|storeys|Number of storeys excluding basement|
|mainroad|yes, if the house faces a main road|
|livingroom|yes, if the house has a separate living room or a drawing room for guests|
|basement|yes, if the house has a basement|
|hotwaterheating|yes, if the house uses gas for hot water heating|
|airconditioning|yes, if there is central air conditioning|
|parking|number of cars that can be parked|
|prefarea|yes, if the house is located in the preferred neighbourhood of the city|


  **Dataset Link:** https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/house-prices.csv

- Print the first five rows of the dataset. Check for null values and treat them accordingly.






In [None]:
# Import modules
import pandas as pd
# Load the dataset
property_df = pd.read_csv("https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/house-prices.csv")
# Print first five rows using head() function
property_df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [None]:
# Check if there are any null values. If any column has null values, treat them accordingly
property_df.isnull().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

---

#### Activity 2: Data Preparation

This dataset contains many columns having categorical data i.e. values 'Yes' or 'No'. However for linear regression, we need numerical data. So you need to convert all 'Yes' and 'No' values to 1s and 0s, where
- 1 means 'Yes'
- 0 means 'No'

Similarly, replace

- `unfurnished` with 0
- `semi-furnished` with 1
- `furnished` with 2

**Hint:** To replace all 'Yes' values with 1 and 'No' values with 0, use `replace()` function of the DataFrame object.

For ex: `df.replace(to_replace="yes", value=1, inplace=True)` $\Rightarrow$ replaces the "yes" values in all columns with 1. If you need to make changes inplace, use `inplace` boolean argument.



In [None]:
# Replace all the non-numeric values with numeric values.
property_df.replace(to_replace="yes",value=1,inplace=True)
property_df.replace(to_replace="no",value=0,inplace=True)
property_df.replace(to_replace="unfurnished",value=0,inplace=True)
property_df.replace(to_replace="semi-furnished",value=1,inplace=True)
property_df.replace(to_replace="furnished",value=2,inplace=True)

---

#### Activity 3: Train-Test Split

You need to predict the house prices based on several factors. Thus, `price` is the target variable and other columns except `price` will be feature variables.

Split the dataset into training set and test set such that the training set contains 67% of the instances and the remaining instances will become the test set.

In [None]:
# Split the DataFrame into the training and test sets.
from sklearn.model_selection import train_test_split
x = property_df.drop(columns="price")
y = property_df["price"]
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.33,random_state=42)

---

#### Activity 4: Model Training

Implement multiple linear regression using `sklearn` module in the following way:

1. Reshape the target variable array into two-dimensional arrays by using `reshape(-1, 1)` function of the numpy module.
2. Deploy the model by importing the `LinearRegression` class and create an object of this class.
3. Call the `fit()` function on the LinearRegression object.

In [None]:
# Create two-dimensional NumPy arrays for the target variable
from sklearn.linear_model import LinearRegression
y_train = y_train.values.reshape(-1,1)
y_test = y_test.values.reshape(-1,1)


# Build linear regression model
lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train)

# Print the value of the intercept
print(lin_reg.intercept_)

# Print the names of the features along with the values of their corresponding coefficients.
for i, j in zip(x, lin_reg.coef_[0]):
  print(f"{i}: {j:.6f}")

[-276654.39716309]
area: 251.340200
bedrooms: 92716.605269
bathrooms: 1126479.377436
stories: 396248.427747
mainroad: 410635.155697
guestroom: 320496.711210
basement: 484622.278853
hotwaterheating: 623047.392904
airconditioning: 678375.342262
parking: 292410.463141
prefarea: 524417.242824
furnishingstatus: 200615.357036


---

#### Activity 5: Model Prediction and Evaluation

Predict the values for both training and test sets by calling the `predict()` function on the LinearRegression object. Also, calculate the $R^2$, MSE, RMSE and MAE values to evaluate the accuracy of your model.

In [None]:
# Predict the target variable values for training and test set
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
y_train_pred = lin_reg.predict(x_train)
y_test_pred = lin_reg.predict(x_test)

In [None]:
# Evaluate the linear regression model using the 'r2_score', 'mean_squared_error' & 'mean_absolute_error' functions of the 'sklearn' module.
print(f"\nr2 score for train set {r2_score(y_train,y_train_pred)},\nMean squared error for train set {mean_squared_error(y_train,y_train_pred)},\nMean absolute error for train set {mean_absolute_error(y_train,y_train_pred)}")
print(f"\nr2 score for test set {r2_score(y_test,y_test_pred)},\nMean squared error for test set {mean_squared_error(y_test,y_test_pred)},\nMean absolute error for test set {mean_absolute_error(y_test,y_test_pred)}")


r2 score for train set 0.68603602364727,
Mean squared error for train set 971946527815.6637,
Mean absolute error for train set 720751.2129481052

r2 score for test set 0.6557070707485257,
Mean squared error for test set 1475542475754.5508,
Mean absolute error for test set 906953.7908301718


---