# Abstract
The purpose of this lab is to use regression to predict house prices using the dataset house_prices. I use two regression models, Linear Regression and Ridge Regression. I then select the most efficient model, based off of R-squared values, MAE, and RMSE for both.

# Introduction
The problem at hand is using features related to house price such as carpet area, car parking availability, and bathroom count to accurately predict housing prices in India. We can use the dataset to analyze and predict housing market trends in order to aid those interested in purchasing homes in the country.

# Related Work
The work in this lab is based off of Chapter 2 of Machine Learning using Python by Professor Itauma. I follow his steps for guidance in performing linear and logistic regression, and use MSE, RMSE, and R-squared to assess performance as referenced in the chapter. I also reference the LinkedIn Learning course "Machine Learning with Scikit-Learn" for data preprocessing and other coding.

# Methodology

## Data Preprocessing

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv("house_prices.csv") ## loading dataset

In [None]:
df.head() ## view data

In [None]:
df.shape
df.info()

In [None]:
df.isnull().sum() ## look at number of missing entries in each column

Since there are multiple columns with a large number of missing values, we will remove those with more than 100,000 missing values from the dataset. We will also remove the "index" column since it will not be used.

In [None]:
df.drop(columns=['Index', 'Society', 'Car Parking', 'Super Area', 'Dimensions', 'Plot Area'], inplace = True)

Next, we will replace the missing values in the text column "Description" with "N/A". 

In [None]:
df_text = ['Description']
df[df_text] = df[df_text].fillna("N/A")

We will impute the missing values for categorical columns with the mode.

In [None]:
df_categorical = ['Status', 'Transaction', 'Furnishing', 'facing', 'overlooking', 'Ownership']
for col in df_categorical:
    mode_col = df[col].mode().iloc[0]
    df[col] = df[col].fillna(mode_col)

For the numeric columns, we will impute the missing values using the mean price.

In [None]:
mean_price = df['Price (in rupees)'].mean()
df['Price (in rupees)'] = df['Price (in rupees)'].fillna(mean_price)

In [None]:
## setting >10 to 11 for the sake of imputation
df.loc[df['Bathroom'] == '> 10', 'Bathroom'] = 11
df.loc[df['Balcony'] == '> 10', 'Balcony'] = 11 

## converting bathroom and balcony to numeric
df['Bathroom'] = pd.to_numeric(df['Bathroom'], errors = 'coerce')
df['Balcony'] = pd.to_numeric(df['Balcony'], errors = 'coerce')

#imputing
mean_bathroom = round(df['Bathroom'].mean())
df['Bathroom'] = df['Bathroom'].fillna(mean_bathroom)

mean_balcony = round(df['Balcony'].mean())
df['Balcony'] = df['Balcony'].fillna(mean_balcony)

In [None]:
## converting carpet areas to same unit and numeric
def as_sqft(area):
     if pd.notnull(area):
        if 'sqft' in area:
            area = float(area.replace(' sqft', ''))
        elif 'kanal' in area:
            area = float(area.replace(' kanal', '')) * 5445
        elif 'marla' in area:
            area = float(area.replace(' marla', '')) * 272.251
        elif 'bigha' in area:
            area = float(area.replace(' bigha', '')) * 27000
        elif 'cent' in area:
            area = float(area.replace(' cent', '')) * 435.56
        elif 'ground' in area:
            area = float(area.replace(' ground', '')) * 2400.35
        elif 'acre' in area:
            area = float(area.replace(' acre', '')) * 43560
        elif 'sqyrd' in area:
            area = float(area.replace(' sqyrd', '')) * 9
        else:
            area = float(area.replace(' sqm', '')) * 10.764
        return area

In [None]:
df['Carpet Area'] = df['Carpet Area'].apply(as_sqft)

In [None]:
mean_carpet = df['Carpet Area'].mean()
df['Carpet Area'] = df['Carpet Area'].fillna(mean_carpet)

Finally, we will impute the floor with the mode.

In [None]:
mode_floor = df['Floor'].mode().iloc[0]
df['Floor'] = df['Floor'].fillna(mode_floor)

In [None]:
df.isnull().sum() ## final check to ensure that all missing values have been imputed.

Since our explanatory variable "Amount (In Rupees)" is not numeric, we will convert it, and remove the rows with non-numeric values.

In [None]:
df = df[df['Amount(in rupees)'] != "Call for Price"]

In [None]:
def as_rupee(amount):
    if 'Lac' in amount:
        amount = float(amount.replace('Lac', '')) * 100000 
    else:
        amount = float(amount.replace('Cr', '')) * 10000000 
    return amount

In [None]:
df['Amount(in rupees)'] = df['Amount(in rupees)'].apply(as_rupee)

We will remove all duplicate rows.

In [None]:
df = df.drop_duplicates()

In [None]:
df.index = range(len(df)) ## reindex

Finally, we will remove outliers from the numeric columns using the IQR method.

In [None]:
# Carpet Area
Q1 = df['Carpet Area'].quantile(0.25)
Q3 = df['Carpet Area'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR

upper_array = np.where(df['Carpet Area'] >= upper)[0]
lower_array = np.where(df['Carpet Area'] <= lower)[0]

df = df.drop(index=upper_array)
df = df.drop(index=lower_array)

In [None]:
df.index = range(len(df)) ## reindex

In [None]:
# Price (in rupees)
Q1 = df['Price (in rupees)'].quantile(0.25)
Q3 = df['Price (in rupees)'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR

upper_array = np.where(df['Price (in rupees)'] >= upper)[0]
lower_array = np.where(df['Price (in rupees)'] <= lower)[0]

df = df.drop(index=upper_array)
df = df.drop(index=lower_array)

In [None]:
df.index = range(len(df)) ## reindex

In [None]:
# Price (in rupees)
Q1 = df['Amount(in rupees)'].quantile(0.25)
Q3 = df['Amount(in rupees)'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR

upper_array = np.where(df['Amount(in rupees)'] >= upper)[0]
lower_array = np.where(df['Amount(in rupees)'] <= lower)[0]

df = df.drop(index=upper_array)
df = df.drop(index=lower_array)

In [None]:
df.index = range(len(df)) ## reindex

In [None]:
df.info() ## the final dataset
df.describe()

## Exploratory Data Analysis

In [None]:
import plotly.express as px

### Categorical Variables

In [None]:
px.histogram(df, x ="location")

![location](location.png)

There does not appear to be a particularly popular location for housing in India.

In [None]:
px.pie(df, names = "Transaction")

![transaction](transaction.png)

It seems that the majority of the houses are being resold.

In [None]:
px.pie(df, names = "Furnishing")

![furnishing](furnishing.png)

The houses seem evenly divided between semi-furnished and unfurnished, with there being only 13.2% of furnished homes in the dataset.

In [None]:
px.pie(df, names = "facing")

![facing](facing.png)

It is really interesting to note that the majority of the houses face East.

In [None]:
px.histogram(df, x = "overlooking")

![overlooking](overlooking.png)

It seems that most houses overlook the main road of the house, which is to be expected.

In [None]:
px.pie(df, names = "Ownership")

![ownership](ownership.png)

The majority of the houses in the dataset are under freehold.

![ownership](ownership.png)

In [None]:
px.histogram(df, x = "Bathroom")

![bathroom](bathroom.png)

Most of the houses have 2 bathrooms, with the next most being 3.

In [None]:
px.histogram(df, x = "Balcony")

![balcony](balcony.png)

You see the same trend as bathrooms with most houses having 2, but the next most popular is 1.

### Numeric Variables

In [None]:
px.box(df, x = "Amount(in rupees)")

![amount](amount.png)

It seems the median housing price in the dataset is 5.8 million Rupees.

In [None]:
px.box(df, x = "Price (in rupees)")

![price](price.png)

The median price per square foot is 4837 Rupees.

In [None]:
px.box(df, x = "Carpet Area")

![carpet_area](carpetarea.png)

The median carpet area for homes in the dataset is 1700 sqft.

### Correlations

In [None]:
px.scatter_matrix(df,
                  dimensions = ["Amount(in rupees)", "Price (in rupees)", "Carpet Area"],
                  width=800, height=800) ## pair plot of the three numeric variables

![corr1](corr1.png)

In [None]:
px.scatter_matrix(df,
                  dimensions = ["Amount(in rupees)", "Bathroom", "Balcony"],
                  width=800, height=800) ## pair plot of the two categorical variables that contain numbers

![corr2](corr2.png)

There appears to be a slightly positive correlation between our explanatory variable, amount in Rupees, and carpet area and price in Rupees, but not between amount and number of bathrooms and balconies.

## Regression Models

Since Title and Description are pure text-based fields, we cannot perform regression using them; thus, we will drop the fields. Also, since Status contains only one category, we will drop the field from regression.

In [None]:
df = df.drop(columns=['Title','Description','Status'])

We will now encode the categorical variables.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
df['location'] = le.fit_transform(df['location'])
df['Transaction'] = le.fit_transform(df['Transaction'])
df['Floor'] = le.fit_transform(df['Floor'])
df['Furnishing'] = le.fit_transform(df['Furnishing'])
df['facing'] = le.fit_transform(df['facing'])
df['overlooking'] = le.fit_transform(df['overlooking'])
df['Ownership'] = le.fit_transform(df['Ownership'])

To prepare for regression, we standardize the data.

In [None]:
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()

In [None]:
df_standardized = standard_scaler.fit_transform(df)
df = pd.DataFrame(df_standardized, columns=df.columns)

In the final step before performing the regression models, we will split the data into testing and training sets. We will use the random seed 42 for reproducibility.

In [None]:
X = df.drop(columns=['Amount(in rupees)'])
y = df['Amount(in rupees)']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Method 1. Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg = LinearRegression(fit_intercept=True)
reg.fit(X_train, y_train)

In [None]:
y_pred = reg.predict(X_test)

### Method 2. Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

In [None]:
clf = Ridge(alpha=1.0)
clf.fit(X_train, y_train)

# Results

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
r_squared = r2_score(y_test, y_pred)
print("The R-squared for linear regression is", r_squared)

In [None]:
mae = mean_absolute_error(y_test, y_pred)
print("The MAE for linear regression is", mae)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("The RMSE for linear regression is", rmse)

In [None]:
y_pred_ridge = clf.predict(X_test)

In [None]:
r_squared_ridge = r2_score(y_test, y_pred_ridge)
print("The R-squared for Ridge Regression is", r_squared_ridge)

In [None]:
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
print("The MAE for Ridge Regression is", mae_ridge)

In [None]:
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
print("The RMSE for Ridge Regression is", rmse_ridge)

From evaluating the performance of both models, it seems that both Linear Regression and Ridge Regression perform the same when applied to this dataset. As Linear Regression appears to perform ever so slightly better in terms of $R^2$ as well as RMSE, I would propose using Linear Regression over Ridge Regression.

In [None]:
reg.coef_

In [None]:
reg.intercept_

The final Linear Regression equation after standardizing the data would be $y = 0.00028 + 0.61533x_1 - 0.03134x_2 + 0.11015x_3 + 0.00492x_4 - 0.03134x_5 - 0.01740x_6 + 0.03902x_7 - 0.02434x_8 + 0.49721x_9 + 0.05925x_{10} + 0.00925x_{11}$.

# Discussion

Overall, the model's $R^2$ is relatively low and MAE/RMSE relatively high, so I do not believe that it completely accurately predicts housing price in India. The limitation exists in the assumption that there are linear relationships for each of the independent variables with the dependent when in fact there may not be.

# Conclusion

After cleaning, encoding, and standardizing the data, we conclude that out of the two regression methods used, Linear Regression is slightly more effective as a regression model for the dataset. However, the relatively low $R^2$ value suggests that the model does not fit the dataset very well. Perhaps in order to fit a better model, more advanced methods can be used, such as Random Forest.

# References

Galarnyk, M. (2020, October 15). Effective machine learning with scikit-learn - scikit-learn video tutorial: Linkedin learning, formerly Lynda.com. LinkedIn. https://www.linkedin.com/learning/machine-learning-with-scikit-learn/effective-machine-learning-with-scikit-learn?u=279222306 

Itauma, I. (n.d.). 2  Chapter 2: Supervised learning - regression. Machine Learning using Python - 2  Chapter 2: Supervised Learning - Regression. https://amightyo.quarto.pub/machine-learning-using-python/Chapter_2.html 