# Dataset Glossary (Column-Wise)

- **BHK:** Number of Bedrooms, Hall, Kitchen.
- **Rent:** Price of the Houses/Apartments/Flats.
- **Size:** Size of the Houses/Apartments/Flats in Square Feet.
- **Floor:** Houses/Apartments/Flats situated in which Floor and Total Number of Floors (Example: Ground out of 2, 3 out of 5, etc.)
- **Area Type:** Size of the Houses/Apartments/Flats calculated on either Super Area or Carpet Area or Build Area.
- **Area Locality:** Locality of the Houses/Apartments/Flats.
- **City:** City where the Houses/Apartments/Flats are Located.
- **Furnishing Status:** Furnishing Status of the Houses/Apartments/Flats, either it is Furnished or Semi-Furnished or Unfurnished.
- **Tenant Preferred:** Type of Tenant Preferred by the Owner or Agent.
- **Bathroom:** Number of Bathrooms.
- **Point of Contact:** Whom should you contact for more information regarding the Houses/Apartments/Flats.

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from wordcloud import WordCloud
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

import joblib

import warnings
warnings.filterwarnings('ignore')

# Loading Data

In [None]:
#Connect google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
rent_data = pd.read_csv('/content/drive/MyDrive/dataset/house_rent/House_Rent_Dataset.csv')
rent_data.head()

Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner


In [None]:
rent_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Posted On          4746 non-null   object
 1   BHK                4746 non-null   int64 
 2   Rent               4746 non-null   int64 
 3   Size               4746 non-null   int64 
 4   Floor              4746 non-null   object
 5   Area Type          4746 non-null   object
 6   Area Locality      4746 non-null   object
 7   City               4746 non-null   object
 8   Furnishing Status  4746 non-null   object
 9   Tenant Preferred   4746 non-null   object
 10  Bathroom           4746 non-null   int64 
 11  Point of Contact   4746 non-null   object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB


In [None]:
rent_data.isna().sum()

Unnamed: 0,0
Posted On,0
BHK,0
Rent,0
Size,0
Floor,0
Area Type,0
Area Locality,0
City,0
Furnishing Status,0
Tenant Preferred,0


#### Observations:
- There is no null value for any column in the dataset.

## Descriptive Statistics

In [None]:
rent_data.describe()

Unnamed: 0,BHK,Rent,Size,Bathroom
count,4746.0,4746.0,4746.0,4746.0
mean,2.08386,34993.45,967.490729,1.965866
std,0.832256,78106.41,634.202328,0.884532
min,1.0,1200.0,10.0,1.0
25%,2.0,10000.0,550.0,1.0
50%,2.0,16000.0,850.0,2.0
75%,3.0,33000.0,1200.0,2.0
max,6.0,3500000.0,8000.0,10.0


# Modeling

In [None]:
##Dropping unnecesaary columns from dataset
rent_data = rent_data.drop(['Posted On','Point of Contact'],axis=1)
rent_data.head()

Unnamed: 0,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom
0,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2
1,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1
2,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1
3,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1
4,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1


In [None]:
# Fungsi untuk mengubah "Floor" menjadi angka
def extract_floor_data(floor_str):
    try:
        parts = floor_str.split(' out of ')
        floor = parts[0]
        total = parts[1]

        # Ubah "Ground" menjadi 0, lainnya jadi int
        if floor.strip().lower() == 'ground':
            floor_num = 0
        else:
            floor_num = int(floor)
        total_floor = int(total)
    except:
        floor_num = np.nan
        total_floor = np.nan
    return pd.Series([floor_num, total_floor])

# Terapkan fungsi ke dataset
rent_data[['floor_num', 'total_floor']] = rent_data['Floor'].apply(extract_floor_data)

# Drop kolom asli
rent_data.drop('Floor', axis=1, inplace=True)


In [None]:
# # Gabung X dan y kembali sementara agar drop baris NaN seragam
data = pd.concat([X, y], axis=1)

# # Hapus baris yang mengandung NaN
data = data.dropna()

# Pisahkan kembali
X = data.drop('Rent', axis=1)
y = data['Rent']


# One-hot encode kategori
X = pd.get_dummies(X, drop_first=True)

## Splitting into Train and Test dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42)

## Scaling the data

In [None]:
# Scaling the data
y_train= y_train.values.reshape(-1,1)
y_test= y_test.values.reshape(-1,1)

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
y_train = sc_y.fit_transform(y_train)
y_test = sc_y.fit_transform(y_test)

In [None]:
print(X_train,y_train)

[[ 1.09656575  1.36934348  1.15708155 ... -0.78535416  0.60551865
  -0.32168761]
 [ 1.09656575  1.13654613  1.15708155 ... -0.78535416  0.60551865
  -0.32168761]
 [-1.30501893 -0.69479307  0.03639263 ... -0.78535416  0.60551865
  -0.32168761]
 ...
 [-0.10422659 -0.41543624  0.03639263 ... -0.78535416  0.60551865
  -0.32168761]
 [-0.10422659 -0.5861543  -1.08429629 ...  1.27331088  0.60551865
  -0.32168761]
 [-0.10422659  0.05015847  1.15708155 ... -0.78535416  0.60551865
  -0.32168761]] [[ 0.28256607]
 [-0.12074087]
 [-0.31663281]
 ...
 [-0.23597142]
 [-0.26477906]
 [ 1.55010215]]


## Support Vector Regressor
Support Vector Regression (SVR) uses the same principle as SVM, but for regression problems. Let’s spend a few minutes understanding the idea behind SVR.

The problem of regression is to find a function that approximates mapping from an input domain to real numbers on the basis of a training sample. So let’s now dive deep and understand how SVR works actually.

Consider these two red lines as the decision boundary and the green line as the hyperplane. Our objective, when we are moving on with SVR, is to basically consider the points that are within the decision boundary line. Our best fit line is the hyperplane that has a maximum number of points.

![Support-Vector-Regression.jpg](attachment:c35b39cc-05ef-4379-a4f4-73deca88610d.jpg)

The first thing that we’ll understand is what is the decision boundary (the danger red line above!). Consider these lines as being at any distance, say ‘a’, from the hyperplane. So, these are the lines that we draw at distance ‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the text is basically referred to as epsilon.

In [None]:
svr = SVR()
svr.fit(X_train, y_train)
svr_prediction = svr.predict(X_test)

# Evaluation metrics
mae_svr = metrics.mean_absolute_error(y_test, svr_prediction)
mse_svr =  metrics.mean_squared_error(y_test, svr_prediction)
rmse_svr =  np.sqrt(mse_svr)

In [None]:
print('MAE:', mae_svr)
print('MSE:', mse_svr)
print('RMSE:', rmse_svr)

MAE: 0.3558941383438831
MSE: 0.6061517007963095
RMSE: 0.7785574486165485


In [None]:
joblib.dump(svr, "house_rent_model.pkl")
print("Model disimpan sebagai house_rent_model.pkl")

joblib.dump(sc_X, 'scaler_X.pkl')
joblib.dump(sc_y, 'scaler_y.pkl')
joblib.dump(X.columns.tolist(), 'columns.pkl')

Model disimpan sebagai house_rent_model.pkl


['columns.pkl']