## Data Understanding And Preparation

In [1]:
import pandas as pd 

In [4]:
#Read the data 
df = pd.read_csv("Bengaluru_House_Data.csv")

In [5]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [6]:
#checking the shape of dataset 
df.shape

(13320, 9)

In [7]:
#info of dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [10]:
#checking for duploicates 
df[df.duplicated()]

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
971,Super built-up Area,Ready To Move,Haralur Road,3 BHK,NRowse,1464,3.0,2.0,56.0
1115,Super built-up Area,Ready To Move,Haralur Road,2 BHK,,1027,2.0,2.0,44.0
1143,Super built-up Area,Ready To Move,Vittasandra,2 BHK,Prlla C,1246,2.0,1.0,64.5
1290,Super built-up Area,Ready To Move,Haralur Road,2 BHK,,1194,2.0,2.0,47.0
1394,Super built-up Area,Ready To Move,Haralur Road,2 BHK,,1027,2.0,2.0,44.0
...,...,...,...,...,...,...,...,...,...
13285,Super built-up Area,Ready To Move,VHBCS Layout,2 BHK,OlarkLa,1353,2.0,2.0,110.0
13299,Super built-up Area,18-Dec,Whitefield,4 BHK,Prtates,2830 - 2882,5.0,0.0,154.5
13311,Plot Area,Ready To Move,Ramamurthy Nagar,7 Bedroom,,1500,9.0,2.0,250.0
13313,Super built-up Area,Ready To Move,Uttarahalli,3 BHK,Aklia R,1345,2.0,1.0,57.0


## Observation 
   * There are duplicates in datasets 

In [11]:
#drop the duplicates 
df.drop_duplicates(inplace=True)

In [13]:
df[df.duplicated()]

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price


In [15]:
df.shape

(12791, 9)

In [14]:
#checking for null values 
df.isnull().sum()

area_type          0
availability       0
location           1
size              16
society         5328
total_sqft         0
bath              73
balcony          605
price              0
dtype: int64

## Observation 
   * There are missing values in datasets 

In [19]:
#handling missing values by mode imputation 

columns_to_fill = ['society', 'balcony', 'bath', 'size','location']
modes = df[columns_to_fill].mode().iloc[0] 

for column in columns_to_fill:
    df[column].fillna(modes[column], inplace=True)

In [21]:
#Now we can see that there is no missing values 
df.isnull().sum()

area_type       0
availability    0
location        0
size            0
society         0
total_sqft      0
bath            0
balcony         0
price           0
dtype: int64

In [22]:
#descriptive statistics of numerical column 
df.describe()

Unnamed: 0,bath,balcony,price
count,12791.0,12791.0,12791.0
mean,2.704558,1.602064,114.317646
std,1.354936,0.807728,151.48031
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,73.0
75%,3.0,2.0,121.0
max,40.0,3.0,3600.0


In [23]:
df.corr()

Unnamed: 0,bath,balcony,price
bath,1.0,0.204692,0.451203
balcony,0.204692,1.0,0.123589
price,0.451203,0.123589,1.0


## Converting categorical  data into numerical

In [70]:
import numpy as np 
df['total_sqft'] = df['total_sqft'].str.replace(r'\D', '', regex=True)

In [74]:
df['total_sqft'] = df['total_sqft'].astype(float)

In [78]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 12790 entries, 0 to 13318
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     12790 non-null  object 
 1   availability  12790 non-null  object 
 2   location      12790 non-null  object 
 3   size          12790 non-null  object 
 4   society       12790 non-null  object 
 5   total_sqft    12790 non-null  float64
 6   bath          12790 non-null  float64
 7   balcony       12790 non-null  float64
 8   price         12790 non-null  float64
dtypes: float64(4), object(5)
memory usage: 999.2+ KB


In [79]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056.0,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600.0,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,GrrvaGr,1440.0,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521.0,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,GrrvaGr,1200.0,2.0,1.0,51.0


In [80]:
from sklearn.preprocessing import LabelEncoder

In [81]:
encoder = LabelEncoder()

In [83]:
columns_to_encode = ["area_type","availability","location","size","society"]
df[columns_to_encode] = df[columns_to_encode].apply(encoder.fit_transform)

In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12790 entries, 0 to 13318
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     12790 non-null  int32  
 1   availability  12790 non-null  int32  
 2   location      12790 non-null  int32  
 3   size          12790 non-null  int32  
 4   society       12790 non-null  int32  
 5   total_sqft    12790 non-null  float64
 6   bath          12790 non-null  float64
 7   balcony       12790 non-null  float64
 8   price         12790 non-null  float64
dtypes: float64(4), int32(5)
memory usage: 749.4 KB


In [85]:
## Independent and dependent features 
X = df.drop(["price"],axis=1)
y = df['price']

## Model Building

In [86]:
from sklearn.model_selection import train_test_split

In [87]:
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.33,random_state=42)

In [88]:
from sklearn.preprocessing import StandardScaler

In [89]:
scaler = StandardScaler()

In [90]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [91]:
from sklearn.svm import SVR

In [92]:
svr = SVR()

In [93]:
svr.fit(X_train,y_train)

SVR()

In [94]:
y_pred = svr.predict(X_test)

In [95]:
from sklearn.metrics import mean_squared_error,mean_absolute_error

In [96]:
print(mean_squared_error(y_pred,y_test))
print(mean_absolute_error(y_pred,y_test))
print(np.sqrt(mean_squared_error(y_pred,y_test)))

18980.20800949039
49.5277087525034
137.76867571944788


In [97]:
from sklearn.metrics import r2_score

In [98]:
score = r2_score(y_pred,y_test)

In [99]:
score

-12.542255397608354

In [None]:
## Q1. In order to predict house price based on several characteristics, such as location, square footage,
##number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this
##situation would be the best to employ?



Mean Absolute Error (MAE) would be the best regression metric for SVM regression model due to its interpretability 
and robustness.

In [None]:
##Q2. You have built an SVM regression model and are trying to decide between using MSE or R-squared as
##your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price
##of a house as accurately as possible?

Mean Squared Error (MSE) would be more appropriate for predicting house prices accurately as it penalizes larger 
errors more, crucial for precise price estimation.

In [None]:
##Q3. You have a dataset with a significant number of outliers and are trying to select an appropriate
##regression metric to use with your SVM model. Which metric would be the most appropriate in this
##scenario?*\



Median Absolute Error (MedAE) would be the most appropriate metric for SVM regression with outliers due to its 
robustness.

In [None]:
##Q4. You have built an SVM regression model using a polynomial kernel and are trying to select the best
##metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values
##are very close. Which metric should you choose to use in this case?

Choose Root Mean Squared Error (RMSE) as it provides error in the same units as the target variable, aiding 
interpretability.

In [None]:
##Q5. You are comparing the performance of different SVM regression models using different kernels (linear,
##polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most
##appropriate if your goal is to measure how well the model explains the variance in the target variable?

The most appropriate metric for measuring how well the model explains the variance in the target variable is the 
coefficient of determination, commonly known as R2R2 (R-squared).