### Q1. In order to predict house price based on several characteristics, such as location, square footage, number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this situation would be the best to employ?

In [3]:
import pandas as pd
import re
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error ,r2_score
import numpy as np


In [4]:

df=pd.read_csv("./Bengaluru_House_Data.csv")

In [5]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [6]:
df.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

In [7]:
df.shape

(13320, 9)

In [8]:
df.isnull().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [9]:
df.dropna(subset=['location','size','bath'],inplace=True)


In [10]:

mode_society = df['society'].mode()[0]
df['society'].fillna(mode_society, inplace=True)

mode_balcony = df['balcony'].mode()[0]
df['balcony'].fillna(mode_balcony, inplace=True)


In [11]:
df.isnull().sum()

area_type       0
availability    0
location        0
size            0
society         0
total_sqft      0
bath            0
balcony         0
price           0
dtype: int64

In [12]:
df.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

In [61]:
df.shape

(13246, 9)

In [13]:
# keep only float and int value

def filter_numeric_values(val):
    match = re.findall(r'\d+\.\d+|\d+', str(val))
    if match:
        return float(match[0]) if '.' in match[0] else int(match[0])
    else:
        return None

df['total_sqft'] = df['total_sqft'].apply(filter_numeric_values)


In [14]:

X=df.iloc[:,:8]

y=df['price']


In [15]:
# Convert categorical variables to numerical using LabelEncoder

label_encoder=LabelEncoder()
for col in ['area_type','availability','location','size','society']:
    X[col]=label_encoder.fit_transform(X[col])


In [16]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [17]:

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [18]:
from sklearn.svm import SVR
svm_regressor = SVR(kernel='linear')
svm_regressor.fit(X_train, y_train)
predictions = svm_regressor.predict(X_test)


In [19]:

# Calculating regression metrics
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R-squared (R²): {r2}")

Mean Squared Error (MSE): 6782.819866476444
Root Mean Squared Error (RMSE): 82.35787677251305
Mean Absolute Error (MAE): 36.774862645793846
R-squared (R²): 0.4813782852819609


In [54]:
# Create a dictionary with custom input values
custom_data = {
    'area_type': ['Super built-up  Area'],
    'availability': ['Ready To Move'],
    'location': ['Electronic City'],
    'size': ['2 BHK'],
    'society': ['Coomee '],
    'total_sqft': [1000],
    'bath': [2],
    'balcony': [1]
}

In [55]:
custom_input = pd.DataFrame(custom_data)

In [56]:
# Label encoding for categorical columns
label_encoder = LabelEncoder()
categorical_cols = ['area_type', 'availability', 'location', 'size', 'society']
for col in categorical_cols:
    custom_input[col] = label_encoder.fit_transform(custom_input[col])

In [57]:
# Scale numerical columns
numerical_cols = ['area_type','availability','location','size','society','total_sqft', 'bath', 'balcony']
custom_input[numerical_cols] = scaler.transform(custom_input[numerical_cols])

In [58]:
# Make predictions using the trained model
predicted_price = svm_regressor.predict(custom_input)
print("Predicted Price:", predicted_price)

Predicted Price: [67.61506232]




### Q2. You have built an SVM regression model and are trying to decide between using MSE or R-squared as your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price of a house as accurately as possible?

the Mean Squared Error (MSE) would be a more appropriate evaluation metric than R-squared

### Q3.You have a dataset with a significant number of outliers and are trying to select an appropriate regression metric to use with your SVM model. Which metric would be the most appropriate in this scenario?

For regression models dealing with outliers, these metrics could be more appropriate:

- Mean Absolute Error (MAE): MAE measures the average absolute
differences between predicted values and actual values. It's less sensitive to outliers compared to MSE because it doesn't square the errors. Thus, it gives a more robust estimation of the error.

- Median Absolute Error: Similar to MAE but using the median of the absolute differences. It's even more robust against outliers than MAE because it's not influenced by extreme values to the same degree.

- R-squared (R2 Score): R2 Score measures the proportion of the variance in the dependent variable that is predictable from the independent variables. While it might still be affected by outliers, its impact might be less pronounced compared to MSE.

### Q4. You have built an SVM regression model using a polynomial kernel and are trying to select the best metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values are very close. Which metric should you choose to use in this case?

When the both MSE and RMSE are very close, i might consider using RMSE for its interpretability. RMSE gives you the average error in the same units as the target variable, making it easier to explain the model's performance to stakeholders or in reports.

### Q5. You are comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most appropriate if your goal is to measure how well the model explains the variance in the target variable?

R-squared is particularly well-suited for evaluating how well the model explains the variance in the target variable. It quantifies the proportion of the variance in the dependent variable (target) that is predictable from the independent variables used in the model.