# House Price Prediction

- [1 - Introduction](#Introduction)
    - [1.1 - Project Overview](#Project-Overview)
    - [1.2 - Problem Statement](#Problem-Statement)
    - [1.3 - Dataset Description](#Dataset-Description)

- [2 - Import Libraries](#Import-Libraries)

- [3 - Data Loading and Exploration](#Data-Loading-and-Exploration)
    - [3.1 - Load the Dataset](#Load-the-Dataset)
    - [3.2 - Display Basic Information](#Display-Basic-Information)
    
- [4 - Data Preprocessing](#Data-Preprocessing)
    - [4.1 - Removing Irrelevant Features](#Removing-Irrelevant-Features)
    - [4.2 - Handle Missing Values](#Handle-Missing-Values)
    - [4.3 - Encoding Categorical Variables](#Encoding-Categorical-Variables)
    - [4.4 - Feature Engineering](#Feature-Engineering)
    - [4.5 - Outlier Removal](#Outlier-Removal)
    - [4.6 - Further Encoding Categorical Variables](#Futher-Encoding-Categorical-Variables)
    - [4.7 - Feature Scaling](#Feature-Scaling)


- [5 - Data Splitting](#Data-Splitting)
    - [5.1 - Split into Train, Validation, and Test Sets](#Split-into-Train-Validation-and-Test-Sets)
    - [5.2 - Split Data into Features (X) and Target (y)](#Split-Data-into-Features-X-and-Target-y)


- [6 - Model Definition and Training](#Model-Definition)
    - [6.1 - Define the Logistic Regression Model using Sklearn](#Define-the-Logistic-Regression-Model-using-Sklearn)
    - [6.2 - Train the Model](#Train-the-Model)
    - [6.3 - Evaluating Model on Validation Set](#Validation-During-Training)

- [7 - Hypertuning of Model](#Hypertuning-of-Model)

- [8 - Model Evaluation](#Model-Evaluation)
    - [8.1 - Confusion Matrix and Scores](#Confusion-Matrix-and-Scores)
    - [8.2 - ROC Curve](#ROC-Curve)

- [9 - Conclusion](#Conclusion)

# 1 - Introduction

## 1.1 - Project Overview
The goal of this project is to develop a predictive model that can estimate the prices of houses in Bengaluru, India. Accurately predicting house prices is crucial for real estate agents, buyers, and sellers to make informed decisions. By analyzing various factors such as the size of the property, location, and available amenities, we aim to build a machine learning model that can effectively predict house prices based on historical data.

## 1.2 - Problem Statement
The real estate market in Bengaluru is dynamic and influenced by multiple factors, making it challenging to estimate property prices accurately. The primary objective of this project is to address the following questions:

- Can we build an accurate model to predict house prices using historical real estate data from Bengaluru?
- How can we interpret the model's predictions to provide actionable insights for real estate professionals and potential buyers?

By answering these questions, we aim to create a tool that can assist in making more accurate and informed real estate decisions.

## 1.3 - Dataset Description
The dataset used in this project is sourced from Kaggle and contains detailed information on various properties in Bengaluru, India.

### Bengaluru House Data
Each row in the dataset represents a property listing, and each column provides different attributes about the properties.

- **Number of Rows:** 13,320 (properties)
- **Number of Columns:** 9 (features)
- **Target Column:** "price"

### Data Composition
The dataset includes the following information:

- **Area Type:**
  - The type of area (e.g., Super built-up Area, Plot Area, Built-up Area).

- **Availability:**
  - The availability status of the property (e.g., Ready to Move, available from a specific date).

- **Location:**
  - The location of the property within Bengaluru.

- **Size:**
  - The size of the property in terms of the number of bedrooms (e.g., 2 BHK, 3 Bedroom).

- **Total Area:**
  - The total area of the property in square feet.

- **Number of Bathrooms:**
  - The number of bathrooms available in the property.

- **Number of Balconies:**
  - The number of balconies available in the property.

This dataset provides a comprehensive view of the real estate market in Bengaluru, allowing us to analyze and model the factors that influence house prices effectively.


# [2 - Import Libraries](#Import-Libraries)

In this section, we import the necessary libraries required for data manipulation, visualization, and building a machine learning model using Sklearn.


In [707]:
# Basic libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sklearn for data preprocessing, building, training the model and evaluation
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc
from sklearn.model_selection import GridSearchCV

# [3 - Data Loading and Exploration](#Data-Loading-and-Exploration)

## [3.1 - Load the Dataset](#Load-the-Dataset)

In this section, we will load the Bengaluru House dataset into a pandas DataFrame for further exploration and analysis.

In [708]:
# Load the dataset into a pandas DataFrame
data_path = './Bengaluru_House_Data.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset to verify loading
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


## [3.2 - Display Basic Information](#Display-Basic-Information)

In this section, we will display basic information about the dataset to understand its structure and contents.


In [709]:
# Display the basic information about the dataset
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


(13320, 9)

# [4 - Data Preprocessing](#Data-Preprocessing)

## [4.1 - Removing Irrelevant Features](#Removing-Irrelevant-Features)

In this section we will remove irrelevant features which we assume do not have any decisive weight for the target (house price)



The following code will remove the 'availability' feature from the dataframe, as it is considered irrelevant for the analysis.

In [710]:
# Remove irrelevant features from the dataframe
df.drop(['availability'], axis=1, inplace=True)
df.head()

Unnamed: 0,area_type,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Kothanur,2 BHK,,1200,2.0,1.0,51.0


## [4.2 - Handle Missing Values](#Handle-Missing-Values)

In this section, we will identify and handle missing values in the dataset to ensure the data is clean and ready for modeling.

In [711]:
missing_values = df.isnull().sum()
print(missing_values)

area_type        0
location         1
size            16
society       5502
total_sqft       0
bath            73
balcony        609
price            0
dtype: int64


5502 out of 13320 Samples do not have a value for **society**, therefore we will drop society too as feature.

In [712]:
df = df.drop(['society'], axis=1)
df.head()

Unnamed: 0,area_type,location,size,total_sqft,bath,balcony,price
0,Super built-up Area,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Plot Area,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Built-up Area,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Super built-up Area,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Super built-up Area,Kothanur,2 BHK,1200,2.0,1.0,51.0


Now we will clean all Samples which do not have a value for the balcony feature.

In [713]:
df = df.dropna(subset=['balcony'])
df.isnull().sum()

area_type     0
location      1
size          0
total_sqft    0
bath          0
balcony       0
price         0
dtype: int64

Now lets drop the row which do not have a value for the location.

In [714]:
df = df.dropna(subset=['location'])
df.isnull().sum()

area_type     0
location      0
size          0
total_sqft    0
bath          0
balcony       0
price         0
dtype: int64

## [4.3 - Encoding Categorical Variables](#Encoding-Categorical-Variables)

In this section, we handle the categorical variables present in the dataset by converting them into a numerical format that can be used by our machine learning model. We use **One-Hot Encoding** to achieve this, which transforms each categorical variable into a set of binary columns (0 or 1), representing the presence or absence of each category.


In [715]:
# Making size numerical feature
df['size'] = df['size'].apply(lambda x: float(x.split(' ')[0]))
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 12710 entries, 0 to 13319
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   area_type   12710 non-null  object 
 1   location    12710 non-null  object 
 2   size        12710 non-null  float64
 3   total_sqft  12710 non-null  object 
 4   bath        12710 non-null  float64
 5   balcony     12710 non-null  float64
 6   price       12710 non-null  float64
dtypes: float64(4), object(3)
memory usage: 794.4+ KB


The **Total_Sqft** feature is not numerical. Lets find out how the input of these feature look like.

In [716]:
# Creating a method for detecting whether a object is a float or not
def isFloat(x):
    try:
        float(x)
        return True
    except:
        return False
    
# Filtering the Total_Sqft column for containing NON Float Values
df[~df['total_sqft'].apply(isFloat)].head(40)


Unnamed: 0,area_type,location,size,total_sqft,bath,balcony,price
30,Super built-up Area,Yelahanka,4.0,2100 - 2850,4.0,0.0,186.0
122,Super built-up Area,Hebbal,4.0,3067 - 8156,4.0,0.0,477.0
137,Super built-up Area,8th Phase JP Nagar,2.0,1042 - 1105,2.0,0.0,54.005
165,Super built-up Area,Sarjapur,2.0,1145 - 1340,2.0,0.0,43.49
188,Super built-up Area,KR Puram,2.0,1015 - 1540,2.0,0.0,56.8
410,Super built-up Area,Kengeri,1.0,34.46Sq. Meter,1.0,0.0,18.5
549,Super built-up Area,Hennur Road,2.0,1195 - 1440,2.0,0.0,63.77
661,Super built-up Area,Yelahanka,2.0,1120 - 1145,2.0,0.0,48.13
672,Built-up Area,Bettahalsoor,4.0,3090 - 5002,4.0,0.0,445.0
772,Super built-up Area,Banashankari Stage VI,2.0,1160 - 1195,2.0,0.0,59.935


We can see that the total_sqft column contains strings describing the Square in form like:
2100 - 2850, 1005.03 - 1252.49, 142.61Sq. Meter, 5.31Acres

The following code will convert this type of strings into floats.

In [717]:
def convert_to_float(x):
    try:
        # Case 1: Range values, e.g., "2100 - 2850"
        if '-' in x:
            parts = x.split('-')
            return (float(parts[0].strip()) + float(parts[1].strip())) / 2
        
        # Case 2: Values in Sq. Meter, e.g., "34.46Sq. Meter"
        elif 'Sq. Meter' in x:
            return float(x.replace('Sq. Meter', '').strip()) * 10.7639  # Convert Sq. Meter to Sq. Feet
        
        # Case 3: Values in Acres, e.g., "5.31Acres"
        elif 'Acres' in x:
            return float(x.replace('Acres', '').strip()) * 43560  # Convert Acres to Sq. Feet
        
        # Default case: Single float value, e.g., "2100"
        else:
            return float(x.strip())
    except:
        return None

# Applying the function to the total_sqft column
df['total_sqft']= df['total_sqft'].apply(convert_to_float)

# For Edge Cases where the conversion failed delete the rows
df = df.dropna(subset=['total_sqft'])

print("NaN Values per Column")
print(df.isnull().sum())
df.head(40)

NaN Values per Column
area_type     0
location      0
size          0
total_sqft    0
bath          0
balcony       0
price         0
dtype: int64


Unnamed: 0,area_type,location,size,total_sqft,bath,balcony,price
0,Super built-up Area,Electronic City Phase II,2.0,1056.0,2.0,1.0,39.07
1,Plot Area,Chikka Tirupathi,4.0,2600.0,5.0,3.0,120.0
2,Built-up Area,Uttarahalli,3.0,1440.0,2.0,3.0,62.0
3,Super built-up Area,Lingadheeranahalli,3.0,1521.0,3.0,1.0,95.0
4,Super built-up Area,Kothanur,2.0,1200.0,2.0,1.0,51.0
5,Super built-up Area,Whitefield,2.0,1170.0,2.0,1.0,38.0
8,Super built-up Area,Marathahalli,3.0,1310.0,3.0,1.0,63.25
10,Super built-up Area,Whitefield,3.0,1800.0,2.0,2.0,70.0
11,Plot Area,Whitefield,4.0,2785.0,5.0,3.0,295.0
12,Super built-up Area,7th Phase JP Nagar,2.0,1000.0,2.0,1.0,38.0


## [4.4 - Feature Engineering](#Feature-Engineering)

In this section, we will introduce a new feature that **will assist in identifying and removing outliers** in the dataset. By engineering this additional feature, we aim to capture more nuanced patterns in the data that may not be immediately apparent from the existing features. This new feature will provide valuable insights for subsequent steps, particularly during the outlier detection and removal process, ultimately contributing to a more robust and accurate predictive model.




In [718]:
# Create new feature 'square_meter_price'
df['square_meter_price'] = df['price'] * 10000 / df['total_sqft']

df.head()

Unnamed: 0,area_type,location,size,total_sqft,bath,balcony,price,square_meter_price
0,Super built-up Area,Electronic City Phase II,2.0,1056.0,2.0,1.0,39.07,369.981061
1,Plot Area,Chikka Tirupathi,4.0,2600.0,5.0,3.0,120.0,461.538462
2,Built-up Area,Uttarahalli,3.0,1440.0,2.0,3.0,62.0,430.555556
3,Super built-up Area,Lingadheeranahalli,3.0,1521.0,3.0,1.0,95.0,624.589086
4,Super built-up Area,Kothanur,2.0,1200.0,2.0,1.0,51.0,425.0


## [4.5 - Outlier Removal](#Outlier-Removal)

In this section, we will identify and remove outliers using the newly engineered feature. Removing these anomalies ensures a cleaner dataset and improves the model's accuracy and reliability.

In this step we will remove real estates that are too extreme based on their properties.

In [719]:
# Remove all real estate properties with square_meter / bedrooms less than 300
df_tmp = df[(df['total_sqft'] / df['size'] < 300)]
print(df_tmp[['total_sqft', 'size', 'bath','balcony']].head())

df = df[~(df['total_sqft'] / df['size'] < 300)]
df.square_meter_price.describe()



    total_sqft  size  bath  balcony
58      1407.0   6.0   4.0      1.0
68      1350.0   8.0   7.0      0.0
70       500.0   3.0   3.0      2.0
78       460.0   2.0   1.0      0.0
89       710.0   6.0   6.0      3.0


count    12037.000000
mean       619.942773
std        398.669479
min          0.225742
25%        419.653179
50%        525.031056
75%        682.352941
max      17647.058824
Name: square_meter_price, dtype: float64

The following function removes outliers through the price_per_sqft column for each location in the dataset. It keeps only the data points within one standard deviation of the mean for each location, effectively filtering out extreme values that could skew the analysis or model training. 

In [720]:
df.square_meter_price.describe()

def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.square_meter_price)
        st = np.std(subdf.square_meter_price)
        reduced_df = subdf[(subdf.square_meter_price>(m-st)) & (subdf.square_meter_price<=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index=True)
    return df_out

In [721]:
# Remove outliers from the dataset using the method defined above
df_new = remove_pps_outliers(df)

print("Removed Samples: ", df.shape[0] - df_new.shape[0])

df = df_new

Removed Samples:  3107


## [4.6 - Further Encoding Categorical Variables](#Futher-Encoding-Categorical-Variables)
The next step is to convert the **location** feature to a numerical feature by using One Hot Encoding.

In [722]:
# Printing Amount of Unique Values in the Location Column
print(len(df.location.unique()))

744


In [723]:
# Amount of Locations with less than 10 entries
print(len(df['location'].value_counts()[df['location'].value_counts() < 10]))

557


To simplify the dataset and enhance the effectiveness of model training, we will replace all locations that appear fewer than 10 times with the label 'Other'.

In [724]:
# Get the locations with fewer than 10 occurrences
rare_locations = df['location'].value_counts()[df['location'].value_counts() < 10].index

# Replace rare locations with 'Other'
df.loc[df['location'].isin(rare_locations), 'location'] = 'Other'

df.head(40)

Unnamed: 0,area_type,location,size,total_sqft,bath,balcony,price,square_meter_price
0,Super built-up Area,Other,3.0,1672.0,3.0,2.0,150.0,897.129187
1,Built-up Area,Other,3.0,1750.0,3.0,3.0,149.0,851.428571
2,Super built-up Area,Other,3.0,1750.0,3.0,2.0,150.0,857.142857
3,Super built-up Area,Devarachikkanahalli,3.0,1250.0,2.0,3.0,44.0,352.0
4,Super built-up Area,Devarachikkanahalli,2.0,1250.0,2.0,2.0,40.0,320.0
5,Plot Area,Devarachikkanahalli,2.0,1200.0,2.0,2.0,83.0,691.666667
6,Super built-up Area,Devarachikkanahalli,2.0,1170.0,2.0,2.0,40.0,341.880342
7,Super built-up Area,Devarachikkanahalli,3.0,1425.0,2.0,2.0,65.0,456.140351
8,Super built-up Area,Devarachikkanahalli,2.0,947.0,2.0,2.0,43.0,454.06547
9,Super built-up Area,Devarachikkanahalli,2.0,1130.0,2.0,2.0,36.0,318.584071


Now let's encode the location feature by using One Hot Encoding.

In [725]:
#Encoding the categorical feature
df = pd.get_dummies(df, columns=['location'])

df.head()

Unnamed: 0,area_type,size,total_sqft,bath,balcony,price,square_meter_price,location_ Devarachikkanahalli,location_1st Phase JP Nagar,location_5th Phase JP Nagar,...,location_Uttarahalli,location_Varthur,location_Vidyaranyapura,location_Vijayanagar,location_Vittasandra,location_Whitefield,location_Yelachenahalli,location_Yelahanka,location_Yelahanka New Town,location_Yeshwanthpur
0,Super built-up Area,3.0,1672.0,3.0,2.0,150.0,897.129187,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,Built-up Area,3.0,1750.0,3.0,3.0,149.0,851.428571,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,Super built-up Area,3.0,1750.0,3.0,2.0,150.0,857.142857,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Super built-up Area,3.0,1250.0,2.0,3.0,44.0,352.0,True,False,False,...,False,False,False,False,False,False,False,False,False,False
4,Super built-up Area,2.0,1250.0,2.0,2.0,40.0,320.0,True,False,False,...,False,False,False,False,False,False,False,False,False,False


The next step is to convert the **area_type** feature to a numerical feature by using One Hot Encoding.

In [726]:
# Check how many unique values are there in the area_type column
print(df['area_type'].nunique())

# Print the names of the unique area types
print(df['area_type'].unique())


#Encoding the categorical feature
df = pd.get_dummies(df, columns=['area_type'])

# Print the number of columns in the dataframe after encoding 
print(len(df.columns))

df.head()

4
['Super built-up  Area' 'Built-up  Area' 'Plot  Area' 'Carpet  Area']
198


Unnamed: 0,size,total_sqft,bath,balcony,price,square_meter_price,location_ Devarachikkanahalli,location_1st Phase JP Nagar,location_5th Phase JP Nagar,location_6th Phase JP Nagar,...,location_Vittasandra,location_Whitefield,location_Yelachenahalli,location_Yelahanka,location_Yelahanka New Town,location_Yeshwanthpur,area_type_Built-up Area,area_type_Carpet Area,area_type_Plot Area,area_type_Super built-up Area
0,3.0,1672.0,3.0,2.0,150.0,897.129187,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1,3.0,1750.0,3.0,3.0,149.0,851.428571,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2,3.0,1750.0,3.0,2.0,150.0,857.142857,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3,3.0,1250.0,2.0,3.0,44.0,352.0,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,2.0,1250.0,2.0,2.0,40.0,320.0,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True


## [4.7 - Feature Scaling](#Feature-Scaling)

In this section, we will apply feature scaling and normalization to ensure that all features contribute equally to the model, as algorithms like linear regression can be sensitive to the scale of input data. This step is crucial to improve the model's performance by preventing features with larger ranges from dominating the learning process.

In [727]:
# Create a StandardScaler object
scaler = StandardScaler()

# Normalize the size, bath, and balcony columns
df[['size', 'bath', 'balcony']] = scaler.fit_transform(df[['size', 'bath', 'balcony']])

# Standardize the total_sqft, price, and square_meter_price columns
df[['total_sqft', 'price', 'square_meter_price']] = scaler.fit_transform(df[['total_sqft', 'price', 'square_meter_price']])

df.head()

Unnamed: 0,size,total_sqft,bath,balcony,price,square_meter_price,location_ Devarachikkanahalli,location_1st Phase JP Nagar,location_5th Phase JP Nagar,location_6th Phase JP Nagar,...,location_Vittasandra,location_Whitefield,location_Yelachenahalli,location_Yelahanka,location_Yelahanka New Town,location_Yeshwanthpur,area_type_Built-up Area,area_type_Carpet Area,area_type_Plot Area,area_type_Super built-up Area
0,0.59375,0.019587,0.644048,0.512552,0.680647,1.375567,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1,0.59375,0.030822,0.644048,1.775896,0.669476,1.186001,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2,0.59375,0.030822,0.644048,0.512552,0.680647,1.209704,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3,0.59375,-0.041193,-0.471852,1.775896,-0.503464,-0.885631,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,-0.648562,-0.041193,-0.471852,0.512552,-0.548147,-1.018367,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True


# [5 - Data Splitting](#Data-Splitting)

## [5.1 Split into Train, Validation, and Test Sets](#Split-into-Train-Validation-and-Test-Sets)


In this chapter, we will split the dataset into training, validation, and testing sets. 
This step is essential to evaluate the model's performance, tune hyperparameters, and ensure its generalizability to unseen data.


In [728]:
# First, split the data into training + validation and test sets (80% train+validation, 20% test)
train_val_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Next, split the training + validation set into separate training and validation sets (75% train, 25% validation)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)

# Display the sizes of each set to verify the split
print("Training set size:", len(train_df))
print("Validation set size:", len(val_df))
print("Test set size:", len(test_df))

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

## [5.2 - Split Data into Features (X) and Target (y)](#Split-Data-into-Features-X-and-Target-y)

In this section, we will divide our dataset into two main components: Features (X) and the target variable (y). The features (X) consist of all the independent variables that will be used as input to the model, while the target variable (y) represents the outcome we aim to predict—in this case, customer churn. This separation is crucial for training and evaluating the model effectively.



In [None]:
# Define the target column
target_column = 'price'

# Split the training set into features (X_train) and target (y_train)
X_train = train_df.drop(columns=[target_column])
y_train = train_df[target_column].astype(int)

# Split the validation set into features (X_val) and target (y_val)
X_val = val_df.drop(columns=[target_column])
y_val = val_df[target_column].astype(int)

# Split the test set into features (X_test) and target (y_test)
X_test = test_df.drop(columns=[target_column])
y_test = test_df[target_column].astype(int)

# Display the first few rows of each to verify
print("Training features (X_train):")
print(X_train.head())
print("\nTraining target (y_train):")
print(y_train.head())

print("\nValidation features (X_val):")
print(X_val.head())
print("\nValidation target (y_val):")
print(y_val.head())

print("\nTest features (X_test):")
print(X_test.head())
print("\nTest target (y_test):")
print(y_test.head())