Exercise 1: Loading Data with Pandas
Objective: Learn how to load and inspect datasets using Pandas.
Steps:
Import the Pandas library and load a CSV file into a DataFrame.
Use the head(), tail(), and info() functions to inspect the dataset.
Check for missing values and data types of each column using isnull() and dtypes.


In [2]:
import pandas as pd
pdDf=pd.read_csv("cust_segment.csv")
pdDf.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [3]:
pdDf.tail()


Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
8063,464018,Male,No,22,No,,0.0,Low,7.0,Cat_1,D
8064,464685,Male,No,35,No,Executive,3.0,Low,4.0,Cat_4,D
8065,465406,Female,No,33,Yes,Healthcare,1.0,Low,1.0,Cat_6,D
8066,467299,Female,No,27,Yes,Healthcare,1.0,Low,4.0,Cat_6,B
8067,461879,Male,Yes,37,Yes,Executive,0.0,Average,3.0,Cat_4,B


In [4]:
pdDf.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


In [5]:
pdDf.isnull()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
8063,False,False,False,False,False,True,False,False,False,False,False
8064,False,False,False,False,False,False,False,False,False,False,False
8065,False,False,False,False,False,False,False,False,False,False,False
8066,False,False,False,False,False,False,False,False,False,False,False


In [6]:
pdDf.dtypes

ID                   int64
Gender              object
Ever_Married        object
Age                  int64
Graduated           object
Profession          object
Work_Experience    float64
Spending_Score      object
Family_Size        float64
Var_1               object
Segmentation        object
dtype: object

Questions:
How do you load a CSV file into a Pandas DataFrame?
I used basic read_csv() function from pandas library.
What information does the info() function provide about the dataset?
info() function gives overall common information about data frame. It includes the number of rows and columns, the data types of the columns, the number of non-null values in each column, and the amount of memory used by the dataframe.
How can you identify missing values in the dataset?
By using isnull() we can see missing value marked as True value in row. False means value exists.

Exercise 2: Handling Missing Data
Objective: Practice techniques for handling missing data in a dataset.
Steps:
Identify missing values in the dataset using isnull().sum().
Use different strategies to handle missing data:
Remove rows with missing values using dropna().
Fill missing values with the mean, median, or a specific value using fillna().
Use forward or backward filling (ffill() or bfill()) to fill missing data.
Compare the results of each method.


In [9]:
pdDf.isnull().sum()

ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64

In [10]:
pdDf = pdDf.dropna(subset=['Var_1'])
pdDf.isnull().sum()

ID                   0
Gender               0
Ever_Married       139
Age                  0
Graduated           78
Profession         121
Work_Experience    820
Spending_Score       0
Family_Size        321
Var_1                0
Segmentation         0
dtype: int64

In [11]:
mean_work=pdDf['Work_Experience'].mean()
pdDf['Work_Experience'] = pdDf['Work_Experience'].fillna(mean_work)
pdDf.isnull().sum()

ID                   0
Gender               0
Ever_Married       139
Age                  0
Graduated           78
Profession         121
Work_Experience      0
Spending_Score       0
Family_Size        321
Var_1                0
Segmentation         0
dtype: int64

In [12]:
pdDf['Ever_Married']=pdDf['Ever_Married'].bfill()
pdDf.isnull().sum()

ID                   0
Gender               0
Ever_Married         0
Age                  0
Graduated           78
Profession         121
Work_Experience      0
Spending_Score       0
Family_Size        321
Var_1                0
Segmentation         0
dtype: int64

Questions:
What strategy did you use to handle missing values, and why?
dropna() was used for not significant column, that also has the minimal number of missing values. fillna() was used for column that has big amount of null data by filling average value. bfill() was used for randomized filling.
How did filling missing values affect the dataset?
Dataset can maintain its size, so that all rows and columns can be used for analysis.
When might it be more appropriate to drop rows with missing values instead of filling them?
When we have big dataset and missing values won't impact significantly.

In [14]:
pdDf = pdDf.dropna(subset=['Profession'])
mean_family=pdDf['Family_Size'].mean()
pdDf['Family_Size'] = pdDf['Family_Size'].fillna(mean_family)
pdDf['Graduated']=pdDf['Graduated'].bfill()
pdDf.isnull().sum()

ID                 0
Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
Segmentation       0
dtype: int64

In [15]:
pdDf = pdDf.drop('ID', axis=1)
pdDf = pdDf.drop('Profession', axis=1)

Exercise 3: Data Transformation
Objective: Transform data to prepare it for analysis.
Steps:
Normalize numerical features using Min-Max scaling or Z-score standardization with sklearn.preprocessing.
Encode categorical variables using one-hot encoding with pd.get_dummies() or sklearn.preprocessing.OneHotEncoder.
Use pd.cut() to bin continuous variables into discrete intervals.

In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer (transformers=[('encoder', OneHotEncoder(), [0,1,7])], remainder= 'passthrough')
pdDf = pd.DataFrame(ct.fit_transform(pdDf))


In [18]:
pdDf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,22,No,1.0,Low,4.0,D
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,38,Yes,2.642638,Average,3.0,A
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,67,Yes,1.0,Low,1.0,B
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,67,Yes,0.0,High,2.0,B
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,40,Yes,2.642638,High,6.0,A


In [19]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
columns = [12, 14, 16]

for col in columns:
    pdDf.iloc[:, col] = le.fit_transform(pdDf.iloc[:, col])

In [20]:
pdDf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,22,0,1.0,2,4.0,3
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,38,1,2.642638,0,3.0,0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,67,1,1.0,2,1.0,1
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,67,1,0.0,1,2.0,1
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,40,1,2.642638,1,6.0,0


In [21]:
age_bins = pd.cut(x=pdDf[11], bins=[1, 20, 40, 60, 80, 100],
                    labels=['1 to 20', '21 to 40', '41 to 60',
                            '61 to 80', '81 to 100'])

In [22]:
from sklearn.preprocessing import MinMaxScaler

segments=pdDf[16]
scaler = MinMaxScaler()
pdDf = pd.DataFrame(scaler.fit_transform(pdDf))
pdDf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.056338,0.0,0.071429,1.0,0.375,1.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.28169,1.0,0.18876,0.0,0.25,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.071429,1.0,0.0,0.333333
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.0,0.5,0.125,0.333333
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.309859,1.0,0.18876,0.5,0.625,0.0


In [23]:
pdDf[16]=segments
pdDf['age_bins']=age_bins
pdDf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,age_bins
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.056338,0.0,0.071429,1.0,0.375,3,21 to 40
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.28169,1.0,0.18876,0.0,0.25,0,21 to 40
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.071429,1.0,0.0,1,61 to 80
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.0,0.5,0.125,1,61 to 80
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.309859,1.0,0.18876,0.5,0.625,0,21 to 40


Questions:
What is the difference between normalization and standardization?
Normalization scales data to a fixed range, usually [0,1]. It's useful when you want all values to be on the same scale, like in neural networks. Standardization transforms data to have a mean of 0 and a standard deviation of 1. It's helpful when the model assumes normally distributed data, like in linear regression or SVM.
In short, normalization focuses on scaling to a range, while standardization focuses on centering data around the mean with unit variance.
How does one-hot encoding transform categorical variables?
One-hot encoding transforms categorical variables into binary columns, where each unique category is represented by a column of Os and 1s.
Why might you want to bin continuous variables into categories?
Binning sometimes simplifies processl by converting continuous variables into discrete categories, which can be easier for some algorithms to use. Binning can reduce the impact of outliers by grouping extreme values into a common category.

Exercise 4: Feature Engineering
Objective: Create new features to improve the predictive power of a dataset.
Steps:
Create new features by combining or transforming existing features (e.g., adding interaction terms or polynomial features).
Extract date-based features (e.g., year, month, day) from datetime columns using pd.to_datetime() and dt accessor.
Use domain knowledge to engineer features that might be useful for your specific problem.

In [26]:
pdDf['Age_Spending_Score'] = (pdDf[11] * pdDf[14])**0.5

In [27]:
pdDf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,age_bins,Age_Spending_Score
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.056338,0.0,0.071429,1.0,0.375,3,21 to 40,0.237356
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.28169,1.0,0.18876,0.0,0.25,0,21 to 40,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.071429,1.0,0.0,1,61 to 80,0.830747
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.0,0.5,0.125,1,61 to 80,0.587427
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.309859,1.0,0.18876,0.5,0.625,0,21 to 40,0.393611


Questions:
What new features did you create, and why?
I used product of Age and Spending Score to see potential interactions between age and spending behaviors
How did the new features improve the dataset?
Dataset now has an interaction term that can help the model better understand how spending behavior changes with age. This can lead to improved predictive performance if the interaction between age and spending is significant.
How can date-based features be useful in a dataset?
By using them we can get more information, see some trends or patterns. Also we can extract year, month, day, day of the week, and time of day, which can help in analyzing time-
dependent behaviors and trends.

Exercise 5: Data Cleaning
Objective: Clean data to ensure it's ready for analysis.
Steps:
Remove duplicate rows using drop_duplicates().
Detect and remove outliers using the Z-score method or the IQR method.
Correct inconsistencies in categorical data (e.g., standardizing text formats or merging similar categories).

In [30]:
pdDf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,age_bins,Age_Spending_Score
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.056338,0.0,0.071429,1.0,0.375,3,21 to 40,0.237356
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.281690,1.0,0.188760,0.0,0.250,0,21 to 40,0.000000
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.071429,1.0,0.000,1,61 to 80,0.830747
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.000000,0.5,0.125,1,61 to 80,0.587427
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.309859,1.0,0.188760,0.5,0.625,0,21 to 40,0.393611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7866,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.323944,1.0,0.000000,0.5,0.500,1,41 to 60,0.402457
7867,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.239437,0.0,0.214286,1.0,0.375,3,21 to 40,0.489323
7868,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.211268,1.0,0.071429,1.0,0.000,3,21 to 40,0.459639
7869,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.126761,1.0,0.071429,1.0,0.375,1,21 to 40,0.356034


In [31]:
pdDf = pdDf.drop_duplicates()
pdDf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,age_bins,Age_Spending_Score
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.056338,0.0,0.071429,1.0,0.375,3,21 to 40,0.237356
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.281690,1.0,0.188760,0.0,0.250,0,21 to 40,0.000000
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.071429,1.0,0.000,1,61 to 80,0.830747
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.000000,0.5,0.125,1,61 to 80,0.587427
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.309859,1.0,0.188760,0.5,0.625,0,21 to 40,0.393611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7864,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.422535,1.0,0.000000,0.0,0.625,0,41 to 60,0.000000
7866,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.323944,1.0,0.000000,0.5,0.500,1,41 to 60,0.402457
7867,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.239437,0.0,0.214286,1.0,0.375,3,21 to 40,0.489323
7869,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.126761,1.0,0.071429,1.0,0.375,1,21 to 40,0.356034


In [32]:
pdDf = pdDf.drop('age_bins', axis=1)
Q1 = pdDf.quantile(0.25)
Q3 = pdDf.quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

pdDf_new = pdDf[~((pdDf < lower_bound) | (pdDf > upper_bound)).any(axis=1)]

pdDf_new.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,Age_Spending_Score
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.071429,1.0,0.0,1,0.830747
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.690141,1.0,0.0,0.5,0.125,1,0.587427
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.309859,1.0,0.18876,0.5,0.625,0,0.393611
5,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.535211,0.0,0.0,0.0,0.125,2,0.0
6,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.197183,1.0,0.071429,1.0,0.25,2,0.444053


Questions:
How did you identify and handle duplicate rows in the dataset?
I checked the whole number of rows and then used function drop_duplicates. About 700 rows was deleted.
What method did you use to detect and remove outliers, and why?
I used IQR method as a way to detect and remove outliers. IQR because data is may be skewed. Extreme values don't impact very much and provides a good way to detect outliers.
How did you address inconsistencies in categorical data?
I don't have any inconsistencies.

Exercise 6: Splitting Data into Training and Testing Sets
Objective: Prepare the data for model training by splitting it into training and testing sets.
Steps:
Use sklearn.model_selection.train_test_split() to split the dataset into training and testing sets.
Ensure that the target variable is correctly separated from the features.
Explore the impact of different train-test split ratios (e.g., 70-30, 80-20) on model performance.

In [35]:
X = pdDf.iloc[:, :-1].values
y = pdDf.iloc[:, -1].values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Questions:
How do you split a dataset into training and testing sets in Python?
I used the train_test_split function from the sklearn.model_selection module.
What considerations should you keep in mind when choosing a train-test split ratio?
Larger datasets can have a smaller percentage for testing, while smaller datasets might need a larger percentage to ensure a representative test set.
How does the size of the training set impact the model's ability to generalize?
Larger dataset will provide more information and capabilities to learn from. It generally improves model performance and stability. But it takes more time and computational recources.

Exercise 7: Data Preprocessing Pipeline
Objective: Build a preprocessing pipeline to automate the data preparation process.
Steps:
Use sklearn.pipeline.Pipeline to create a pipeline that includes steps such as missing value imputation, feature scaling, and encoding categorical variables.
Fit the pipeline to the training data and transform the test data.
Integrate the preprocessing pipeline with a machine learning model for end-to-end training and evaluation.

In [38]:
pdDf2=pd.read_csv("cust_segment.csv")
pdDf2


Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A
...,...,...,...,...,...,...,...,...,...,...,...
8063,464018,Male,No,22,No,,0.0,Low,7.0,Cat_1,D
8064,464685,Male,No,35,No,Executive,3.0,Low,4.0,Cat_4,D
8065,465406,Female,No,33,Yes,Healthcare,1.0,Low,1.0,Cat_6,D
8066,467299,Female,No,27,Yes,Healthcare,1.0,Low,4.0,Cat_6,B


In [39]:
pdDf2 = pdDf2.drop('ID', axis=1)

In [40]:
numerical_features = ['Age', 'Work_Experience', 'Family_Size']
categorical_features = ['Gender', 'Ever_Married', 'Profession', 'Var_1']
label_features = ['Graduated', 'Spending_Score']


In [41]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  
    ('normalizer', Normalizer())                 
])

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  
    ('onehot', OneHotEncoder(handle_unknown='ignore'))    
])

le2 = LabelEncoder()

for col in label_features:
    pdDf2[col] = le2.fit_transform(pdDf2[col])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])
X = pdDf2.drop(columns=['Segmentation'])  
y = pdDf2['Segmentation']                       

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model_pipeline.fit(X_train, y_train)

y_pred = model_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.45


In [42]:
X = pdDf.iloc[:, :-1].values
y = pdDf.iloc[:, -1].values

y = y.astype(str)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rfc=RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.62


Questions:
What are the benefits of using a preprocessing pipeline?
automates datapreparation.
How does the pipeline ensure consistency between training and test data transformations?
Same transformations are made on both training and test data, they are processed similuar.
How can you extend the pipeline to include additional preprocessing steps?
Defining additional steps in the pipeline sequence before model training.