<a href="https://colab.research.google.com/github/ilakshmiteja/Diamonds-Sklearn-pipeline-column-transformer/blob/main/sklearn_pipeline_and_column_transformer_on_diamonds_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, PowerTransformer, OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

Loading the dataset

In [2]:
diamonds = sns.load_dataset('diamonds')

Viewing the dataset

In [3]:
diamonds

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


Checking for missing values and duplicate entries

In [4]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


From above information, we can see that there are no missing values in the dataset.
<br>There are three category columns:
1. cut
2. color
3. clarity
<br>All the remaining columns are numerical columns.
<br>'Price' is our target column.

In [5]:
diamonds.duplicated().sum()

146

We can see that there are 146 duplicates. We will drop the duplicate entries.

In [6]:
diamonds.drop_duplicates(inplace = True)

We will now see the statistical measures of the dataset.

In [8]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53794.0,53794.0,53794.0,53794.0,53794.0,53794.0,53794.0
mean,0.79778,61.74808,57.458109,3933.065082,5.731214,5.734653,3.538714
std,0.47339,1.429909,2.233679,3988.11446,1.120695,1.141209,0.705037
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,951.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5326.75,6.54,6.54,4.03
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


We see that there could possibly be outliers in price, x, y and z features. However, these outliers can be caused because of unique or special diamonds.

Splitting Independent features and target feature.

In [9]:
X = diamonds.drop(columns = ['price'])
y = diamonds['price']

Splitting X, y variables to train and test

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Identifying ordinal, nominal categorical variables and continuous variables

In [12]:
X_Ordinal = ['cut']
X_Nominal = ['color','clarity']
X_cont = ['carat','depth','table','x','y','z']

Ordinal Categories Pipeline

In [16]:
ordinal_pipeline = Pipeline( steps = [
    ('SimpleImputing',SimpleImputer(strategy = 'most_frequent')),
    ('ordinalEncoding',OrdinalEncoder(categories=[['Premium','Very Good','Good','Fair','Ideal']]))
])

Nominal Categories Pipeline

In [17]:
nominal_pipeline = Pipeline(steps = [
    ('SimpleImputing',SimpleImputer(strategy = 'most_frequent')),
    ('oneHotEncoding',OneHotEncoder(sparse_output = False, drop='first'))
])

Continuous variables Pipleline

In [18]:
cont_pipeline = Pipeline(steps = [
    ('SimpleImputing',SimpleImputer(strategy = 'median')),
    ('robustScaling',RobustScaler()),
    ('powerTransformation',PowerTransformer())
])

Column Transformer for wrapping all the continuous and categorical pipelines.

In [19]:
pre_col_trans = ColumnTransformer(transformers = [
    ('ordinalPipeline',ordinal_pipeline,X_Ordinal),
    ('nominalPipeline',nominal_pipeline,X_Nominal),
    ('continuousPipeline',cont_pipeline,X_cont)
    ], remainder = 'passthrough')

Final Pipeline

In [20]:
final_pipeline = Pipeline(steps=[
    ('preColTransformer', pre_col_trans)
    ])

Final Pipeline Diagram

In [25]:
final_pipeline

Applying fit_transform on our Train data

In [21]:
X_train_ft = pd.DataFrame(final_pipeline.fit_transform(X_train),
                          columns = final_pipeline.get_feature_names_out(),
                          index = X_train.index)

In [22]:
X_train_ft

Unnamed: 0,ordinalPipeline__cut,nominalPipeline__color_E,nominalPipeline__color_F,nominalPipeline__color_G,nominalPipeline__color_H,nominalPipeline__color_I,nominalPipeline__color_J,nominalPipeline__clarity_IF,nominalPipeline__clarity_SI1,nominalPipeline__clarity_SI2,nominalPipeline__clarity_VS1,nominalPipeline__clarity_VS2,nominalPipeline__clarity_VVS1,nominalPipeline__clarity_VVS2,continuousPipeline__carat,continuousPipeline__depth,continuousPipeline__table,continuousPipeline__x,continuousPipeline__y,continuousPipeline__z
12820,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.719284,0.447645,-0.100971,0.721947,0.646248,0.718761
19997,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.769790,-0.323142,0.757046,0.729743,0.799002,0.718761
6099,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.518909,0.589454,0.350840,0.481751,0.470807,0.523787
37984,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.252853,-0.599611,-0.100971,-1.275784,-1.226540,-1.277152
24865,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.401892,0.095201,-0.613787,1.392846,1.305833,1.369668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11311,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.684692,-0.943058,0.350840,0.753053,0.723376,0.615678
44869,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,-0.145376,-0.044807,-1.804268,-0.090204,-0.038344,-0.082748
38271,4.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,-1.252853,0.025121,-1.243771,-1.275784,-1.262204,-1.241295
860,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.499369,0.731675,0.757046,0.448735,0.379833,0.483895


Applying transform on our Test data

In [23]:
X_test_ft = pd.DataFrame(final_pipeline.transform(X_test),
                         columns = final_pipeline.get_feature_names_out(),
                         index = X_test.index)

In [24]:
X_test_ft

Unnamed: 0,ordinalPipeline__cut,nominalPipeline__color_E,nominalPipeline__color_F,nominalPipeline__color_G,nominalPipeline__color_H,nominalPipeline__color_I,nominalPipeline__color_J,nominalPipeline__clarity_IF,nominalPipeline__clarity_SI1,nominalPipeline__clarity_SI2,nominalPipeline__clarity_VS1,nominalPipeline__clarity_VS2,nominalPipeline__clarity_VVS1,nominalPipeline__clarity_VVS2,continuousPipeline__carat,continuousPipeline__depth,continuousPipeline__table,continuousPipeline__x,continuousPipeline__y,continuousPipeline__z
43657,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.073616,2.243935,-1.804268,0.012448,-0.066839,0.209298
4274,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.499369,-0.530649,0.757046,0.457012,0.503330,0.403143
47412,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.327525,0.235792,-0.613787,-0.302815,-0.262270,-0.266970
44437,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.525082,0.518496,1.130257,-0.524706,-0.498360,-0.457528
13975,4.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.986147,0.376908,-1.183900,0.972801,0.989182,1.016923
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43980,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.252853,-0.943058,0.757046,-1.275784,-1.262204,-1.349284
1115,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.124169,-0.668477,-0.100971,0.166113,0.208151,0.094763
48829,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.491084,0.165428,-1.183900,-0.453129,-0.414900,-0.425343
42876,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,-0.359380,1.953676,1.478090,-0.382457,-0.508891,-0.220307
