Using the ADULT dataset (https://archive-beta.ics.uci.edu/ml/datasets/adult). The goal is to build a classification model, that determine whether a given individual's income would be greater >50 k USD or <= 50k USD

Variables:

>50K, <=50K.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Some variables would need to be one-hot encoded, but we'll need to careful about how we proceed with this.

Education: OrdinalEncoder, since the earning potential of some one who has received higher education, is more likely to be higher than for someone who has not in the employment market
sex, race, native-country, marital-status, relationship, occupation, workplace: LabelEncoder. These variables can be quite complex as they play a huge role in someones earning potential, especially women and taking into consideration intersectionality.

In [1]:
! pip install pandas numpy scikit-learn pandas-profiling



In [2]:

import os

from typing import List

import numpy as np
import pandas as pd


colnames: List[str] = ["age", "workclass", "fnlwgt", \
                            "education", "education-nom", "marital-status"\
    ,                       "education-occupation", "relationship", "race", \
                            "sex", "capital-gain", "capital-loss",\
                                "hours-per-week", "native-country", "income"]

train, test = pd.read_csv(os.path.join(os.getcwd(), "data/adult.data"), names=colnames, header=None), pd.read_csv(os.path.join(os.getcwd(), "data/adult.test"), names=colnames, header=None)

print(train.shape, test.shape)

(32561, 15) (16282, 15)


In [3]:
print(train.head())

   age          workclass  fnlwgt   education  education-nom  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status education-occupation    relationship    race      sex  \
0        Never-married         Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse      Exec-managerial         Husband   White     Male   
2             Divorced    Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse    Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse       Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0              40   United-States 

In [4]:
print(train['income'].value_counts())

## Severe case of imbalance

 <=50K    24720
 >50K      7841
Name: income, dtype: int64


In [5]:
test = test.iloc[1:]
print(test.head())

  age   workclass    fnlwgt      education  education-nom  \
1  25     Private  226802.0           11th            7.0   
2  38     Private   89814.0        HS-grad            9.0   
3  28   Local-gov  336951.0     Assoc-acdm           12.0   
4  44     Private  160323.0   Some-college           10.0   
5  18           ?  103497.0   Some-college           10.0   

        marital-status education-occupation relationship    race      sex  \
1        Never-married    Machine-op-inspct    Own-child   Black     Male   
2   Married-civ-spouse      Farming-fishing      Husband   White     Male   
3   Married-civ-spouse      Protective-serv      Husband   White     Male   
4   Married-civ-spouse    Machine-op-inspct      Husband   Black     Male   
5        Never-married                    ?    Own-child   White   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
1           0.0           0.0            40.0   United-States   <=50K  
2           0.0           

In [6]:
train = train.drop_duplicates()
print(train.head())

   age          workclass  fnlwgt   education  education-nom  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status education-occupation    relationship    race      sex  \
0        Never-married         Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse      Exec-managerial         Husband   White     Male   
2             Divorced    Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse    Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse       Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0              40   United-States 

In [7]:
print(train.isin([" ?"]).any())

age                     False
workclass                True
fnlwgt                  False
education               False
education-nom           False
marital-status          False
education-occupation     True
relationship            False
race                    False
sex                     False
capital-gain            False
capital-loss            False
hours-per-week          False
native-country           True
income                  False
dtype: bool


In [8]:
train = train.replace(' ?', np.NaN)

In [9]:
print(train.isin([" ?"]).any())

age                     False
workclass               False
fnlwgt                  False
education               False
education-nom           False
marital-status          False
education-occupation    False
relationship            False
race                    False
sex                     False
capital-gain            False
capital-loss            False
hours-per-week          False
native-country          False
income                  False
dtype: bool


In [10]:
na_train = train[train.isnull().any(axis=1)]
train = train.dropna()


In [10]:
! pip install imbalanced-learn



In [11]:
train.reset_index(drop=True, inplace=True)
y_train = train["income"]

del train["income"]

In [12]:
print(train.shape, y_train.shape)

(32537, 14) (32537,)


In [13]:
from pandas_profiling import ProfileReport

report: ProfileReport = ProfileReport(train, explorative=True)
report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
### continuous variables: age, fnlwgt, education-nom, capital-gain, captical-loss and hours-per-week
### categorical variables: workclass, marital-status, education-occupation, relationship, race, sex, native-country
### Ordincal variables: education

from imblearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

numeric_features: List[str] = ["age", "fnlwgt", "education-nom", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features: List[str] = ["workclass", "marital-status", "education-occupation", "relationship", "race", "sex", "native-country"]
ordinal_features: List[str] = ["education"]

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown="error")
ordinal_transformer = OrdinalEncoder(handle_unknown="error")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("ord", ordinal_transformer, ordinal_features),
    ]
)

In [None]:

from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression

clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("sampling", SMOTE()), ("classifier", LogisticRegression())]
)

In [17]:
from sklearn.preprocessing import LabelEncoder

lb = LabelEncoder().fit(y_train.astype(str))

y_train = lb.transform(y_train.astype(str))


In [18]:
print(y_train)

[0 0 0 ... 0 0 1]


In [19]:
print(train.head())

   age          workclass  fnlwgt   education  education-nom  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status education-occupation    relationship    race      sex  \
0        Never-married         Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse      Exec-managerial         Husband   White     Male   
2             Divorced    Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse    Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse       Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  
0          2174             0              40   United-States  
1     

In [20]:

clf.fit(train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:

print(test.head())

  age   workclass    fnlwgt      education  education-nom  \
1  25     Private  226802.0           11th            7.0   
2  38     Private   89814.0        HS-grad            9.0   
3  28   Local-gov  336951.0     Assoc-acdm           12.0   
4  44     Private  160323.0   Some-college           10.0   
5  18           ?  103497.0   Some-college           10.0   

        marital-status education-occupation relationship    race      sex  \
1        Never-married    Machine-op-inspct    Own-child   Black     Male   
2   Married-civ-spouse      Farming-fishing      Husband   White     Male   
3   Married-civ-spouse      Protective-serv      Husband   White     Male   
4   Married-civ-spouse    Machine-op-inspct      Husband   Black     Male   
5        Never-married                    ?    Own-child   White   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
1           0.0           0.0            40.0   United-States   <=50K  
2           0.0           

In [22]:
test = test.replace(' ?', np.NaN)

na_test = test[test.isnull().any(axis=1)]
test = test.dropna()

print(test.head())

  age   workclass    fnlwgt      education  education-nom  \
1  25     Private  226802.0           11th            7.0   
2  38     Private   89814.0        HS-grad            9.0   
3  28   Local-gov  336951.0     Assoc-acdm           12.0   
4  44     Private  160323.0   Some-college           10.0   
6  34     Private  198693.0           10th            6.0   

        marital-status education-occupation    relationship    race    sex  \
1        Never-married    Machine-op-inspct       Own-child   Black   Male   
2   Married-civ-spouse      Farming-fishing         Husband   White   Male   
3   Married-civ-spouse      Protective-serv         Husband   White   Male   
4   Married-civ-spouse    Machine-op-inspct         Husband   Black   Male   
6        Never-married        Other-service   Not-in-family   White   Male   

   capital-gain  capital-loss  hours-per-week  native-country  income  
1           0.0           0.0            40.0   United-States   <=50K  
2           0.0     

In [23]:
test["income"] = lb.transform(test["income"].astype(str))

print(test.head())


  age   workclass    fnlwgt      education  education-nom  \
1  25     Private  226802.0           11th            7.0   
2  38     Private   89814.0        HS-grad            9.0   
3  28   Local-gov  336951.0     Assoc-acdm           12.0   
4  44     Private  160323.0   Some-college           10.0   
6  34     Private  198693.0           10th            6.0   

        marital-status education-occupation    relationship    race    sex  \
1        Never-married    Machine-op-inspct       Own-child   Black   Male   
2   Married-civ-spouse      Farming-fishing         Husband   White   Male   
3   Married-civ-spouse      Protective-serv         Husband   White   Male   
4   Married-civ-spouse    Machine-op-inspct         Husband   Black   Male   
6        Never-married        Other-service   Not-in-family   White   Male   

   capital-gain  capital-loss  hours-per-week  native-country  income  
1           0.0           0.0            40.0   United-States       0  
2           0.0     

In [24]:
y_test = test["income"]
del test["income"]

train.reset_index(drop=True, inplace=True)

In [25]:
print("model score: %.3f" % clf.score(test, y_test))

model score: 0.805


In [26]:
### Now let's try several classifiers

from typing import Dict

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier

seed: int = 42
scores: Dict = {}

models: Dict = {
    "svc": SVC(gamma='auto', random_state=42),
    "tree": DecisionTreeClassifier(random_state=seed),
    "knn": KNeighborsClassifier(),
    "rf": RandomForestClassifier(random_state=seed),
    "gb": GradientBoostingClassifier(random_state=seed),
    "ada": AdaBoostClassifier(random_state=seed),
    "logistic": LogisticRegression(random_state=seed),
    "sgd": SGDClassifier(random_state=seed),
}

for key, value in models.items():
    classifier = Pipeline(steps=[("preprocessor", preprocessor), ("sampling", SMOTE()), ("classifier", value)])
    models[key] = classifier.fit(train, y_train)
    scores[key] = models[key].score(test, y_test)
 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [27]:
for key, value in scores.items():
      print(f"{key} - model score: {value: .5f}")

svc - model score:  0.80744
tree - model score:  0.80737
knn - model score:  0.78187
rf - model score:  0.84024
gb - model score:  0.82928
ada - model score:  0.82915
logistic - model score:  0.80551
sgd - model score:  0.76487


In [28]:
test.reset_index(drop=True, inplace=True)
print(test.head())

  age   workclass    fnlwgt      education  education-nom  \
0  25     Private  226802.0           11th            7.0   
1  38     Private   89814.0        HS-grad            9.0   
2  28   Local-gov  336951.0     Assoc-acdm           12.0   
3  44     Private  160323.0   Some-college           10.0   
4  34     Private  198693.0           10th            6.0   

        marital-status education-occupation    relationship    race    sex  \
0        Never-married    Machine-op-inspct       Own-child   Black   Male   
1   Married-civ-spouse      Farming-fishing         Husband   White   Male   
2   Married-civ-spouse      Protective-serv         Husband   White   Male   
3   Married-civ-spouse    Machine-op-inspct         Husband   Black   Male   
4        Never-married        Other-service   Not-in-family   White   Male   

   capital-gain  capital-loss  hours-per-week  native-country  
0           0.0           0.0            40.0   United-States  
1           0.0           0.0       

In [29]:
test = test.to_numpy()
y_test = y_test.to_numpy()

# print(test[:10])

print(type(test), type(y_test))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [30]:
import statistics

from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import precision_score, recall_score, f1_score, balanced_accuracy_score, roc_auc_score, accuracy_score
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score

skfolds: StratifiedKFold = StratifiedKFold(n_splits=10, random_state=seed, shuffle=True)

columnnames: List[str] = ["age", "workclass", "fnlwgt", \
                            "education", "education-nom", "marital-status"\
    ,                       "education-occupation", "relationship", "race", \
                            "sex", "capital-gain", "capital-loss",\
                                "hours-per-week", "native-country"]

statistic: Dict = {}

for test_index, y_index in skfolds.split(test, y_test):
    tmp_test = pd.DataFrame(test[y_index], columns=columnnames)
    y_true = y_test[y_index]

    for key, value in models.items():
        if not key in statistic:
            statistic[key] = {}
            statistic[key]["precision"] = []
            statistic[key]["recall"] = []
            statistic[key]["f1"] = []
            statistic[key]["accuracy"] = [] 

        y_pred = value.predict(tmp_test)

        
        statistic[key]["precision"].append(precision_score(y_true, y_pred, average="micro"))
        statistic[key]["recall"].append(recall_score(y_true, y_pred, average="micro"))
        statistic[key]["f1"].append(f1_score(y_true, y_pred, average="micro"))
        statistic[key]["accuracy"].append(accuracy_score(y_true, y_pred))

        # print(f"Model: {key}")

with open(os.path.join(os.getcwd(), "results/results.txt"), "w") as f:
    for key, value in statistic.items():
        for skey, svalue in value.items():
            f.write(f"Model: {key} - {skey}: {statistics.mean(svalue): .5f} +/- {statistics.stdev(svalue): .5f} \n")





In [31]:
from sklearn.metrics import confusion_matrix

for key, value in models.items():
    tmp_test =  pd.DataFrame(test, columns=columnnames)
    y_pred = value.predict(tmp_test)
    cm = confusion_matrix(y_test, y_pred, normalize='true')

    print(f"{key}")
    print(cm)
    print('======================================================================')

svc
[[0.79242958 0.20757042]
 [0.14648649 0.85351351]]
tree
[[0.85818662 0.14181338]
 [0.34864865 0.65135135]]
knn
[[0.78274648 0.21725352]
 [0.22081081 0.77918919]]
rf
[[0.88952465 0.11047535]
 [0.31108108 0.68891892]]
gb
[[0.83353873 0.16646127]
 [0.18378378 0.81621622]]
ada
[[0.83538732 0.16461268]
 [0.19       0.81      ]]
logistic
[[0.79542254 0.20457746]
 [0.16351351 0.83648649]]
sgd
[[0.72024648 0.27975352]
 [0.09810811 0.90189189]]


In [32]:
from sklearn.metrics import classification_report

for key, value in models.items():
    tmp_test =  pd.DataFrame(test, columns=columnnames)
    y_pred = value.predict(tmp_test)
    report = classification_report(y_test, y_pred)

    print(f"{key}")
    print(report)
    print('======================================================================')

svc
              precision    recall  f1-score   support

           0       0.94      0.79      0.86     11360
           1       0.57      0.85      0.69      3700

    accuracy                           0.81     15060
   macro avg       0.76      0.82      0.77     15060
weighted avg       0.85      0.81      0.82     15060

tree
              precision    recall  f1-score   support

           0       0.88      0.86      0.87     11360
           1       0.60      0.65      0.62      3700

    accuracy                           0.81     15060
   macro avg       0.74      0.75      0.75     15060
weighted avg       0.81      0.81      0.81     15060

knn
              precision    recall  f1-score   support

           0       0.92      0.78      0.84     11360
           1       0.54      0.78      0.64      3700

    accuracy                           0.78     15060
   macro avg       0.73      0.78      0.74     15060
weighted avg       0.82      0.78      0.79     15060

rf
   

In [33]:
print(train.head())

   age          workclass  fnlwgt   education  education-nom  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status education-occupation    relationship    race      sex  \
0        Never-married         Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse      Exec-managerial         Husband   White     Male   
2             Divorced    Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse    Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse       Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  
0          2174             0              40   United-States  
1     

In [34]:
print(y_train)

[0 0 0 ... 0 0 1]


In [37]:

# ## Taking too long. Need to revise on how to make this more efficient.

from sklearn.model_selection import RandomizedSearchCV

# clf = Pipeline(
#     steps=[("preprocessor", preprocessor), ("sampling", SMOTE()), ("classifier", SVC(kernel="rbf"))]
# )

# param_grid = {"classifier__gamma": [0.1, 1, 10, 100], "classifier__C": [0.1, 1, 10, 100, 1000]}

# grid_search = RandomizedSearchCV(clf, param_grid, refit = True, verbose = 0, cv = 5, n_jobs=-1)
# grid_search.fit(train, y_train)

# print("tuned hyperparameters :(best parameters) ", grid_search.best_params_)
# print("tuned hyperparameters :(best estimator) ", grid_search.best_estimator_)
# print("tuned hyperparameters :(best score) ", grid_search.best_score_)


In [38]:

clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("sampling", SMOTE()), ("classifier", GradientBoostingClassifier())]
)

param_grid = {"classifier__n_estimators":[5,50,250,500], "classifier__max_depth":[1,3,5,7,9], "classifier__learning_rate":[0.01,0.1,1,10,100]}

grid_search = RandomizedSearchCV(clf, param_grid, refit = True, verbose = 0, cv = 5)
grid_search.fit(train, y_train)

print("tuned hyperparameters :(best parameters) ", grid_search.best_params_)
print("tuned hyperparameters :(best estimator) ", grid_search.best_estimator_)
print("tuned hyperparameters :(best score) ", grid_search.best_score_)

Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\model_selection\_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\metrics\_scorer.py", line 429, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\pipeline.py", line 695, in score
    Xt = transform.transform(Xt)
  File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\compose\_column_transformer.py", line 763, in transform
    Xs = self._fit_transform(

tuned hyperparameters :(best parameters)  {'classifier__n_estimators': 500, 'classifier__max_depth': 1, 'classifier__learning_rate': 0.1}
tuned hyperparameters :(best estimator)  Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fnlwgt',
                                                   'education-nom',
                                                   'capital-gain',
                                                   'capital-loss',
                                                   'hours-per-week']),
                                                 ('cat', OneHotEncoder(),
                                                  ['workclass',
                                                   'marital-status',
                  

In [1]:

clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("sampling", SMOTE()), ("classifier", AdaBoostClassifier())]
)

param_grid = {'classifier__base_estimator__max_depth': [i for i in range(2,11,2)],
              'classifier__base_estimator__min_samples_leaf': [5,10],
              'classifier__n_estimators': [10,50,250,1000],
              'classifier__learning_rate': [0.01,0.1]
            }

grid_search = RandomizedSearchCV(clf, param_grid, refit = True, verbose = 0, cv = 5)
grid_search.fit(train, y_train)

print("tuned hyperparameters :(best parameters) ", grid_search.best_params_)
print("tuned hyperparameters :(best estimator) ", grid_search.best_estimator_)
print("tuned hyperparameters :(best score) ", grid_search.best_score_)

NameError: name 'Pipeline' is not defined