# End-to-End Machine Learning Pipeline with ChatGPT: A Beginner's Guide - part 3

# Introduction

This is part 3 of the **How to use ChatGPT in a competition** series, in which I show how to leverage the power of ChatGPT to build a complete pipeline without the need for extensive coding skills.

If you haven't already, first check out [part 1](https://www.kaggle.com/code/jacoporepossi/how-to-use-chatgpt-in-a-competition-eda-part-1/notebook) and [part 2](https://www.kaggle.com/code/jacoporepossi/how-to-use-chatgpt-in-a-competition-model-part-2), which cover EDA and baseline creation respectively.

In this notebook we are going to improve our previous baseline thanks to **hyperparameter tuning and ensembling using ChatGPT**!

# Hyperparameter tuning and ensembling with ChatGPT

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
import warnings
warnings.filterwarnings("ignore")
simplefilter("ignore", category=ConvergenceWarning)
np.random.seed(42)
random.seed(42)


# Load the dataset
data = pd.read_csv("/kaggle/input/playground-series-s3e3/train.csv")

# Splitting the data into features and target
X = data.drop(['Attrition', 'id'], axis=1)
y = data['Attrition']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# Selecting the categorical features
categorical_features = X_train.select_dtypes(include='object').columns

# Encoding the categorical features
encoder = OneHotEncoder(sparse=False)
X_train_cat = encoder.fit_transform(X_train[categorical_features])
X_test_cat = encoder.transform(X_test[categorical_features])

# Dropping the original categorical features
X_train = X_train.drop(categorical_features, axis=1)
X_test = X_test.drop(categorical_features, axis=1)


# Concatenating the encoded categorical features
X_train = pd.concat([X_train, pd.DataFrame(X_train_cat, index=X_train.index)], axis=1)
X_test = pd.concat([X_test, pd.DataFrame(X_test_cat, index=X_test.index)], axis=1)

In [part 2](https://www.kaggle.com/code/jacoporepossi/how-to-use-chatgpt-in-a-competition-model-part-2) **we didn't quite get a great roc_auc score**, in fact the purpose of the notebook was only to **test ChatGPT in creating a baseline** starting from a textual input.

Will ChatGPT be able to provide better code to increase the performance obtained previously? Let's try! 

**Starting where we left off**, perhaps it makes sense to **provide feedback** that the performance hasn't been that great. Let's see **without additional help** what the algorithm proposes as possibile solution:

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - the machine learning proposed didn't <b>achieved a decent roc_auc score</b>. Provide some <b>code to improve</b> the performance and <b>test the final optimized model</b> on the test set as before.</i></summary>
<pre><code>
There are several ways to improve the performance of a machine learning model, some of which include:
- Feature Engineering: You can create new features or transform existing features to improve the model's performance. For example, you could create interaction terms between features, or take the logarithm of a feature to reduce the impact of outliers.
- Hyperparameter Tuning: You can use techniques like grid search or random search to find the best hyperparameters for your model. This can improve the model's performance by adjusting the learning rate, regularization strength, and other parameters.
- Ensemble Methods: You can use ensemble methods like bagging, boosting, and stacking to improve the performance of your model. These methods combine the predictions of multiple models to create a more accurate final prediction.
- Model selection: Try different models and compare their performance. For example, you can try Decision Trees, Random Forest, SVM, XGBoost, etc.
- Cross-Validation: Use cross-validation techniques like K-Fold cross-validation to get a more robust evaluation of the model's performance. This can help you avoid overfitting and select a model that generalizes well to unseen data.

Here's an example of how you can use GridSearchCV and RandomizedSearchCV to tune the hyperparameters of a logistic regression model:
</code></pre>
<br>
<img src="https://i.imgur.com/vOjVKFz.png">
</details>

In [2]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

# Define the parameter grid
param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}

# Initialize the logistic regression model
clf = LogisticRegression()

# Initialize the GridSearchCV object
grid_search = GridSearchCV(clf, param_grid, cv=5)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

# Get the best estimator
best_clf = grid_search.best_estimator_

# Print the best parameters and the best score
print(grid_search.best_params_)
print(grid_search.best_score_)

# Predict on the test set
y_pred = best_clf.predict(X_test)

# Evaluate the model
auc = roc_auc_score(y_test, y_pred)
print(f'AUC: {auc}')

{'C': 0.1, 'penalty': 'l2'}
0.8799395217222438
AUC: 0.5


It's definitely interesting to read ChatGPT's "opinions", however the **more we delve into certain details** of machine learning, the more some information tends to be **quite vague**. For instance, in a bullet list of possibile ways to improve a model's performance, **cross validation, model selection and hyperparameters tuning should be somehow related to each other** and not presented as separated things/steps. This could lead to **misunderstandings** if used by an inexperienced Citizen Data Scientist.

Given that the performance didn't improve, let's further challange ChatGPT.

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - we are still getting a low AUC score, <b>change the algorithm and optimize its parameters</b>. Provide only the code</i></summary>

<br>
<img src="https://i.imgur.com/sGLiYw3.png">
</details>

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score

# Define the parameter grid
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 20]}

# Initialize the RandomForestClassifier
clf = RandomForestClassifier()

# Initialize the RandomizedSearchCV object
random_search = RandomizedSearchCV(clf, param_grid, cv=5)

# Fit the RandomizedSearchCV object to the data
random_search.fit(X_train, y_train)

# Print the best parameters and the best score
print(random_search.best_params_)
print(random_search.best_score_)

# Get the best estimator
best_clf = random_search.best_estimator_

# Predict on the test set
y_pred = best_clf.predict(X_test)

# Evaluate the model
auc = roc_auc_score(y_test, y_pred)
print(f'AUC: {auc}')

{'n_estimators': 50, 'max_depth': 10}
0.8888919713699162
AUC: 0.5324324324324324


The code changed the algorithm to `RandomForestClassifier` and used `RandomizedSearchCV` to find the best hyperparameters. Indeed we were able to **increase our roc_auc score on the test set!**
ChatGPT was able to guess that `RandomForestClassifier` is generally more powerful compared to `LogistiRegression` when it comes to work with tabular data.

Can we push a little bit further asking for an **ensemble of algorithms**?

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - how can I <b>ensemble different models</b>, weighting them properly in order to increase my performance? Provide code and skip the explanation</i></summary>

<br>
<img src="https://i.imgur.com/AqgRX6p.png">
</details>

In [4]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# Initialize the classifiers
clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier()
clf3 = RandomForestClassifier()

# Create the ensemble
ensemble = VotingClassifier(estimators=[('lr', clf1), ('dt', clf2), ('rf', clf3)], voting='soft', weights=[1,1,1])

# Fit the ensemble to the data
ensemble.fit(X_train, y_train)

# Predict on the test set
y_pred = ensemble.predict(X_test)

# Evaluate the ensemble
auc = roc_auc_score(y_test, y_pred)
print(f'AUC: {auc}')

AUC: 0.5415540540540541


Once again, we improved our score but <b>this is pure luck</b>, as we have no idea if the test set we set aside is representative of the entire problem.<br> Let's tell that to ChatGPT.

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - the performance improved but probably is due to luck. Can we use <b>cross validation</b> to properly assess that the roc_auc is better? Provide a new code and skip the explanation.</i></summary>

<br>
<img src="https://i.imgur.com/r1tty9I.png">
</details>

In [5]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Initialize the classifiers
clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier()
clf3 = RandomForestClassifier()

# Create the ensemble
ensemble = VotingClassifier(estimators=[('lr', clf1), ('dt', clf2), ('rf', clf3)], voting='soft', weights=[1,1,1])

# Use cross_val_score to assess the performance of the ensemble
scores = cross_val_score(ensemble, X, y, cv=5, scoring='roc_auc')
print(scores)

# Print the mean and standard deviation of the scores
print(f'Mean AUC: {scores.mean():.3f} (+/- {scores.std():.3f})')


[nan nan nan nan nan]
Mean AUC: nan (+/- nan)


We are getting all `nan` because we are using `X` and the **categorical features are not handled**.

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - I get only nan results in the cross_val_score. This is because <b>X has not been preprocessed</b>. Fix the code considering all the information I gave to you in the previous prompts</i></summary>

<br>
<img src="https://i.imgur.com/ZnKl90q.png">
</details>
<br>

Nope...we are starting all over again. Moreover this is not exactly what we want. I was expecting a code able to **handle the categorical encoding inside the cross_val_score**, and why not, also **hyperparameter tuning with GridSearchCV**. 

Here's my code to achieve that:

In [6]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pprint

# Initialize the classifiers
clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier()
clf3 = RandomForestClassifier()

# Init 
numeric_features = X.select_dtypes(include='number').columns
numeric_transformer = StandardScaler()

categorical_features = X.select_dtypes(include='object').columns
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Create the ensemble
ensemble = VotingClassifier(estimators=[('lr', clf1), ('dt', clf2), ('rf', clf3)], voting='soft', weights=[1,1,1])

# Define the parameters for the GridSearchCV
param_grid = {
    'dt__max_depth': [5, 10, 20],
    'weights': [[1,1,1], [2,1,1], [1,2,1], [1,1,2]]
}
grid = GridSearchCV(ensemble, param_grid, cv=3, verbose=1)

# Put everything all together in a Pipeline
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", grid)]
)

# Use cross_val_score to assess the performance of the ensemble
scores = cross_validate(clf, X, y, cv=5, scoring='roc_auc', return_estimator=True)
print('\nScores on the test folds:', scores['test_score'])

# Print the mean and standard deviation of the scores
print('Mean AUC: {:.3f} (+/- {:.3f})'.format(scores['test_score'].mean(), scores['test_score'].std()))

# Best params found for each fold
print('Best estimators found:')
pprint.pprint([estim.steps[1][1].best_params_ for estim in scores['estimator']])

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits

Scores on the test folds: [0.78327703 0.85346284 0.76389831 0.81762712 0.8490678 ]
Mean AUC: 0.813 (+/- 0.035)
Best estimators found:
[{'dt__max_depth': 10, 'weights': [2, 1, 1]},
 {'dt__max_depth': 5, 'weights': [1, 1, 2]},
 {'dt__max_depth': 10, 'weights': [2, 1, 1]},
 {'dt__max_depth': 5, 'weights': [2, 1, 1]},
 {'dt__max_depth': 10, 'weights': [1, 1, 2]}]


In this code we are performing **cross validation** using `cross_validate`, which is able to **return also the estimator** used on the test fold. **Within the cross validation**, we are **processing** the categorical features and the numerical ones with `ColumnTransformer`, and performing **hyperparameter tuning** with `GridSearchCV`.
This resembles the framework [I shared about nested cross validation](https://www.kaggle.com/code/jacoporepossi/nested-cross-validation-and-ensembling-framework).

We can see that the **performace are quite different** than those obtained before with ChatGPT, mainly because:
- I **fine-tuned our VotingClassifier estimators and the weights associated to it**
- `cross_validate` uses `predict_proba` and not `predict`, which will **yield higher roc_auc score**. The metric uses the raw probabilities to calculate the best threshold to differentiate the two classes, whereas if we use `predict` we are using the **default threshold 0.5**

# Submission

It's time to submit our predictions to Kaggle. How can we do that using our VotingClassifier?

In what I think is the [best answer about nested cross validation and model selection](https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection), I've learned that we can select **any estimator** we previously found in a cross validation scheme as long as **our performance are quite stable**. Stability means that the **model does not change (much)** if the training data is "perturbed" by replacing a few cases by other cases.<br>
Given that the **standard deviation** is `0.04`, we can reasonably say that **any model** we created so far would yield a score around `0.8 +- 0.04`, so let's just **select the first estimator** fitted and let's make a submission.

Due to the fact that the **estimator is in a pipeline**, we are able to predict knowing that the **preprocessing will be handled automatically by sklearn!**

In [7]:
submission = pd.read_csv('/kaggle/input/playground-series-s3e3/sample_submission.csv')
test = pd.read_csv('/kaggle/input/playground-series-s3e3/test.csv').drop(['id'], axis=1)

submission.head()

Unnamed: 0,id,Attrition
0,1677,0.119261
1,1678,0.119261
2,1679,0.119261
3,1680,0.119261
4,1681,0.119261


In [8]:
preds = scores['estimator'][0].predict_proba(test)[:, 1]
preds

array([0.12869493, 0.11766861, 0.0242274 , ..., 0.04729937, 0.0514997 ,
       0.01396995])

In [9]:
submission['Attrition'] = preds
submission.to_csv('submission.csv', index=False)

# Conclusions

In this final notebook of the **How to use ChatGPT in a competition** series, we focused on improving the performance of our machine learning model and submitting it to Kaggle.

Throughout the notebook, we saw how ChatGPT could provide **some basic guidance** but **we needed to correct it quite a few times** as it was leading us to a **wrong implementation**.

I think this highlights the **importance of having a good understanding** of the machine learning workflow and the specific task at hand when using ChatGPT. For a beginner, **ChatGPT could lead to a disaster**, as it may provide incorrect information just to **"please" the user**. 

Let me give you a final example to convince you about that:

<img src="https://i.imgur.com/rGAH2Rt.png" width=700>

It's clear that **based on how you ask the question**, ChatGPT will try to provide the best answer in what **I think is a deductive way**: if you are asking why something is happening, it probably **means that it should be like that**. It's like being asked why 2+2=4 at school...you probably wouldn't argue with the teacher that 2+2 is not 4, but rather try to come up with the best answer to demonstrate that.<br>
**That's exactly why might get fooled by ChatGPT: it plays to our expectations. It tries to predict what we foresee and what seems to be the general opinion held by experts in some domains.**

However, the area where ChatGPT has shown quite remarkable results is Exploratory Data Analysis. EDA is often full of code tricks you should know, sometimes just to have some basic plots. Here we can **definitely leverage the tool to speed up the process** and collect some quick insights about the dataset.

In conclusion, these are my **3 basic rules** that I've learned about ChatGPT:
 1. with **minumum effort prompts**, you'll get **low quality results**. It takes time and effor to fine-tune an answer
 2. **don't trust anything it says** unless you check online if the answer makes sense
 3. **don't ask confirmatory questions** unless you have already an idea of where you are going. A confirmatory question means a question that leads to answers that can only support a certain point. Between "why do we do this instead of that?", try "should we do this or that?" or "what should we do?"
 
Who knows what the future will hold and how Data Scientist will evolve to adapt to new AI tool advances.

If you've made it this far, thank you for your interest and I hope it helped!