# End-to-End Machine Learning Pipeline with ChatGPT: A Beginner's Guide - part 2

# Introduction

This is part 2 of the **How to use ChatGPT in a competition** series, in which I show how to leverage the power of ChatGPT to build a complete pipeline without the need for extensive coding skills.

If you haven't already, first check out [part 1](https://www.kaggle.com/code/jacoporepossi/how-to-use-chatgpt-in-a-competition-eda-part-1/notebook).

In this notebook we are going to build our first model using ChatGPT! [Part 3](https://www.kaggle.com/code/jacoporepossi/how-to-use-chatgpt-in-a-competition-final-part-3) will be devoted to hyperparameter tuning and ensembling.

# Modelling with ChatGPT

In case you are wondering, che OpenAI interface allows you to save previous conversations with ChatGPT so we should be able to continue our conversation with the tool.<br>
However, it's quite common to see this messagge
>Due to high demand on our systems, previous conversations are temporarily unavailable

so I'll start over again.<br>
Let's start providing again the task to ChatGPT, this time asking for some suggestions on how we should tackle the problem.<br> Once again, **click on the question to see ChatGPT answer!**

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - I'm taking part to a tabular kaggle competition. The dataset comes from the Employee Attrition Dataset and the task is to classifier whether employees had attrition or not (binary classification).<br>The <b>metric is ROC AUC</b> and the dataset has 35 available features, here's a list: <b>[list of features]</b><br>
I <b>already performed EDA</b> so now help me understand how I can tackle this problem using machine learning and python. I need to submit my predictions to Kaggle and I need to know <b>the next steps after EDA.</b></i></summary>
<br>
<pre><code>
After completing exploratory data analysis (EDA), the next step in tackling a machine learning problem is to preprocess the data and split it into training and testing sets. This can be done using tools such as Pandas and scikit-learn in Python.

1. Preprocessing: This includes tasks such as handling missing values, encoding categorical variables, and scaling numerical features.
2. Splitting the data: Use the train_test_split() function from scikit-learn to split the data into training and testing sets. This will allow you to train your model on the training set and evaluate its performance on the unseen testing set.
3. Feature Selection: Use feature selection techniques like Correlation matrix with heatmap, Univariate feature selection, RFE( Recursive Feature Elimination) to select the important features for the model.
4. Model selection: Choose a suitable machine learning model for binary classification. Some popular models for binary classification include logistic regression, decision trees, and random forests.
5. Training and tuning the model: Train the model using the training set and tune its parameters using techniques such as cross-validation to achieve the best performance.
6. Evaluation: Evaluate the performance of the model on the testing set using metrics such as accuracy, precision, recall, and ROC AUC.
7. Once you are satisfied with your model's performance, use it to make predictions on the Kaggle test set and submit your results.

It is also a good idea to use ensemble models like Bagging, Boosting, and Stacking to improve the performance.
</code></pre>
</details>

<br>
Quite interesting: in a nutshell we have a complete overview of the steps needed to successfully submit to the competition.
<br>
<div class="alert alert-block alert-warning">  
    <b>Expert warning: Data leakage</b><br> Performing Data preprocessing <b>before</b> splitting the data will result in <b>data leakage in your pipeline</b>.
    </div>

Let's ask ChatGPT why we should encode the categorical feature before splitting the data and let's see if it's able to correct itself.

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - Why are we preprocessing the data, such as encoding the categorical variables, <b>before splitting the data?</b></i></summary>
    
<br>
<pre><code>
Encoding categorical variables and scaling numerical features are considered preprocessing steps because they are performed on the entire dataset before it is split into training and testing sets.

Encoding categorical variables is important because many machine learning models cannot handle categorical data in its raw form. Categorical variables need to be transformed into numerical values for the model to be able to process and make predictions. There are different ways to encode categorical variables, such as one-hot encoding and label encoding.

Scaling numerical features is important because many machine learning models are sensitive to the scale of the input features. For example, if some features have large values while others have small values, the model may weight the features with large values more heavily than the features with small values. Scaling helps to ensure that all features are on a similar scale and that the model is not biased towards features with larger values.

It is important to preprocess the data before splitting it into training and testing sets to ensure that the same preprocessing steps are applied to both sets. This way, the testing set is representative of the same distribution as the training set, and the model will perform better when making predictions on unseen data.

It is also important to perform these preprocessing steps on the entire dataset and not just the training set because otherwise, the testing set would be preprocessed differently and the model would be trained on a different distribution than the one it will be tested on.
</code></pre>
</details>

<br>
This is going all wrong and I hope you understand why the homepage reports the current limitations of the tool such as `May occasionally generate incorrect information` and `May occasionally produce harmful instructions or biased content`.
The problematic part is the following:

>It is important to preprocess the data before splitting it into training and testing sets to <b>ensure that the same preprocessing steps are applied to both sets.</b> This way, the testing set is representative of the same distribution as the training set, and the model will perform better when making predictions on unseen data.<br>
It is also important to perform these preprocessing steps on the <b>entire dataset</b> and not just the training set because otherwise, the testing set would be <b>preprocessed differently and the model would be trained on a different distribution</b> than the one it will be tested on.

The problem here is that preprocessing before splitting means that we are using also the test set, thus "cheating". Let's tell that to ChatGPT:

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - Preprocessing the data before splitting into train and test will lead to data leakage in my pipeline. </i></summary>
    
<br>
<pre><code>
You are correct that preprocessing the data before splitting it into training and testing sets can lead to data leakage. Data leakage occurs when information from the testing set is used to preprocess the training set, which can lead to overfitting and artificially inflated performance metrics.
</code></pre>
</details>

<br>
This makes much more sense and it's exactly what I was referring to before.

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - back to my original problem, provide a python code to preprocess all the categorical features but after splitting into train and test using scikit-learn.</i></summary>
<br>
<img src="https://i.imgur.com/U2Aehra.png">
</details>

In [1]:
# Importing the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
data = pd.read_csv("/kaggle/input/playground-series-s3e3/train.csv")

# Splitting the data into features and target
X = data.drop('Attrition', axis=1)
y = data['Attrition']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Selecting the categorical features
categorical_features = X_train.select_dtypes(include='object').columns

# Encoding the categorical features
encoder = OneHotEncoder(sparse=False)
X_train_cat = encoder.fit_transform(X_train[categorical_features])
X_test_cat = encoder.transform(X_test[categorical_features])

# Dropping the original categorical features
X_train = X_train.drop(categorical_features, axis=1)
X_test = X_test.drop(categorical_features, axis=1)


# Concatenating the encoded categorical features
X_train = pd.concat([X_train, pd.DataFrame(X_train_cat, index=X_train.index)], axis=1)
X_test = pd.concat([X_test, pd.DataFrame(X_test_cat, index=X_test.index)], axis=1)

We finally obtained what we were looking for. Let's check if everything is ok:

In [2]:
X_train.head()

Unnamed: 0,id,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,19,20,21,22,23,24,25,26,27,28
661,661,49,950,5,3,1,1,73,3,2,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
1539,1539,23,1329,22,1,1,4,92,4,1,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
22,22,51,1107,2,3,1,1,59,3,5,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
1523,1523,37,783,14,4,1,4,39,2,3,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
1491,1491,40,1299,4,3,1,3,64,2,2,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


We finally obtained what we were looking for. It's time to train a machine learning model! We'll specifically ask to drop the column `Id` since it's not useful.

<details><summary><i><mark style='background-color:burlywood'>PROMPT</mark> - now train a machine learning model. Drop the column id since it's not useful</i></summary>
<br>
<img src="https://i.imgur.com/gojZWJI.png">
</details>
<br>

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Droping the 'id' column
X_train = X_train.drop(['id'], axis=1)
X_test = X_test.drop(['id'], axis=1)

# Training the model
clf = LogisticRegression()

# Fitting the model
clf.fit(X_train, y_train)

# Predicting on the test set
y_pred = clf.predict(X_test)

# Evaluating the model
roc_auc = roc_auc_score(y_test, y_pred)
print(f'ROC AUC score: {roc_auc}')

ROC AUC score: 0.5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Amazing! The warning we are getting is due to the fact that columns from the preprocessing step with the `OneHotEncoding` have integer names instead of strings.

In terms of results, we are not getting a great score mainly due to the fact that ChatGPT is using `predict` instead of `predict_proba`, the former using the default threshold `0.5` which is not great.

# Conclusions

In this second part, we used ChatGPT to create a **basic machine learning model** and we saw how the tool can be useful in providing a basic understanding of the **steps involved in creating a ML pipeline** and the relevant python libraries to use. We were able to **preprocess the categorical features**, and use **logistic regression model** to make predictions and evaluate its performance.

However, we did **encounter some problems** and incorrect information that required **human intervention**. <br>
I hope this shows the **importance of Data Scientists in the process**: as we saw, the data leakage mistake needed to be corrected.

This ends the second part of this series, in [part 3](https://www.kaggle.com/code/jacoporepossi/how-to-use-chatgpt-in-a-competition-final-part-3) we'll see how we can create **improve the performance** thanks to **hyperparameter tuning and ensembling**, eventually making a submission to Kaggle.

Stay tuned!