<a href="https://colab.research.google.com/github/hargurjeet/LJMU_Thesis/blob/main/open_ai_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Enrichement with OpenAI GPT-3.5

The notebook contains end to end code for generating enriched text using Open AI's Chat GPT

In [1]:
!pip install pandas openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
Successfully installed openai-0.28.0


## 1. Library imports and Dataset Imports

In [9]:
import pandas as pd
import openai
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,accuracy_score, precision_score, recall_score, confusion_matrix, f1_score, classification_report

## Update your api key here
openai.api_key = ''

In [3]:
travel_data = pd.read_csv('/content/Customertravel.csv')
travel_data.head()

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,No,Middle Income,6,No,Yes,0
1,34,Yes,Low Income,5,Yes,No,1
2,37,No,Middle Income,3,Yes,No,0
3,30,No,Middle Income,2,No,No,0
4,30,No,Low Income,1,No,No,0


In [4]:
df = travel_data.drop(columns='Target', axis=1).head()
df = travel_data.copy()
df

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,No,Middle Income,6,No,Yes,0
1,34,Yes,Low Income,5,Yes,No,1
2,37,No,Middle Income,3,Yes,No,0
3,30,No,Middle Income,2,No,No,0
4,30,No,Low Income,1,No,No,0
...,...,...,...,...,...,...,...
949,31,Yes,Low Income,1,No,No,0
950,30,No,Middle Income,5,No,Yes,0
951,37,No,Middle Income,4,No,No,0
952,30,No,Low Income,1,Yes,Yes,0


## 2. Generating textual data using gpt-3.5-turbo

In [None]:
## Function to convert the tabular data to rows
def row_to_textual_data(row):
    row_str = ', '.join([f"{col}: {val}" for col, val in row.items()])
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"""A Tour & Travels Company Wants To Predict Whether A Customer Will Churn Or Not Based On Indicators Given Below.
                                            Convert the following row to a detailed textual description: {row_str}.
                                            Consider the following constraints:
                                              1.Dont repeat similar discription for every line. No need to follow an order while generation of discription.\
                                              which mean the age can go in last or annual income can be in first...etc.
                                          """}
        ]
    )
    return response['choices'][0]['message']['content']

# Apply the function to each row of the DataFrame and save the results in a new DataFrame
df['TextualData'] = df.drop(columns='Target', axis=1).apply(row_to_textual_data, axis=1)

df

## 3. Generating Embedding uisng text-embedding-ada-002

In [6]:

def get_embedding(text):
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response['data'][0]['embedding']

df['Embedding'] = df['TextualData'].apply(get_embedding)

In [7]:
len(df['Embedding'][0])

1536

## 4. Saving the Final Results

In [14]:
df.drop(columns='Embedding', axis=1).to_csv('openai_output.csv', index=False)

In [13]:
df.drop(columns='Embedding', axis=1)

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target,TextualData
0,34,No,Middle Income,6,No,Yes,0,The customer is 34 years old and has indicated...
1,34,Yes,Low Income,5,Yes,No,1,The customer is aged 34 and is a frequent flye...
2,37,No,Middle Income,3,Yes,No,0,The customer is 37 years old and is not a freq...
3,30,No,Middle Income,2,No,No,0,The customer is aged 30 and is not a frequent ...
4,30,No,Low Income,1,No,No,0,"The customer is 30 years old, does not frequen..."
...,...,...,...,...,...,...,...,...
949,31,Yes,Low Income,1,No,No,0,The customer is 31 years old and falls under t...
950,30,No,Middle Income,5,No,Yes,0,The customer is aged 30 and falls under the mi...
951,37,No,Middle Income,4,No,No,0,The customer is 37 years old and falls under t...
952,30,No,Low Income,1,Yes,Yes,0,"Based on the provided indicators, a customer a..."


## 5. Training ML models

In [10]:
import pandas as pd
import openai
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Assuming df is already populated with TextualData and Embedding columns

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['Embedding'].tolist(),
    df['Target'],  # Assuming 'Target' is your target variable
    test_size=0.2,
    random_state=42
)

# Standardize the embeddings
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Dimensionality Reduction with PCA
# pca = PCA(n_components=min(len(X_train), X_train.shape[1]))
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_pca, y_train)

# Make predictions and evaluate the model
y_pred = clf.predict(X_test_pca)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Further steps could involve hyperparameter tuning, cross-validation, etc.


Accuracy: 0.84


In [11]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.84      0.99      0.91       153
           1       0.89      0.21      0.34        38

    accuracy                           0.84       191
   macro avg       0.86      0.60      0.62       191
weighted avg       0.85      0.84      0.79       191



In [None]:
X_train.shape

(4, 1536)

## Important note - 

This is not the final notebook, and there are a series of notebooks that should be followed after this one. The reason I haven't included all the code in one place is due to computational requirements. I cannot rerun all the code again to format it.

Please follow the order outlined below after reviewing this notebook: