## Classification of 5000 forum posts by GPT to serve as train data for fine-tuning one of the open-source models:
This notebook was created with the goal to classify the sentiment of another 5000 forum posts via a selected OpenAI API GPT model, so that in a second step they can be used as train data to fine-tune of the leaner open-source models.
This involves the following steps.



In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
!pip install openai
import openai
from openai import OpenAI

Collecting openai
  Downloading openai-1.47.1-py3-none-any.whl.metadata (24 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.47.1-py3-none-any.whl (375 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.6/375.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━

In [None]:
import pandas as pd
import yaml
with open ("/content/drive/MyDrive/github_projects/fine_tuning_ai_for_sentiments/config/config.yaml", "r") as f:
  config = yaml.safe_load(f)

# load the key from the yaml file
with open("/content/drive/MyDrive/github_projects/chatgpt_api_credentials.yaml", "r") as file:
  chatgpt_api = yaml.safe_load(file)
# Ensuring the OpenAI API key is loaded correctly
display(chatgpt_api.keys())

dict_keys(['openAI_key'])

In [None]:
# os.environ is a dictionary with environment variables
# store the OpenAI API key on such environment variable for secure access
import os
os.environ["OPENAI_API_KEY"] = chatgpt_api["openAI_key"]
# setting the OpenAI key from the environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

In [None]:
# importing the DataCleanerAndRefiner
import sys
sys.path.append(config["project_path"]+config["notebooks_dir"])
from data_cleaning_and_refining_helpers import DataCleanerAndRefiner

In [None]:
# # load the csv file containing all forum posts without the 100 forum posts which were used for test (forum_posts_without_100_test.csv).
# # Extract a train set of 5000 random forum posts and save them (forum_posts_5000_train.csv) as well as the residual forum posts into separate files (forum_posts_without_5000_train.csv).
# # important: in the code no random_state was used in the sampling step which means each time the code is run a different sample set will be selected.
# # Therefore if you want to continue this project to avoid having a different sample, load the previously saved files with the code in the next cell,
# # if you want to use the code on a different dataset be sure to add a random_state as e.g. "3" to ensure reproducibility.

# # warning: if you still run the following cells, you might overwrite the provided files with the defined sample

# df_forum_posts_without_100_test = pd.read_csv(config["project_path"]+config["data_processed_dir"]+"forum_posts_without_100_test.csv")
# df_forum_posts_5000_train = df_forum_posts_without_100_test.sample(n=5000, random_state=None)
# display(df_forum_posts_5000_train)
# df_forum_posts_5000_train.to_csv(config["project_path"]+config["data_processed_dir"]+"forum_posts_5000_train.csv", index=False)
# df_forum_posts_without_5000_train = df_forum_posts_without_100_test[~df_forum_posts_without_100_test["ID"].isin(df_forum_posts_5000_train["ID"])]
# display(df_forum_posts_without_5000_train)
# df_forum_posts_without_5000_train.to_csv(config["project_path"]+config["data_processed_dir"]+"forum_posts_without_5000_train.csv", index=False)

In [None]:
# loading the previously saved file with 5000 forum posts
df_forum_posts_5000_train = pd.read_csv(config["project_path"]+config["data_processed_dir"]+"forum_posts_5000_train.csv")
display(df_forum_posts_5000_train)
df_forum_posts_without_5000_train = pd.read_csv(config["project_path"]+config["data_processed_dir"]+"forum_posts_without_5000_train.csv")
df_forum_posts_without_5000_train

Unnamed: 0,ID,text,datetime,company
0,1244692,"Du hast vollkommen Recht, VW wird definitiv d...",2017-08-02 23:16:24,Volkswagen
1,237957,und Mutti knickt noch nicht ein Nach den ...,2012-05-07 15:28:37,Commerzbank
2,1419572,Ich will ja nicht so sein: Der Staat Israe...,2016-11-03 19:41:47,Wirecard
3,1251842,22.01.2014 WOLFSBURG/POSEN - Volkswagen st...,2014-01-22 14:18:09,Volkswagen
4,278275,"....das ist so, wie wenn man das kursziel ...",2011-09-14 08:45:52,Commerzbank
...,...,...,...,...
4995,417633,"""Obwohl das Unternehmen wegen der Übernahme d...",2010-09-23 17:02:48,Deutsche_Bank
4996,1253328,"Ja da sehe ich sie jetzt auch bald landen,...",2013-04-12 15:26:04,Volkswagen
4997,1396418,"Und dieser Laan_Pa ist sowas von Short, der s...",2019-04-18 23:23:28,Wirecard
4998,1128494,SE besteht nicht nur aus Gamesa! Die anderen ...,2023-07-10 08:09:31,Siemens_Energy


Unnamed: 0,ID,text,datetime,company
0,214,Wovon sollte das bezahlt werden? Die Divid...,2020-08-29 08:20:13,1_und_1_Drillisch
1,330,">>Wer genau hinschaut, erkennt die Sinnlosigk...",2019-06-19 18:04:05,1_und_1_Drillisch
2,607,27.10.15 13:12 aktiencheck.de Maintal (www.a...,2015-11-02 10:04:41,1_und_1_Drillisch
3,695,ich habs auf aktiecheck gefunden gruss Tageshoch,2015-01-26 17:35:36,1_und_1_Drillisch
4,875,07.07.14 16:07 Bankhaus Lampe Düsseldorf (...,2014-07-08 09:32:19,1_und_1_Drillisch
...,...,...,...,...
81078,1440349,"naja aber was wäre : stoppkurse bei 7,75€ (al...",2008-07-02 12:29:21,Wirecard
81079,1440354,"hübsches sümmchen, was da investiert wurde...",2008-07-02 11:56:10,Wirecard
81080,1440381,SES macht bewusst den Kurs kaput. Das ist ...,2008-07-01 16:00:48,Wirecard
81081,1440414,9 EUR ...... wir kommen,2008-07-01 09:09:33,Wirecard


In [None]:
# defining the function to conduct the API calls on the GPT models from OpenAI
# detailed doc see here: https://platform.openai.com/docs/api-reference/chat/create

def get_sentiment_classifications(text, model):
    """
    Classifies the sentiment of a given text (e.g. forum post in the presented case)
    using a selected GPT model from OpenAI via their API.

    The function sends a text to the selected GPT model with a prompt to classify
    the sentiment either as negative, neutral or positive. It expects the model to
    return a single word indicating this sentiment.

    Parameters:
    - text (str): The text of the forum post to be classified.
    - model (str): The name of the GPT model to used for doing the classification.
                  Overview of the available models: https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turb

    Returns:
    - str: The detected sentiment of the review as determined by the model
            ("negative", "neutral", "positive"). If an exception occurs, instead
            of a sentiment the error message is returned
    """

    try:
        messages = [
            {"role": "system",
             "content": """You are an AI language model trained to analyze
                            and detect the sentiment of forum posts."""},
            {"role": "user",
             "content": f"""Analyze the following forum post and determine
                            if the sentiment is: negative, neutral or positive.
                            Return only a single word, either negative, neutral
                             or positive: {text}"""}
        ]
        # client referst to the OpenAI() client that sends the request
        client = OpenAI()

        # "completion" refers to the text that the model generates in response to an input prompt
        completion = client.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=1,
            n=1,              #the number of responses the model will generate for the given input
            temperature=0
        )
        # extracting the sentiment classification from the received answer
        sentiment = completion.choices[0].message.content.lower()
        return sentiment

    #captures all the errors and returns the error statements for analysis
    except Exception as e:
        return f"error: {e}"

In [None]:
# testing the function with the API calls and the model which was selected to see if the calls and sentiment classification works properly
test_texts = ["That's a bad thing", "That's a neutral thing", "Thats a great thing"]
print(f"The used model {config['train_data_eval_model']}, provides the following sentiment classifications:")
for text in test_texts:
  print(get_sentiment_classifications(text, config["train_data_eval_model"]))

The used model gpt-4-turbo-preview, provides the following sentiment classifications:
negative
neutral
positive


### Classifying the 5000 Forum Posts with GPT

In [None]:
# CAREFUL This Cell is EXPENSIVE because of the GTP API use!!!
# TO perform the sentiment classification of the 5000 forum posts via the GPT ATI, the data is processed in chunks
# of 500 to allow periodic saving. This will allow to retain the results even in cases where the API connection would be lost or interrupted.

# # set the chunk size for periodic saving
# chunk_size = 500

# # create an empty list to collect the sentiment classified chunks
# sentiment_classified_chunks = []

# # iterating over the dataset in chunks of 500 rows, starting from index 0 to the end (here 4999)
# for start in range(0, len(df_forum_posts_5000_train), chunk_size):
#   end = start + chunk_size
#   # creating the chunk data frame
#   df_chunk = df_forum_posts_5000_train.iloc[start:end]
#   # apply the sentiment classification via the GPT API on the chunk
#   df_chunk["sentiment_"+config["train_data_eval_model"]] = df_chunk["text"].apply(lambda t: get_sentiment_classifications(t, config["train_data_eval_model"]))
#   # append the processed chunk to the list
#   sentiment_classified_chunks.append(df_chunk)
#   # concatenate the already classified chunks and save them
#   df_forum_posts_5000_train_classification_by_GPT_model_checkpoint = pd.concat(sentiment_classified_chunks, ignore_index=True)
#   # saving/overwriting the checkpoint file
#   df_forum_posts_5000_train_classification_by_GPT_model_checkpoint.to_csv(config["project_path"]+config["data_raw_dir"]+"forum_posts_5000_train_classification_by_GPT_model_checkpoint.csv", index=False)
#   print(f"a checkpoint was saved after performing the sentiment classification up to row {end - 1}")
# # after the sentiment of all 5000 forum posts has bin classified the results is saved as csv

# df_forum_posts_5000_train_classification_by_GPT_model_completed_raw = pd.concat(sentiment_classified_chunks, ignore_index=True)           # Question / Frage: könne man den df aus dem loop oben nehmen oder besteht der nur im loop?
# df_forum_posts_5000_train_classification_by_GPT_model_completed_raw.to_csv(config["project_path"]+config["data_raw_dir"]+"forum_posts_5000_train_classification_by_GPT_model_completed_raw.csv", index=False) # ... = df_forum_posts_5000_train_classification_by_GPT_model_checkpoint.copy()
# df_forum_posts_5000_train_classification_by_GPT_model_completed_raw

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_chunk["sentiment_"+config["train_data_eval_model"]] = df_chunk["text"].apply(lambda t: get_sentiment_classifications(t, config["train_data_eval_model"]))


a checkpoint was saved after performing the sentiment classification up to row 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_chunk["sentiment_"+config["train_data_eval_model"]] = df_chunk["text"].apply(lambda t: get_sentiment_classifications(t, config["train_data_eval_model"]))


a checkpoint was saved after performing the sentiment classification up to row 9


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_chunk["sentiment_"+config["train_data_eval_model"]] = df_chunk["text"].apply(lambda t: get_sentiment_classifications(t, config["train_data_eval_model"]))


a checkpoint was saved after performing the sentiment classification up to row 14


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_chunk["sentiment_"+config["train_data_eval_model"]] = df_chunk["text"].apply(lambda t: get_sentiment_classifications(t, config["train_data_eval_model"]))


a checkpoint was saved after performing the sentiment classification up to row 19
a checkpoint was saved after performing the sentiment classification up to row 24


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_chunk["sentiment_"+config["train_data_eval_model"]] = df_chunk["text"].apply(lambda t: get_sentiment_classifications(t, config["train_data_eval_model"]))


Unnamed: 0,ID,text,datetime,company,sentiment_gpt-4-turbo-preview
0,1244692,"Du hast vollkommen Recht, VW wird definitiv d...",2017-08-02 23:16:24,Volkswagen,positive
1,237957,und Mutti knickt noch nicht ein Nach den ...,2012-05-07 15:28:37,Commerzbank,negative
2,1419572,Ich will ja nicht so sein: Der Staat Israe...,2016-11-03 19:41:47,Wirecard,neutral
3,1251842,22.01.2014 WOLFSBURG/POSEN - Volkswagen st...,2014-01-22 14:18:09,Volkswagen,neutral
4,278275,"....das ist so, wie wenn man das kursziel ...",2011-09-14 08:45:52,Commerzbank,negative
5,124322,Weiter geht's: 🦌,2022-05-31 09:42:21,Bayer,neutral
6,1402054,"Oha, da bringst Du etwas durcheinander... Die...",2019-03-19 22:12:59,Wirecard,neutral
7,1224069,"Die Frage ist, ob es nur eine Marktbereinigun...",2020-05-14 17:52:49,Varta,neutral
8,484056,Lufthansa-Aktie: Profis setzten jetzt auf ...,2013-09-10 11:30:19,Deutsche_Lufthansa,positive
9,1121986,"streiche das ""und"" vor Ziele....",2008-01-24 20:21:43,SGL_Carbon,neutral


### Cleaning and Refining the Assigned Data with Helper Function

In [None]:
# loading the previously saved file with 5000 forum posts
df_forum_posts_5000_train_classification_by_GPT_model_completed_raw = pd.read_csv(config["project_path"]+config["data_raw_dir"]+"forum_posts_5000_train_classification_by_GPT_model_completed_raw.csv")
display(df_forum_posts_5000_train_classification_by_GPT_model_completed_raw)

Unnamed: 0,ID,text,datetime,company,sentiment_gpt-4-turbo-preview
0,330,">>Wer genau hinschaut, erkennt die Sinnlosigk...",2019-06-19 18:04:05,1_und_1_Drillisch,Negative
1,429,Der Markt dürfte für Drillisch enger durch di...,2018-10-28 20:27:34,1_und_1_Drillisch,Negative
2,607,27.10.15 13:12 aktiencheck.de Maintal (www.a...,2015-11-02 10:04:41,1_und_1_Drillisch,Negative
3,695,ich habs auf aktiecheck gefunden gruss Tageshoch,2015-01-26 17:35:36,1_und_1_Drillisch,neutral
4,875,07.07.14 16:07 Bankhaus Lampe Düsseldorf (...,2014-07-08 09:32:19,1_und_1_Drillisch,Positive
...,...,...,...,...,...
4995,669305,Sie bekommen doch aber jeden Monat ihre Monat...,2020-09-18 18:29:51,Grenke,Neutral
4996,669384,dpa-AFX: *GRENKE-CEO: VICEROY HAT VIEL POR...,2020-09-18 15:17:02,Grenke,Negative
4997,669405,"gepostet? Verkauf 33,x - Leihgebühr je nach D...",2020-09-18 14:33:10,Grenke,Neutral
4998,669556,Nachbörse sieht auch mau aus...,2020-09-17 17:37:51,Grenke,Negative


In [None]:
# create and instance of the DataCleanerAndRefiner class from the data_cleaning_and_refining_helpers
valid_values = ["positive", "neutral", "negative"]
column_names = ["sentiment_"+config["train_data_eval_model"]] #, "sentiment_"+config["test_data_eval_model1"], "sentiment_"+config["test_data_eval_model2"]]
sentiment_label_conversion = {"positive":0, "neutral":1, "negative":2}

cleaner_refiner = DataCleanerAndRefiner(valid_values, column_names, sentiment_label_conversion)

# defining the path and filename for where to save the refined and valid as well as the invalid data resulting from the cleaning process
df_raw = df_forum_posts_5000_train_classification_by_GPT_model_completed_raw.copy()
refined_valid_path_and_filename = config["project_path"]+config["data_processed_dir"]+"df_forum_posts_5000_train_classification_by_GPT_model_completed_refined_valid.csv"
invalid_path_and_filename = config["project_path"]+config["data_processed_dir"]+"df_forum_posts_5000_train_classification_by_GPT_model_completed_invalid.csv"

clean_and_refine_data = cleaner_refiner.clean_and_refine_data(df_raw, refined_valid_path_and_filename, invalid_path_and_filename)

display(cleaner_refiner.df_valid_refined)
display(cleaner_refiner.df_invalid)
print("Cell execution completed")

Unnamed: 0,ID,text,datetime,company,sentiment_gpt-4-turbo-preview
0,330,">>Wer genau hinschaut, erkennt die Sinnlosigk...",2019-06-19 18:04:05,1_und_1_Drillisch,2
1,429,Der Markt dürfte für Drillisch enger durch di...,2018-10-28 20:27:34,1_und_1_Drillisch,2
2,607,27.10.15 13:12 aktiencheck.de Maintal (www.a...,2015-11-02 10:04:41,1_und_1_Drillisch,2
3,695,ich habs auf aktiecheck gefunden gruss Tageshoch,2015-01-26 17:35:36,1_und_1_Drillisch,1
4,875,07.07.14 16:07 Bankhaus Lampe Düsseldorf (...,2014-07-08 09:32:19,1_und_1_Drillisch,0
...,...,...,...,...,...
4995,669305,Sie bekommen doch aber jeden Monat ihre Monat...,2020-09-18 18:29:51,Grenke,1
4996,669384,dpa-AFX: *GRENKE-CEO: VICEROY HAT VIEL POR...,2020-09-18 15:17:02,Grenke,2
4997,669405,"gepostet? Verkauf 33,x - Leihgebühr je nach D...",2020-09-18 14:33:10,Grenke,1
4998,669556,Nachbörse sieht auch mau aus...,2020-09-17 17:37:51,Grenke,2


Unnamed: 0,ID,text,datetime,company,sentiment_gpt-4-turbo-preview
4027,546197,erg.:,2007-03-15 13:15:51,Deutsche_Telekom,Sure please


This cell seems to pass
