# Project 2 AI Chat Notebook - 

In this notebook, we want to see your interaction with ChatGPT that you use to solve
the unsupervised AND supervised machine learning tasks in `project2.ipynb`.  


# ChatGPT functions to Use

You are free to use these functions, or other LLMs, or the Web ChatGPT, as long as you show those systems' outputs
so we are convinced you have used a large language model to perform the machine learning tasks.


In [109]:
from openai import OpenAI

client = OpenAI(
    # This is the default and can be omitted
	base_url = 'http://aitools.cs.vt.edu:7860/openai/v1',
    api_key="aitools"
)


In [110]:

def converse(prompt, messages=None, 
             model="gpt-3.5-turbo", 
             max_tokens=4000, 
             temperature=0, 
             top_p=1, 
             frequency_penalty=0,
			 presence_penalty=0):
    # Add the user's message to the list of messages
    if messages is None:
        messages = []

    messages.append({"role": "user", "content": prompt})

    response = client.chat.completions.create(
                model=model,
		messages=messages,
		temperature=temperature,
		max_tokens=max_tokens,
		top_p=top_p,
		frequency_penalty=frequency_penalty,
		presence_penalty=presence_penalty).choices[0].message.content

    # Add the assistant's message to the list of messages
    messages.append({"role": "assistant", "content": response})

    return response, messages

# Unsupervised Learning

In this section, show us how you performed the unsupervised learning tasks with the help of Chat GPT. 

In [111]:
import glob

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer


In [112]:
prompt = """
- Load all data files matching 'data/datasets/train-00000-5k-*.parquet'.
- Merge those dataframes into a single dataframe "df".
"""
response, messages = converse(prompt, [])
print(response)

import pandas as pd
import glob

# Load all data files matching the pattern
files = glob.glob('data/datasets/train-00000-5k-*.parquet')

# Merge all dataframes into a single dataframe
df = pd.concat([pd.read_parquet(file) for file in files], ignore_index=True)

# Display the merged dataframe
print(df.head())


In [113]:
prompt = """
- Define the code snippets (from data frames's "code" attribute)
"""
response, messages = converse(prompt, messages)
print(response)

To define the code snippets from the data frames' "code" attribute, you can use the following code:

```python
# Define the code snippets from the data frames' "code" attribute
code_snippets = df['code'].tolist()

# Display the code snippets
for snippet in code_snippets:
    print(snippet)
```

This code snippet will extract the code snippets from the "code" attribute of the data frames in the merged dataframe "df" and display them one by one.


In [114]:
prompt = """
- Define labels (from dataframe's "language" attribute).
"""
response, messages = converse(prompt, messages)
print(response)

To define the labels from the dataframe's "language" attribute, you can use the following code:

```python
# Define the labels from the dataframe's "language" attribute
labels = df['language'].tolist()

# Display the labels
print(labels)
```

This code snippet will extract the labels from the "language" attribute of the merged dataframe "df" and store them in a list called "labels". It will then display the list of labels.


In [115]:
prompt = """
- Convert code snippets to TF-IDF features
"""
response, messages = converse(prompt, messages)
print(response)

To convert the code snippets to TF-IDF features, you can use the following code snippet:

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the code snippets to TF-IDF features
tfidf_features = tfidf_vectorizer.fit_transform(code_snippets)

# Display the shape of the TF-IDF features
print("TF-IDF features shape:", tfidf_features.shape)
```

This code snippet uses the `TfidfVectorizer` from scikit-learn to convert the code snippets stored in the `code_snippets` list to TF-IDF features. It fits the vectorizer on the code snippets and transforms them to TF-IDF features. Finally, it displays the shape of the TF-IDF features matrix.


In [116]:
prompt = """
- Perform K-Means clustering on the TF-IDF features. Use 5 clusters.
"""
response, messages = converse(prompt, messages)
print(response)

To perform K-Means clustering on the TF-IDF features with 5 clusters, you can use the following code snippet:

```python
from sklearn.cluster import KMeans

# Perform K-Means clustering with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_labels = kmeans.fit_predict(tfidf_features)

# Display the cluster labels
print("Cluster labels:", cluster_labels)
```

This code snippet uses the `KMeans` algorithm from scikit-learn to perform clustering on the TF-IDF features with 5 clusters. It assigns each code snippet to one of the 5 clusters and stores the cluster labels in the `cluster_labels` array. Finally, it displays the cluster labels assigned to each code snippet.


In [117]:
prompt = """
- Show 10 sample code snippets truncated to 80 characters for each identified cluster. If code has newlines, replace them with spaces.
"""
response, messages = converse(prompt, messages)
print(response)

To show 10 sample code snippets truncated to 80 characters for each identified cluster (replacing newlines with spaces), you can use the following code snippet:

```python
import textwrap

# Create a DataFrame with code snippets and cluster labels
df_clustered = pd.DataFrame({'code': code_snippets, 'cluster': cluster_labels})

# Truncate code snippets to 80 characters and replace newlines with spaces
df_clustered['code_truncated'] = df_clustered['code'].str.replace('\n', ' ').apply(lambda x: textwrap.shorten(x, width=80))

# Display 10 sample code snippets for each identified cluster
for cluster_id in range(5):
    cluster_samples = df_clustered[df_clustered['cluster'] == cluster_id]['code_truncated'].head(10).tolist()
    print(f"Cluster {cluster_id} samples:")
    for sample in cluster_samples:
        print(sample)
    print()
```

This code snippet creates a DataFrame with code snippets and their corresponding cluster labels. It then truncates the code snippets to 80 characters and

## Showing Unsupervised Clustering Visually (Optional)

In [34]:
prompt = """
- Generate code to make a plot (use mathplotlib) showing the data points and clusters. Use a distinct color for each cluster.
"""
response, messages = converse(prompt, messages)
print(response)

import matplotlib.pyplot as plt

# Assuming the dataframe has a column 'cluster' indicating the cluster for each data point
clusters = df['cluster'].unique()

# Define colors for each cluster
colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k']

# Plot data points with different colors for each cluster
plt.figure(figsize=(10, 6))
for i, cluster in enumerate(clusters):
    cluster_data = df[df['cluster'] == cluster]
    plt.scatter(cluster_data['x'], cluster_data['y'], color=colors[i], label=f'Cluster {cluster}')

plt.xlabel('X')
plt.ylabel('Y')
plt.title('Data Points and Clusters')
plt.legend()
plt.show()


## Supervised Learning

In this section, show us how you performed the supervised learning tasks with the help of Chat GPT or other LLM.

In [118]:
import glob
import time

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC


In [119]:
prompt = """
- Track measurements for each round: the round number/index, how long in seconds the train, predict steps take, and an accuracy report
- Use a list of dictionaries for measurement storage, with "round", "train", "predict", "accuracy" as keys
"""
response, messages = converse(prompt, [])
print(response)

```python
# Initialize list to store measurements
measurements = []

# Function to track measurements for each round
def track_measurements(round_num, train_time, predict_steps, accuracy):
    measurement = {
        "round": round_num,
        "train": train_time,
        "predict": predict_steps,
        "accuracy": accuracy
    }
    measurements.append(measurement)

# Example usage
track_measurements(1, 30, 1000, 0.85)
track_measurements(2, 25, 900, 0.88)

# Print measurements
for measurement in measurements:
    print(measurement)
```

Output:
```
{'round': 1, 'train': 30, 'predict': 1000, 'accuracy': 0.85}
{'round': 2, 'train': 25, 'predict': 900, 'accuracy': 0.88}
```


In [120]:
prompt = """
- Write a function named learn that takes a dataframe and an index as parameters.
- The index indicates the round number of learning. The function should:
- 1. Create a dictionary to store the measurements for this round
- 2. Print a message indicating the round number
- 3. Store the round number in the dictionary under 'round'
- 4. Split the dataframe into code snippets and labels. The code snippets is column 'code' and labels is the column 'language'.
- 5. Split the code snippets and labels into training and test sets
- 6. Create a TF-IDF vectorizer
- 7. Use the 'fit_transform' method on the training data to learn the vocabulary and idf, and return term-document matrix.
- 8. Use the 'transform' method on the test data to transform documents to document-term matrix.
- 9. Create a Support Vector Machine classifier
- 10. Train the classifier using the training data
- 11. Save the model to a file and load it back from a file (to make sure it works)
- 12. Use the classifier to predict the labels for the test data
- 13. Print the classification report which should be a dictionary
- 14. Store the training time in the dictionary under 'train'
- 15. Store the prediction time in the dictionary under 'predict'
- 16. Store the classification report in the dictionary under 'report'
- 17. Add 'accuracy' to the dictionary and set it to the accuracy score from the classification report
- 18. Append the dictionary to the measurements list for this round
"""
response, messages = converse(prompt, messages)
print(response)

```python
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report
import pickle
import time

# Initialize list to store measurements
measurements = []

def learn(df, index):
    # Create dictionary to store measurements for this round
    measurement = {"round": index}
    
    print(f"Round {index}")
    
    # Split dataframe into code snippets and labels
    X = df['code']
    y = df['language']
    
    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    
    # Fit and transform training data
    start_train = time.time()
    X_train_tfidf = vectorizer.fit_transform(X_train)
    end_train = time.time()
    
    # Transform test data
    start_predict = time.time()
    X_test_tfidf = vectorizer.

In [121]:
prompt = """
- Load all data files matching 'data/datasets/train-00000-5k*.parquet'. * represents a wildcard.
- For each file loaded, merge the latest data file with the merged data to date,
- and call the learn function with the dataframe and the index of the file in the list of files.
- The idea is for round 1, load in dataset 0. Then for round 2, load in datasets 0,1. Then for round 3, load in datasets 0,1,2. Only perform for 3 rounds.
"""
response, messages = converse(prompt, messages)
print(response)

```python
import glob
import pandas as pd

# Load all data files matching the pattern
files = glob.glob('data/datasets/train-00000-5k*.parquet')

# Initialize merged data
merged_data = pd.DataFrame()

# Perform for 3 rounds
for i in range(3):
    # Load the latest data file
    file = files[i]
    df = pd.read_parquet(file)
    
    # Merge latest data file with merged data
    merged_data = pd.concat([merged_data, df])
    
    # Call learn function with dataframe and index
    learn(merged_data, i)
```


In [97]:
prompt = """
I am getting following error for the above piece of code you provided:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/PycharmProjects/AI-Tools/Week2/project2/venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'code_snippets'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[26], line 19
     16 merged_data = pd.concat([merged_data, df], ignore_index=True)
     18 # Call learn function with dataframe and index
---> 19 learn(merged_data, index)

Cell In[25], line 15, in learn(df, index)
     12 print(f"Round {index}")
     14 # Split dataframe into code snippets and labels
---> 15 X = df['code_snippets']
     16 y = df['labels']
     18 # Split code snippets and labels into training and test sets

File ~/PycharmProjects/AI-Tools/Week2/project2/venv/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/PycharmProjects/AI-Tools/Week2/project2/venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'code_snippets'
"""

response, messages = converse(prompt, messages)
print(response)

The error indicates that the column 'code_snippets' does not exist in the DataFrame `df` that is being passed to the `learn` function. 

To resolve this issue, you need to check the column names in the DataFrame `df` loaded from the parquet file and ensure that it contains the columns 'code_snippets' and 'labels'. If the column names are different, you should adjust the code to use the correct column names.

You can print the column names of the DataFrame `df` before calling the `learn` function to verify the column names. Here is an updated version of the code with added print statements to check the column names:

```python
import glob
import pandas as pd

# Load all data files matching the pattern
data_files = glob.glob('data/datasets/train-00000-5k*.parquet')

# Initialize merged data
merged_data = pd.DataFrame()

for i, file in enumerate(data_files):
    # Load data file
    df = pd.read_parquet(file)
    
    # Print column names to verify
    print(df.columns)
    
    # Merge l

In [123]:
prompt = """
- If I have measurements in Python like a list of dictionaries such as:
- `[{'round': 0, 'train': 32.76, 'predict': 2.13, 'accuracy': 0.78},....]`
- Multiply accuracy by 100 and round it to 2 decimal places.
- let us plot lines on the same graph for train, predict and accuracy using python?  Use matplotlib.
- Add a legend.  Add axis labels.  Add a title.
- Let us show the plot in this notebook below.
"""
response, messages = converse(prompt, messages)
print(response)

```python
import matplotlib.pyplot as plt

# List of measurements
measurements = [{'round': 0, 'train': 32.76, 'predict': 2.13, 'accuracy': 0.78},
                {'round': 1, 'train': 28.45, 'predict': 1.98, 'accuracy': 0.82},
                {'round': 2, 'train': 25.67, 'predict': 1.85, 'accuracy': 0.87}]

# Multiply accuracy by 100 and round to 2 decimal places
for measurement in measurements:
    measurement['accuracy'] = round(measurement['accuracy'] * 100, 2)

# Extract data for plotting
rounds = [m['round'] for m in measurements]
train_times = [m['train'] for m in measurements]
predict_times = [m['predict'] for m in measurements]
accuracies = [m['accuracy'] for m in measurements]

# Plot lines on the same graph
plt.plot(rounds, train_times, label='Train Time')
plt.plot(rounds, predict_times, label='Predict Time')
plt.plot(rounds, accuracies, label='Accuracy')

# Add legend, axis labels, and title
plt.legend()
plt.xlabel('Round')
plt.ylabel('Time (s) / Accuracy (%)')
plt.title('P