# Task
Analyze the provided support ticket dataset to automatically tag tickets into categories using an LLM, comparing zero-shot and few-shot learning techniques and outputting the top 3 most probable tags per ticket. If necessary, apply fine-tuning and compare its performance as well. Present the findings and code.

## Load and inspect data

### Subtask:
Load the support ticket dataset and display the first few rows and the columns and their types to understand the structure of the data.


**Reasoning**:
Load the support ticket dataset from the CSV file, display the first few rows to inspect the data, and show the column names and data types to understand the structure.



In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('support_tickets.csv')

# Display the first 5 rows
display(df.head())

# Display column names and their data types
display(df.info())

FileNotFoundError: [Errno 2] No such file or directory: 'support_tickets.csv'

**Reasoning**:
The previous attempt to load the CSV failed because the file was not found. I will try loading the data from a different path, assuming the file is located in the '/data/' directory.



In [2]:
# Load the dataset from the correct path
df = pd.read_csv('/data/support_tickets.csv')

# Display the first 5 rows
display(df.head())

# Display column names and their data types
display(df.info())

FileNotFoundError: [Errno 2] No such file or directory: '/data/support_tickets.csv'

## Load and inspect data

### Subtask:
Load the support ticket dataset and display the first few rows and the columns and their types to understand the structure of the data.


**Reasoning**:
Since the previous attempts to load the file failed, I will try a different path where the data might be located and then display the head and info of the dataframe.



In [3]:
# Load the dataset from a different path
df = pd.read_csv('/tmp/support_tickets.csv')

# Display the first 5 rows
display(df.head())

# Display column names and their data types
display(df.info())

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/support_tickets.csv'

**Reasoning**:
Since the file was not found in the previous locations, I will try another common location for data files and then display the head and info of the dataframe.



In [4]:
# Load the dataset from a different path
df = pd.read_csv('/content/support_tickets.csv')

# Display the first 5 rows
display(df.head())

# Display column names and their data types
display(df.info())

FileNotFoundError: [Errno 2] No such file or directory: '/content/support_tickets.csv'

## Load and inspect data

### Subtask:
Load the support ticket dataset and display the first few rows and the columns and their types to understand the structure of the data.


**Reasoning**:
List the files in the `/content/sample_data/` directory to confirm the dataset's location and then load the dataset if it is found, displaying the head and info of the dataframe.



In [5]:
import os
import pandas as pd

# List files in the directory
print(os.listdir('/content/sample_data/'))

# Assuming 'support_tickets.csv' is found in the directory after listing, load it.
# If the file is not in this directory, the next step will raise a FileNotFoundError.
# If the file is named differently, the code will need to be adjusted.
try:
    df = pd.read_csv('/content/sample_data/support_tickets.csv')

    # Display the first 5 rows
    display(df.head())

    # Display column names and their data types
    display(df.info())

except FileNotFoundError:
    print("support_tickets.csv not found in /content/sample_data/")
    # If the file is not found, the task cannot be completed.
    # The next command will be finish_task with status "failure".

['README.md', 'anscombe.json', 'california_housing_test.csv', 'mnist_test.csv', 'california_housing_train.csv', 'mnist_train_small.csv']
support_tickets.csv not found in /content/sample_data/


## Load and inspect data

### Subtask:
Load the support ticket dataset and display the first few rows and the columns and their types to understand the structure of the data.


**Reasoning**:
List the files in the root directory and the `/content/` directory to check if the dataset is located there.



In [6]:
import os

print("Files in root directory:")
print(os.listdir('/'))

print("\nFiles in /content/ directory:")
print(os.listdir('/content/'))

Files in root directory:
['root', 'sbin', 'usr', 'home', 'proc', 'boot', 'var', 'etc', 'lib32', 'sys', 'opt', 'dev', 'mnt', 'lib64', 'bin', 'tmp', 'lib', 'srv', 'libx32', 'media', 'run', 'kaggle', '.dockerenv', 'tools', 'datalab', 'content', 'python-apt', 'python-apt.tar.xz', 'NGC-DL-CONTAINER-LICENSE', 'cuda-keyring_1.1-1_all.deb']

Files in /content/ directory:
['.config', 'sample_data']


## Load and inspect data

### Subtask:
Load the support ticket dataset and display the first few rows and the columns and their types to understand the structure of the data.


**Reasoning**:
I need to find the support ticket dataset file to load it. I will try listing files in some common directories where datasets might be stored in this environment, including the root, content, and sample_data directories again, to see if I missed it or if the file path has changed.



In [7]:
import os

print("Files in root directory:")
try:
    print(os.listdir('/'))
except Exception as e:
    print(f"Could not list files in root directory: {e}")

print("\nFiles in /content/ directory:")
try:
    print(os.listdir('/content/'))
except Exception as e:
    print(f"Could not list files in /content/ directory: {e}")

print("\nFiles in /content/sample_data/ directory:")
try:
    print(os.listdir('/content/sample_data/'))
except Exception as e:
    print(f"Could not list files in /content/sample_data/ directory: {e}")

Files in root directory:
['root', 'sbin', 'usr', 'home', 'proc', 'boot', 'var', 'etc', 'lib32', 'sys', 'opt', 'dev', 'mnt', 'lib64', 'bin', 'tmp', 'lib', 'srv', 'libx32', 'media', 'run', 'kaggle', '.dockerenv', 'tools', 'datalab', 'content', 'python-apt', 'python-apt.tar.xz', 'NGC-DL-CONTAINER-LICENSE', 'cuda-keyring_1.1-1_all.deb']

Files in /content/ directory:
['.config', 'sample_data']

Files in /content/sample_data/ directory:
['README.md', 'anscombe.json', 'california_housing_test.csv', 'mnist_test.csv', 'california_housing_train.csv', 'mnist_train_small.csv']


## Load and inspect data

### Subtask:
Load the support ticket dataset and display the first few rows and the columns and their types to understand the structure of the data.


**Reasoning**:
I need to find the location of the dataset file `support_tickets.csv`. I will list the files in the current directory and its parent directory.



In [8]:
import os

print("Files in current directory:")
print(os.listdir('.'))

print("\nFiles in parent directory:")
print(os.listdir('..'))

Files in current directory:
['.config', 'sample_data']

Files in parent directory:
['root', 'sbin', 'usr', 'home', 'proc', 'boot', 'var', 'etc', 'lib32', 'sys', 'opt', 'dev', 'mnt', 'lib64', 'bin', 'tmp', 'lib', 'srv', 'libx32', 'media', 'run', 'kaggle', '.dockerenv', 'tools', 'datalab', 'content', 'python-apt', 'python-apt.tar.xz', 'NGC-DL-CONTAINER-LICENSE', 'cuda-keyring_1.1-1_all.deb']


**Reasoning**:
The previous command showed that the dataset file is not in the current directory or the parent directory. It might be in the `/content/sample_data/` directory which was listed in the current directory. I will list the files in `/content/sample_data/` again to confirm and then attempt to load the data from there.



In [9]:
import pandas as pd

print("Files in /content/sample_data/ directory:")
print(os.listdir('/content/sample_data/'))

# Assuming 'support_tickets.csv' is found in the directory after listing, load it.
# If the file is not in this directory, the next step will raise a FileNotFoundError.
# If the file is named differently, the code will need to be adjusted.
try:
    df = pd.read_csv('/content/sample_data/support_tickets.csv')

    # Display the first 5 rows
    display(df.head())

    # Display column names and their data types
    display(df.info())

except FileNotFoundError:
    print("support_tickets.csv not found in /content/sample_data/")
    # If the file is not found, the task cannot be completed.
    # The next command will be finish_task with status "failure".

Files in /content/sample_data/ directory:
['README.md', 'anscombe.json', 'california_housing_test.csv', 'mnist_test.csv', 'california_housing_train.csv', 'mnist_train_small.csv']
support_tickets.csv not found in /content/sample_data/


## Load and inspect data

### Subtask:
Load the support ticket dataset and display the first few rows and the columns and their types to understand the structure of the data.


**Reasoning**:
The previous attempts to locate the file have failed. I will try listing files in the `/kaggle/input/support-ticket-data/` directory as specified in the original instructions to see if the file is present there and then attempt to load it if found.



In [10]:
import os
import pandas as pd

# List files in the specified directory
try:
    print(os.listdir('/kaggle/input/support-ticket-data/'))
except FileNotFoundError:
    print("Directory /kaggle/input/support-ticket-data/ not found.")
except Exception as e:
    print(f"Could not list files in directory: {e}")


# Attempt to load the dataset from the specified path
try:
    df = pd.read_csv('/kaggle/input/support-ticket-data/support_tickets.csv')

    # Display the first 5 rows
    display(df.head())

    # Display column names and their data types
    display(df.info())

except FileNotFoundError:
    print("support_tickets.csv not found at /kaggle/input/support-ticket-data/")
    # If the file is not found, the task cannot be completed.
    # The next command will be finish_task with status "failure".
except Exception as e:
    print(f"An error occurred while loading the file: {e}")
    # If any other error occurs, the task cannot be completed.
    # The next command will be finish_task with status "failure".

Directory /kaggle/input/support-ticket-data/ not found.
support_tickets.csv not found at /kaggle/input/support-ticket-data/


## Summary:

### Data Analysis Key Findings

*   The dataset `support_tickets.csv` could not be located in any of the attempted directories (`/data/`, `/tmp/`, `/content/`, `/content/sample_data/`, `/`, `..`, `.`, `/kaggle/input/support-ticket-data/`).
*   All attempts to load the dataset resulted in `FileNotFoundError`.
*   The task could not proceed with data inspection and analysis due to the absence of the required dataset file.

### Insights or Next Steps

*   Verify the correct path and filename of the `support_tickets.csv` dataset in the execution environment.
*   Ensure the dataset file is accessible from the location specified in the code.


In [13]:
import zipfile
import os
import pandas as pd
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Define the path to the zip file in Google Drive
zip_path = '/content/drive/MyDrive/support.zip'

# Define the extraction directory
extracted_path = '/content/support_data'

# Create the extraction directory if it doesn't exist
os.makedirs(extracted_path, exist_ok=True)

# Unzip the file
try:
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extracted_path)
    print(f"Successfully extracted {zip_path} to {extracted_path}")
except FileNotFoundError:
    print(f"Error: Zip file not found at {zip_path}")
except zipfile.BadZipFile:
    print(f"Error: Bad zip file at {zip_path}")
except Exception as e:
    print(f"An error occurred during extraction: {e}")

# Assuming the unzipped file is named 'support_tickets.csv' and is directly in the extracted_path
csv_path = os.path.join(extracted_path, 'customer_support_tickets.csv')

# Load the dataset
try:
    df = pd.read_csv(csv_path)

    # Display the first 5 rows
    display(df.head())

    # Display column names and their data types
    display(df.info())

except FileNotFoundError:
    print(f"Error: {csv_path} not found.")
except Exception as e:
    print(f"An error occurred while loading the CSV: {e}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Successfully extracted /content/drive/MyDrive/support.zip to /content/support_data


Unnamed: 0,Ticket ID,Customer Name,Customer Email,Customer Age,Customer Gender,Product Purchased,Date of Purchase,Ticket Type,Ticket Subject,Ticket Description,Ticket Status,Resolution,Ticket Priority,Ticket Channel,First Response Time,Time to Resolution,Customer Satisfaction Rating
0,1,Marisa Obrien,carrollallison@example.com,32,Other,GoPro Hero,2021-03-22,Technical issue,Product setup,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Social media,2023-06-01 12:15:36,,
1,2,Jessica Rios,clarkeashley@example.com,42,Female,LG Smart TV,2021-05-22,Technical issue,Peripheral compatibility,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Chat,2023-06-01 16:45:38,,
2,3,Christopher Robbins,gonzalestracy@example.com,48,Other,Dell XPS,2020-07-14,Technical issue,Network problem,I'm facing a problem with my {product_purchase...,Closed,Case maybe show recently my computer follow.,Low,Social media,2023-06-01 11:14:38,2023-06-01 18:05:38,3.0
3,4,Christina Dillon,bradleyolson@example.org,27,Female,Microsoft Office,2020-11-13,Billing inquiry,Account access,I'm having an issue with the {product_purchase...,Closed,Try capital clearly never color toward story.,Low,Social media,2023-06-01 07:29:40,2023-06-01 01:57:40,3.0
4,5,Alexander Carroll,bradleymark@example.com,67,Female,Autodesk AutoCAD,2020-02-04,Billing inquiry,Data loss,I'm having an issue with the {product_purchase...,Closed,West decision evidence bit.,Low,Email,2023-06-01 00:12:42,2023-06-01 19:53:42,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8469 entries, 0 to 8468
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticket ID                     8469 non-null   int64  
 1   Customer Name                 8469 non-null   object 
 2   Customer Email                8469 non-null   object 
 3   Customer Age                  8469 non-null   int64  
 4   Customer Gender               8469 non-null   object 
 5   Product Purchased             8469 non-null   object 
 6   Date of Purchase              8469 non-null   object 
 7   Ticket Type                   8469 non-null   object 
 8   Ticket Subject                8469 non-null   object 
 9   Ticket Description            8469 non-null   object 
 10  Ticket Status                 8469 non-null   object 
 11  Resolution                    2769 non-null   object 
 12  Ticket Priority               8469 non-null   object 
 13  Tic

None

In [12]:
import os

# List files in the extracted directory
extracted_path = '/content/support_data'
try:
    print(f"Files in {extracted_path}:")
    print(os.listdir(extracted_path))
except FileNotFoundError:
    print(f"Error: Extraction directory not found at {extracted_path}")

Files in /content/support_data:
['customer_support_tickets.csv']


## Prepare data for zero-shot inference

### Subtask:
Select a subset of the data for zero-shot inference to test the model without any specific training examples.

**Reasoning**:
To prepare for zero-shot inference, we need to select a representative subset of the data. Displaying the columns will help in identifying relevant columns for the task.

In [14]:
# Display column names to identify relevant columns
print(df.columns)

Index(['Ticket ID', 'Customer Name', 'Customer Email', 'Customer Age',
       'Customer Gender', 'Product Purchased', 'Date of Purchase',
       'Ticket Type', 'Ticket Subject', 'Ticket Description', 'Ticket Status',
       'Resolution', 'Ticket Priority', 'Ticket Channel',
       'First Response Time', 'Time to Resolution',
       'Customer Satisfaction Rating'],
      dtype='object')


**Reasoning**:
Selecting a subset of the data with relevant columns ('Ticket Subject' and 'Ticket Description') is necessary for zero-shot inference. Displaying the head of the subset will confirm the selection.

In [15]:
# Select relevant columns for zero-shot inference
subset_df = df[['Ticket Subject', 'Ticket Description']].copy()

# Display the first 5 rows of the subset
display(subset_df.head())

Unnamed: 0,Ticket Subject,Ticket Description
0,Product setup,I'm having an issue with the {product_purchase...
1,Peripheral compatibility,I'm having an issue with the {product_purchase...
2,Network problem,I'm facing a problem with my {product_purchase...
3,Account access,I'm having an issue with the {product_purchase...
4,Data loss,I'm having an issue with the {product_purchase...


## Perform zero-shot inference

### Subtask:
Use prompt engineering with an LLM to tag the tickets in the selected subset and output the top 3 most probable tags per ticket.

**Reasoning**:
To perform zero-shot inference, we need to set up the LLM and define a prompt that instructs the LLM to generate relevant tags based on the ticket subject and description. It's also helpful to define a list of possible tags to guide the LLM.

In [16]:
# Install the Google Generative AI library
!pip install google-generativeai



To use the Gemini API, you'll need an API key. If you don't already have one, create a key in Google AI Studio.
In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `GOOGLE_API_KEY`. Then pass the key to the SDK:

**Reasoning**:
Define a list of potential tags that the LLM can use to categorize the support tickets. This helps in guiding the zero-shot inference and ensures the generated tags are relevant to the domain.

**Subtask**: Define a list of possible tags.

In [18]:
# Define a list of possible tags based on the domain of support tickets.
# These tags can be refined based on further data exploration or domain knowledge.
possible_tags = [
    'Technical Issue',
    'Billing Inquiry',
    'Account Access',
    'Product Setup',
    'Peripheral Compatibility',
    'Network Problem',
    'Data Loss',
    'Software Issue',
    'Hardware Issue',
    'Installation Support',
    'Refund Request',
    'Payment Issue',
    'Bug Report',
    'Feature Request',
    'General Question'
]

**Reasoning**:
Create a prompt for the LLM that includes instructions for tagging the support tickets based on their subject and description. The prompt should also specify the desired output format (top 3 most probable tags).

**Subtask**: Create a prompt for zero-shot inference.

In [19]:
def create_zero_shot_prompt(subject, description, possible_tags):
    """Creates a zero-shot prompt for the LLM."""
    prompt = f"""
    Given the following support ticket subject and description, identify the top 3 most probable tags from the provided list of possible tags.

    Possible tags: {', '.join(possible_tags)}

    Ticket Subject: {subject}
    Ticket Description: {description}

    Output the top 3 tags as a comma-separated list.
    """
    return prompt

**Reasoning**:
Apply the zero-shot inference to a small sample of the data subset to test the prompt and the LLM's ability to generate relevant tags. This helps in verifying the process before applying it to the entire dataset.

**Subtask**: Apply zero-shot inference to a small sample.

In [20]:
# Apply zero-shot inference to a small sample of the data
sample_size = 5
sample_df = subset_df.head(sample_size).copy()

sample_df['predicted_tags_zero_shot'] = None

for index, row in sample_df.iterrows():
    prompt = create_zero_shot_prompt(row['Ticket Subject'], row['Ticket Description'], possible_tags)
    try:
        response = zero_shot_model.generate_content(prompt)
        # Assuming the response text is a comma-separated string of tags
        sample_df.loc[index, 'predicted_tags_zero_shot'] = response.text.strip()
    except Exception as e:
        print(f"Error processing ticket {index}: {e}")
        sample_df.loc[index, 'predicted_tags_zero_shot'] = "Error"

# Display the sample with predicted tags
display(sample_df)

Error processing ticket 0: name 'zero_shot_model' is not defined
Error processing ticket 1: name 'zero_shot_model' is not defined
Error processing ticket 2: name 'zero_shot_model' is not defined
Error processing ticket 3: name 'zero_shot_model' is not defined
Error processing ticket 4: name 'zero_shot_model' is not defined


Unnamed: 0,Ticket Subject,Ticket Description,predicted_tags_zero_shot
0,Product setup,I'm having an issue with the {product_purchase...,Error
1,Peripheral compatibility,I'm having an issue with the {product_purchase...,Error
2,Network problem,I'm facing a problem with my {product_purchase...,Error
3,Account access,I'm having an issue with the {product_purchase...,Error
4,Data loss,I'm having an issue with the {product_purchase...,Error
