## AI-Assisted Labeling  
In this Jupyter notebook, we demonstrate an innovative approach to automating the process of labeling textual data using advanced Language Learning Models (LLMs). Our strategy involves leveraging transformer models from renowned platform Huggingface through zero-shot classification.  
The process involves:  
1. **Data Ingestion**: Loading text data in CSV format into a pandas DataFrame.  
2. **Automated Labeling**: Employing Huggingface's zero-shot classification to assign labels and confidence scores to each text entry.  
3. **Visualization and Export**: The labeled data, along with confidence levels, are displayed in a styled pandas DataFrame, using color coding for varying confidence levels. The notebook also facilitates exporting this data into a new CSV file.

In [21]:
#| default_exp assistant

In [22]:
#| hide
from nbdev.showdoc import *

###  Libraries and Data Loading

In [30]:
# import necessary libraries 
import pandas as pd
import numpy as np
from transformers import pipeline
import os
import torch
# make sure the version of torch is the most updated
print(torch.__version__)

2.1.1


In [24]:
# load the data
df = pd.read_csv('../data/009-1.csv')
df.head()

Unnamed: 0,Text,Label
0,"Good morning class, today we are going to lear...",PRS
1,"A noun is a word that represents a person, pla...",
2,Can anyone give me an example of a noun?,OTR
3,"That's right, 'dog' is a noun because it is a ...",
4,Let's write down some nouns in our notebooks.,


### Zero-shot Model  
When presented with text data, the zero-shot model utilizes its understanding of language and context to classify each piece of text into predefined categories.   

For instance, if your dataset contains customer reviews, the zero-shot model can categorize them into sentiments like 'positive', 'negative', or 'neutral', even if it hasn't been trained on these specific reviews. The model leverages its existing knowledge base to infer the most probable category for each text entry.   

Additionally, it provides a confidence score for each classification, indicating how certain the model is about its decision. This capability significantly accelerates the labeling process, ensuring both efficiency and a high degree of accuracy, even in the absence of extensive, category-specific training data.

In [25]:
# Remove the 'Legal' column if it exists
if 'Legal' in df.columns:
    df.drop('Legal', axis=1, inplace=True)

# Initialize the zero-shot classifier
classifier = pipeline("zero-shot-classification", model="sileod/deberta-v3-base-tasksource-nli")

# Define candidate labels
candidate_labels = ['PRS', 'REP', 'OTR', 'NEU']

# Initialize columns for scores
score_columns = ['PRS_Score', 'REP_Score', 'OTR_Score', 'NEU_Score']
for col in score_columns:
    df[col] = 0.0

# Process each text and apply classifier and override rules
for index, row in df.iterrows():
    text = row['Text']
    # Run classifier
    prediction = classifier(text, candidate_labels, truncation=True, max_length=1024)
    label_scores = {label: score for label, score in zip(prediction['labels'], prediction['scores'])}

    #Here we apply the override rules to improve the accuracy of the result based on the giving classroom site
    def apply_rule_based_override(text):
        positive_words = ['great', 'well', 'excellent', 'good', 'proud', 'amazing']
        negative_words = ['bad', 'stop', 'disrespectful', 'quiet', 'get out']
        text_lower = text.lower()
        
        if any(word in text_lower for word in positive_words):
            return 'PRS'
        elif any(word in text_lower for word in negative_words):
            return 'REP'
        elif text.strip().endswith('?'):
            return 'OTR'
        return None

    # Apply rule-based override
    override_label = apply_rule_based_override(text)
    if override_label:
        label_scores[override_label] = max(label_scores[override_label], 0.5)  # Override score if higher

    # Update the DataFrame with scores
    for label in candidate_labels:
        df.at[index, f'{label}_Score'] = label_scores[label]

# Determine the label with the highest score
df['Label'] = df[score_columns].idxmax(axis=1).str.replace('_Score', '')

new_column_order = [col for col in df.columns if col not in score_columns and col != 'Label'] + score_columns + ['Label']
df = df[new_column_order]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [26]:
# create a color_map function for better visualization 
def color_map(val):
    """
    Takes a scalar and returns a string with
    the css property `'background-color'` for a color.
    Uses a non-linear scale for color mapping.
    """
    if np.isnan(val):
        return ''
    elif val < 0.2:
        return 'background-color: #ffffcc'  # light yellow
    elif val < 0.3:
        return 'background-color: #d9f0a3'  # light green
    elif val < 0.4:
        return 'background-color: #addd8e'  # green
    elif val < 0.5:
        return 'background-color: #78c679'  # darker green
    else:
        return 'background-color: #31a354'  # dark green

# Apply the styling
score_columns = ['PRS_Score', 'REP_Score', 'OTR_Score', 'NEU_Score']
styled_df = df.style.applymap(color_map, subset=score_columns)

# Display the styled DataFrame in Jupyter Notebook
styled_df

Unnamed: 0,Text,PRS_Score,REP_Score,OTR_Score,NEU_Score,Label
0,"Good morning class, today we are going to learn about nouns.",0.5,0.139344,0.143033,0.561613,NEU
1,"A noun is a word that represents a person, place, thing, or idea.",0.089902,0.136978,0.092486,0.680633,NEU
2,Can anyone give me an example of a noun?,0.154021,0.274473,0.5,0.411084,OTR
3,"That's right, 'dog' is a noun because it is a thing.",0.209836,0.190717,0.22598,0.373467,NEU
4,Let's write down some nouns in our notebooks.,0.166816,0.178466,0.179472,0.475246,NEU
5,"Now, let's talk about verbs. Does anyone know what a verb is?",0.215511,0.242683,0.5,0.268819,OTR
6,"A verb is a word that describes an action, occurrence, or state of being.",0.20749,0.36132,0.211763,0.219428,REP
7,Can someone give me an example of a verb?,0.181728,0.354987,0.5,0.275267,OTR
8,"Great example, 'run' is a verb because it is an action.",0.5,0.26665,0.207185,0.247342,PRS
9,"Now, let's write down some verbs in our notebooks.",0.234103,0.255851,0.236518,0.273527,NEU


In [33]:
directory_path = 'example_label_file'
if not os.path.exists(directory_path):
    os.makedirs(directory_path)
file_path = os.path.join(directory_path, 'AI_assisted_labeled.xlsx')

styled_df.to_excel(file_path, engine='openpyxl')

In [28]:
#| hide
import nbdev; nbdev.nbdev_export()