# Hello Foundation model Spam SMS Classifier

Utilize ChatGPT foundation model to classify SMS messages to "Spam" or "Not Spam"

### Get Data

In [1]:
from datasets import load_dataset

# Load spam dataset
spam_data = load_dataset("sms_spam", split=["train"])[0]

for item in spam_data.select(range(10)):
    sms = item["sms"]
    label = item["label"]
    print(f"Label: '{label}', sms: '{sms}'")


Label: '0', sms: 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
'
Label: '0', sms: 'Ok lar... Joking wif u oni...
'
Label: '1', sms: 'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
'
Label: '0', sms: 'U dun say so early hor... U c already then say...
'
Label: '0', sms: 'Nah I don't think he goes to usf, he lives around here though
'
Label: '1', sms: 'FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
'
Label: '0', sms: 'Even my brother is not like to speak with me. They treat me like aids patent.
'
Label: '0', sms: 'As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
'
Label: '1', sms: 'WINNER!! As a valued netw

### Helper Functions

Convert between labels and ids

In [2]:
id2label = {0: "HAM", 1: "SPAM"}
label2id = {"HAM": 0, "SPAM": 1}

for item in spam_data.select(range(10)):
    sms = item["sms"]
    id_label = item["label"]
    print(f"Label: '{id2label[id_label]}', sms: '{sms}'")

Label: 'HAM', sms: 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
'
Label: 'HAM', sms: 'Ok lar... Joking wif u oni...
'
Label: 'SPAM', sms: 'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
'
Label: 'HAM', sms: 'U dun say so early hor... U c already then say...
'
Label: 'HAM', sms: 'Nah I don't think he goes to usf, he lives around here though
'
Label: 'SPAM', sms: 'FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
'
Label: 'HAM', sms: 'Even my brother is not like to speak with me. They treat me like aids patent.
'
Label: 'HAM', sms: 'As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
'
Label: 'SPAM', sms: 'WINN

Format SMS messages for LLM

In [4]:
def get_sms_msg_string(dataset, item_numbers, include_labels=False):
    sms_msg_string = ""
    for item_no, item in zip(item_numbers, dataset.select(item_numbers)):
        sms = item["sms"]
        id_label = item["label"]

        if include_labels:
            sms_msg_string += (f"{item_no} (label = {id2label[id_label]}) -> sms: {sms}\n")
        else:
            sms_msg_string += (f"{item_no} -> sms: {sms}\n")

    return sms_msg_string

In [5]:
print(get_sms_msg_string(spam_data, range(10), include_labels=True))

0 (label = HAM) -> sms: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

1 (label = HAM) -> sms: Ok lar... Joking wif u oni...

2 (label = SPAM) -> sms: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

3 (label = HAM) -> sms: U dun say so early hor... U c already then say...

4 (label = HAM) -> sms: Nah I don't think he goes to usf, he lives around here though

5 (label = SPAM) -> sms: FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

6 (label = HAM) -> sms: Even my brother is not like to speak with me. They treat me like aids patent.

7 (label = HAM) -> sms: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

8

### Prompt LLM

Get some messages to produce a Prompt for the LLM

In [6]:
# Get message strings for a few messages 
sms_msg_string = get_sms_msg_string(spam_data, range(20, 28))

# Prompt for LLM including messages 
query_string = f"""
{sms_msg_string}
---
Please classify the above messages as SPAM or NOT SPAM = HAM. Respond in JSON format.
Use this format {{"0": "HAM", "1": "SPAM"}}

"""
print(query_string)


20 -> sms: Is that seriously how you spell his name?

21 -> sms: I‘m going to try for 2 months ha ha only joking

22 -> sms: So ü pay first lar... Then when is da stock comin...

23 -> sms: Aft i finish my lunch then i go str down lor. Ard 3 smth lor. U finish ur lunch already?

24 -> sms: Ffffffffff. Alright no way I can meet up with you sooner?

25 -> sms: Just forced myself to eat a slice. I'm really not hungry tho. This sucks. Mark is getting worried. He knows I'm sick when I turn down pizza. Lol

26 -> sms: Lol your always so convincing.

27 -> sms: Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?


---
Please classify the above messages as SPAM or NOT SPAM = HAM. Respond in JSON format.
Use this format {"0": "HAM", "1": "SPAM"}




In [16]:
# Response string from ChatGPT
response_1 = {
  "20": "HAM",
  "21": "HAM",
  "22": "SPAM",
  "23": "HAM",
  "24": "HAM",
  "25": "HAM",
  "26": "HAM",
  "27": "HAM"
}


### Check Accuracy

In [17]:
def check_accuracy(response, dataset, original_indices):
    nr_correct = 0
    nr_total = 0
    for item_no, prediction in response.items():
        if int(item_no) not in original_indices: 
            continue

        id_label = dataset[int(item_no)]["label"]
        label = id2label[id_label]

        # If LLM prediction is correct, increment correct count
        if prediction.lower() == label.lower():
            nr_correct += 1

        # Increment total count
        nr_total += 1

        try:
            accuracy = nr_correct / nr_total
        except ZeroDivisionError:
            print("No matching indices found in the dataset.")
            return
        
    return round(accuracy, 2)


In [18]:
print(f"Accuracy when one-shot prompt: {check_accuracy(response_1, spam_data, range(20, 28))}")

Accuracy when one-shot prompt: 0.88


### Prompt with Examples

In [19]:
# Get message strings for a few messages 
sms_msg_string = get_sms_msg_string(spam_data, range(50, 60), include_labels=True)

# Prompt for LLM including messages 
query_string = f"""
{sms_msg_string}
---
Please classify the above messages as SPAM or NOT SPAM = HAM. Respond in JSON format.
Use this format {{"0": "HAM", "1": "SPAM"}}

"""
print(query_string)


50 (label = HAM) -> sms: What you thinked about me. First time you saw me in class.

51 (label = HAM) -> sms: A gram usually runs like  &lt;#&gt; , a half eighth is smarter though and gets you almost a whole second gram for  &lt;#&gt;

52 (label = HAM) -> sms: K fyi x has a ride early tomorrow morning but he's crashing at our place tonight

53 (label = HAM) -> sms: Wow. I never realized that you were so embarassed by your accomodations. I thought you liked it, since i was doing the best i could and you always seemed so happy about "the cave". I'm sorry I didn't and don't have more to give. I'm sorry i offered. I'm sorry your room was so embarassing.

54 (label = SPAM) -> sms: SMS. ac Sptv: The New Jersey Devils and the Detroit Red Wings play Ice Hockey. Correct or Incorrect? End? Reply END SPTV

55 (label = HAM) -> sms: Do you know what Mallika Sherawat did yesterday? Find out now @  &lt;URL&gt;

56 (label = SPAM) -> sms: Congrats! 1 year special cinema pass for 2 is yours. call 09061

In [20]:
# CHatGPT response
response_2 = {
  "50": "HAM",
  "51": "HAM",
  "52": "HAM",
  "53": "HAM",
  "54": "SPAM",
  "55": "HAM",
  "56": "SPAM",
  "57": "HAM",
  "58": "HAM",
  "59": "HAM"
}

Check Accuracy

In [None]:
print(f"Accuracy when prompted with examples: {check_accuracy(response_2, spam_data, range(50, 60)):.2f}")

Accuracy when prompted with examples: 1.00


: 