# Data Preprocessing

This notebook contains the work done on the data preprocessing aspect of our Machine Learning Project. In this notebook, the data ingestion, cleaning, and processing is performed for further use in downstream tasks.

## 1. Data Loading

### BothBosu 1

In [25]:
import pandas as pd

single_agent_scam_dialogue_train_df = pd.read_csv("hf://datasets/BothBosu/single-agent-scam-conversations/single-agent-scam-dialogue_train.csv")
single_agent_scam_dialogue_test_df = pd.read_csv("hf://datasets/BothBosu/single-agent-scam-conversations/single-agent-scam-dialogue_test.csv")

single_agent_scam_dialogue = pd.concat([single_agent_scam_dialogue_train_df, single_agent_scam_dialogue_test_df])
single_agent_scam_dialogue.head()

Unnamed: 0,dialogue,type,labels
0,"Suspect: Hi, this is Karen from Dr. Smith's of...",appointment,0
1,"Suspect: Hi, is this John? Innocent: Yeah, tha...",appointment,0
2,"Suspect: Hi, I'm calling from XYZ Medical Cent...",appointment,0
3,"Suspect: Hi, I'm calling to confirm your appoi...",appointment,0
4,"Suspect: Hi, I'm calling from Dr. Smith's offi...",appointment,0


In [26]:
print(f"BothBosu 1 Dataset Count: {single_agent_scam_dialogue.shape[0]}")

BothBosu 1 Dataset Count: 1600


### Youtube

In [73]:
youtube_scam_conversations_df = pd.read_csv("hf://datasets/BothBosu/youtube-scam-conversations/youtube_scam_conversations.csv")

youtube_scam_conversations_df.head()

Unnamed: 0,dialogue,type,call,link,channel,labels
0,"Innocent: Yes, hello.\nSuspect: Yes ma'am, are...",support,outgoing,Scammer BEGS For His Deleted Files As I Drink ...,Scammer Payback,1
1,Suspect: Your Geek Squad subscription is upgra...,support,outgoing,"Incompetent Scammer Sends Me $25,000 (youtube....",Rinoa Poison,1
2,"Suspect: Cancellation department, how may I he...",support,outgoing,Scammer Gets Really Angry... Must See [scambai...,scambait tv,1
3,Suspect: Thank you for calling support. How ca...,support,outgoing,Tech Support Scammer vs Linux Mint (youtube.com),Lewis's Tech,1
4,"Suspect: Thank you for calling PayPal, how may...",refund,outgoing,"Telling a Scammer I'm Keeping His $18,000 (you...",Rinoa Poison,1


In [27]:
print(f"Youtube Dataset Count: {youtube_scam_conversations_df.shape[0]}")

Youtube Dataset Count: 20


## BothBosu 2

In [28]:
scam_dialogue_train_df = pd.read_csv("hf://datasets/BothBosu/scam-dialogue/scam-dialogue_train.csv")
scam_dialogue_test_df = pd.read_csv("hf://datasets/BothBosu/scam-dialogue/scam-dialogue_test.csv")

scam_dialogue = pd.concat([scam_dialogue_train_df, scam_dialogue_test_df])
scam_dialogue.rename(columns={"label": "labels"}, inplace=True)
scam_dialogue.head()

Unnamed: 0,dialogue,type,labels
0,"caller: Hello, is this John? receiver: Yes, it...",ssn,1
1,"caller: Hello, is this John? receiver: Yeah, t...",ssn,1
2,"caller: Hello, is this Mr. Johnson? receiver: ...",ssn,1
3,"caller: Hello, is this John? receiver: Yeah, t...",ssn,1
4,"caller: Hello, this is Officer Johnson from th...",ssn,1


In [38]:
scam_dialogue['dialogue'].iloc[555]

"caller: Hello, is this Mrs. Johnson?  receiver: Yes, it is. Who's calling?  caller: My name is David from Customer Rewards Center. How are you today?  receiver: I'm doing well, thanks. What can I do for you?  caller: Congratulations, Mrs. Johnson! You've been selected to receive a $1,000 gift card as part of our customer appreciation program.  receiver: That sounds amazing! How did I get selected?  caller: We randomly select customers from our database, and your name was chosen. All you need to do is pay a small processing fee of $20 to receive your gift card.  receiver: That's so exciting! Can you tell me more about the database you used? Was it from a specific store or website I visited?  caller: Uh, no, ma'am. It was just a general database. Don't worry about that. Would you like to proceed with the payment?  receiver: Okay, but how do I know this is legit? Can you provide me with a website or phone number to verify this offer?  caller: I understand your concern, ma'am. Unfortunate

In [29]:
print(f"BothBosu 2 Dataset Count: {scam_dialogue.shape[0]}")

BothBosu 2 Dataset Count: 1600


### BothBosu 3

In [30]:
agent_conversation_train_df = pd.read_csv("hf://datasets/BothBosu/multi-agent-scam-conversation/agent_conversation_train.csv")
agent_conversation_test_df = pd.read_csv("hf://datasets/BothBosu/multi-agent-scam-conversation/agent_conversation_test.csv")

agent_conversation = pd.concat([agent_conversation_train_df, agent_conversation_test_df])
agent_conversation.head()

Unnamed: 0,dialogue,personality,type,labels
0,"Innocent: Hello. Suspect: Hi, this is Karen f...",aggressive,appointment,0
1,"Innocent: Hello. Suspect: Hi, this is Karen f...",aggressive,appointment,0
2,"Innocent: Hello. Suspect: Hi, this is Karen f...",aggressive,appointment,0
3,"Innocent: Hello. Suspect: Hi, this is Rachel ...",aggressive,appointment,0
4,"Innocent: Hello. Suspect: Hi, this is Karen f...",aggressive,appointment,0


In [31]:
print(f"BothBosu 3 Dataset Count: {agent_conversation.shape[0]}")

BothBosu 3 Dataset Count: 1600


### BothBosu 4

In [76]:
gen_conver_noIdentifier_1000_df = pd.read_csv("hf://datasets/BothBosu/Scammer-Conversation/gen_conver_noIdentifier_1000.csv")

gen_conver_noIdentifier_1000_df.rename(columns={"conversation": "dialogue", "label": "labels"}, inplace=True)
gen_conver_noIdentifier_1000_df.head()

Unnamed: 0,dialogue,labels
0,"Person A: Hello, I'm calling from the bank and...",1
1,"Person A: Hi, how was your weekend? Person B: ...",0
2,"Person A: Hello, this is Sarah from the credit...",1
3,"Person A: Hey, how's your week going so far? P...",0
4,"Person A: Good day, I'm calling from a lottery...",1


In [32]:
print(f"BothBosu 4 Dataset Count: {gen_conver_noIdentifier_1000_df.shape[0]}")

BothBosu 4 Dataset Count: 1000


## 2. Data Preprocessing

### Adding 'Type' column in the dataset

Why is the 'type' column so crucial? Well, as one of the targets to predict is the type of scam (e.g., Delivery Scam, Bank Scam, etc.) when a call conversation is flagged as a potential scam call, this column needs to be filled.

In [33]:
# adding 'type' column to the last dataset

scam_only_df = gen_conver_noIdentifier_1000_df[gen_conver_noIdentifier_1000_df['labels'] == 1].copy()
non_scam_df = gen_conver_noIdentifier_1000_df[gen_conver_noIdentifier_1000_df['labels'] == 0].copy()

# Define keywords for each type
type_keywords = {
    "appointment": ["appointment", "schedule", "meeting", "consultation"],
    "delivery": ["delivery", "package", "courier", "ship"],
    "insurance": ["insurance", "policy", "claim", "premium"],
    "refund": ["refund", "money back", "return"],
    "reward": ["reward", "prize", "win", "lottery"],
    "ssn": ["ssn", "social security", "identity"],
    "support": ["support", "help desk", "customer service"],
    "telemarketing": ["offer", "promo", "discount", "telemarketing"],
    "wrong": ["wrong number", "misdialed"],
    "bank": ["bank", "account", "loan", "credit", "card", "transaction"] # new type
}

def classify_dialogue(dialogue):
    for type_, keywords in type_keywords.items():
        if any(keyword.lower() in dialogue.lower() for keyword in keywords):
            return type_
    return "others"  # For dialogues that don't match any type

scam_only_df['type'] = scam_only_df['dialogue'].apply(classify_dialogue)

# For non-scam rows, assign "NIL" as type
non_scam_df['type'] = "NIL"

# Combine classified scam rows and non-scam rows
final_df = pd.concat([scam_only_df, non_scam_df], ignore_index=True)

In [34]:
# Combine all the datasets with only same columns
dfs = [single_agent_scam_dialogue, youtube_scam_conversations_df, scam_dialogue, agent_conversation, final_df]
combined_df = pd.concat([df for df in dfs], ignore_index=True)
combined_df = combined_df[["dialogue", "labels", "type"]]
combined_df.head()

Unnamed: 0,dialogue,labels,type
0,"Suspect: Hi, this is Karen from Dr. Smith's of...",0,appointment
1,"Suspect: Hi, is this John? Innocent: Yeah, tha...",0,appointment
2,"Suspect: Hi, I'm calling from XYZ Medical Cent...",0,appointment
3,"Suspect: Hi, I'm calling to confirm your appoi...",0,appointment
4,"Suspect: Hi, I'm calling from Dr. Smith's offi...",0,appointment


In [35]:
print("Combined dataset shape: ", combined_df.shape)
print("Unique types: ", combined_df.type.unique()) # "wrong" type means wrong number scam

Combined dataset shape:  (5820, 3)
Unique types:  ['appointment' 'delivery' 'insurance' 'wrong' 'refund' 'reward' 'ssn'
 'support' 'telemarketing' 'bank' 'others' 'NIL']


### Remove caller's identity

As we are focusing on certain keywords and phrases (telltale signs) and compromised PII within the conversation, we opt to remove speaker identies from the conversation.

In [39]:
# From the dialogue, retrieve a list of words that end with a colon (:)
def get_colon_words_with_counts(dialogue):
    words = []
    for word in dialogue.split():
        if word.endswith(":"):
            words.append(word.rstrip(":"))
    return words

# Initialize a dictionary to store word counts
colon_word_counts = {}

# Iterate over the column to gather all words and their counts
for dialogue in combined_df["dialogue"]:
    words = get_colon_words_with_counts(dialogue)
    for word in words:
        if word in colon_word_counts:
            colon_word_counts[word] += 1
        else:
            colon_word_counts[word] = 1

# Convert the dictionary to a list of tuples and sort by count (descending order)
sorted_colon_word_counts = sorted(colon_word_counts.items(), key=lambda x: x[1], reverse=True)

# Display the result
for word, count in sorted_colon_word_counts:
    print(f"{word}: {count}")

Innocent: 21501
Suspect: 21304
caller: 11285
receiver: 10605
A: 5424
B: 4950
number: 30
: 14
question: 14
this: 14
address: 14
thing: 13
ask: 13
clear: 12
you: 10
code: 9
information: 6
is: 5
simple: 5
confirm: 5
compromise: 4
alternative: 4
extension: 4
supervisor: 3
correct: 3
refund: 3
follows: 3
solution: 3
way: 3
codes: 2
one: 2
42: 2
reads: 2
insurance: 2
go: 2
policy: 2
account: 2
it: 2
courtesy: 2
deal: 2
again: 2
hint: 2
details: 2
Alex: 2
benefits: 2
Error: 1
"TeamViewer: 1
Rachel: 1
no: 1
things: 1
down: 1
moment.Rachel: 1
quote: 1
far: 1
though: 1
offer: 1
Discounts: 1
fee: 1
numbers: 1
premium: 1
Deductible: 1
Johnson: 1
info: 1
are: 1
location: 1
something: 1
confirmed: 1
Name: 1
ThompsonAddress: 1
processor: 1
at: 1
now: 1
system: 1
bonus: 1
chance: 1
important: 1
time: 1
David: 1
do: 1
assurance: 1
guarantee: 1
software: 1
issue.John: 1
website: 1
435: 1
ID: 1
get: 1
directly: 1
summarize: 1
works: 1
Insurance: 1
Program: 1
Services: 1
discount: 1
features: 1
case: 1
ch

In [40]:
# Remove the words from the dialogue
combined_df["dialogue"] = combined_df["dialogue"].str.replace("Person A: ", "")
combined_df["dialogue"] = combined_df["dialogue"].str.replace("Person B: ", "")
combined_df["dialogue"] = combined_df["dialogue"].str.replace("Innocent: ", "")
combined_df["dialogue"] = combined_df["dialogue"].str.replace("Suspect: ", "")
combined_df["dialogue"] = combined_df["dialogue"].str.replace("receiver: ", "")
combined_df["dialogue"] = combined_df["dialogue"].str.replace("caller: ", "")

## 3. Type of Scams

Shown in the cells below are the type of scams available in our dataset.

In [41]:
combined_df.to_csv("combined_scam_dataset.csv", index=False)

# Print the count of each type
type_counts = combined_df['type'].value_counts()
print(type_counts)

type
insurance        644
support          638
reward           614
refund           613
delivery         609
ssn              606
wrong            600
NIL              476
appointment      400
bank             384
telemarketing    209
others            27
Name: count, dtype: int64
