<a href="https://colab.research.google.com/github/msmekka/ncssm-summer25-cyber/blob/main/Secure_AI_Spam_Classifier_Colab_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Secure Your Data, Power Your AI
Welcome to your project notebook! You will build a spam classifier using anonymized data. Work through each section and complete the code where prompted.

# Python Practice

Overview of basic Python concepts

In [None]:
name = "Ada"
age = 14
height = 5.3

## 🔍 Step 1: Load Your Dataset
We're using the SMS Spam Collection dataset from Kaggle. Upload and load it below.

In [2]:
# Import libraries
import pandas as pd

filename = 'https://raw.githubusercontent.com/msmekka/ncssm-summer25-cyber/refs/heads/main/spam_clean.csv'


# Load dataset (replace with actual file path or use Kaggle API)
df = pd.read_csv(filename, sep=',')
#df = pd.read_csv('/content/sample_data/spam.csv', encoding='latin1')
print(df.shape)
df.tail()

(5495, 5)


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5490,spam,This is the 2nd time we have tried 2 contact u...,,,
5491,ham,Will Ì_ b going to esplanade fr home?,,,
5492,ham,"Pity, * was in mood for that. So...any other s...",,,
5493,ham,The guy did some *****ing but I acted like i'd...,,,
5494,ham,Rofl. Its true to its name,,,


## 🛡️ Step 2: Explore for Sensitive Data
Check for names, numbers, emails, or other personally identifiable information (PII).

In [3]:
# Use regex to identify PII patterns
import re

def find_phone_numbers(text):
  if isinstance(text, str):
    return re.findall(r'\d{11}', text)
    #return re.findall(r'\d{11}|\(?\d{3}[-.\s]?\d{3}[-.\s]?d{4}\)?', text)
  else:
    return []

df['phone_numbers'] = df['v2'].apply(find_phone_numbers)
df['phone_numbers'].head()

def find_emails(text):
  if isinstance(text, str):
    return re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
  else:
    return []

df['email_addresses'] = df['v2'].apply(find_emails)
df['email_addresses'].head()

#How many phone numbers did I find?
#total_phone_numbers = df['phone_numbers'].apply(lambda x: len(x)).sum()
#print(f"Total phone numbers found: {total_phone_numbers}")

#How many email addresses did I find?
total_email_addresses = df['email_addresses'].apply(lambda x: len(x)).sum()
print(f"Total email addresses found: {total_email_addresses}")

Total email addresses found: 7


## 🔒 Step 3: Anonymize Sensitive Data
Use masking or redaction to protect the data.

In [5]:
#Anonymize the message
def anonymize_message(msg):
    #msg = re.sub(r'\b\d{11}\b', '[PHONE]', msg)
    msg = re.sub(r'\d{11}', '[PHONE]', msg)
    # Add other anonymization logic here
    return msg

df['v2_anonymized'] = df['v2'].apply(anonymize_message)
#df[['v2', 'message_anonymized']].head()
# Filter for rows that had phone numbers and view the original and anonymized message
df[df['phone_numbers'].apply(lambda x: len(x) > 0)][['v2', 'v2_anonymized']].head()

Unnamed: 0,v2,v2_anonymized
2,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
8,WINNER!! As a valued network customer you have...,WINNER!! As a valued network customer you have...
9,Had your mobile 11 months or more? U R entitle...,Had your mobile 11 months or more? U R entitle...
39,07732584351 - Rodger Burns - MSG = We tried to...,[PHONE] - Rodger Burns - MSG = We tried to cal...
51,Congrats! 1 year special cinema pass for 2 is ...,Congrats! 1 year special cinema pass for 2 is ...


## 🤖 Step 4: Train a Spam Classifier
Use Scikit-learn to train a simple model on the anonymized messages.

In [10]:
#Import libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

#split the data
X_train, X_test, y_train, y_test = train_test_split(
    df['v2_anonymized'], df['v1'], test_size=0.2, random_state=42)

#X_train, X_test, y_train, y_test = train_test_split(
#    df['v2'], df['v1'], test_size=0.2, random_state=42)

#vectorize the data
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

#train the model
#model = LogisticRegression()
model = LogisticRegression(class_weight='balanced')
model.fit(X_train_vec, y_train)

# Ask the model to predict
y_pred = model.predict(X_test_vec)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       947
        spam       0.92      0.92      0.92       152

    accuracy                           0.98      1099
   macro avg       0.95      0.95      0.95      1099
weighted avg       0.98      0.98      0.98      1099



# 🧪 Step 5: Test the Classifer

In [11]:
# Sample messages for testing the classifier
sample_messages = [
    # Ham (not spam)
    "Hey, are we still meeting after school?",
    "Don’t forget to bring your science notebook!",
    "Happy birthday! Hope you have an amazing day",
    "Can you send me the homework?",
    "I’ll be there in 10 minutes.",

    # Spam (but safe-for-classroom)
    "Congratulations! You’ve won a $50 gift card. Reply YES to claim.",
    "Act fast! Your account has been selected for a reward.",
    "Click this link now to unlock your mystery prize.",
    "Final notice: Your warranty is about to expire. Renew now.",
    "Get 50% off your next order—limited time only!"
]

# Transform the messages with the same vectorizer used for training
sample_vectors = vectorizer.transform(sample_messages)

# Use the trained model to predict spam or ham
sample_predictions = model.predict(sample_vectors)

# Display the results
for msg, prediction in zip(sample_messages, sample_predictions):
    print(f"Prediction: {prediction} | Message: {msg}")

Prediction: ham | Message: Hey, are we still meeting after school?
Prediction: ham | Message: Don’t forget to bring your science notebook!
Prediction: ham | Message: Happy birthday! Hope you have an amazing day
Prediction: ham | Message: Can you send me the homework?
Prediction: ham | Message: I’ll be there in 10 minutes.
Prediction: spam | Message: Congratulations! You’ve won a $50 gift card. Reply YES to claim.
Prediction: spam | Message: Act fast! Your account has been selected for a reward.
Prediction: spam | Message: Click this link now to unlock your mystery prize.
Prediction: spam | Message: Final notice: Your warranty is about to expire. Renew now.
Prediction: spam | Message: Get 50% off your next order—limited time only!


# ➕ More Anonymization Techniques

In [1]:
#Generalization exercise
# Import libraries
import pandas as pd
import re

filename = 'https://raw.githubusercontent.com/msmekka/ncssm-summer25-cyber/refs/heads/main/Dummy_5000_Employee_Details_Dataset.csv'


# Load dataset (replace with actual file path or use Kaggle API)
gdf = pd.read_csv(filename, sep=',')
print(gdf.shape)
gdf.tail()

(5000, 19)


Unnamed: 0.1,Unnamed: 0,Name,Address,Salary,DOJ,DOB,Age,Sex,Dependents,HRA,DA,PF,Gross Salary,Insurance,Marital Status,In Company Years,Year of Experience,Department,Position
4995,4995,qWoLfJfL,"ZySbqPyI St, ngMINtRCsM, MUB 927128",129159.11,2015-01-27,1975-02-10,49,Female,2.0,8769.089,30482.806983,19157.030038,149253.975945,Medical,Divorced,9,28,IT,IT Manager
4996,4996,bSgpJtiF,"frtFLglE St, lTFYlKNUjF, HYD 805378",134664.03,2010-08-15,1972-12-18,51,Female,3.0,10036.597,31782.021679,19973.526201,156509.122477,Medical,Divorced,13,30,Sales,National Sales Manager
4997,4997,gToCBVUH,"EYXeRLkj St, VmMSAYuspF, HYD 735138",92852.65,2023-09-18,1985-09-23,38,Other,4.0,6651.735,21914.129075,13772.013489,107646.500586,Medical,Married,0,17,IT,QA Lead
4998,4998,dwTQETtZ,"ivLjcIWM St, qWGPFaIuvI, KOL 978537",17412.05,2023-06-02,2002-07-20,21,Other,0.0,6315.795,4109.41326,2582.575591,25254.682669,Both,Single,1,0,Human Resources,HR Representative
4999,4999,EWrIRxIJ,"vVrWcNkW St, NsOMLwTEmX, KOL 378650",91693.8,2021-06-09,1988-01-17,36,Female,3.0,4421.62,21640.629197,13600.131504,104155.917693,,Married,3,15,Marketing,Regional Marketing Manager


## Generalize the Position

In [8]:

#Now let's categorize by title

# Assuming your DataFrame is already loaded and named 'df'
# and has a column named 'title'.
# Replace 'title' with the actual name of your title column if it's different.

# Define the list of words to search for
keywords = ['software', 'hr', 'executive', 'finance', 'marketing', 'sales', 'it', 'qa']

# Create a regex pattern to match any of the keywords
# Use re.IGNORECASE for case-insensitive matching
pattern = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b', re.IGNORECASE)

# Find titles that include any of the specified words
# Use .str.contains() with the compiled pattern and handle potential NaN values
matching_titles_df = gdf[gdf['Position'].str.contains(pattern, na=False)].copy()

# Count occurrences of each matching title [1]
matching_title_counts = matching_titles_df['Position'].value_counts()

# Calculate the total number of occurrences with matching titles
total_matching_occurrences = len(matching_titles_df)

# Convert the pandas Series of counts into a DataFrame
matching_title_counts_df = matching_title_counts.reset_index()
matching_title_counts_df.columns = ['Matching Title', 'Count']

# Sort the DataFrame by count in descending order
matching_title_counts_df = matching_title_counts_df.sort_values(by='Count', ascending=False)

print(f"Total number of occurrences with matching titles: {total_matching_occurrences}")

print("\nTable of Matching Title Occurrences:")
print(matching_title_counts_df)

# To save the table of matching titles and their counts to a CSV file
# matching_title_counts_df.to_csv('matching_title_counts.csv', index=False)

Total number of occurrences with matching titles: 3697

Table of Matching Title Occurrences:
                          Matching Title  Count
0                                QA Lead    198
1                  Software Engineer III    197
2                 National Sales Manager    196
3                       Senior Executive    195
4                              Senior HR    195
5                         Sales Director    187
6             National Marketing Manager    184
7             Senior Marketing Executive    184
8                 Regional Sales Manager    184
9             Regional Marketing Manager    175
10              Senior Account Executive    174
11                            IT Manager    164
12                           HR Director    163
13                    Marketing Director    146
14                          HR Associate    108
15                          HR Executive     98
16                   Marketing Associate     96
17                       Sales Executive   

## Update the dataset

In [9]:
#Generalization continued. Let's update the dataset to be more general

# Define a function to apply the generalization
def generalize_position(position):
    if isinstance(position, str):
        match = pattern.search(position)
        if match:
            # Get the matched keyword (in lowercase for consistency)
            keyword = match.group(0).lower()
            return f"{keyword.capitalize()} Professional"
    return position # Return original position if no keyword is found or it's not a string

# Apply the function to the 'Position' column to create a new generalized column
gdf['Generalized_Position'] = gdf['Position'].apply(generalize_position)

# Display the original and generalized positions for comparison (optional)
print(gdf[['Position', 'Generalized_Position']].tail())

                        Position    Generalized_Position
4995                  IT Manager         It Professional
4996      National Sales Manager      Sales Professional
4997                     QA Lead         Qa Professional
4998           HR Representative         Hr Professional
4999  Regional Marketing Manager  Marketing Professional


## 🧾 Step 6: Reflect & Document
Answer the following:
- What types of PII did you find and remove?
- How did you anonymize them?
- Did anonymization affect model accuracy?
- What would the risks be if the data were left unprotected?