# Logistic Regression

Logistic regression is a statistical method used for binary classification. The goal is to predict the probability that a given input belongs to one of two classes (e.g. 0 or 1).

## How does Logistic Regression work?

1. Linear combination of inputs is calculated
2. The linear combination of inputs is passed through a sigmoid function to map the result to a value between 0 and 1
3. To make a classfication, a threshold is set, if larger than threshold it is 1, if less it is 0
4. Training the model
5. Optimization

## Advantages vs Disadvantages

**Advantages**
1. Simple
2. Efficient (less computer resources used)
3. Easy to interpret

**Disadvantages**

1. Poor performance when large sets of features
2. Outliers can potentially impact the decision threshold


References:

https://www.ahajournals.org/doi/full/10.1161/CIRCULATIONAHA.106.682658

https://ceur-ws.org/Vol-2124/paper_12.pdf

https://link.springer.com/book/10.1007/978-1-4419-1742-3



# Data Preprocessing

In [5]:
import pandas as pd

# UVIC Dataset
file_path = '../../datasets/CaptstoneProjectData_2024.csv'
uvicData = pd.read_csv(file_path)

# Remove unnecessary columns
uvicData_cleaned = uvicData.drop(columns=['Unnamed: 2', 'Unnamed: 3'], errors='ignore')

# Replace empty 'Subject' with space
uvicData_cleaned['Subject'] = uvicData_cleaned['Subject'].fillna(' ')

# Check and remove rows with missing 'Body'
data_cleaned = uvicData_cleaned.dropna(subset=['Body'])

# Normalize text: convert to lowercase, remove special characters, and trim whitespaces
uvicData_cleaned['Subject'] = uvicData_cleaned['Subject'].str.lower().str.replace('[^\w\s]', '', regex=True).str.strip()
uvicData_cleaned['Body'] = uvicData_cleaned['Body'].str.lower().str.replace('[^\w\s]', '', regex=True).str.strip()

# Confirm cleaning
print(uvicData_cleaned.head())

                                             Subject  \
0  review your shipment details  shipment notific...   
1                            υоur ассоunt іѕ оn hоld   
2  completed invoice  kz89tys2564 frombestbuycom ...   
3                              uvic important notice   
4             you have 6 suspended incoming messages   

                                                Body  
0  notice this message was sent from outside the ...  
1  votre réponse a bien été prise en compte\r\nht...  
2  notice this message was sent from outside the ...  
3  your uvic account has been filed under the lis...  
4  message generated from  uvicca source\r\n\r\n\...  


  uvicData_cleaned['Subject'] = uvicData_cleaned['Subject'].str.lower().str.replace('[^\w\s]', '', regex=True).str.strip()
  uvicData_cleaned['Body'] = uvicData_cleaned['Body'].str.lower().str.replace('[^\w\s]', '', regex=True).str.strip()


In [6]:
# Load the normal emails dataset
file_path = '../../datasets/emails.csv'

normData = pd.read_csv(file_path)
normData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517401 entries, 0 to 517400
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   file     517401 non-null  object
 1   message  517401 non-null  object
dtypes: object(2)
memory usage: 7.9+ MB


In [7]:
normData.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [8]:
def parse_email(message):
    lines = message.split('\n')
    subject = next((line.split(": ", 1)[1] for line in lines if line.lower().startswith('subject: ')), "")
    body_start = next(i for i, line in enumerate(lines) if line.strip() == '') + 1
    body = "\n".join(lines[body_start:])
    return subject, body

# Apply the function to the 'message' column
normData[['Subject', 'Body']] = normData['message'].apply(lambda x: pd.Series(parse_email(x)))
normData.head()

Unnamed: 0,file,message,Subject,Body
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,,Here is our forecast\n\n
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,Re:,Traveling to have a business meeting takes the...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,Re: test,test successful. way to go!!!
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,,"Randy,\n\n Can you send me a schedule of the s..."
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,Re: Hello,Let's shoot for Tuesday at 11:45.


In [9]:
normData['Subject'] = normData['Subject'].fillna(' ')
normData = normData.dropna(subset=['Body'])
normData = normData.drop(columns=['file', 'message'], errors='ignore')
# Normalize text: convert to lowercase, remove special characters, and trim whitespaces
normData['Subject'] = normData['Subject'].str.lower().str.replace('[^\w\s]', '', regex=True).str.strip()
normData['Body'] = normData['Body'].str.lower().str.replace('[^\w\s]', '', regex=True).str.strip()

# Showing the updated DataFrame with subject and body columns
normData.head()

  normData['Subject'] = normData['Subject'].str.lower().str.replace('[^\w\s]', '', regex=True).str.strip()
  normData['Body'] = normData['Body'].str.lower().str.replace('[^\w\s]', '', regex=True).str.strip()


Unnamed: 0,Subject,Body
0,,here is our forecast
1,re,traveling to have a business meeting takes the...
2,re test,test successful way to go
3,,randy\n\n can you send me a schedule of the sa...
4,re hello,lets shoot for tuesday at 1145


In [10]:
uvicData_cleaned['label'] = 1
normData['label'] = 0

masterData = pd.concat([uvicData_cleaned, normData], ignore_index=True)
masterData.head()

Unnamed: 0,Subject,Body,label
0,review your shipment details shipment notific...,notice this message was sent from outside the ...,1
1,υоur ассоunt іѕ оn hоld,votre réponse a bien été prise en compte\r\nht...,1
2,completed invoice kz89tys2564 frombestbuycom ...,notice this message was sent from outside the ...,1
3,uvic important notice,your uvic account has been filed under the lis...,1
4,you have 6 suspended incoming messages,message generated from uvicca source\r\n\r\n\...,1


In [11]:
masterData['label'][232]

1

In [12]:
# Feature Engineering: Length of the email body

def add_body_length(df):

    df['Body_Length'] = df['Body'].apply(lambda x: len(x) if isinstance(x, str) else pd.NA)
    return df

masterData = add_body_length(masterData.copy())
masterData.head()

Unnamed: 0,Subject,Body,label,Body_Length
0,review your shipment details shipment notific...,notice this message was sent from outside the ...,1,890
1,υоur ассоunt іѕ оn hоld,votre réponse a bien été prise en compte\r\nht...,1,1235
2,completed invoice kz89tys2564 frombestbuycom ...,notice this message was sent from outside the ...,1,3024
3,uvic important notice,your uvic account has been filed under the lis...,1,528
4,you have 6 suspended incoming messages,message generated from uvicca source\r\n\r\n\...,1,1234


In [None]:
# save final csv
masterData.to_csv('capstone_dataset_final.csv', index=False)  # Change index=False if you want to keep the index

# create a download link in jupyter
from IPython.display import FileLink
FileLink(r'capstone_dataset_final.csv')

# Logistic Regression Code & Results

In [25]:
# importing the appropriate modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, confusion_matrix, balanced_accuracy_score
from sklearn.pipeline import Pipeline

# loading the dataset
df = masterData

# combining the subject and body into a single text feature for better context
df['text'] = df['Subject'].fillna('') + " " + df['Body'].fillna('')  # Fill NaN with empty string

# defining the feature and target variable
X = df['text']
y = df['label']

# splitting the data into training and test sets (sequester 10% of the data for final validation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# creating a pipeline with TF-IDF Vectorizer and Logistic Regression
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', LogisticRegression(random_state=42))
])

# training the model
pipeline.fit(X_train, y_train)

# predicting on the test data
y_pred = pipeline.predict(X_test)
y_pred_prob = pipeline.predict_proba(X_test)[:, 1]

# evaluating the model
auc = roc_auc_score(y_test, y_pred_prob)
cm = confusion_matrix(y_test, y_pred)
balanced_acc = balanced_accuracy_score(y_test, y_pred)

print("Confusion Matrix:")
print(cm)
print("\nAUC Score:", auc)
print("Balanced Accuracy Score:", balanced_acc)


Confusion Matrix:
[[51724     3]
 [   76   195]]

AUC Score: 0.9985969484842256
Balanced Accuracy Score: 0.8597495993905557
