In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
import urllib.request
import tarfile

def fetch_spam_data():
    spam_root = "https://spamassassin.apache.org/old/publiccorpus/"
    fileinfos = [
        ("easy_ham_2", "20030228_easy_ham_2.tar.bz2"),
        ("hard_ham", "20030228_hard_ham.tar.bz2"),
        ("spam_2", "20050311_spam_2.tar.bz2"),
    ]

    spam_path = Path() / "datasets" / "spam"
    spam_path.mkdir(parents=True, exist_ok=True)

    for folder_name, tar_filename in fileinfos:
        if not (spam_path / folder_name).is_dir():
            url = spam_root + tar_filename
            path = spam_path / tar_filename
            print("Downloading", path.name)
            urllib.request.urlretrieve(url, path)
            with tarfile.open(path) as tar:
                tar.extractall(path=spam_path)

    return [spam_path / folder_name for folder_name, _ in fileinfos]


In [2]:
easy_ham_dir, hard_ham_dir, spam_dir = fetch_spam_data()

In [3]:
easy_ham_filenames = [f for f in sorted(easy_ham_dir.iterdir()) if len(f.name) > 20]
hard_ham_filenames = [f for f in sorted(hard_ham_dir.iterdir()) if len(f.name) > 20]
spam_filenames = [f for f in sorted(spam_dir.iterdir()) if len(f.name) > 20]

In [4]:
len(easy_ham_filenames)

1400

In [5]:
len(hard_ham_filenames)

250

In [6]:
len(spam_filenames)

1396

In [7]:
import email
import email.policy

def load_email(filepath):
    with open(filepath, "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [8]:
easy_ham_emails = [load_email(filepath) for filepath in easy_ham_filenames]
hard_ham_emails = [load_email(filepath) for filepath in hard_ham_filenames]
spam_emails = [load_email(filepath) for filepath in spam_filenames]


In [16]:
ham_emails = easy_ham_emails + hard_ham_emails
# combines the easy_ham list and the hard_ham. I'm trying to avoid underfitting of hard_ham

In [17]:
print(ham_emails[10].get_content().strip())

Hi!

Is there a command to insert the signature using a combination of keys and not 
to have sent the mail to insert it then?

Regards,
Ulises




_______________________________________________
Exmh-users mailing list
Exmh-users@redhat.com
https://listman.redhat.com/mailman/listinfo/exmh-users


In [18]:
import random

# shuffles the list in place
random.shuffle(ham_emails)

In [19]:
print(ham_emails[10].get_content().strip())

I think there's a link to all the details on welcomehome.org, but 
basically it was told to me this way: govt wanted land in arizona or someplace, 
indians said no, govt said 'okay, u know that land in MS u want?  we will 
fux0r it, then.'  indians still said no, govt fux0red it.
C

On Wed, 31 Jul 2002, Elias Sinderson wrote:

> Heh. Never mind the perfectly good desert in the southwest, right? Or is 
> that area too hot for desert warfare training? I'll never understand.
> 
> E
> 
> CDale wrote:
> 
> >Okay, I'll ammend that to LIVE OLD tree saving, like the thousands of 
> >acres of virgin pine forest that was razed here in MS so that our military 
> >can practice desert warfare?  Fought it for years, lost, now there is a 
> >stand of trees up and down Hwy 49 that's supposed to try to hide the fact 
> >that a huge portion of the Desoto Ntl. Forest is gone.
> >
> 
> 
> http://xent.com/mailman/listinfo/fork
> 

-- 
"My theology, briefly, is that the universe was dictated but not
       

In [20]:
print(ham_emails[1].get_content().strip())

Halloechen!

If I create an RPM according to one of the how-to's with having
Red Hat in mind, how big are my chances that it will also work
for the SuSE distribution, or others?  (I don't know how many
base on the RPM system.)

Or what must I pay attention to when creating an RPM that should
work with the big distributions?

Tschoe,
Torsten.

_______________________________________________
RPM-List mailing list <RPM-List@freshrpms.net>
http://lists.freshrpms.net/mailman/listinfo/rpm-list


In [21]:
print(spam_emails[6].get_content().strip())

NEW PRODUCT ANNOUNCEMENT

From: OUTSOURCE ENG.& MFG. INC.


Sir/Madam;

This note is to inform you of new watchdog board technology for maintaining
continuous unattended operation of PC/Servers etc. that we have released for
distribution.
  
We are proud to announce Watchdog Control Center featuring MAM (Multiple
Applications Monitor) capability.
The key feature of this application enables you to monitor as many
applications as you
have resident on any computer as well as the operating system for
continuous unattended operation.  The Watchdog Control Center featuring
MAM capability expands third party application "control" of a Watchdog as
access to the application's
source code is no longer needed.

Here is how it all works:
Upon installation of the application and Watchdog, the user may select
many configuration options, based on their model of Watchdog, to fit their
operational needs.  If the MAM feature is enabled, the user may select any
executable program that they wish for monit

In [25]:
print(spam_emails[10].get_content().strip())

Yes we do purchase uncollected Judicial Judgements!!!            st10                           .           

If you, your company or an acquaintance have an uncollected Judicial Judgement then please call us and find out how we can help you receive the money that the court states you are rightfully due.

We have strong interest in acquiring uncollected Judicial Judgements in your City and Area.

J T C is the largest firm in the world specializing in the purchase and collection of Judicial Judgements.

Currently we are processing over 455 million dollars worth of judgements in the United States alone. We have associate offices in virtually every city in the US and in most foreign countries.

You have nothing to lose and everything to gain by calling. There is absolutely no cost to you.

We can be reached Toll free at 1-888-557-5744. in the US or if you are in Canada call 1-310-842-3521. You can call 24 hours per day.

Thank you for your time.







++++++++++++++++++++++++++++++++++++

In [31]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        multipart = ", ".join([get_email_structure(sub_email)
                              for sub_email in payload])
        return f"multipart({multipart})"
    else:
        return email.get_content_type()

In [32]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [33]:
len(ham_emails)

1650

In [34]:
len(spam_emails)

1396

In [35]:
structures_counter(ham_emails).most_common()

[('text/plain', 1424),
 ('text/html', 120),
 ('multipart(text/plain, text/html)', 55),
 ('multipart(text/plain, application/pgp-signature)', 35),
 ('multipart(text/html)', 2),
 ('multipart(text/plain, image/bmp)', 1),
 ('multipart(multipart(text/plain, text/html), image/jpeg, image/gif, image/gif, image/gif, image/gif)',
  1),
 ('multipart(multipart(text/plain, multipart(text/plain), text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, application/x-patch)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(text/plain, text/plain)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/jpeg, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif)',
  1),
 ('multipart(text/plain, application/ms-tnef)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(multipart(text/plain, text/html))

In [36]:
structures_counter(spam_emails).most_common()

[('text/plain', 597),
 ('text/html', 589),
 ('multipart(text/plain, text/html)', 114),
 ('multipart(text/html)', 29),
 ('multipart(text/plain)', 25),
 ('multipart(multipart(text/html))', 18),
 ('multipart(multipart(text/plain, text/html))', 5),
 ('multipart(text/plain, application/octet-stream, text/plain)', 3),
 ('multipart(text/html, text/plain)', 2),
 ('multipart(text/html, image/jpeg)', 2),
 ('multipart(multipart(text/plain), application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(multipart(text/plain, text/html), image/jpeg, image/jpeg, image/jpeg, image/jpeg, image/jpeg)',
  1),
 ('multipart(multipart(text/plain, text/html), image/jpeg, image/jpeg, image/jpeg, image/jpeg, image/gif)',
  1),
 ('text/plain charset=us-ascii', 1),
 ('multipart(multipart(text/html), image/gif)', 1),
 ('multipart(multipart(text/plain, text/html), application/octet-stream, application/octet-stream, applic

In [37]:
spam_emails[0]["Subject"]

'[ILUG] STOP THE MLM INSANITY'

In [38]:
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=42)

In [39]:
import re
from html import unescape

def html_to_plain_text(html):
    text = re.sub(r'<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub(r'<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub(r'<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

In [40]:
html_spam_emails = [email for email in X_train[y_train==1]
                   if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_emails[7]
print(sample_html_spam.get_content().strip()[:1000], "...")

<html>
<head>
<title>Web Letter</title>
<LINK REL="stylesheet" HREF="http://www.sancristobalsa.net/letter/styles.css" TYPE="text/css"> 
<style>
<!--
body
{
background-image:
url(http://www.sancristobalsa.net/letter/images/bg.gif);
background-repeat: repeat-y;
background-position: top left
}
//-->
</style>
</head>
<body bgcolor="#FFFFFF" marginwidth="0" marginheight="0" topmargin="0" leftmargin="0">
<table align="left" cellpadding="0" cellspacing="0" border="0" width="760">
 <tr>
  <td colspan="2"><img src="http://www.sancristobalsa.net/letter/images/header.jpg" width="760" height="90" border="0" alt=""></td>
 </tr>
 <tr>
  <td align="left" width="190" valign="top"><br>
<table align="left" cellpadding="0" cellspacing="0" border="0" width="190">
 <tr><td><img src="http://www.sancristobalsa.net/letter/images/set1.jpg" width="190" height="400" border="0" alt=""></td></tr>
 <tr><td><img src="http://www.sancristobalsa.net/letter/images/set2.jpg" width="190" height="360" border="0" alt=""></t

In [41]:
print(html_to_plain_text(sample_html_spam.get_content())[:1000], "...")


                Tremendous Investment Opportunity!
                Are you are fed up with the manipulation and erratic performance of the stock market?  Like most people, are you looking for a stable investment that can provide a 25%-35% tax-free annual cash flow return?  If you answered yes to either question, I have an exciting offering for you.
     I represent a company that has a limited offering of emerging growth Caribbean Real Estate that enjoins a tropical working farm, which provides substantial cash flow, long-term wealth accumulation, and the proven appreciation of Caribbean real estate.
                Here are the details:
                                             A 20-acre tract of waterfront Caribbean real estate is only $106,000 (including closing costs and dual residency status in a tax-advantaged country).
    The investment will be totally private and protected from lawsuits.
                                                 The plantation has a proven track rec

In [45]:
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if ctype not in ("text/plain", "text/html"):
            continue
        try:
            payload = part.get_payload(decode=True)
            charset = part.get_content_charset()
            if charset is None:
                charset = "utf-8"
            content = payload.decode(charset, errors="replace")
        except Exception as e:
            print(f"Decoding error: {e}")
            content = str(part.get_payload())

        if ctype == "text/plain":
            return content.strip()
        else:
            html = content

    if html:
        return html_to_plain_text(html)
    return ""
            

In [46]:
print(email_to_text(sample_html_spam)[:100], "...")


                Tremendous Investment Opportunity!
                Are you are fed up with the mani ...


In [47]:
import nltk

stemmer = nltk.PorterStemmer()
for word in ("Computations", "Computation", "Computing", "Computed", "Compute", 
            "Compulsive"):
    print(word, "=>", stemmer.stem(word))

Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compulsive => compuls


In [48]:
import urlextract

url_extractor = urlextract.URLExtract()
some_text = "Will it detect github.com and https://youtu.be/7Pq-S557XQU?t=3m32s"
print(url_extractor.find_urls(some_text))


['github.com', 'https://youtu.be/7Pq-S557XQU?t=3m32s']


In [49]:
from sklearn.base import BaseEstimator, TransformerMixin

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True,
                remove_punctuation=True, replace_urls=True,
                replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [50]:
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'i': 14, 'the': 9, 'to': 6, 'firewal': 5, 'a': 5, 'linux': 5, 'so': 4, 'connect': 4, 'have': 4, 'at': 3, 'my': 3, 'usb': 3, 'and': 3, 'be': 3, 'of': 3, 'adsl': 3, 'what': 3, 'm': 2, 'configur': 2, 'thi': 2, 'that': 2, 'box': 2, 'one': 2, 've': 2, 'as': 2, 'do': 2, 'nat': 2, 'am': 2, 'need': 2, 'router': 2, 'option': 2, 'look': 2, 'are': 2, 'modem': 2, 'machin': 2, 'would': 2, 'ie': 2, 'moment': 1, 'still': 1, 'found': 1, 'out': 1, 'ha': 1, 'dodgi': 1, 'await': 1, 'pci': 1, 'control': 1, 'finish': 1, 'meant': 1, 'buy': 1, 'yesterday': 1, 'forgot': 1, 'bought': 1, 'smoothwal': 1, 'corpor': 1, 'edit': 1, 'it': 1, 'best': 1, 'rout': 1, 'solut': 1, 'around': 1, 'or': 1, 'been': 1, 'led': 1, 'believ': 1, 'will': 1, 'proxi': 1, 'mayb': 1, 'vpn': 1, 'too': 1, 'but': 1, 'we': 1, 'shall': 1, 'see': 1, 'cw': 1, 'hi': 1, 'all': 1, 'serious': 1, 'think': 1, 'get': 1, 'solo': 1, 'from': 1, 'eircom': 1, 'also': 1, 'more': 1, 'than': 1, 'comput': 1, 'therefor': 1, 'can': 1, 'them': 1, 

In [52]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size


    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.vocabulary_ = {word: index + 1
                           for index, (word, count) in enumerate(most_common)}
        return self

    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []

        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)),
                         shape=(len(X), self.vocabulary_size + 1))

In [53]:
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 21 stored elements and shape (3, 11)>

In [54]:
X_few_vectors.toarray()

array([[150,   9,   1,  14,   6,   5,   3,   5,   5,   4,   3],
       [ 60,   2,   1,   0,   1,   0,   3,   0,   0,   0,   1],
       [ 38,   1,  62,   0,   0,   1,   0,   0,   0,   0,   0]])

In [55]:
vocab_transformer.vocabulary_

{'the': 1,
 'number': 2,
 'i': 3,
 'to': 4,
 'a': 5,
 'and': 6,
 'firewal': 7,
 'linux': 8,
 'so': 9,
 'of': 10}

In [56]:
from sklearn.pipeline import Pipeline

preprocess_pipeline_vec = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer())
])
X_train_transform_vec = preprocess_pipeline_vec.fit_transform(X_train)

Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: gb2312_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: chinesebig5
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset


In [57]:
class EmailToTextTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_transformed = []
        for email_obj in X:
            text = email_to_text(email_obj) or ""
            X_transformed.append(text)
        return np.array(X_transformed)

In [71]:
from sklearn.feature_extraction.text import TfidfVectorizer

preprocessing_pipeline_tf = Pipeline([
    ("email_to_text", EmailToTextTransformer()),
    ("tfidf", TfidfVectorizer(
        max_features=1000,
        strip_accents="unicode",
        lowercase=True,
        stop_words="english",
        ngram_range=(1, 2)
    ))
])

X_train_transform_tf = preprocessing_pipeline_tf.fit_transform(X_train)

Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: gb2312_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: chinesebig5
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset


In [62]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(max_iter=1000, random_state=42)
score_vec = cross_val_score(log_clf, X_train_transform_vec, y_train, cv=3)
score_vec.mean()

# checking the vec edited dataset against LogReg, withoutout using GridSearch

np.float64(0.9683908045977011)

In [72]:
score_tf = cross_val_score(log_clf, X_train_transform_tf, y_train, cv=3)
score_tf.mean()

np.float64(0.9634646962233169)

In [64]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# build a pipeline: scaling + SVM
svm_pipeline = Pipeline([
    ("scaler", StandardScaler(with_mean=False)), # important: with_mean=False because sparse matrix
    ("svm_clf", SVC())
])

param_grid = {
    "svm_clf__C": [0.1, 1, 10],
    "svm_clf__kernel": ["linear", "rbf"],
    "svm_clf__gamma": ["scale", "auto"]
}

grid_search = GridSearchCV(svm_pipeline, param_grid, cv=3, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train_transform_vec, y_train)

svm_model = grid_search.best_estimator_

score_svm = cross_val_score(svm_model, X_train_transform_vec, y_train, cv=3)
score_svm.mean()


np.float64(0.951559934318555)

In [65]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

# Build a pipeline: scaling + Logistic Regression
log_reg_pipeline = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),  # Important: with_mean=False because sparse matrix
    ("log_reg_clf", LogisticRegression(solver='liblinear'))  # Logistic Regression classifier
])

# Define parameter grid for GridSearchCV
param_grid = {
    "log_reg_clf__C": [0.1, 1, 10],  # Regularization parameter for Logistic Regression
    "log_reg_clf__penalty": ["l2"],  # Regularization type (L2 is the default)
    "log_reg_clf__solver": ["liblinear", "saga"]  # Solvers for optimization
}

# Set up GridSearchCV
grid_search = GridSearchCV(log_reg_pipeline, param_grid, cv=3, scoring="accuracy", n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train_transform_vec, y_train)

# Get the best model
log_reg_model = grid_search.best_estimator_

# Evaluate the model using cross-validation
score_log_reg = cross_val_score(log_reg_model, X_train_transform_vec, y_train, cv=3)
print("Logistic Regression Accuracy (CV Mean):", score_log_reg.mean())


Logistic Regression Accuracy (CV Mean): 0.9675697865353038


In [68]:
from sklearn.metrics import precision_score, recall_score

X_test_transform_vec = preprocess_pipeline_vec.transform(X_test)

y_pred_vec = log_reg_model.predict(X_test_transform_vec)

print(f"Precision: {precision_score(y_test, y_pred_vec):.2%}")
print(f"Recall: {recall_score(y_test, y_pred_vec):.2%}")

Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Precision: 93.01%
Recall: 95.83%


In [74]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

# Build a pipeline: scaling + Logistic Regression
log_reg_pipeline = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),  # Important: with_mean=False because sparse matrix
    ("log_reg_clf", LogisticRegression(solver='liblinear', max_iter=1000))  # Logistic Regression classifier
])

# Define parameter grid for GridSearchCV
param_grid = {
    "log_reg_clf__C": [0.1, 1, 10],  # Regularization parameter for Logistic Regression
    "log_reg_clf__penalty": ["l2"],  # Regularization type (L2 is the default)
    "log_reg_clf__solver": ["liblinear", "saga"]  # Solvers for optimization
}

# Set up GridSearchCV
grid_search = GridSearchCV(log_reg_pipeline, param_grid, cv=3, scoring="accuracy", n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train_transform_tf, y_train)

# Get the best model
log_reg_model = grid_search.best_estimator_

# Evaluate the model using cross-validation
score_log_reg = cross_val_score(log_reg_model, X_train_transform_tf, y_train, cv=3)
print("Logistic Regression Accuracy (CV Mean):", score_log_reg.mean())


Logistic Regression Accuracy (CV Mean): 0.972495894909688


In [77]:
from sklearn.metrics import precision_score, recall_score

X_test_transform_tf = preprocessing_pipeline_tf.transform(X_test)

y_pred_vec = log_reg_model.predict(X_test_transform_tf)

print(f"Precision: {precision_score(y_test, y_pred_vec):.2%}")
print(f"Recall: {recall_score(y_test, y_pred_vec):.2%}")

Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Precision: 95.80%
Recall: 95.08%


In [88]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Logistisc Regression Pipeline
log_reg_pipeline2 = Pipeline([
    ("log_reg", LogisticRegression(max_iter=1000, solver='saga', penalty='l2'))
])

# Random Forest Piipeline
rf_pipeline = Pipeline([
    ("rf_clf", RandomForestClassifier(n_estimators=100))
])


# Gradient Boosting Pipeline
gb_pipeline = Pipeline([
    ("gb_clf", GradientBoostingClassifier(n_estimators=100))
])

In [89]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[
        ("log_reg", log_reg_pipeline2),
        ("rf", rf_pipeline),
        ("gb", gb_pipeline)
    ],
    voting="soft"
)

In [90]:
score_tf_ensm = cross_val_score(voting_clf, X_train_transform_tf, y_train, cv=3)
score_tf_ensm.mean()

np.float64(0.9626436781609197)

In [91]:
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    "log_reg__log_reg__C": [0.1, 1, 10],
    "rf__rf_clf__n_estimators": [50, 100, 200],
    "gb__gb_clf__n_estimators": [50, 100, 200]
}

rnd_search = RandomizedSearchCV(voting_clf, param_distributions, n_iter=10, cv=3, 
                                scoring="accuracy", random_state=42)

rnd_search.fit(X_train_transform_tf, y_train)

best_voting_clf = rnd_search.best_estimator_

score_rnd = cross_val_score(best_voting_clf, X_train_transform_tf, y_train, cv=3)
score_rnd.mean()

np.float64(0.9704433497536945)

In [92]:
from sklearn.metrics import precision_score, recall_score

X_test_transform_tf = preprocessing_pipeline_tf.transform(X_test)

y_pred_vec = best_voting_clf.predict(X_test_transform_tf)

print(f"Precision: {precision_score(y_test, y_pred_vec):.2%}")
print(f"Recall: {recall_score(y_test, y_pred_vec):.2%}")

Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default_charset
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default
Decoding error: unknown encoding: default_charset
Precision: 93.01%
Recall: 95.83%


In [94]:
"""
From all modelling, and work. The best scores came from when TfidVectorization and 
EmailToTextCounter were used to work on the ham nd spam data. Then the result training pipeline
was used to train the data that was in X_train. After that LogisticRegression in combination with
GridSearchCV was used to remodel and get the best model features that worked with the LogRess
model. This gave a cross val score of 0.9724, and Precision: 95.80% & Recall: 95.08%. Other 
models, even an ensemble of RF and LogRess was used. But this model (log_reg_model) was the best
"""

score_log_reg = cross_val_score(log_reg_model, X_train_transform_tf, y_train, cv=3)
print("Logistic Regression Accuracy (CV Mean):", score_log_reg.mean())

y_pred_vec = log_reg_model.predict(X_test_transform_tf)

print(f"Precision: {precision_score(y_test, y_pred_vec):.2%}")
print(f"Recall: {recall_score(y_test, y_pred_vec):.2%}")

Logistic Regression Accuracy (CV Mean): 0.972495894909688
Precision: 95.80%
Recall: 95.08%
