<a href="https://colab.research.google.com/github/jackiekuen2/notes-handson-ml-tf/blob/master/ch3_Exercise4_SpamEmail.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Classifier
- Download examples from http://spamassassin.apache.org/old/publiccorpus/
- Split the datasets into a training set and a test set (using sklearn train_test_split)
- Write a data preparatino pipeline
    - convert each email into a feature vector (using nltk)
    - whether or not strip off email headers
    - convert to lowercase
    - remove punctuation
    - replace all URLs with "URL"
    - replace all numbers with "NUMBER"
    - perform stemming (i.e. trim off word endings)
- Try several classifiers, both high recall and high precision (PR curve closer to the top-right corner)

## I. Load datasets

In [0]:
import os
import tarfile
import urllib

In [0]:
DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_spam_data(spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (('ham.tar.bz2', HAM_URL), ('spam.taz.bz2', SPAM_URL)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=SPAM_PATH)
        tar_bz2_file.close()

In [0]:
fetch_spam_data()

In [0]:
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")

ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]

In [0]:
len(ham_filenames)

2500

In [0]:
len(spam_filenames)

500

## II. Parpse email structure
- Use 'email' to parse these email headers, encoding, and so on
- Parse multi-part emails, those with images and attachments

In [0]:
import email
import email.policy

def load_email(is_spam, filename, spam_path=SPAM_PATH):
    directory = "spam" if is_spam else "easy_ham"
    with open(os.path.join(spam_path, directory, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [0]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

Take a look at example ham emails and spam emails

In [0]:
print(ham_emails[6].get_content().strip())

The Scotsman - 22 August 2002

 Playboy wants to go out with a bang 
 
 
 AN AGEING Berlin playboy has come up with an unusual offer to lure women into
 his bed - by promising the last woman he sleeps with an inheritance of 250,000
 (£160,000). 
 
 Rolf Eden, 72, a Berlin disco owner famous for his countless sex partners,
 said he could imagine no better way to die than in the arms of an attractive
 young woman - preferably under 30. 
 
 "I put it all in my last will and testament - the last woman who sleeps with
 me gets all the money," Mr Eden told Bild newspaper. 
 
 "I want to pass away in the most beautiful moment of my life. First a lot of
 fun with a beautiful woman, then wild sex, a final orgasm - and it will all
 end with a heart attack and then Im gone." 
 
 Mr Eden, who is selling his nightclub this year, said applications should be
 sent in quickly because of his age. "It could end very soon," he said.


------------------------ Yahoo! Groups Sponsor ---------------------~

In [0]:
print(spam_emails[5].get_content().strip())

A POWERHOUSE GIFTING PROGRAM You Don't Want To Miss! 
 
  GET IN WITH THE FOUNDERS! 
The MAJOR PLAYERS are on This ONE
For ONCE be where the PlayerS are
This is YOUR Private Invitation

EXPERTS ARE CALLING THIS THE FASTEST WAY 
TO HUGE CASH FLOW EVER CONCEIVED
Leverage $1,000 into $50,000 Over and Over Again

THE QUESTION HERE IS:
YOU EITHER WANT TO BE WEALTHY 
OR YOU DON'T!!!
WHICH ONE ARE YOU?
I am tossing you a financial lifeline and for your sake I 
Hope you GRAB onto it and hold on tight For the Ride of youR life!

Testimonials

Hear what average people are doing their first few days:
�We've received 8,000 in 1 day and we are doing that over and over again!' Q.S. in AL
 �I'm a single mother in FL and I've received 12,000 in the last 4 days.� D. S. in FL
�I was not sure about this when I sent off my $1,000 pledge, but I got back $2,000 the very next day!� L.L. in KY
�I didn't have the money, so I found myself a partner to work this with. We have received $4,000 over the last 2 days

### II-A. Email Structure --> Useful info
Most ham emails are plain text, and quite a number of ham emails are signed using PGP (i.e. an encryption program providing cryptographic privacy and authentication for data communication)
https://en.wikipedia.org/wiki/Pretty_Good_Privacy

A lot of spam emails have a lot of HTML.


In [0]:
# Parse multipart email
def get_email_structure(email):
    # First, check if the email is string only, if yes return email as string
    if isinstance(email, str):
        return email
    # Return a list of the payload (from 0), if is_multipart() is True
    payload = email.get_payload()

    if isinstance(payload, list):
        return "multipart({})".format(", ".join([get_email_structure(sub_email) for sub_email in payload]))
    else:
        return email.get_content_type()

In [0]:
# Count the structures of email
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [0]:
structures_counter(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [0]:
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

### II-B. Headers --> Useful info
- Focus on the Subject header in this exercise

In [0]:
for header, value in spam_emails[5].items():
    print(header, ":", value)

Return-Path : <Thecashsystem@firemail.de>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 3453043F99	for <zzzz@localhost>; Thu, 22 Aug 2002 11:58:24 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 16:58:24 +0100 (IST)
Received : from mailbox-13.st1.spray.net (mailbox-13.st1.spray.net [212.78.202.113])	by webnote.net (8.9.3/8.9.3) with ESMTP id QAA05573	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 16:55:29 +0100
Received : from freesource (user-24-214-168-210.knology.net [24.214.168.210])	by mailbox-13.st1.spray.net (Postfix) with ESMTP	id ADDD03E25C; Thu, 22 Aug 2002 17:50:55 +0200 (DST)
Message-ID : <413-220028422154219900@freesource>
X-Priority : 1
To : 1 <thecashsystem@firemail.de>
From : TheCashSystem <Thecashsystem@firemail.de>
Subject : RE: Your Bank Ac

Focus on Subject header in this exercise.

In [0]:
spam_emails[5]['Subject']

'RE: Your Bank Account Information '

## II. Train Test Split

In [0]:
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails)
# Label ham emails as 0 and spam emails as 1
y = np.array([0]*len(ham_emails) + [1]*len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## III. Data Preprocessing

### III-A. Parsing/ Converting HTML to text
- Need a function to convert HTML to plain text
    - The best way to do: BeautifulSoup
    - Quick and dirty solution: Using regular expression
        1. Drop the \<head> section
        2. Convert all \<a> tags to the word "HYPERLINK"
        3. Get rid of all HTML tags, leaving only the plain text
        4. Replace multiple newlines with single newlines
        5. Unescape HTML entities (e.g. &gt, &nbsp)
- Need a function to take an email as input, then return its content as plain text

In [0]:
import re
from html import unescape

def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I) # drop <head> section
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I) # convert <a> tag to "HYPYERLINK"
    text = re.sub('<.*?>', '', text, flags=re.M | re.S) # drop all HTML tags
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S) # replace multiple newline with single newlines
    return unescape(text)

In [0]:
html_spam_emails = [email for email in X_train[y_train==1] if get_email_structure(email)=="text/html"]

sample_html_spam = html_spam_emails[7]
# print(sample_html_spam.get_content().strip()[:1000], "...")

In [0]:
print(sample_html_spam.get_content().strip()[:1000], "...")

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0" 

Converted version of above HTML spam

In [0]:
print(html_to_plain_text(sample_html_spam.get_content().strip())[:1000], "...")


OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts.  IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi

In [0]:
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ('text/plain', 'text/html'):
            continue
        try:
            content = part.get_content()
        except: # in case of encoding issue
            content = str(part.get_payload())
        if ctype == 'text/plain':
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)

In [0]:
print(email_to_text(sample_html_spam)[:100], "...")


OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Wat ...


### III-B. Stemming
-  Removing morphological affixes from words, using NLTK http://www.nltk.org/
- http://www.nltk.org/api/nltk.stem.html?highlight=stemming

In [0]:
import nltk

stemmer = nltk.PorterStemmer()
# for word in ("Corrections", "Correction", "Correcting", "Corrected", "Correct"):
for word in ("Computations", "Computation", "Computing", "Computed", "Compute", "Compulsive"):
    print(word, "-->", stemmer.stem(word))

Computations --> comput
Computation --> comput
Computing --> comput
Computed --> comput
Compute --> comput
Compulsive --> compuls


### III-C. Replace URLs with the word "URL"
- Option 1: Use regular expressions https://mathiasbynens.be/demo/url-regex
- Option 2: use urlextract library https://github.com/lipoja/URLExtract

In [0]:
!pip install urlextract

Collecting urlextract
  Downloading https://files.pythonhosted.org/packages/06/db/23b47f32d990dea1d9852ace16d551a0003bdfc8be33094cfd208757466e/urlextract-0.14.0-py3-none-any.whl
Collecting appdirs
  Downloading https://files.pythonhosted.org/packages/56/eb/810e700ed1349edde4cbdc1b2a21e28cdf115f9faf263f6bbf8447c1abf3/appdirs-1.4.3-py2.py3-none-any.whl
Collecting uritools
  Downloading https://files.pythonhosted.org/packages/eb/1a/5995c0a000ef116111b9af9303349ba97ec2446d2c9a79d2df028a3e3b19/uritools-3.0.0-py3-none-any.whl
Installing collected packages: appdirs, uritools, urlextract
Successfully installed appdirs-1.4.3 uritools-3.0.0 urlextract-0.14.0


In [0]:
import urlextract

url_extractor = urlextract.URLExtract()
print(url_extractor.find_urls("Will it detect github.com and https://youtu.be/7Pq-S557XQU?t=3m32s"))

['github.com', 'https://youtu.be/7Pq-S557XQU?t=3m32s']


### III-D. Custom Transformer 1: Convert emails to word counts
- Split sentences into words, using split()
- Count words

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or "" # 1st: Convert email to text
            if self.lower_case:
                text = text.lower() # 2nd: convert lower case
            if self.replace_urls and url_extractor is not None: # 3rd: replace URLs with "URL" (Take out > Sort > Replace)
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers: # 4th: replace numbers with "NUMBER"
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', "NUMBER", text)
            if self.remove_punctuation: # 5th: remove punctuation (replace them with whitespace)
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split()) # 6th: Split and then count
            if self.stemming and stemmer is not None: # 7th: stem the words
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed) # Make sure X_transformed in np array

In [0]:
# Testing the transformer with a few examples
X_few = X_train[:5]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)

In [0]:
X_few[1].get_content().strip()

'Some interesting quotes...\n\nhttp://www.postfun.com/pfp/worbois.html\n\n\nThomas Jefferson:\n\n"I have examined all the known superstitions of the word, and I do not\nfind in our particular superstition of Christianity one redeeming feature.\nThey are all alike founded on fables and mythology. Millions of innocent\nmen, women and children, since the introduction of Christianity, have been\nburnt, tortured, fined and imprisoned. What has been the effect of this\ncoercion? To make one half the world fools and the other half hypocrites;\nto support roguery and error all over the earth."\n\nSIX HISTORIC AMERICANS,\nby John E. Remsburg, letter to William Short\nJefferson again:\n\n"Christianity...(has become) the most perverted system that ever shone on\nman. ...Rogueries, absurdities and untruths were perpetrated upon the\nteachings of Jesus by a large band of dupes and importers led by Paul, the\nfirst great corrupter of the teaching of Jesus."'

In [0]:
print(X_few_wordcounts)

[Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'r': 1})
 Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'url': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom': 1, 'most':

### III-E. Custom Transformer 2: Convert word counts to vector (Corpus)
- fit() method: will build the vocabulary (i.e. the corpus)
- transform() method: will use the corpus to convert word counts into vectors
- the output is a sparse matrix
- enumerate() https://www.geeksforgeeks.org/enumerate-in-python/

In [0]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000): # Limit to top 1000 common vocabularies
        self.vocabulary_size = vocabulary_size

    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10) # Append to total_count, no more than 10
        most_common = total_count.most_common()[:self.vocabulary_size] # Limit to top 1000 common vocabularies
        self.most_common_ = most_common
        self.vocabulary_ = {word: index+1 for index, (word, count) in enumerate(most_common)}
        return self

    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size+1))

In [0]:
# Testing the transformer with a few examples
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)

In [0]:
X_few_vectors.toarray()

array([[  6,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [105,  11,   3,   8,   9,   1,   1,   1,   2,   2,   0],
       [ 67,   0,   3,   2,   1,   0,   4,   2,   0,   1,   1],
       [ 48,   1,   6,   1,   1,   2,   2,   1,   2,   0,   0],
       [ 88,   6,   2,   2,   1,   8,   1,   3,   1,   2,   4]],
      dtype=int64)

Meaning of the corpus:
- 2nd row 1st col: 105 --> The second email contains 105 unknown vocabularies
- 2nd row 2nd col: 11 --> The first word of the corpus appears 11 times in this email ("the" appears 11 times)
- 2nd row 3rd col: 3 --> The second word of the corpus appears 3 times in this email ("and" appears 3 times)



In [0]:
# The trained corpus
vocab_transformer.vocabulary_

{'a': 5,
 'and': 3,
 'i': 8,
 'in': 7,
 'number': 10,
 'of': 4,
 'on': 9,
 'the': 1,
 'to': 2,
 'url': 6}

In [0]:
X_few_wordcounts[1].most_common(10)

[('the', 11),
 ('of', 9),
 ('and', 8),
 ('all', 3),
 ('christian', 3),
 ('to', 3),
 ('by', 3),
 ('jefferson', 2),
 ('i', 2),
 ('have', 2)]

## IV. Data Pipeline
1. EmailToWordCounterTransformer: email to word count
2. WordCounterToVectorTransformer: word count to vector


In [0]:
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer())
])

X_train_transformed = preprocess_pipeline.fit_transform(X_train)

## V. Train Model
- 1st trial: Logistic Regression

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver='liblinear', random_state=42)
scores = cross_val_score(log_clf, X_train_transformed, y_train, cv=5, verbose=3, n_jobs=-1)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    1.0s remaining:    1.6s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.4s finished


In [0]:
scores.mean()

0.9870833333333333

1st trail: Logistic Regression: 98.7% accuracy

Further studies:
- Try harder datasets
- Try multiple models
- Select the best ones
- Fine-tune them using Grid Search Cross-validation

In [0]:
from sklearn.metrics import precision_score, recall_score

log_clf = LogisticRegression(solver='liblinear', random_state=42)
log_clf.fit(X_train_transformed, y_train)

X_test_transformed = preprocess_pipeline.transform(X_test)
y_test_pred = log_clf.predict(X_test_transformed)

In [0]:
print("Precision score: {:.2f}%".format(100 * precision_score(y_test, y_test_pred)))
print("Recall score: %.2f%%" % (100 * recall_score(y_test, y_test_pred)))

Precision score: 96.88%
Recall score: 97.89%


## VI. Evaluation
- Confusion Matrix

In [0]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, log_clf.predict(X_test_transformed))

array([[502,   3],
       [  2,  93]])