<a href="https://colab.research.google.com/github/onkarvkunte/NLP_Assignment/blob/main/M3_Part_I_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[link text](https://)##Assignment 3 Naïve Bayes and Sentiment Classification and Logistic Regression
Instructions
* Read the following Chapter 4: Naive Bayes and Sentiment Classification. Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. I have tried to pull out relevant notes for you below, but it is encouraged that you read each chapter provided.
* Read the following Chapter 5: Logistic Regression. Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. I have tried to pull out relevant notes for you below, but it is encouraged that you read each chapter provided.

Summary
Classification is one of the most important tasks of NLP and in machine learning. In NLP it often means the task of text categorization for both sentiment analysis, spam detection, and topic modeling. Naïve Bayes is often one of the first classification algorithms defined in NLP.  The intuition behind a classifier is lies at the underlying probability inferred by the Bayesian Inference, which uses Baye’s rule and conditional probabilities.

Here’s a reminder on Baye’s Rule:
P(y)=P(x)P(x)/(P(y))

We are saying “what is the probability of x given y”. Naïve Bayes is a generative model because there is an input that helps the model determine what the output could be. Said differently, “to train a generative model we first collect a large amount of data in some domain (e.g., think millions of images, sentences, or sounds, etc.) and then train a model to generate data like it.” [6]

So in the case of Naïve Bayes, we say given some word, what should be the class of the current word we are assessing? Contrastingly, discriminative models such as logistic regression, learn from features provided to the algorithm and then determine or predict what the class is. [7]


With Naïve Bayes, the assumption is that the probabilities are independent. We often call the Naïve Bayes classifier the bag-of-words approach. That’s because we are essentially throwing in the collection of words into a ‘bag’, selecting a word at random, and then calculating their frequency to use in the Bayesian Inference. Thus, context – the position of words -- is ignored and despite this, it turns out that the Naïve Bayes approach can be accurate and effective at determining whether an email is spam for example.

Back to bag of words. With bag of words, we assume that the position of the words are not relevant -- that dependency or context in the word phrase or sentence doesn’t matter. Relatedly, the naive Bayes assumption implies that the conditional probabilities are independent -- a rather strange assumption to make for words in a sentence! The equation for the naive Bayes classifier is outlined below:

You can use Naive Bayes by creating an index of words and walking through every word position in a test or corpus.


It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this Assignment, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).   One example corpus:   https://spamassassin.apache.org/old/publiccorpus/

You may work alone or in a group on this project.  You're welcome to use any tools or approach that you like.  Due before our next meetup. Starter code provided below.

Test example is provided at the end.

Libraries you may wish to use

In [2]:
import pandas as pd
import numpy as np
from os import makedirs, path, remove, rename, rmdir
from tarfile import open as open_tar
from shutil import rmtree
from urllib import request, parse
from glob import glob
from os import path
from re import sub
from email import message_from_file
from glob import glob
from sklearn.model_selection import StratifiedShuffleSplit
from collections import defaultdict
from functools import partial
from sklearn.metrics import (accuracy_score, f1_score, precision_score, recall_score)
from sklearn.model_selection import cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
import gc

download corpus using the following functions

Note: you may need to mount your drive on google then run this location. See previous exercises.

In [3]:
def download_corpus(dataset_dir: str = 'data'):
    base_url = 'https://spamassassin.apache.org'
    corpus_path = 'old/publiccorpus'
    files = {
        '20021010_easy_ham.tar.bz2': 'ham',
        '20021010_hard_ham.tar.bz2': 'ham',
        '20021010_spam.tar.bz2': 'spam',
        '20030228_easy_ham.tar.bz2': 'ham',
        '20030228_easy_ham_2.tar.bz2': 'ham',
        '20030228_hard_ham.tar.bz2': 'ham',
        '20030228_spam.tar.bz2': 'spam',
        '20030228_spam_2.tar.bz2': 'spam',
        '20050311_spam_2.tar.bz2': 'spam' }

    #creates the folders: downloads, ham and spam
    downloads_dir = path.join(dataset_dir, 'downloads')
    ham_dir = path.join(dataset_dir, 'ham')
    spam_dir = path.join(dataset_dir, 'spam')

    makedirs(downloads_dir, exist_ok=True)
    makedirs(ham_dir, exist_ok=True)
    makedirs(spam_dir, exist_ok=True)


    for file, spam_or_ham in files.items():
        # download files from URL of each specific .bz2 file
        url = parse.urljoin(base_url, f'{corpus_path}/{file}')
        tar_filename = path.join(downloads_dir, file)
        request.urlretrieve(url, tar_filename)

        #list e-mails in the compressed .bz2 file
        emails = []
        with open_tar(tar_filename) as tar:
            tar.extractall(path=downloads_dir)
            for tarinfo in tar:
                if len(tarinfo.name.split('/')) > 1:
                    emails.append(tarinfo.name)

        # move e-mails to ham or spam directory
        for email in emails:
            directory, filename = email.split('/')
            directory = path.join(downloads_dir, directory)

            if not path.exists(path.join(dataset_dir, spam_or_ham, filename)):
                rename(path.join(directory, filename),
                   path.join(dataset_dir, spam_or_ham, filename))

        rmtree(directory)

download_corpus()

#How many e-mails are classified in our dataset as either Spam or not Spam?


In [4]:
#How many e-mails are classified in our dataset as either Spam or not Spam?
ham_dir = path.join('data', 'ham')
spam_dir = path.join('data', 'spam')

print('Number of Non-Spam E-mails:', len(glob(f'{ham_dir}/*')))
print('\nNumber of Spam E-mails:', len(glob(f'{spam_dir}/*')))

Number of Non-Spam E-mails: 6952

Number of Spam E-mails: 2399


Provide your classifier below

In [5]:
import glob
import pandas as pd

ham_dir = path.join('data', 'ham')
spam_dir = path.join('data', 'spam')

# Create lists to store email content and labels
email_content = []
labels = []

# Process non-spam emails
print("Non-Spam E-mails:")
for file_path in glob.glob(f"{ham_dir}/*"):
    with open(file_path, 'rb') as file:  # Open the file in binary mode
        try:
            content = file.read().decode('utf-8')
        except UnicodeDecodeError:
            content = file.read().decode('latin-1')  # Try a different encoding if UTF-8 fails
        email_content.append(content)
        labels.append("ham")
        print(content)

# Process spam emails
print("\nSpam E-mails:")
for file_path in glob.glob(f"{spam_dir}/*"):
    with open(file_path, 'rb') as file:  # Open the file in binary mode
        try:
            content = file.read().decode('utf-8')
        except UnicodeDecodeError:
            content = file.read().decode('latin-1')  # Try a different encoding if UTF-8 fails
        email_content.append(content)
        labels.append("spam")
        print(content)

# Create a DataFrame from the email content and labels
df = pd.DataFrame({'Content': email_content, 'Label': labels})
print(df)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                        <td width=3D3><img src=3D"http://www.zdnet.com/grap=
hics/anchordesk/frontpage/b.gif" width=3D"3" height=3D"1" border=3D"0"><br>=
</td>
                </tr>
                </table>
                </td></tr></table>

       =20
</td>
<td width=3D12 colspan=3D3 valign=3Dbottom><img src=3D"http://www.zdnet.com=
/techupdate/i/gl_corner2.gif" width=3D"12" height=3D"6" border=3D"0" alt=3D=
""></td>
</tr>
<tr valign=3Dtop>


<td width=3D1 bgcolor=3D"#83A3CB"><img src=3D"http://www.zdnet.com/b.gif" w=
idth=3D1 height=3D1></td>
<td width=3D10 bgcolor=3D"#1E5C99"><img src=3D"http://www.zdnet.com/b.gif" =
width=3D10 height=3D1></td>
<td width=3D1 bgcolor=3D"#000000"><img src=3D"http://www.zdnet.com/b.gif" w=
idth=3D1 height=3D1></td>
<td width=3D428 bgcolor=3D"#ffffff">


                <table cellpadding=3D0 cellspacing=3D0 border=3D0 width=3D4=
28 bgcolor=3D"#ffffef">
                        <tr valign=

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Return-Path: <Online#3.19799.13-cLBKgHwZrJFX3dRR.1.b@newsletter.online.com>
Received: from abv-sfo1-acmta5.cnet.com (abv-sfo1-acmta5.cnet.com [206.16.1.164])
	by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g6BAY2J23694
	for <qqqqqqqqqq-zdnet@spamassassin.taint.org>; Thu, 11 Jul 2002 11:34:02 +0100
Received: from abv-sfo1-ac-agent5 (206.16.0.240) by abv-sfo1-acmta5.cnet.com (PowerMTA(TM) v1.5); Thu, 11 Jul 2002 03:27:16 -0700 (envelope-from <Online#3.19799.13-cLBKgHwZrJFX3dRR.1.b@newsletter.online.com>)
Message-ID: <4648047.1026383634977.JavaMail.root@abv-sfo1-ac-agent5>
Date: Thu, 11 Jul 2002 03:33:52 -0700 (PDT)
From: AnchorDesk <Online#3.19799.13-cLBKgHwZrJFX3dRR.1@newsletter.online.com>
To: qqqqqqqqqq-zdnet@spamassassin.taint.org
Subject: Next MS Office? Here's what I'd add--and delete [ANCHORDESK]
Mime-Version: 1.0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Mailer: Accucast (http://www.accucast.com)
X-Mailer-Version: 2.8.4-2


 <

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



From ellebt@hotmail.com  Fri Sep 20 17:36:05 2002
Return-Path: <ellebt@hotmail.com>
Delivered-To: zzzz@localhost.spamassassin.taint.org
Received: from localhost (jalapeno [127.0.0.1])
	by zzzzason.org (Postfix) with ESMTP id A469016F03
	for <zzzz@localhost>; Fri, 20 Sep 2002 17:36:02 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Fri, 20 Sep 2002 17:36:02 +0100 (IST)
Received: from server2k.TISWorld.com ([4.41.176.180]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8KFivC11606 for
    <webmaster@efi.ie>; Fri, 20 Sep 2002 16:44:58 +0100
Received: from chastney.demon.co.uk (66-0-177-154.deltacom.net
    [66.0.177.154]) by server2k.TISWorld.com with SMTP (Microsoft Exchange
    Internet Mail Service Version 5.5.2653.13) id STA9WDQF; Fri,
    20 Sep 2002 03:03:16 -0700
Message-Id: <000029de097e$00000b1a$000006f0@avalonprinting.com>
To: <tmg4@hotmail.com>, <steve@cash4u.com>,
	<patti_contreras@hotmail.

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



The following email is a test email. You can take this and test your classifier to see if it predicts spam or not.



In [7]:
df

Unnamed: 0,Content,Label
0,From fork-admin@xent.com Sun Sep 22 14:11:41 ...,ham
1,From rssfeeds@jmason.org Tue Sep 24 10:48:01 ...,ham
2,From razor-users-admin@lists.sourceforge.net ...,ham
3,,ham
4,From fork-admin@xent.com Fri Aug 23 11:08:39 ...,ham
...,...,...
9346,From paige_455@aol.com Sun Sep 22 14:13:09 20...,spam
9347,From fork-admin@xent.com Tue Aug 6 12:01:13 ...,spam
9348,From antheaygd@chinchilla.freeserve.co.uk Wed...,spam
9349,From sophia_komar@eudoramail.com Mon Jun 24 1...,spam


In [6]:
spam_email = """
Subject: Get Rich Quick!

Dear Friend,

Congratulations! You've been selected to participate in an exclusive opportunity to make thousands of dollars from the comfort of your own home. Our revolutionary system guarantees quick and easy cash with minimal effort.

No more struggling to pay bills or worrying about financial security. With our proven method, you can start earning massive amounts of money in no time.

Here's what some of our satisfied customers have to say:
- "I was skeptical at first, but I'm now living my dream life thanks to this incredible system!" - John S.
- "I never thought making money online could be this simple. It's changed my life!" - Sarah L.

Don't miss out on this limited-time offer. Act now to secure your spot and start enjoying a life of financial freedom.

Click the link below to get started:
www.getrichquick.com

Remember, this opportunity is exclusive and won't last long. Take control of your financial future today!

Best regards,
The Get Rich Quick Team
"""
