# Document Filtering

Ch 6 from *Programming Collective Intelligence*, based on code from
* https://github.com/arthur-e/Programming-Collective-Intelligence/tree/master/chapter6
* https://go.oreilly.com/old-dominion-university/library/view/programming-collective-intelligence/9780596529321/

**Goal:** Classify email as spam or not spam.

**Implemented Example:** Classify a given document as "bad" or "good".

## General Functions

In [1]:
import sqlite3 as sqlite   # replaces import stmt from book
import re
import math

`getwords(doc)` - returns a list of unique words found in the given document

* breaks up the text into words, by dividing on any character that isn’t a letter
* leaves only actual words, converted to lowercase
* returns only unique words (so doesn't calculate the count if a word is used multiple times in a document)

Note that this reduces the number of features because text is now case insensitive. However, this will completely miss ALL CAPS as potential feature for spam.


In [3]:
def getwords(doc):
  splitter=re.compile('\W+')  # different than book
  #print (doc)
  # Split the words by non-alpha characters
  words=[s.lower() for s in splitter.split(doc) 
          if len(s)>2 and len(s)<20]
  
  # Return the unique set of words only
  uniq_words = dict([(w,1) for w in words])

  return uniq_words

  splitter=re.compile('\W+')  # different than book


## Naive Bayes Classifier

*To use this with the basic classifier (and to change it back later), make the following changes:*
* `class naivebayes(classifier)` -> `class naivebayes(basic_classifier)`
* `classifier.__init__(self,getfeatures)` -> `basic_classifier.__init__(self,getfeatures)`

In [11]:
import os

# Modified Naive Bayes Classifier (change basic_classifier if needed)
class naivebayes(basic_classifier):   # change for basic_classifier

  def __init__(self,getfeatures):   
    basic_classifier.__init__(self,getfeatures)  # change for basic_classifier
    self.thresholds={}
  
  def docprob(self,item,cat):
    features=self.getfeatures(item)   

    # Multiply the probabilities of all the features together
    p=1
    for f in features: p*=self.weightedprob(f,cat,self.fprob)
    return p

  def prob(self,item,cat):
    catprob=self.catcount(cat)/self.totalcount()
    docprob=self.docprob(item,cat)
    return docprob*catprob
  
  def setthreshold(self,cat,t):
    self.thresholds[cat]=t
    
  def getthreshold(self,cat):
    if cat not in self.thresholds: return 1.0
    return self.thresholds[cat]
  
  def classify(self,item,default=None):
    probs={}
    # Find the category with the highest probability
    max=0.0
    for cat in self.categories():
      probs[cat]=self.prob(item,cat)
      if probs[cat]>max: 
        max=probs[cat]
        best=cat

    # Make sure the probability exceeds threshold*next best
    for cat in probs:
      if cat==best: continue
      if probs[cat]*self.getthreshold(best)>probs[best]: return default
    return best

# Function to read content of emails
def read_emails_from_folder(folder_path):
    emails = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                content = file.read().strip()
                emails.append(content)
    return emails

classifier = naivebayes(getwords)

# Training the Classifier
work_folder = "C:/Users/JHON G. BOTELLO/OneDrive - Old Dominion University/PHD/Courses/Spring 2024/Web Science/Web-Science/HW7-Email Classification/Code/Data/Training/work"
nonwork_folder = "C:/Users/JHON G. BOTELLO/OneDrive - Old Dominion University/PHD/Courses/Spring 2024/Web Science/Web-Science/HW7-Email Classification/Code/Data/Training/nonwork"

work_emails = read_emails_from_folder(work_folder)
nonwork_emails = read_emails_from_folder(nonwork_folder)

for email in work_emails:
    classifier.train(email, "work")

for email in nonwork_emails:
    classifier.train(email, "nonwork")

print("Training completed!")


Training completed!


In [12]:
# Testing the Classifier
test_work_folder = "C:/Users/JHON G. BOTELLO/OneDrive - Old Dominion University/PHD/Courses/Spring 2024/Web Science/Web-Science/HW7-Email Classification/Code/Data/Testing/work"
test_nonwork_folder = "C:/Users/JHON G. BOTELLO/OneDrive - Old Dominion University/PHD/Courses/Spring 2024/Web Science/Web-Science/HW7-Email Classification/Code/Data/Testing/nonwork"

test_work_emails = read_emails_from_folder(test_work_folder)
test_nonwork_emails = read_emails_from_folder(test_nonwork_folder)

results = []

# Classify work emails
for email in test_work_emails:
    prediction = classifier.classify(email)
    results.append(("work", prediction))

# Classify nonwork emails
for email in test_nonwork_emails:
    prediction = classifier.classify(email)
    results.append(("nonwork", prediction))

# Print the results as a table
print("\nTesting Results:")
print(f"{'Actual':<15}{'Predicted':<15}")
for actual, predicted in results:
    print(f"{actual:<15}{predicted:<15}")



Testing Results:
Actual         Predicted      
work           nonwork        
work           nonwork        
work           work           
work           work           
work           work           
nonwork        nonwork        
nonwork        nonwork        
nonwork        nonwork        
nonwork        nonwork        
nonwork        nonwork        


In [13]:
# Analyze misclassified emails
print("\nMisclassified Emails:")
for (actual, predicted), email in zip(results, test_work_emails + test_nonwork_emails):
    if actual != predicted:
        print(f"Actual: {actual}, Predicted: {predicted}")
        print(f"Content: {email}\n")



Misclassified Emails:
Actual: work, Predicted: nonwork
Content: 1 new citation to your articles
---------- Forwarded message ---------
From: Google Scholar Alerts <scholaralerts-noreply@google.com>
Date: Sat, Nov 30, 2024 at 4:12 AM
Subject: 1 new citation to your articles

[HTML] Storyline Extraction of Document-Level Events Using Large Language Models
Z Hu, Y Li - Journal of Computer and Communications, 2024
This article proposes a document-level prompt learning approach using LLMs to
extract the timeline-based storyline. Through verification tests on datasets such as
ESCv1. 2 and Timeline17, the results show that the prompt+ one-shot learning
proposed in this article works well. Meanwhile, our research findings indicate that
although timeline-based storyline extraction has shown promising prospects in the
practical applications of LLMs, it is still a complex natural language processing task …
•	Cites: ‪Generating stories from archived collections‬  Edit
Save	Twitter	


Weigle, Mic