# Document Classification
### Team The p < 0.05 Team - Haig Bedros, Noori Selina, Julia Ferris, Matthew Roland


It can be useful to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data: UCI Machine Learning Repository: Spambase Data Set.

For this project, we used the BBC Full Text Document Classification dataset from Kaggle. This dataset contains full documents categorized into five categories: business, entertainment, politics, sport, and tech. 

The goal of our text classification is to predict the category of new documents in the test set.

Link to the dataset: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification?resource=download

The models we used include the Naive Bayes Classifier, Support Vector Machines, and Random Forests. The results were compared for accuracy.

In [4]:
import os
import requests
from bs4 import BeautifulSoup

# Function to get all .txt dat files from subfolders
def get_txt_files_from_github(category_url):
    response = requests.get(category_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    files = []
    for link in soup.find_all('a'):
        href = link.get('href')
        if href and href.endswith('.txt'):
            files.append(href.split('/')[-1])
    return files

# Github repo
base_url = "https://github.com/juliaDataScience-22/documentClassification/tree/main/data/bbc"
raw_base_url = "https://raw.githubusercontent.com/juliaDataScience-22/documentClassification/main/data/bbc"

## 1. Load and Process the Documents

The zip file of the dataset is extracted, and the documents from different categories are loaded and processed.

In [14]:
categories = ['business', 'entertainment', 'politics', 'sport', 'tech']
documents = []
all_words_list = []

for category in categories:
    category_url = f"{base_url}/{category}"
    txt_files = get_txt_files_from_github(category_url)
    
    for filename in txt_files:
        file_url = f"{raw_base_url}/{category}/{filename}"
        try:
            file_response = requests.get(file_url)
            file_response.encoding = 'utf-8'
            words = file_response.text.split()
        except UnicodeDecodeError:
            file_response.encoding = 'ISO-8859-1'
            words = file_response.text.split()
        documents.append((words, category))
        all_words_list.extend(w.lower() for w in words)

# Print some data to verify
print(f"Total documents: {len(documents)}")
print(f"Sample document: {documents[0]}")
print(f"Total words: {len(all_words_list)}")

Total documents: 800
Sample document: (['Ad', 'sales', 'boost', 'Time', 'Warner', 'profit', 'Quarterly', 'profits', 'at', 'US', 'media', 'giant', 'TimeWarner', 'jumped', '76%', 'to', '$1.13bn', '(£600m)', 'for', 'the', 'three', 'months', 'to', 'December,', 'from', '$639m', 'year-earlier.', 'The', 'firm,', 'which', 'is', 'now', 'one', 'of', 'the', 'biggest', 'investors', 'in', 'Google,', 'benefited', 'from', 'sales', 'of', 'high-speed', 'internet', 'connections', 'and', 'higher', 'advert', 'sales.', 'TimeWarner', 'said', 'fourth', 'quarter', 'sales', 'rose', '2%', 'to', '$11.1bn', 'from', '$10.9bn.', 'Its', 'profits', 'were', 'buoyed', 'by', 'one-off', 'gains', 'which', 'offset', 'a', 'profit', 'dip', 'at', 'Warner', 'Bros,', 'and', 'less', 'users', 'for', 'AOL.', 'Time', 'Warner', 'said', 'on', 'Friday', 'that', 'it', 'now', 'owns', '8%', 'of', 'search-engine', 'Google.', 'But', 'its', 'own', 'internet', 'business,', 'AOL,', 'had', 'has', 'mixed', 'fortunes.', 'It', 'lost', '464,000', 

## 2. Feature Extraction
Extract features using NLTK for the Naive Bayes classifier.

In [15]:
import nltk
import random
from nltk import FreqDist

all_words = FreqDist(w.lower() for w in all_words_list)
word_features = list(all_words)[:1000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

random.shuffle(featuresets)
total_samples = len(featuresets)
train_size = int(0.7 * total_samples)
train_set, test_set = featuresets[:train_size], featuresets[train_size:]

## 3. Training and Evaluation of Naive Bayes Classifier
Train and evaluate the NLTK Naive Bayes classifier.

In [16]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print('NLTK Naive Bayes Accuracy:', nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(10)

NLTK Naive Bayes Accuracy: 0.9541666666666667
Most Informative Features
          contains(race) = True            sport : busine =     37.7 : 1.0
         contains(actor) = True           entert : politi =     37.0 : 1.0
          contains(film) = True           entert : politi =     29.6 : 1.0
        contains(leader) = True           politi : sport  =     28.5 : 1.0
          contains(star) = True           entert : politi =     27.7 : 1.0
         contains(award) = True           entert : busine =     27.1 : 1.0
      contains(starring) = True           entert : politi =     26.9 : 1.0
         contains(drugs) = True            sport : busine =     24.3 : 1.0
           contains(won) = True            sport : busine =     23.1 : 1.0
         contains(party) = True           politi : entert =     22.5 : 1.0


**Naive Bayes Classifier Results:**
- **Accuracy**: 92.3%
- **Summary**: The NLTK Naive Bayes classifier achieved an accuracy of 92.3%, meaning it correctly classified the documents into their respective categories (business, entertainment, politics, sport, and tech) 92.3% of the time. The most informative features were words like 'said', 'race', 'won', 'growth', and 'gold', which had the highest impact on the classification decisions.

## 4. Prepare Data for SVM Classifier using TF-IDF
Convert the documents to TF-IDF features for use with the SVM classifier.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

def document_to_string(document):
    return ' '.join(document)

documents_str = [document_to_string(doc) for doc, _ in documents]
labels = [label for _, label in documents]

vectorizer = TfidfVectorizer(max_features=2000)
X = vectorizer.fit_transform(documents_str)


Split the data into training (70%) and testing (30%) sets.

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

## 5. Train and Evaluate the SVM Classifier
Train and evaluate the SVM classifier.

In [19]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

svm_classifier = LinearSVC()
svm_classifier.fit(X_train, y_train)

svm_predictions = svm_classifier.predict(X_test)

svm_accuracy = accuracy_score(y_test, svm_predictions)
print(f'SVM Accuracy: {svm_accuracy}')
print(classification_report(y_test, svm_predictions))

SVM Accuracy: 1.0
               precision    recall  f1-score   support

     business       1.00      1.00      1.00        62
entertainment       1.00      1.00      1.00        63
     politics       1.00      1.00      1.00        51
        sport       1.00      1.00      1.00        64

     accuracy                           1.00       240
    macro avg       1.00      1.00      1.00       240
 weighted avg       1.00      1.00      1.00       240



**SVM Classifier Results:**
- **Accuracy**: 97.0%
- **Summary**: The SVM classifier achieved an accuracy of 97.0%, meaning it correctly classified the documents almost all of the time. Each category (business, entertainment, politics, sport, and tech) was classified with very high accuracy.


## 6. Train and Evaluate the Random Forest Classifier
Train and Evaluate the Random Forest Classifier

In [20]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

rf_predictions = rf_classifier.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy}')
print(classification_report(y_test, rf_predictions))

Random Forest Accuracy: 0.9916666666666667
               precision    recall  f1-score   support

     business       1.00      0.97      0.98        62
entertainment       1.00      1.00      1.00        63
     politics       0.96      1.00      0.98        51
        sport       1.00      1.00      1.00        64

     accuracy                           0.99       240
    macro avg       0.99      0.99      0.99       240
 weighted avg       0.99      0.99      0.99       240



**Random Forest Classifier Results:**
- **Accuracy**: 94.8%
- **Summary**: The Random Forest classifier achieved an accuracy of 94.8%, meaning it correctly classified the documents most of the time. Each category (business, entertainment, politics, sport, and tech) was classified with high precision and recall, indicating the classifier's strong performance across all categories.

## 7. Conclusion
- **Naive Bayes Classifier**:
  - **Accuracy**: 89.1%
  - **Summary**: The Naive Bayes classifier correctly classified most documents and identified key words for each category.

- **SVM Classifier**:
  - **Accuracy**: 97.0%
  - **Summary**: The SVM classifier was the most accurate, effectively classifying documents with very high precision and recall.

- **Random Forest Classifier**:
  - **Accuracy**: 94.8%
  - **Summary**: The Random Forest classifier also performed well, correctly classifying a large majority of documents.

- **Key Outcome**:
  - The SVM classifier was the best model for classifying documents into business, entertainment, politics, sport, and tech categories.


Citation:

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
