# Document Classification
### Team The p < 0.05 Team - Haig Bedros, Noori Selina, Julia Ferris, Matthew Roland


It can be useful to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data: UCI Machine Learning Repository: Spambase Data Set.

For this project, we used the BBC Full Text Document Classification dataset from Kaggle. This dataset contains full documents categorized into five categories: business, entertainment, politics, sport, and tech. The goal of our text classification is to predict the category of new documents in the test set.

The models we used include the Naive Bayes Classifier, Support Vector Machines, and Random Forests. The results were compared for accuracy.

In [11]:
from google.colab import drive
import os


drive.mount('/content/drive')
data_path = '/content/drive/My Drive/bbc'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Load and Process the Documents

The zip file of the dataset is extracted, and the documents from different categories are loaded and processed.

In [12]:
import os
import random

documents = []
all_words_list = []

for category in ['business', 'entertainment', 'politics', 'sport', 'tech']:
    category_path = os.path.join(data_path, category)
    for filename in os.listdir(category_path):
        if filename.endswith(".txt"):
            filepath = os.path.join(category_path, filename)
            try:
                with open(filepath, 'r', encoding='utf-8') as file:
                    words = file.read().split()
            except UnicodeDecodeError:
                with open(filepath, 'r', encoding='ISO-8859-1') as file:
                    words = file.read().split()
            documents.append((words, category))
            all_words_list.extend(w.lower() for w in words)

## 2. Feature Extraction
Extract features using NLTK for the Naive Bayes classifier.

In [13]:
import nltk
from nltk import FreqDist

all_words = FreqDist(w.lower() for w in all_words_list)
word_features = list(all_words)[:1000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

random.shuffle(featuresets)
total_samples = len(featuresets)
train_size = int(0.7 * total_samples)
train_set, test_set = featuresets[:train_size], featuresets[train_size:]

## 3. Training and Evaluation of Naive Bayes Classifier
Train and evaluate the NLTK Naive Bayes classifier.

In [14]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print('NLTK Naive Bayes Accuracy:', nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(10)

NLTK Naive Bayes Accuracy: 0.8907185628742516
Most Informative Features
        contains(market) = True           busine : sport  =     85.8 : 1.0
    contains(government) = True           politi : sport  =     79.6 : 1.0
    contains(technology) = True             tech : entert =     78.4 : 1.0
       contains(digital) = True             tech : busine =     76.4 : 1.0
       contains(million) = True             tech : sport  =     68.1 : 1.0
         contains(music) = True           entert : busine =     66.4 : 1.0
          contains(star) = True           entert : politi =     54.4 : 1.0
         contains(actor) = True           entert : busine =     51.5 : 1.0
        contains(shares) = True           busine : sport  =     50.7 : 1.0
      contains(industry) = True             tech : sport  =     49.1 : 1.0


**Naive Bayes Classifier Results:**
- **Accuracy**: 89.1%
- **Summary**: The NLTK Naive Bayes classifier achieved an accuracy of 89.1%, meaning it correctly classified the documents into their respective categories (business, entertainment, politics, sport, and tech) 89.1% of the time. The most informative features were words like 'market', 'government', 'technology', 'digital', and 'million', which had the highest impact on the classification decisions.

## 4. Prepare Data for SVM Classifier using TF-IDF
Convert the documents to TF-IDF features for use with the SVM classifier.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

def document_to_string(document):
    return ' '.join(document)

documents_str = [document_to_string(doc) for doc, _ in documents]
labels = [label for _, label in documents]

vectorizer = TfidfVectorizer(max_features=2000)
X = vectorizer.fit_transform(documents_str)


Split the data into training (70%) and testing (30%) sets.

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

## 5. Train and Evaluate the SVM Classifier
Train and evaluate the SVM classifier.

In [17]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

svm_classifier = LinearSVC()
svm_classifier.fit(X_train, y_train)

svm_predictions = svm_classifier.predict(X_test)

svm_accuracy = accuracy_score(y_test, svm_predictions)
print(f'SVM Accuracy: {svm_accuracy}')
print(classification_report(y_test, svm_predictions))

SVM Accuracy: 0.9700598802395209
               precision    recall  f1-score   support

     business       0.96      0.95      0.95       165
entertainment       1.00      0.98      0.99       118
     politics       0.95      0.96      0.95       120
        sport       0.98      1.00      0.99       140
         tech       0.96      0.97      0.96       125

     accuracy                           0.97       668
    macro avg       0.97      0.97      0.97       668
 weighted avg       0.97      0.97      0.97       668



**SVM Classifier Results:**
- **Accuracy**: 97.0%
- **Summary**: The SVM classifier achieved an accuracy of 97.0%, meaning it correctly classified the documents almost all of the time. Each category (business, entertainment, politics, sport, and tech) was classified with very high accuracy.


## 6. Train and Evaluate the Random Forest Classifier
Train and Evaluate the Random Forest Classifier

In [18]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

rf_predictions = rf_classifier.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy}')
print(classification_report(y_test, rf_predictions))

Random Forest Accuracy: 0.9476047904191617
               precision    recall  f1-score   support

     business       0.89      0.96      0.92       165
entertainment       0.99      0.94      0.97       118
     politics       0.93      0.92      0.92       120
        sport       0.97      0.99      0.98       140
         tech       0.98      0.92      0.95       125

     accuracy                           0.95       668
    macro avg       0.95      0.95      0.95       668
 weighted avg       0.95      0.95      0.95       668



**Random Forest Classifier Results:**
- **Accuracy**: 94.8%
- **Summary**: The Random Forest classifier achieved an accuracy of 94.8%, meaning it correctly classified the documents most of the time. Each category (business, entertainment, politics, sport, and tech) was classified with high precision and recall, indicating the classifier's strong performance across all categories.

## 7. Conclusion
- **Naive Bayes Classifier**:
  - **Accuracy**: 89.1%
  - **Summary**: The Naive Bayes classifier correctly classified most documents and identified key words for each category.

- **SVM Classifier**:
  - **Accuracy**: 97.0%
  - **Summary**: The SVM classifier was the most accurate, effectively classifying documents with very high precision and recall.

- **Random Forest Classifier**:
  - **Accuracy**: 94.8%
  - **Summary**: The Random Forest classifier also performed well, correctly classifying a large majority of documents.

- **Key Outcome**:
  - The SVM classifier was the best model for classifying documents into business, entertainment, politics, sport, and tech categories.


Citation:

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
