# Document Classification
### Team The p < 0.05 Team - Haig Bedros, Noori Selina, Julia Ferris, Matthew Roland


It can be useful to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data: UCI Machine Learning Repository: Spambase Data Set.

For this project, we used the BBC Full Text Document Classification dataset from Kaggle. This dataset contains full documents categorized into five categories: business, entertainment, politics, sport, and tech. 

The goal of our text classification is to predict the category of new documents in the test set.

Link to the dataset: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification?resource=download

The models we used include the Naive Bayes Classifier, Support Vector Machines, and Random Forests. The results were compared for accuracy.

## 1. Load and Process the Documents

The zip file of the dataset is extracted, and the documents from different categories are loaded and processed.

In [3]:
import os

# Function to get all .txt files from a local directory
def get_txt_files_from_local_directory(category_path):
    files = []
    for filename in os.listdir(category_path):
        if filename.endswith('.txt'):
            files.append(filename)
    return files

base_path = "/Users/haigbedros/Desktop/MSDS/Summer 24/620/HW/documentClassification/data/bbc"
category_paths = {
    'business': os.path.join(base_path, 'business'),
    'entertainment': os.path.join(base_path, 'entertainment'),
    'politics': os.path.join(base_path, 'politics'),
    'sport': os.path.join(base_path, 'sport'),
    'tech': os.path.join(base_path, 'tech')
}

documents = []
all_words_list = []

for category, category_path in category_paths.items():
    txt_files = get_txt_files_from_local_directory(category_path)
    
    for filename in txt_files:
        file_path = os.path.join(category_path, filename)
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                words = file.read().split()
        except UnicodeDecodeError:
            with open(file_path, 'r', encoding='ISO-8859-1') as file:
                words = file.read().split()
        except Exception as e:
            print(f"Failed to read {file_path}: {e}")
            continue
        documents.append((words, category))
        all_words_list.extend(w.lower() for w in words)

# Data verification
print(f"Total documents: {len(documents)}")
print(f"Sample document: {documents[0]}")
print(f"Total words: {len(all_words_list)}")

Total documents: 2194
Sample document: (['Asian', 'quake', 'hits', 'European', 'shares', 'Shares', 'in', "Europe's", 'leading', 'reinsurers', 'and', 'travel', 'firms', 'have', 'fallen', 'as', 'the', 'scale', 'of', 'the', 'damage', 'wrought', 'by', 'tsunamis', 'across', 'south', 'Asia', 'has', 'become', 'apparent.', 'More', 'than', '23,000', 'people', 'have', 'been', 'killed', 'following', 'a', 'massive', 'underwater', 'earthquake', 'and', 'many', 'of', 'the', 'worst', 'hit', 'areas', 'are', 'popular', 'tourist', 'destinations.', 'Reisurance', 'firms', 'such', 'as', 'Swiss', 'Re', 'and', 'Munich', 'Re', 'lost', 'value', 'as', 'investors', 'worried', 'about', 'rebuilding', 'costs.', 'But', 'the', 'disaster', 'has', 'little', 'impact', 'on', 'stock', 'markets', 'in', 'the', 'US', 'and', 'Asia.', 'Currencies', 'including', 'the', 'Thai', 'baht', 'and', 'Indonesian', 'rupiah', 'weakened', 'as', 'analysts', 'warned', 'that', 'economic', 'growth', 'may', 'slow.', '"It', 'came', 'at', 'the', '

## 2. Feature Extraction
Extract features using NLTK for the Naive Bayes classifier.

In [4]:
import nltk
from nltk import FreqDist
import random

all_words = FreqDist(w.lower() for w in all_words_list)
word_features = list(all_words)[:1000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

random.shuffle(featuresets)
total_samples = len(featuresets)
train_size = int(0.7 * total_samples)
train_set, test_set = featuresets[:train_size], featuresets[train_size:]

# data verification
print(f"Train set size: {len(train_set)}")
print(f"Test set size: {len(test_set)}")

Train set size: 1535
Test set size: 659


## 3. Training and Evaluation of Naive Bayes Classifier
Train and evaluate the NLTK Naive Bayes classifier.

In [5]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print('NLTK Naive Bayes Accuracy:', nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(10)

NLTK Naive Bayes Accuracy: 0.9195751138088012
Most Informative Features
          contains(film) = True           entert : sport  =     96.2 : 1.0
    contains(technology) = True             tech : sport  =     96.1 : 1.0
        contains(market) = True           busine : sport  =     80.7 : 1.0
       contains(million) = True             tech : sport  =     74.4 : 1.0
    contains(government) = True           politi : sport  =     74.0 : 1.0
       contains(digital) = True             tech : busine =     67.7 : 1.0
      contains(minister) = True           politi : entert =     62.0 : 1.0
          contains(star) = True           entert : politi =     56.0 : 1.0
         contains(phone) = True             tech : sport  =     55.9 : 1.0
          contains(firm) = True             tech : entert =     54.2 : 1.0


**Naive Bayes Classifier Results:**
- **Accuracy**: 92%
- **Summary**: The NLTK Naive Bayes classifier achieved an accuracy of 92%, meaning it correctly classified the documents into their respective categories (business, entertainment, politics, sport, and tech) 92% of the time. The most informative features were words like 'film', 'technology', 'market', 'million', and 'government', which had the highest impact on the classification decisions.

## 4. Prepare Data for SVM Classifier using TF-IDF
Convert the documents to TF-IDF features for use with the SVM classifier.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

def document_to_string(document):
    return ' '.join(document)

documents_str = [document_to_string(doc) for doc, _ in documents]
labels = [label for _, label in documents]

vectorizer = TfidfVectorizer(max_features=2000)
X = vectorizer.fit_transform(documents_str)


Split the data into training (70%) and testing (30%) sets.

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

## 5. Train and Evaluate the SVM Classifier
Train and evaluate the SVM classifier.

In [8]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

svm_classifier = LinearSVC()
svm_classifier.fit(X_train, y_train)

svm_predictions = svm_classifier.predict(X_test)

svm_accuracy = accuracy_score(y_test, svm_predictions)
print(f'SVM Accuracy: {svm_accuracy}')
print(classification_report(y_test, svm_predictions))

SVM Accuracy: 0.9696509863429439
               precision    recall  f1-score   support

     business       0.94      0.96      0.95       150
entertainment       0.97      0.95      0.96       119
     politics       0.95      0.97      0.96       118
        sport       0.99      0.99      0.99       149
         tech       0.99      0.97      0.98       123

     accuracy                           0.97       659
    macro avg       0.97      0.97      0.97       659
 weighted avg       0.97      0.97      0.97       659



**SVM Classifier Results:**
- **Accuracy**: 97%
- **Summary**: The SVM classifier achieved an accuracy of 97%, meaning it correctly classified the documents almost all of the time. Each category (business, entertainment, politics, sport, and tech) was classified with very high accuracy.


## 6. Train and Evaluate the Random Forest Classifier
Train and Evaluate the Random Forest Classifier

In [9]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

rf_predictions = rf_classifier.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy}')
print(classification_report(y_test, rf_predictions))

Random Forest Accuracy: 0.9590288315629742
               precision    recall  f1-score   support

     business       0.91      0.97      0.94       150
entertainment       0.97      0.93      0.95       119
     politics       0.97      0.95      0.96       118
        sport       1.00      0.99      0.99       149
         tech       0.96      0.94      0.95       123

     accuracy                           0.96       659
    macro avg       0.96      0.96      0.96       659
 weighted avg       0.96      0.96      0.96       659



**Random Forest Classifier Results:**
- **Accuracy**: 96%
- **Summary**: The Random Forest classifier achieved an accuracy of 96%, meaning it correctly classified the documents most of the time. Each category (business, entertainment, politics, sport, and tech) was classified with high precision and recall, indicating the classifier's strong performance across all categories.

## 7. Conclusion
- **Naive Bayes Classifier**:
  - **Accuracy**: 92%
  - **Summary**: The Naive Bayes classifier correctly classified most documents and identified key words for each category.

- **SVM Classifier**:
  - **Accuracy**: 97%
  - **Summary**: The SVM classifier was the most accurate, effectively classifying documents with very high precision and recall.

- **Random Forest Classifier**:
  - **Accuracy**: 96%
  - **Summary**: The Random Forest classifier also performed well, correctly classifying a large majority of documents.

- **Key Outcome**:
  - The SVM classifier was the best model for classifying documents into business, entertainment, politics, sport, and tech categories.


Citation:

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
