We begin by making appropriate imports as well as loading the data needed for NLTK and Spacy.

In [1]:
import math
import os
import random

from pprint import pprint
from typing import List, Dict

import nltk
import numpy as np
import spacy
# Download the required dataset from NLTK
nltk.download("stopwords", quiet=True)

from nltk.corpus import stopwords
from sklearn import preprocessing, svm
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import classification_report
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from tqdm.notebook import tqdm

# If this fails, please run `python -m spacy download en_core_web_sm`
nlp = spacy.load("en_core_web_sm")

Two functions can then be defined to load the data from the text files to memory

In [2]:
def load_corpus(folder: str) -> List[str]:
    """Load strings from folder of text

    Args:
        folder (str): The path to the folder to load

    Returns:
        List[str]: List of strings retrieved from text files in the folder
    """
    corpus = []
    for root, dirs, files in os.walk(folder, topdown=False):
        for name in files:
            try:
                with open(os.path.join(root, name), "r") as fp:
                    corpus.append(fp.read())
            except UnicodeDecodeError as e:
                ... # Let the error pass silently
                # print(e.__str__(), "for", os.path.join(root, name))
    return corpus

def load_corpuses(folder: str) -> Dict[str, List[str]]:
    """Load corpuses from sub-folders of specified folder

    Args:
        folder (str): The parent folder

    Returns:
        Dict[str, List[str]]: Dictionary of corpuses
    """
    sub_folders = []
    for root, dirs, files in os.walk(folder):
        if dirs:
            for dir_ in dirs:
                sub_folders.append(dir_)

    corpuses = {}
    for sub_folder in sub_folders:
        corpuses[sub_folder] = load_corpus(os.path.join(folder, sub_folder))
    return corpuses

We then build our dataset using these corpuses. The $x$ vector is made from three features:
- Word frequencies
- Frequency of named entity types
- Weighted word frequencies

Once constructed, the data set is shuffled.

In [3]:
x = []
y =[]

corpuses = load_corpuses("bbc")

for corpus in corpuses:
    for story in corpuses[corpus]:
        x.append(story)
        y.append(corpus)
c = list(zip(x, y))

random.shuffle(c)

x, y = zip(*c)

Test and training data are then sampled using a 20:80 split respectively. The $Y$ values are then encoded so that they can be used as labels within the SVM.

In [4]:
size_dataset_full=len(x)
size_test=int(round(size_dataset_full*0.2,0))

list_test_indices=random.sample(range(size_dataset_full), size_test)

test_x = []
test_y = []
train_x = []
train_y = []

for i,example in enumerate(x):
    if i in list_test_indices:
        test_x.append(example)
        test_y.append(y[i])
    else:
        train_x.append(example)
        train_y.append(y[i])

le = preprocessing.LabelEncoder()
le.fit(train_y)
train_y = le.transform(train_y)
test_y = le.transform(test_y)

In [5]:
def feature_extraction(stories: List[str]) -> List[List[int]]:
    """Extracts features from a list of strings

    Args:
        stories (List[str]): Strings to extract features from

    Returns:
        List[List[int]]: List of vectors which can be used in a model
    """
    entity_types = CountVectorizer(stop_words=stopwords.words('english'))
    entity_types.fit(['CARDINAL', 'PERSON', 'GPE', 'MONEY', 'ORG', 'ORDINAL', 'WORK_OF_ART', 'NORP', 'PERCENT', 'DATE', 'LANGUAGE', 'FAC', 'LOC', 'TIME', 'PRODUCT', 'EVENT', 'QUANTITY', 'LAW'])
    processed_stories = []
    for story in tqdm(stories):
        analysed = nlp(story)
        processed_stories.append(
            list(vectorizer.transform([story]).toarray()[0]) + 
            list(entity_types.transform([tag.label_ for tag in analysed.ents]).toarray()[0]) +
            list(tfid.transform(vectorizer.transform([story])).toarray()[0])
        )
    return processed_stories

We then define two of the feature extraction methods. `CountVectorizer` builds a vocabulary from the previously loaded training data. `TfidfTransformer` is then built using the matrix provided by `CountVectorizer`.

The combination of the vectors resulted in very large $x$ vectors to train on. The best 500 features are selected using the $\chi^2$ method

In [6]:
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
tfid = TfidfTransformer()

vectorizer.fit(train_x)
tfid.fit(vectorizer.transform(train_x))

train_x = feature_extraction(train_x)
print(type(train_x))

get_best=SelectKBest(chi2, k=500).fit(train_x, train_y)
train_x_chi = get_best.transform(train_x)

  0%|          | 0/1779 [00:00<?, ?it/s]

<class 'list'>


In [7]:
print(test_y)

[4 4 0 0 1 1 2 2 4 2 4 3 1 1 2 2 2 3 1 1 3 0 1 0 1 2 4 1 4 0 3 2 2 1 0 1 3
 3 0 0 4 1 0 0 1 1 0 2 4 1 1 2 2 2 2 3 0 4 1 3 2 0 1 1 4 2 3 1 0 4 1 1 4 1
 2 3 1 3 4 4 1 0 2 4 0 2 1 1 0 4 3 2 1 0 1 0 2 0 3 4 4 4 4 0 3 4 1 4 0 1 3
 1 2 2 2 1 3 2 2 3 2 4 4 0 3 3 2 1 0 2 0 2 1 4 1 2 1 4 0 4 4 3 4 1 3 2 1 2
 2 3 2 0 2 1 1 4 1 4 1 2 2 0 2 2 1 0 0 3 2 1 0 0 3 2 2 2 2 2 1 3 2 1 2 2 1
 3 3 4 4 2 3 3 2 3 0 4 3 0 3 4 4 2 3 3 1 0 3 2 2 4 3 2 0 3 1 4 0 1 4 3 2 0
 3 1 0 3 4 4 3 1 0 4 1 4 0 4 4 2 2 0 0 2 0 0 1 3 0 2 4 2 0 4 4 0 0 0 4 4 1
 3 2 4 4 3 1 4 0 3 2 0 3 1 4 4 1 0 3 0 2 4 3 0 3 3 1 0 4 1 0 0 0 1 0 0 2 2
 3 3 0 0 3 4 0 4 4 3 1 2 0 1 2 0 3 3 3 1 0 2 0 1 4 4 4 0 0 3 2 3 1 4 4 4 3
 1 0 0 4 2 3 3 4 3 0 2 3 0 2 2 2 4 2 1 3 3 3 1 0 0 1 1 1 3 3 2 3 0 3 2 2 4
 2 1 4 2 2 2 1 3 0 0 0 3 1 0 1 2 0 3 1 1 2 0 0 3 1 3 4 1 4 4 1 1 3 2 2 3 4
 3 4 3 2 3 0 0 1 3 3 0 0 0 0 0 0 0 4 3 3 1 3 3 0 4 1 1 1 4 3 0 1 0 0 3 2 1
 1]


The SVM object is constructed. The pipeline includes passing the data through the `StandardScaler` function which "Standardize \[sic\] features by removing the mean and scaling to unit variance".

In [8]:
svm_clf=make_pipeline(StandardScaler(), svm.SVC(cache_size=10000, decision_function_shape='ovo'))

We can then pass the training data to the SVM to train the model.

In [9]:
svm_clf.fit(train_x_chi, train_y)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(cache_size=10000, decision_function_shape='ovo'))])

Using this model, we can run the training data through it in order to evaluate the SVM.

In [10]:
Y_text_predictions = svm_clf.predict(get_best.transform(feature_extraction(test_x)))

  0%|          | 0/445 [00:00<?, ?it/s]

The `classification_report` function allows us to easily generate a report on the success of the SVM by providing known good $Y$ values as well as $Y$ values attained through the SVM.

In [11]:
print(classification_report(test_y, Y_text_predictions, target_names=le.inverse_transform(svm_clf.classes_)))

               precision    recall  f1-score   support

     business       0.93      0.97      0.95        97
entertainment       1.00      0.92      0.96        91
     politics       0.98      0.96      0.97        89
        sport       0.99      0.99      0.99        89
         tech       0.94      1.00      0.97        79

     accuracy                           0.97       445
    macro avg       0.97      0.97      0.97       445
 weighted avg       0.97      0.97      0.97       445



In order to use the SVM, a large amount of preprocessing needs to be done on a string. This has been encapsulated in the following function:

In [12]:
def predict(story: str) -> str:
    """Gives a genre prediction for a news story

    Args:
        story (str): The plaintext of the story

    Returns:
        str: The genre of the story
    """
    return le.inverse_transform(
        svm_clf.predict(
            get_best.transform(
                feature_extraction([story])
            )
        )
    )[0]

We can then try this function with a news story.

In [13]:
predict("""
Greene sets sights on world title

Maurice Greene aims to wipe out the pain of losing his Olympic 100m title in Athens by winning a fourth World Championship crown this summer.

He had to settle for bronze in Greece behind fellow American Justin Gatlin and Francis Obikwelu of Portugal. "It really hurts to look at that medal. It was my mistake. I lost because of the things I did," said Greene, who races in Birmingham on Friday. "It's never going to happen again. My goal - I'm going to win the worlds." Greene crossed the line just 0.02 seconds behind Gatlin, who won in 9.87 seconds in one of the closest and fastest sprints of all time. But Greene believes he lost the race and his title in the semi-finals. "In my semi-final race, I should have won the race but I was conserving energy. "That's when Francis Obikwelu came up and I took third because I didn't know he was there. "I believe that's what put me in lane seven in the final and, while I was in lane seven, I couldn't feel anything in the race.

"I just felt like I was running all alone. "I believe if I was in the middle of the race I would have been able to react to people that came ahead of me." Greene was also denied Olympic gold in the 4x100m men's relay when he could not catch Britain's Mark Lewis-Francis on the final leg. The Kansas star is set to go head-to-head with Lewis-Francis again at Friday's Norwich Union Grand Prix. The pair contest the 60m, the distance over which Greene currently holds the world record of 6.39 seconds. He then has another indoor meeting in France before resuming training for the outdoor season and the task of recapturing his world title in Helsinki in August. Greene believes Gatlin will again prove the biggest threat to his ambitions in Finland. But he also admits he faces more than one rival for the world crown. "There's always someone else coming. I think when I was coming up I would say there was me and Ato (Boldon) in the young crowd," Greene said. "Now you've got about five or six young guys coming up at the same time."
""")

  0%|          | 0/1 [00:00<?, ?it/s]

'sport'