# Text Genre Classifier

Based on *Another Exercise: Classifying News Documents in Categories: sport, humor, adventure, science fiction, etc.* in [Natural Language Processing with Python/NLTK](https://github.com/luchux/ipython-notebook-nltk/blob/master/NLP%20-%20MelbDjango.ipynb) by Luciano M. Guasco.

## 1. Exploring the `brown` corpus

The corpus consists of 500 samples, distributed across 15 genres. Each sample begins at a random sentence-boundary in the article or other unit chosen, and continues up to the first sentence boundary after 2,000 words.

1. **PRESS: Reportage** *(44 texts)*
2. **PRESS: Editorial** *(27 texts)*
3. **PRESS: Reviews** *(17 texts)*
4. **RELIGION** *(17 texts)*
5. **SKILL AND HOBBIES** *(36 texts)*
6. **POPULAR LORE** *(48 texts)*
7. **BELLES-LETTRES** – Biography, Memoirs, etc. *(75 texts)*
8. **MISCELLANEOUS: US Government & House Organs** *(30 texts)*
9. **LEARNED** – Natural sciences, Medicine, Mathematics, etc. *(80 texts)*
10. **FICTION: General** *(29 texts)*
11. **FICTION: Mystery and Detective Fiction** *(24 texts)*
12. **FICTION: Science** *(6 texts)*
13. **FICTION: Adventure and Western** *(29 texts)*
14. **FICTION: Romance and Love Story** *(29 texts)*
15. **HUMOR** *(9 texts)*

In [1]:
from nltk.corpus import brown

In [2]:
print(brown.readme()[:105])

BROWN CORPUS

A Standard Corpus of Present-Day Edited American
English, for use with Digital Computers.




In [3]:
brown.fileids()[:5]

['ca01', 'ca02', 'ca03', 'ca04', 'ca05']

In [4]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [5]:
brown.sents('ca01')[0]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that',
 'any',
 'irregularities',
 'took',
 'place',
 '.']

## 2. Compiling a list of the most popular words in the corpus

We will supply all the words in the corpus to `FreqDist` to count the number of times each word appears. We will use Python strings’ `isalpha()` method to make sure that punctuation tokens are not included, even though this will also exclude words like `1` (very common), `Aug.`, `1913`, `13th`, `over-all`, etc.

In [6]:
from nltk import FreqDist

word_freq_in_corpus = FreqDist(w.lower() for w in brown.words() if w.isalpha())
word_freq_in_corpus

FreqDist({'the': 69971, 'of': 36412, 'and': 28853, 'to': 26158, 'a': 23195, 'in': 21337, 'that': 10594, 'is': 10109, 'was': 9815, 'he': 9548, ...})

Next, we convert `word_freq_in_corpus` to a list of lists to make it easier to work with.

In [7]:
word_freq_in_corpus = list(map(list, word_freq_in_corpus.items()))
word_freq_in_corpus[:10]

[['the', 69971],
 ['fulton', 17],
 ['county', 155],
 ['grand', 48],
 ['jury', 67],
 ['said', 1961],
 ['friday', 60],
 ['an', 3740],
 ['investigation', 51],
 ['of', 36412]]

In [8]:
word_freq_in_corpus_sorted = sorted(word_freq_in_corpus, key=lambda x: x[1], reverse=True)
word_freq_in_corpus_sorted[:10]

[['the', 69971],
 ['of', 36412],
 ['and', 28853],
 ['to', 26158],
 ['a', 23195],
 ['in', 21337],
 ['that', 10594],
 ['is', 10109],
 ['was', 9815],
 ['he', 9548]]

In [9]:
top_1500_words = word_freq_in_corpus_sorted[:1500]

for list_item in top_1500_words:
    del list_item[1]

top_1500_words[:10]

[['the'],
 ['of'],
 ['and'],
 ['to'],
 ['a'],
 ['in'],
 ['that'],
 ['is'],
 ['was'],
 ['he']]

We need to flatten `top_1500_words`. We can do this by breaking down the list into its individual sublists and then chaining them with `itertools.chain`. This returns a value of type `itertools.chain` that we can then cast to a list.

In [10]:
import itertools

chain = itertools.chain(*top_1500_words)
top_1500_words = list(chain)
top_1500_words[:10]

['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', 'was', 'he']

Let’s remove stop words from our top 1500 words.

In [11]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')


def filter_out_stopwords(words_list):
    return [word for word in words_list if word not in stop_words]


top_1500_words_filtered = filter_out_stopwords(top_1500_words)
top_1500_words_filtered[:10]

['one', 'would', 'said', 'new', 'could', 'time', 'two', 'may', 'first', 'like']

## 3. Converting the corpus to a form suitable for classification

Each file in the corpus ultimately needs to be represented by a dictionary indicating the presence of the corpus’s most popular words in the particular file.

We will start by representing the corpus as a list of tuples, where each tuple contains a list of all the words in a file of the corpus and the category of the file.

In [12]:
words_and_categories_by_file = [
    (
        [item.lower() for item in filter_out_stopwords(brown.words(fileid)) if item.isalpha()],
        category,
    )
    for category in brown.categories()
    for fileid in brown.fileids(category)
]
print(words_and_categories_by_file[0][0][:10], words_and_categories_by_file[0][1])

['dan', 'morgan', 'told', 'would', 'forget', 'ann', 'turner', 'he', 'well', 'rid'] adventure


In [13]:
from random import shuffle

shuffle(words_and_categories_by_file)
print(words_and_categories_by_file[0][0][:10], words_and_categories_by_file[0][1])

['thomas', 'douglas', 'fifth', 'earl', 'selkirk', 'noble', 'humanitarian', 'scot', 'concerned', 'plight'] lore


We can now extract features that indicate the presence or not of the top 1500 words.

In [14]:
def extract_word_presence_absence_features(word_list):
    words_set = set(word_list)
    features_dict = {}
    for word in top_1500_words_filtered:
        features_dict['has(%s)' % word] = word in words_set
    return features_dict


features_and_categories_by_file = [
    (extract_word_presence_absence_features(d), c) for (d, c) in words_and_categories_by_file
]

print('SOME EXAMPLE WORD PRESENCE AND ABSENCE FEATURES FOR THE FIRST CORPUS FILE')
print('has(one):', features_and_categories_by_file[0][0]['has(one)'])
print('has(would):', features_and_categories_by_file[0][0]['has(would)'])
print('has(said):', features_and_categories_by_file[0][0]['has(said)'])
print('has(new):', features_and_categories_by_file[0][0]['has(new)'])
print('has(time):', features_and_categories_by_file[0][0]['has(time)'])

SOME EXAMPLE WORD PRESENCE AND ABSENCE FEATURES FOR THE FIRST CORPUS FILE
has(one): True
has(would): True
has(said): False
has(new): True
has(time): True


## 4. Building the classifier

In [15]:
from nltk import NaiveBayesClassifier

TRAIN_TEST_SPLIT = 0.8
TRAIN_SET_SIZE = round(len(features_and_categories_by_file) * TRAIN_TEST_SPLIT)
train_set = features_and_categories_by_file[:TRAIN_SET_SIZE]
test_set = features_and_categories_by_file[TRAIN_SET_SIZE:]

classifier = NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(10)

Most Informative Features
               has(wife) = True            humor : learne =     33.6 : 1.0
              has(music) = True           review : learne =     29.7 : 1.0
               has(dark) = True           advent : learne =     29.7 : 1.0
              has(woman) = True           fictio : learne =     28.4 : 1.0
                has(god) = True           religi : learne =     28.0 : 1.0
          has(telephone) = True            humor : belles =     26.1 : 1.0
            has(watched) = True           advent : learne =     26.1 : 1.0
            has(waiting) = True           myster : learne =     25.0 : 1.0
             has(walked) = True           myster : learne =     25.0 : 1.0
            has(playing) = True           review : learne =     24.5 : 1.0


## 5. Testing the classifier

In [16]:
from nltk.classify import accuracy

round(accuracy(classifier, test_set), 2)

0.63

Let’s try to classify the `ca01` file, which is under the “news” category.

In [17]:
classifier.classify(extract_word_presence_absence_features(brown.words('ca01')))

'news'

We will now try classifying our own text. It needs to be long enough to contain a significant number of the top 1500 words, and needs to belong to a clear category.

We will use the first section of the prologue of the Catechism of the Catholic Church, which should be classified under “religion”.

In [18]:
catechism_text = '''1 God, infinitely perfect and blessed in himself, in a plan of sheer goodness
                    freely created man to make him share in his own blessed life. For this reason,
                    at every time and in every place, God draws close to man. He calls man to seek
                    him, to know him, to love him with all his strength. He calls together all men,
                    scattered and divided by sin, into the unity of his family, the Church. To
                    accomplish this, when the fullness of time had come, God sent his Son as
                    Redeemer and Saviour. In his Son and through him, he invites men to become, in
                    the Holy Spirit, his adopted children and thus heirs of his blessed life.
                    2 So that this call should resound throughout the world, Christ sent forth the
                    apostles he had chosen, commissioning them to proclaim the gospel: "Go
                    therefore and make disciples of all nations, baptizing them in the name of the
                    Father and of the Son and of the Holy Spirit, teaching them to observe all that
                    I have commanded you; and lo, I am with you always, to the close of the age."
                    Strengthened by this mission, the apostles "went forth and preached everywhere,
                    while the Lord worked with them and confirmed the message by the signs that
                    attended it."
                    3 Those who with God's help have welcomed Christ's call and freely responded to
                    it are urged on by love of Christ to proclaim the Good News everywhere in the
                    world. This treasure, received from the apostles, has been faithfully guarded
                    by their successors. All Christ's faithful are called to hand it on from
                    generation to generation, by professing the faith, by living it in fraternal
                    sharing, and by celebrating it in liturgy and prayer.'''

In [19]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
catechism_text_tokens = filter_out_stopwords(tokenizer.tokenize(catechism_text.lower()))
catechism_text_tokens = [w for w in catechism_text_tokens if w.isalpha()]
catechism_text_tokens[:5]

['god', 'infinitely', 'perfect', 'blessed', 'plan']

In [20]:
classifier.classify(extract_word_presence_absence_features(catechism_text_tokens))

'learned'