<a href="https://colab.research.google.com/github/mkane968/Text-Mining-Experiments/blob/main/NLTK/Tutorial%207%3A%20Classifying%20News%20Documents%20into%20Categories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 7: Classifying News Documents into Categories

Based on Another Excercise: Classifying News Documents in Categories: sport, humor, adventure, science fiction, etc... in [Natural Language Processing with Python/NLTK by Luciano M. Guasco](https://github.com/luchux/ipython-notebook-nltk/blob/master/NLP%20-%20MelbDjango.ipynb)

**Exploring the Brown corpus**

The Corpus consists of 500 samples, distributed across 15 genres. Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words.

A. PRESS: Reportage (44 texts)

B. PRESS: Editorial (27 texts)

C. PRESS: Reviews (17 texts)

D. RELIGION (17 texts)

E. SKILL AND HOBBIES (36 texts)

F. POPULAR LORE (48 texts)

G. BELLES-LETTRES - Biography, Memoirs, etc. (75 texts)

H. MISCELLANEOUS: US Government & House Organs (30 texts)

J. LEARNED - Natural sciences, Medicine, Mathematics, etc. (80 texts)

K. FICTION: General (29 texts)

L. FICTION: Mystery and Detective Fiction (24 texts)

M. FICTION: Science (6 texts)

N. FICTION: Adventure and Western (29 texts)

P. FICTION: Romance and Love Story (29 texts)

R. HUMOR (9 texts)

Download brown corpus and clean spacing

In [None]:
import nltk
nltk.download('brown')

from nltk.corpus import brown

brown.readme().replace('\n', ' ')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


'BROWN CORPUS  A Standard Corpus of Present-Day Edited American English, for use with Digital Computers.  by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA  Revised 1971, Revised and Amplified 1979  http://www.hit.uib.no/icame/brown/bcm.html  Distributed with the permission of the copyright holder, redistribution permitted. '

Print file ids in Brown corpus

In [None]:
brown.fileids()

Get categories (genres of text) in Brown corpus

In [None]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

Print first sentence in specified file of brown corpus

In [None]:
brown.sents('ca01')[0]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that',
 'any',
 'irregularities',
 'took',
 'place',
 '.']

Compile a list of most popular words in the corpus

Takes a bunch of tokens and returns the frequencies of all unique cases.

In [None]:
from nltk import FreqDist 
# Check if the word is alphabetical avoids including stuff like `` and '' which are actually pretty common. 
# Note that it also omits words such as 1 (very common), aug., 1913, $30, 13th, over-all etc. Another option would have been .isalnum().
words_in_corpora = FreqDist(w.lower() for w in brown.words() if w.isalpha()) 
#words_in_corpora

Use this instead of sorted() to sort dictionary into a (mutable) list in order to delete the second column as opposed to into a tuple (immutable).


In [None]:
words_in_corpora_freq_sorted = list(map(list, words_in_corpora.items()))
#words_in_corpora_freq_sorted

Sort words in corpus based on frequency

In [None]:
words_in_corpora_freq_sorted.sort(key=lambda x: x[1], reverse=True) # Using a lambda function is an alternative to using the operator library.
words_in_corpora_freq_sorted

Put 1500 most frequent words in list into variable and delete word count (list item 1)


In [None]:
best1500 = words_in_corpora_freq_sorted[:1500]

for list_item in best1500:
    del list_item[1]

#best1500

Since best1500 is now a list of words, it should be flattened. 

Break down the list into its individual sublists and then chain them. 

Chain further breaks down each sublist into its individual components so this approach can be used to flatten any list of lists.

In [None]:
import itertools

chain = itertools.chain(*best1500) 
best1500 = list(chain) # chain is of type itertools.chain so we need the cast
#best1500

Receives a list of words and removes stop words from list

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopw = stopwords.words('english')

def nonstop(listwords):
    return [word for word in listwords if word not in stopw]

best1500_words_corpora = nonstop(best1500) # Note how this will probably contain less than 1500 words.
#best1500_words_corpora

***Converting corpus to form suitable for classification:*** Each file in the corpus will eventually be represented by a dictionary showing the presence of the corpus’ most popular words in the particular file.

In [None]:
# documents = [(nonstop(brown.words(fileid)), category) for category in brown.categories() for fileid in brown.fileids(category)]
# documents # Note how documents is a list of tuples.

# The code above generates a representation of the corpus but without removing punctuation. This is better:
documents = [([item.lower() for item in nonstop(brown.words(fileid)) if item.isalpha()], category)
             for category in brown.categories()
             for fileid in brown.fileids(category)]
documents # Note how documents is a list of tuples.

Shuffle items in list of tuples

In [None]:
from random import shuffle

shuffle(documents)
documents

Given a document extract features (the presence or not of the 1500 most frequent words of the corpus)

In [None]:
def document_features(doc):
    doc_set_words = set(doc) # Checking whether a word occurs in a set is much faster than checking whether it occurs in a list.
    features_dic = {} # Features is a dictionary
    for word in best1500_words_corpora:
        features_dic['has(%s)' % word] = (word in doc_set_words)
    return features_dic

doc_features_set = [(document_features(d),c) for (d,c) in documents]
doc_features_set[0]

Now build the classifer to determine what category documents fall into based on most frequent words

In [None]:
from nltk import NaiveBayesClassifier

train_set = doc_features_set[:350] # Since the total is 500
test_set  = doc_features_set[150:]

classifier = NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(15)

Most Informative Features
             has(walked) = True           myster : learne =     29.7 : 1.0
              has(music) = True           review : learne =     28.9 : 1.0
                has(ran) = True           advent : learne =     28.5 : 1.0
          has(afternoon) = True           fictio : learne =     27.2 : 1.0
               has(road) = True           myster : learne =     27.1 : 1.0
            has(playing) = True           review : learne =     26.2 : 1.0
                has(god) = True           religi : learne =     25.8 : 1.0
                has(car) = True            humor : learne =     25.3 : 1.0
               has(hair) = True           romanc : learne =     23.8 : 1.0
              has(maybe) = True           romanc : learne =     23.8 : 1.0
            has(watched) = True           advent : learne =     22.6 : 1.0
            has(kitchen) = True            humor : belles =     22.4 : 1.0
          has(communism) = True           editor : learne =     22.1 : 1.0

Get accuracy of classifier

In [None]:
from nltk.classify import accuracy

print(accuracy(classifier, test_set))

0.7371428571428571


Test classification of documet 'ca01' (it is under the 'news' category)

In [None]:
classifier.classify(document_features(brown.words('ca01')))

'news'

In [None]:
from nltk.tokenize import RegexpTokenizer

# The test text needs to be long enough in order to contain a significant amount of the 1500 most common words in our training corpus.
text = "1 God, infinitely perfect and blessed in himself, in a plan of sheer goodness freely created man to make him share in his own blessed life. For this reason, at every time and in every place, God draws close to man. He calls man to seek him, to know him, to love him with all his strength. He calls together all men, scattered and divided by sin, into the unity of his family, the Church. To accomplish this, when the fullness of time had come, God sent his Son as Redeemer and Saviour. In his Son and through him, he invites men to become, in the Holy Spirit, his adopted children and thus heirs of his blessed life. 2 So that this call should resound throughout the world, Christ sent forth the apostles he had chosen, commissioning them to proclaim the gospel: \"Go therefore and make disciples of all nations, baptizing them in the name of the Father and of the Son and of the Holy Spirit, teaching them to observe all that I have commanded you; and lo, I am with you always, to the close of the age.\"4 Strengthened by this mission, the apostles \"went forth and preached everywhere, while the Lord worked with them and confirmed the message by the signs that attended it.\" 3 Those who with God's help have welcomed Christ's call and freely responded to it are urged on by love of Christ to proclaim the Good News everywhere in the world. This treasure, received from the apostles, has been faithfully guarded by their successors. All Christ's faithful are called to hand it on from generation to generation, by professing the faith, by living it in fraternal sharing, and by celebrating it in liturgy and prayer. 4 Quite early on, the name catechesis was given to the totality of the Church's efforts to make disciples, to help men believe that Jesus is the Son of God so that believing they might have life in his name, and to educate and instruct them in this life, thus building up the body of Christ. Catechesis is an education in the faith of children, young people and adults which includes especially the teaching of Christian doctrine imparted, generally speaking, in an organic and systematic way, with a view to initiating the hearers into the fullness of Christian life. While not being formally identified with them, catechesis is built on a certain number of elements of the Church's pastoral mission which have a catechetical aspect, that prepare for catechesis, or spring from it. They are: the initial proclamation of the Gospel or missionary preaching to arouse faith; examination of the reasons for belief; experience of Christian living; celebration of the sacraments; integration into the ecclesial community; and apostolic and missionary witness. Catechesis is intimately bound up with the whole of the Church's life. Not only her geographical extension and numerical increase, but even more her inner growth and correspondence with God's plan depend essentially on catechesis. Periods of renewal in the Church are also intense moments of catechesis. In the great era of the Fathers of the Church, saintly bishops devoted an important part of their ministry to catechesis. St. Cyril of Jerusalem and St. John Chrysostom, St. Ambrose and St. Augustine, and many other Fathers wrote catechetical works that remain models for us. The ministry of catechesis draws ever fresh energy from the councils. the Council of Trent is a noteworthy example of this. It gave catechesis priority in its constitutions and decrees. It lies at the origin of the Roman Catechism, which is also known by the name of that council and which is a work of the first rank as a summary of Christian teaching. The Council of Trent initiated a remarkable organization of the Church's catechesis. Thanks to the work of holy bishops and theologians such as St. Peter Canisius, St. Charles Borromeo, St. Turibius of Mongrovejo or St. Robert Bellarmine, it occasioned the publication of numerous catechisms. It is therefore no surprise that catechesis in the Church has again attracted attention in the wake of the Second Vatican Council, which Pope Paul Vl considered the great catechism of modern times. the General Catechetical Directory (1971) the sessions of the Synod of Bishops devoted to evangelization (1974) and catechesis (1977), the apostolic exhortations Evangelii nuntiandi (1975) and Catechesi tradendae (1979), attest to this. the Extraordinary Synod of Bishops in 1985 asked that a catechism or compendium of all Catholic doctrine regarding both faith and morals be composed. The Holy Father, Pope John Paul II, made the Synod's wish his own, acknowledging that this desire wholly corresponds to a real need of the universal Church and of the particular Churches. He set in motion everything needed to carry out the Synod Fathers' wish."

tokenizer = RegexpTokenizer(r'\w+') # Picks out sequences of alphanumeric characters as tokens and drops everything else
text_tokens = nonstop(tokenizer.tokenize(text.lower()))
text_tokens = [w for w in text_tokens if w.isalpha()]
#text_tokens

Determine whether list of tokens contain most frequent words set above

In [None]:
text_features = document_features(text_tokens)
#text_features

Classifies new document based on presence of frequent words in brown corpus categories

In [None]:
classifier.classify(document_features(text_tokens))

'belles_lettres'