# Text Classification

In this lecture we are looking at text classification using Naïve Bayes. We will look at the authors of inaugural addresses by US presidents

In [1]:
import nltk
nltk.download("inaugural")
from nltk.corpus import inaugural
print(inaugural.fileids())

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.

Divide this into a training set consisting of every president's first address and a test set consisting of every (two-term!) president's second address

In [2]:
def fileid_to_author(fileid):
  return fileid[5:-4]

test = [fileid2 for fileid1, fileid2 in 
         zip (inaugural.fileids(),inaugural.fileids()[1:]) 
         if fileid_to_author(fileid1) == fileid_to_author(fileid2)]
  
train = [fileid for fileid in inaugural.fileids() if fileid not in test]

print("train=", len(train), train)
print("test=", len(test), test)

train= 40 ['1789-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1809-Madison.txt', '1817-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1869-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '2001-Bush.txt', '2009-Obama.txt', '2017-Trump.txt']
test= 18 ['1793-Washington.txt', '1805-Jefferson.txt', '1813-Madison.txt', '1821-Monroe.txt', '1833-Jackson.txt', '1865-Lincoln.txt', '1873-Grant.txt', '1901-McKinley.txt', '1917-Wilson.txt', '1937-Ro

Calculate the size of the overlap of each document

In [3]:
def overlap(id1, id2):
  return len(set(inaugural.words(id1)).intersection(inaugural.words(id2)))

assert overlap("1793-Washington.txt","1789-Washington.txt")==61

Classify each of the test documents by the size of the overlap

In [4]:
for doc in test:
  print(doc,"=",max(train, key=lambda d: overlap(doc,d)))

1793-Washington.txt = 1841-Harrison.txt
1805-Jefferson.txt = 1841-Harrison.txt
1813-Madison.txt = 1841-Harrison.txt
1821-Monroe.txt = 1841-Harrison.txt
1833-Jackson.txt = 1841-Harrison.txt
1865-Lincoln.txt = 1841-Harrison.txt
1873-Grant.txt = 1841-Harrison.txt
1901-McKinley.txt = 1841-Harrison.txt
1917-Wilson.txt = 1841-Harrison.txt
1937-Roosevelt.txt = 1841-Harrison.txt
1941-Roosevelt.txt = 1841-Harrison.txt
1945-Roosevelt.txt = 1841-Harrison.txt
1957-Eisenhower.txt = 1841-Harrison.txt
1973-Nixon.txt = 1841-Harrison.txt
1985-Reagan.txt = 1841-Harrison.txt
1997-Clinton.txt = 1841-Harrison.txt
2005-Bush.txt = 1841-Harrison.txt
2013-Obama.txt = 2009-Obama.txt


Why does this happen? What change would you suggest to improve the results?

<font color="red">
The classifier always predicts the same value.</font>

<font color="red">We should account for document length and frequency of words.
</font>

In [7]:
from nltk.probability import FreqDist
from math import log

fd_train = { d:FreqDist(inaugural.words(d)) for d in train }

def prob(id1, id2):
  total = 0
  for w in inaugural.words(id2):
    total += log((fd_train[id1][w]+1e-6) / fd_train[id1].N())
  return total / len(inaugural.words(id2))

In [8]:
for doc in test:
  print(doc,"=",max(train, key=lambda d: prob(d,doc)))

1793-Washington.txt = 1861-Lincoln.txt
1805-Jefferson.txt = 1841-Harrison.txt
1813-Madison.txt = 1841-Harrison.txt
1821-Monroe.txt = 1817-Monroe.txt
1833-Jackson.txt = 1841-Harrison.txt
1865-Lincoln.txt = 1841-Harrison.txt
1873-Grant.txt = 1841-Harrison.txt
1901-McKinley.txt = 1897-McKinley.txt
1917-Wilson.txt = 1841-Harrison.txt
1937-Roosevelt.txt = 1929-Hoover.txt
1941-Roosevelt.txt = 1841-Harrison.txt
1945-Roosevelt.txt = 1969-Nixon.txt
1957-Eisenhower.txt = 1953-Eisenhower.txt
1973-Nixon.txt = 2009-Obama.txt
1985-Reagan.txt = 1981-Reagan.txt
1997-Clinton.txt = 2009-Obama.txt
2005-Bush.txt = 2009-Obama.txt
2013-Obama.txt = 2009-Obama.txt


<font color="red">By account for document length, we are able to improve the quality of the classification. While it is still not a great classifier we have improved this.</font>

<font color="red">Did you achieve a better result? You can share your method for discussion in the forums</font>