# Chapter 9: Counting and Indexing Words
Applying text categorization to Homer works with sklearn functions. We train the model to recognize if a chapter is from the _Iliad_ or the _Odyssey_.

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

# Loading the data

We use Homer's _Iliad_ and _Odyssey_ that we split in chapters and we read them from their respective folders.

Source: http://classics.mit.edu/Browse/index.html

In [1]:
PATH = '../datasets/'

In [2]:
ILIAD_FILES = [PATH + 'iliad_chapters/iliad.' +
               str(i) + '.txt' for i in range(1, 25)]
ILIAD_FILES

['../datasets/iliad_chapters/iliad.1.txt',
 '../datasets/iliad_chapters/iliad.2.txt',
 '../datasets/iliad_chapters/iliad.3.txt',
 '../datasets/iliad_chapters/iliad.4.txt',
 '../datasets/iliad_chapters/iliad.5.txt',
 '../datasets/iliad_chapters/iliad.6.txt',
 '../datasets/iliad_chapters/iliad.7.txt',
 '../datasets/iliad_chapters/iliad.8.txt',
 '../datasets/iliad_chapters/iliad.9.txt',
 '../datasets/iliad_chapters/iliad.10.txt',
 '../datasets/iliad_chapters/iliad.11.txt',
 '../datasets/iliad_chapters/iliad.12.txt',
 '../datasets/iliad_chapters/iliad.13.txt',
 '../datasets/iliad_chapters/iliad.14.txt',
 '../datasets/iliad_chapters/iliad.15.txt',
 '../datasets/iliad_chapters/iliad.16.txt',
 '../datasets/iliad_chapters/iliad.17.txt',
 '../datasets/iliad_chapters/iliad.18.txt',
 '../datasets/iliad_chapters/iliad.19.txt',
 '../datasets/iliad_chapters/iliad.20.txt',
 '../datasets/iliad_chapters/iliad.21.txt',
 '../datasets/iliad_chapters/iliad.22.txt',
 '../datasets/iliad_chapters/iliad.23.txt

In [3]:
ODYSSEY_FILES = [PATH + 'odyssey_chapters/odyssey.' +
                 str(i) + '.txt' for i in range(1, 25)]
ODYSSEY_FILES

['../datasets/odyssey_chapters/odyssey.1.txt',
 '../datasets/odyssey_chapters/odyssey.2.txt',
 '../datasets/odyssey_chapters/odyssey.3.txt',
 '../datasets/odyssey_chapters/odyssey.4.txt',
 '../datasets/odyssey_chapters/odyssey.5.txt',
 '../datasets/odyssey_chapters/odyssey.6.txt',
 '../datasets/odyssey_chapters/odyssey.7.txt',
 '../datasets/odyssey_chapters/odyssey.8.txt',
 '../datasets/odyssey_chapters/odyssey.9.txt',
 '../datasets/odyssey_chapters/odyssey.10.txt',
 '../datasets/odyssey_chapters/odyssey.11.txt',
 '../datasets/odyssey_chapters/odyssey.12.txt',
 '../datasets/odyssey_chapters/odyssey.13.txt',
 '../datasets/odyssey_chapters/odyssey.14.txt',
 '../datasets/odyssey_chapters/odyssey.15.txt',
 '../datasets/odyssey_chapters/odyssey.16.txt',
 '../datasets/odyssey_chapters/odyssey.17.txt',
 '../datasets/odyssey_chapters/odyssey.18.txt',
 '../datasets/odyssey_chapters/odyssey.19.txt',
 '../datasets/odyssey_chapters/odyssey.20.txt',
 '../datasets/odyssey_chapters/odyssey.21.txt',
 

In [4]:
homer_corpus = [open(ODYSSEY_FILES[i], encoding='utf8').read().strip()
                for i in range(len(ODYSSEY_FILES))]
homer_corpus += [open(ILIAD_FILES[i], encoding='utf8').read().strip()
                 for i in range(len(ILIAD_FILES))]
homer_corpus[0][:60]

"Book I\n\nTHE GODS IN COUNCIL--MINERVA'S VISIT TO ITHACA--THE "

In [5]:
homer_titles = ['odyssey'] * len(ODYSSEY_FILES)
homer_titles += ['iliad'] * len(ILIAD_FILES)
homer_titles

['odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad']

## Train and test corpus

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    homer_corpus, homer_titles, test_size=0.2)

In [7]:
X_train[:2]

['BOOK XIV\n\n  Agamemnon proposes that the Achaeans should sail home, and\n  is rebuked by Ulysses--Juno beguiles Jupiter--Hector is\n  wounded.\n\nNESTOR was sitting over his wine, but the cry of battle did not escape\nhim, and he said to the son of Aesculapius, "What, noble Machaon, is\nthe meaning of all this? The shouts of men fighting by our ships grow\nstronger and stronger; stay here, therefore, and sit over your wine,\nwhile fair Hecamede heats you a bath and washes the clotted blood from\noff you. I will go at once to the look-out station and see what it is\nall about."\n\nAs he spoke he took up the shield of his son Thrasymedes that was lying\nin his tent, all gleaming with bronze, for Thrasymedes had taken his\nfather\'s shield; he grasped his redoubtable bronze-shod spear, and as\nsoon as he was outside saw the disastrous rout of the Achaeans who, now\nthat their wall was overthrown, were flying pell-mell before the\nTrojans. As when there is a heavy swell upon the sea, bu

In [8]:
y_train[:2]

['iliad', 'odyssey']

In [9]:
X_test

['BOOK XX\n\n  The gods hold a council and determine to watch the fight, from\n  the hill Callicolone, and the barrow of Hercules--A fight\n  between Achilles and AEneas is interrupted by Neptune, who\n  saves AEneas--Achilles kills many Trojans.\n\nTHUS, then, did the Achaeans arm by their ships round you, O son of\nPeleus, who were hungering for battle; while the Trojans over against\nthem armed upon the rise of the plain.\n\nMeanwhile Jove from the top of many-delled Olympus, bade Themis gather\nthe gods in council, whereon she went about and called them to the\nhouse of Jove. There was not a river absent except Oceanus, nor a\nsingle one of the nymphs that haunt fair groves, or springs of rivers\nand meadows of green grass. When they reached the house of\ncloud-compelling Jove, they took their seats in the arcades of polished\nmarble which Vulcan with his consummate skill had made for father Jove.\n\nIn such wise, therefore, did they gather in the house of Jove. Neptune\nalso, lord

In [10]:
y_test

['iliad',
 'odyssey',
 'iliad',
 'odyssey',
 'iliad',
 'iliad',
 'iliad',
 'odyssey',
 'odyssey',
 'odyssey']

## Converting texts into matrices

Counting the words

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape

(38, 9113)

In [12]:
tfidf_vectorizer.get_feature_names_out()[-20:]

array(['yoked', 'yokestraps', 'yolking', 'yonder', 'you', 'young',
       'younger', 'youngest', 'youngster', 'your', 'yours', 'yourself',
       'yourselves', 'youth', 'youths', 'zacynthus', 'zeal', 'zelea',
       'zethus', 'zeus'], dtype=object)

In [13]:
X_train_tfidf.toarray()[0, :200]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

## Training the model

In [14]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_train_tfidf, y_train)

## Predicting

Formatting the data

In [15]:
X_test_tfidf = tfidf_vectorizer.transform(X_test)

Predicting the classes

In [16]:
y_test_hat = clf.predict(X_test_tfidf)
y_test_hat

array(['iliad', 'odyssey', 'iliad', 'odyssey', 'iliad', 'iliad', 'iliad',
       'odyssey', 'odyssey', 'odyssey'], dtype='<U7')

In [17]:
y_test

['iliad',
 'odyssey',
 'iliad',
 'odyssey',
 'iliad',
 'iliad',
 'iliad',
 'odyssey',
 'odyssey',
 'odyssey']

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_test_hat)

1.0