# Chapter 9: Counting and Indexing Words
Applying text categorization to Homer works with sklearn functions. We train the model to recognize if a chapter is from the _Iliad_ or the _Odyssey_.

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

# Loading the data

We use Homer's _Iliad_ and _Odyssey_ that we split in chapters and we read them from their respective folders.

Source: http://classics.mit.edu/Browse/index.html

In [1]:
PATH = '../datasets/'

In [2]:
ILIAD_FILES = [PATH + 'iliad_chapters/iliad.' +
               str(i) + '.txt' for i in range(1, 25)]
ILIAD_FILES

['../datasets/iliad_chapters/iliad.1.txt',
 '../datasets/iliad_chapters/iliad.2.txt',
 '../datasets/iliad_chapters/iliad.3.txt',
 '../datasets/iliad_chapters/iliad.4.txt',
 '../datasets/iliad_chapters/iliad.5.txt',
 '../datasets/iliad_chapters/iliad.6.txt',
 '../datasets/iliad_chapters/iliad.7.txt',
 '../datasets/iliad_chapters/iliad.8.txt',
 '../datasets/iliad_chapters/iliad.9.txt',
 '../datasets/iliad_chapters/iliad.10.txt',
 '../datasets/iliad_chapters/iliad.11.txt',
 '../datasets/iliad_chapters/iliad.12.txt',
 '../datasets/iliad_chapters/iliad.13.txt',
 '../datasets/iliad_chapters/iliad.14.txt',
 '../datasets/iliad_chapters/iliad.15.txt',
 '../datasets/iliad_chapters/iliad.16.txt',
 '../datasets/iliad_chapters/iliad.17.txt',
 '../datasets/iliad_chapters/iliad.18.txt',
 '../datasets/iliad_chapters/iliad.19.txt',
 '../datasets/iliad_chapters/iliad.20.txt',
 '../datasets/iliad_chapters/iliad.21.txt',
 '../datasets/iliad_chapters/iliad.22.txt',
 '../datasets/iliad_chapters/iliad.23.txt

In [3]:
ODYSSEY_FILES = [PATH + 'odyssey_chapters/odyssey.' +
                 str(i) + '.txt' for i in range(1, 25)]
ODYSSEY_FILES

['../datasets/odyssey_chapters/odyssey.1.txt',
 '../datasets/odyssey_chapters/odyssey.2.txt',
 '../datasets/odyssey_chapters/odyssey.3.txt',
 '../datasets/odyssey_chapters/odyssey.4.txt',
 '../datasets/odyssey_chapters/odyssey.5.txt',
 '../datasets/odyssey_chapters/odyssey.6.txt',
 '../datasets/odyssey_chapters/odyssey.7.txt',
 '../datasets/odyssey_chapters/odyssey.8.txt',
 '../datasets/odyssey_chapters/odyssey.9.txt',
 '../datasets/odyssey_chapters/odyssey.10.txt',
 '../datasets/odyssey_chapters/odyssey.11.txt',
 '../datasets/odyssey_chapters/odyssey.12.txt',
 '../datasets/odyssey_chapters/odyssey.13.txt',
 '../datasets/odyssey_chapters/odyssey.14.txt',
 '../datasets/odyssey_chapters/odyssey.15.txt',
 '../datasets/odyssey_chapters/odyssey.16.txt',
 '../datasets/odyssey_chapters/odyssey.17.txt',
 '../datasets/odyssey_chapters/odyssey.18.txt',
 '../datasets/odyssey_chapters/odyssey.19.txt',
 '../datasets/odyssey_chapters/odyssey.20.txt',
 '../datasets/odyssey_chapters/odyssey.21.txt',
 

In [4]:
homer_corpus = [open(ODYSSEY_FILES[i], encoding='utf8').read().strip()
                for i in range(len(ODYSSEY_FILES))]
homer_corpus += [open(ILIAD_FILES[i], encoding='utf8').read().strip()
                 for i in range(len(ILIAD_FILES))]
homer_corpus[0][:60]

"Book I\n\nTHE GODS IN COUNCIL--MINERVA'S VISIT TO ITHACA--THE "

In [5]:
homer_titles = ['odyssey'] * len(ODYSSEY_FILES)
homer_titles += ['iliad'] * len(ILIAD_FILES)
homer_titles

['odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad']

## Train and test corpus

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    homer_corpus, homer_titles, test_size=0.2)

In [7]:
X_train[:2]

['Book VI\n\nTHE MEETING BETWEEN NAUSICAA AND ULYSSES.\n\nSo here Ulysses slept, overcome by sleep and toil; but Minerva went off\nto the country and city of the Phaeacians--a people who used to live in\nthe fair town of Hypereia, near the lawless Cyclopes. Now the Cyclopes\nwere stronger than they and plundered them, so their king Nausithous\nmoved them thence and settled them in Scheria, far from all other\npeople. He surrounded the city with a wall, built houses and temples,\nand divided the lands among his people; but he was dead and gone to\nthe house of Hades, and King Alcinous, whose counsels were inspired\nof heaven, was now reigning. To his house, then, did Minerva hie in\nfurtherance of the return of Ulysses.\n\nShe went straight to the beautifully decorated bedroom in which there\nslept a girl who was as lovely as a goddess, Nausicaa, daughter to King\nAlcinous. Two maid servants were sleeping near her, both very pretty,\none on either side of the doorway, which was closed w

In [8]:
y_train[:2]

['odyssey', 'iliad']

In [9]:
X_test

['Book XXI\n\nTHE TRIAL OF THE AXES, DURING WHICH ULYSSES REVEALS HIMSELF TO EUMAEUS\nAND PHILOETIUS\n\nMinerva now put it in Penelope\'s mind to make the suitors try their\nskill with the bow and with the iron axes, in contest among themselves,\nas a means of bringing about their destruction. She went upstairs and\ngot the store-room key, which was made of bronze and had a handle of\nivory; she then went with her maidens into the store-room at the end of\nthe house, where her husband\'s treasures of gold, bronze, and wrought\niron were kept, and where was also his bow, and the quiver full of\ndeadly arrows that had been given him by a friend whom he had met in\nLacedaemon--Iphitus the son of Eurytus. The two fell in with one another\nin Messene at the house of Ortilochus, where Ulysses was staying in\norder to recover a debt that was owing from the whole people; for the\nMessenians had carried off three hundred sheep from Ithaca, and had\nsailed away with them and with their shepherds

In [10]:
y_test

['odyssey',
 'iliad',
 'iliad',
 'odyssey',
 'iliad',
 'odyssey',
 'iliad',
 'iliad',
 'iliad',
 'iliad']

## Converting texts into matrices

Counting the words

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape

(38, 9073)

In [12]:
tfidf_vectorizer.get_feature_names_out()[-20:]

array(['yoked', 'yokestraps', 'yolking', 'yonder', 'you', 'young',
       'younger', 'youngest', 'youngster', 'your', 'yours', 'yourself',
       'yourselves', 'youth', 'youths', 'zacynthus', 'zeal', 'zelea',
       'zephyrus', 'zethus'], dtype=object)

In [13]:
X_train_tfidf.toarray()[0, :200]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

## Training the model

In [14]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_train_tfidf, y_train)

## Predicting

Formatting the data

In [15]:
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

NameError: name 'count_vect' is not defined

Predicting the classes

In [18]:
y_test_hat = clf.predict(X_test_tfidf)
y_test_hat

array(['odyssey', 'odyssey', 'iliad', 'odyssey', 'odyssey', 'odyssey',
       'odyssey', 'iliad', 'iliad', 'iliad'], dtype='<U7')

In [19]:
y_test

['odyssey',
 'odyssey',
 'iliad',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'iliad',
 'iliad',
 'iliad']

In [20]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_test_hat)

1.0