# Chapter 9: Counting and Indexing Words
Applying text categorization to Homer works with sklearn functions. We train the model to recognize if a chapter is from the _Iliad_ or the _Odyssey_.

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

# Loading the data

We use Homer's _Iliad_ and _Odyssey_ that we split in chapters and we read them from their respective folders.

Source: http://classics.mit.edu/Browse/index.html

In [1]:
PATH = '../datasets/'

In [2]:
ILIAD_FILES = [PATH + 'iliad_chapters/iliad.' +
               str(i) + '.txt' for i in range(1, 25)]
ILIAD_FILES

['../datasets/iliad_chapters/iliad.1.txt',
 '../datasets/iliad_chapters/iliad.2.txt',
 '../datasets/iliad_chapters/iliad.3.txt',
 '../datasets/iliad_chapters/iliad.4.txt',
 '../datasets/iliad_chapters/iliad.5.txt',
 '../datasets/iliad_chapters/iliad.6.txt',
 '../datasets/iliad_chapters/iliad.7.txt',
 '../datasets/iliad_chapters/iliad.8.txt',
 '../datasets/iliad_chapters/iliad.9.txt',
 '../datasets/iliad_chapters/iliad.10.txt',
 '../datasets/iliad_chapters/iliad.11.txt',
 '../datasets/iliad_chapters/iliad.12.txt',
 '../datasets/iliad_chapters/iliad.13.txt',
 '../datasets/iliad_chapters/iliad.14.txt',
 '../datasets/iliad_chapters/iliad.15.txt',
 '../datasets/iliad_chapters/iliad.16.txt',
 '../datasets/iliad_chapters/iliad.17.txt',
 '../datasets/iliad_chapters/iliad.18.txt',
 '../datasets/iliad_chapters/iliad.19.txt',
 '../datasets/iliad_chapters/iliad.20.txt',
 '../datasets/iliad_chapters/iliad.21.txt',
 '../datasets/iliad_chapters/iliad.22.txt',
 '../datasets/iliad_chapters/iliad.23.txt

In [3]:
ODYSSEY_FILES = [PATH + 'odyssey_chapters/odyssey.' +
                 str(i) + '.txt' for i in range(1, 25)]
ODYSSEY_FILES

['../datasets/odyssey_chapters/odyssey.1.txt',
 '../datasets/odyssey_chapters/odyssey.2.txt',
 '../datasets/odyssey_chapters/odyssey.3.txt',
 '../datasets/odyssey_chapters/odyssey.4.txt',
 '../datasets/odyssey_chapters/odyssey.5.txt',
 '../datasets/odyssey_chapters/odyssey.6.txt',
 '../datasets/odyssey_chapters/odyssey.7.txt',
 '../datasets/odyssey_chapters/odyssey.8.txt',
 '../datasets/odyssey_chapters/odyssey.9.txt',
 '../datasets/odyssey_chapters/odyssey.10.txt',
 '../datasets/odyssey_chapters/odyssey.11.txt',
 '../datasets/odyssey_chapters/odyssey.12.txt',
 '../datasets/odyssey_chapters/odyssey.13.txt',
 '../datasets/odyssey_chapters/odyssey.14.txt',
 '../datasets/odyssey_chapters/odyssey.15.txt',
 '../datasets/odyssey_chapters/odyssey.16.txt',
 '../datasets/odyssey_chapters/odyssey.17.txt',
 '../datasets/odyssey_chapters/odyssey.18.txt',
 '../datasets/odyssey_chapters/odyssey.19.txt',
 '../datasets/odyssey_chapters/odyssey.20.txt',
 '../datasets/odyssey_chapters/odyssey.21.txt',
 

In [4]:
homer_corpus = [open(ODYSSEY_FILES[i], encoding='utf8').read().strip()
                for i in range(len(ODYSSEY_FILES))]
homer_corpus += [open(ILIAD_FILES[i], encoding='utf8').read().strip()
                 for i in range(len(ILIAD_FILES))]
homer_corpus[0][:60]

"Book I\n\nTHE GODS IN COUNCIL--MINERVA'S VISIT TO ITHACA--THE "

In [5]:
homer_titles = ['odyssey'] * len(ODYSSEY_FILES)
homer_titles += ['iliad'] * len(ILIAD_FILES)
homer_titles

['odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad']

## Train and test corpus

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    homer_corpus, homer_titles, test_size=0.2)

In [7]:
X_train[:2]

['BOOK XIII\n\n  Neptune helps the Achaeans--The feats of Idomeneus--Hector at\n  the ships.\n\nNOW when Jove had thus brought Hector and the Trojans to the ships, he\nleft them to their never-ending toil, and turned his keen eyes away,\nlooking elsewhither towards the horse-breeders of Thrace, the Mysians,\nfighters at close quarters, the noble Hippemolgi, who live on milk, and\nthe Abians, justest of mankind. He no longer turned so much as a glance\ntowards Troy, for he did not think that any of the immortals would go\nand help either Trojans or Danaans.\n\nBut King Neptune had kept no blind look-out; he had been looking\nadmiringly on the battle from his seat on the topmost crests of wooded\nSamothrace, whence he could see all Ida, with the city of Priam and the\nships of the Achaeans. He had come from under the sea and taken his\nplace here, for he pitied the Achaeans who were being overcome by the\nTrojans; and he was furiously angry with Jove.\n\nPresently he came down from his p

In [8]:
y_train[:2]

['iliad', 'iliad']

In [9]:
X_test

['Book XVII\n\nTELEMACHUS AND HIS MOTHER MEET--ULYSSES AND EUMAEUS COME DOWN TO THE\nTOWN, AND ULYSSES IS INSULTED BY MELANTHIUS--HE IS RECOGNISED BY THE\nDOG ARGOS--HE IS INSULTED AND PRESENTLY STRUCK BY ANTINOUS WITH A\nSTOOL--PENELOPE DESIRES THAT HE SHALL BE SENT TO HER.\n\nWhen the child of morning, rosy-fingered Dawn, appeared, Telemachus\nbound on his sandals and took a strong spear that suited his hands, for\nhe wanted to go into the city. "Old friend," said he to the swineherd,\n"I will now go to the town and show myself to my mother, for she will\nnever leave off grieving till she has seen me. As for this unfortunate\nstranger, take him to the town and let him beg there of any one who will\ngive him a drink and a piece of bread. I have trouble enough of my own,\nand cannot be burdened with other people. If this makes him angry so\nmuch the worse for him, but I like to say what I mean."\n\nThen Ulysses said, "Sir, I do not want to stay here; a beggar can always\ndo better in t

In [10]:
y_test

['odyssey',
 'iliad',
 'iliad',
 'iliad',
 'odyssey',
 'odyssey',
 'odyssey',
 'iliad',
 'odyssey',
 'iliad']

## Converting texts into matrices

Counting the words

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(38, 9092)

In [12]:
count_vect.get_feature_names_out()[-20:]

array(['yoked', 'yokestraps', 'yonder', 'you', 'young', 'younger',
       'youngest', 'youngster', 'your', 'yours', 'yourself', 'yourselves',
       'youth', 'youths', 'zacynthus', 'zeal', 'zelea', 'zephyrus',
       'zethus', 'zeus'], dtype=object)

In [13]:
X_train_counts.toarray()[0, :200]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  0,  0,  0,  0,  0,  0,
        0,  0,  0, 17,  3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  3, 36,  0,  0,  0,  2,  0])

Applying tf-idf to the matrix

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(38, 9092)

In [15]:
X_train_tfidf.toarray()[0, :200]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

## Training the model

In [16]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_train_tfidf, y_train)

## Predicting

Formatting the data

In [17]:
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

Predicting the classes

In [18]:
y_test_hat = clf.predict(X_test_tfidf)
y_test_hat

array(['odyssey', 'iliad', 'iliad', 'iliad', 'odyssey', 'odyssey',
       'odyssey', 'iliad', 'odyssey', 'iliad'], dtype='<U7')

In [19]:
y_test

['odyssey',
 'iliad',
 'iliad',
 'iliad',
 'odyssey',
 'odyssey',
 'odyssey',
 'iliad',
 'odyssey',
 'iliad']

In [20]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_test_hat)

1.0