# Chapter 9: Counting and Indexing Words
Applying text categorization to Homer works with sklearn functions. We train the model to recognize if a chapter is from the _Iliad_ or the _Odyssey_.

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

# Loading the data

We use Homer's _Iliad_ and _Odyssey_ that we split in chapters and we read them from their respective folders.

Source: http://classics.mit.edu/Browse/index.html

In [1]:
PATH = '../datasets/'

In [2]:
ILIAD_FILES = [PATH + 'iliad_chapters/iliad.' +
               str(i) + '.txt' for i in range(1, 25)]
ILIAD_FILES

['../datasets/iliad_chapters/iliad.1.txt',
 '../datasets/iliad_chapters/iliad.2.txt',
 '../datasets/iliad_chapters/iliad.3.txt',
 '../datasets/iliad_chapters/iliad.4.txt',
 '../datasets/iliad_chapters/iliad.5.txt',
 '../datasets/iliad_chapters/iliad.6.txt',
 '../datasets/iliad_chapters/iliad.7.txt',
 '../datasets/iliad_chapters/iliad.8.txt',
 '../datasets/iliad_chapters/iliad.9.txt',
 '../datasets/iliad_chapters/iliad.10.txt',
 '../datasets/iliad_chapters/iliad.11.txt',
 '../datasets/iliad_chapters/iliad.12.txt',
 '../datasets/iliad_chapters/iliad.13.txt',
 '../datasets/iliad_chapters/iliad.14.txt',
 '../datasets/iliad_chapters/iliad.15.txt',
 '../datasets/iliad_chapters/iliad.16.txt',
 '../datasets/iliad_chapters/iliad.17.txt',
 '../datasets/iliad_chapters/iliad.18.txt',
 '../datasets/iliad_chapters/iliad.19.txt',
 '../datasets/iliad_chapters/iliad.20.txt',
 '../datasets/iliad_chapters/iliad.21.txt',
 '../datasets/iliad_chapters/iliad.22.txt',
 '../datasets/iliad_chapters/iliad.23.txt

In [3]:
ODYSSEY_FILES = [PATH + 'odyssey_chapters/odyssey.' +
                 str(i) + '.txt' for i in range(1, 25)]
ODYSSEY_FILES

['../datasets/odyssey_chapters/odyssey.1.txt',
 '../datasets/odyssey_chapters/odyssey.2.txt',
 '../datasets/odyssey_chapters/odyssey.3.txt',
 '../datasets/odyssey_chapters/odyssey.4.txt',
 '../datasets/odyssey_chapters/odyssey.5.txt',
 '../datasets/odyssey_chapters/odyssey.6.txt',
 '../datasets/odyssey_chapters/odyssey.7.txt',
 '../datasets/odyssey_chapters/odyssey.8.txt',
 '../datasets/odyssey_chapters/odyssey.9.txt',
 '../datasets/odyssey_chapters/odyssey.10.txt',
 '../datasets/odyssey_chapters/odyssey.11.txt',
 '../datasets/odyssey_chapters/odyssey.12.txt',
 '../datasets/odyssey_chapters/odyssey.13.txt',
 '../datasets/odyssey_chapters/odyssey.14.txt',
 '../datasets/odyssey_chapters/odyssey.15.txt',
 '../datasets/odyssey_chapters/odyssey.16.txt',
 '../datasets/odyssey_chapters/odyssey.17.txt',
 '../datasets/odyssey_chapters/odyssey.18.txt',
 '../datasets/odyssey_chapters/odyssey.19.txt',
 '../datasets/odyssey_chapters/odyssey.20.txt',
 '../datasets/odyssey_chapters/odyssey.21.txt',
 

In [4]:
homer_corpus = [open(ODYSSEY_FILES[i], encoding='utf8').read().strip()
                for i in range(len(ODYSSEY_FILES))]
homer_corpus += [open(ILIAD_FILES[i], encoding='utf8').read().strip()
                 for i in range(len(ILIAD_FILES))]
homer_corpus[0][:60]

"Book I\n\nTHE GODS IN COUNCIL--MINERVA'S VISIT TO ITHACA--THE "

In [5]:
homer_titles = ['odyssey'] * len(ODYSSEY_FILES)
homer_titles += ['iliad'] * len(ILIAD_FILES)
homer_titles

['odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'odyssey',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad',
 'iliad']

## Train and test corpus

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    homer_corpus, homer_titles, test_size=0.2)

In [7]:
X_train[:2]

['Book XXIII\n\nPENELOPE EVENTUALLY RECOGNISES HER HUSBAND--EARLY IN THE MORNING\nULYSSES, TELEMACHUS, EUMAEUS, AND PHILOETIUS LEAVE THE TOWN.\n\nEuryclea now went upstairs laughing to tell her mistress that her dear\nhusband had come home. Her aged knees became young again and her feet\nwere nimble for joy as she went up to her mistress and bent over her\nhead to speak to her. "Wake up Penelope, my dear child," she exclaimed,\n"and see with your own eyes something that you have been wanting this\nlong time past. Ulysses has at last indeed come home again, and has\nkilled the suitors who were giving so much trouble in his house, eating\nup his estate and ill treating his son."\n\n"My good nurse," answered Penelope, "you must be mad. The gods sometimes\nsend some very sensible people out of their minds, and make foolish\npeople become sensible. This is what they must have been doing to you;\nfor you always used to be a reasonable person. Why should you thus mock\nme when I have trouble 

In [8]:
y_train[:2]

['odyssey', 'iliad']

In [9]:
X_test

['Book X\n\nAEOLUS, THE LAESTRYGONES, CIRCE.\n\n"Thence we went on to the Aeolian island where lives Aeolus son of\nHippotas, dear to the immortal gods. It is an island that floats (as\nit were) upon the sea, {83} iron bound with a wall that girds it. Now,\nAeolus has six daughters and six lusty sons, so he made the sons marry\nthe daughters, and they all live with their dear father and mother,\nfeasting and enjoying every conceivable kind of luxury. All day long the\natmosphere of the house is loaded with the savour of roasting meats till\nit groans again, yard and all; but by night they sleep on their well\nmade bedsteads, each with his own wife between the blankets. These were\nthe people among whom we had now come.\n\n"Aeolus entertained me for a whole month asking me questions all the\ntime about Troy, the Argive fleet, and the return of the Achaeans. I\ntold him exactly how everything had happened, and when I said I must go,\nand asked him to further me on my way, he made no sort

In [10]:
y_test

['odyssey',
 'iliad',
 'odyssey',
 'odyssey',
 'iliad',
 'iliad',
 'odyssey',
 'iliad',
 'odyssey',
 'iliad']

## Converting texts into matrices

Counting the words

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(38, 9152)

In [12]:
count_vect.get_feature_names_out()[-20:]

array(['yokestraps', 'yolking', 'yonder', 'you', 'young', 'younger',
       'youngest', 'youngster', 'your', 'yours', 'yourself', 'yourselves',
       'youth', 'youths', 'zacynthus', 'zeal', 'zelea', 'zephyrus',
       'zethus', 'zeus'], dtype=object)

In [13]:
X_train_counts.toarray()[0, :200]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  1,  1,  1,  1,  1,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
       18,  0,  1,  3,  0,  0,  1,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  2,  0,  0,  0])

Applying tf-idf to the matrix

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(38, 9152)

In [15]:
X_train_tfidf.toarray()[0, :200]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

## Training the model

In [16]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_train_tfidf, y_train)

## Predicting

Formatting the data

In [17]:
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

Predicting the classes

In [18]:
y_test_hat = clf.predict(X_test_tfidf)
y_test_hat

array(['odyssey', 'iliad', 'odyssey', 'odyssey', 'iliad', 'iliad',
       'odyssey', 'iliad', 'odyssey', 'iliad'], dtype='<U7')

In [19]:
y_test

['odyssey',
 'iliad',
 'odyssey',
 'odyssey',
 'iliad',
 'iliad',
 'odyssey',
 'iliad',
 'odyssey',
 'iliad']

In [20]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_test_hat)

1.0