# CLTK Data Cleaning / Exploration

Before diving into the Epistles, I've spent the week getting more familiar with some of the tools to process Classical texts in Python, my language of choice. Specifically, I've experimented with loading the texts, cleaning the data, and generating different representations of each document - centered around the problem of classifying sentences as either Xenophon or Plutarch. Next week, I'll work on developing the classification models themselves on this problem, given that we can more easily benchmark the success of classification models between Xenophon and Plutarch because those authors' works are not contested. Once I explore the models there, I see what works and identify good candidates for models to solve the more difficult problem of classifying Plato's Epistles (authorship unknown, genre different than other works by Plato).

Ultimately, many of the decisions in this notebook (lemmatization, text normalization) are temporary and intended to get simple models up and running quickly, and I'll be examining them more closely over the next few months.

## Acquiring the Corpus

Acquiring the documents proved to be a simple task with the CLTK's Corpus Importer (which also allows users to import pre-trained word vectors and Greek-specific data cleaning functionality).

In [1]:
from cltk.corpus.utils.importer import CorpusImporter
from cltk.corpus.readers import get_corpus_reader

corpus_importer = CorpusImporter('greek')

corpus_importer.import_corpus("greek_text_perseus")
corpus_importer.import_corpus("greek_text_first1kgreek")
corpus_importer.import_corpus("greek_models_cltk")
corpus_importer.import_corpus("greek_word2vec_cltk")

## Creating a Dataframe

In [2]:

import pandas as pd

data = {'Paragraph': [],
        'Author':[]}

df = pd.DataFrame (data, columns = ['Paragraph','Author'])

## Cleaning Data

At this step, I converted the JSON-style hierarchical documents into lists of strings which denote separate paragraphs. I also took advantage of CLTK's data cleaning formats which remove superfluous punctuation (tailored to Perseus text), and normalize different representations of accented characters (polytonic vs monotonic Greek characters). Since capitalization in Greek is more or less restricted to proper nouns, I dedided not to case-normalize the text explicitly.

In [3]:
from cltk.corpus.utils.formatter import tlg_plaintext_cleanup, cltk_normalize

def process_document(doc):
    cleaned_paragraphs = []
    for paragraph in doc['text'].values():
        para_string = ""
        if type(paragraph) != str:
            for sent in paragraph.values():
                if type(sent) == str:
                    para_string += tlg_plaintext_cleanup(sent)
        sentence = cltk_normalize(para_string)
        cleaned_paragraphs.append(sentence)
        # cleaned_paragraphs.append(lemmatizer.lemmatize(sentence))
    return cleaned_paragraphs

In [4]:
perseus_reader = get_corpus_reader(corpus_name='greek_text_perseus', language='greek')

plutarch_docs = []
xenophon_docs = []
    
for doc in perseus_reader.docs():
    if doc["author"] == 'plutarch':
        for paragraph in process_document(doc):
            df = df.append({"Paragraph": paragraph, "Author": "Plutarch"}, ignore_index=True)
    if doc["author"] == "xenophon":
        for paragraph in process_document(doc):
            df = df.append({"Paragraph": paragraph, "Author": "Xenophon"}, ignore_index=True)
    

In [5]:
len(df.iloc[-14]["Paragraph"])

2490

In [6]:
df.head()

Unnamed: 0,Paragraph,Author
0,ἐμοὶ τῆς τῶν βίων ἅψασθαι μὲν γραφῆς συνέβη δ...,Plutarch
1,τὸν Αἰμιλίων οἶκον ἐν Ῥώμῃ τῶν εὐπατριδῶν γεγ...,Plutarch
2,πρώτην γοῦν τῶν ἐπιφανῶν ἀρχῶν ἀγορανομίαν με...,Plutarch
3,συστάντος δὲ τοῦ πρὸς Ἀντίοχον τὸν μέγαν πολέ...,Plutarch
4,"ἔγημε δὲ Παπιρίαν, ἀνδρὸς ὑπατικοῦ Μάσωνος θυ...",Plutarch


## Train / Test Split

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

text_train, text_test, author_train, author_test = train_test_split(df["Paragraph"], df["Author"], test_size=0.3)

X = pd.concat([text_train, author_train], axis=1)

# separate minority and majority classes
plutarch = X[X.Author == "Plutarch"]
xenophon = X[X.Author == "Xenophon"]

# upsample minority
xenophon_upsampled = resample(xenophon, replace=True, n_samples=len(plutarch))
upsampled = pd.concat([plutarch, xenophon_upsampled])

text_train = upsampled["Paragraph"]
author_train = upsampled["Author"]

## Lemmatization and Word Representation

The next step is to transform document into vectorized representations.

One popular representation is the bag of words model, in which each document is represented as a vector of length *m*, where *m* is the number of unique words in the vocabulary. The value of each index of the vector is equal to the frequency of the 

The next representation is the TFIDF model, in which each document is also represented s a vector of length *m*; however, the value at each index of the vector is now assigned a score corresponding to how important that word is to the document - a score directly proportional to the word's frequency in the document and inversely proportional to the word's frequency in the entire document corpus at large.

Finally, I've examined the possibility of using gensim to load pre-trained Greek word embeddings - trained by the CLTK team, to my knowledge, through n-grams. Alternatively, I intend to train my own word embeddings through more advanced neural methods. 

In this process, I made the decision to use a lemmatizer, which reduces each form to its morphological root. Given that Greek nouns, adjectives, and especially verbs can take up to hundreds of different morphological forms, I thought this would be an appropriate choice. However, this process comes at the expense of losing valuable semantic information - that is to say, the sentences "X sees Y" and "Y sees X" would be rendered the same. One of my research goals is to ponder this tradeoff more intently to formulate a method which preserves both word semantics semantics and morphology as much as possible. 

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from cltk.stem.lemma import LemmaReplacer
lemmatizer = LemmaReplacer('greek')

analyze_text = lambda x: lemmatizer.lemmatize(x)

cv = CountVectorizer(ngram_range = (1,1), tokenizer=analyze_text)
bag_of_words = cv.fit_transform(text_train)

tf = TfidfVectorizer(ngram_range = (1,1), tokenizer=analyze_text)
tfidf_train = tf.fit_transform(text_train)
tfidf_test = tf.transform(text_test)

In [9]:
# from gensim.models import Word2Vec
# model = Word2Vec.load("/Users/blissperry/cltk_data/greek/model/greek_word2vec_cltk/greek_s100_w30_min5_sg.model")


## Potential Classification Models (Part 2 - Coming Soon)

Roughly in order of complexity: 
- (Unigram) Naive Bayes
- Straight N-gram model
- RNN language model (then, with LSTM) 
- Attention-based models

In [10]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

naive_bayes = MultinomialNB().fit(tfidf_train, author_train)
author_pred = naive_bayes.predict(tfidf_test)

print(metrics.classification_report(author_test, author_pred))

print(metrics.confusion_matrix(author_test, author_pred))

              precision    recall  f1-score   support

    Plutarch       0.90      1.00      0.95       162
    Xenophon       1.00      0.62      0.77        48

    accuracy                           0.91       210
   macro avg       0.95      0.81      0.86       210
weighted avg       0.92      0.91      0.91       210

[[162   0]
 [ 18  30]]


### Simple Neural Network

In [39]:
from tensorflow.keras import layers, Input, Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.losses import CategoricalCrossentropy


# vectorize_layer = TextVectorization(
#     standardize=analyze_test,
#     max_tokens=max_features,
#     output_mode="tf-idf",
# )
# vectorize_layer.adapt(text_train["Paragraph"])

VOCAB_SIZE = tfidf_train.shape[1]
print(VOCAB_SIZE)
EMBEDDING_SIZE = 100

model = Sequential()
model.add(layers.Dense(100, activation='relu', input_shape=(VOCAB_SIZE,)))
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(2, activation='softmax'))
model.compile("adam", CategoricalCrossentropy(from_logits=True), metrics=["accuracy"])
model.summary()


15188
Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_20 (Dense)             (None, 100)               1518900   
_________________________________________________________________
dense_21 (Dense)             (None, 10)                1010      
_________________________________________________________________
dense_22 (Dense)             (None, 2)                 22        
Total params: 1,519,932
Trainable params: 1,519,932
Non-trainable params: 0
_________________________________________________________________


In [40]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

encoder = LabelEncoder()
encoded_author_train = encoder.fit_transform(author_train)
encoded_author_test = encoder.transform(author_test)

encoded_author_train = to_categorical(encoded_author_train)
encoded_author_test = to_categorical(encoded_author_test)

# convert integers to dummy variables (i.e. one hot encoded)
print(encoded_author_train.shape)
print(tfidf_train.shape)

(736, 2)
(736, 15188)


In [41]:
import numpy as np

def batch_generator(X, y, batch_size):
    number_of_batches = len(text_train)/batch_size
    counter=0
    shuffle_index = np.arange(np.shape(y)[0])
    np.random.shuffle(shuffle_index)
    X =  X[shuffle_index, :]
    y =  y[shuffle_index]
    while 1:
        index_batch = shuffle_index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X[index_batch,:].todense()
        y_batch = y[index_batch]
        counter += 1
        yield(np.array(X_batch),y_batch)
        if (counter < number_of_batches):
            np.random.shuffle(shuffle_index)
            counter=0

model.fit_generator(generator=batch_generator(tfidf_train, np.array(encoded_author_train), 128),
                    epochs=5,
                    steps_per_epoch=len(text_train)//128,
                    validation_data=(tfidf_test, encoded_author_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1cc1679a10>

In [47]:
author_test_pred = model.predict(tfidf_test)
author_test_pred = (author_test_pred > 0.5).argmax(axis=1)

print(metrics.classification_report(author_test_pred, author_test=="Xenophon"))
print(metrics.confusion_matrix(author_test_pred, author_test=="Xenophon"))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98       168
           1       0.88      1.00      0.93        42

    accuracy                           0.97       210
   macro avg       0.94      0.98      0.96       210
weighted avg       0.97      0.97      0.97       210

[[162   6]
 [  0  42]]
