# CLTK Data Cleaning / Exploration

Before diving into the Epistles, I've spent the week getting more familiar with some of the tools to process Classical texts in Python, my language of choice. Specifically, I've experimented with loading the texts, cleaning the data, and generating different representations of each document - centered around the problem of classifying sentences as either Xenophon or Plutarch. Next week, I'll work on developing the classification models themselves on this problem, given that we can more easily benchmark the success of classification models between Xenophon and Plutarch because those authors' works are not contested. Once I explore the models there, I see what works and identify good candidates for models to solve the more difficult problem of classifying Plato's Epistles (authorship unknown, genre different than other works by Plato).

## Acquiring the Corpus

Acquiring the documents proved to be a simple task with the CLTK's Corpus Importer (which also allows users to import pre-trained word vectors and Greek-specific data cleaning functionality).

In [1]:
from cltk.corpus.utils.importer import CorpusImporter
from cltk.corpus.readers import get_corpus_reader

corpus_importer = CorpusImporter('greek')

corpus_importer.import_corpus("greek_text_perseus")
corpus_importer.import_corpus("greek_text_first1kgreek")
corpus_importer.import_corpus("greek_models_cltk")
corpus_importer.import_corpus("greek_word2vec_cltk")

## Creating a Dataframe

In [2]:

import pandas as pd

data = {'Paragraph': [],
        'Author':[]}

df = pd.DataFrame (data, columns = ['Paragraph','Author'])

## Cleaning Data

At this step, I converted the JSON-style hierarchical documents into lists of strings which denote separate paragraphs. I also took advantage of CLTK's data cleaning formats which remove superfluous punctuation (tailored to Perseus text), and normalize different representations of accented characters (polytonic vs monotonic Greek characters). Since capitalization in Greek is more or less restricted to proper nouns, I dedided not to case-normalize the text explicitly.

In [3]:
from cltk.corpus.utils.formatter import tlg_plaintext_cleanup, cltk_normalize

def process_document(doc):
    cleaned_paragraphs = []
    for paragraph in doc['text'].values():
        para_string = ""
        if type(paragraph) != str:
            for sent in paragraph.values():
                if type(sent) == str:
                    para_string += tlg_plaintext_cleanup(sent)
        sentence = cltk_normalize(para_string)
        cleaned_paragraphs.append(sentence)
        # cleaned_paragraphs.append(lemmatizer.lemmatize(sentence))
    return cleaned_paragraphs

In [4]:
perseus_reader = get_corpus_reader(corpus_name='greek_text_perseus', language='greek')

plutarch_docs = []
xenophon_docs = []
    
for doc in perseus_reader.docs():
    if doc["author"] == 'plutarch':
        for paragraph in process_document(doc):
            df = df.append({"Paragraph": paragraph, "Author": "Plutarch"}, ignore_index=True)
    if doc["author"] == "xenophon":
        for paragraph in process_document(doc):
            df = df.append({"Paragraph": paragraph, "Author": "Xenophon"}, ignore_index=True)
    

In [11]:
df.head()
len(df[df["Author"] == "Xenophon"])

167

## Lemmatization and Word Representation

The next step is to transform document into vectorized representations.

One popular representation is the bag of words model, in which each document is represented as a vector of length *m*, where *m* is the number of unique words in the vocabulary. The value of each index of the vector is equal to the frequency of the 

The next representation is the TFIDF model, in which each document is also represented s a vector of length *m*; however, the value at each index of the vector is now assigned a score corresponding to how important that word is to the document - a score directly proportional to the word's frequency in the document and inversely proportional to the word's frequency in the entire document corpus at large.

Finally, I've examined the possibility of using gensim to load pre-trained Greek word embeddings - trained by the CLTK team, to my knowledge, through n-grams. Alternatively, I intend to train my own word embeddings through more advanced neural methods. 

In this process, I made the decision to use a lemmatizer, which reduces each form to its morphological root. Given that Greek nouns, adjectives, and especially verbs can take up to hundreds of different morphological forms, I thought this would be an appropriate choice. However, this process comes at the expense of losing valuable semantic information - that is to say, the sentences "X sees Y" and "Y sees X" would be rendered the same. One of my research goals is to ponder this tradeoff more intently to formulate a method which preserves both word semantics semantics and morphology as much as possible. 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from cltk.stem.lemma import LemmaReplacer
lemmatizer = LemmaReplacer('greek')


analyze_text = lambda x: lemmatizer.lemmatize(x)

cv = CountVectorizer(ngram_range = (1,1), tokenizer=analyze_text)
bag_of_words = cv.fit_transform(df["Paragraph"])

tf = TfidfVectorizer(ngram_range = (1,1), tokenizer=analyze_text)
tfidf = tf.fit_transform(df['Paragraph'])

In [7]:
# from gensim.models import Word2Vec
# model = Word2Vec.load("/Users/blissperry/cltk_data/greek/model/greek_word2vec_cltk/greek_s100_w30_min5_sg.model")


## Potential Classification Models (Part 2 - Coming Soon)

Roughly in order of complexity: 
- (Unigram) Naive Bayes
- Straight N-gram model
- RNN language model (with LSTM) 
- Attention-based models

## Notes:

- 