Some sources:

* Gensim LDA: https://radimrehurek.com/gensim/models/ldamodel.html
* Misc clustering with Python: http://brandonrose.org/clustering
* Scikit LDA: http://scikit-learn.org/0.16/modules/generated/sklearn.lda.LDA.html
* Scikit NMF: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
* WMD in Python: http://vene.ro/blog/word-movers-distance-in-python.html
* Original WMD paper: http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf
* Scikit Affinity propagation: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html

# Greek (TLG)

In [1]:
import os
import time

In [2]:
from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_index
from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithets
from cltk.corpus.greek.tlg.parse_tlg_indices import select_authors_by_epithet
from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_of_author
from cltk.stop.greek.stops import STOPS_LIST as greek_stops
from cltk.tokenize.word import nltk_tokenize_words

from greek_accentuation.characters import base

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
def stream_lemmatized_files(corpus_dir):
    # return all docs in a dir
    user_dir = os.path.expanduser('~/cltk_data/user_data/' + corpus_dir)
    files = os.listdir(user_dir)

    for file in files:
        filepath = os.path.join(user_dir, file)
        with open(filepath) as fo:
            yield fo.read()

In [5]:
t0 = time.time()

data_samples = []
for text in stream_lemmatized_files('tlg_lemmatized_no_accents_no_stops'):
    data_samples.append(text)

print('Time to collect texts: {}'.format(time.time() - t0))
print('Number of texts:', len(data_samples))

Time to collect texts: 3.0573339462280273
Number of texts: 1139


In [17]:
# tf-idf features
n_samples = 2000
n_features = 1000  # TODO: increase
n_topics = len(get_epithets())
n_top_words = 20

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words=None)
t0 = time.time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print('Time to extract tf-idf features: {} secs.'.format(time.time() - t0))

Time to extract tf-idf features: 48.39504814147949 secs.


In [18]:
print("Fitting the NMF model with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print('Fit NMF model in {}'.format(time.time() - t0))

Fitting the NMF model with tf-idf features, n_samples=2000 and n_features=1000...
Fit NMF model in 3.3092401027679443


In [19]:
def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic Nr.%d:' % int(topic_id + 1)) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

In [20]:
print("Topics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print_top_words(nmf, tfidf_feature_names, n_top_words)

Topics in NMF model:

Topic Nr.1:
ειμι 2.22 | ου 2.01 | τος 1.06 | αυ 0.87 | εχω 0.83 | τας 0.57 | αλλ 0.48 | ποιεω 0.4 | εγω 0.38 | ωσπερ 0.35 | δεω1 0.34 | τουτων 0.32 | πα 0.32 | ευ 0.29 | ος 0.29 | εαυτου 0.29 | αυτον 0.28 | οσος 0.27 | ουδεν 0.27 | παντα 0.27 | 

Topic Nr.2:
αυ 2.86 | τος 2.8 | αυτον 0.65 | αυτους 0.25 | χριστος 0.23 | αυτην 0.23 | θεαομαι 0.22 | τας 0.2 | υιος 0.19 | εκει 0.17 | κυριου 0.16 | εαυτου 0.15 | τοτε 0.13 | γυνη 0.13 | εγενετο 0.13 | αδελφος 0.13 | κυριος 0.12 | υιον 0.12 | θεος 0.12 | ινα 0.11 | 

Topic Nr.3:
εγω 3.51 | νυ 0.31 | συ 0.19 | εμεω 0.19 | θεαομαι 0.16 | ου 0.14 | χριστος 0.13 | εμοι 0.12 | παρ 0.12 | ειδον 0.11 | αγιος 0.11 | αλλ 0.11 | οπως 0.1 | πως 0.1 | ερχομαι 0.1 | οιδα 0.1 | ταυτ 0.09 | ποθεν 0.09 | καγω 0.09 | ειπε 0.09 | 

Topic Nr.4:
φημι 2.73 | καλεω 0.38 | ειμι 0.36 | αυτον 0.35 | φησι 0.28 | γενεσθαι 0.23 | φησιν 0.22 | αθηνη 0.22 | λεγει 0.18 | λεγεσθαι 0.14 | φασι 0.14 | αθηναι 0.13 | μεντοι 0.13 | λεγεται 0.12 | καθα 0.12 

In [22]:
tfidf.shape

(1139, 1000)

In [23]:
doc_topic_distrib = nmf.transform(tfidf)  # numpy.ndarray

In [26]:
doc_topic_distrib.shape

(1139, 55)


In [32]:
import pandas

In [37]:
df = pandas.DataFrame(doc_topic_distrib)

In [39]:
print(df)

            0         1         2         3         4         5         6   \
0     0.013578  0.038987  0.029025  0.006141  0.024565  0.383003  0.000000   
1     0.132518  0.075712  0.010431  0.000000  0.018797  0.013699  0.000000   
2     0.074000  0.015881  0.074922  0.000000  0.025416  0.215582  0.000000   
3     0.090141  0.000000  0.068305  0.000000  0.061753  0.101658  0.000000   
4     0.188525  0.038693  0.000000  0.017228  0.024812  0.007424  0.000000   
5     0.024308  0.000000  0.121566  0.001350  0.034013  0.237043  0.000000   
6     0.157141  0.039724  0.047996  0.000000  0.016129  0.000000  0.000000   
7     0.105163  0.000000  0.068595  0.000000  0.050525  0.105301  0.000000   
8     0.020445  0.020189  0.041453  0.001801  0.019356  0.366678  0.000000   
9     0.134103  0.093481  0.000000  0.000000  0.004997  0.017075  0.000000   
10    0.118175  0.035509  0.007176  0.001182  0.027865  0.099679  0.000000   
11    0.125590  0.052472  0.057957  0.000000  0.025009  0.000000