## Libraries ##

[from [Digital Ocean](https://www.digitalocean.com/community/tutorials/how-to-import-modules-in-python-3)], Modules are Python `.py` files that consist of Python code. Any Python file can be referenced as a module. A Python file called `hello.py` has the module name of `hello` that can be imported into other Python files or used on the Python command line interpreter.

Modules can define functions, classes, and variables that you can reference in other Python `.py` files or via the Python command line interpreter. In Python, modules are accessed by using the `import` statement. When you do this, you execute the code of the module, keeping the scopes of the definitions so that your current file(s) can make use of these.

When Python imports a module called `hello` for example, the interpreter will first search for a built-in module called `hello`. If a built-in module is not found, the Python interpreter will then search for a file named `hello.py` in a list of directories that it receives from the `sys.path` variable.

In [9]:
import glob, os, re

from natsort import natsorted
# structuring
from sklearn.feature_extraction.text import TfidfVectorizer
# model
from sklearn.decomposition import NMF

## Functions ##
Python provides a way to package our code so that it is easier to reuse by letting us define things called 'functions' — a shorthand way of re-executing longer pieces of code. The function definition opens with the keyword `def` followed by the name of the function (`txt2dict`) and a parenthesized list of parameter names (`filenames`). The body of the function - the statements that are executed when it runs - is indented below the definition line. The body concludes with a `return` keyword followed by the return value.

In [18]:
def txt2dict(filenames):
    """ read all txt files to dictionary with filename as key
    """
    # out vars
    db = dict()
    # delete patterns
    tag_file = re.compile(r".txt")
    tag_content = re.compile(r"<(.*?)>")
    # build
    for i, filename in enumerate(filenames):
        with open(filename,"r") as fobj:
            content = fobj.read()
            content = tag_content.sub("",content)
            db[tag_file.sub("",os.path.basename(filename))] = content

    return db

def display_topics(model, feature_names, no_top_words):
    """ display no_top_words number of words for each topic in a sklearn latent variable model
    """
    for topic_idx, topic in enumerate(model.components_):
        print("Topic {}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

## data ##

In [12]:
d_path = os.path.join("..","data")
fnames = sorted(glob.glob(os.path.join(d_path,"ocr","*.txt")))
data = txt2dict(fnames)
fnames = natsorted(list(data.keys()))
texts = [data[fname] for fname in fnames]

## Vector space model ##

[From [Wiki](https://en.wikipedia.org/wiki/Vector_space_model)] Documents and queries are represented as vectors.

$$
{\displaystyle d_{j}=(w_{1,j},w_{2,j},\dotsc ,w_{t,j})} d_j = ( w_{1,j} ,w_{2,j} , \dotsc ,w_{t,j} )
$$
$$
{\displaystyle q=(w_{1,q},w_{2,q},\dotsc ,w_{n,q})} q = ( w_{1,q} ,w_{2,q} , \dotsc ,w_{n,q} )
$$

Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting (see the example below).

The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).

Vector operations can be used to compare documents with queries.

### TF-IDF weighting ###

TODO

In [14]:
no_features = 1000# number of words to use from the lexicon within the vectorizer's parameters

tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=10, max_features=no_features)

tfidf = tfidf_vectorizer.fit_transform(texts)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

## Non-negative Matrix Factorization (NMF) ##

In [15]:
no_topics = 50
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
no_top_words = 10

In [16]:
W = nmf.fit_transform(tfidf)
H = nmf.components_

In [17]:
print("*** NMF model with {} components".format(no_topics))
display_topics(nmf, tfidf_feature_names, no_top_words)

*** NMF model with 50 components
Topic 0:
theory does sense way experience say fact object point things
Topic 1:
pottery site sites archaeological excavations archaeology area neolithic stone period
Topic 2:
der die und von das den zu ist nicht des
Topic 3:
god divine theology biblical man evil love creation lord does
Topic 4:
images image visual painting representation century space cult vision landscape
Topic 5:
la et le les des du une dans est que
Topic 6:
book chapter author text books reader work volume authors reading
Topic 7:
10 11 12 13 14 15 20 16 17 18
Topic 8:
amp press 2000 2002 2003 2004 2001 university 1997 1998
Topic 9:
art artist painting work arts aesthetic works museum objects visual
Topic 10:
animals animal human humans beings moral species food like people
Topic 11:
rock sites art site cave figures river archaeological figure red
Topic 12:
poem poetry poet love lines literary like line death self
Topic 13:
american history university america york war century nationa