## Importing files sorted on filename ##

Here we import your .txt files and their filenames. We sort the filenames to keep the order of your index.

nb. you need to install the `natsort` module: https://anaconda.org/anaconda/natsort
    , which you can either do from the Anaconda Prompt or from the Anaconda Navigator.
    If you use the prompt write: `conda install natsort`

In [2]:
import os, glob
from natsort import natsorted

def read_txt_dir(dirpath):
    """ import all .txt files from directory of directory path dirpath
        - output file and filename in list
    """
    filenames = natsorted(glob.glob(os.path.join(dirpath,"*.txt")))
    files = list()
    for filename in filenames:
        with open(filename,"r") as fobj:
            files.append(fobj.read())
    filenames = [filename.split("/")[-1] for filename in filenames]
    return files, filenames 

# import articles
article_path = os.path.join("dat","articles")
articles, article_names = read_txt_dir(article_path)

# import magazines
magazine_path = os.path.join("dat","magazines")
magazines, magazine_names = read_txt_dir(magazine_path)

## Tokenization and basic corpus statistics ##

Start by computing basic corpus statistics. For this you need several functions for preprocessing your string data. We use `re` to remove punctuation and `NLTK` for tokenization. The functionality can be implemented with `re` alone.

In [3]:
import re
from nltk import word_tokenize

def remove_punctuation(string):
    pattern = re.compile(r"\W+")
    return pattern.sub("",string)

def clean_word_tokenize(string, casefold = True, nopunct = True):
    """ Word-level tokenizer that casefolds to lower if True
    and removes punctuation if True
    """
    if casefold:
        string = string.lower()
    else:
        pass
    tokens = word_tokenize(string)
    if nopunct:
        tokens = [remove_punctuation(token) for token in tokens]
    else:
        pass
    tokens = [token for token in tokens if token]
    return tokens

flatten = lambda l: [item for sublist in l for item in sublist]

# tokenize all articles at word level (all words are lowercased and punctuation removed)
articles_tokens = list(map(clean_word_tokenize,articles))# apply tokenizer to all string object in list
tokens = sorted(flatten(articles_tokens))# all tokens in one (flat) sorted list
n_tokens = len(tokens)# number of tokens
n_types = len(list(set(tokens)))# number of typeshhh

print("The corpus consist of {} tokens distributed over {} lexical types".format(n_tokens, n_types))
print("The lexical richness measured as the type-token ratio is {}".format(round(n_types/n_tokens,4)))
print("On average every word is repeated {} times".format(round(n_tokens/n_types,2)))

The corpus consist of 287177 tokens distributed over 13104 lexical types
The lexical richness measured as the type-token ratio is 0.0456
On average every word is repeated 21.92 times


Topic Modelling

In [4]:
from nltk.tag import pos_tag

#tag every token and store tagged tokens in list
tokens_tagged = pos_tag(tokens, tagset = 'universal', lang = 'eng')
