## Importing files sorted on filename ##

Here we import your .txt files and their filenames. We sort the filenames to keep the order of your index.

nb. you need to install the `natsort` module: https://anaconda.org/anaconda/natsort
    , which you can either do from the Anaconda Prompt or from the Anaconda Navigator.
    If you use the prompt write: `conda install natsort`

In [2]:
import os, glob
from natsort import natsorted

def read_txt_dir(dirpath):
    """ import all .txt files from directory of directory path dirpath
        - output file and filename in list
    """
    filenames = natsorted(glob.glob(os.path.join(dirpath,"*.txt")))
    files = list()
    for filename in filenames:
        with open(filename,"r") as fobj:
            files.append(fobj.read())
    filenames = [filename.split("/")[-1] for filename in filenames]
    return files, filenames 

# import articles
article_path = os.path.join("dat","articles")
articles, article_names = read_txt_dir(article_path)

# import magazines
magazine_path = os.path.join("dat","magazines")
magazines, magazine_names = read_txt_dir(magazine_path)

print(magazines[0][:1000])

The people of falsehood constantly attempt to make the deaths of righteous men and their slayings by the enemies of Islam – the mushrikin and the apostates – into a sign foretelling the breaking of the muwahiddin. But those fools do not realize that Allah has ordained for each soul its set term before He created the heavens and the earth. Allah said, “And each nation has its set term. They can neither delay it for an hour nor advance it” (Al-A’raf 34). In this decree, all people are equal, including prophets and righteous people as well as disbelievers and tyrants. Those fools do not realize that Allah preserves His religion however He wills, and this religion will remain established and will not be damaged by the death of any person. If it would have been damaged by anything, it would have been by the death of the Prophet and those of his noble Companions. But the religion remained long after their departure, as Allah established its foothold and spread it on the earth. He preserved i

## Tokenization and basic corpus statistics ##

Start by computing basic corpus statistics. For this you need several functions for preprocessing your string data. We use `re` to remove punctuation and `NLTK` for tokenization. The functionality can be implemented with `re` alone.

In [3]:
import re
from nltk import word_tokenize

def remove_punctuation(string):
    pattern = re.compile(r"\W+")
    return pattern.sub("",string)

def clean_word_tokenize(string, casefold = True, nopunct = True):
    """ Word-level tokenizer that casefolds to lower if True
    and removes punctuation if True
    """
    if casefold:
        string = string.lower()
    else:
        pass
    tokens = word_tokenize(string)
    if nopunct:
        tokens = [remove_punctuation(token) for token in tokens]
    else:
        pass
    tokens = [token for token in tokens if token]
    return tokens

flatten = lambda l: [item for sublist in l for item in sublist]

# tokenize all articles at word level (all words are lowercased and punctuation removed)
articles_tokens = list(map(clean_word_tokenize,articles))# apply tokenizer to all string object in list
tokens = sorted(flatten(articles_tokens))# all tokens in one (flat) sorted list
n_tokens = len(tokens)# number of tokens
n_types = len(list(set(tokens)))# number of typeshhh

print("The corpus consist of {} tokens distributed over {} lexical types".format(n_tokens, n_types))
print("The lexical richness measured as the type-token ratio is {}".format(round(n_types/n_tokens,4)))
print("On average every word is repeated {} times".format(round(n_tokens/n_types,2)))

The corpus consist of 287177 tokens distributed over 13104 lexical types
The lexical richness measured as the type-token ratio is 0.0456
On average every word is repeated 21.92 times


## Stopword filtering in list of lists ##

In [26]:
from collections import defaultdict
from operator import itemgetter

def gen_ls_stoplist(input, n = 100):
    t_f_total = defaultdict(int)
    for text in input:
        for token in text:
            t_f_total[token] += 1
    nmax = sorted(t_f_total.items(), key = itemgetter(1), reverse = True)[:n]
    return [elem[0] for elem in nmax]

# generate stopword list from articles with 50 stopwords (not really a good sw list)
sw = gen_ls_stoplist(articles_tokens, 50)

# list with three texts in string
listofstrings = ["he said he loved good food","she and allah had great fun","so what am I going to do"]

# tokenize each text
listoflists = list()
for string in listofstrings:
    listoflists.append(clean_word_tokenize(string))

print("list of lists with stopwords:\n")
print(listoflists)

# remove stopwords from list of list
nostopwords = list()
for alist in listoflists:
    out = [token for token in alist if token not in sw]
    nostopwords.append(out)

print("list of lists without stopwords:\n")
print(nostopwords)

list of lists with stopwords:

[['he', 'said', 'he', 'loved', 'good', 'food'], ['she', 'and', 'allah', 'had', 'great', 'fun'], ['so', 'what', 'am', 'i', 'going', 'to', 'do']]
list of lists without stopwords:

[['loved', 'good', 'food'], ['she', 'had', 'great', 'fun'], ['am', 'going', 'do']]


# Stopwords

In [4]:
from collections import defaultdict
from operator import itemgetter

def gen_ls_stoplist(input, n = 100):
    t_f_total = defaultdict(int)
    #n = 100
    for text in input:
        for token in text:
            t_f_total[token] += 1
    nmax = sorted(t_f_total.items(), key = itemgetter(1), reverse = True)[:n]
    return [elem[0] for elem in nmax]

# generate stopword list from articles with 50 stopwords (not really a good sw list)
sw = gen_ls_stoplist(articles_tokens, 50)


# import sw list from nltk instead

import io

def read_txt(filepath):
    """
    Read txt file from filepath and returns char content
    in string
    """
    f = io.open(filepath, 'r', encoding = 'utf-8')
    content = f.read()
    f.close()
    return content

nltk_sw = read_txt('stopwords/english') # save nltk stopword list in variable
nltk_sw = clean_word_tokenize(nltk_sw)  # tokenize nltk stopword list

# apply stopword lists

# OBS! test
test1 = ['i', 'am', 'a', 'muslim', 'who', 'loves', 'allah']
# virker
test1_nosw = [w for w in test1 if not w in sw] 
# virker
test1_nosw = []
for w in test1: 
    if w not in sw: 
        test1_nosw.append(w)


# virker slet ikke
nosw_articles_tokens = [w for w in articles_tokens if not w in sw] 
sw_articles_tokens = [] 
for w in articles_tokens: 
    if w not in sw: 
        sw_articles_tokens.append(w) 

# virker lidt, men på en mærkelig måde med kun den første artikel som bare duplikeres.
nosw_articles_tokens=[]
for article in articles_tokens:
    nosw_articles=[]
    for word in article:
        if word not in sw:
            nosw_articles.append(word)
        nosw_articles_tokens.append(nosw_articles)

# apply nltk_sw

# test
test2 = ['i', 'am', 'a', 'muslim', 'who', 'loves', 'allah']
# virker
test2_nosw = [w for w in test2 if not w in nltk_sw] 
# virker
test2_nosw = []
for w in test2: 
    if w not in nltk_sw: 
        test2_nosw.append(w)




# Stemming

In [None]:
# open stemmer downloaded from nltk and apply to dataset

# Word frequency og evt. distribution plot af udvalgte ord

In [None]:
# code

# Topic modeling (men lige nu uden at stopwords er fjernet og uden pos_tag)

In [None]:
from gensim import corpora, models

dictionary = corpora.Dictionary(articles_tokens)
print(dictionary.num_docs)


# create bag-of-words representation of the articles
text_bow = [dictionary.doc2bow(article) for article in articles_tokens]

# 10 topics
k = 10
mdl = models.LdaModel(text_bow, id2word = dictionary, num_topics = k, random_state = 1234)

for i in range(k):
    print('Topic',i)
    print([t[0] for t in mdl.show_topic(i,10)])
    print('-----')



# Dependency modeling (association rules)

In [None]:
# code