# c5-topic_extract challenge

### 1 Challenge Introduction
This particular notebook deals with the challenge of "topic extraction".<br>
**INTRO:**<br>
Extracting topics from documents is a common task in natural language processing. There are already some well-established algorithms for multi-document topic extraction. But what if there's only one document? Welcome to this challenge!
**INPUT:**<br>
Data type: string<br>
Example: *a lot of text...*"The experimental results show that the Nouns are more related, reliable, and suitable words for finding the topic of the text."*...a lot of text*<br>
Depending on the performance of the preprocessing components (text_cleaning, parsing_pptx), this component receives more or less clean strings of text. <br>
**OUTPUT:** <br>
Data type: list of tuples<br>
Example: [("financial times", 0.74235),("controlling", 0.72123),("business", 0.69824)]<br> 
The goal is to produce a list of tuples whereas the tuples contain the word and its respective score so the list can be ranked. 

In [1]:
import spacy
nlp = spacy.load('en')
import PyPDF2
import gensim
from gensim import corpora
# 'tika' is another pdf parser, however, quality is similar so can be neglected
#from tika import parser

### 2 Loading Input Data
We use as base input data Bayer's sustainability report from 2016. It can be publicly downloaded here: https://www.investor.bayer.de/en/reports/sustainability-reports/.

It is the same document as in the text cleaning session, uncleaned though. However, since it is a rather large document, the impact of the noise is reduced. 

In [8]:
input_file_path = './document1.pdf'

In [9]:
# This function goes through the pages of the pdf document and collects the input as a string with PyPDF2:
def pdfparsing(file):
    str_text = ""
    pdfReader = PyPDF2.PdfFileReader(file)
    for page in pdfReader.pages:
        str_text = str_text + str(page.extractText()) + ". "
    return str_text.replace("\n","")

In [10]:
# apply the pdfparsing function on our input document
with open(input_file_path, 'rb') as file:
    input_string = pdfparsing(file)
# alternative library for pdf parsing
# example = parser.from_file(input_file_path,'http://localhost:9998/tika')

In [11]:
# The input string is not perfectly clean but hopefully good enough for a sensible topic extraction
input_string[0:1000]

'  A Combined Management Report 1.4 Sustainable Conduct Bayer Annual Report 2016 Augmented Version 1.4 Sustainable Conduct 1.4.1 Commitment to Employees and Society > Attracting, developing and retaining the best managers and employees > Corporate culture: dialogue, diversity, innovation > Creating attractive working conditions > Wide-ranging societal engagement  Our business success is based to a large extent on the knowledge, skills, commitment and satis-faction of our employees. As a modern international employer, we offer our employees attractive conditions and wide-ranging individual development opportunities. The key to this is our highly effective system of vocational and ongoing training, which we are continuously extending. Along-side professional training, we focus on conveying our corporate values (LIFE) and establishing a dialogue-oriented corporate culture based on trust, respect for diversity and equality of oppor-tunity. That plays a part in employee satisfaction – along

In [6]:
# Before we do anything, we need to tokenize our text. Which means, we split the string into words and append them
# to a list. Below you see a one liner who does just that while also extracting words that are smaller than 2 letters.
tokens = [token.text for token in nlp(input_string) if len(token) >2]
print("first 10 tokens: ",tokens[0:10])
print("count of all tokens: ", len(tokens))

first 10 tokens:  ['Send', 'Orders', 'for', 'Reprints', 'reprints@benthamscience.net', 'Current', 'Drug', 'Safety', '2014', 'Validation']
count of all tokens:  3919


In [7]:
# In order to analyze "topics" from the text, we need to first convert our list of tokens into a dictionary.
# It's essentially just a python dictionary with "uniqueID" and the "word".
dictionary = corpora.Dictionary([tokens])
print("number of unique words: ",len(dictionary.token2id))
print("A quick glance into the dictionary: ", dictionary.token2id)
# as we can see, the dictionary is in alphabetical order, which immediately exposes all of the noise.
# If we have done a good job in the text_cleaning challenge, that should not be an issue in the future.
# let's hope that it wont affect our results for now.

number of unique words:  1322
A quick glance into the dictionary:  {'   ': 0, '                ': 1, '02451': 2, '1,000': 3, '1,17': 4, '1,299,056': 5, '1,351': 6, '1.7': 7, '1.781.434.1701': 8, '1.781.434.1763': 9, '10-year': 10, '100': 11, '100,000': 12, '101': 13, '106(11': 14, '108': 15, '10th': 16, '11(2': 17, '112': 18, '128': 19, '129(2': 20, '13(4': 21, '13.8': 22, '14.8': 23, '1431': 24, '1440': 25, '15(4': 26, '159': 27, '1594': 28, '16%-100': 29, '1623': 30, '17(1': 31, '17(10': 32, '17(8': 33, '18.2': 34, '180': 35, '195': 36, '1990': 37, '1992': 38, '1993': 39, '1996': 40, '1999': 41, '1RTI': 42, '2,192': 43, '2,907': 44, '2.4': 45, '20.2': 46, '2001': 47, '2002': 48, '2003': 49, '2004': 50, '2005': 51, '2006': 52, '2007': 53, '2008': 54, '2009': 55, '201': 56, '2010': 57, '2011': 58, '2012': 59, '2013': 60, '2014': 61, '21(Suppl': 62, '214': 63, '21CFR56.107': 64, '22.6': 65, '221': 66, '2212': 67, '241': 68, '25,000': 69, '26%-94': 70, '265': 71, '272': 72, '29.9': 73, '

In [11]:
# The dictionary simply maps words to their unqiue IDs. "doc2bow" builds frequency tuples.
corpus = [dictionary.doc2bow(text) for text in [tokens]]
print(corpus)
# Meaning, the word with the ID "0" (which is just "    " as we can see above) appeared 58 times. 

[[(0, 58), (1, 21), (2, 11), (3, 2), (4, 5), (5, 3), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 5), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 3), (67, 1), (68, 12), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 53), (76, 27), (77, 1), (78, 1), (79, 1), (80, 1), (81, 2), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 3), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (1

In [12]:
# Now that we have the table of occurrences, we can extract some meaning from the text, whereas meaning is a pretty 
# pretentious term. 

# We are using a bit of a trick here. The algorithm of choice is the so called Latent Dirichlet Allocation (LDA). 
# In theory (and in practice), this algorithm is applied over a term document matrix, so it expects a list of documents.
# However, we only have one document. So we sort of trick ourselves. The result is, that we only get "1 topic", which
# is defined by the most frequent words. Let's see how that'll turn out:

# Find Topics in Textbody with Gensim LDA
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=1)
for i in range(0,5):
    print(f'topic {i}: ',ldamodel.show_topic(i, topn=10),"\n")

topic 0:  [('the', 0.035857454), ('and', 0.031633146), ('for', 0.011897531), ('Bayer', 0.011309708), ('our', 0.009557784), ('with', 0.009017344), ('2016', 0.008306255), ('are', 0.0069280136), ('The', 0.005201543), ('also', 0.0046851807)] 

topic 1:  [('the', 0.031902153), ('and', 0.025121765), ('for', 0.01106729), ('with', 0.009704396), ('our', 0.008160761), ('Bayer', 0.007862078), ('are', 0.0074881753), ('The', 0.005931841), ('2016', 0.005316668), ('Report', 0.0036851075)] 

topic 2:  [('the', 0.040739827), ('and', 0.03639443), ('our', 0.012603901), ('for', 0.010308251), ('are', 0.009632117), ('2016', 0.00937799), ('Bayer', 0.009086022), ('with', 0.007826334), ('The', 0.005672117), ('safety', 0.0048947735)] 

topic 3:  [('the', 0.051239863), ('and', 0.048497614), ('our', 0.013241939), ('for', 0.012166675), ('Bayer', 0.009038047), ('are', 0.008691564), ('with', 0.008513404), ('2016', 0.0070455484), ('The', 0.005992525), ('from', 0.0052657994)] 

topic 4:  [('the', 0.038394514), ('and',

What's happening here? 

So basically LDA tries to find similar distributions of words over documents. Those words that have a similar distribution are allocated into a topic. Since we only look at one document, we're not looking at distributions of words over documents but only distributions of words. That's why it's really quite pretentious, we're basically just taking the count statistic into consideration in order to derive the "topic" or in our case the "skills".

In [13]:
# to demonstrate it again: Let's loop over the corpus with the frequency statistic and let's extract the top 10 and 
# see which words we get.
i = 0
top10 = {}
# iterate over sorted corpus
for element in sorted(corpus[0], key=lambda x: x[1],reverse=True):
    # extract word from dictionary with respective ID
    top10[i] = dictionary[element[0]]
    # break after 10 
    i += 1
    if i > 9:
        break
print(top10)

{0: 'the', 1: 'and', 2: 'for', 3: 'our', 4: 'Bayer', 5: 'with', 6: 'are', 7: '2016', 8: 'The', 9: 'from'}


As we can see, the simple word count has approximately the same output as our fancy model. The few slight deviations from model 0 to 4 are due to a hyperparameter of LDA that controls the proximity of similar topics.

### 3 The current approach
If it is so simple, why does it already *somehow* work? Again, spacy comes to the rescue. Basically what we do is, we filter out all words that are irrelevant and then do LDA aka frequency analysis. The filtering is possible thanks to spacy's powerful tools. However, the power comes at a price. Computation is pretty expensive. That's one of the potential improvements. Let's take a look at the current filtering. Don't worry about the seeming complexity, it's actually quite simple:

In [14]:
# original implementation from Bayron
def tokenizing(text):
    # make sure (again) to remove the \n html tag
    clean_text = text.replace('\n', '')
    # run text through spacy's pipeline
    doc = nlp(clean_text)
    # chunk based: enables processing word chunks
    relevant_words = []

    for chunk in doc.noun_chunks:
        # drop word if it's a 'complement of preposition'
        if chunk.root.dep_ != 'pcomp':
            term = ""

            for token in chunk:
                # drop several kinds of tokens like numbers, symbols, spaces, coordinating conjunctions and determiners 
                if token.pos_ not in ['NUM','SYM','PUNCT','SPACE','CCONJ','DET']:
                    # drop pronouns and words that are super short. Potentially also filter for nouns only (commented out)
                    if token.lemma_ != "-PRON-" and len(token.lemma_) > 1: #and token.pos_ == "NOUN":
                        if term == "":
                            term += token.lemma_

                        else:
                            term += " " + token.lemma_
            # add term to the relevant_word list
            if term != "" and len(term) > 2:
                relevant_words.append(term)
    if len(relevant_words) == 0:
        raise NoRelevantWordsError(400, 'No words found')
    return relevant_words

In [15]:
# the implementation in action, which takes in this case roughly 3 seconds on my machine (tooooo long)
tokens = tokenizing(input_string)

In [16]:
tokens

['combined management report sustainable conduct bayer annual report',
 'augmented',
 'version',
 'sustainable conduct commitment',
 'employees',
 'society',
 'attracting',
 'good manager',
 'employee',
 'corporate culture',
 'dialogue',
 'diversity',
 'innovation',
 'attractive working condition',
 'wide range societal engagement',
 'business success',
 'large extent',
 'knowledge',
 'skill',
 'commitment',
 'satis faction',
 'employee',
 'modern international employer',
 'employee',
 'attractive condition',
 'wide range individual development opportunity',
 'key',
 'highly effective system',
 'vocational ongoing training',
 'corporate value',
 'life',
 'dialogue orient corporate culture',
 'trust',
 'respect',
 'diversity',
 'equality',
 'oppor tunity',
 'part',
 'employee satisfaction',
 'responsible approach',
 'struc tur working condition',
 'fair respectful treatment',
 'work',
 'transparent competitive equitable compensation system',
 'company pension plan',
 'ability',
 'family

In [17]:
# Let's take a look at the dictionary when words are filtered and moved into chunks 
# remember: our dictionary before contained ~4,500 unique tokens 
dictionary = corpora.Dictionary([tokens])
print("number of unique words: ",len(dictionary.token2id))
print("A quick glance into the dictionary: ", dictionary.token2id)
# Now we only have ~3,500 which is surprisingly much for that we filtered so intensively. 
# However, as we created word chunks we did not just drop tokens but also created new ones.

number of unique words:  3429


In [18]:
# Now running the frequency analysis and outputting the 10 most frequent tokens
corpus = [dictionary.doc2bow(text) for text in [tokens]]

# to demonstrate it again: Let's loop over the corpus with the frequency statistic and let's extract the top 10 and 
# see which words we get.
i = 0
top10 = {}
# iterate over sorted corpus
for element in sorted(corpus[0], key=lambda x: x[1],reverse=True):
    # extract word from dictionary with respective ID
    top10[i] = dictionary[element[0]]
    # break after 10 
    i += 1
    if i > 9:
        break
print(f'Count based top10: \n',top10,'\n')

# 'fake' LDA for comparison:
NUM_TOPICS = 1
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=1)
print(f'LDA top10: \n',ldamodel.show_topic(0, topn=10),"\n")

Count based top10: 
 {0: 'bayer', 1: 'employee', 2: 'product', 3: 'online annex', 4: 'germany', 5: 'safety', 6: 'example', 7: 'crop science', 8: 'health', 9: 'supplier'} 

LDA top10: 
 [('bayer', 0.010772358), ('employee', 0.007418699), ('product', 0.005894309), ('online annex', 0.0055894307), ('germany', 0.0049796747), ('safety', 0.0032520324), ('example', 0.0031504065), ('crop science', 0.0030487804), ('supplier', 0.0029471547), ('health', 0.0029471545)] 



Bayer, employee, product, online annex, Germany, safety, example, crop science, health, supplier....
Not too bad for a simple count based model! Let's remember that the Bayer sustainability report is also a quite generic piece of writing. From the top 10, roughly 7 words could potentially be considered 'skill terms' (in my interpretation). <br>
<br>
One difficulty is that for large texts the accuracy commonly reduces because the most frequent words are very often simply not the most important words (let's call it the bullshit effect). The smaller the number of words the more likely we extract relevant words. <br>
<br>
## What's next? 
Well, we'ld like to improve the accuracy of course. One big problem is the comparability of outcomes as this is a highly subjective manner. Thus the best you could do to test your outputs at this point is to assess for yourself whether the outcome of your algorithm is good or not and ideally write a comment on your self-assessment e.g. why it is well-performing or not.

If you want to learn more about LDA (and how it should actually be used) I can highly recommend the following webapp (https://lettier.com/projects/lda-topic-modeling/) and its respective article (https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d).

Now, I leave the stage to you. As always with a couple of hints but from now on only creativity can help solve the challenge. 

### 4 Last hints

From the top of my head I can think of three possible ways on how to improve from here: 
1. (Not sophisticared:) Use the named entities from challenge e1-c4-named_entity_data and do a simple string matching to find potential (and most importantly interesting) named entities from the uploaded data. The matched named entities could be scored higher.
2. Make use of spacy's named entity recognizer (e.g. "for ent in doc.ents:") to identify named entities in the document and similar to 1. mix it with the outcome of LDA. 
3. Additionally train spacy on new named entities (again from challenge e1-c4-named_entity_data) in order to detect only by pattern recognition and only and all relevant named entities from the text structure.
3. ....? 

For clarification, in this context *topic* and *named entity* are used somewhat interchangeably. They describe the same goal (finding *skill terms*) through different means (count frequencies for topics, pattern recognition in named entities). As our current baseline approach uses LDA which is by definition a topic extraction algorithm, this challenge is called topic_extraction. In the future however we'ld like to focus on finding skill terms through pattern recognition rather than just frequency based statistics.

### 5 Your code