# e1-c1-parsing_pptx challenge

### 1 Challenge Introduction
This particular notebook deals with the challenge of "parsing pptx" (pptx = powerpoint presentations).<br>
<br>
**INTRO:**<br>
PowerPoint presentations are less structured than, say, a CV or an excel sheet, however, there is more structure to be found than in plain text. Whereas in plain text we follow the paradigm "know the meaning of a word by the company it keeps", the same approach is hardly applicable to powerpoints. Powerpoint presentations usually follow the rather broad structure of presentation > slides > title, subtitle, paragraph. In a powerpoint, it can easily happen that a word only appears once, though it has a significant relevance to the content presented. Those words would be expected to be in the title of a slide, a first hint to a possible solution.<br>
**INPUT:**<br>
Data type: ppt Presentation (special class type of the python-pptx library)<br>
Example: <class 'pptx.presentation.Presentation'><br>
Essentially your component receives a ppt Presentation object which holds a lot of meta data on the presentation which we'ld like to make use of such as "shape type" and slide numbers. <br>
**OUTPUT:** <br>
Data type: dict/list<br>
Example: {"name":"Frenz Josef Freud"}<br> 
The goal is to output structured data in form of dictionaries or lists e. g. {"skills": ["strategic management","product supply"]} AND ALSO a string corpus which we'll sent through our plain text pipeline. Ideally this way we can capture not just the frequency-based relevance but also the position-based relevance.

In [None]:
!python -m spacy download en
import spacy
nlp = spacy.load('en')
from pptx import Presentation

### 2 Loading Input Data
We use as input data a pretty random and publically available powerpoint. It is simple, not atypical and contains many content types we would like to cover.

In [None]:
input_file_path = './presentation.pptx'

In [None]:
# This function goes through the pages of the pptx document and collects the input as a string with python-pptx.
# As this would mean we would loose ALL of the meta information from the powerpoint, this shall be the worst case scenario approach
def pptxparsing(file):
    # Extract text from ppt as a single string
    str_text = ""
    ppt = Presentation(file)
    for slide in ppt.slides:
        for shape in slide.shapes:
            if not shape.has_text_frame:
                continue
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    str_text = str_text + str(run.text) + ". "
    return str_text

In [None]:
# apply the pptx parsing function on our input document.
with open(input_file_path, 'rb') as file:
    input_string = pptxparsing(file)
input_string
# As we see, just plain very unstructured, noisy text which is far from ideal for further analysis.

In [None]:
# open the powerpoint document ("with open")
with open(input_file_path, 'rb') as file:
    # initialize dictionary to store the strings in
    texts = {"notes":[],"titles":[],"paragraphs":[]}
    # send powerpoint document through the parser
    pptx = Presentation(file)

    # iterate through powerpoint slides
    for slide in pptx.slides:
        # if slide has a notes section and it's NOT empty, extract the text into dictionary
        if slide.has_notes_slide == True and slide.notes_slide.notes_text_frame.text != "":
            # ...and remove the html left overs (\n,\t)
            texts["notes"].append(slide.notes_slide.notes_text_frame.text.replace("\n","").replace("\t",""))

        # if the slide has a title, and it's not empty, extract title into dictionary
        if slide.shapes.title and slide.shapes.title.text != "":
            texts["titles"].append(slide.shapes.title.text.replace("\n","").replace("\t",""))

        # shapes are the next hierarchical level under slides. They can represent textframes, pictures, rectangles etc.
        # For the beginning we'll only look at the text frame: 
        for shape in slide.shapes:
            # make sure that shape has text
            if shape.has_text_frame and shape.text != "":
                # extract paragraph text into dictionary
                texts["paragraphs"].append(shape.text.replace("\n","").replace("\t",""))
texts

In [None]:
# Wonderful. Now we have all titles, paragraphs and notes in one dictionary. Whatever happens now is up to you. 
# Some ideas: 
# - Make use of spacy's named entity recognizer and extract the category (see below). 
# - Run an LDA (https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24,
# and https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) on the titles, paragraphs and notes.
# - Find an appropriate weighting of notes vs titles vs paragraph texts.
# - Your ideas...

# Ideally, we end up with a list or dictionary that provides us with the skill terms ranked by relevance (how to rank is up to you).

# spacy's NER on the titles:
for title in texts["titles"]:
    doc = nlp(title)
    for ent in doc.ents:
        if ent.end_char-ent.start_char > 3:
            print(ent.text, "," ,  ent.label_)
# if you're not familiar with spacy's named entity labeling: Spacy is pretty awesome, 
# because it can identify "named entities", which are essentially, *not normal words* like "Berlin, 
# Barack Obama, or Bayer". The ent.label_ provides us with an indication of what it is, like Berlin = GPE.
# spacy.explain() can explain what that means:
print(spacy.explain("GPE"))
# correct :) 

print("\nBe aware that spacy is great, but not perfect! Many entities, especially if there is no context, will be falsely classified!")

# Happy Coding!