# c2-text_cleaning challenge

### 1 Challenge Introduction
This particular notebook deals with the challenge of "text cleaning":
**INTRO:**<br>
Sometimes, actually only in very few cases, we receive structured data from the user or semi structured data which we can successfully turn into structured data. Most often, we will not be able to turn the uploaded documents straight into structured data. And in any case, simply using the raw text data from the files is **always, always, always** our next best guess, our wild card if you will. The algorithms we use to extract information from raw text heavily rely on the assumption, that the text is well cleaned. Thus this challenge has a huge impact on everything else following after this step.<br>
**INPUT:**<br>
Data type: string<br>
Example: "on the liver \ninjury date and ICD\n-9-\nCM codes of interest"<br>
Essentially your component receives unfiltered, human readable text and outputs nicely structured, filtered text. Your input will most likely be soaked with html tags, falsely decoded strings and misformatting, let's call it *noise*. <br>
**OUTPUT:** <br>
Data type: string<br>
Example: "on the liver injury date and ICD-9-CM codes of interest"<br> 
The goal is to produce a string that is free from any sort of noise. Just nice, clean sentences.

In [1]:
import spacy
nlp = spacy.load('en')
import PyPDF2

### 2 Loading Input Data
We use as base input data a publicly available publication called "Validation of Acute Liver Injury Cases in a Population-Based Cohort Study of Oral Antimicrobial Users" by our colleagues from Medical Affairs available from https://www.ncbi.nlm.nih.gov/pubmed/24111729.

In [None]:
input_file_path = './document.pdf'

In [None]:
# Let's load the pdf file into memory and look at the data
with open(input_file_path, 'rb') as file:
    i = 0
    for element in file:
        print(element)
        # break loop after 5 (or whatever number you like) loops 
        if i > 5:
            break
        i += 1

In [None]:
# "\xcbqj\x8c\xed\x84..." Certainly not what we want. Luckily, other people have created PyPDF2, 
# which enables us to turn the character salad into nice, readable string text.
# This function goes through the pages of the pdf document and collects the input as a string with PyPDF2:
def pdfparsing(file):
    str_text = ""
    pdfReader = PyPDF2.PdfFileReader(file)
    for page in pdfReader.pages:
        str_text = str_text + str(page.extractText()) + ". "
    return str_text

In [None]:
# Let's apply the pdfparsing function on our input document
with open(input_file_path, 'rb') as file:
    input_string = pdfparsing(file)

In [None]:
# If we now look at our 'input_string', we can see that we have some human readable text. 
# A lot of niose, but for a human it would be possible to extract the meaning of the sentences. 
input_string[0:1000]

### 4 Your Turn
Now it's up to you. Enjoy the challenge!
![title](img/jk.png)<br>
Well, that might be a bit of a rough starting point. You can find some potential strategies on how to approach the problem further below. However, please consider to first **challenge your brain and come up with potential solutions yourself!** Looking at the hints straight away might limit your creativity afterwards, and our hints surely don't present the absolute best way of solving the challenge (otherwise it wouldn't be a challenge here). So please consider taking some time investigating the problem **creatively**. <br>
<br>
The following cells are just for you! :) <br>
Further hints at the bottom of the notebook.

In [None]:
# your code here

In [None]:
# and don't forget your input!
input_string

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

### 5 Additional hints
Before diving straight into code, a brief analysis of the problem we're actually trying to solve: <br>
We want to make sure that sentences are "clean", meaning that they only contain "real" words (e.g. not "\nbr\n") and follow a semantic structure. Thus if we have a way to identify those real words or analyze the semantic structure, we can easily distinguish "clean" sentences from "noisy" sentences. Based on this assumption, there are two obvious strategies to go with:
1. **Rule based filtering**: We can set up rules that a sentences is required to meet in order to be classified as "clean" sentence. Such rules could be "must include a verb" or "must be longer than 10 characters" etc. This will most likely give us a **high precision (large proportion of true positives) but a rather low recall (many false negatives)**. Depending on the individual case is optimizing for precision to the cost of recall a fair trade off. This is the case in large texts where data is abundant. In small text data, such as a powerpoint pitch presentation, this will certainly lead to issues.
2. **Machine learning**: We could train a machine learning algorithm, such as a neural network, on identifying clean sentences or even "real words". On the plus side, the resulting model will probably generalize well, so we could run the model on any text body, it wouldn't matter if it's small or large. **The issue here is rather to find (or create) large, labeled, high-quality data sets** on which to train the algorithm on. 

#### 5.1 Rule based filtering with spaCy
spaCy is a natural language processing library and it is BEAUTIFUL. In case you wondered "how the heck are we gonna figure out if a word is a verb or noun?!", here's the answer: spaCy does it for you. We might be able to build models that outperform spacy, however that would cost a lot of time to do and spaCy's performance is "good enough" (it's really, really good actually). If you're interested in how spacy built the models e.g. named entity recognizer or part of speech tagger, you can check out their documentation or just reach out to Jannis or Sabrina for a brief explanation. <br>
<br>
For us, the details on *how* spaCy works are irrelevant. What matters is *that* it works and *what* we get out of it. So let's dive into it.

In [None]:
# import the spacy library
import spacy
# load the english language model. That's were all the 'smartness' comes from
nlp = spacy.load('en')

# a clean, simple example sentence to discover the core capabilities of spacy
example = "The red shark ate a large mango because she was very hungry."

# the nlp function is essentially our command to analyze whatever text we pass as argument. 
# under the hood, nlp() is a so called pipeline that runs different models on the text data.
# If you like learning new things, i can highly recommend spaCy's documentation: 
# https://spacy.io/usage/processing-pipelines 
doc = nlp(example)

In [None]:
# if we print doc, we just get the same text we gave spacy as input. 
print(doc)
# checking for the type though exposes that we're dealing with a "doc" object. 
print(type(doc))
# The doc object contains many valuable methods and attributes which can help us analyze text.
# You can read about the many linguistic features spacy provides here: https://spacy.io/usage/linguistic-features

In [None]:
# The doc object contains several tokens, which are representations of words, numbers and special characters.
# By iterating through the doc object's tokens we can investigate the token's attributes like part of speech and lemma
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

In [None]:
# WOW, lots of information. If you're curious about the details, follow again this link for exhaustive explanations.
# Though if you want to know what an abbreviation means, you can also do:
print(spacy.explain("advcl"))
print(spacy.explain("VBD"))

A brief explanation of the different attributes, copied straight from the spaCy documentation:

**Text**: The original word text.<br>
**Lemma**: The base form of the word.<br>
**POS**: The simple part-of-speech tag.<br>
**Tag**: The detailed part-of-speech tag.<br>
**Dep**: Syntactic dependency, i.e. the relation between tokens.<br>
**Shape**: The word shape – capitalisation, punctuation, digits.<br>
**is alpha**: Is the token an alpha character?<br>
**is stop**: Is the token part of a stop list, i.e. the most common words of the language?<br>
<br>
Analyzing our previous output for correctness shows: spaCy is pretty accurate. E.g. "mango" is indeed a noun, "ate" a verb in past tense and "very" is of course an adverbial clause modifier...okay, i have actually no idea what an adverbial clause modifier is but it "sounds" like it's correct.  :) <br>
<br>
Identifying all of the above is not all that spacy can do. Spacy can also detect '**noun chunks**':

In [None]:
for chunk in doc.noun_chunks:
    print(chunk.text,", ", chunk.root.text,", ", chunk.root.dep_,", ",
        chunk.root.head.text)

Taking a look into spacy's documentation again gives us the descriptions of the attributes:<br>
<br>
**Text**: The original noun chunk text.<br>
**Root text**: The original text of the word connecting the noun chunk to the rest of the parse.<br>
**Root dep**: Dependency relation connecting the root to its head.<br>
**Root head text**: The text of the root token's head.<br>
<br>
Wonderful. But none of this is of help if we can't even identify sentence breaks. To dive into that topic we need a larger example data with several sentences. For example a fraction of the worlds longest joke from http://www.longestjokeintheworld.com/:

In [None]:
example2 = "He thinks about walking at night to avoid the heat and sun(8. ), but based upon how dark it actually was the night before, and given that he has no flashlight, he's afraid that he'll break a leg or step on a rattlesnake. So, he puts on some sun block, puts the rest in his pocket for reapplication later, brings an umbrella he'd had in the back of the SUV with him to give him a little shade, pours the windshield wiper fluid into his water bottle in case he gets that desperate, brings his pocket knife in case he finds a cactus that looks like it might have water in it, and heads out in the direction he thinks is right. He walks for the entire day. By the end of the day he's really thirsty. He's been sweating all day, and his lips are starting to crack. He's reapplied the sunblock twice, and tried to stay under the umbrella, but he still feels sunburned. The windshield wiper fluid sloshing in the bottle in his pocket is really getting tempting now. He knows that it's mainly water and some ethanol and coloring, but he also knows that they add some kind of poison to it to keep people from drinking it. He wonders what the poison is, and whether the poison would be worse than dying of thirst."
print(example2)

In [None]:
# one way to split for sentences is to just split at every ".":
doc = nlp(example2)

# list to temporarily store all words of the sentence in until a "." arrives
sentence = []
for token in doc:
    sentence.append(token.text)
    if token.text == ".":
        print(sentence, "\n")
        sentence = []

Works quite well. But look at 1st and 2nd printed sentences. There is an error with the "sun(8.", which shouldnt be the end of the sentence but is more of a human error. There are many ways to deal with this issue, but as you can imagine...<br>
<br>
...spaCy already takes care of that! spaCy comes with a (smarter) **sentence splitter**:

In [None]:
for sent in doc.sents:
    print(sent.text, "\n")

Thank you spaCy! Well, the sentence splitter is not perfect as you will see when you go through real life examples with very nasty input text. But it is a fair approximation and helps a lot for our task.<br>
<br>
Now, one rather simple way to categorize sentences as "clean" or "noisy" could be to **define rules based on...part of speechs, dependencies, noun chunks, sentence lengths... you name it!** From here on you may start to play around. Your input_string should still be loaded in memory:

In [None]:
# maybe start by sending input_string through spacy pipeline?
doc = nlp(input_string)

i = 0
for sent in doc.sents:
    print(sent.text)
    # limit output
    if i > 5:
        break
    i += 1

P.S.:¹
Uffffff, lots of work to be done here! Please keep in mind:<br>
### This is just ONE proposed way to solve a problem out of A TRILLION
Thus, feel free to play around with other libraries (such as spacy) you know or found on the internet or try to build your very own methods to clean text effectively.
### All that matters is the output. And now happy coding :) 

P.S.:² We did not talk yet about a machine learning approach for identifying clean sentences. The reason for that is that we need a large-enough data set with labeled clean and noisy sentences which we haven't created yet. We could potentially also use existing, open-source datasets and algorithms. There are some interesting papers on this issue (e.g. http://anthology.aclweb.org/C/C10/C10-2022.pdf). Feel free to dive into this further if you like the topic!