# c3-analyzing_cvs challenge

### 1 Challenge Introduction
This particular notebook deals with the challenge of "analyzing CVs".<br>
<br>
**INTRO:**<br>
CVs represent a special challenge for extracting data as they usually consist of *semi-structured* data, meaning, there is a structure, but it's not standardized or easily readible as in a relational database. Additionally, the terms found in the CV are pretty much on point, i.e. when a CV says "Languages: English, Cantonese", then that is something worth extracting as opposed to free text data where relevance is represented by occurences. As an example, remember Seamus Finnigan? No? Well...Remember Harry Potter? Same book. The assumptions of traditional NLP algorithms don't apply to semi-structured text, thus the challenge here: Extract the information with an algorithm, specifically tailored to CVs.<br>
**INPUT:**<br>
Data type: string<br>
Example: "\n \n \nName\n \nD\nr. J\nohnny Depp\n \nAddress\n \nBroadway 10"<br>
Essentially your component receives unfiltered, human readable text and outputs in the ideal case a structured data format like a list or a dictionary of e.g. skill or previous worked companies. As pdf and docx always come in slightly different formats and version, you can expect a fair amount of noise. <br>
**OUTPUT:** <br>
Data type: dict/list<br>
Example: {"name":"Johnny Depp"}<br> 
The goal is to output structured data in form of dictionaries or lists. In the worst case, you might as well output a slightly better string that then goes through the same skill extraction process as all other normal texts (worst case scenario).

In [1]:
import spacy
nlp = spacy.load('en')
import PyPDF2

### 4 Loading Input Data
We use as input data Jesus' CV. It is simple and probably a good average CV of an employee.  

In [2]:
input_file_path = './cv.pdf'

In [None]:
# This function goes through the pages of the pdf document and collects the input as a string with PyPDF2:
def pdfparsing(file):
    str_text = ""
    pdfReader = PyPDF2.PdfFileReader(file)
    for page in pdfReader.pages:
        str_text = str_text + str(page.extractText()) + ". "
    return str_text

In [None]:
# apply the pdf parsing function on our input document
with open(input_file_path, 'rb') as file:
    input_string = pdfparsing(file)
input_string

In [None]:
# As we can see, the input string contains a number of html \n tags 
# which we would like to get rid of in order to effectively extract information.
# So let's get rid of them:
input_string.replace("\n"," ")

In [None]:
# Much better, though we still see a lot of unclean data (referring to the relevance of challenge 2 "text cleaning")
# but lets work with what we have. First, where do we want to get to? 
# In the most ideal case, we reach a dictonary with {"skill area": "skill"} pairs, like {"language":["german","spanish"]}
# or {"jobs":["scientist","freelancer","founder"]}

# just explorative, let's check what spacy can give us:
doc = nlp(input_string)
for ent in doc.ents:
    if ent.end_char-ent.start_char > 3:
        print(ent.text, ",", ent.start_char,",", ent.end_char,",",  ent.label_)
# oh right, you might not be familiar with spacy's named entity labeling. Spacy is pretty awesome, 
# because it can identify "named entities", which are essentially, *not normal words* like "Berlin, 
# Barack Obama, or Bayer". The ent.label_ provides us with an indication of what it is, like Berlin = GPE.
# spacy.explain() can explain what that means:

In [None]:
print(spacy.explain("GPE"))
# correct :) 

Now there are multiple ways of operating:
1. Rely on the accuracy of spaCy's named entity recognizer and just put these information into the dictionaries based on the ENT labels. (easy, quick to implement, not recommended though, because not very accurate)
2. Create vocabulary of words we could be interested in and apply simple string matching, e.g.in pseudo code:
interesting_words = {"language":["german","english","spanish"]}
if word in interesting_words["language"]:
    add to skills_list
3. Nr. 2 is very simple and can be very powerful, however that solely depends on the size of our "interesting_words" list and probably we will never be able to cover all potentially important words. Thus a hybrid approach of 1 and 2 is probably best. 
4. Do you have any further ideas? Looking forward to see your approaches!