# c4-named_entity_data challenge

### 1 Challenge Introduction
This particular notebook deals with the challenge of creating "named entity data".<br>
<br>
**DISCLAIMER:**<br>
This challenge is different to the other challenges as it does not include any actual coding, it's more of a trivial task that just needs to be done, thus, **if you work on this, make sure to submit your intermediate steps as quick as possible in order to avoid redundant efforts!** <br>
**INTRO:** Named entities are an essential part of our language. Wikipedia says: *In information extraction, a named entity is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name.* Examples are *New York City*, *Aspirin* or *Pfizer Inc*. Identifying those named entities is an important task when trying to understand language systematically. <br>
**BAYRON:**<br>
One of our goals is to capture the expertise of a person in the best and most natural way possible. It is possible to capture expertise with generic words only, however, we can get far more precise in the real world with the use of named entities. For example, "I work at McDonald's in Berlin" gives much more insights than "I work at a restaurant company in a big city". If we were to understand named entities just like we understand generic words ("Berlin" vs "City), we could get a much better picture of our experts.<br>
**DEEP LEARNING:**<br>
Named entity recognition is a pretty old discipline in NLP. Most recently, the evolution of deep neural networks has enabled slim and well performing models. Basically what happens is that a deep neural network is confronted with a large corpus of text and needs to guess which words are named entities, e.g. company names, and which are not. The network then runs through the corpus over and over again to adjust its values and recognize patterns in the sentences. Ideally as a result we get a model that perfectly predicts whether a word is a company name or not. **What we need, however, is a set of predefined named entities to begin with.**<br>
**OUTPUT:** <br>
Data type: list with tuples<br>
Example: ["software":"excel", "software": "powerpoint"]<br> 
At Bayron, we have a couple of predefined named entities that we would like to learn about a person. The following are just proposals and open for discussion. You're welcome to propose your own ideas of named entities. <br>
We suggest to begin with the named entities of "Software", "Scientific Discipline" and "Job positions".

In [None]:
import spacy
nlp = spacy.load('en')

### 2 Start coding
I provide some examples analyzed by spaCy's named entity recognizer (NER) and ultimately the three lists with some examples.

In [None]:
example = "Johnny Depp was the CEO of Acting Stars Inc. from 2014 until 2018."
doc = nlp(example)

In [None]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
# As we can see, spacy is already pretty good at predicting People and Dates. spacy gets much better if more text is supplied rather  
# than just a sentence. We want some more specific named entities, thus let's just create them:

In [None]:
spacy.explain("GPE")

In [None]:
software = ["powerpoint","excel","sap","eclipse","atom","jupyter"]
science = ["biology","oncology","hematology","engineering","computer science","theology","medicine","business"]
position = ["CEO","manager","research scientist","computer scientist","assistant","head"]

In [None]:
# Once you're done you could transform the single lists into a list of tuples, e.g. for the software list:
named_entity = []
for term in software:
    named_entity.append(("software", term))
named_entity