# Information Extraction
You should extract keyphrases and named entities from unstructured texts using different approaches. In particular, you should do the following:
- Suppose the given string variable is the content of your document.
- Extract the keyphrases of your document using some unsupervised algorithms, such as `TextRank` and `SGRank`. The implementation of these algorithms can be found in [`textaCy`](https://textacy.readthedocs.io/en/0.12.0/api_reference/extract.html).
- Recognize named entities of your document using pretrained models. These pretrained models can be found in [spaCy](https://spacy.io/usage/linguistic-features) and [Hugging Face Transformers](https://huggingface.co/docs/transformers/task_summary#named-entity-recognition).
- Compare the result of different approaches. Analyze the effect of hyperparameters on the quality of results.

In [1]:
text = """
About GISMA Business School
Since its foundation in 1999, GISMA Business School has paved the way for talented and qualified people to enter the international business world. Equipped with an interdisciplinary foundation and digital literacy, our graduates are able to pinpoint problem situations in companies of all sizes, start-ups or other organisations, and develop innovative solutions with commitment, motivation and creativity. With our goals in mind, we continue to expand and support students from all over the world to find their dream job and be successful.
As a state-recognised university, GISMA Business School awards its own Bachelor's and Master's degrees. In addition, we enjoy the trust of some of the best universities in Europe to offer their degree programmes through GISMA.

Our Mission
GISMA educates individuals to become highly sought-after, leading members of the global business community. GISMA offers both traditional physical and modern virtual learning spaces that enable the acquisition of future-oriented competencies through state-of-the-art technology, an innovative and creative learning environment, and highly qualified staff. GISMA stands for practical and inspiring management education, where students learn from research-strong professors as well as top executives and founders. GISMA cooperates with a network of globally active organisations from business and education. It supports business and society by preparing students for management practice in a world characterised by permanent change, uncertainty, complexity and ambiguity. GISMA offers a learning environment characterised by a high degree of internationality.
"""

In [16]:
import spacy
from textacy.extract.keyterms import textrank, sgrank
from transformers import pipeline

In [4]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

In [11]:
t_key_terms = textrank(doc)
t_key_terms

[('GISMA Business School award', 0.03416257583806404),
 ('international business world', 0.025506347861031266),
 ('modern virtual learning space', 0.02252718016778866),
 ('creative learning environment', 0.019963268446678853),
 ('global business community', 0.01949552325521034),
 ('inspiring management education', 0.01832441779133801),
 ('degree programme', 0.014733900191777467),
 ('high degree', 0.014490240591750157),
 ('active organisation', 0.012487918297846283),
 ('qualified people', 0.011646443497108763)]

In [10]:
s_key_terms = sgrank(doc)
s_key_terms

[('GISMA Business School', 0.653917081448932),
 ('world', 0.010860673886228727),
 ('digital literacy', 0.01081000197616003),
 ('problem situation', 0.010723782048406233),
 ('interdisciplinary foundation', 0.009940002754286116),
 ('innovative solution', 0.009910816642060461),
 ('business world', 0.009856493456615236),
 ('international business', 0.008777599275258932),
 ('student', 0.008442754504057653),
 ('qualified people', 0.008413677274300087)]

In [14]:
ents = [(e.text, e.label_) for e in doc.ents]
ents

[('GISMA Business School', 'ORG'),
 ('1999', 'DATE'),
 ('GISMA Business School', 'ORG'),
 ('GISMA Business School', 'ORG'),
 ('Bachelor', 'ORG'),
 ('Master', 'WORK_OF_ART'),
 ('Europe', 'LOC'),
 ('GISMA', 'ORG'),
 ('GISMA', 'ORG')]

In [18]:
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
results = ner_pipeline(text)
for entity in results:
    print(f"{entity['word']} ({entity['entity']}) [{entity['score']:.2f}]")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


G (I-ORG) [1.00]
##IS (I-ORG) [1.00]
##MA (I-ORG) [1.00]
Business (I-ORG) [0.99]
School (I-ORG) [0.96]
G (I-ORG) [1.00]
##IS (I-ORG) [1.00]
##MA (I-ORG) [1.00]
Business (I-ORG) [0.99]
School (I-ORG) [0.98]
G (I-ORG) [1.00]
##IS (I-ORG) [1.00]
##MA (I-ORG) [1.00]
Business (I-ORG) [0.98]
School (I-ORG) [0.94]
Europe (I-LOC) [1.00]
G (I-ORG) [0.97]
##IS (I-ORG) [0.91]
##MA (I-ORG) [0.74]
G (I-ORG) [0.96]
##IS (I-ORG) [0.98]
##MA (I-ORG) [0.95]
G (I-ORG) [0.99]
##IS (I-ORG) [0.99]
##MA (I-ORG) [0.89]
G (I-ORG) [0.97]
##IS (I-ORG) [0.93]
##MA (I-ORG) [0.83]
G (I-ORG) [0.99]
##IS (I-ORG) [0.99]
##MA (I-ORG) [0.93]
G (I-ORG) [0.99]
##IS (I-ORG) [0.98]
##MA (I-ORG) [0.82]
