# Information Extraction
Evaluate some information extraction approaches in action. In particular, you should do the followings:
- Define a string variable that contains a piece of text as your document.
- Extract the keyphrases of your document using some unsupervised algorithms, such as `TextRank` and `SGRank`. These algorithms are already implemented in some Python libraries, such as [`textaCy`](https://textacy.readthedocs.io/en/0.11.0/api_reference/extract.html).
- Recognize named entities of your document using a library, such as `spaCy`.

## Importing Modules

In [2]:
import pandas
import spacy
import textacy
import transformers

!python3.8 -m spacy download en_core_web_sm

2021-11-25 12:11:22.017903: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-25 12:11:22.017936: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


2021-11-25 12:11:24.676026: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-25 12:11:24.676053: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
     |████████████████████████████████| 13.6 MB 9.7 MB/s            
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Keyphrase Extraction

In [11]:
text = """
Has Angela Merkel actually gone now?
Her imminent departure has been reported on for months, in the lead-up to Germany's September general election and ever since. 
Yet Mrs Merkel kept popping up this autumn at press conferences in Berlin, the G20 meeting of world leaders in Rome, EU leaders summits and more.
Now she really, truly is poised to tiptoe into Germany's political sunset.
Social Democrat Olaf Scholz is the Chancellor-in-waiting. 
After presenting his plans for coalition government on Wednesday, he's hoping to get the formal nod of approval from parliament in a couple of weeks.
So what can we expect? Will it essentially be same old, same old for Germany?
Mr Scholz is a solid member of the political establishment, most recently serving as Angela Merkel's deputy prime minister and finance minister; perceived as a calm and steady hand throughout the ongoing coronavirus crisis.
"""

en = textacy.load_spacy_lang("en_core_web_sm")
doc = textacy.make_spacy_doc(text, lang=en)

textrank_keyphrases = textacy.extract.keyterms.textrank(doc, topn=5)
sgrank_keyphrases = textacy.extract.keyterms.sgrank(doc, topn=5)

df = pandas.DataFrame({
    "TextRank KP": [kp for kp, s in textrank_keyphrases], 
    "TextRank Scores": [s for kp, s in textrank_keyphrases],
    "SGRank KP": [kp for kp, s in sgrank_keyphrases], 
    "SGRank Scores": [s for kp, s in sgrank_keyphrases]
})
df

Unnamed: 0,TextRank KP,TextRank Scores,SGRank KP,SGRank Scores
0,Social Democrat Olaf Scholz,0.029629,Angela Merkel,0.193277
1,deputy prime minister,0.025201,Social Democrat Olaf Scholz,0.149371
2,Angela Merkel,0.023054,September general election,0.091945
3,Mrs Merkel,0.02097,imminent departure,0.058762
4,September general election,0.020316,EU leader,0.038809


## Named Entity Recognition

### Using spaCy

In [3]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

df = pandas.DataFrame({
    "Token": [t.text for t in doc.ents],
    "Type": [t.label_ for t in doc.ents]
})

spacy.displacy.render(doc, style="ent")
df

Unnamed: 0,Token,Type
0,Apple,ORG
1,U.K.,GPE
2,$1 billion,MONEY


## Using Transformers

In [6]:
nlp = transformers.pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City."

result = nlp(sequence)
result

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


[{'entity': 'I-ORG',
  'score': 0.99926627,
  'index': 1,
  'word': 'Hu',
  'start': 0,
  'end': 2},
 {'entity': 'I-ORG',
  'score': 0.9808882,
  'index': 2,
  'word': '##gging',
  'start': 2,
  'end': 7},
 {'entity': 'I-ORG',
  'score': 0.9953625,
  'index': 3,
  'word': 'Face',
  'start': 8,
  'end': 12},
 {'entity': 'I-ORG',
  'score': 0.9993382,
  'index': 4,
  'word': 'Inc',
  'start': 13,
  'end': 16},
 {'entity': 'I-LOC',
  'score': 0.99902683,
  'index': 11,
  'word': 'New',
  'start': 40,
  'end': 43},
 {'entity': 'I-LOC',
  'score': 0.9988483,
  'index': 12,
  'word': 'York',
  'start': 44,
  'end': 48},
 {'entity': 'I-LOC',
  'score': 0.99917734,
  'index': 13,
  'word': 'City',
  'start': 49,
  'end': 53}]