# Named entity recognition

The exercise shows how we may extract elements such as names of companies, countries and similar objects from text.

## Tasks

1. Read the classification of [Named Entities](http://clarin-pl.eu/pliki/warsztaty/Wyklad3-inforex-liner2.pdf).
> Done
2. Read the [API of NER](http://nlp.pwr.wroc.pl/redmine/projects/nlprest2/wiki) in [Clarin](http://ws.clarin-pl.eu/ner.shtml).
> Done
3. Read the [documentation of CLL format](http://nlp.pwr.wroc.pl/redmine/projects/corpus2/wiki/CCL_format).
> Done
4. Sort bills according to their size and take top 50 (largest) bills.

In [1]:
import os
directory = '../ustawy/'
files = os.listdir(directory)
print(files)
files.sort(key=lambda f: os.stat(directory+f).st_size, reverse = True)
print(files[:50])
files50 = files[:50]

['2004_1644.txt', '2001_1086.txt', '2000_856.txt', '1994_645.txt', '2001_452.txt', '1997_590.txt', '1995_31.txt', '2004_889.txt', '1996_602.txt', '2003_1569.txt', '1997_714.txt', '2000_1026.txt', '1996_775.txt', '2001_803.txt', '1995_208.txt', '1999_440.txt', '2001_1441.txt', '2001_1070.txt', '2004_1647.txt', '2001_1369.txt', '2002_1523.txt', '1996_776.txt', '2001_1196.txt', '1995_5.txt', '2001_166.txt', '2004_2390.txt', '2003_2259.txt', '1995_2.txt', '2002_1086.txt', '2000_545.txt', '2000_704.txt', '1997_691.txt', '1999_980.txt', '2001_1384.txt', '2001_1371.txt', '2001_745.txt', '1995_419.txt', '1996_461.txt', '2004_2135.txt', '2001_498.txt', '2000_549.txt', '2001_465.txt', '2001_1645.txt', '2003_1324.txt', '1997_776.txt', '1996_720.txt', '1997_297.txt', '1996_407.txt', '2001_1321.txt', '2001_1440.txt', '2000_1312.txt', '2001_1067.txt', '1997_154.txt', '1996_499.txt', '2001_1081.txt', '2001_1365.txt', '2001_474.txt', '2001_744.txt', '2000_178.txt', '1997_501.txt', '1997_681.txt', '200

5. Use the lemmatized and sentence split documents (from ex. 5) to identify the expressions that consist of consecutive
   words starting with a capital letter (you will have to look at the inflected form of the word to check its
   capitalization) that do not occupy the first position in a sentence. E.g. the sentence:
   ```
   Wczoraj w Krakowie miało miejsce spotaknie prezydentów Polski i Stanów Zjednoczonych.
   ```
   should yield the following entries: `Kraków`, `Polska`, `Stan Zjednoczony`.

>$ docker run -it -p 9200:9200 apohllo/krnnt:0.1 python3 /home/krnnt/krnnt/krnnt_serve.py /home/krnnt/krnnt/data

In [21]:
import requests
import regex



def lemmatize(data):
    data = "Wczoraj w Krakowie miało miejsce spotaknie prezydentów Polski i Stanów Zjednoczonych." \
           "Dzisiaj w Zwierzyńcu świeto Rzeczpospolitej Polskiej".split(' ')

    corp = []
    for data in data:
        response = requests.post('http://localhost:9200', data=data.encode('utf-8'))
        text = response.text

        reg = r"(?<=\n\t)\S*\t\w*"
        matches = regex.finditer(reg, text)


        for matchNum, match in enumerate(matches, start=1):
            spliced = match.group().split("\t")
            combined = spliced[0]
            corp.append(combined)

    return corp

import os

def lemmatize_corpus(directory, fileList):
    corp = []
    for filename in fileList:
        with open(os.path.join(directory + filename), 'r') as file:
            infile = file.read()
            corp_ = lemmatize(infile.encode('utf-8'))
            corp.extend(corp_)
            print('', end='.')
    return corp

corp = lemmatize_corpus(directory, files50[49:])
print(corp)

.['wczoraj', 'w', 'Krak', 'mieć', 'miejsce', 'spotaknie', 'prezydent', 'Polska', 'i', 'Stany', 'Zjednoczony', '.', 'dzisiaj', 'w', 'zwierzyniec', 'świeto', 'Rzeczpospolita', 'polski']


In [15]:
def find_capital_expression(corp):
    for t in corp:
        # print(t)
        if t.istitle():
            print(t)

capital = find_capital_expression(corp)
print(capital)

Kraków
Polska
Stany
Zjednoczony
Rzeczpospolita
None


6. Compute the frequency of each identified expression and print 50 results with the largest number of occurrences.
7. Apply the NER algorithm to identify the named entities in the same set of documents (not lemmatized) using the `n82` model.
8. Plot the frequency (histogram) of the coares-grained classes (e.g. nam_adj`, `nam_eve`, `nam_fac`).
9. Display 10 most frequent Named Entities for each coarse-grained type.
10. Display 50 most frequent Named Entities including their count and fine-grained type.
11. Answer the following questions:
   i. Which of the method (counting expressions with capital letters vs. NER) worked better for the task concerned with
      identification of the proper names?
   ii. What are the drawbacks of the method based on capital letters?
   iii. What are the drawbacks of the method based on NER?
   iv. Which of the coarse-grained NER groups has the best and which has the worst results? Try to justify this
      observation.
   v. Do you think NER is sufficient for identifying different occurrences of the same entity (i.e. consider "USA" and
      "Stany Zjednoczone" and "Stany Zjednoczone Ameryki Północnej") ? If not, can you suggest an algorithm or a tool that
      would be able to group such names together?
   vi. Can you think of a real world problem that would benefit the most from application of Named Entity Recognition
      algorithm?

## Hints

1. Named entity recognition is a process aimed at the identification of entities mentioned in text by determining their
   scope and classifying them to a predefined type. The larger the number of types, the more difficult the problem is.
2. Named entities are usually proper names and temporal expressions. They usually convey the most important information
   in text.
3. IOB format is typically used to tag names entities. The name (IOB) comes from the types of tokens (_in_, _out_, _beginning_).
   The following example shows how the format works:
   ```
   W            O
   1776         B-TIME
   niemiecki    O
   zoolog       O
   Peter        B-PER
   Simon        I-PER
   Pallas       I-PER
   dokonał      O
   formalnego   O
   ...
   ```
4. The set of classes used in NER is partially task dependant. Some general classes such as names of people or cities
   are used universally, but categories such as references to law regulations is specific to legal information systems.