# Finding Mentions by Name

In this lab we will try to identify frequencies of proper nouns in the news corpus. More specifically we are looking for names.

You will learn how to:
- identify words based on capitalization
- work with regular expressions
- compute Levenshtein distance

## Prerequisites

This lab requires following 3rd party libraries! Run this command before starting:

```
pip install python-Levenshtein
```

In [1]:
# import dependencies 
import re
from corpus_utilities import load_articles
from Levenshtein import distance as levenshtein_distance

## 1. Read article data from corpus

In [2]:
# path to corpus directory
# change this value as necessary
directory_path = '../corpus'

# use utility script to load articles
articles = load_articles(directory_path)

print('Sanity check! Got', len(articles), 'articles.')  

Sanity check! Got 67259 articles.


## 2. Naive search approach using Regular Expression

Let's use capitalization rules as a way to identify nouns. When a word occurs in the beginning of the sentence, it is impossible to tell whether capitalization is an indication of a word being a proper noun, or just the first word of a sentence, so we will accept those words as well. Therefore this is a very naive approach to finding proper nouns, but it will get us started. 

In [3]:
# Regular expression pattern for matching capitalized words
# Initially let's match all capitalized word sequences 
pattern = r'([A-ZÄÖ][a-zäö]+(?=\s[A-ZÄÖ])(?:\s[A-ZÄÖ][a-zäö]+)+)'

matches = []

for article in articles:

    [title, summary] = article[1:3]
    
    # find capitalized words in article title
    title_words = re.findall(pattern, title)
  
    # find capitalized words in article summary
    summary_words = re.findall(pattern, summary)
    
    # combine all matches
    proper_nouns = title_words + summary_words
    
    if len(proper_nouns) > 0:
        matches += proper_nouns


# show stats about findings        
print('Found', len(matches), 'capitalized word seaquences.')
print('Found', len(list(set(matches))), 'unique capitalized word seaquences.', '\n')

# display some matches on the screen to observe the results
words_to_show, cols = 80, 2
preview = list(set(matches))[0:words_to_show]
display = lambda x, y: str(matches.count(x)).ljust(5, ' ') + ' ' + x.ljust(y,' ')

for i in range(0, int(words_to_show / cols)):
    upper, lower = i * cols, (i + 1) * cols
    row_words = [display(p, 45) for p in preview[upper:lower]]
    print(' '.join(row_words))

Found 73575 capitalized word seaquences.
Found 31362 unique capitalized word seaquences. 

2     Niko Vesterinen                               1     Bella Hadidin                                
3     Chicago Blackhawksia                          1     Sekä Yhdysvaltain                            
1     Overia St                                     1     Krista Tervon                                
4     Maria Sid                                     1     Brooklyn Nets                                
2     Lauri Markkasen Bulls                         1     Graeme Hall                                  
1     Pirjo Mannermaa                               2     Niki Mäenpään                                
1     Savonlinnalainen Heli                         1     Lähtikö Sara Siepin                          
1     Washington Capiltalsin                        1     Näyttelijä Timo Lavikainen                   
1     Carline Flack                                 4     Viihdetaiteilija Ve

These results show this approach does identify proper nouns but this approach has specific issues:

- the discovered words include extra words: `Juontaja`
- we cannot automatically assume a fixed length word sequence: `Valerie Morris Campbell` vs. `Frederik`
- the different word forms prevent use from grouping matches properly: `Petteri Orpolta` vs. `Petteri Orpo`

## 3. Try frequency based sorting

Observing the results above, we can see that actual names of people seem to have higher frequency than random word pairs. Based on this observation, lets sort the previously discovered proper noun words by frequency and see what happens. **Caution: This operation may take a while to finish.**

In [5]:
words_to_show, cols = 100, 3
sorted_matches = sorted([(matches.count(w), w) for w in list(set(matches))], reverse=True)

for i in range(0, int(words_to_show / cols)):
    upper, lower = i * cols, (i + 1) * cols
    row_words = [display(p[1], 30) for p in sorted_matches[upper:lower]]
    print(' '.join(row_words))
    
print('\nWords with 2 or more occurrences:', len([p for p in sorted_matches if p[0] > 1]))    

537   Kimi Räikkönen                 406   Valtteri Bottas                388   Sensuroimaton Päivärinta      
344   Kimi Räikkösen                 259   Patrik Laine                   210   Teemu Pukki                   
206   Valtteri Bottaksen             201   Kaisa Mäkäräinen               196   Matti Nykäsen                 
195   Antti Rinteen                  188   Iivo Niskanen                  183   Lewis Hamilton                
181   Antti Rinne                    180   Kaapo Kakko                    166   Matti Latvala                 
162   Patrik Laineen                 159   Krista Pärmäkoski              151   Donald Trump                  
151   Big Brother                    137   Teemu Pukin                    136   Therese Johaug                
132   Jussi Halla                    128   Prinssi Harry                  124   New Yorkissa                  
123   Herttuatar Meghan              119   Olli Lindholmin                117   Sauli Niinistö                
1

This result is already much better. We have narrowed down from > 70K capitalized words to < 8K most frequent expressions. 

But there are still some issues we should address. The following conjugations are expressions referring to the same person.

- `Kimi Räikkönen` and `Kimi Räikkösen`
- `Valtteri Bottas` and `Valtteri Bottaksen`
- `Matti Nykänen` and `Matti Nykäsen`
- `Antti Rinteen` and `Antti Rinne`

The frequency count would be more accurate, if we were able identify relationship between these words.

## 4. Grouping similar words

[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) is used for calculating the distance between words. Let's try to improve our result by grouping words based on this calculated word distance. 

In [17]:
for (n, w1) in sorted_matches[0:50]:
    
    closest_matches = [w1]
    min_distance = len(w1)
        
    for (m, w2) in sorted_matches:
        dist = levenshtein_distance(w1, w2)
        if dist == 0:
            continue
        if dist == min_distance:
            closest_matches.append(w2)
        if dist < min_distance:
            min_distance = dist
            closest_matches = [w1,w2]
        
    print(min_distance, 
          # "{:.1f}".format(min_distance/len(w1)),
          # "{:.1f}".format(len(w1)/min_distance),
          closest_matches)


1 ['Kimi Räikkönen', 'Kimi Räikkösen']
2 ['Valtteri Bottas', 'Valtteri Bottasta']
4 ['Sensuroimaton Päivärinta', 'Sensuroimaton Päivärinnassa']
1 ['Kimi Räikkösen', 'Kimi Räikkönen', 'Kimi Räikköseen', 'Kimin Räikkösen']
2 ['Patrik Laine', 'Patrik Laineen']
1 ['Teemu Pukki', 'Teemu Pukkia']
1 ['Valtteri Bottaksen', 'Valtteri Bottakseen', 'Valtterin Bottaksen']
1 ['Kaisa Mäkäräinen', 'Kaisa Mäkäräisen']
1 ['Matti Nykäsen', 'Matti Nykänen', 'Matti Nykäseen']
2 ['Antti Rinteen', 'Antti Rintanen']
1 ['Iivo Niskanen', 'Iivo Niskasen', 'Ivo Niskanen', 'Iio Niskanen']
2 ['Lewis Hamilton', 'Lewis Hamiltonin', 'Lewis Hamiltonia']
3 ['Antti Rinne', 'Antti Rinteen', 'Antti Rinnettä', 'Antti Reinin', 'Antti Reini', 'Antti Rönkä', 'Antti Rintanen', 'Antti Rinnekin']
1 ['Kaapo Kakko', 'Kaapo Kakkoa']
1 ['Matti Latvala', 'Matti Latvalan', 'Matti Latvalaa']
2 ['Patrik Laineen', 'Patrik Laine', 'Patrik Lainekin']
2 ['Krista Pärmäkoski', 'Krista Pärmäkosken', 'Krista Pärmäkoskea']
2 ['Donald Trump', 'Do