# Finding Mentions by Name

In this lab we will try to identify frequencies of proper nouns in the news corpus. More specifically we are looking for names that occur frequently in the news. 

You will learn how to:

- identify words based on capitalization
- work with regular expressions
- compute Levenshtein distance

You will get slightly different results by adjusting the parameters. This is an basic analysis and the primary purpose of this exercise is to demonstrate possible usage of this this dataset.

## Prerequisites

This lab requires following 3rd party libraries! Run this command before starting:

```
pip install python-Levenshtein
```

In [1]:
# import dependencies 
import re
from corpus_utilities import load_articles, longest_base_word
from Levenshtein import distance as levenshtein_distance

## 1. Read article data from corpus

In [2]:
# path to corpus directory
# change this value as necessary
directory_path = '../corpus'

# use utility script to load articles
articles = load_articles(directory_path)

print('Sanity check! Got', len(articles), 'articles.')  

Sanity check! Got 67259 articles.


## 2. Naive approach using Regular Expression

Let's use capitalization rules as a way to identify nouns. When a word occurs in the beginning of the sentence, it is impossible to tell whether capitalization is an indication of a word being a proper noun, or just the first word of a sentence, so we will accept those words as well. Therefore this is a very naive approach to finding proper nouns, but it will get us started. 

In [3]:
# Regular expression pattern for matching capitalized words
# Initially let's match all capitalized word sequences 
# using {1,} to only detect expressions with 2 or more words
pattern = r'([A-ZÄÖ][a-zäöóé\-]+(?=\s[A-ZÄÖ])(?:\s[A-ZÄÖ][a-zäöóé\-]+){1,})'

matches = []

for article in articles:

    [title, summary] = article[1:3]
    
    # find capitalized words in article title
    title_words = re.findall(pattern, title)
  
    # find capitalized words in article summary
    summary_words = re.findall(pattern, summary)
    
    # combine all matches
    proper_nouns = title_words + summary_words
    
    if len(proper_nouns) > 0:
        matches += proper_nouns


# show stats about findings        
print('Found', len(matches), 'capitalized word seaquences.')
print('Found', len(list(set(matches))), 'unique capitalized word seaquences.', '\n')

# display some matches on the screen to observe the results
words_to_show, cols = 80, 2
preview = list(set(matches))[0:words_to_show]
display = lambda x, y: str(matches.count(x)).ljust(5, ' ') + ' ' + x.ljust(y,' ')

for i in range(0, int(words_to_show / cols)):
    upper, lower = i * cols, (i + 1) * cols
    row_words = [display(p, 45) for p in preview[upper:lower]]
    print(' '.join(row_words))

Found 74440 capitalized word seaquences.
Found 33070 unique capitalized word seaquences. 

2     Mati Alaveria                                 1     Heikki Maliselle                             
1     Samuli Sirviöön                               2     Colin Jostinin                               
1     Radio Novan Aamuun                            1     Kemikaaliyliherkän Tiinan                    
1     Minnesotassa Yhdysvalloissa                   1     Dave Mustaine                                
2     Dina Lohan                                    1     Alexander Mc                                 
2     B-juontaja Alma Hätönen                       1     Jannika Landen                               
1     Simpsonit-ääninäyttelijä Russi Taylor         1     Hyväntekeväisyysjärjestö Marie Stopes        
7     Tyson Furyn                                   1     Detroit Pistonsin                            
1     Vantaalainen Helena                           9     Renny Harlin       

These results show this approach does identify proper nouns but this approach has specific issues:

- the discovered words include extra words: `Juontaja`
- we cannot automatically assume a fixed length word sequence: `Valerie Morris Campbell` vs. `Frederik`
- the different word forms prevent use from grouping matches properly: `Petteri Orpolta` vs. `Petteri Orpo`

## 3. Try frequency based sorting

Observing the results above, we can see that actual names of people seem to have higher frequency than random word pairs. Based on this observation, lets sort the previously discovered proper noun words by frequency and see what happens.

In [4]:
words_to_show, cols = 100, 3
sorted_matches = sorted([(matches.count(w), w) for w in list(set(matches))], reverse=True)

for i in range(0, int(words_to_show / cols)):
    upper, lower = i * cols, (i + 1) * cols
    row_words = [display(p[1], 25) for p in sorted_matches[upper:lower]]
    print(' '.join(row_words))
    
print('\nWords with 2 or more occurrences:', len([p for p in sorted_matches if p[0] > 1]))    

537   Kimi Räikkönen            404   Valtteri Bottas           388   Sensuroimaton Päivärinta 
344   Kimi Räikkösen            254   Patrik Laine              210   Teemu Pukki              
205   Valtteri Bottaksen        201   Kaisa Mäkäräinen          195   Matti Nykäsen            
195   Antti Rinteen             188   Iivo Niskanen             181   Lewis Hamilton           
180   Antti Rinne               179   Kaapo Kakko               166   Matti Latvala            
161   Patrik Laineen            159   Krista Pärmäkoski         151   Donald Trump             
150   Big Brother               137   Teemu Pukin               136   Therese Johaug           
128   Prinssi Harry             123   Herttuatar Meghan         121   New Yorkissa             
118   Olli Lindholmin           117   Sauli Niinistö            116   Iivo Niskasen            
110   Jukka Jalonen             109   Mikko Rantanen            108   Donald Trumpin           
106   Winnipeg Jetsin           106   Ka

This result is already much better. We have narrowed down from > 70K capitalized words to < 8K most frequent expressions. 

But there are still some issues we should address. The following conjugations are expressions referring to the same person.

- `Kimi Räikkönen` and `Kimi Räikkösen`
- `Valtteri Bottas` and `Valtteri Bottaksen`
- `Matti Nykänen` and `Matti Nykäsen`
- `Antti Rinteen` and `Antti Rinne`

The frequency count would be more accurate, if we were able identify relationship between these words.

## 4. Grouping similar words

[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) is used for calculating the distance between words. Let's try to improve our result by grouping words based on this calculated word distance. Next let's try to cluster related words, to identify conjugations, by using the Levenshtein distance. We will only allow a very small Levenshtein distance (`max_dist`), which gives us high confidence that two words are related.

In [5]:
# distionary to hold related terms
related_terms = {}

# copy the list of unique sorted capitalized words; drop frequency count
search_base = sorted([w[1] for w in sorted_matches[:] if w[0] > 2])

# maximum allowed Levenshtein distance
# increasing this number will create more relationships and possibly incorrect matches
# decreasing this number will make the relationships more strict and identify fewer conjunctions
# adding more fuzzyness at this setp gets you better ranking at the end
max_dist = 3

# test for substring match occuring at the beginning of strings
sub_match = lambda w1,w2: w1.startswith(w2) or w2.startswith(w1)

# test base word length match; return true if at least 0.75 of shorter word1 matches word2
base_word_match = lambda w1,w2: len(longest_base_word([w1,w2])) >= 0.75 * min([len(w1), len(w2)])

# when computing distance treat substring match as equal; 
# otherwise compute Levenshtein distance if the base word matches
get_distance = lambda w1,w2: 0 if sub_match(w1, w2) \
                else levenshtein_distance(w1, w2) if base_word_match(w1, w2) else 100

def find_in_dict(w1):
    """Check if there is a close match in the dictionary"""
    for k, words in related_terms.items():
        for w2 in words:
            if get_distance(w1, w2) <= max_dist:
                return True, k    
    return False, None


print('Grouping related words....')
iter_count, percent, num_iters = 0, 0., len(search_base)

# iterate over words to find conjugations
for w1 in search_base:

    # display progress in 10% icrements
    if iter_count / num_iters >= percent + 0.1:
        percent = iter_count / num_iters
        print('progress....', "{:.0%}".format(percent))

    iter_count += 1

    # first check if word has a closely 
    # related term in the dictionary
    dict_match, dict_key = find_in_dict(w1)

    if dict_match:
        if w1 not in related_terms[dict_key]:
            related_terms[dict_key].append(w1)
        continue

    # if not, look for best match in sorted words list
    # these are words that are yet to be paired
    for w2 in search_base[iter_count:]:
        dist = get_distance(w1, w2)
        
        # pair close matches
        if dist <= max_dist:
            dict_match, dict_key = find_in_dict(w2)
            if dict_match:
                if w1 not in related_terms[dict_key]:
                    related_terms[dict_key].append(w1)
                if w2 not in related_terms[dict_key]:
                    related_terms[dict_key].append(w2)
            else:
                related_terms[len(related_terms.keys()) + 1] = [w1, w2]
            break

# safety check to ensure we don't have duplication in the dictionary            
for (k,v) in related_terms.items():
    related_terms[k] = list(set(v))
        
# display results
print('\nFound', len(related_terms.keys()), 'related words!', '\n')    

# display dictionary contents
for i in range(0, 50):
    v = list(related_terms.values())[i]
    output = ', '.join(v)
    dots = ('...' if len(output) > 80 else '')
    print((str(i + 1) + '.').ljust(3, ' '), output[0: 80].strip() + dots) 

Grouping related words....
progress.... 10%
progress.... 20%
progress.... 30%
progress.... 40%
progress.... 50%
progress.... 60%
progress.... 70%
progress.... 80%
progress.... 90%

Found 810 related words! 

1.  Abu Dhabin, Abu Dhabissa
2.  Aira Samulinin, Aira Samulin
3.  Airiston Helmen, Airiston Helmi
4.  Aki Linnanahteen, Aki Linnanahde
5.  Aki Manninen, Aki Mannisen
6.  Aki Palsanmäen, Aki Palsanmäki
7.  Aki Riihilahti, Aki Riihilahden
8.  Akseli Herlevin, Akseli Herlevi
9.  Aku Hirviniemen, Aku Hirviniemi, Aku Hirviniemelle
10. Aku Louhimiehen, Aku Louhimies
11. Alec Baldwin, Alec Baldwinin
12. Aleksander Barkov, Aleksander Barkovin, Aleksander Barkovilla
13. Aleksandr Bolshunovin, Aleksandr Bolshunov
14. Aleksandr Loginov, Aleksandr Loginovin
15. Aleksandr Ovetshkin, Aleksandr Ovetshkinin
16. Alexander Stubbin, Alexander Stubb
17. Alfa Romeo Racingin, Alfa Romeolla, Alfa Romeolta, Alfa Romeon, Alfa Romeo, Alfa...
18. Alina Tomnikovin, Alina Tomnikov
19. Alisa Ranta-, Alisa Ranta

Observing these results, we can see results look generally good and we can see different word conjugations grouped together. There are still some limitations, for example `Sauli Niinistö` and `Presidentti Niinistö` being grouped separately. 

Given the complexity of this exercise and accepting that this method produces a high approximation - not exact matches - we will move on with these results. It is certainly faster than trying to do this manually. Let's figure out who are the most discussed people in the Finnish media!

## 5. Identify most discussed people

In this last step we will use `related_terms` to compute the total frequencies, this will allow us to discover the "most discussed people" over time. Since we are performing this search over the entire corpus, this will give us a count over all time.

In [6]:
# create a list to hold final results
result = []

for (k, words) in related_terms.items():
    
    match_terms = []
    article_matches = 0
        
    for i in range(0, len(articles)):
        [title, summary] = articles[i][1:3]
        is_match = False
        
        for word in words:
            if word in title or word in summary:
                match_terms.append(word)
                is_match = True

        if is_match:
            article_matches += 1
    
    top_word_form = sorted([(match_terms.count(x), x) for x in list(set(match_terms))], reverse=True)[0][1]
    result.append((article_matches, top_word_form))

# sort by popularity with top expression first    
result = sorted(result, reverse=True)    
    
# Display results
print('=' * 50)
print('All-Time Mentions, Top 100'.upper())
print('=' * 50, '\n')
print('#'.ljust(5), 'Articles'.upper().ljust(12), 'Topic'.upper())

for i in range(0, 100):    
    count, word = result[i]  
    print((str(i + 1) + '.').ljust(5), ('  ' + str(count)).ljust(12), word)

ALL-TIME MENTIONS, TOP 100

#     ARTICLES     TOPIC
1.      727        Kimi Räikkönen
2.      457        Valtteri Bottas
3.      389        Antti Rinne
4.      365        New York
5.      364        Patrik Laine
6.      308        Donald Trump
7.      300        Teemu Pukki
8.      278        Kaisa Mäkäräinen
9.      275        Iivo Niskanen
10.     265        Kaapo Kakko
11.     222        Lewis Hamilton
12.     218        Matti Nykäsen
13.     212        Krista Pärmäkoski
14.     203        Sauli Niinistö
15.     202        Winnipeg Jets
16.     187        Temptation Island
17.     187        Matti Latvala
18.     183        Herttuatar Meghan
19.     173        Helsingin Sanomien
20.     165        Therese Johaug
21.     163        Big Brother
22.     158        Alfa Romeo
23.     151        Prinssi Harry
24.     144        Juha Sipilä
25.     143        Sebastian Vettel
26.     143        Olli Lindholm
27.     138        Sanna Marin
28.     138        Miss Suomi
29.     137        

# [&laquo; Previous Lab](plotting_frequencies.ipynb)