# Finding Mentions by Name

In this lab we will try to identify frequencies of proper nouns in the news corpus. More specifically we are looking for names.

You will learn how to:
- work with corpus utility functions
- how to do analysis based on word capitalization
- how to work with regular expressions

In [1]:
# import dependencies 
import re
from corpus_utilities import load_articles

## 1. Read article data from corpus

In [2]:
# path to corpus directory
# change this value as necessary
directory_path = '../corpus'

# use utility script to load articles
articles = load_articles(directory_path)

print('Sanity check! Got', len(articles), 'articles.')  

Sanity check! Got 67259 articles.


## 2. Naive search approach

Let's use capitalization rules as a way to identify nouns. When a word occurs in the beginning of the sentence, it is impossible to tell whether capitalization is an indication of a word being a proper noun, or just the first word of a sentence, so we will accept those words as well. Therefore this is a very naive approach to finding proper nouns, but it will get us started. 

In [3]:
# Regular expression pattern for matching capitalized words
# Initially let's match all capitalized word sequences 
pattern = r'([A-ZÄÖ][a-zäö]+(?=\s[A-ZÄÖ])(?:\s[A-ZÄÖ][a-zäö]+)+)'

matches = []

for article in articles:

    [title, summary] = article[1:3]
    
    # find capitalized words in article title
    title_words = re.findall(pattern, title)
  
    # find capitalized words in article summary
    summary_words = re.findall(pattern, summary)
    
    # combine all matches
    proper_nouns = title_words + summary_words
    
    if len(proper_nouns) > 0:
        matches += proper_nouns


# show stats about findings        
print('Found', len(matches), 'capitalized word seaquences.')
print('Found', len(list(set(matches))), 'unique capitalized word seaquences.', '\n')

# display some matches on the screen to observe the results
words_to_show, cols = 80, 2
preview = list(set(matches))[0:words_to_show]
display = lambda x, y: str(matches.count(x)).ljust(5, ' ') + ' ' + x.ljust(y,' ')

for i in range(0, int(words_to_show / cols)):
    upper, lower = i * cols, (i + 1) * cols
    row_words = [display(p, 45) for p in preview[upper:lower]]
    print(' '.join(row_words))

Found 73575 capitalized word seaquences.
Found 31362 unique capitalized word seaquences. 

1     Antti Niemellä                                1     Venäjän Sotshin                              
1     La Ligan Sevilla                              1     Juventuksen Torinossa                        
1     Lähihoitaja Johanna                           1     Tapahtumajärjestäjä Live Entertainment Finland
1     Kosicesta Bratislavaan                        1     Josie Goldberg                               
1     Jane Doe                                      6     Sky News                                     
1     Sarin Instagram                               1     Portugalin Kansojen                          
1     New Yorkissä                                  1     Alfa Romeon Antonio Giovinazzille            
2     Fazerin Sinisen                               2     Ilja Bryzgalov                               
1     Geto Boys                                     1     Suosikkibloggaaja 

These results show this approach does identify proper nouns but this approach has specific issues:

- the discovered words include extra words: `Juontaja`
- we cannot automatically assume a fixed length word sequence: `Valerie Morris Campbell` vs. `Frederik`
- the different word forms prevent use from grouping matches properly: `Petteri Orpolta` vs. `Petteri Orpo`

## 3. Try frequency based sorting

Observing the results above, we can see that actual names of people seem to have higher frequency than random word pairs. Based on this observation, lets sort the previously discovered proper noun words by frequency and see what happens. **Caution: This operation may take a while to finish.**

In [7]:
words_to_show, cols = 100, 3
preview = sorted([(matches.count(w), w) for w in list(set(matches))], reverse=True)

for i in range(0, int(words_to_show / cols)):
    upper, lower = i * cols, (i + 1) * cols
    row_words = [display(p[1], 30) for p in preview[upper:lower]]
    print(' '.join(row_words))
    
print('\nWords with 2 or more occurrences:', len([p for p in preview if p[0] > 1]))    

537   Kimi Räikkönen                 406   Valtteri Bottas                388   Sensuroimaton Päivärinta      
344   Kimi Räikkösen                 259   Patrik Laine                   210   Teemu Pukki                   
206   Valtteri Bottaksen             201   Kaisa Mäkäräinen               196   Matti Nykäsen                 
195   Antti Rinteen                  188   Iivo Niskanen                  183   Lewis Hamilton                
181   Antti Rinne                    180   Kaapo Kakko                    166   Matti Latvala                 
162   Patrik Laineen                 159   Krista Pärmäkoski              151   Donald Trump                  
151   Big Brother                    137   Teemu Pukin                    136   Therese Johaug                
132   Jussi Halla                    128   Prinssi Harry                  124   New Yorkissa                  
123   Herttuatar Meghan              119   Olli Lindholmin                117   Sauli Niinistö                
1

This result is already much better. We have narrowed down from > 70K capitalized words to < 8K most frequent expressions. 

But there are still some issues we should address. All of the following refer to the same person and our solution would be better if we learned to group these correctly:

- `Kimi Räikkönen` and `Kimi Räikkösen`
- `Valtteri Bottas` and `Valtteri Bottaksen`
- `Matti Nykänen` and `Matti Nykäsen`
- `Antti Rinteen` and `Antti Rinne`