# Finding Mentions by Name

In this lab we will try to identify frequencies of proper nouns in the news corpus. More specifically we are looking for names.

You will learn how to:
- work with corpus utility functions
- how to do analysis based on word capitalization
- how to work with regular expressions

In [1]:
# import dependencies 
import re
from corpus_utilities import load_articles

## 1. Read article data from corpus

In [2]:
# path to corpus directory
# change this value as necessary
directory_path = '../corpus'

# use utility script to load articles
articles = load_articles(directory_path)

print('Sanity check! Got', len(articles), 'articles.')  

Sanity check! Got 67259 articles.


## 2. Naive search approach

Let's use capitalization rules as a way to identify nouns. When a word occurs in the beginning of the sentence, it is impossible to tell whether capitalization is an indication of a word being a proper noun, or just the first word of a sentence, so we will accept those words as well. Therefore this is a very naive approach to finding propre nouns, but will get us started. 

In [34]:
# Regular expression pattern for matching capitalized words
# Initially let's match all capitalized word sequences 
pattern = r'([A-ZÄÖ][a-zäö]+(?=\s[A-ZÄÖ])(?:\s[A-ZÄÖ][a-zäö]+)+)'

matches = []

for article in articles:

    [title, summary] = article[1:3]
    
    # find capitalized words in article title
    title_words = re.findall(pattern, title)
  
    # find capitalized words in article summary
    summary_words = re.findall(pattern, summary)
    
    # combine all matches
    proper_nouns = title_words + summary_words
    
    if len(proper_nouns) > 0:
        matches += proper_nouns


# show stats about findings        
print('Found', len(matches), 'capitalized word seaquences.')
print('Found', len(list(set(matches))), 'unique capitalized word seaquences.', '\n')

# display some matches on the screen to observe the results
words_to_show, cols = 80, 2
preview = list(set(matches))[0:words_to_show]

for i in range(0, int(words_to_show / cols)):
    upper, lower = i * cols, (i + 1) * cols
    row_words = [str(p).ljust(50,' ') for p in preview[upper:lower]]
    print(' '.join(row_words))

Found 73575 capitalized word seaquences.
Found 31362 unique capitalized word seaquences. 

Sebastian Schauman                                 Martin Johnsrud Sundbyhyn                         
Blade Runner                                       Lars Nelsonin                                     
Elli Immon                                         Asettuminen Lappiin                               
Stefan Kraft                                       Sanna Tilanto                                     
Palaako Gareth Bale Valioliigaan                   Mark Schmidt                                      
Heikki Korpela                                     Jonathan Kingin                                   
Keimolan Nesteellä                                 Kun Satu Rusasen                                  
Urmas Viilunkina                                   Aira Samulinin                                    
Näyttelijä Malla Malmivaaran                       Mahailya Reeves                           

These results show this approach does identify proper nouns but this approach has some limitations:

- the discovered words include extra words: `Juontaja`
- we cannot automatically assume a fixed length word sequence
- the different word forms prevent use from grouping matches properly: `Satu Rusanen` vs. `Satu Rusasen`