## A frequency ranked list of economics vocabulary

### Aim of project:
To help a friend improve her economics specific English vocabulary in an efficient way. 

### End result of project:
1. A <a href="https://github.com/pvonglehn/economics_vocab/blob/master/economics_vocab.txt">list</a> of the most common, economics specific English words which don't have Spanish cognates. That is, words which are common in economics texts but uncommon in general texts and which cannot be easily guessed by a Spanish speaker. 
<br><br>
2. Anki flashcards deck for studying these words with example sentences (still to do)
 


### Problem:
A friend is preparing for an English exam as part of her studies to become an economist for the civil service in Spain. Part of the exam will be based on an article from International Monetary Fund <a href="https://www.imf.org/external/pubs/ft/fandd/">finance and development magazine</a>, or a similar publication. She finds that she is lacking much of the economics specific vocabulary necessary to understand these articles. 

The classic strategy for improving vocabulary is to read a lot and look up words that you don't know. However, this approach is very inefficient. 

Let's consider an example.
The student reads the sentence:<br>
'Higher <strong>wages</strong> in China make <strong>offshoring</strong> less attractive.'

The student may not know the meaning of 'wages' or 'offshoring' so their instinct might be to look up and learn both. What the student doesn't know however, is that while the word 'wage' is very common in financial texts and definitely worth learning,  the word 'offshoring' is much less common, so it is not worth the student's effort learning, at least as long as there are many more useful words they could learn first.

### Project results:

Below is a table showing the first 5 words in the final list. The imf_rank indicates each word's rank based on its frequency of occurrence in the imf magazine. This corpus was compiled during this project and consists of over two million tokens (words) from over 100,000 sentences from over 1000 articles. The general rank indicates the word's position on a list of 5000 English words ordered by their frequency of occurence in a large corpus of a wide range of English texts. This list was downloaded from https://www.wordfrequency.info/ 

<br><em>A general_rank of 1000000 means that the word is not on the list from the general corpus.</em>


In [1157]:
economics_vocab_no_cognates.head(5)

Unnamed: 0,imf_rank,imf_freq,general_rank
currency,136,2328,3297
poverty,146,2123,2080
spending,175,1853,2082
inequality,206,1556,1000000
banking,228,1432,3476


Below is the table including words with Spanish cognates. The meanings of 'percent', 'fiscal' and 'export' could all be easily guessed by a Spanish speaker, so we are taking them off the list.

In [1158]:
economics_vocab.head(5)

Unnamed: 0,imf_rank,imf_freq,general_rank
percent,32,6845,1000000
fiscal,98,2952,3212
currency,136,2328,3297
poverty,146,2123,2080
export,157,2036,3062







## Procedural overview
1. Generate a text corpus by web scraping economics articles 
2. Split the text into sentences and then tag each word by its part of speech (verb, noun etc.)
3. Convert the words into their lemmas (dictionary entry forms). e.g. running -> run
4. Create a frequency table of the words and put them in rank order
5. Get an existing ranked list of words from a corpus of general English
6. Produce a list of words which have a higher rank in the economics corpus than in the general corpus
7. Remove words which have Spanish cognates e.g. the economy - la economía
8. Get example sentences with Spanish translation for each word
9. Turn these example sentences into flash cards for studying





### To do/ improvements to be made/ features to add

Get example sentences and make Anki flashcards.<br>
Get more articles e.g. from 'the economist'.
<br>Use an existing web crawling library such as https://scrapy.org/ to crawl entire sites in the future.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

In [552]:
# Web crawling function to get all links from a webpage using beautiful soup

# WARNING: Take care before reusing this function. It is one of my first attempts at 
# writing a function to scrape multiple pages and is quite hacky. It works for this case but may not
# work as desired in other cases.

# Arguments are:
# url: starting url
# base_url: the url of website homepage
# must_include: regex that the links must include (default is the wildcard character ".")
# must_not_include: regex that the links must not include (default is the nonsense string "xasdfcasdf" )

def get_all_links(url,base_url,existing_links,must_include = ".",must_not_include = "xasdfcasdf"):
    content = requests.get(url).content
    soup = BeautifulSoup(content,'lxml')
    links = []
    existing_links = existing_links.copy() # this copy is required to prevent modifying the original list (side effect)
    for anchor in soup.findAll("a"):
        # if the anchor element isn't doesn't have a href, it isn't a proper url, so skip to next anchor
        try:
            link = anchor['href']
        except:
            continue 
            
        full_link = None 
        if re.search("^http",link): # if link starts with http it is either an external link or internal with full url
            if base_url in link: # exclude links to external sites
                full_link = link 
        elif re.search("^/",link): # if link is an internat link to the base_url, create full link from base_url
            full_link = (base_url + "/" + anchor['href'])
        else:
            full_link = (url + "/" + anchor['href']) # if link is
            
        # filter out links based on various conditions    
        if ((full_link not in existing_links) # ignore links that have already been found
             and full_link is not None
             and (not re.search("#",full_link)) # filter out links to id's on the same page
             and (not re.search("htm.*htm",full_link)) # this is hack because of some buggy behaviour - should fix
             and  re.search("htm$",full_link) 
             and re.search(must_include,full_link)
             and (not re.search(must_not_include,full_link))):
            
            # add to list of links found on the whole site so far
            # (the existing_links variable here has local scope only)
            existing_links.append(full_link)
            
            # add to list of links found within this function invokation
            links.append(full_link)
            
    return links


In [576]:
%%time
# Get all internal links from the imf finance and development publications website

f = open("imf_links.txt","w") # file to save the links

base_url = "https://www.imf.org"
must_not_include = "fandd/spa|fandd/fre|fandd/rus|fandd/chi|fandd/ara|fandd/ger"
all_links = []

# The online magazine is published quaterly
# loop over the years and quaters
for year in range(1996,2019):
    for month in ("03","06","09","12"):
        must_include = "external/pubs/ft/fandd/{}/{}".format(year,month) # only get links from current edition
        current_existing = [] # initialise list of links that have been visited within this loop
        current_existing.append("https://www.imf.org/external/pubs/ft/fandd/{}/{}".format(year,month))
        for link in current_existing:
            # get all new links from within this link and add them to the list of links already found
            current_existing.extend(get_all_links(link,base_url,current_existing,must_include,must_not_include))
        all_links.extend(current_existing)
        for link in current_existing:
            f.write((link + "\n"))

f.close()


CPU times: user 1min 3s, sys: 3.35 s, total: 1min 6s
Wall time: 6min 51s


In [577]:
%%time

# Visit each link
# Extract all text
# Break the text into sentences with nltk (natural language tool kit) sentence tokenizer 

import nltk
all_sentences = []
for link in all_links:
    content = requests.get(link).content
    soup = BeautifulSoup(content,'lxml')
    text = soup.body.text
    sentences = nltk.sent_tokenize(text) # this splits the text into sentences
    all_sentences.extend(sentences)

CPU times: user 1min 9s, sys: 3.1 s, total: 1min 12s
Wall time: 4min 4s


In [578]:
# Exclude sentences which are shorter than 6 words
# As they are probably not proper sentences

long_sentences = []
for sentence in all_sentences:
    if sentence.count("\n") < 2:
        if len(sentence.split()) > 6:
            long_sentences.append(sentence)

f = open("imf_sentences.txt","w")
for line in long_sentences:
    f.write((line + "\n"))
f.close()

In [822]:
sentence_count = len(long_sentences)
word_count = pd.Series(long_sentences).apply(lambda x : len(x.split())).sum()
words_per_sentence = pd.Series(long_sentences).apply(lambda x : len(x.split())).mean()
print("sentence count = {}\ntotal word count = {}\nmean words per sentence = {:.2f}"
      .format(sentence_count,word_count,words_per_sentence))
      

sentence count = 101541
total word count = 2470329
mean words per sentence = 24.33


In [87]:
# There are libraries and functions that we will need to turn the
# words into their lemmas (dictionary forms) https://en.wikipedia.org/wiki/Lemma_(morphology)
# E.g. running will be turned into run and played into play
# To do this we need to:

# 1. tokenize the sentences (break them up into words)
# 2. tag the parts of speech for each token (word) e.g. verb, adjective
# 3. lemmatize the tokens (turn into dictionary form)

# part of speech to the lemmatizer
# Full sentences need to be passed to the parts of speech tagger in order to tag them accurately
# If the word "play" is given in isolation, it is ambiguous if it is a verb or a noun,
# but if you give the pos tagger a full sentence, it can determine from context
# e.g. I am going to play (verb) tennis. I am going to see a play (noun).

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
lemmatizer = WordNetLemmatizer()

# This function converts the parts of speech tags from nltk pos tagger 
# To POS tags that are compatible with the wordnet lemmatizer
# function adapted from: 
# https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    else:
        return None
    

[nltk_data] Downloading package wordnet to /Users/pv7409/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pv7409/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [629]:
%%time

all_tokens = []
all_lemmas = []
for sentence in long_sentences:
    tokens = nltk.word_tokenize(sentence)
    tokenized = nltk.pos_tag(tokens)
    for i, token in enumerate(tokenized):
        # Filter out proper nouns (words with capital letters that aren't at the beginning of a sentence)
        if token[0].islower() or i == 0:
            word = token[0].lower()
            wordnet_pos = get_wordnet_pos(token[1])
        #print(word,wordnet_pos)
            if wordnet_pos is None:
                all_lemmas.append(word)
                all_tokens.append(word)
            else:
                all_lemmas.append(lemmatizer.lemmatize(word,wordnet_pos))
                all_tokens.append(word)

# write out lemmes to text file
f = open("all_lemmas.txt","w")
for i in all_lemmas:
    f.write((i+"\n"))
f.close()

CPU times: user 3min 25s, sys: 4.73 s, total: 3min 30s
Wall time: 3min 39s


In [635]:
# No we'll process the lemmas a little

all_lemma_series = pd.Series(all_lemmas)

#remove punctuation
all_lemma_series = all_lemma_series[~all_lemma_series.str.contains("\W")]

# remove proper names
propernames = pd.Series(open("/usr/share/dict/propernames","r").read().split("\n"))
propernames = propernames.apply(lambda x: x.lower())
propernames = set(propernames)
#all_lemma_series = all_lemma_series[~all_lemma_series.isin(propernames)]

#get rid of numbers
all_lemma_series = all_lemma_series[~all_lemma_series.str.contains("\d")]


In [826]:
# Create a frequency table for each lemma
# Give each lemma a rank

lemma_frequencey = all_lemma_series.value_counts()
ranked = pd.Series(lemma_frequencey.index)
ranks = pd.DataFrame(list(range(1,len(lemma_frequencey)+1)))
ranks.index = lemma_frequencey.index
ranks['freq'] = lemma_frequencey

In [827]:
ranks.head()

Unnamed: 0,0,freq
the,1,149311
of,2,85426
and,3,82856
be,4,74219
to,5,69129


Here is a list of the 5000 most frequent words in English from https://www.wordfrequency.info/
<br>Note that although there are 5000 entries in the list, there are only 4353 unique words,
as sometimes the same word has several entries because it appears as a different part of speech 
<br>e.g. 'light' appears as a noun (the light at the end of the tunnel) and as an adjective (a light breakfast)

In [828]:
# Get list of English words by frequency, generated from 14 billion word intente corpus
# https://www.wordfrequency.info/ 
f = open("5000_eng_words.txt","r")
freq_5000 = f.read().split("\n")
freq_5000 = freq_5000[1:] # ignore header
freq_5000_list = pd.Series(freq_5000).str.lower()

#remove duplicated words in freq_5000_list
freq_5000_list = pd.Series(freq_5000_list).unique()

# give each word in the general corpus frequency list a rank
freq5000df = pd.DataFrame(list(range(1,len(freq_5000_list)+1)))
freq5000df.index = freq_5000_list

In [829]:
freq5000df.head()

Unnamed: 0,0
the,1
be,2
and,3
of,4
a,5


In [830]:
# merge the general corpus frequency data frame with our economics vocab frequency list

ranks = ranks.merge(freq5000df,how="left",left_index=True,right_index=True)
ranks.columns = "imf_rank","imf_freq","general_rank"
ranks = ranks.sort_values("imf_rank")

# if word not in general corpus list, give it rank of 1000000
ranks.loc[ranks["general_rank"].isnull(),["general_rank"]] = 1000000 
ranks["general_rank"] = ranks["general_rank"].astype(int) # turn back into integers


In [831]:
ranks.head()

Unnamed: 0,imf_rank,imf_freq,general_rank
the,1,149311,1
of,2,85426,4
and,3,82856,3
be,4,74219,2
to,5,69129,7


In [832]:
# exclude short words
large_words = pd.Series(ranks.index)
large_words = large_words[large_words.apply(lambda x : len(x) > 3)]
ranks = ranks.loc[list(large_words)]

In [851]:
my_series = pd.Series(long_sentences)
my_series[my_series.str.contains("estuary")].values

array(['Work evaporated at the port on the south shore of the giant estuary, the Rio de la Plata.'],
      dtype=object)

In [1086]:
# list of spanish/english cognates downloaded from: http://cognates.org/pdf/mfcogn.pdf
# massage the text file (formatting was messed up when converted from pdf)
f = open("cognates.txt","r").read()
f = re.sub(r'\((.*?)\)',",",f)
f = re.sub("por ciento","porciento",f)
f = re.sub("se relajó","serelajó",f)
f = re.sub("en el presente","enelpresente",f)
f = re.sub("ex prefix","prefix",f)
f = re.sub("ex prefijo","prefix",f)
f = re.sub("soul, música","soulmúsica",f)
f = re.sub("substituir v. 5/sustituir","substituir/sustituir",f)
f = re.sub("rock n' roll","rock'n'roll",f)
f = re.sub("El Salvador","ElSalvador",f)
f = re.sub("valuación, avalúo","valuación/avalúo",f)
f = re.sub("prefix","",f)
f = re.sub("intj.","",f)

f = re.sub(r'\d',",",f)
f = re.split("PMF|MFW| |conj\.|v\.|adj\.|n\.|s\.|adv\.|,|abbr\.|abr\.|prep\.",f)
to_remove = "Cognate","org","","clic","prefijo","prefix"
for item in to_remove:
    while item in f: f.remove(item)
f[f.index("porciento")] = "por ciento"
f[f.index("serelajó")] = "se relajó"
f[f.index("enelpresente")] = "en el presente"
f[f.index("soulmúsica")] = "soul música"




In [1091]:
cognate_list = []
for i in range(0,len(f)-1,2):
    cognate_list.append([f[i],f[i+1]])
cognate_df = pd.DataFrame(cognate_list)
cognate_df = pd.DataFrame(cognate_df)
cognate_df.columns = "english","spanish"
cognate_df.to_csv("spanish_english_cognates.csv",index=False)

In [1096]:
cognate_df.sample(5)

Unnamed: 0,english,spanish
3373,servant,sirviente
687,completed,completado
315,attribute,atributo
658,communist,comunista
308,attitude,actitud


In [1113]:
#economics_vocab = ranks[(ranks["imf_rank"] > ranks["general_rank"] + 500) & (ranks["general_rank"] < 10000)]
economics_vocab = ranks[((ranks["imf_rank"] ) < ranks["general_rank"]) & (ranks["general_rank"] > 2000)]

In [1118]:
s = pd.Series(economics_vocab.index)
no_cognates = list(s[~s.isin(cognate_df["english"])])
have_cognates = list(s[s.isin(cognate_df["english"])])
economics_vocab_no_cognates = economics_vocab.loc[no_cognates]
economics_vocab_with_cognates = economics_vocab.loc[have_cognates]

In [1129]:
pd.options.display.max_rows=100
economics_vocab_with_cognates


Unnamed: 0,imf_rank,imf_freq,general_rank
percent,32,6845,1000000
fiscal,98,2952,3212
export,157,2036,3062
monetary,170,1892,1000000
inflation,181,1823,2830
finance,207,1551,2492
advanced,220,1473,2483
external,240,1362,2536
shock,264,1240,2050
economist,285,1150,2650


In [1130]:
pd.options.display.max_rows=10
economics_vocab_no_cognates[:1000]


Unnamed: 0,imf_rank,imf_freq,general_rank
currency,136,2328,3297
poverty,146,2123,2080
spending,175,1853,2082
inequality,206,1556,1000000
banking,228,1432,3476
...,...,...,...
paul,3146,41,1000000
hurdle,3152,41,1000000
isolation,3154,41,3753
societal,3155,41,1000000


In [1133]:
f = open("economics_vocab.txt","w")
f.write("""#List of 2000 economics related words 
#Ranked by frequency of occurrence in the imf finance and development magazine
#https://www.imf.org/external/pubs/ft/fandd/
#Words with Spanish cognates removed\n""")
for i in economics_vocab_no_cognates[:2000].index:
    f.write((i+"\n"))
f.close()