## A frequency ranked list of economics vocabulary

### Aim of project:
To help a friend improve her economics specific English vocabulary in an efficient way. 

### End result of project:
1. A <a href="https://github.com/pvonglehn/economics_vocab/blob/master/economics_vocab.txt">list</a> of the most common, economics specific English words which don't have Spanish cognates. That is, words which are common in economics texts but uncommon in general texts and which cannot be easily guessed by a Spanish speaker. 
<br><br>
2. Anki flashcards deck for studying these words with example sentences (still to do)
 


### Problem:
A friend is preparing for an English exam as part of her studies to become an economist for the civil service in Spain. Part of the exam will be based on an article from the International Monetary Fund <a href="https://www.imf.org/external/pubs/ft/fandd/">finance and development magazine</a>, or a similar publication. She finds that she is lacking much of the economics specific vocabulary necessary to understand these articles. 

The classic strategy for improving vocabulary is to read a lot and look up words that you don't know. However, this approach is very inefficient. 

Let's consider an example.
The student reads the sentence:<br>
'Higher <strong>wages</strong> in China make <strong>offshoring</strong> less attractive.'

The student may not know the meaning of 'wages' or 'offshoring' so their instinct might be to look up and learn both. What the student doesn't know however, is that while the word 'wage' is very common in financial texts and definitely worth learning,  the word 'offshoring' is much less common, so it is not worth the student's effort learning, at least as long as there are many more useful words they could learn first.

### Project results:

Below is a table showing the first 5 words in the final list. The imf_rank indicates each word's rank based on its frequency of occurrence in the imf magazine. This corpus was compiled during this project and consists of over two million tokens (words) from over 100,000 sentences from over 1000 articles. The general rank indicates the word's position on a list of 5000 English words ordered by their frequency of occurence in a large corpus constructed from a wide range of English texts. This list was downloaded from https://www.wordfrequency.info/ 

This list only contains words which are more common in the economics corpus than in general English, so words like 'the','and' etc. don't appear. We can see that the top words in our list occur much more frequently in the economics corpus than in general English. This indicates that these words are very important to know in order to understand these texts, but students may not know them because they are relatively uncommon in general texts that they will mostly have been exposed to in their English studies. 

<br><em>A general_rank of 1000000 means that the word is not on the list from the general corpus.</em>


In [1196]:
economics_vocab_no_cognates.head(10)

Unnamed: 0,imf_rank,imf_freq,general_rank
currency,136,2328,3297
poverty,146,2123,2080
spending,175,1853,2082
inequality,206,1556,1000000
banking,228,1432,3476
macroeconomic,241,1356,1000000
deficit,289,1134,2142
wage,305,1070,2300
unemployment,325,1014,3089
commodity,329,1004,4038


### Removing words with Spanish cognates

Below is the table including words with Spanish cognates. The meanings of 'percent', 'fiscal' and 'export' etc could all be easily guessed by a Spanish speaker. Of the top 1000 words on this list, 40% of the words have Spanish cognates, so it is really worth excluding them from our list to save the student's time and effort. 

In [1195]:
w_cognate.head(10)

Unnamed: 0,imf_rank,imf_freq,general_rank,spanish_cognate
percent,32,6845,1000000,por ciento
fiscal,98,2952,3212,fiscal
currency,136,2328,3297,-
poverty,146,2123,2080,-
export,157,2036,3062,exportar
monetary,170,1892,1000000,monetario
spending,175,1853,2082,-
inflation,181,1823,2830,inflación
inequality,206,1556,1000000,-
finance,207,1551,2492,financiar







## Procedural overview
1. Generate a text corpus by web scraping economics articles 
2. Split the text into sentences and then tag each word by its part of speech (verb, noun etc.)
3. Convert the words into their lemmas (dictionary entry forms). e.g. running -> run
4. Create a frequency table of the words and put them in rank order
5. Get an existing ranked list of words from a corpus of general English
6. Produce a list of words which have a higher rank in the economics corpus than in the general corpus
7. Remove words which have Spanish cognates e.g. the economy - la economía
8. Get example sentences with Spanish translation for each word
9. Turn these example sentences into flash cards for studying

## Tools used
Beautiful Soup for web scraping<br>
NLTK for natural language processing<br>
pandas for data manipulation<br> 



## To do/ improvements to be made/ features to add

Get example sentences and make Anki flashcards.<br>
Get more articles e.g. from 'the economist'.
<br>Use an existing web crawling library such as https://scrapy.org/ to crawl entire sites in the future.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

In [552]:
# Web crawling function to get all links from a webpage using beautiful soup

# WARNING: Take care before reusing this function. It is one of my first attempts at 
# writing a function to scrape multiple pages and is quite hacky. It works for this case but may not
# work as desired in other cases.

# Arguments are:
# url: starting url
# base_url: the url of website homepage
# must_include: regex that the links must include (default is the wildcard character ".")
# must_not_include: regex that the links must not include (default is the nonsense string "xasdfcasdf" )

def get_all_links(url,base_url,existing_links,must_include = ".",must_not_include = "xasdfcasdf"):
    content = requests.get(url).content
    soup = BeautifulSoup(content,'lxml')
    links = []
    existing_links = existing_links.copy() # this copy is required to prevent modifying the original list (side effect)
    for anchor in soup.findAll("a"):
        # if the anchor element isn't doesn't have a href, it isn't a proper url, so skip to next anchor
        try:
            link = anchor['href']
        except:
            continue 
            
        full_link = None 
        if re.search("^http",link): # if link starts with http it is either an external link or internal with full url
            if base_url in link: # exclude links to external sites
                full_link = link 
        elif re.search("^/",link): # if link is an internat link to the base_url, create full link from base_url
            full_link = (base_url + "/" + anchor['href'])
        else:
            full_link = (url + "/" + anchor['href']) # if link is
            
        # filter out links based on various conditions    
        if ((full_link not in existing_links) # ignore links that have already been found
             and full_link is not None
             and (not re.search("#",full_link)) # filter out links to id's on the same page
             and (not re.search("htm.*htm",full_link)) # this is hack because of some buggy behaviour - should fix
             and  re.search("htm$",full_link) 
             and re.search(must_include,full_link)
             and (not re.search(must_not_include,full_link))):
            
            # add to list of links found on the whole site so far
            # (the existing_links variable here has local scope only)
            existing_links.append(full_link)
            
            # add to list of links found within this function invokation
            links.append(full_link)
            
    return links


In [576]:
%%time
# Get all internal links from the imf finance and development publications website

f = open("imf_links.txt","w") # file to save the links

base_url = "https://www.imf.org"
must_not_include = "fandd/spa|fandd/fre|fandd/rus|fandd/chi|fandd/ara|fandd/ger"
all_links = []

# The online magazine is published quaterly
# loop over the years and quaters
for year in range(1996,2019):
    for month in ("03","06","09","12"):
        must_include = "external/pubs/ft/fandd/{}/{}".format(year,month) # only get links from current edition
        current_existing = [] # initialise list of links that have been visited within this loop
        current_existing.append("https://www.imf.org/external/pubs/ft/fandd/{}/{}".format(year,month))
        for link in current_existing:
            # get all new links from within this link and add them to the list of links already found
            current_existing.extend(get_all_links(link,base_url,current_existing,must_include,must_not_include))
        all_links.extend(current_existing)
        for link in current_existing:
            f.write((link + "\n"))

f.close()


CPU times: user 1min 3s, sys: 3.35 s, total: 1min 6s
Wall time: 6min 51s


In [577]:
%%time

# Visit each link
# Extract all text
# Break the text into sentences with nltk (natural language tool kit) sentence tokenizer 

import nltk
all_sentences = []
for link in all_links:
    content = requests.get(link).content
    soup = BeautifulSoup(content,'lxml')
    text = soup.body.text
    sentences = nltk.sent_tokenize(text) # this splits the text into sentences
    all_sentences.extend(sentences)

CPU times: user 1min 9s, sys: 3.1 s, total: 1min 12s
Wall time: 4min 4s


In [578]:
# Exclude sentences which are shorter than 6 words
# As they are probably not proper sentences

long_sentences = []
for sentence in all_sentences:
    if sentence.count("\n") < 2:
        if len(sentence.split()) > 6:
            long_sentences.append(sentence)

f = open("imf_sentences.txt","w")
for line in long_sentences:
    f.write((line + "\n"))
f.close()

In [822]:
sentence_count = len(long_sentences)
word_count = pd.Series(long_sentences).apply(lambda x : len(x.split())).sum()
words_per_sentence = pd.Series(long_sentences).apply(lambda x : len(x.split())).mean()
print("sentence count = {}\ntotal word count = {}\nmean words per sentence = {:.2f}"
      .format(sentence_count,word_count,words_per_sentence))
      

sentence count = 101541
total word count = 2470329
mean words per sentence = 24.33


In [87]:
# There are libraries and functions that we will need to turn the
# words into their lemmas (dictionary forms) https://en.wikipedia.org/wiki/Lemma_(morphology)
# E.g. running will be turned into run and played into play
# To do this we need to:

# 1. tokenize the sentences (break them up into words)
# 2. tag the parts of speech for each token (word) e.g. verb, adjective
# 3. lemmatize the tokens (turn into dictionary form)

# part of speech to the lemmatizer
# Full sentences need to be passed to the parts of speech tagger in order to tag them accurately
# If the word "play" is given in isolation, it is ambiguous if it is a verb or a noun,
# but if you give the pos tagger a full sentence, it can determine from context
# e.g. I am going to play (verb) tennis. I am going to see a play (noun).

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
lemmatizer = WordNetLemmatizer()

# This function converts the parts of speech tags from nltk pos tagger 
# To POS tags that are compatible with the wordnet lemmatizer
# function adapted from: 
# https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    else:
        return None
    

[nltk_data] Downloading package wordnet to /Users/pv7409/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pv7409/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [629]:
%%time

all_tokens = []
all_lemmas = []
for sentence in long_sentences:
    tokens = nltk.word_tokenize(sentence)
    tokenized = nltk.pos_tag(tokens)
    for i, token in enumerate(tokenized):
        # Filter out proper nouns (words with capital letters that aren't at the beginning of a sentence)
        if token[0].islower() or i == 0:
            word = token[0].lower()
            wordnet_pos = get_wordnet_pos(token[1])
        #print(word,wordnet_pos)
            if wordnet_pos is None:
                all_lemmas.append(word)
                all_tokens.append(word)
            else:
                all_lemmas.append(lemmatizer.lemmatize(word,wordnet_pos))
                all_tokens.append(word)

# write out lemmes to text file
f = open("all_lemmas.txt","w")
for i in all_lemmas:
    f.write((i+"\n"))
f.close()

CPU times: user 3min 25s, sys: 4.73 s, total: 3min 30s
Wall time: 3min 39s


In [635]:
# No we'll process the lemmas a little

all_lemma_series = pd.Series(all_lemmas)

#remove punctuation
all_lemma_series = all_lemma_series[~all_lemma_series.str.contains("\W")]

# remove proper names
propernames = pd.Series(open("/usr/share/dict/propernames","r").read().split("\n"))
propernames = propernames.apply(lambda x: x.lower())
propernames = set(propernames)
#all_lemma_series = all_lemma_series[~all_lemma_series.isin(propernames)]

#get rid of numbers
all_lemma_series = all_lemma_series[~all_lemma_series.str.contains("\d")]


In [826]:
# Create a frequency table for each lemma
# Give each lemma a rank

lemma_frequencey = all_lemma_series.value_counts()
ranked = pd.Series(lemma_frequencey.index)
ranks = pd.DataFrame(list(range(1,len(lemma_frequencey)+1)))
ranks.index = lemma_frequencey.index
ranks['freq'] = lemma_frequencey

In [827]:
ranks.head()

Unnamed: 0,0,freq
the,1,149311
of,2,85426
and,3,82856
be,4,74219
to,5,69129


Here is a list of the 5000 most frequent words in English from https://www.wordfrequency.info/
<br>Note that although there are 5000 entries in the list, there are only 4353 unique words,
as sometimes the same word has several entries because it appears as a different part of speech 
<br>e.g. 'light' appears as a noun (the light at the end of the tunnel) and as an adjective (a light breakfast)

In [828]:
# Get list of English words by frequency, generated from 14 billion word intente corpus
# https://www.wordfrequency.info/ 
f = open("5000_eng_words.txt","r")
freq_5000 = f.read().split("\n")
freq_5000 = freq_5000[1:] # ignore header
freq_5000_list = pd.Series(freq_5000).str.lower()

#remove duplicated words in freq_5000_list
freq_5000_list = pd.Series(freq_5000_list).unique()

# give each word in the general corpus frequency list a rank
freq5000df = pd.DataFrame(list(range(1,len(freq_5000_list)+1)))
freq5000df.index = freq_5000_list

In [829]:
freq5000df.head()

Unnamed: 0,0
the,1
be,2
and,3
of,4
a,5


In [830]:
# merge the general corpus frequency data frame with our economics vocab frequency list

ranks = ranks.merge(freq5000df,how="left",left_index=True,right_index=True)
ranks.columns = "imf_rank","imf_freq","general_rank"
ranks = ranks.sort_values("imf_rank")

# if word not in general corpus list, give it rank of 1000000
ranks.loc[ranks["general_rank"].isnull(),["general_rank"]] = 1000000 
ranks["general_rank"] = ranks["general_rank"].astype(int) # turn back into integers


In [831]:
ranks.head()

Unnamed: 0,imf_rank,imf_freq,general_rank
the,1,149311,1
of,2,85426,4
and,3,82856,3
be,4,74219,2
to,5,69129,7


In [832]:
# exclude short words
large_words = pd.Series(ranks.index)
large_words = large_words[large_words.apply(lambda x : len(x) > 3)]
ranks = ranks.loc[list(large_words)]

In [851]:
my_series = pd.Series(long_sentences)
my_series[my_series.str.contains("estuary")].values

array(['Work evaporated at the port on the south shore of the giant estuary, the Rio de la Plata.'],
      dtype=object)

In [1086]:
# list of spanish/english cognates downloaded from: http://cognates.org/pdf/mfcogn.pdf
# massage the text file (formatting was messed up when converted from pdf)
f = open("cognates.txt","r").read()
f = re.sub(r'\((.*?)\)',",",f)
f = re.sub("por ciento","porciento",f)
f = re.sub("se relajó","serelajó",f)
f = re.sub("en el presente","enelpresente",f)
f = re.sub("ex prefix","prefix",f)
f = re.sub("ex prefijo","prefix",f)
f = re.sub("soul, música","soulmúsica",f)
f = re.sub("substituir v. 5/sustituir","substituir/sustituir",f)
f = re.sub("rock n' roll","rock'n'roll",f)
f = re.sub("El Salvador","ElSalvador",f)
f = re.sub("valuación, avalúo","valuación/avalúo",f)
f = re.sub("prefix","",f)
f = re.sub("intj.","",f)

f = re.sub(r'\d',",",f)
f = re.split("PMF|MFW| |conj\.|v\.|adj\.|n\.|s\.|adv\.|,|abbr\.|abr\.|prep\.",f)
to_remove = "Cognate","org","","clic","prefijo","prefix"
for item in to_remove:
    while item in f: f.remove(item)
f[f.index("porciento")] = "por ciento"
f[f.index("serelajó")] = "se relajó"
f[f.index("enelpresente")] = "en el presente"
f[f.index("soulmúsica")] = "soul música"




In [1091]:
cognate_list = []
for i in range(0,len(f)-1,2):
    cognate_list.append([f[i],f[i+1]])
cognate_df = pd.DataFrame(cognate_list)
cognate_df = pd.DataFrame(cognate_df)
cognate_df.columns = "english","spanish"
cognate_df.to_csv("spanish_english_cognates.csv",index=False)

In [1096]:
cognate_df.sample(5)

Unnamed: 0,english,spanish
3373,servant,sirviente
687,completed,completado
315,attribute,atributo
658,communist,comunista
308,attitude,actitud


In [1113]:
#economics_vocab = ranks[(ranks["imf_rank"] > ranks["general_rank"] + 500) & (ranks["general_rank"] < 10000)]
economics_vocab = ranks[((ranks["imf_rank"] ) < ranks["general_rank"]) & (ranks["general_rank"] > 2000)]

In [1118]:
s = pd.Series(economics_vocab.index)
no_cognates = list(s[~s.isin(cognate_df["english"])])
have_cognates = list(s[s.isin(cognate_df["english"])])
economics_vocab_no_cognates = economics_vocab.loc[no_cognates]
economics_vocab_with_cognates = economics_vocab.loc[have_cognates]

In [1184]:
cog_df = pd.DataFrame(cognate_df["spanish"])
cog_df.index = cognate_df["english"]

In [1190]:
w_cognate = economics_vocab.merge(cog_df,how="left",left_index=True,right_index=True)
w_cognate = w_cognate.sort_values("imf_rank")
w_cognate.loc[w_cognate['spanish'].isnull(),['spanish']] = "-"

In [1193]:
w_cognate = w_cognate.rename({"spanish":"spanish_cognate"},axis=1)

In [1129]:
pd.options.display.max_rows=100
economics_vocab_with_cognates


Unnamed: 0,imf_rank,imf_freq,general_rank
percent,32,6845,1000000
fiscal,98,2952,3212
export,157,2036,3062
monetary,170,1892,1000000
inflation,181,1823,2830
finance,207,1551,2492
advanced,220,1473,2483
external,240,1362,2536
shock,264,1240,2050
economist,285,1150,2650


In [1130]:
pd.options.display.max_rows=10
economics_vocab_no_cognates[:1000]


Unnamed: 0,imf_rank,imf_freq,general_rank
currency,136,2328,3297
poverty,146,2123,2080
spending,175,1853,2082
inequality,206,1556,1000000
banking,228,1432,3476
...,...,...,...
paul,3146,41,1000000
hurdle,3152,41,1000000
isolation,3154,41,3753
societal,3155,41,1000000


In [1133]:
f = open("economics_vocab.txt","w")
f.write("""#List of 2000 economics related words 
#Ranked by frequency of occurrence in the imf finance and development magazine
#https://www.imf.org/external/pubs/ft/fandd/
#Words with Spanish cognates removed\n""")
for i in economics_vocab_no_cognates[:2000].index:
    f.write((i+"\n"))
f.close()

In [1198]:
df.head()

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,cmn,我們試試看！,sysko,\N,2010-03-14 19:46:23
2,cmn,我该去睡觉了。,fucongcong,\N,2010-01-01 15:23:53
3,cmn,你在干什麼啊？,sysko,\N,2010-06-30 11:54:11
4,cmn,這是什麼啊？,Martha,2008-09-08 09:20:59,2011-02-27 12:01:04
5,cmn,今天是６月１８号，也是Muiriel的生日！,Zifre,\N,2011-08-17 19:03:02


In [1197]:
# Load in the Tatoeba database  into pandas DataFrame
# Note: file is actually tab seperated, although named .csv
# I haven't uploaded this file to github because it is very large, you will need to download it from:
# https://tatoeba.org/eng/downloads
df = pd.read_csv("/Users/pv7409/Spanish/verbs_w_prepositions/sentences_detailed.csv",sep="\t",index_col=0,header=None)


# This produced a FutureWarning. I'm not sure why this warning occured. Everything has loaded correctly now 
# but the warning means that if numpy or python is upgraded there could be issues 
# (This is unlikely to cause problems in my opinion)

  mask |= (ar1 == a)


In [1223]:
phrases = []
for word in economics_vocab_no_cognates.index[:500]:
    phrases.append(df.loc[df[2].str.contains(word)].head(1))

KeyboardInterrupt: 

In [1232]:
collect = phrases[5]
for i in phrases:
    collect = collect.append(i)

In [1236]:
collect

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
21981,eng,The exchange rates for foreign currency change daily.,Swift,\N,2010-12-03 19:03:06
1533,eng,Rye was called the grain of poverty.,Zifre,\N,2011-01-23 01:21:24
1959,eng,"My roommate is prodigal when it comes to spending money on movies; he buys them the day they're released, regardless of price.",Nero,\N,2010-12-09 21:55:17
276921,eng,No inequality should be allowed to exist between men and women.,CM,\N,2012-03-28 03:53:32
18429,eng,What are the banking hours?,Zifre,\N,2011-04-18 00:52:08
19393,eng,A huge federal budget deficit has been plaguing the American economy for many years.,CK,\N,2010-10-18 00:17:18
1065,deu,Wir brauchen einen Krankenwagen.,MUIRIEL,\N,2010-10-14 20:51:46
58815,eng,This increase in unemployment is a consequence of the recession.,CK,\N,2013-12-07 10:36:25
54196,eng,Service economy is a useful labor that does not produce a tangible commodity.,\N,\N,\N
20054,eng,"Thanks to the technological innovation, the maximum output of the factory has doubled.",NekoKanjya,\N,2011-12-02 22:14:30


In [1408]:
%%time
import numpy as np
my_list = []
anki = pd.DataFrame({"english": [],"spanish": [],"example":[],"translation":[]})
for word in economics_vocab_no_cognates.index[:50]:
#for word in ["macroeconomic"]:
    try:
        definition = np.nan
        example = np.nan
        example_trans = np.nan
        
        url = "https://www.wordreference.com/es/translation.asp?tranword={}".format(word)
        content = requests.get(url).content
        soup = BeautifulSoup(content,'lxml')
        definition = soup.select(".ToWrd")[1].text.split()[0]
        example_tag = soup.select(".FrEx")[0]
        example = example_tag.text
        if example_tag.parent.nextSibling.nextSibling.select(".ToEx")[0]:
            example_trans = example_tag.parent.nextSibling.nextSibling.select(".ToEx")[0].text
        else:
            example_trans = "no translation found"
        tmp_df = pd.DataFrame({"english": [word],"spanish": [definition],"example":[example],"translation":[example_trans]})
        anki = anki.append(tmp_df)
    except:
        try:
            tmp_df = pd.DataFrame({"english": [word],"spanish": [definition],"example":[example],"translation":[np.nan]})
            anki = anki.append(tmp_df)
        except:
            try:
                tmp_df = pd.DataFrame({"english": [word],"spanish": [definition],"example":[np.nan],"translation":[np.nan]})
                anki = anki.append(tmp_df)
            except:
                try:
                    tmp_df = pd.DataFrame({"english": [word],"spanish": [np.nan],"example":[np.nan],"translation":[np.nan]})
                    anki = anki.append(tmp_df)
                except:
                    pass
        
        

CPU times: user 3.58 s, sys: 234 ms, total: 3.81 s
Wall time: 18.3 s


In [1362]:
example_tag.parent.nextSibling.nextSibling.select(".ToEx")[0]

<td class="ToEx" colspan="2">La globalización significa que debemos competir con trabajadores de todo el mundo.</td>

In [1466]:
anki["cloze"] = 0
anki = anki.dropna(axis=0)
for i in range(len(anki)):
    regex = r"{}\S*".format(anki.iloc[i,0][:-2])
    match = re.findall(regex,anki.iloc[i,2].lower())[0]
    regex2 = r" {}\S*".format(anki.iloc[i,1][:-4].lower())
    if re.search(regex2,anki.iloc[i,3].lower()):
        match2 = re.findall(regex2,anki.iloc[i,3].lower())[0]
    else:
        match2 = anki.iloc[i,1]
    anki.iloc[i,4] = re.sub(match,"{{{{c1::{}::{}}}}}".format(match,match2),anki.iloc[i,2].lower())


In [1467]:
anki

Unnamed: 0,english,spanish,example,translation,cloze
0,currency,moneda,I need to get some foreign currency for my holidays.,Tengo que conseguir moneda extranjera para mis vacaciones.,i need to get some foreign {{c1::currency:: moneda}} for my holidays.
0,poverty,pobreza,He has a tiny income and lives in great poverty.,Tiene ingresos muy reducidos y vive en la pobreza.,he has a tiny income and lives in great {{c1::poverty.:: pobreza.}}
0,spending,gastos,Robert buys more than he can afford every month; his spending is out of control.,"Todos los meses, Robert compra más cosas de las que se puede permitir. Sus gastos están fuera de control.",robert buys more than he can afford every month; his {{c1::spending:: gastos}} is out of control.
0,inequality,injusticia,The party's main aim is to eliminate inequality in society.,El objetivo principal del partido es eliminar la injusticia en la sociedad.,the party's main aim is to eliminate {{c1::inequality:: injusticia}} in society.
0,banking,banca,Banking as an industry is struggling right now.,La banca es una industria que en estos momentos tiene dificultades.,{{c1::banking:: banca}} as an industry is struggling right now.
0,deficit,déficit,The deficit in next year's budget is expected to be lower.,Se espera que el déficit en el presupuesto del año que viene sea inferior.,the {{c1::deficit:: déficit}} in next year's budget is expected to be lower.
0,unemployment,desempleo,"Unemployment fell for a third month, encouraging economists.","El desempleo cayó por tercer mes consecutivo, alentando a los economistas.","{{c1::unemployment:: desempleo}} fell for a third month, encouraging economists."
0,commodity,producto,The country is famous for commodities such as textiles and grains.,"El país es famoso por sus productos básicos, tales como textiles y granos.",the country is famous for {{c1::commodities:: productos}} such as textiles and grains.
0,framework,armazón,"After completing the building, they remove the supporting framework.","Cuando terminaron el edificio, quitaron el armazón de soporte.","after completing the building, they remove the supporting {{c1::framework.:: armazón}}"
0,financing,financiamiento,We need to carefully plan some fund-raisers for the financing of the project. An anonymous donor has volunteered to be responsible for the financing of the project.,Necesitamos planificar con detalle cómo recaudar fondos para el financiamiento del proyecto. Un donante anónimo se ha ofrecido como voluntario para el financiamiento del proyecto.,we need to carefully plan some fund-raisers for the {{c1::financing:: financiamiento}} of the project. an anonymous donor has volunteered to be responsible for the {{c1::financing:: financiamiento}} of the project.


In [1468]:
anki[["cloze","translation"]].to_csv("to_anki.csv",sep="\t",header=None,index=False)

In [1443]:
regex = r"{}\S*".format(anki.iloc[i,0][:-2])
re.findall(match,anki.iloc[i,2].lower())[0]

'banking'

In [1439]:
anki.iloc[i,2]


'Banking as an industry is struggling right now.'

In [1359]:
economics_vocab_no_cognates.index[:20]

Index(['currency', 'poverty', 'spending', 'inequality', 'banking',
       'macroeconomic', 'deficit', 'wage', 'unemployment', 'commodity',
       'output', 'framework', 'financing', 'euro', 'policymakers',
       'strengthen', 'governance', 'regulatory', 'employment',
       'globalization'],
      dtype='object')

In [1360]:
anki

Unnamed: 0,english,spanish,example,translation
0,currency,moneda,I need to get some foreign currency for my holidays.,Tengo que conseguir moneda extranjera para mis vacaciones.
0,poverty,pobreza,He has a tiny income and lives in great poverty.,Tiene ingresos muy reducidos y vive en la pobreza.
0,spending,gastos,Robert buys more than he can afford every month; his spending is out of control.,"Todos los meses, Robert compra más cosas de las que se puede permitir. Sus gastos están fuera de control."
0,inequality,injusticia,The party's main aim is to eliminate inequality in society.,El objetivo principal del partido es eliminar la injusticia en la sociedad.
0,banking,banca,Banking as an industry is struggling right now.,La banca es una industria que en estos momentos tiene dificultades.
0,deficit,déficit,The deficit in next year's budget is expected to be lower.,Se espera que el déficit en el presupuesto del año que viene sea inferior.
0,unemployment,desempleo,"Unemployment fell for a third month, encouraging economists.","El desempleo cayó por tercer mes consecutivo, alentando a los economistas."
0,commodity,producto,The country is famous for commodities such as textiles and grains.,"El país es famoso por sus productos básicos, tales como textiles y granos."
0,framework,armazón,"After completing the building, they remove the supporting framework.","Cuando terminaron el edificio, quitaron el armazón de soporte."
0,financing,financiamiento,We need to carefully plan some fund-raisers for the financing of the project. An anonymous donor has volunteered to be responsible for the financing of the project.,Necesitamos planificar con detalle cómo recaudar fondos para el financiamiento del proyecto. Un donante anónimo se ha ofrecido como voluntario para el financiamiento del proyecto.


'gastos'

In [1270]:
my_list = []
for word in economics_vocab_no_cognates.index[:10]:
    try:
        url = "https://www.linguee.com/english-spanish/search?source=auto&query={}".format(word)
        content = requests.get(url).content
        soup = BeautifulSoup(content,'lxml')
        example1 = soup.select(".tag_s")[0].text
        example2 = soup.select(".tag_t")[0].text
        translation = soup.select(".dictLink.featured")[0].text
        my_list.append((word,example1,example2,translation))
    except:
        continue

In [1544]:
for i, word in enumerate(economics_vocab_no_cognates.index[:50]):
    try:
        url = "https://www.linguee.com/english-spanish/search?source=auto&query={}".format(word)
        content = requests.get(url).content
        soup = BeautifulSoup(content,'lxml')
        #soup.find_all(True, {'class':['sentence', 'left']},text="imf.org")
        print(i,word)
        print(soup.find_all(text="imf.org")[0].parent.parent.parent.parent.select(".sentence.left")[0].text)
        print(soup.find_all(text="imf.org")[0].parent.parent.parent.parent.select(".sentence.right2")[0].text)
    except:
        continue

0 currency
1 poverty
2 spending
3 inequality
4 banking
5 macroeconomic
6 deficit
7 wage
8 unemployment
9 commodity
10 output
11 framework
12 financing
13 euro
14 policymakers
15 strengthen
16 governance
17 regulatory
18 employment
19 globalization
20 liberalization
21 enterprise
22 moreover
23 inflow
24 liquidity
25 expenditure
26 lending
27 equity
28 further
29 donor
30 boost
31 indicator
32 tariff
33 enhance
34 trading
35 instance
36 wealth
37 volatility
38 boom
39 creditor
40 remittance
41 imbalance
42 transparency
43 constraint
44 arrangement
45 surplus
46 sustainable
47 scheme
48 borrow
49 liability


In [1497]:
soup

<!DOCTYPE html>
<html il_en="">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content='Many translated example sentences containing "currency" – Spanish-English dictionary and search engine for Spanish translations.' name="description"/>
<meta content="currency, Linguee,Spanish,dictionary,translations,spanish dictionary,english,search engine,translation" name="keywords"/>
<meta content="en" name="language"/>
<meta content="en" http-equiv="content-language"/>
<meta content="https://d1wigddrwdtsce.cloudfront.net/img5/ogimage.jpg" name="og:image"/>
<meta content="image/jpeg" property="og:image:type"/>
<meta content="1200" property="og:image:width"/>
<meta content="630" property="og:image:height"/>
<meta content="Linguee.com" property="og:site_name"/>
<meta content="currency - Spanish translation – Linguee" property="title"/>
<meta content="currency - Spanish translation – Linguee" property="og:title"/>
<meta content='Many translated example sentences co

In [1563]:
url="https://www.imf.org/es/news/articles/2015/09/28/04/53/sonew101015a"
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
spanish_t = soup.body.text

In [1564]:
url="https://www.imf.org/en/news/articles/2015/09/28/04/53/sonew101015a"
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
english_t = soup.body.text

In [1571]:
print(english_sentences)

['\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEnglish\n\nعربي\n中文\nFrançais\n日本語 \nРусский\nEspañol\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\n\n\n\r\nAbout the IMF                                            \n\n\nResearch\n\n\nGlobal Analysis\nResearchers at the IMF\nData Visualization\n\n\nStaff Discussion Notes\nLatest Working Papers\nResearch Bulletin\n\n\nEconomic Review\nIMFBlog\nCommodity Prices\n\n\n\n\n\r\nCountries                                            \n\n\nIMF reports and publications by country\nA\nB\nC\nD\nE\nF\nG\nH\nI\nJ\nK\nL\nM\nN\nO\nP\nQ\nR\nS\nT\nU\nV\nY\nZ\n\n\n\n\n\r\nCapacity Development                                            \n\n\nAbout Us\n\n\nWhat We Do\nPublic Finances\nMonetary and Financial Systems\nLegislative Frameworks\nStatistics\nMacroeconomic Frameworks\n\n\nHow We Work\nRegional Capacity Development Centers\nOnline Training \n\n\nOur Partners\nFunds for Capacity Development\n\n\nCountry Examples\n\n\n\n\n\r\nNews      

In [1572]:
import nltk
english_sentences = nltk.sent_tokenize(english_t)
spanish_sentences = nltk.sent_tokenize(spanish_t)


In [1587]:
english_sentences

['\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEnglish\n\nعربي\n中文\nFrançais\n日本語 \nРусский\nEspañol\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\n\n\n\r\nAbout the IMF                                            \n\n\nResearch\n\n\nGlobal Analysis\nResearchers at the IMF\nData Visualization\n\n\nStaff Discussion Notes\nLatest Working Papers\nResearch Bulletin\n\n\nEconomic Review\nIMFBlog\nCommodity Prices\n\n\n\n\n\r\nCountries                                            \n\n\nIMF reports and publications by country\nA\nB\nC\nD\nE\nF\nG\nH\nI\nJ\nK\nL\nM\nN\nO\nP\nQ\nR\nS\nT\nU\nV\nY\nZ\n\n\n\n\n\r\nCapacity Development                                            \n\n\nAbout Us\n\n\nWhat We Do\nPublic Finances\nMonetary and Financial Systems\nLegislative Frameworks\nStatistics\nMacroeconomic Frameworks\n\n\nHow We Work\nRegional Capacity Development Centers\nOnline Training \n\n\nOur Partners\nFunds for Capacity Development\n\n\nCountry Examples\n\n\n\n\n\r\nNews      

In [1585]:
for i in range(len(english_sentences)):
    print("\n----\n",english_sentences[i],"\n\n",spanish_sentences[i])
#    print("\n----\n",english_sentences[i],"\n\n",spanish_sentences[i])



----
 
















English

عربي
中文
Français
日本語 
Русский
Español


































Home



About the IMF                                            


Research


Global Analysis
Researchers at the IMF
Data Visualization


Staff Discussion Notes
Latest Working Papers
Research Bulletin


Economic Review
IMFBlog
Commodity Prices





Countries                                            


IMF reports and publications by country
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
Y
Z





Capacity Development                                            


About Us


What We Do
Public Finances
Monetary and Financial Systems
Legislative Frameworks
Statistics
Macroeconomic Frameworks


How We Work
Regional Capacity Development Centers
Online Training 


Our Partners
Funds for Capacity Development


Country Examples





News                                            


All News
Country Focus
Communiqués
Mission Concluding Statements
Press Releases



Speeches
Statements at D

IndexError: list index out of range

In [1528]:
# Gale church sentence alignment
url="https://www.imf.org/es/News/Articles/2015/09/28/04/53/pn0780"
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
spanish_t = soup.body.text

url="https://www.imf.org/en/News/Articles/2015/09/28/04/53/pn0780"
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
english_t = soup.body.text
english_sentences = nltk.sent_tokenize(english_t)
spanish_sentences = nltk.sent_tokenize(spanish_t)


from nltk import align

nltk.align

#aligned = nltk.align.gale_church.align_blocks(english_sentences, spanish_sentences)

In [1531]:
nltk.translate.gale_church.align_block

[nltk_data] Error loading Align: Package 'Align' not found in index


False

In [1532]:
from bleualign.align import Aligner

In [1534]:
Aligner(options=(english_t,spanish_t))

ValueError: dictionary update sequence element #0 has length 19007; 2 is required

In [1536]:
e = open("english.txt","w")
s = open("spanish.txt","w")
for line in english_sentences:
    e.write((line+"\n"))
for line in spanish_sentences:
    s.write((line+"\n"))
e.close()
s.close()

In [1539]:
from gale_church.py import *

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(src[1])? (gale_church.py, line 120)

In [1540]:
import glate_church3.py

ModuleNotFoundError: No module named 'minimath'

In [1541]:
sentence = "hello my name is Patrick"
mystring = re.sub(" ","%20",sentence)
url="https://translate.google.com/#en/es/{}".format(mystring)
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')


In [1543]:
url

'https://translate.google.com/#en/es/hello%20my%20name%20is%20Patrick'

In [1548]:
align_texts(english_sentences,spanish_sentences)

ValueError: Source and target texts do not have the same number of blocks.

ValueError: Source and target texts do not have the same number of blocks.

In [1561]:
".".join(english_sentences[4:10])

'With the consent of the country (or\xa0countries) concerned, PINs are issued after Executive Board discussions of Article IV consultations with member countries, of its surveillance of developments at the regional level, of post-program monitoring, and of ex post assessments of member countries with longer-term program engagements..PINs are also issued after Executive Board discussions of general policy matters, unless otherwise decided by the Executive Board in a particular case..Public Information Notice (PIN) No..07/80\nJuly 17, 2007\n\nEspañol\n\n\n\nOn July 13, 2007, the Executive Board of the International Monetary Fund (IMF) concluded the Article IV consultation with Bolivia.1\nBackground\nThe economy has continued to benefit from rising natural gas export volumes, which has been compounded, over the last two years, by high hydrocarbons and mining export prices..Real GDP growth has been around 4-4½\xa0percent, mainly driven by an expansion in hydrocarbons and mining-related act

aligners to try:
forum discussion
https://groups.google.com/forum/#!topic/nltk-dev/nuVHpzqldXA
hunalign
GIZA++ 

In [1617]:
found_links = []

for i in range(97): 
    url="https://www.imf.org/es/News/Search?datefrom=1994-01-01&dateto=2018-11-02&page={}".format(i)
    content = requests.get(url).content
    soup = BeautifulSoup(content,'lxml')
    for link in soup.find_all("a"):
        if link.get("href"):
            if re.search("es/news/articles",link['href'].lower()):
                if link['href'] not in found_links:
                    found_links.append(link['href'])

    

In [1657]:
found_links

['https://www.imf.org/es/News/Articles/2018/10/31/pr18402-imf-staff-concludes-visit-to-nicaragua',
 'https://www.imf.org/es/News/Articles/2018/10/26/pr18395-argentina-imf-executive-board-completes-first-review-under-argentina-stand-arrangement',
 'https://www.imf.org/es/News/Articles/2018/10/17/NA101718-Recovery-in-Latin-America-and-the-Caribbean-Has-Lost-Momentum',
 'https://www.imf.org/es/News/Articles/2018/10/10/October-2018-imfc-attendance-list',
 'https://www.imf.org/es/News/Articles/2018/10/10/communique-of-the-thirty-eighth-meeting-of-the-international-monetary-and-financial-committee',
 'https://www.imf.org/es/News/Articles/2018/10/09/NA101118-external-risks-threaten-sub-saharan-africas-steady-recovery',
 'https://www.imf.org/es/News/Articles/2018/10/11/sp101218-new-economic-landscape-new-multilateralism',
 'https://www.imf.org/es/News/Articles/2018/09/27/sp100118-steer-dont-drift',
 'https://www.imf.org/es/News/Articles/2018/09/17/sp09172018-the-case-for-the-sustainable-develo

In [1659]:
url="https://www.imf.org/es/News/Articles/2017/10/13/NA101317-Latin-Americas-Recovery-on-Track-but-Long-Term-Growth-Weak"
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')

spanish_t = soup.article.text

url="https://www.imf.org/en/News/Articles/2017/10/13/NA101317-Latin-Americas-Recovery-on-Track-but-Long-Term-Growth-Weak"
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')

english_t = soup.article.text

In [1727]:
cmd = "mkdir pdfs"
os.system(cmd)
for year in range(2009,2018):
    for month in ["03","06","09","12"]:
        short_year = str(year)[-2:]
        cmd = """curl "https://www.imf.org/external/pubs/ft/fandd/{}/{}/pdf/fd{}{}.pdf" \
                --output ./pdfs/eng-{}-{}.pdf""".format(year,month,month,short_year,year,month)
        os.system(cmd)
        cmd = """curl "https://www.imf.org/external/pubs/ft/fandd/spa/{}/{}/pdf/fd{}{}s.pdf" \
                --output ./pdfs/spa-{}-{}.pdf""".format(year,month,month,short_year,year,month)
        os.system(cmd)

In [1730]:
cmd = "mkdir tmp"
os.system(cmd)
for year in range(2009,2010):
    for month in ["03","06","09","12"]:
        cmd = """xpdf-tools-mac-4.00/bin64/pdftotext \
        ./pdfs/eng-{}-{}.pdf  tmp/eng-{}-{}.txt""".format(year,month,year,month)
        os.system(cmd)
        cmd = """xpdf-tools-mac-4.00/bin64/pdftotext \
        ./pdfs/spa-{}-{}.pdf  tmp/spa-{}-{}.txt""".format(year,month,year,month)
        os.system(cmd)
        e = open("tmp/spa-{}-{}.txt""".format(year,month),"r",encoding="Latin-1").read()
        s = open("tmp/spa-{}-{}.txt""".format(year,month),"r",encoding="Latin-1").read()
        english_sentences = nltk.sent_tokenize(e)
        spanish_sentences = nltk.sent_tokenize(s)
        eo = open("tmp/sent-eng-{}-{}.txt""".format(year,month),"w")
        for line in english_sentences:
            eo.write((line+"\n"))
        eo.close()
        so = open("tmp/sent-spa-{}-{}.txt""".format(year,month),"w")
        for line in spanish_sentences:
            eo.write((line+"\n"))
        so.close()


xpdf-tools-mac-4.00/bin64/pdftotext         ./pdfs/eng-2009-03.pdf  tmp/eng-2009-03.txt
xpdf-tools-mac-4.00/bin64/pdftotext         ./pdfs/eng-2009-06.pdf  tmp/eng-2009-06.txt
xpdf-tools-mac-4.00/bin64/pdftotext         ./pdfs/eng-2009-09.pdf  tmp/eng-2009-09.txt
xpdf-tools-mac-4.00/bin64/pdftotext         ./pdfs/eng-2009-12.pdf  tmp/eng-2009-12.txt


In [None]:
e = open("eng.txt","r",encoding="Latin-1").read()
s = open("spa.txt","r",encoding="Latin-1").read()
english_sentences = nltk.sent_tokenize(e)
spanish_sentences = nltk.sent_tokenize(s)
eo = open("eng_sen.txt","w")
for line in english_sentences:
    eo.write((line+"\n"))
eo.close()
so = open("spa_sen.txt","w")
for line in spanish_sentences:
    so.write((line+"\n"))
so.close()

Using hunalign
D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005).
Parallel corpora for medium density languages
In Proceedings of the RANLP 2005

In [1706]:
import os
command = """hunalign/hunalign-1.2/src/hunalign/hunalign \
hunalign/hunalign-1.2/data/null.dic \
eng_sen.txt spa_sen.txt -text -bisent > eng_span.txt"""
os.system(command)

0

In [1714]:
f = open("eng_span.txt","r")
lines = f.readlines()
for line in lines:
    try:
        both = line.split("\t")
        eng = both[0]
        spa = both[1]
        if len(eng.split()) > 7:
            print("\n------\n",eng,"\n",spa)
    except:
        continue
    
    


------
 For Southeast Asia, the next couple of decades could prove exhilarating but also 12 tumultuous. 
 Para el sudeste asiático, las próximas dos décadas podrían deparar grandes promesas, pero también podrían ser 12 tumultuosas.

------
 18 A Hidden Scourge Human trafficking is a crime that usually goes unreported Mely Caballero-Anthony 
 Vietnam se destaca en el proceso de aumento de la participación de las mujeres en la fuerza laboral en Asia

------
 Subscribe at www.imfbookstore.org/f&d Read at www.imf.org/fandd Connect at facebook.com/FinanceandDevelopment 
 Suscríbase en www.imfbookstore.org/f&d Lea la edición digital www.imf.org/fandd Conéctese en facebook.com/FinanceandDevelopment

------
 FINANCE & DEVELOPMENT A Quarterly Publication of the International Monetary Fund September 2018 | Volume 55 | Number 3 
 FINANZAS & DESARROLLO Publicación trimestral del Fondo Monetario Internacional Septiembre de 2018 | Volumen 55 | Número 3

------
 Globalists: The End of Empire and the

 "I want to prove to people that I can do it--maybe even better than some guy." 
 "Quiero demostrar que puedo lograrlo, y hacerlo quizá mejor que algún chico".

------
 The last straw Pocholo Espina, 22, always thought he would grow up to be a doctor or lawyer. 
 El último sorbete Pocholo Espina, 22, pensaba que sería médico o abogado.

------
 Instead, the young Manila resident is the founder and CEO of Sip PH, a company that makes and distributes stainless steel straws. 
 Sin embargo, el joven residente en Manila es fundador y Director General de Sip PH, una empresa que fabrica y distribuye sorbetes de acero inoxidable.

------
 It all started when Espina was a student at Ateneo de Manila University. 
 Todo comenzó cuando Espina estudiaba en la Universidad Ateneo de Manila.

------
 He got interested in the zerowaste movement, which promotes a lifestyle that minimizes the amount of waste sent to landfills by encouraging the reuse of products. 
 Se interesó en el movimiento basura cer

 Enhancing economic competitiveness and strengthening governance and social institutions are already considered essential to the inclusive growth agenda. 
 Mejorar la competitividad económica y fortalecer la gobernanza y las instituciones sociales ya se consideran esenciales para la agenda de crecimiento inclusivo.

------
 But the remittance trap lends urgency to these goals. 
 Pero la trampa de las remesas hace que estas metas sean más urgentes.

------
 Avoiding this potentially serious pitfall of remittances may actually be the key to unlocking their development potential by removing a previously unrecognized obstacle to inclusive development. 
 En realidad, sortear las trampa de las remesas, y sus serios efectos potenciales, podría ser la clave, no identificada hasta ahora, para liberar el potencial de desarrollo al eliminar un obstáculo para el desarrollo inclusivo.

------
 RALPH CHAMI is an assistant director in the IMF's Institute for Capacity Development, EKKEHARD ERNST is ch

Glossaries and dictionaries of economics:

the economist: https://www.economist.com/economics-a-to-z


https://core-econ.org/the-economy/book/text/50-02-glossary.html


https://econclassroom.com/glossary/macroeconomics/

** very thorough
http://www-personal.umich.edu/~alandear/glossary

In [1686]:
all_links[-20:]

['https://www.imf.org/external/pubs/ft/fandd/2018/09',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/index.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/future-of-southeast-asia-bhaskaran.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/what-are-subsidies-basics.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/stanford-economist-raj-chetty-profile-people.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/southeast-asia-progress-and-reform-rhee.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/female-labor-force-participation-in-vietnam-banerji.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/human-trafficking-in-southeast-asia-caballero.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/southeast-asia-climate-change-and-greenhouse-gas-emissions-prakash.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/southeast-asian-youth-on-the-future-overman.htm',
 'https://www.imf.org/external/pubs/ft/fandd/2018/09/

In [None]:
https://www.imf.org/external/pubs/ft/fandd/spa/2018/09/pdf/fd0918s.pdf
https://www.imf.org/external/pubs/ft/fandd/2018/09/pdf/fd0918.pdf

In [1653]:
mydict = {}
for i in economics_vocab_no_cognates.index[:30]:
    results = mydict[i] = LS[LS.str.contains(i)]
    results = sorted(results,key = (lambda x: len(x.split())))
    results = pd.Series(results)
    results = results[results.apply(lambda x: len(x.split()) > 9)]
    mydict[i] =  results.iloc[0]
        

In [1655]:
len(mydict)
for i in mydict.items():
    print(i)

('currency', 'Other indicators of international currency use tell a similar story.')
('poverty', '1Defined as proportion of families living below the poverty line.')
('spending', 'Some governments allowed arrears to accumulate on their spending commitments.')
('inequality', 'In extreme cases of income inequality, outcomes are clearly critical.')
('banking', 'Moreover, banking controls stifled the development of prudential supervisory systems.')
('macroeconomic', 'Mauritius successfully overcame its macroeconomic imbalances in the early 1980s.')
('deficit', 'Government securities issued to cover the deficit were short term.')
('wage', 'Economic reforms have freed wages and allowed incomes to diverge.')
('unemployment', 'Industrial output shrank by 40 percent, leading to mass unemployment.')
('commodity', 'Do exchange rate regimes influence the behavior of commodity currencies?')
('output', 'Industrial output shrank by 40 percent, leading to mass unemployment.')
('framework', "This will 

In [1622]:
LS = pd.Series(long_sentences)

In [1662]:
url="https://www.collinsdictionary.com/dictionary/english-spanish/currency"
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')

print(soup.select(".content.definitions.dictionary.biling"))

[<div class="content definitions dictionary biling"><div class="hom"><span class="gramGrp"><span class="span hi rend-sc"><span class="pos">noun</span></span></span><div class="sense"><span class="span sensenum bluebold">1. </span> <span class="lbl type-syn"><span class="span punctuation">(</span><span class="span punctuation">= </span>monetary system, money<span class="span punctuation">)</span></span> <span class="cit type-translation"><span class="quote">moneda <span class="lbl type-pos">f</span></span></span><div class="cit type-example"><span class="quote">foreign currency</span> <span class="cit type-translation"><span class="quote">moneda <span class="lbl type-pos">f</span> extranjera</span></span><span class="span bluebold"> ⧫ </span><span class="cit type-translation"><span class="quote">divisas <span class="lbl type-pos">fpl</span></span></span></div><span class="xr"> <span class="lbl"><span class="span italics">see also</span></span> <a class="ref" href="https://www.collinsdic