## A frequency ranked list of economics vocabulary

### Aim of project:
To help a friend improve her economics specific English vocabulary in an efficient way. 

### End result of project:
1. A <a href="https://github.com/pvonglehn/economics_vocab/blob/master/economics_vocab.txt">list</a> of the most common, economics specific English words which don't have Spanish cognates. That is, words which are common in economics texts but uncommon in general texts and which cannot be easily guessed by a Spanish speaker. 
<br><br>
2. Anki flashcards <a href="https://github.com/pvonglehn/economics_vocab/blob/master/Blanca.apkg">deck</a> for studying these words with example sentences.
 


### Problem:
A friend is preparing for an English exam as part of her studies to become an economist for the civil service in Spain. Part of the exam will be based on an article from <a href="https://www.economist.com/">'the economist'</a>, the International Monetary Fund <a href="https://www.imf.org/external/pubs/ft/fandd/">finance and development magazine</a>, or a similar publication. She finds that she is lacking much of the economics specific vocabulary necessary to understand these articles. 

The classic strategy for improving vocabulary is to read a lot and look up words that you don't know. However, this approach is very inefficient. 

Let's consider an example.
The student reads the sentence:<br>
'Higher <strong>wages</strong> in China make <strong>offshoring</strong> less attractive.'

The student may not know the meaning of 'wages' or 'offshoring' so their instinct might be to look up and learn both. What the student doesn't know however, is that while the word 'wage' is very common in financial texts and definitely worth learning,  the word 'offshoring' is much less common, so it is not worth the student's effort learning, at least as long as there are many more useful words they could learn first.

The student should learn the most relevant words first, which is what this project aims to help with.

### Project results:

Below is a table showing the first 10 words in the final list. The economics_rank indicates each word's rank based on its frequency of occurrence in the imf and 'the economist' magazines. This corpus was compiled during this project and consists of over four million tokens (words) from over 4000 articles. The general rank indicates the word's position on a list of 5000 English words ordered by their frequency of occurence in a large corpus constructed from a wide range of English texts. This list was downloaded from https://www.wordfrequency.info/ 

This list only contains words which are more common in the economics corpus than in general English, so 'stop words' like 'the','and' etc. don't appear. We can see that the top words in our list occur much more frequently in the economics corpus than in general English. This indicates that these words are very important to know in order to understand these texts, but students may not know them because they are relatively uncommon in general texts that they will mostly have been exposed to in their English studies. 

<br><em>A general_rank of 1000000 means that the word is not on the list from the general corpus.</em>


In [435]:
economics_vocab_no_cognates.head(10)

Unnamed: 0,economics_rank,economics_freq,general_rank
inbox,174,2781,1000000
upgrade,214,2410,1000000
debt,240,2163,1684
investor,311,1761,1536
asset,315,1714,1869
spending,329,1640,2082
currency,348,1487,3297
decline,349,1470,1790
revenue,421,1212,1514
wage,450,1131,2300


### Removing words with Spanish cognates

Below is the table including words with Spanish cognates. The meanings of 'percent', 'minister', 'sector' etc could all be easily guessed by a Spanish speaker. Of the top 1000 words on this list, 40% of the words have Spanish cognates, so it is really worth excluding them from our list to save the student's time and effort. 

In [438]:
w_cognate.head(10)

Unnamed: 0,economics_rank,economics_freq,general_rank,spanish_cognate
percent,157,2976,1000000,por ciento
inbox,174,2781,1000000,-
upgrade,214,2410,1000000,-
minister,224,2294,1711,ministro
debt,240,2163,1684,-
accord,253,2052,1000000,acuerdo
sector,298,1824,1767,sector
investor,311,1761,1536,-
asset,315,1714,1869,-
spending,329,1640,2082,-


### Anki Flashcards

Anki is a flashcard app that uses active recall testing and spaced repetition to help you learn almost anything very efficiently. The flash cards in this project contain three components: 1) A sentence in English with the word to be learned/tested highlighted in blue. 2) The translation of this word and of the full sentence in Spanish. 3) A dictionary entry with translations into Spanish.
<br>
<br>
<img src="./images/Anki1.png">
<br>
Above you can see the front of the flashcard, which asks the user to give the meaning of the word in blue. The user then has to guess the correct preposition (say it aloud or in your head - you don't type anything into Anki). And then press "show answer" to see if you are correct. 
<br>
<br>
<br>
<br>
<br>
<br>
<img src="./images/Anki2.png">
<br>
When the "show answer" button is pressed, the Spanish translation or the single word and of the full phrase is given. If the user wants to they can click "show definition" to get a full dictionary definition of the card.
<br>
<br>
<br>
<br>
<br>
<br>
<img src="./images/Anki3.png">
<br>
The full dictionary definition is given. The dictionary information was scraped from the site http://www.spanishdict.com






## Procedural overview

1. Download the imf articles in Spanish and English
2. Split the text into sentences and align Spanish and English sentence pairs
3. Add sentences from the economist magazine to sentences
4. Tokenize the sentences (split into words)
5. Lemmatize the words (put into dictionary form e.g. running -> run)
6. Rank all words by frequency
7. Get a separate frequency ranked list of English words from a general corpus of English texts
8. Get a list of Spanish-English cognates (e.g. la economía <-> the economy)
9. Create our list of most common economics specific words without Spanish cognates 
10. Get dictionary entries with examples for each word by web scraping
11. Extract examples from dictionary entries
12. Make the Anki flash cards


## Tools used
Beautiful Soup for web scraping<br>
NLTK for natural language processing<br>
pandas for data manipulation<br> 



## To do/ improvements to be made/ features to add

Include examples from the sentences aligned parallel corpus in the Anki cards


Resolve issues caused by question marks in regular expressions when making the Anki deck


## Getting the text corpus
### Get a sentence aligned parallel corpus of imf magazine articles
Fortunately, the imf published much of its material in mutliple languages. Having downloaded all of the articles from the beginning of 2009 to the end of 2017 in English and Spanish I then needed match up each English sentence with each Spanish sentence (for creating flash cards with examples later). The first step was to split up the articles into sentences using nltk package in python. I then used a external package called hunalign to match up each English sentences with its Spanish counterpart.


hunalign:<br>
https://github.com/danielvarga/hunalign<br>
D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005).<br>
Parallel corpora for medium density languages<br>
In Proceedings of the RANLP 2005<br>

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import os
import nltk

### 1. Download the imf articles in Spanish and English

In [1727]:
# Get the pdfs of the imf articles from 2009 to 2017 from the imf website

cmd = "mkdir pdfs"
os.system(cmd)
for year in range(2009,2018):
    for month in ["03","06","09","12"]:
        short_year = str(year)[-2:]
        cmd = """curl "https://www.imf.org/external/pubs/ft/fandd/{}/{}/pdf/fd{}{}.pdf" \
                --output ./pdfs/eng-{}-{}.pdf""".format(year,month,month,short_year,year,month)
        os.system(cmd)
        cmd = """curl "https://www.imf.org/external/pubs/ft/fandd/spa/{}/{}/pdf/fd{}{}s.pdf" \
                --output ./pdfs/spa-{}-{}.pdf""".format(year,month,month,short_year,year,month)
        os.system(cmd)

### 2. Split the text into sentences and align Spanish and English sentence pairs

In [7]:
# Create the sentences aligned parallel corpus

%%time
import pandas as pd
import nltk
import os
sentence_df = pd.DataFrame({"english":[],"spanish":[]})
cmd = "rm -rf tmp ; mkdir tmp"
os.system(cmd)
for year in range(2009,2018):
    for month in ["03","06","09","12"]:
        
        # convert the pdfs to raw text
        cmd = """xpdf-tools-mac-4.00/bin64/pdftotext \
        ./pdfs/eng-{}-{}.pdf  tmp/eng-{}-{}.txt""".format(year,month,year,month)
        os.system(cmd)
        cmd = """xpdf-tools-mac-4.00/bin64/pdftotext \
        ./pdfs/spa-{}-{}.pdf  tmp/spa-{}-{}.txt""".format(year,month,year,month)
        os.system(cmd)
        e = open("tmp/eng-{}-{}.txt".format(year,month),"r",encoding="Latin-1").read()
        s = open("tmp/spa-{}-{}.txt".format(year,month),"r",encoding="Latin-1").read()
        
        # split the texts into sentences
        english_sentences = nltk.sent_tokenize(e)
        spanish_sentences = nltk.sent_tokenize(s)
        eo = open("tmp/sent-eng-{}-{}.txt".format(year,month),"w")
        for line in english_sentences:
            eo.write((line+"\n"))
        eo.close()
        so = open("tmp/sent-spa-{}-{}.txt".format(year,month),"w")
        for line in spanish_sentences:
            so.write((line+"\n"))
        so.close()
        
        #Align the sentences with external package hunalign
        cmd = """hunalign/hunalign-1.2/src/hunalign/hunalign \
        hunalign/hunalign-1.2/data/null.dic \
        tmp/sent-eng-{}-{}.txt tmp/sent-spa-{}-{}.txt\
        -text -bisent > tmp/aligned-{}-{}.txt""".format(year,month,year,month,year,month)
        os.system(cmd)

        #put lines into dataframe
        tmp_df = pd.DataFrame({"english":[],"spanish":[]})
        lines = open("tmp/aligned-{}-{}.txt".format(year,month),"r").readlines()
        for line in lines:
            sentences = line.split("\t")
            eng = sentences[0]
            spa = sentences[1]
            tmp_df = tmp_df.append(pd.DataFrame({"english":[eng],"spanish":[spa]}))
        sentence_df = sentence_df.append(tmp_df)
        
sentence_df.to_csv("imf_sentences.txt",sep="\t",index=False,header=None)

CPU times: user 2min 28s, sys: 2.15 s, total: 2min 30s
Wall time: 5min 58s


In [386]:
all_imf_sentences = pd.read_csv("imf_sentences.txt",sep="\t",header=None)
all_imf_sentences.columns = "english","spanish"
all_imf_sentences = all_imf_sentences.dropna(axis=0)

In [387]:
long_imf_sentences = all_imf_sentences[all_imf_sentences['english'].apply(lambda x: len(x.split()) > 7)]

In [388]:
sentence_count = len(long_imf_sentences['english'])
word_count = long_imf_sentences['english'].apply(lambda x : len(x.split())).sum()
words_per_sentence = long_imf_sentences['english'].apply(lambda x : len(x.split())).mean()
print("sentence count = {}\ntotal word count = {}\nmean words per sentence = {:.2f}"
      .format(sentence_count,word_count,words_per_sentence))
      

sentence count = 46668
total word count = 1005216
mean words per sentence = 21.54


###  3. Add sentences from the economist magazine to corpus

Adding words from a year's worth of articles from the economist magazine.  The imf articles were downloaded in pdf format from the imf website and the economist sentences were provided already in csv format by a friend. I am not including the raw economist data in this repository because I do not have permission to distribute it.

In [390]:
# I'm not adding the_economist.csv file to the github repository for this project because it
# contains articles that require a subscription to access and I do not have permission to share them. 
the_economist = pd.read_csv("the_economist.csv",encoding="Latin-1")

In [391]:
the_economist['words'] = the_economist['words'].str.replace("[',?]","")

In [392]:
def sentence_break(text):
    text = re.sub("\[|\]","",text)
    for match in re.findall("\D\.\s",text):
        try:
            regex = match[0] + "\."
            text = re.sub(regex,(regex[0] + "\t"),text)
        except:
            continue

    return text

In [393]:
%%time
tab_separated = the_economist['words'].apply(sentence_break).sum()
split_sentences = tab_separated.split("\t")
split_sentences = pd.Series(split_sentences)
long_econ_sentences = split_sentences[split_sentences.apply(lambda x: len(x.split()) > 7)]

CPU times: user 23.8 s, sys: 28.8 s, total: 52.6 s
Wall time: 52.9 s


In [394]:
# combine the economist and imf articles
imf_econ = pd.concat([long_imf_sentences['english'],long_econ_sentences],axis=0)

In [397]:
imf_econ.head()

14               SENIOR EDITORS Camilla Andersen Archana Kumar James Rowe Simon Willson
15      ASSISTANT EDITORS Maureen Burke Sergio Negrete Cardenas Natalie Ramirez-Djumena
18                 EDITORIAL ASSISTANTS Lijun Li Kelley McCollum Niccole Braynen-Kimani
20    Periodicals postage is paid at Washington, DC, and at additional mailing offices.
21            The English edition is printed at United Lithographers Inc., Ashburn, VA.
dtype: object

In [437]:
sentence_count = len(imf_econ)
word_count = imf_econ.apply(lambda x : len(x.split())).sum()
words_per_sentence = imf_econ.apply(lambda x : len(x.split())).mean()
articles = len(the_economist)
print("article count = {}\nsentence count = {}\ntotal word count = {}\nmean words per sentence = {:.2f}"
      .format(articles,sentence_count,word_count,words_per_sentence))
      

article count = 4440
sentence count = 213807
total word count = 4523948
mean words per sentence = 21.16


### 4. Tokenize the sentences (split into words)
### 5. Lemmatize the words (put into dictionary form e.g. running -> run)

In [75]:
# There are libraries and functions that we will need to turn the
# words into their lemmas (dictionary forms) https://en.wikipedia.org/wiki/Lemma_(morphology)
# E.g. running will be turned into run and played into play
# To do this we need to:

# 1. tokenize the sentences (break them up into words)
# 2. tag the parts of speech for each token (word) e.g. verb, adjective
# 3. lemmatize the tokens (turn into dictionary form)

# part of speech to the lemmatizer
# Full sentences need to be passed to the parts of speech tagger in order to tag them accurately
# If the word "play" is given in isolation, it is ambiguous if it is a verb or a noun,
# but if you give the pos tagger a full sentence, it can determine from context
# e.g. I am going to play (verb) tennis. I am going to see a play (noun).

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
lemmatizer = WordNetLemmatizer()

# This function converts the parts of speech tags from nltk pos tagger 
# To POS tags that are compatible with the wordnet lemmatizer
# function adapted from: 
# https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    else:
        return None
    

[nltk_data] Downloading package wordnet to /Users/pv7409/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pv7409/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [53]:
%%time

all_tokens = []
all_lemmas = []
for sentence in imf_econ:
    tokens = nltk.word_tokenize(sentence)
    tokenized = nltk.pos_tag(tokens)
    for i, token in enumerate(tokenized):
        # Filter out proper nouns (words with capital letters that aren't at the beginning of a sentence)
        if token[0].islower() or i == 0:
            word = token[0].lower()
            wordnet_pos = get_wordnet_pos(token[1])
        #print(word,wordnet_pos)
            if wordnet_pos is None:
                all_lemmas.append(word)
                all_tokens.append(word)
            else:
                all_lemmas.append(lemmatizer.lemmatize(word,wordnet_pos))
                all_tokens.append(word)

# write out lemmes to text file
f = open("all_lemmas.txt","w")
for i in all_lemmas:
    f.write((i+"\n"))
f.close()

CPU times: user 5min 43s, sys: 5.84 s, total: 5min 49s
Wall time: 5min 50s


In [400]:
# Now we'll process the lemmas a little

all_lemmas = open("all_lemmas.txt","r").read().split("\n")

all_lemma_series = pd.Series(all_lemmas)

#remove punctuation
all_lemma_series = all_lemma_series[~all_lemma_series.str.contains("\W")]

# remove proper names
propernames = pd.Series(open("/usr/share/dict/propernames","r").read().split("\n"))
propernames = propernames.apply(lambda x: x.lower())
propernames = set(propernames)
#all_lemma_series = all_lemma_series[~all_lemma_series.isin(propernames)]

#get rid of numbers
all_lemma_series = all_lemma_series[~all_lemma_series.str.contains("\d")]



### 6. Rank all words by frequency

In [415]:
# Create a frequency table for each lemma
# Give each lemma a rank

lemma_frequencey = all_lemma_series.value_counts()
ranked = pd.Series(lemma_frequencey.index)
ranks = pd.DataFrame(list(range(1,len(lemma_frequencey)+1)))
ranks.index = lemma_frequencey.index
ranks['freq'] = lemma_frequencey

In [416]:
ranks[2000:2010]

Unnamed: 0,0,freq
spillover,2001,188
corrupt,2002,188
behavior,2003,188
sensitive,2004,188
typical,2005,188
generous,2006,188
attractive,2007,188
jet,2008,188
usual,2009,187
brother,2010,187


### 7. Get a frequency ranked list of English words from a general corpus of English texts

Here is a list of the 5000 most frequent words in English from https://www.wordfrequency.info/
<br>Note that although there are 5000 entries in the list, there are only 4353 unique words,
as sometimes the same word has several entries because it appears as a different part of speech 
<br>e.g. 'light' appears as a noun (the light at the end of the tunnel) and as an adjective (a light breakfast)

In [403]:
# Get list of English words by frequency, generated from 14 billion word intente corpus
# https://www.wordfrequency.info/ 
f = open("5000_eng_words.txt","r")
freq_5000 = f.read().split("\n")
freq_5000 = freq_5000[1:] # ignore header
freq_5000_list = pd.Series(freq_5000).str.lower()

#remove duplicated words in freq_5000_list
freq_5000_list = pd.Series(freq_5000_list).unique()

# give each word in the general corpus frequency list a rank
freq5000df = pd.DataFrame(list(range(1,len(freq_5000_list)+1)))
freq5000df.index = freq_5000_list

In [404]:
freq5000df.head()

Unnamed: 0,0
the,1
be,2
and,3
of,4
a,5


In [417]:
# merge the general corpus frequency data frame with our economics vocab frequency list

ranks = ranks.merge(freq5000df,how="left",left_index=True,right_index=True)
ranks.columns = "economics_rank","economics_freq","general_rank"
ranks = ranks.sort_values("economics_rank")

# if word not in general corpus list, give it rank of 1000000
ranks.loc[ranks["general_rank"].isnull(),["general_rank"]] = 1000000 
ranks["general_rank"] = ranks["general_rank"].astype(int) # turn back into integers


In [418]:
# remove small words (get rid of single letters etc.)
words_series = pd.Series(ranks.index)
long_words = words_series[words_series.apply(lambda x: len(x) > 3)].values

In [419]:
ranks = ranks.loc[long_words]

In [420]:
ranks.head()

Unnamed: 0,economics_rank,economics_freq,general_rank
have,8,55367,8
that,9,54685,11
with,15,26638,15
from,16,22019,25
more,21,17803,78


### 8. Get a list of Spanish-English cognates (e.g. la economía <-> the economy)

In [1086]:
# list of spanish/english cognates downloaded from: http://cognates.org/pdf/mfcogn.pdf
# massage the text file (formatting was messed up when converted from pdf)
f = open("cognates.txt","r").read()
f = re.sub(r'\((.*?)\)',",",f)
f = re.sub("por ciento","porciento",f)
f = re.sub("se relajó","serelajó",f)
f = re.sub("en el presente","enelpresente",f)
f = re.sub("ex prefix","prefix",f)
f = re.sub("ex prefijo","prefix",f)
f = re.sub("soul, música","soulmúsica",f)
f = re.sub("substituir v. 5/sustituir","substituir/sustituir",f)
f = re.sub("rock n' roll","rock'n'roll",f)
f = re.sub("El Salvador","ElSalvador",f)
f = re.sub("valuación, avalúo","valuación/avalúo",f)
f = re.sub("prefix","",f)
f = re.sub("intj.","",f)

f = re.sub(r'\d',",",f)
f = re.split("PMF|MFW| |conj\.|v\.|adj\.|n\.|s\.|adv\.|,|abbr\.|abr\.|prep\.",f)
to_remove = "Cognate","org","","clic","prefijo","prefix"
for item in to_remove:
    while item in f: f.remove(item)
f[f.index("porciento")] = "por ciento"
f[f.index("serelajó")] = "se relajó"
f[f.index("enelpresente")] = "en el presente"
f[f.index("soulmúsica")] = "soul música"




In [1091]:
cognate_list = []
for i in range(0,len(f)-1,2):
    cognate_list.append([f[i],f[i+1]])
cognate_df = pd.DataFrame(cognate_list)
cognate_df = pd.DataFrame(cognate_df)
cognate_df.columns = "english","spanish"
cognate_df.to_csv("spanish_english_cognates.csv",index=False)

In [421]:
cognate_df = pd.read_csv("spanish_english_cognates.csv",header=None)
cognate_df.columns = "english","spanish"

In [422]:
cognate_df.head()

Unnamed: 0,english,spanish
0,english,spanish
1,a.m.,a.m.
2,abandon,abandonar
3,abandoned,abandonó
4,abandoned,abandonado


### 9. Create our list of most common economics specific words without Spanish cognates

In [424]:
#economics_vocab = ranks[(ranks["economics_rank"] > ranks["general_rank"] + 500) & (ranks["general_rank"] < 10000)]
economics_vocab = ranks[((ranks["economics_rank"] ) < ranks["general_rank"] - 1000) ]

In [425]:
s = pd.Series(economics_vocab.index)
no_cognates = list(s[~s.isin(cognate_df["english"])])
have_cognates = list(s[s.isin(cognate_df["english"])])
economics_vocab_no_cognates = economics_vocab.loc[no_cognates]
economics_vocab_with_cognates = economics_vocab.loc[have_cognates]

In [426]:
cog_df = pd.DataFrame(cognate_df["spanish"])
cog_df.index = cognate_df["english"]

In [427]:
w_cognate = economics_vocab.merge(cog_df,how="left",left_index=True,right_index=True)
w_cognate = w_cognate.sort_values("economics_rank")
w_cognate.loc[w_cognate['spanish'].isnull(),['spanish']] = "-"

In [428]:
w_cognate = w_cognate.rename({"spanish":"spanish_cognate"},axis=1)

In [127]:
# Write vocab list
f = open("economics_vocab.txt","w")
f.write("""#List of 2000 economics related words 
#Ranked by frequency of occurrence in the imf finance and development magazine
#https://www.imf.org/external/pubs/ft/fandd/
#Words with Spanish cognates removed\n""")
for i in economics_vocab_no_cognates[:2000].index:
    f.write((i+"\n"))
f.close()

### 10. Get dictionary entries with examples for each word by web scraping

In [131]:
economics_vocab_list = open("economics_vocab.txt","r").readlines()
economics_vocab_list = pd.Series(economics_vocab_list)
economics_vocab_list = economics_vocab_list.iloc[4:]
economics_vocab_list = economics_vocab_list.str.rstrip()

In [132]:
%%time
import requests
from bs4 import BeautifulSoup
import numpy as np
definitions = pd.DataFrame({"word":[],"definitions":[]})
for word in economics_vocab_list:
    try:
        url="http://www.spanishdict.com/translate/{}".format(word)
        content = requests.get(url).content
        soup = BeautifulSoup(content,'lxml')
        entry = str(soup.select(".dictionary-entry")[0])
        tmp_df = pd.DataFrame({"word":[word],"definitions":[entry]})
        definitions = definitions.append(tmp_df)
    except:
        tmp_df = pd.DataFrame({"word":[word],"definitions":[np.nan]})
        definitions = definitions.append(tmp_df)
definitions.to_csv("defintions_4_anki.txt",header=None,index=False,sep="\t")

CPU times: user 1min 46s, sys: 4.24 s, total: 1min 50s
Wall time: 19min 49s


In [52]:
definitions = pd.read_csv("defintions_4_anki.txt",header=None,sep="\t")
definitions.columns="word","definitions"
definitions.index = definitions['word'].values

In [430]:
# create a set of words on the first draught of the list that my friend says she already knows

knows = pd.read_csv("words_blanca_knows.txt",header=None)[1:]
knows.loc[220:,1] = "x"
already_knows = set(knows[knows[1].isnull()][0])

### 11. Extract examples from dictionary entries

In [61]:
for word in definitions['word'][:400]:
    try:
        content = definitions.loc[word,'definitions']
        soup = BeautifulSoup(content,'lxml')
        for i, item in enumerate(soup.find_all(class_="dictionary-neodict-example")):
            trans = item.parent.previousSibling.find_all(class_="dictionary-neodict-translation-translation")[0].text
            if trans == "":
                trans == "No direct translation"
            english = item.find_all("span")[0].text
            spanish = item.find_all(class_="exB")[0].text
            definitions.loc[word,'ex{}trans'.format(i)] = trans 
            definitions.loc[word,'ex{}eng'.format(i)] = english
            definitions.loc[word,'ex{}spa'.format(i)] = spanish
    except:
        continue

### 12. Make the Anki flash cards:

To start with I will only make cards for around the first 200 words.  

In [70]:
to_check = definitions[~definitions['word'].isin(already_knows)]
to_check = to_check.iloc[2:]

In [72]:
to_check[:200].to_csv("to_check.csv",sep="\t",header=None)

In [73]:
for_anki = pd.read_csv("./first_195.csv",sep="\t",index_col=0,header=None)

In [432]:
### make anki for guessing English word from Spanish

def make_anki(entry,replacement,sentence,translation,extra,definition):
    try:
        tokens = nltk.word_tokenize(sentence)
        tokenized = nltk.pos_tag(tokens)
        lemmas = []
        for i, token in enumerate(tokenized):
            word = token[0]
            wordnet_pos = get_wordnet_pos(token[1])
            if wordnet_pos is None:
                lemmas.append(word)
            else:
                lemmas.append(lemmatizer.lemmatize(word,wordnet_pos))
        target_index = lemmas.index(entry)
        tokens[target_index] = "{{c1::" + replacement + "::" + tokens[target_index] + "}}"
        tokens[-2] = tokens[-2] + tokens[-1]
        result = " ".join(tokens[:-1])
        return (result + "\t" + extra + "\t" + translation + "\t" + definition)
    except:
        return None

# entry = "scheme"
# replacement = "el plan"
# sentence = "The city government developed a scheme to revitalize its downtown area."
# translation = "El gobierno municipal ideó un plan para revitalizar el centro de la ciudad."
# extra = " "
# definition = "definition goes here"
# make_anki(word,replacement,sentence,translation,extra,definition)

In [434]:
### Make anki for guessing Spanish word from English

def make_anki(entry,replacement,sentence,translation,extra,definition):
    """makes an Anki card with sentence in English, translation in Spanish and definition"""
    try:
        # Split the sentence into tokens (words plus puncuation)
        tokens = nltk.word_tokenize(sentence)
        tokenized = nltk.pos_tag(tokens)
        lemmas = []
        # Lemmatize the words (necessary for replacing/highlighting the words later)
        for i, token in enumerate(tokenized):
            word = token[0]
            wordnet_pos = get_wordnet_pos(token[1])
            if wordnet_pos is None:
                lemmas.append(word)
            else:
                lemmas.append(lemmatizer.lemmatize(word,wordnet_pos))
        target_index = lemmas.index(entry)
        # insert css styling around the target word
        tokens[target_index] = "<span style='color:blue;'> " + tokens[target_index] + "</span>"
        tokens[-2] = tokens[-2] + tokens[-1]
        result = " ".join(tokens[:-1])
        result = re.sub(" n't","n't",result)
        for match in re.findall("\w\w\w \W",result):
            result = re.sub(match,re.sub(" ","",r"{}".format(match)),result)
             

        return ("¿Qué significa la palabra en azul?<br><br>" + result + "\t" + "<strong>" + replacement + "</strong><br>" + translation + "\t" + definition)
    except:
        return None


# Lines below are a test of the make_anki function    
entry = "scheme"
replacement = "el plan"
sentence = "The city government developed a scheme to revitalize its downtown area."
translation = "El gobierno municipal ideó un plan para revitalizar el centro de la ciudad."
extra = " "
definition = "definition goes here"
make_anki(word,replacement,sentence,translation,extra,definition)

"¿Qué significa la palabra en azul?<br><br>The city government developed a <span style='color:blue;'> scheme</span> to revitalize its downtown area.\t<strong>el plan</strong><br>El gobierno municipal ideó un plan para revitalizar el centro de la ciudad.\tdefinition goes here"

In [374]:
# Create a dictionary where the keys are the words to be learned and the 
# values are a list of anki cards with one example per card

cards = {}
for entry in for_anki.index:
    try:
#        cards[entry] = ["{{c1::what is meaning of...::what is meaning of...}}<br>" + entry + "\t" + " " + "\t" + definitions.loc[entry,"definitions"] + "\t" + " "  ]
        cards[entry] = []
        for i in range(for_anki.loc[entry].dropna().shape[0]//3):
            replacement = for_anki.loc[entry,1+(i*3)]
            sentence = for_anki.loc[entry,2+(i*3)]
            translation = for_anki.loc[entry,3+(i*3)]
            definition = definitions.loc[entry,'definitions']
            extra = " "
            result = make_anki(entry,replacement,sentence,translation,extra,definition)
            if result is None:
                continue
            else:
                cards[entry].append(result)
    except:
        continue



In [375]:
# Prepare the deck for loading into Anki
# We want the easier cards to appear first, but we also want to randomize
# the order a little to spread out examples of the same words to make
# the deck less predicatable

import copy
import random
f = open("test_anki_deck.txt","w")
# deep copy required because we are removing cards from the list and don't want to
# alter the original list
destroy_cards = copy.deepcopy(cards) 
# Group the cards in intervals of 15 and randomize within those groups
interval = 15
written_cards = []
for start in range(0,len(for_anki),interval):
    randlist = []
    # Take 3 examples for each word, or the most available if less than 3
    for i in range(4):
        for entry in for_anki.index[start:start+interval]:
            if len(destroy_cards[entry]) > 0:
                randlist.append(destroy_cards[entry].pop(0))
    while len(randlist) > 0:
        next_card = randlist.pop(random.randrange(len(randlist)))
        f.write(next_card)
        f.write("\n")
        written_cards.append(next_card.split("\t")[0])           
f.close()

The Anki deck can then be imported in as a tab seperated file. I needed to create a special card time in Anki to accomodate the front/back/definition format of the card. These cards types are already in the .apgk file in this repository

# Cells not currently in use

### Getting examples from the parallel corpus

In [155]:
to_anki = definitions.copy()

In [166]:
%%time
for word in to_anki.index[:10]:
    try:
        long_imf_sentences['sentence_length'] = long_imf_sentences['english'].apply(lambda x: len(x.split()))
        examples = (long_imf_sentences[long_imf_sentences['english']
             .str.contains(word)].sort_values('sentence_length'))
        examples = examples[['english','spanish']].loc[examples['sentence_length'] > 15]
        examples = examples.loc[~examples['english'].str.contains("\d") & ~examples['spanish'].str.contains("\d")]
        examples = examples.iloc[:5,:].copy()
        for i in range(5):
            to_anki.loc[word,"ex{}_eng".format(i)] = examples.iloc[i,0]
            to_anki.loc[word,"ex{}_spa".format(i)] = examples.iloc[i,1]
    except:
        for i in range(5):
            to_anki.loc[word,"ex{}_eng".format(i)] = np.nan
            to_anki.loc[word,"ex{}_spa".format(i)] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


CPU times: user 4.59 s, sys: 1.71 s, total: 6.3 s
Wall time: 18.8 s


In [163]:
to_anki['ex0_eng'].head()

word
inbox      NaN
upgrade    NaN
debt       NaN
investor   NaN
asset      NaN
Name: ex0_eng, dtype: float64

In [66]:
import re
for word in to_anki.index[:100]:
    try:
        regex = r"{}\S*".format(to_anki.loc[word,'word'][:-2])
        match = re.findall(regex,to_anki.loc[word,'ex0_eng'].lower())[0]
        to_anki.loc[word,'ex0_cloze'] = re.sub(match,"{{{{c1::{}}}}}".format(match),to_anki.loc[word,'ex0_eng'].lower())
    except:
        continue

In [72]:
to_anki.iloc[:100,[12,3,1]].to_csv("examples_anki.txt",index=False,header=None,sep="\t")
    

In [71]:
pd.options.display.max_colwidth=1000
tmp = to_anki.iloc[:100,[12,3,1]]
tmp.head(20)

Unnamed: 0_level_0,ex0_cloze,ex0_spa,definitions
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
currency,but what goes up does not come down so easily when there is no independent {{c1::currency.}},Pero lo que sube no baja con tanta facilidad si la moneda no es independiente.,"<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">currency</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(finance)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/la%20moneda"">la moneda</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(f) means that a noun is feminine. Spanish nouns have a gender, which is either feminine (like la ..."
poverty,she wants it to be part of a data revolution to guide the fight against {{c1::poverty.}},Quiere que sea parte de una revolución en materia de datos que oriente la lucha contra la pobreza.,"<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">poverty</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(state of being poor)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/la%20pobreza"">la pobreza</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(f) means that a noun is feminine. Spanish nouns have a gender, which is either femin..."
spending,"canada, meanwhile, implemented profound structural reforms in {{c1::spending}} and tax policy that had a longer-lasting impact.","Canadá, entre tanto, puso en marcha profundas reformas estructurales de la política de gasto y tributación que tuvieron un impacto más duradero.","<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">spending</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(finance)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/los%20gastos"">los gastos</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(m) means that a noun is masculine. Spanish nouns have a gender, which is either feminine (like ..."
inequality,there are many channels through which opening up the capital account can lead to higher {{c1::inequality.}},de capital puede incrementar la,"<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">inequality</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(general)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/la%20desigualdad"">la desigualdad</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(f) means that a noun is feminine. Spanish nouns have a gender, which is either femini..."
banking,"in fact, caribbean countries have been among the most affected by loss of correspondent {{c1::banking}} relationships.","De hecho, los países del Caribe han sido de los más afectados por la pérdida de relaciones de corresponsalía bancaria.","<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">banking</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(financial business)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/la%20banca"">la banca</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(f) means that a noun is feminine. Spanish nouns have a gender, which is either feminine (..."
macroeconomic,"to be effective, regulations should provide incentives to firms to smooth the impact of {{c1::macroeconomic}} shocks.","Para ser eficaces, las normas deben alentar a las empresas a atenuar el efecto de los shocks macroeconómicos.","<div class=""dictionary-entry dictionary-collins""><div class=""entry""><div class=""hw_unit""><span class=""hw"">macroeconomic</span> <span class=""pron"">[ˌmækrəʊˌiːkəˈnɒmɪk]</span> </div><div class=""gram_cat""><a name=""adjective""></a><div class=""entry_pos"">adjective</div><span class=""sense""><span class=""tran_group""> <span class=""tran_main"">macroeconómico</span></span></span></div></div><div class=""d-copyright""><a href=""http://www.collinsdictionary.com/dictionary/english-spanish"" rel=""nofollow"">Collins Complete Spanish Electronic Dictionary © HarperCollins Publishers 2011</a></div></div>"
deficit,"people often think that the program of growth, employment, and redistribution was about cutting the {{c1::deficit.}}","Se suele pensar que el programa de crecimiento, empleo y redistribución consistía en reducir el déficit.","<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">deficit</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(finance)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/el%20d%C3%A9ficit"">el déficit</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(m) means that a noun is masculine. Spanish nouns have a gender, which is either feminine (l..."
wage,"but they are merely a high-{{c1::wage,}} capital- or skill-intensive drop in india's low{{c1::wage,}} unskilled, labor-abundant ocean.",Pero estos son solo una gota de agua de mano de obra de salarios altos y uso intensivo de capital en un océano de abundante mano de obra no calificada de salarios bajos.,"<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">wage</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(rate of pay)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/el%20salario"">el salario</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(m) means that a noun is masculine. Spanish nouns have a gender, which is either feminine (like ..."
unemployment,"because {{c1::unemployment}} follows growth with a delay, it is called a lagging indicator of economic activity.","Por ese motivo, el desempleo es un indicador rezagado de la actividad económica.","<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">unemployment</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(lack of work)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/el%20desempleo"">el desempleo</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(m) means that a noun is masculine. Spanish nouns have a gender, which is either fe..."
commodity,the latest sharp rise and fall in {{c1::commodity}} prices is not the first nor the last,Los recientes altibajos de los precios de las materias primas no son ni los primeros ni los últimos,"<div class=""dictionary-entry dictionary-neodict""><div class=""dictionary-neodict-entry-title"">commodity</div><a class=""has-tooltip dictionary-neodict-first-part-of-speech part_of_speech"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""A noun is a word referring to a person, animal, place, thing, feeling or idea (e.g. man, dog, house)."">noun</a><div class=""dictionary-neodict-indent-1""><span class=""def"">1. </span><span class=""context"">(commerce)</span> <div class=""dictionary-neodict-indent-2""><div class=""dictionary-neodict-translation""><span class=""dictionary-neodict-translation-letters"">a. </span><a class=""dictionary-neodict-translation-translation"" href=""/translate/el%20art%C3%ADculo"">el artículo</a><span class=""def""> </span><a class=""has-tooltip def"" data-toggle=""tooltip"" href=""http://www.spanishdict.com/guide/masculine-and-feminine-nouns"" title=""(m) means that a noun is masculine. Spanish nouns have a gender, which is either femini..."


In [1829]:
w_defs = economics_vocab_no_cognates.copy()
w_defs = w_defs.merge(tmp,left_index=True,right_index=True,how="left")

In [1617]:
# Links for articles on the imf news website
# Not currently using these but could be a nice additional resource

found_links = []

for i in range(97): 
    url="https://www.imf.org/es/News/Search?datefrom=1994-01-01&dateto=2018-11-02&page={}".format(i)
    content = requests.get(url).content
    soup = BeautifulSoup(content,'lxml')
    for link in soup.find_all("a"):
        if link.get("href"):
            if re.search("es/news/articles",link['href'].lower()):
                if link['href'] not in found_links:
                    found_links.append(link['href'])

    

In [1544]:
# Tried to get definitions from Linguee. Linguee blocked me for sending too many requests

for i, word in enumerate(economics_vocab_no_cognates.index[:50]):
    try:
        url = "https://www.linguee.com/english-spanish/search?source=auto&query={}".format(word)
        content = requests.get(url).content
        soup = BeautifulSoup(content,'lxml')
        #soup.find_all(True, {'class':['sentence', 'left']},text="imf.org")
        print(i,word)
        print(soup.find_all(text="imf.org")[0].parent.parent.parent.parent.select(".sentence.left")[0].text)
        print(soup.find_all(text="imf.org")[0].parent.parent.parent.parent.select(".sentence.right2")[0].text)
    except:
        continue

0 currency
1 poverty
2 spending
3 inequality
4 banking
5 macroeconomic
6 deficit
7 wage
8 unemployment
9 commodity
10 output
11 framework
12 financing
13 euro
14 policymakers
15 strengthen
16 governance
17 regulatory
18 employment
19 globalization
20 liberalization
21 enterprise
22 moreover
23 inflow
24 liquidity
25 expenditure
26 lending
27 equity
28 further
29 donor
30 boost
31 indicator
32 tariff
33 enhance
34 trading
35 instance
36 wealth
37 volatility
38 boom
39 creditor
40 remittance
41 imbalance
42 transparency
43 constraint
44 arrangement
45 surplus
46 sustainable
47 scheme
48 borrow
49 liability


### Try getting definitions and examples from Word Reference

In [1408]:
# Try getting definitions and examples from Word Reference

%%time
import numpy as np
my_list = []
anki = pd.DataFrame({"english": [],"spanish": [],"example":[],"translation":[]})
for word in economics_vocab_no_cognates.index[:50]:
#for word in ["macroeconomic"]:
    try:
        definition = np.nan
        example = np.nan
        example_trans = np.nan
        
        url = "https://www.wordreference.com/es/translation.asp?tranword={}".format(word)
        content = requests.get(url).content
        soup = BeautifulSoup(content,'lxml')
        definition = soup.select(".ToWrd")[1].text.split()[0]
        example_tag = soup.select(".FrEx")[0]
        example = example_tag.text
        if example_tag.parent.nextSibling.nextSibling.select(".ToEx")[0]:
            example_trans = example_tag.parent.nextSibling.nextSibling.select(".ToEx")[0].text
        else:
            example_trans = "no translation found"
        tmp_df = pd.DataFrame({"english": [word],"spanish": [definition],"example":[example],"translation":[example_trans]})
        anki = anki.append(tmp_df)
    except:
        try:
            tmp_df = pd.DataFrame({"english": [word],"spanish": [definition],"example":[example],"translation":[np.nan]})
            anki = anki.append(tmp_df)
        except:
            try:
                tmp_df = pd.DataFrame({"english": [word],"spanish": [definition],"example":[np.nan],"translation":[np.nan]})
                anki = anki.append(tmp_df)
            except:
                try:
                    tmp_df = pd.DataFrame({"english": [word],"spanish": [np.nan],"example":[np.nan],"translation":[np.nan]})
                    anki = anki.append(tmp_df)
                except:
                    pass
        
        

CPU times: user 3.58 s, sys: 234 ms, total: 3.81 s
Wall time: 18.3 s


In [1362]:
example_tag.parent.nextSibling.nextSibling.select(".ToEx")[0]

<td class="ToEx" colspan="2">La globalización significa que debemos competir con trabajadores de todo el mundo.</td>

In [1466]:
anki["cloze"] = 0
anki = anki.dropna(axis=0)
for i in range(len(anki)):
    regex = r"{}\S*".format(anki.iloc[i,0][:-2])
    match = re.findall(regex,anki.iloc[i,2].lower())[0]
    regex2 = r" {}\S*".format(anki.iloc[i,1][:-4].lower())
    if re.search(regex2,anki.iloc[i,3].lower()):
        match2 = re.findall(regex2,anki.iloc[i,3].lower())[0]
    else:
        match2 = anki.iloc[i,1]
    anki.iloc[i,4] = re.sub(match,"{{{{c1::{}::{}}}}}".format(match,match2),anki.iloc[i,2].lower())


In [1270]:
# Trying Linguee
my_list = []
for word in economics_vocab_no_cognates.index[:10]:
    try:
        url = "https://www.linguee.com/english-spanish/search?source=auto&query={}".format(word)
        content = requests.get(url).content
        soup = BeautifulSoup(content,'lxml')
        example1 = soup.select(".tag_s")[0].text
        example2 = soup.select(".tag_t")[0].text
        translation = soup.select(".dictLink.featured")[0].text
        my_list.append((word,example1,example2,translation))
    except:
        continue

In [1564]:
# imf news site

url="https://www.imf.org/en/news/articles/2015/09/28/04/53/sonew101015a"
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
english_t = soup.body.text

### Web scraping articles from back to 1996

In [552]:
# Web crawling function to get all links from a webpage using beautiful soup

# WARNING: Take care before reusing this function. It is one of my first attempts at 
# writing a function to scrape multiple pages and is quite hacky. It works for this case but may not
# work as desired in other cases.

# Arguments are:
# url: starting url
# base_url: the url of website homepage
# must_include: regex that the links must include (default is the wildcard character ".")
# must_not_include: regex that the links must not include (default is the nonsense string "xasdfcasdf" )

def get_all_links(url,base_url,existing_links,must_include = ".",must_not_include = "xasdfcasdf"):
    content = requests.get(url).content
    soup = BeautifulSoup(content,'lxml')
    links = []
    existing_links = existing_links.copy() # this copy is required to prevent modifying the original list (side effect)
    for anchor in soup.findAll("a"):
        # if the anchor element isn't doesn't have a href, it isn't a proper url, so skip to next anchor
        try:
            link = anchor['href']
        except:
            continue 
            
        full_link = None 
        if re.search("^http",link): # if link starts with http it is either an external link or internal with full url
            if base_url in link: # exclude links to external sites
                full_link = link 
        elif re.search("^/",link): # if link is an internat link to the base_url, create full link from base_url
            full_link = (base_url + "/" + anchor['href'])
        else:
            full_link = (url + "/" + anchor['href']) # if link is
            
        # filter out links based on various conditions    
        if ((full_link not in existing_links) # ignore links that have already been found
             and full_link is not None
             and (not re.search("#",full_link)) # filter out links to id's on the same page
             and (not re.search("htm.*htm",full_link)) # this is hack because of some buggy behaviour - should fix
             and  re.search("htm$",full_link) 
             and re.search(must_include,full_link)
             and (not re.search(must_not_include,full_link))):
            
            # add to list of links found on the whole site so far
            # (the existing_links variable here has local scope only)
            existing_links.append(full_link)
            
            # add to list of links found within this function invokation
            links.append(full_link)
            
    return links


In [576]:
%%time
# Get all internal links from the imf finance and development publications website

f = open("imf_links.txt","w") # file to save the links

base_url = "https://www.imf.org"
must_not_include = "fandd/spa|fandd/fre|fandd/rus|fandd/chi|fandd/ara|fandd/ger"
all_links = []

# The online magazine is published quaterly
# loop over the years and quaters
for year in range(1996,2019):
    for month in ("03","06","09","12"):
        must_include = "external/pubs/ft/fandd/{}/{}".format(year,month) # only get links from current edition
        current_existing = [] # initialise list of links that have been visited within this loop
        current_existing.append("https://www.imf.org/external/pubs/ft/fandd/{}/{}".format(year,month))
        for link in current_existing:
            # get all new links from within this link and add them to the list of links already found
            current_existing.extend(get_all_links(link,base_url,current_existing,must_include,must_not_include))
        all_links.extend(current_existing)
        for link in current_existing:
            f.write((link + "\n"))

f.close()


CPU times: user 1min 3s, sys: 3.35 s, total: 1min 6s
Wall time: 6min 51s


In [577]:
%%time

# Visit each link
# Extract all text
# Break the text into sentences with nltk (natural language tool kit) sentence tokenizer 

import nltk
all_sentences = []
for link in all_links:
    content = requests.get(link).content
    soup = BeautifulSoup(content,'lxml')
    text = soup.body.text
    sentences = nltk.sent_tokenize(text) # this splits the text into sentences
    all_sentences.extend(sentences)

CPU times: user 1min 9s, sys: 3.1 s, total: 1min 12s
Wall time: 4min 4s
