### Basic Keywords Search 
- it is an simple example of how to search through document, find keyworkds, and get the paragraph context. 
- we will use all Staff Reports from 2000 - 2016. Most data came from COM xml database, but I also patched some missing years with IR documents. That is why you will see we are going to process two different data srouce in the code
- a metadata sheet is also available to match all our documents with proper metadat info. 

In [1]:
### import some modules we gonna use 
import os 
python_root = '..'
import sys
sys.path.insert(0, python_root)
import data_util
import pickle
import csv
import re
from collections import Counter

- make sure you have pandas and nltk installed 

In [2]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

#### Donload preprocessed data 

- In 1_process_xmls, we have shown one way of processing COM's xml data base. Here, I have already compiled a small database, with all article IV document i can possible find from 1990 to 2016.  
- All documents has been preprocessed into a document object that we defined earlier in 1_process_xmls, detailed steps can be found in ../data_util. 
- From the link, you can download all preprocessed data and also raw data in case you want to process them in a different way. 

In [3]:
### specify download path and extract path 
download_path = "staff_reports_full_sample.zip"
download_link = "https://www.dropbox.com/s/23vdujvh67io2iz/staff_reports_full_sample.zip?dl=1"
extract_path = "./data"  # place data in Python project root folder

In [4]:
## detailed of the download_data function is in data_util module in python_root 
## if you do not yet have the data, run this code, it will set up a data folder under ./Python folder 
data_util.download_data(download_path,download_link,extract_path)

article iv: 589MB [02:31, 3.87MB/s]                                            


#### 1. Search

- Load xml documents and txt documents from pickle

In [5]:
## laod xml_dict and txt_dict 
xml_dict = pickle.load(open('data/xlm_docs.p', "rb")) 
txt_dict = pickle.load(open('data/txt_docs.p', "rb")) 

## merge them together 
# if you are running python 3.5 or later, you can use this one line to merge them 
doc_dict = {**xml_dict,**txt_dict}

## print some information 
doc_ids = list(doc_dict.keys())
print("number of documents: ", len(doc_dict))
print("\nOne example: \n\ndoc_id: {} \n\nFirst paragraph: \n{}".format(doc_ids[0],doc_dict[doc_ids[0]].paras[0]))

number of documents:  2960

One example: 

doc_id: 9781451821079 

First paragraph: 
1. At the conclusion of the last Article IV consultation in March 2002 (SUR/02/33, 3/19/02), Directors commended Kenya for achieving a measure of macroeconomic stability during recent years in difficult circumstances. Directors, however, were concerned that the macroeconomic and financial situation remained fragile, and that investor confidence was low. They stressed the importance of implementing a comprehensive medium-term economic and structural reform program and undertaking measures to address the governance problems that had stalled progress to date. Directors stressed that it was important for Kenya to implement the prior actions needed to resume the Poverty Reduction and Growth Facility (PRGF)-supported program and to help restore confidence.1


- get keywords, i am reading it from a txt file. Of course you can simply define it as a list in your script

In [6]:
### get keyword list 
def read_keywords(file):
    """
    file: csv file with keyword list
    """
    with open(file,'r') as f:
        reader = csv.reader(f)
        mylist = list(reader)
    return mylist

key_file = 'search_keywords.csv'
keywords = read_keywords(key_file)
keywords = [k[0].replace('_',' ') for k in keywords] ## in keywords we used _ as [space], change it back 
print('Keywords: ', keywords[:10])

Keywords:  ['asset quality', 'bad asset', 'bad debt', 'bad loan', 'balance sheet mismatch', 'balance-sheet mismatch', 'bank profitability', 'bankruptcies', 'bankruptcy', 'bond spread']


- Use regular expression to find keywords in your text 
- regular expression can get really complicated, we may have a stand alone tutorial for regular expression. 
- Here what we are trying to achieve is to feed in a list of keywords, and return a list of keywords found, and also the frequency 

In [7]:
## define a function to loacte keywords
def construct_rex(keywords):
    r_keywords = [r'\b' + re.escape(k) + r'(s|es)?\b'for k in keywords]    # tronsform keyWords list to a patten list, find both s and es 
    rex = re.compile('|'.join(r_keywords),flags=re.I)                        # use or to join all of them, ignore casing
    #match = [(m.start(),m.group()) for m in rex.finditer(content)]          # get the position and the word
    return rex

def find_exact_keywords(content,keywords,rex=None):
    if rex is None: 
        rex = construct_rex(keywords)
    content = content.replace('\n', '').replace('\r', '')#.replace('.',' .')
    match = Counter([m.group() for m in rex.finditer(content)])             # get all instances of matched words 
                                                                             # and turned them into a counter object, to see frequencies
    return match

- Let's see how ti works

In [8]:
content = doc_dict[doc_ids[100]].paras[9]
print("Content: \n{}".format(content))
rex = construct_rex(keywords)
results = find_exact_keywords(content.lower(),keywords,rex)
print("\nkeywords found: \n {}".format(results))

Content: 
10. Remarkable progress has been made in consolidating the peace process culminating in the agreement signed in Pretoria on December 17,2002. In line with the Lusaka accords of 1999, peace agreements were signed with Rwanda (end-July 2002) and Uganda (early September 2002), and these countries have all but completed the withdrawal of their troops. On the government side, Angola is also completing the withdrawal of its troops, while Namibia and Zimbabwe have already done so. On November 11, 2002, Presidents Kabila and Kagame agreed to extend the period envisaged in the original peace agreement by three months to allow for the disarming and repatriation of ex-Rwandan Hutu soldiers. Meanwhile, Phase III of the United Nations Observation Mission (MONUC) to the DRC continues and the regional demobilization and reintegration program is gradually being put in place, with the help of the United Nations (UN) and the World Bank. On December 5, 2002, the UN Security Council passed Resol

#### Now, let's loop through all document we have and put all keywords frequencies together
- depends on how many documents you have, this may take a while 
- you are more then welcome to modify this to seep it up. Depends on your need, if all you care about is a document level statistics, you can concatinate all paragraphs and do one regex search, it will be much faster. 
- We are doing it in paragaph level, just in case you will need paragraph context in the future. 
- you can also try multi thread it, if time is really precious to you. You can find some usefuly information here: https://pymotw.com/2/multiprocessing/basics.html 

In [9]:
total_doc_num = len(doc_dict.items())
df = pd.DataFrame()               ## define a empty data frame, will populat it as we loop though documents
rex = construct_rex(keywords)     ## construct regular expression 

for idx,(ite) in enumerate(doc_dict.items()):
    #if idx > 500: break
    key,doc = ite     ## get id and doc
    val = doc.paras   ## read all paragraphs 
    if len(val)==0:   ## skip if document has no paragraph 
        continue
    for idxp,content in enumerate(val):                                    ## loop though each paragaph in a document 
        results = find_exact_keywords(content.lower(),keywords,rex)        ## get keywords frequency in each paragraph 
        if len(results) == 0:                                              ## if nothing found, skip 
            continue
        results['context'] = content                                    ## if keyword found, assign context to be the paragaph
        results['doc_id'] = str(key)                                    ## assign documetn id 
        results['para_id'] = str(key) + '_' + str(idxp)                 ## assigin paragaph id, we may not use this at all, but just in case
        results['para_word_count'] = len(content.split())               ## get paragaph word count 
        df = df.append(results, ignore_index=True)
    
    if idx%100 == 0 :                                     ## print every 100 iterations 
        print('{}/{}'.format(idx,total_doc_num))

0/2960
100/2960
200/2960
300/2960
400/2960
500/2960
600/2960
700/2960
800/2960
900/2960
1000/2960
1100/2960
1200/2960
1300/2960
1400/2960
1500/2960
1600/2960
1700/2960
1800/2960
1900/2960
2000/2960
2100/2960
2200/2960
2300/2960
2400/2960
2500/2960
2600/2960
2700/2960
2800/2960
2900/2960


In [10]:
## order doc id and context to first 3 columns 
keys = df.columns
meta = ['doc_id','para_id','context','para_word_count']
keys = [i for i in keys if i not in meta]
meta.extend(keys)
df = df[meta]
df.head()

Unnamed: 0,doc_id,para_id,context,para_word_count,investor confidence,nonperforming,npls,risk premium,restructuring,asset quality,...,maturity transformation,capital flow reversal,liquidity crunches,cds premia,hard landings,risk appetites,counterparty credit risk,depositor runs,sovereign defaults,solvents
0,9781451821079,9781451821079_0,1. At the conclusion of the last Article IV co...,106.0,1.0,,,,,,...,,,,,,,,,,
1,9781451821079,9781451821079_17,10. The lack of financial services and access ...,107.0,,1.0,2.0,,,,...,,,,,,,,,,
2,9781451821079,9781451821079_19,12. During 1998–2001 and the first half of 200...,228.0,,,,3.0,,,...,,,,,,,,,,
3,9781451821079,9781451821079_30,A strong implementation of the governance agen...,304.0,,,,,1.0,,...,,,,,,,,,,
4,9781451821079,9781451821079_33,a slower implementation of the governance agen...,109.0,,,,,1.0,,...,,,,,,,,,,


In [11]:
### save results to csv 
df.to_csv('data/search_results.csv',encoding='utf-8')
print('finished')

finished


##### Now, every row is a paragraph, with keyword frequencies. in most cases, we would want country level aggregated data. Please see follow up notebook for data aggregation and visualization. 

##### alternatively, you can also simply import all these data into stata, and aggregate them in stata. I have attached a stata do file, for your reference. 