# Purpose: 
The purpose of this project is to **accurately and quickly** match a database of articles that are related to development to the actual International Development Projects.  Prior to this attempt, people were manually going through articles and matching.  Automation would allow for a significant increase in the number of cases that could be tracked.  

### Method
Using Natural Language Processes (NLP) is a valuable technique in machine learning to understand text.  Words can be converted to numerical vectors and the similarities between articles and projects can be quantified to find **matches**.  The problem with this technique is that more articles that are related to development will have wording and themes simliar to many different projects, even if they are not directly related.  **OUR METHOD** aims to explore entity extraction methods.  An **entity** is a real-world object that has been assigned a name.  By only looking at the entitites extrated from the articles and comparing them with the entities in the project description, we feel that we can create a method that produces less false positives.

# import the necessary package

In [20]:
import pandas as pd
import pickle
import random
pd.set_option('display.max_columns',100)
%matplotlib inline

from scipy import spatial
import numpy as np
import spacy
from spacy import displacy
from geotext import GeoText

# load the dataset

**News Articles ** 

This is the "raw" newsdeed data that we want to try to match with projects. If you run into a protocol error try the '_py27.pkl' version 

In [21]:
news_data = pd.read_pickle('../Data/Feedly_Processed_DF_cleaned.pkl')

Note that the third item's `title` field looks to have a lot of content - there is likely cleaning needed on some of these fields. 

In [22]:
news_data.head(3)

Unnamed: 0,article_id,title,url,feed_label,content,published,summary,article_text,article_keywords,article_text_len,top_lang
10900,eebb9702,"India, World Bank sign financing agreement for...",http://www.abplive.in/business/india-world-ban...,NEWS WB- All Streams,,2017-12-21 09:22:12,"<table border=""0"" cellspacing=""3"" cellpadding=...","New Delhi [India], Dec 20 (ANI): A financing a...","[institutes, india, skill, financing, training...",1031,en
4268,6832ce57,Rs 40000-crore development projects in limbo i...,http://www.moneycontrol.com/news/business/econ...,NEWS AIIB - All Streams,,2017-12-10 09:40:00,"<table border=""0"" cellspacing=""3"" cellpadding=...","Development projects worth more than Rs 40,000...","[development, crore, andhra, eaps, state, proj...",4390,en
1663,30f8f65e,https://www.the-american-interest.com/2018/01/...,https://www.the-american-interest.com/2018/01/...,NEWS AFDB- All Streams,,2018-01-03 12:21:54,"<table border=""0"" cellspacing=""3"" cellpadding=...",Ten Lessons\n\nDevelopment with Chinese Charac...,"[transitions, university, chinese, united, dev...",575,en


In [23]:
news_data.columns

Index(['article_id', 'title', 'url', 'feed_label', 'content', 'published',
       'summary', 'article_text', 'article_keywords', 'article_text_len',
       'top_lang'],
      dtype='object')

** Projcets Data ** 

This is the "raw" unmatched project data 

In [24]:
project_info = pd.read_csv('../Data/EWS_Published Project_Listing_DD.csv',encoding='ISO-8859-1')

In [25]:
project_info.head(3)

Unnamed: 0,EWS ID,ProjectNumber,Published,Bank Risk Rating,Project Status,EWS URL,Detailed Analysis URL,Project Name,City,Country Count,Country 1,Country 2,Country 3,Country 4,Country 5,Country 6,Country 7,Country 8,Country 9,Country 10,Country 11,Country 12,Borrower or Client,Private Actor Count,Private Actor 1,Private Actor 2,Private Actor 3,Private Actor 4,Private Actor 5,Private Actor 6,Private Actor 7,Private Actor 8,Private Actor 9,Private Actor 10,Private Actor 11,Private Actor 12,Private Actor 13,Private Actor 14,Private Actor 15,Bank Count,Bank 1,Bank 2,Bank 3,Bank 4,Bank 5,Sector Count,Sector 1,Sector 2,Sector 3,Sector 4,Sector 5,Sector 6,Sector 7,Last Edited,Date Scraped,Date Disclosed,Board Date,Source URL,Project Cost,Investment Amount,Project Description,Contact Information
0,29164,AFDB-P-TN-BB0-007,Published,U,Proposed,https://ews.rightsindevelopment.org/projects/p...,,TUNISIA FERTILIZER PROJECT,,1,Tunisia,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,1,African Development Bank (AFDB),,,,,1,Agriculture and Forestry,,,,,,,9/4/17,8/15/17,12/13/01,12/13/01,http://www.afdb.org/en/projects-and-operations...,,,,ACCOUNTABILITY MECHANISM OF AfDB\r\r\r\rThe In...
1,29166,AFDB-P-SZ-HAA-001,Published,U,Approved,https://ews.rightsindevelopment.org/projects/p...,,LINE OF CREDIT TO SWAZILAND DEVELOPMENT FINANC...,,1,Swaziland,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,1,African Development Bank (AFDB),,,,,2,Finance,Industry and Trade,,,,,,9/4/17,8/15/17,12/13/01,5/12/17,http://www.afdb.org/en/projects-and-operations...,4.76,1.36,,MACHARIA Lilian Wanjiru - PIFD1\r\r\r\rACCOUNT...
2,29931,IADB-UR-T1100,Pending,C,Approved,https://ews.rightsindevelopment.org/projects/u...,,Supporting INEFOP in Improving Labor Training ...,,1,Uruguay,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,1,Inter-American Development Bank (IADB),,,,,0,,,,,,,,,10/3/17,12/31/99,7/16/13,http://www.iadb.org/en/projects/project-descri...,0.44,0.44,,


In [26]:
project_info.columns

Index(['EWS ID', 'ProjectNumber', 'Published', 'Bank Risk Rating',
       'Project Status', 'EWS URL', 'Detailed Analysis URL', 'Project Name',
       'City', 'Country Count', 'Country 1', 'Country 2', 'Country 3',
       'Country 4', 'Country 5', 'Country 6', 'Country 7', 'Country 8',
       'Country 9', 'Country 10', 'Country 11', 'Country 12',
       'Borrower or Client', 'Private Actor Count', 'Private Actor 1',
       'Private Actor 2', 'Private Actor 3', 'Private Actor 4',
       'Private Actor 5', 'Private Actor 6', 'Private Actor 7',
       'Private Actor 8', 'Private Actor 9', 'Private Actor 10',
       'Private Actor 11', 'Private Actor 12', 'Private Actor 13',
       'Private Actor 14', 'Private Actor 15', 'Bank Count', 'Bank 1',
       'Bank 2', 'Bank 3', 'Bank 4', 'Bank 5', 'Sector Count', 'Sector 1',
       'Sector 2', 'Sector 3', 'Sector 4', 'Sector 5', 'Sector 6', 'Sector 7',
       'Last Edited', 'Date Scraped', 'Date Disclosed', 'Board Date',
       'Source URL', 'Pro

**Labeled Projcet Data ** 

There isn't much of this data - so it is wise to treat this as a test/validation set for any algorithm you develop. Perhaps you can use a small sample of this data to build intuition for designing your algorithm. 

In [8]:
projects_labeled = pd.read_csv('../Data/Labeled_Data/projects.csv',encoding='ISO-8859-1')

In [9]:
projects_labeled.head(3)

Unnamed: 0,article_id,published,title,url,feed_label,ProjectNumber,EWS Project Name,EWS hyperlink,Matched
0,10f9ed2,1/11/18,ADB Provides Support for Three Infrastructure ...,http://moderndiplomacy.eu/2018/01/11/adb-provi...,NEWS ADB - All Streams,"ADB-41123-015, ADB-48158-001, ADB-41435-053",Road Network Improvement Project (formerly Sec...,https://ewsdata.rightsindevelopment.org/projec...,1
1,c0eece9b,5/13/18,ADB Helps Inaugurate New Power Distribution Ne...,http://feedproxy.google.com/~r/adb_news/~3/2My...,NEWS ADB - All Streams,ADB-47282-001,Energy Supply Improvement Investment Program (...,https://ewsdata.rightsindevelopment.org/projec...,1
2,d1d79dd8,2/20/18,ADB Provides $360 Million for Rolling Stock to...,http://feedproxy.google.com/~r/adb_news/~3/v9s...,NEWS ADB - All Streams,ADB-50312-003,Railway Rolling Stock Operations Improvement P...,https://ewsdata.rightsindevelopment.org/projec...,1


# Possible Places to Start 

1. Extract features from the News Articles and the raw Projects data (bank, keywords, country etc). From news article with extract the banks from feed_label. feed_label is from news article. 

2. Maybe start with title and banks - see if you are getting any matches that make sense.

In [10]:
project_info.columns

Index(['EWS ID', 'ProjectNumber', 'Published', 'Bank Risk Rating',
       'Project Status', 'EWS URL', 'Detailed Analysis URL', 'Project Name',
       'City', 'Country Count', 'Country 1', 'Country 2', 'Country 3',
       'Country 4', 'Country 5', 'Country 6', 'Country 7', 'Country 8',
       'Country 9', 'Country 10', 'Country 11', 'Country 12',
       'Borrower or Client', 'Private Actor Count', 'Private Actor 1',
       'Private Actor 2', 'Private Actor 3', 'Private Actor 4',
       'Private Actor 5', 'Private Actor 6', 'Private Actor 7',
       'Private Actor 8', 'Private Actor 9', 'Private Actor 10',
       'Private Actor 11', 'Private Actor 12', 'Private Actor 13',
       'Private Actor 14', 'Private Actor 15', 'Bank Count', 'Bank 1',
       'Bank 2', 'Bank 3', 'Bank 4', 'Bank 5', 'Sector Count', 'Sector 1',
       'Sector 2', 'Sector 3', 'Sector 4', 'Sector 5', 'Sector 6', 'Sector 7',
       'Last Edited', 'Date Scraped', 'Date Disclosed', 'Board Date',
       'Source URL', 'Pro

In [11]:
project_info['Project Description'].head()

0    None
1    None
2     NaN
3     NaN
4     NaN
Name: Project Description, dtype: object

In [12]:
project_info.head()

Unnamed: 0,EWS ID,ProjectNumber,Published,Bank Risk Rating,Project Status,EWS URL,Detailed Analysis URL,Project Name,City,Country Count,Country 1,Country 2,Country 3,Country 4,Country 5,Country 6,Country 7,Country 8,Country 9,Country 10,Country 11,Country 12,Borrower or Client,Private Actor Count,Private Actor 1,Private Actor 2,Private Actor 3,Private Actor 4,Private Actor 5,Private Actor 6,Private Actor 7,Private Actor 8,Private Actor 9,Private Actor 10,Private Actor 11,Private Actor 12,Private Actor 13,Private Actor 14,Private Actor 15,Bank Count,Bank 1,Bank 2,Bank 3,Bank 4,Bank 5,Sector Count,Sector 1,Sector 2,Sector 3,Sector 4,Sector 5,Sector 6,Sector 7,Last Edited,Date Scraped,Date Disclosed,Board Date,Source URL,Project Cost,Investment Amount,Project Description,Contact Information
0,29164,AFDB-P-TN-BB0-007,Published,U,Proposed,https://ews.rightsindevelopment.org/projects/p...,,TUNISIA FERTILIZER PROJECT,,1,Tunisia,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,1,African Development Bank (AFDB),,,,,1,Agriculture and Forestry,,,,,,,9/4/17,8/15/17,12/13/01,12/13/01,http://www.afdb.org/en/projects-and-operations...,,,,ACCOUNTABILITY MECHANISM OF AfDB\r\r\r\rThe In...
1,29166,AFDB-P-SZ-HAA-001,Published,U,Approved,https://ews.rightsindevelopment.org/projects/p...,,LINE OF CREDIT TO SWAZILAND DEVELOPMENT FINANC...,,1,Swaziland,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,1,African Development Bank (AFDB),,,,,2,Finance,Industry and Trade,,,,,,9/4/17,8/15/17,12/13/01,5/12/17,http://www.afdb.org/en/projects-and-operations...,4.76,1.36,,MACHARIA Lilian Wanjiru - PIFD1\r\r\r\rACCOUNT...
2,29931,IADB-UR-T1100,Pending,C,Approved,https://ews.rightsindevelopment.org/projects/u...,,Supporting INEFOP in Improving Labor Training ...,,1,Uruguay,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,1,Inter-American Development Bank (IADB),,,,,0,,,,,,,,,10/3/17,12/31/99,7/16/13,http://www.iadb.org/en/projects/project-descri...,0.44,0.44,,
3,30104,IADB-BR-T1279,Pending,C,Approved,https://ews.rightsindevelopment.org/projects/b...,,"Racial Equality and Social, Economic, Politica...",,1,Brazil,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,1,Inter-American Development Bank (IADB),,,,,0,,,,,,,,,10/3/17,12/31/99,6/4/13,http://www.iadb.org/en/projects/project-descri...,0.97,0.82,,
4,30322,IADB-PE-T1297,Pending,C,Approved,https://ews.rightsindevelopment.org/projects/p...,,Adaptation to Climate Change of the Fishery Se...,,1,Peru,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,1,Inter-American Development Bank (IADB),,,,,0,,,,,,,,,10/3/17,12/31/99,12/4/13,http://www.iadb.org/en/projects/project-descri...,1.5,1.5,,


### Let's play around with a sample description:

In [28]:
sentence = project_info.iloc[6809]['Project Description']
sentence

"The project's objective is to provide the necessary conditions for the growth and competitiveness of businesses in Paraguay by supporting a network of impact oriented companies."

In [29]:
## Load an already-trained model from Spacy 
## this is a small english model, could look at more complex models
nlp = spacy.load('en')

In [30]:
## Make the sentence and NLP object
doc = nlp(sentence)
doc

The project's objective is to provide the necessary conditions for the growth and competitiveness of businesses in Paraguay by supporting a network of impact oriented companies.

Detect sentences within the block of text:

In [16]:
for sent in doc.sents:
    print(sent)

The project's objective is to provide the necessary conditions for the growth and competitiveness of businesses in Paraguay by supporting a network of impact oriented companies.


In [17]:
## Look at the tags for eah word in the sentence
print([(token.text, token.tag_) for token in doc])

[('The', 'DT'), ('project', 'NN'), ("'s", 'POS'), ('objective', 'NN'), ('is', 'VBZ'), ('to', 'TO'), ('provide', 'VB'), ('the', 'DT'), ('necessary', 'JJ'), ('conditions', 'NNS'), ('for', 'IN'), ('the', 'DT'), ('growth', 'NN'), ('and', 'CC'), ('competitiveness', 'NN'), ('of', 'IN'), ('businesses', 'NNS'), ('in', 'IN'), ('Paraguay', 'NNP'), ('by', 'IN'), ('supporting', 'VBG'), ('a', 'DT'), ('network', 'NN'), ('of', 'IN'), ('impact', 'NN'), ('oriented', 'VBN'), ('companies', 'NNS'), ('.', '.')]


### Explore the entities (real-world objects):

In [31]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Paraguay ORG


In [33]:
doc = nlp(sentence)
displacy.render(doc, style='ent', jupyter=True)

In [34]:
## Noun chunks: noun plus the words describing the noun
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)
    print(" ")

The project's objective NP objective
 
the necessary conditions NP conditions
 
the growth NP growth
 
competitiveness NP competitiveness
 
businesses NP businesses
 
Paraguay NP Paraguay
 
a network NP network
 
impact oriented companies NP companies
 


In [67]:
for token in doc:
    print("{}/{} <--{}-- {}/{}".format(token.text, 
                                      token.tag_,
                                      token.dep_,
                                      token.head.text,
                                      token.head.tag_))

The/DT <--det-- project/NN
project/NN <--poss-- objective/NN
's/POS <--case-- project/NN
objective/NN <--nsubj-- is/VBZ
is/VBZ <--ROOT-- is/VBZ
to/TO <--aux-- provide/VB
provide/VB <--xcomp-- is/VBZ
the/DT <--det-- conditions/NNS
necessary/JJ <--amod-- conditions/NNS
conditions/NNS <--dobj-- provide/VB
for/IN <--prep-- conditions/NNS
the/DT <--det-- growth/NN
growth/NN <--pobj-- for/IN
and/CC <--cc-- growth/NN
competitiveness/NN <--conj-- growth/NN
of/IN <--prep-- growth/NN
businesses/NNS <--pobj-- of/IN
in/IN <--prep-- growth/NN
Paraguay/NNP <--pobj-- in/IN
by/IN <--prep-- provide/VB
supporting/VBG <--pcomp-- by/IN
a/DT <--det-- network/NN
network/NN <--dobj-- supporting/VBG
of/IN <--prep-- network/NN
impact/NN <--npadvmod-- oriented/VBN
oriented/VBN <--amod-- companies/NNS
companies/NNS <--pobj-- of/IN
./. <--punct-- is/VBZ


#### dependency visualizer 

In [68]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

### Load a more complex model (will take some time)

In [35]:
nlp = spacy.load('en_core_web_lg')

### Quantify the simliarity between terms.

In [37]:
np.dot(nlp.vocab['chiquita'].vector, nlp.vocab['banana'].vector)

14.928883

In [38]:
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

In [39]:
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
queen = nlp.vocab['queen'].vector
king = nlp.vocab['king'].vector

In [40]:
# We now need to find the closest vector in the vocabulary to the 
# result of "man" - "woman" + "queen"
maybe_king =  man - woman + queen

In [41]:
computed_similarities = []
for word in nlp.vocab:
    #ignore words wo vectors
    if not word.has_vector:
        continue
    similarity = cosine_similarity(maybe_king, word.vector)
    computed_similarities.append((word, similarity))

In [42]:
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

In [43]:
## These are the words that are similar to the vector maybe_king
print([w[0].text for w in computed_similarities[:10]])

['Queen', 'QUEEN', 'queen', 'King', 'KING', 'king', 'KIng', 'Kings', 'KINGS', 'kings']


#### Back to the project

In [47]:
news_sentence = nlp(news_data.iloc[4268]['article_text'])
news_sentence

2018-02-08 00:00:06

In 2009, the New York Times published an article titled: In a Digital Future, Textbooks are History. Since then, many have queried whether textbooks are needed in a digital world, given that extensive and often free online resources are now available.



The digital era has challenged conventional practices on textbooks. Should physical textbooks be replaced by packaged digital materials? With free online resources, should governments stop allocating budgets to textbooks? Are schools in developing countries using digital sources to improve learning?



The answers to these questions are not binary but depend on the development context of countries including their digital infrastructure availability, affordability and capacity. Even as developing countries catch up on the digital curve, the good old physical textbook cannot be debunked.

Digital learning materials are indispensable in the modern era. But it’s just as important that they be supported by next-generati

In [48]:
project_description = nlp(sentence)

In [49]:
## Similarity between the project description and the news sentence
project_description.similarity(news_sentence)

0.9317586846877413

In [50]:
news_data.columns

Index(['article_id', 'title', 'url', 'feed_label', 'content', 'published',
       'summary', 'article_text', 'article_keywords', 'article_text_len',
       'top_lang'],
      dtype='object')

#### Apply NLP to article text and determine the entitites in each article

In [51]:
article = news_data.iloc[5173]['article_text']

In [52]:
news_data.article_text = news_data.article_text.apply(lambda x: x.replace('\n', ''))

In [None]:
## Take a long time to run
news_data['article_text_nlp'] = news_data.article_text.apply(nlp)

In [None]:
news_data['article_text_entities'] = news_data.article_text_nlp.apply(lambda x: x.ents)

# When you want to Test your algorithm 

* The projects in the labeled data should also present in the projects data. 
* WATCH OUT : Some articles are linked to multiple projects so you need to decide on how to deal with before you just join the files
* ALSO : The labeled data inlcudes articles that don't match projects (Articles that should be classified as having NO Match) - you can see this if the "Matched" field = 0. 

In [162]:
projects_labeled.columns

Index(['article_id', 'published', 'title', 'url', 'feed_label',
       'ProjectNumber', 'EWS Project Name', 'EWS hyperlink', 'Matched'],
      dtype='object')

In [164]:
news_data.columns

Index(['article_id', 'title', 'url', 'feed_label', 'content', 'published',
       'summary', 'article_text', 'article_keywords', 'article_text_len',
       'top_lang'],
      dtype='object')

In [167]:
project_info.columns

Index(['EWS ID', 'ProjectNumber', 'Published', 'Bank Risk Rating',
       'Project Status', 'EWS URL', 'Detailed Analysis URL', 'Project Name',
       'City', 'Country Count', 'Country 1', 'Country 2', 'Country 3',
       'Country 4', 'Country 5', 'Country 6', 'Country 7', 'Country 8',
       'Country 9', 'Country 10', 'Country 11', 'Country 12',
       'Borrower or Client', 'Private Actor Count', 'Private Actor 1',
       'Private Actor 2', 'Private Actor 3', 'Private Actor 4',
       'Private Actor 5', 'Private Actor 6', 'Private Actor 7',
       'Private Actor 8', 'Private Actor 9', 'Private Actor 10',
       'Private Actor 11', 'Private Actor 12', 'Private Actor 13',
       'Private Actor 14', 'Private Actor 15', 'Bank Count', 'Bank 1',
       'Bank 2', 'Bank 3', 'Bank 4', 'Bank 5', 'Sector Count', 'Sector 1',
       'Sector 2', 'Sector 3', 'Sector 4', 'Sector 5', 'Sector 6', 'Sector 7',
       'Last Edited', 'Date Scraped', 'Date Disclosed', 'Board Date',
       'Source URL', 'Pro

In [176]:
project_info_snip = project_info[['ProjectNumber', 'Project Description', 'Project Name']].copy()
project_info_snip['Project Description'] = project_info_snip['Project Description'].astype('str')
project_info_snip['Project Name'] = project_info_snip['Project Name'].astype('str')
project_info_snip['Project Description'] = project_info_snip['Project Description'].apply(lambda x: x.replace('\n', ''))

news_data_snip = news_data[['article_id', 'article_text', 'article_keywords', 'title']].copy()
news_data_snip.article_text = news_data_snip.article_text.apply(lambda x: x.replace('\n', ''))
news_data_snip.rename(columns={'title':'article_title'}, inplace=True)

projects_labeled_snip = projects_labeled[['article_id', 'ProjectNumber', 'Matched']].copy()
projects_labeled_snip['ProjectNumber'] = projects_labeled_snip['ProjectNumber'].astype(str)
#projects_labeled_snip = projects_labeled_snip.dropna()

In [177]:
projects_labeled_snip.shape

(123, 3)

In [178]:
## Explode rows with multiple project numbers
x = projects_labeled_snip.assign(**{'ProjectNumber':projects_labeled_snip['ProjectNumber'].str.split(',')})
projects_labeled_snip_explode = pd.DataFrame({col:np.repeat(x[col].values, x['ProjectNumber'].str.len())
                                              for col in x.columns.difference(['ProjectNumber'])}).assign(**{'ProjectNumber':np.concatenate(x['ProjectNumber'].values)})[x.columns.tolist()]

In [179]:
projects_labeled_snip_explode.shape

(136, 3)

In [181]:
list(projects_labeled_snip.ProjectNumber.unique())

['ADB-41123-015, ADB-48158-001, ADB-41435-053',
 'ADB-47282-001',
 'ADB-50312-003',
 'ADB-50410-001',
 'nan',
 'AIIB-000011',
 'AIIB-000015',
 'AIIB-000015, AIIB-0003, AIIB-00057',
 'AIIB-000019',
 'AIIB-000020',
 'AIIB-000021',
 'AIIB-000023',
 'AIIB-000038',
 'AIIB-000079',
 'AIIB-0003, AIIB-000015, AIIB-00057',
 'AIIB-00057',
 'EBRD-46778',
 'EBRD-48132',
 'EBRD-48576',
 'EBRD-49078',
 'EBRD-49188',
 'EBRD-49222',
 'EBRD-49556',
 'EBRD-49649',
 'EIB-20140596',
 'EIB-20140645',
 'EIB-20150676',
 'EIB-20160341',
 'EIB-20160727',
 'EIB-20160816',
 'EIB-20161001',
 'EIB-20170105',
 'EIB-20170819',
 'EBRD-48424',
 'GCF-FP080, WB-P160383',
 'IADB-EC-L1111',
 'IIC-12063-02',
 'IIC-12116-01',
 'ADB-36330-033',
 'WB-P146330',
 'WB-P148775',
 'WB-P160408',
 'WB-P161234',
 'WB-P162422',
 'WB-P163628',
 'EIB-20170635',
 'EIB-20160848',
 'ADB-47320-001',
 'IIC-11794-04',
 'IADB-BH-L1035',
 'IIC-12114-01',
 'IADB-JA-G1002',
 'IIC-12161-01']

In [18]:
train_df = projects_labeled_snip_explode.merge(news_data_snip, on='article_id', how='inner')

In [19]:
train_df = train_df.merge(project_info_snip, on='ProjectNumber', how='outer')
train_df = train_df[train_df.Matched.isin([0,1])]
train_df['Matched'] = train_df['Matched'].astype(int)

In [20]:
#train_df.to_csv('../Data/train_df_clean.csv')

In [21]:
train_df.columns

Index(['article_id', 'ProjectNumber', 'Matched', 'article_text',
       'article_keywords', 'article_title', 'Project Description',
       'Project Name'],
      dtype='object')

In [22]:
nlp = spacy.load('en_core_web_lg')

In [23]:
train_df.loc[train_df['Project Description'].apply(type) != str]

Unnamed: 0,article_id,ProjectNumber,Matched,article_text,article_keywords,article_title,Project Description,Project Name
1,10f9ed2,ADB-48158-001,1,The Asian Development Bank (ADB) today signed ...,"[development, supply, tonle, million, infrastr...",ADB Provides Support for Three Infrastructure ...,,
2,10f9ed2,ADB-41435-053,1,The Asian Development Bank (ADB) today signed ...,"[development, supply, tonle, million, infrastr...",ADB Provides Support for Three Infrastructure ...,,
12,c2a956dd,,0,"YEREVAN, May 7. /ARKA/. The Asian Development ...","[inclusion, growth, financial, services, smes,...","ADB loan, equity to Ameriabank to help promote...",,
13,9fd5c398,,0,Board of the African Development Bank (AfDB) h...,"[african, development, support, longterm, prom...",AfDB okays $10million bond support fund,,
14,b2d42591,,0,"(Ecofin Agency) - Tomorrow March 27, 2018, the...","[smes, african, sector, agriculture, support, ...",AfDB to launch 12 projects for agricultural SM...,,
15,86d7fa70,,0,"ZIMBORDERS, a locally-owned company, has been ...","[post, rehabilitation, border, zimborders, ent...",US$100m for Beitbridge border rehabilitation,,
16,fd6ca411,,0,Try OOSKAnews current Daily Water Briefings or...,"[newsletters, subscription, water, daily, week...",AfDB Funds USD 101 Million Angola Projects,,
17,82edb610,,0,Harare BureauTHE African Development Bank (AfD...,"[25m, sector, development, foreign, finance, e...",AfDB extends $25m loan facility to private sector,,
18,98cae838,,0,"Business News of Tuesday, 6 February 2018Sourc...","[development, construct, adomi, amoakoatta, al...",Gov't secures $20m AfDB cash to construct link...,,
19,cd0ebdc7,,0,Construction of a four storey building to repl...,"[annex, construction, polytechnic, wing, used,...",Plem construction of Polytechnic College annex...,,


In [24]:
train_df['Project Description'] = train_df['Project Description'].astype('str')

In [25]:
train_df.loc[train_df['Project Name'].apply(type) != str]

Unnamed: 0,article_id,ProjectNumber,Matched,article_text,article_keywords,article_title,Project Description,Project Name
1,10f9ed2,ADB-48158-001,1,The Asian Development Bank (ADB) today signed ...,"[development, supply, tonle, million, infrastr...",ADB Provides Support for Three Infrastructure ...,,
2,10f9ed2,ADB-41435-053,1,The Asian Development Bank (ADB) today signed ...,"[development, supply, tonle, million, infrastr...",ADB Provides Support for Three Infrastructure ...,,
12,c2a956dd,,0,"YEREVAN, May 7. /ARKA/. The Asian Development ...","[inclusion, growth, financial, services, smes,...","ADB loan, equity to Ameriabank to help promote...",,
13,9fd5c398,,0,Board of the African Development Bank (AfDB) h...,"[african, development, support, longterm, prom...",AfDB okays $10million bond support fund,,
14,b2d42591,,0,"(Ecofin Agency) - Tomorrow March 27, 2018, the...","[smes, african, sector, agriculture, support, ...",AfDB to launch 12 projects for agricultural SM...,,
15,86d7fa70,,0,"ZIMBORDERS, a locally-owned company, has been ...","[post, rehabilitation, border, zimborders, ent...",US$100m for Beitbridge border rehabilitation,,
16,fd6ca411,,0,Try OOSKAnews current Daily Water Briefings or...,"[newsletters, subscription, water, daily, week...",AfDB Funds USD 101 Million Angola Projects,,
17,82edb610,,0,Harare BureauTHE African Development Bank (AfD...,"[25m, sector, development, foreign, finance, e...",AfDB extends $25m loan facility to private sector,,
18,98cae838,,0,"Business News of Tuesday, 6 February 2018Sourc...","[development, construct, adomi, amoakoatta, al...",Gov't secures $20m AfDB cash to construct link...,,
19,cd0ebdc7,,0,Construction of a four storey building to repl...,"[annex, construction, polytechnic, wing, used,...",Plem construction of Polytechnic College annex...,,


In [26]:
train_df['Project Name'] = train_df['Project Name'].astype('str')

In [27]:
def apply_nlp_col(df, col_name, nlp_model):
    df['{}_nlp'.format(col_name)] = df[col_name].apply(lambda x: nlp_model(x))
    return df

In [28]:
train_df = apply_nlp_col(train_df, 'article_title', nlp)
train_df = apply_nlp_col(train_df, 'article_text', nlp)

train_df = apply_nlp_col(train_df, 'Project Description', nlp)
train_df = apply_nlp_col(train_df, 'Project Name', nlp)

In [29]:
train_df.columns

Index(['article_id', 'ProjectNumber', 'Matched', 'article_text',
       'article_keywords', 'article_title', 'Project Description',
       'Project Name', 'article_title_nlp', 'article_text_nlp',
       'Project Description_nlp', 'Project Name_nlp'],
      dtype='object')

In [30]:
def apply_entity_col(df, col_name):
    df['{}_entities'.format(col_name)] = df['{}_nlp'.format(col_name)].apply(lambda x: x.ents)
    return df

In [31]:
train_df = apply_entity_col(train_df, 'article_title')
train_df = apply_entity_col(train_df, 'article_text')

train_df = apply_entity_col(train_df, 'Project Description')
train_df = apply_entity_col(train_df, 'Project Name')

In [32]:
train_df.columns

Index(['article_id', 'ProjectNumber', 'Matched', 'article_text',
       'article_keywords', 'article_title', 'Project Description',
       'Project Name', 'article_title_nlp', 'article_text_nlp',
       'Project Description_nlp', 'Project Name_nlp', 'article_title_entities',
       'article_text_entities', 'Project Description_entities',
       'Project Name_entities'],
      dtype='object')

In [130]:
text_ents_test = train_df['article_text_entities'].iloc[0]
proj_ents_test = train_df['Project Description_entities'].iloc[0]

In [131]:
proj_ents_test

(the Greater Mekong Subregion,
 GMS,
 SEC,
 Cambodia,
 Prey Veng,
 Siem Reap,
 Svay Rieng,
 Cambodia,
 the Ministry of Public Works and Transport (MPWT,
 HTAP,
 2014 2018,
 Guidelines on the Use of Consultants by Asian Development Bank,
 March 2013)

In [40]:
proj_ents_labels = [ent.label_ for ent in proj_ents_test]
text_ents_labels = [ent.label_ for ent in text_ents_test]

In [132]:
proj_ents_text = [ent.text for ent in proj_ents_test]
text_ents_text = [ent.text for ent in text_ents_test]

In [133]:
proj_ents_text

['the Greater Mekong Subregion',
 'GMS',
 'SEC',
 'Cambodia',
 'Prey Veng',
 'Siem Reap',
 'Svay Rieng',
 'Cambodia',
 'the Ministry of Public Works and Transport (MPWT',
 'HTAP',
 '2014 2018',
 'Guidelines on the Use of Consultants by Asian Development Bank',
 'March 2013']

In [85]:
proj_ents_text = " ".join(proj_ents_text)
text_ents_text = " ".join(text_ents_text)

In [86]:
nlp(proj_ents_text).similarity(nlp(text_ents_text))

0.8450022837775083

In [170]:
train_df['project_description_ent_text'] = train_df['Project Description_entities'].apply(lambda x: [ent.text for ent in x])
train_df['project_description_ent_text'] = train_df['project_description_ent_text'].apply(lambda x: ' '.join(x)).astype('str')

In [171]:
train_df['article_text_ent_text'] = train_df['article_text_entities'].apply(lambda x: [ent.text for ent in x])
train_df['article_text_ent_text'] = train_df['article_text_ent_text'].apply(lambda x: ' '.join(x)).astype('str')

In [None]:
np.where(df.applymap(lambda x: x == '')).

In [160]:
train_df['project_description_ent_text'] = np.where(train_df['project_description_ent_text'].apply(lambda x: x == ''), None, train_df['project_description_ent_text'])

In [166]:
train_df['project_description_ent_text'] = train_df['project_description_ent_text'].astype(str)

In [172]:
train_df['nlp_project_description_ent_text'] = train_df['project_description_ent_text'].apply(nlp)
train_df['nlp_article_text_ent_text'] = train_df['article_text_ent_text'].apply(nlp)

In [173]:
traind_df['similarity'] = train_df['nlp_article_text_ent_text'].similarity(train_df['nlp_project_description_ent_text'])

AttributeError: 'Series' object has no attribute 'similarity'

In [164]:
train_df['nlp_project_descriptaion_ent_text']

KeyError: 'nlp_project_description_ent_text'

In [None]:
nlp(text_ents_text)