# Knowledge Base Population Using  Relation Extraction and Word2vec 
**by JAPeTo**

The Knowledge Base Population (KBP) is to automatically identifyrelevant entities, learn and disc discover attributes about the its relations, and finally search, expand the KB with other relations. 

The idea is take a small set of samples pairs. Automatically defining semantic relation and expand the set with new pairs.
## Installation
1. **Prerequisites**
    You need to have these libraries.
    * Python >= 3.0  
    * [gensim](https://radimrehurek.com/gensim/) library
    * *NumPy* and *SciPy* include in gensim

In [1]:
#!pip3 install gensim

2. **Setting paths**
    Sample in [config.py](http://localhost:8888/edit/config.py) file:

    * **word2vec_file** - Path to file with word embeddings dataset. 
    Yo could be use any format also by word2vec (vec or bin) or custom vectors from gensim library. 
    Popular pre-trained datasets can be found on official 
    [word2vec page](https://code.google.com/archive/p/word2vec/) as [Google News dataset](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) (1.5GB).

    * **output_file** - New expand set of pairs (entities have a possible semantic relation) whitout tag.


## Load libraries and utilities

In [2]:
import numpy as np
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from collections import Counter
import os
import time
import random
import re
import glob
import datetime
import math

config={}
config["word2vec_file"] = '/Users/macbookpro/Downloads/GoogleNews-vectors-negative300.bin'
config["input_kbase"] = '/Users/macbookpro/Desktop/nlp-workshop-2020/inputs/capitals.txt'
curr_date = str(datetime.date.today())
config["new_kbase"] = f'/Users/macbookpro/Desktop/nlp-workshop-2020/outputs/output_{curr_date}.txt'


# own libraries
from utilities import *
import embedd_utils as utils
import classes as model

## Additional functions

In [3]:
def embedding_object(word=None, vector=None):
    """
    This method serves as a interface to embedding cache. If the embedding with given word was already
    used it will return this object. Otherwise it will create new object with specified vector.
    :param word: string
    :param vector: list of floats
    :return: Embedding
    """
    if word is None:
        return Embedding(vector=vector)
    cached = utils.cached_embedding(word)
    if cached is None:
        utils.embeddings[word] = Embedding(word=word, vector=vector)
    return utils.embeddings[word]

def method_names(x):
    return {
        1: 'avg',
        2: 'max',
        3: 'svm'
    }[x]

def similarity_names(x):
    return {
        1: 'euclidean',
        2: 'cosine',
    }[x]

def normalization_names(x):
    return {
        1: 'none',
        2: 'standard',
        3: 'softmax'
    }[x]


## Load Word2Vec model
Load pretrained model from google dataset, the model cannot be refined with additional data

In [4]:
model.model = utils.load_model(config["word2vec_file"] )

## Create the recognizer
From dataset build a **recognizer**

In [5]:
builder = model.PairSet.create_from_file(filename=config["input_kbase"])

The **recognizer** seek pair candidates

In [6]:
new_pairs = builder.find_new_pairs(output=config["new_kbase"], result_count=3,neighborhood=5,
                                      method=method_names(2),
                                      distance=similarity_names(2),
                                      normalization=normalization_names(2))

Pair List, samples and candidates

In [7]:
print("#"*5, "Samples")
show_content_file(config["input_kbase"], lines=8)

print("#"*5, "Candidates")
new_pairs.print(lines=8)

##### Samples
[32mt[0m [34mAthens[0m [34mGreece[0m
[32mt[0m [34mBaghdad[0m [34mIraq[0m
[32mt[0m [34mBangkok[0m [34mThailand[0m
[32mt[0m [34mBeijing[0m [34mChina[0m
[32mt[0m [34mBerlin[0m [34mGermany[0m
[32mt[0m [34mBern[0m [34mSwitzerland[0m
[32mt[0m [34mCairo[0m [34mEgypt[0m
[32mt[0m [34mCanberra[0m [34mAustralia[0m
##### Candidates
[31m?[0m [32mEgypt[0m [34mCairo[0m
[31m?[0m [32mPakistan[0m [34mIslamabad[0m
[31m?[0m [32mThailand[0m [34mBangkok[0m
[31m?[0m [32mRussia[0m [34mMoscow[0m
[31m?[0m [32mIran[0m [34mTehran[0m
[31m?[0m [32mFrance[0m [34mParis[0m
[31m?[0m [32mJapan[0m [34mTokyo[0m
[31m?[0m [32mCuba[0m [34mHavana[0m


## Classifier
With candidates and samples train a svm **Classifier**

Before of run build [SVM_pef](http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
- Download
- Compile
- set folder bellow


In [8]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://ipython-books.github.io/pages/chapter08_ml/05_svm_files/kernel.png")

In [9]:
config["svm_perf_path"] = f' /Users/macbookpro/Desktop/nlp-workshop-2020/lib/svm_perf/'
config["svm_folder"] = f'/Users/macbookpro/Desktop/nlp-workshop-2020/outputs/svm/svm_{curr_date}'

tagged = builder.svm_learning(method='svm', 
                              svm_folder= config["svm_folder"], 
                              svm_perf_path= config["svm_perf_path"])

Start SVM [svm_perf_learn] 1589166126
load predictions from /Users/macbookpro/Desktop/nlp-workshop-2020/outputs/svm/svm_2020-05-10_capitals_prediction


In [10]:
results = model.ResultList()
results.from_array(tagged)

In [13]:
print("#"*5, "Samples")
show_content_file(config["input_kbase"], lines=8)

print("#"*5, "Candidates tagged")
[str(order) for order in results if str(order)[0] =="t"]
results.print(lines=12)

##### Samples
[32mt[0m [34mAthens[0m [34mGreece[0m
[32mt[0m [34mBaghdad[0m [34mIraq[0m
[32mt[0m [34mBangkok[0m [34mThailand[0m
[32mt[0m [34mBeijing[0m [34mChina[0m
[32mt[0m [34mBerlin[0m [34mGermany[0m
[32mt[0m [34mBern[0m [34mSwitzerland[0m
[32mt[0m [34mCairo[0m [34mEgypt[0m
[32mt[0m [34mCanberra[0m [34mAustralia[0m
##### Candidates tagged
[31mt[0m [32mSouth_Korea[0m [34mKorea[0m
[31mt[0m [32mArgentine[0m [34mArgentina[0m
[31mt[0m [32mPrague_Czech_Republic[0m [34mCzech_Republic[0m
[31mt[0m [32mDoer[0m [34mStephen_Harper[0m
[31mt[0m [32mHu[0m [34mWen[0m
[31mt[0m [32mCanadians[0m [34mManitobans[0m
[31mt[0m [32mSpanish[0m [34mPortuguese[0m
[31mt[0m [32mYamada[0m [34mTanaka[0m
[31mt[0m [32mBrisbane[0m [34mAdelaide[0m
[31mt[0m [32mMunich_Germany[0m [34mGermany[0m
[31mt[0m [32mVientiane[0m [34mLaos[0m
[31mt[0m [32mEcuadorean[0m [34mEcuador[0m
