# Build knowledge graph via deep learning

## Problem
What are shared entities in all defi whitepapers?

## Solution
With bag of words model, we will extract entities in whitepapers.

## Expected outcome
1. Top ten entities for each whitepaper are extracted.
2. Top ten trigram for each whitepaper are extracted.

Authors:
* Xiaoyuan Liu
* Neel Kovelamudi
* Zijian Xie
* Mu He
* Cuiqianhe Du
* Nicholas Lin
* Austin Wei

Principal Investigator: 
* Dawn Song

Date: Fall 2021

References: 
[Bag_of_words](https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/)

### Import library

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import collections
import re
import os
import string
pd.set_option('display.max_colwidth', 200)
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#!pip install BeautifulSoup4
#import nltk
#nltk.download()  # Download text data sets, including stop words

In [3]:
# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup         

In [4]:
from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Whitepaper datasource

In [5]:
def read_whitepapers(filename):
    directory = "../whitepapers/top20_whitepapers/"
    texts = []
    for entry in os.scandir(directory):
        if (entry.path.endswith(filename) and entry.is_file()):
            a_dataframe = pd.read_csv(entry.path, names=[filename], sep="\n")
    a_dataframe.replace('', np.nan, inplace=True)
    a_dataframe.dropna(inplace=True)
    return a_dataframe

In [6]:
# bitcoin_filename="Bitcoin.txt"
# whitepapers = read_whitepapers(bitcoin_filename)
# # whitepapers.rename(columns={bitcoin_filename: "whitepapers"}, inplace=True)
# whitepapers

In [7]:
filenames = ['Algorand.txt', 'Avalanche.txt', 'Binance.txt', 'Bitcoin.txt', 'Cardano.txt', 'Chainlink.txt',
            'Crypto_com.txt', 'Ethereum.txt', 'FTX_token.txt', 'PolkaDot.txt', 'Polygon.txt', 'Ripple.txt', 
            'ShibaInu.txt', 'Solana.txt', 'Terra.txt', 'Tether.txt', 'Tron.txt', 'Uniswap.txt', 'Wrapped.txt']

In [8]:
def create_dataframe():
#     for i in filenames:
#         whitepaper = read_whitepapers(i)
#         whitepaper.rename(columns={i : "whitepapers"}, inplace=True)
#         whitepapers = whitepapers.append(whitepaper)
    df_from_each_whitepaper = (read_whitepapers(i) for i in filenames)
    whitepapers = pd.concat(df_from_each_whitepaper, ignore_index=True, axis=1)
    return whitepapers

### Create dataframe
Append whitepaper dataframe one after another

In [9]:
df_whitepapers = create_dataframe()
df_whitepapers.columns = filenames
df_whitepapers

Unnamed: 0,Algorand.txt,Avalanche.txt,Binance.txt,Bitcoin.txt,Cardano.txt,Chainlink.txt,Crypto_com.txt,Ethereum.txt,FTX_token.txt,PolkaDot.txt,Polygon.txt,Ripple.txt,ShibaInu.txt,Solana.txt,Terra.txt,Tether.txt,Tron.txt,Uniswap.txt,Wrapped.txt
0,ALGORAND AGREEMENT,Avalanche Platform,Binance Exchange,Bitcoin: A Peer-to-Peer Electronic Cash System,Ouroboros: A Provably Secure Proof-of-Stake Blockchain Protocol,Chainlink 2.0: Next Steps in the Evolution of,Crypto.com Whitepaper 1.03,HOME / WHITEPAPER,FTT Whitepaper,POLKADOT: VISION FOR A HETEROGENEOUS MULTI-CHAIN FRAMEWORK,Search or jump to… Pulls Issues Marketplace Explore,"RippleLabsInc,2014",SHIBA INU,Solana: A new architecture for a high,Terra Money:,,Advanced Decentralized Blockchain Platform,Uniswap v2 Core,
1,Super Fast and Partition Resilient Byzantine Agreement,2020/06/30,www.binance.com,Satoshi Nakamoto,Aggelos Kiayias∗ Alexander Russell† Bernardo David‡ Roman Oliynykov§,Decentralized Oracle Networks,August 2020,"Page last updated: January 30, 2022","Initially Released June 25, 2019",DRAFT 1,maticnetwork / whitepaper,The Ripple Protocol Consensus Algorithm,ECOSYSTEM,performance blockchain v0.8.14,Stability and Adoption,,Whitepaper Version: 2.0,Hayden Adams Noah Zinsmeister Dan Robinson,Wrapped Tokens
2,Jing Chen Sergey Gorbunov Silvio Micali Georgios Vlachos,"Kevin Sekniqi, Daniel Laine, Stephen Buttolph, and Emin Gu¨n Sirer",Whitepaper,satoshin@gmx.com,"July 20, 2019",Lorenz Breidenbach1 Christian Cachin2 Benedict Chan1,Version 1.03.17 - April 2021,On this page,1,DR.GAVINWOOD,Public Watch 12 Fork 29 Star 125,David Schwartz,WWW.SHIBATOKEN.COM,Anatoly Yakovenko,"Evan Kereiakes, Do Kwon, Marco Di Maggio, Nicholas Platias",,TRON Protocol Version: 3.2,hayden@uniswap.org noah@uniswap.org dan@paradigm.xyz,A multi-institutional framework for tokenizing any asset
3,"{jing, sergey, silvio, georgios@algorand.com}","Abstract. ThispaperprovidesanarchitecturaloverviewoftheﬁrstreleaseoftheAvalancheplatform,",V1.2,www.bitcoin.org,Abstract,Alex Coventry1 Steve Ellis1 Ari Juels3 Farinaz Koushanfar4,This whitepaper is a working document that is subject to review and changes,Ethereum Whitepaper,Contents,"FOUNDER,ETHEREUM&PARITY",Code Issues 3 Pull requests 1 Actions Projects Wiki Security,This paper does not reflect the current state of the ledger consensus protocol or its,BONE,anatoly@solana.io,April 2019,,TRON DAO,March 2020,
4,"April 25, 2018","5 codenamed Avalanche Borealis. For details on the economics of the native token, labeled $AVAX, we",Intro 3,Abstract. A purely peer-to-peer version of electronic cash would allow online,"We present “Ouroboros,” the ﬁrst blockchain protocol based on proof of stake with rig-",Andrew Miller5 Brendan Magauran1 Daniel Moroz6,Crypto.com 2,"This introductory paper was originally published in 2013 by Vitalik Buterin, the founder of Ethereum,",1 Our Mission 4,GAVIN@PARITY.IO,master whitepaper / README.md Go to file,"david@ripple.com analysis. We will continue hosting this draft for historical interest, but it SHOULD NOT be",v1 - 4/29/21 - WOOF Paper,"Legal Disclaimer NothinginthisWhitePaperisanoffertosell,orthesolicitationofanoffer",Abstract,,"December 10th, 2018, San Francisco",Abstract,Whitepaper v0.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5327,,,,,,tion.,,,,,,,,,,,,,
5328,,,,,,• Customization: Diﬀerent users may have diﬀerent preferences in terms of trade-,,,,,,,,,,,,,
5329,,,,,,"oﬀs among reliability, performance, and cost tradeoﬀs, and should be able to",,,,,,,,,,,,,
5330,,,,,,express these preferences in terms of their selection of providers.,,,,,,,,,,,,,


### Data cleaning/preprocessing

In [10]:
def sentence_to_words(sentence):
    # Function to convert a raw sentence to a string of words
    # The input is a single string (a whitepaper sentence), and 
    # the output is a single string (a preprocessed sentence)
    
    # if the line is nah, return
    if pd.isnull(sentence):
        return sentence
    # 1. Remove HTML
    sentence = BeautifulSoup(sentence).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", sentence) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

In [11]:
for i in filenames:
    df_whitepapers[i] = df_whitepapers[i].apply(lambda x: sentence_to_words(x))
df_whitepapers

Unnamed: 0,Algorand.txt,Avalanche.txt,Binance.txt,Bitcoin.txt,Cardano.txt,Chainlink.txt,Crypto_com.txt,Ethereum.txt,FTX_token.txt,PolkaDot.txt,Polygon.txt,Ripple.txt,ShibaInu.txt,Solana.txt,Terra.txt,Tether.txt,Tron.txt,Uniswap.txt,Wrapped.txt
0,algorand agreement,avalanche platform,binance exchange,bitcoin peer peer electronic cash system,ouroboros provably secure proof stake blockchain protocol,chainlink next steps evolution,crypto com whitepaper,home whitepaper,ftt whitepaper,polkadot vision heterogeneous multi chain framework,search jump pulls issues marketplace explore,ripplelabsinc,shiba inu,solana new architecture high,terra money,,advanced decentralized blockchain platform,uniswap v core,
1,super fast partition resilient byzantine agreement,,www binance com,satoshi nakamoto,aggelos kiayias alexander russell bernardo david roman oliynykov,decentralized oracle networks,august,page last updated january,initially released june,draft,maticnetwork whitepaper,ripple protocol consensus algorithm,ecosystem,performance blockchain v,stability adoption,,whitepaper version,hayden adams noah zinsmeister dan robinson,wrapped tokens
2,jing chen sergey gorbunov silvio micali georgios vlachos,kevin sekniqi daniel laine stephen buttolph emin gu n sirer,whitepaper,satoshin gmx com,july,lorenz breidenbach christian cachin benedict chan,version april,page,,dr gavinwood,public watch fork star,david schwartz,www shibatoken com,anatoly yakovenko,evan kereiakes kwon marco di maggio nicholas platias,,tron protocol version,hayden uniswap org noah uniswap org dan paradigm xyz,multi institutional framework tokenizing asset
3,jing sergey silvio georgios algorand com,abstract thispaperprovidesanarchitecturaloverviewofthe rstreleaseoftheavalancheplatform,v,www bitcoin org,abstract,alex coventry steve ellis ari juels farinaz koushanfar,whitepaper working document subject review changes,ethereum whitepaper,contents,founder ethereum parity,code issues pull requests actions projects wiki security,paper reflect current state ledger consensus protocol,bone,anatoly solana io,april,,tron dao,march,
4,april,codenamed avalanche borealis details economics native token labeled avax,intro,abstract purely peer peer version electronic cash would allow online,present ouroboros rst blockchain protocol based proof stake rig,andrew miller brendan magauran daniel moroz,crypto com,introductory paper originally published vitalik buterin founder ethereum,mission,gavin parity io,master whitepaper readme md go file,david ripple com analysis continue hosting draft historical interest,v woof paper,legal disclaimer nothinginthiswhitepaperisanoffertosell orthesolicitationofanoffer,abstract,,december th san francisco,abstract,whitepaper v
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5327,,,,,,tion,,,,,,,,,,,,,
5328,,,,,,customization di erent users may di erent preferences terms trade,,,,,,,,,,,,,
5329,,,,,,among reliability performance cost tradeo able,,,,,,,,,,,,,
5330,,,,,,express preferences terms selection providers,,,,,,,,,,,,,


In [12]:
def fit(vectorizer, df_column, occ = 10):
    # Function to fit a vectorizer to whitepaper
    # The input is vectorizer, and a df_column (whitepaper)
    # the output is a list of top 10 occurence
    # fit a column
    df_column.dropna(inplace=True)
    features = vectorizer.fit_transform(df_column)
    vocabulary = vectorizer.get_feature_names()
    features = features.toarray()
    
    # Sum up the counts of each vocabulary word
    count_sum = np.sum(features, axis=0)

    # For each, print the vocabulary word and the number of times it 
    # appears in the training set
    paired =  list(zip(vocabulary, count_sum))

    # reverse sort occurence of words
    feat_occ = sorted(paired, key = lambda x: x[1], reverse=True)
    return pd.DataFrame(list(feat_occ[:occ]))

### CountVectorizer

In [13]:
vectorizer1 = CountVectorizer(stop_words='english')

### Count of each word in the vocabulary

In [14]:
bow_occ_from_each_whitepaper = (fit(vectorizer1, df_whitepapers[i]) for i in filenames)
df_bow_occ = pd.concat(bow_occ_from_each_whitepaper, ignore_index=True, axis=1)
df_bow_columns = [val for val in filenames for _ in (0, 1)]
df_bow_occ.columns = [i if not c % 2 else "count" for c, i in enumerate(df_bow_columns)]
df_bow_occ

Unnamed: 0,Algorand.txt,count,Avalanche.txt,count.1,Binance.txt,count.2,Bitcoin.txt,count.3,Cardano.txt,count.4,...,Terra.txt,count.5,Tether.txt,count.6,Tron.txt,count.7,Uniswap.txt,count.8,Wrapped.txt,count.9
0,period,103,avalanche,57,binance,32,block,51,cid,264,...,terra,79,tether,66,tron,105,uniswap,47,tokens,68
1,cid,79,network,31,exchange,29,hash,43,protocol,227,...,mining,50,tethers,57,trx,82,contract,45,wrapped,48
2,honest,67,platform,31,systems,23,transaction,37,honest,194,...,luna,48,fiat,56,network,72,price,35,wbtc,44
3,votes,65,protocols,28,bnb,22,transactions,32,sl,148,...,price,40,bitcoin,52,block,69,liquidity,33,merchant,43
4,value,62,consensus,27,trading,20,nodes,31,stake,143,...,rewards,37,users,42,account,65,eth,32,user,30
5,users,61,nodes,25,team,19,proof,27,chain,136,...,funding,27,exchange,36,bandwidth,43,pair,28,chain,28
6,step,56,state,25,founder,18,work,27,slot,129,...,unit,26,currency,32,token,43,pool,20,asset,27
7,protocol,37,set,24,bijietech,17,chain,25,fork,123,...,currency,24,exchanges,28,transaction,43,asset,19,custodian,26
8,time,36,avax,23,cz,17,attacker,22,slots,104,...,stability,24,limited,25,contract,41,tokens,17,ethereum,26
9,user,36,node,21,exchanges,15,network,21,length,102,...,stable,24,blockchain,23,smart,41,assets,16,token,25


### Actual result
The previous column shows top 10 occurence in bag of words model among 20 whitepapers

### Trigram

In [15]:
vectorizer2 = CountVectorizer(stop_words='english',ngram_range=(3, 3))

In [16]:
tri_occ_from_each_whitepaper = (fit(vectorizer2, df_whitepapers[i]) for i in filenames)
df_tri_occ = pd.concat(tri_occ_from_each_whitepaper, ignore_index=True, axis=1)
df_tri_columns = [val for val in filenames for _ in (0, 1)]
df_tri_occ.columns = [i if not c % 2 else "count" for c, i in enumerate(df_tri_columns)]
df_tri_occ

Unnamed: 0,Algorand.txt,count,Avalanche.txt,count.1,Binance.txt,count.2,Bitcoin.txt,count.3,Cardano.txt,count.4,...,Terra.txt,count.5,Tether.txt,count.6,Tron.txt,count.7,Uniswap.txt,count.8,Wrapped.txt,count.9
0,cid cid cid,17,buttolph emin gu,8,july th july,4,hash hash hash,7,cid cid cid,23,...,unit mining rewards,17,audit flaws exchanges,4,tron virtual machine,11,liquidity pool share,6,asset backed tokens,6
1,certi ed value,10,daniel laine stephen,8,th july th,4,prev hash nonce,7,sl sl sl,15,...,fees luna burn,4,custodian reserve assets,4,virtual machine tvm,8,asset terms asset,3,atomic swap contract,3
2,votes value cid,9,emin gu sirer,8,allan yan product,2,tx tx tx,5,cid exp cid,9,...,rate luna burn,4,existing fiat pegging,4,create new account,7,basis point fee,3,erc token ethereum,3
3,soft votes value,8,kevin sekniqi daniel,8,binance coin bnb,2,block header block,4,cid lexp cid,9,...,changes unit mining,3,fiat currency held,4,dynamic network parameters,7,cid ti pi,3,new wrapped tokens,3
4,ed value period,7,laine stephen buttolph,8,bnb pay fees,2,hash nonce prev,4,leader selection process,9,...,long term commitment,3,fiat pegging systems,4,delegated proof stake,5,https eips ethereum,3,address secret hash,2
5,potentially certi ed,7,sekniqi daniel laine,8,bnb value burn,2,nonce prev hash,4,closed fork let,7,...,long term stable,3,flaws exchanges wallets,4,false notice set,5,org eips eip,3,aml kyc procedures,2
6,value cid period,6,stephen buttolph emin,8,bnb vesting plan,2,proof work chain,4,computer science pages,7,...,luna burn rate,3,limitations existing fiat,4,notice set false,5,price asset terms,3,asset case wbtc,2
7,honest users cert,5,forward looking statements,6,changpeng zhao ceo,2,majority cpu power,3,exp cid cid,7,...,target exchange rate,3,omni layer protocol,4,set false true,5,url https eips,3,atomic swap fee,2
8,honest users period,5,post quantum cryptography,3,cz years worked,2,owner owner owner,3,lecture notes computer,7,...,available mining power,2,currency held reserves,3,total vote reward,5,value liquidity pool,3,atomic swap use,2
9,sees soft votes,5,blockchain ned vm,2,english chinese japanese,2,proof work block,3,notes computer science,7,...,central banks governments,2,decentralized digital currency,3,true false notice,5,angeris et al,2,backed tokens usually,2


### Actual result
The previous column shows top 10 occurence in trigram model among 20 whitepapers