# Build knowledge graph via deep learning

## Problem
What are shared entities in all defi whitepapers?

## Solution
With bag of words model, we will extract common entities in between whitepapers.

## Expected outcome
1. Top ten shared entities are extracted.
2. Top ten shared trigram are extracted.

Authors:
* Xiaoyuan Liu
* Neel Kovelamudi
* Zijian Xie
* Mu He
* Cuiqianhe Du
* Nicholas Lin
* Austin Wei

Principal Investigator: 
* Dawn Song

Date: Fall 2021

References: 
[Bag_of_words](https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/)

### Import library

In [1]:
import pandas as pd
import numpy as np
import collections
import re
import os
import string
pd.set_option('display.max_colwidth', 200)
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#!pip install BeautifulSoup4
#import nltk
#nltk.download()  # Download text data sets, including stop words

In [3]:
# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup         

In [4]:
from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Whitepaper datasource

In [5]:
def read_whitepapers(filename):
    directory = "../whitepapers/top20_whitepapers/"
    texts = []
    for entry in os.scandir(directory):
        if (entry.path.endswith(filename) and entry.is_file()):
            a_dataframe = pd.read_csv(entry.path, names=[filename], sep="\n")
    a_dataframe.replace('', np.nan, inplace=True)
    a_dataframe.dropna(inplace=True)
    return a_dataframe

In [6]:
bitcoin_filename="Bitcoin.txt"
whitepapers = read_whitepapers(bitcoin_filename)
whitepapers.rename(columns={bitcoin_filename: "whitepapers"}, inplace=True)
whitepapers

Unnamed: 0,whitepapers
0,Bitcoin: A Peer-to-Peer Electronic Cash System
1,Satoshi Nakamoto
2,satoshin@gmx.com
3,www.bitcoin.org
4,Abstract. A purely peer-to-peer version of electronic cash would allow online
...,...
348,"http://www.hashcash.org/papers/hashcash.pdf, 2002."
349,"[7] R.C. Merkle, ""Protocols for public key cryptosystems,"" In Proc. 1980 Symposium on Security and"
350,"Privacy, IEEE Computer Society, pages 122-133, April 1980."
351,"[8] W. Feller, ""An introduction to probability theory and its applications,"" 1957."


In [7]:
filenames = ['Algorand.txt', 'Avalanche.txt', 'Binance.txt', 'Cardano.txt', 'Chainlink.txt',
            'Crypto_com.txt', 'Ethereum.txt', 'FTX_token.txt', 'PolkaDot.txt', 'Polygon.txt', 'Ripple.txt', 
            'ShibaInu.txt', 'Solana.txt', 'Terra.txt', 'Tether.txt', 'Tron.txt', 'Uniswap.txt', 'Wrapped.txt']

In [8]:
def create_dataframe(whitepapers):
    for i in filenames:
        whitepaper = read_whitepapers(i)
        whitepaper.rename(columns={i : "whitepapers"}, inplace=True)
        whitepapers = whitepapers.append(whitepaper)
    return whitepapers

### Create dataframe
Append whitepaper dataframe one after another

In [9]:
whitepapers = create_dataframe(whitepapers)
whitepapers

Unnamed: 0,whitepapers
0,Bitcoin: A Peer-to-Peer Electronic Cash System
1,Satoshi Nakamoto
2,satoshin@gmx.com
3,www.bitcoin.org
4,Abstract. A purely peer-to-peer version of electronic cash would allow online
...,...
452,​
453,
454,
455,


### Data cleaning/preprocessing

In [10]:
def sentence_to_words(sentence):
    # Function to convert a raw sentence to a string of words
    # The input is a single string (a whitepaper sentence), and 
    # the output is a single string (a preprocessed sentence)
    #
    # 1. Remove HTML
    sentence = BeautifulSoup(sentence).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", sentence) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

In [11]:
whitepapers['whitepapers']=whitepapers['whitepapers'].apply(lambda x:sentence_to_words(x))
whitepapers['whitepapers'].sample(5)



282                                                   sequence prior event consumers
1600                                                                               w
3187          sign messages inside key valid message space respectively remark proxy
570     matic network well seamless mechanism connect browser based dapps mobile app
28                                                                      introduction
Name: whitepapers, dtype: object

### CountVectorizer

In [12]:
vectorizer1 = CountVectorizer(stop_words='english')
bow_features = vectorizer1.fit_transform(whitepapers['whitepapers'])
vocabulary1 = vectorizer1.get_feature_names()
bow_features = bow_features.toarray()
print(bow_features.shape)

(20695, 13802)


### Count of each word in the vocabulary

In [13]:
# Sum up the counts of each vocabulary word
count_sum1 = np.sum(bow_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
paired1 =  list(zip(vocabulary1, count_sum1))

# reverse sort occurence of words
bow_occ = sorted(paired1, key = lambda x: x[1], reverse=True)
bow_occ[:10]

[('chain', 719),
 ('network', 660),
 ('block', 612),
 ('data', 601),
 ('transaction', 593),
 ('protocol', 500),
 ('transactions', 490),
 ('crypto', 485),
 ('nodes', 468),
 ('cid', 429)]

### Actual result
The previous column shows top 10 occurence in bag of words model among 20 whitepapers

### Trigram

In [14]:
vectorizer2 = CountVectorizer(stop_words='english',ngram_range=(3, 3))
tri_features = vectorizer2.fit_transform(whitepapers['whitepapers'])
vocabulary2 = vectorizer2.get_feature_names()
tri_features = tri_features.toarray()
print(tri_features.shape)

(20695, 65783)


In [15]:
# Sum up the counts of each vocabulary word
count_sum2 = np.sum(tri_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
paired2 =  list(zip(vocabulary2, count_sum2))

# reverse sort occurence of words
tri_occ = sorted(paired2, key = lambda x: x[1], reverse=True)
tri_occ[:10]

[('cid cid cid', 46),
 ('crypto org chain', 46),
 ('crypto com app', 40),
 ('matic development team', 24),
 ('heterogeneous multi chain', 23),
 ('crypto com exchange', 22),
 ('www shibatoken com', 22),
 ('multi chain framework', 21),
 ('paper www shibatoken', 21),
 ('polkadot vision heterogeneous', 21)]

### Actual result
The previous column shows top 10 occurence in trigram model among 20 whitepapers