# Build knowledge graph via deep learning

Authors:
* Xiaoyuan Liu
* Neel Kovelamudi
* Zijian Xie
* Mu He
* Cuiqianhe Du
* Nicholas Lin
* Austin Wei

Principal Investigator: 
* Dawn Song

Date: Fall 2021

References: 
[Bag_of_words](https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/)

### Import library

In [1]:
import pandas as pd
import numpy as np
import collections
import re
import os
import string
pd.set_option('display.max_colwidth', 200)
from sklearn.feature_extraction.text import CountVectorizer

### Whitepaper datasource

In [2]:
def read_whitepapers(filename):
    directory = "../whitepapers/top20_whitepapers/"
    texts = []
    for entry in os.scandir(directory):
        if (entry.path.endswith(filename) and entry.is_file()):
            a_dataframe = pd.read_csv(entry.path, names=[filename], sep="\n")
    a_dataframe.replace('', np.nan, inplace=True)
    a_dataframe.dropna(inplace=True)
    return a_dataframe

In [3]:
bitcoin_filename="Bitcoin.txt"
whitepapers = read_whitepapers(bitcoin_filename)
whitepapers.rename(columns={bitcoin_filename: "whitepapers"}, inplace=True)
whitepapers

Unnamed: 0,whitepapers
0,Bitcoin: A Peer-to-Peer Electronic Cash System
1,Satoshi Nakamoto
2,satoshin@gmx.com
3,www.bitcoin.org
4,Abstract. A purely peer-to-peer version of electronic cash would allow online
...,...
348,"http://www.hashcash.org/papers/hashcash.pdf, 2002."
349,"[7] R.C. Merkle, ""Protocols for public key cryptosystems,"" In Proc. 1980 Symposium on Security and"
350,"Privacy, IEEE Computer Society, pages 122-133, April 1980."
351,"[8] W. Feller, ""An introduction to probability theory and its applications,"" 1957."


In [4]:
filenames = ['Algorand.txt', 'Avalanche.txt', 'Binance.txt', 'Cardano.txt', 'Chainlink.txt',
            'Crypto_com.txt', 'Ethereum.txt', 'FTX_token.txt', 'PolkaDot.txt', 'Polygon.txt', 'Ripple.txt', 
            'ShibaInu.txt', 'Solana.txt', 'Terra.txt', 'Tether.txt', 'Tron.txt', 'Uniswap.txt', 'Wrapped.txt']

In [5]:
def create_dataframe(whitepapers):
    for i in filenames:
        whitepaper = read_whitepapers(i)
        whitepaper.rename(columns={i : "whitepapers"}, inplace=True)
        whitepapers = whitepapers.append(whitepaper)
    return whitepapers

### Create dataframe
Append whitepaper dataframe one after another

In [6]:
whitepapers = create_dataframe(whitepapers)
whitepapers

Unnamed: 0,whitepapers
0,Bitcoin: A Peer-to-Peer Electronic Cash System
1,Satoshi Nakamoto
2,satoshin@gmx.com
3,www.bitcoin.org
4,Abstract. A purely peer-to-peer version of electronic cash would allow online
...,...
452,​
453,
454,
455,


### Data cleaning/preprocessing

In [7]:
def tweet_cleaning(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = str(text).lower()
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('https?://\S+|www\.\S+', ' ', text)
    text = re.sub('<.*?>+', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip()
    return text

In [8]:
whitepapers['whitepapers']=whitepapers['whitepapers'].apply(lambda x:tweet_cleaning(x))
whitepapers['whitepapers'].sample(5)

355                                                                                                   ​
112    probabilistic argument regarding a combinatorial notion of “forkable strings” which we formulate
5                               thistechnicalwhitepaperexplainssomeofthedesigndecisionsbehindtheuniswap
147                        or other malicious groups or organizations may attempt to interfere with our
64      be aware of all transactions in the mint based model the mint was aware of all transactions and
Name: whitepapers, dtype: object

### CountVectorizer

In [9]:
vectorizer1 = CountVectorizer(stop_words='english')
bow = vectorizer1.fit_transform(whitepapers['whitepapers'])
header1 = vectorizer1.get_feature_names()
df_bow = pd.DataFrame(bow.toarray(),columns=header1)
df_bow.head()

Unnamed: 0,aaa,aarhus,aave,aaveg,aavegotchi,aaveunsecuredborrowingdefi,ab,aban,abate,abc,...,ﬂoat,ﬂooding,ﬂoods,ﬂoor,ﬂoorblkparentoplimit,ﬂoorparentopcount,ﬂow,ﬂows,ﬂuctuations,ﬂushed
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Trigram

In [10]:
vectorizer2 = CountVectorizer(stop_words='english',ngram_range=(3, 3))
trigram = vectorizer2.fit_transform(whitepapers['whitepapers'])
header2 = vectorizer2.get_feature_names()
df_tri = pd.DataFrame(trigram.toarray(),columns=header2)
df_tri.head()

Unnamed: 0,aaa reserve governmental,aarhus university iohk,aave brought unse,aave oracle network,aavegotchi introduction aavegotchi,aavegotchi wiki contributors,ab ba used,abc currently xyz,abc def abc,abc inconvenient user,...,ﬂips biased coin,ﬂooding attacks example,ﬂooding network intend,ﬂooding transactions strategy,ﬂoorparentopcount blklimitfactor emafactor,ﬂow transactions enter,ﬂuctuations fall demand,ﬂuctuations lastly discuss,ﬂuctuations miner compensation,ﬂuctuations price bitcoin
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
