Text Categorization Using GCN

Inzva AI Projects #4, Text Categorization Using GCN

Abstract

Our project aims to tackle the text classification problem with novel approaches Graph Convolutional Networks and Graph Attention Networks using Deep Learning algorithms and Natural Language Processing Techniques.

Build

It is applicable for only Linux distros. You can update the commands and use the equivalent ones in other distros (Mac, Windows, etc.) Executing buid.sh will create a new virtual environment in the project folder and install dependencies into that. Run the following command to build:

bash build.sh

Be sure that your computer is connected to internet. It can take a while to download and install the dependendencies.

Run

Available Datasets:

20ng (Newsgroup Dataset)
R8 (Reuters News Dataset with 8 labels)
R52 (Reuters News Dataset with 52 labels)
ohsumed (Cardiovascular Diseases Abstracts Dataset)
mr (Movie Reviews Dataset)
cora (Citation Dataset)
citeseer (Citation Dataset)
pubmed (Citation Dataset)

Preprocess:

venv/bin/python3 preprocess.py <DATASET_NAME>

Example: venv/bin/python3 preprocess.py R8

Train:

venv/bin/python3 train.py <DATASET_NAME>

Example: venv/bin/python3 train.py R8

Contributors

Berk Sudan, GitHub, LinkedIn
Güneykan Özgül, GitHub, LinkedIn

Example Output

$ python3 preprocess.py R8
[WARN] directory:'/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.cleaned/' already exists, not overwritten.
[nltk_data] Downloading package stopwords to venv/nltk_data/...
[nltk_data]   Unzipping corpora/stopwords.zip.
[INFO] Cleaned-Corpus Dir='/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.cleaned/'
[INFO] Rare-Count=<5>
[INFO] ========= CLEANED DATA: Removed rare & stop-words. =========
[WARN] directory:'/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/' already exists, not overwritten.
[WARN] directory:'/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/meta/' already exists, not overwritten.
[WARN] directory:'/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/split_index/' already exists, not overwritten.
[INFO] Shuffled-Corpus Dir='/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/'
[INFO] ========= SHUFFLED DATA: Corpus documents shuffled. =========
[WARN] directory:'/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/vocabulary/' already exists, not overwritten.
[WARN] directory:'/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/word_vectors/' already exists, not overwritten.
[nltk_data] Downloading package wordnet to venv/nltk_data/...
[nltk_data]   Unzipping corpora/wordnet.zip.
[INFO] Vocabulary Dir='/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/vocabulary/'
[INFO] Word-Vector Dir='/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/word_vectors/'
[INFO] ========= PREPARED WORDS: Vocabulary & word-vectors extracted. =========
[WARN] directory:'/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/node_features//R8' already exists, not overwritten.
[INFO] x.shape=   (4937, 300),	 y.shape=   (4937, 8)
[INFO] tx.shape=  (2189, 300),	 ty.shape=  (2189, 8)
[INFO] allx.shape=(13173, 300),	 ally.shape=(13173, 8)
[INFO] ========= EXTRACTED NODE FEATURES: x, y, tx, ty, allx, ally. =========
[WARN] directory:'/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/adjacency/' already exists, not overwritten.

[INFO] Adjacency Dir='/home/linuxuser/Desktop/GCNN - Repo/gcn_text_categorization/data/corpus.shuffled/adjacency/'
[INFO] ========= EXTRACTED ADJACENCY MATRIX: Heterogenous doc-word adjacency matrix. =========

$ python3 train.py R8
[2020/8/16 19:12:00] Epoch:1, train_loss=2.10250, train_acc=0.04436, val_loss=2.06262, val_acc=0.62044, time=4.47397
[2020/8/16 19:12:02] Epoch:2, train_loss=1.91917, train_acc=0.66235, val_loss=2.05057, val_acc=0.68978, time=2.37341
[2020/8/16 19:12:05] Epoch:3, train_loss=1.80157, train_acc=0.72514, val_loss=2.04366, val_acc=0.72263, time=2.30123
[2020/8/16 19:12:07] Epoch:4, train_loss=1.73192, train_acc=0.76727, val_loss=2.03955, val_acc=0.75365, time=2.29406

...
[2020/8/16 19:14:40] Epoch:61, train_loss=1.42440, train_acc=0.99149, val_loss=2.01033, val_acc=0.96350, time=2.95969
[2020/8/16 19:14:43] Epoch:62, train_loss=1.42407, train_acc=0.99190, val_loss=2.01033, val_acc=0.96168, time=2.79453
[2020/8/16 19:14:45] Epoch:63, train_loss=1.42377, train_acc=0.99271, val_loss=2.01034, val_acc=0.96168, time=2.56621
[2020/8/16 19:14:45] Early stopping...
[2020/8/16 19:14:45] Optimization Finished!
[2020/8/16 19:14:46] Test set results: 
[2020/8/16 19:14:46] 	 loss= 1.79863, accuracy= 0.97076, time= 0.85016
[2020/8/16 19:14:46] Test Precision, Recall and F1-Score...
[2020/8/16 19:14:47]               precision    recall  f1-score   support
[2020/8/16 19:14:47] 
[2020/8/16 19:14:47]            0     0.9826    0.9917    0.9871      1083
[2020/8/16 19:14:47]            1     0.9706    0.8148    0.8859        81
[2020/8/16 19:14:47]            2     0.9826    0.9741    0.9784       696
[2020/8/16 19:14:47]            3     0.9435    0.9669    0.9551       121
[2020/8/16 19:14:47]            4     0.8454    0.9425    0.8913        87
[2020/8/16 19:14:47]            5     0.9012    0.9733    0.9359        75
[2020/8/16 19:14:47]            6     0.9615    0.6944    0.8065        36
[2020/8/16 19:14:47]            7     1.0000    1.0000    1.0000        10
[2020/8/16 19:14:47] 
[2020/8/16 19:14:47]     accuracy                         0.9708      2189
[2020/8/16 19:14:47]    macro avg     0.9484    0.9197    0.9300      2189
[2020/8/16 19:14:47] weighted avg     0.9715    0.9708    0.9703      2189
[2020/8/16 19:14:47] 
[2020/8/16 19:14:47] Macro average Test Precision, Recall and F1-Score...
[2020/8/16 19:14:47] (0.9484369779553932, 0.9197363948390138, 0.9300186011259608, None)
[2020/8/16 19:14:47] Micro average Test Precision, Recall and F1-Score...
[2020/8/16 19:14:48] (0.9707629054362723, 0.9707629054362723, 0.9707629054362723, None)
[2020/8/16 19:14:48] Embeddings:
Word_embeddings:7688 
Train_doc_embeddings:5485
Test_doc_embeddings:2189
Word_embeddings::48] 
[[0.         0.16006473 0.         ... 0.23560655 0.03900092 0.03204489]
 [0.77342826 0.         0.         ... 0.         0.5302985  0.6264921 ]
 [0.7280124  0.2820953  0.         ... 0.         0.11174901 0.5485109 ]
 ...
 [0.3487121  0.00971795 0.16308378 ... 0.23769674 0.19000074 0.06477314]
 [0.04132688 0.10753749 0.06603149 ... 0.05606151 0.13652363 0.0443929 ]
 [0.06743763 0.23347118 0.36717737 ... 0.15584956 0.         0.        ]]

References

Papers

[Liang Yao, Chengsheng Mao, Yuan Luo, 2018] Graph Convolutional Networks for Text Classification
[Kipf and Welling, 2017] Semi-supervised Classification with Graph Convolutional Networks
[Ankit Pal, 2020] Multi-label Text Classification using Attention based Graph Neural Network
[Petar Velickovic, 2017] Graph Attention Networks

Repos

PyTorch Implementation of Graph Attention Networks: https://github.com/Diego999/pyGAT
PyTorch implementation of Graph Convolutional Networks: https://github.com/tkipf/pygcn

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
convert_citation_datasets		convert_citation_datasets
data		data
docs/pics		docs/pics
models		models
preprocessors		preprocessors
trainer		trainer
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
common.py		common.py
preprocess.py		preprocess.py
train.py		train.py

License

inzva/gcn_text_categorization

Folders and files

Latest commit

History

Repository files navigation

Text Categorization Using GCN

Abstract

Build

Run

Contributors

Example Output

References

Papers

Repos

About

Resources

License

Stars

Watchers

Forks

Languages