# TFKit - Multi-Label Classifier - PapersWithCode Dataset
In this notebook, we will try to predict the tasks of paper's abstract based on the paperwithcode dataset.

References:
- https://github.com/voidful/TFkit
- https://github.com/paperswithcode/paperswithcode-data

In [None]:
%rm -f papers-with-abstracts.json.gz*
!wget -nc https://paperswithcode.com/media/about/papers-with-abstracts.json.gz
!gunzip -f papers-with-abstracts.json.gz
!ls -lhS
!head -n 30 papers-with-abstracts.json

--2020-09-14 16:03:52--  https://paperswithcode.com/media/about/papers-with-abstracts.json.gz
Resolving paperswithcode.com (paperswithcode.com)... 104.26.13.155, 172.67.73.69, 104.26.12.155, ...
Connecting to paperswithcode.com (paperswithcode.com)|104.26.13.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78863047 (75M) [application/octet-stream]
Saving to: ‘papers-with-abstracts.json.gz’


2020-09-14 16:03:59 (13.4 MB/s) - ‘papers-with-abstracts.json.gz’ saved [78863047/78863047]

total 245M
-rw-r--r-- 1 root root 245M Sep 13 20:09 papers-with-abstracts.json
drwxr-xr-x 1 root root 4.0K Aug 27 16:39 sample_data
[
  {
    "paper_url": "https://paperswithcode.com/paper/understanding-the-semantic-intent-of-natural",
    "arxiv_id": null,
    "title": "Understanding the Semantic Intent of Natural Language Query",
    "abstract": "",
    "url_abs": "https://www.aclweb.org/anthology/I13-1063/",
    "url_pdf": "https://www.aclweb.org/anthology/I13-1063",
    "pro

In [None]:
!pip install tfkit nlprep

Collecting tfkit
[?25l  Downloading https://files.pythonhosted.org/packages/bd/3e/083f71c56b87c97651affa740a00da5cb32019d8367afd747bcdf69abbda/tfkit-0.3.88-py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 2.7MB/s 
[?25hCollecting nlprep
[?25l  Downloading https://files.pythonhosted.org/packages/5c/6b/c8b866d11466e4be3f1024c0d2603662cb952f5be7413890a93a6584c339/nlprep-0.1.53-py3-none-any.whl (44kB)
[K     |████████████████████████████████| 51kB 6.4MB/s 
[?25hCollecting transformers>=2.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 14.0MB/s 
[?25hCollecting inquirer
  Downloading https://files.pythonhosted.org/packages/60/10/450a7edfaea3d09a4a7062bd567178bfb66233bae3ee0042934910e180de/inquirer-2.7.0-py2.py3-none-any.whl
Collecting tqdm>=4.45.0
[?25l  Downloading https://files.pythonho

### 2.2 Preparing train data

We will have to convert the json dataset to a csv file (commas separated) with the following columns:
```
"Fifty-four patients had pancreas cancer, confirmed by resection or biopsy in all cases .",outcome/population
```

The label in the target of the data will be separated by "/".


In [None]:
!pip install tqdm



In [None]:
from tqdm import tqdm

import json
import os.path
import csv
file_name = "papers-with-abstracts.json"


with open(file_name, encoding='utf-8') as f:
  with open('papers-with-abstracts.csv','w', encoding='utf-8') as fw:
    csv_write = csv.writer(fw)
    docs = json.load(f)
    rows = []
    for doc in tqdm(docs):
      if doc['title'] != '' and len(doc['tasks']) > 0:
        rows.append([doc['title'],"/".join(doc['tasks'])])
    csv_write.writerows(rows)   

100%|██████████| 163401/163401 [00:00<00:00, 473790.74it/s]


In [None]:
!head -n 35 papers-with-abstracts.csv

Tmuse: Lexical Network Exploration,Machine Translation/Semantic Textual Similarity
Parsing Croatian and Serbian by Using Croatian Dependency Treebanks,Dependency Parsing
Predicting the relevance of distributional semantic similarity with contextual information,Information Retrieval/Semantic Similarity/Semantic Textual Similarity/Word Sense Disambiguation
Assessing the Difficulty of Classifying ConceptNet Relations in a Multi-Label Classification Setting,Multi-Label Classification/Relation Classification
Deep Transfer Reinforcement Learning for Text Summarization,Text Summarization/Transfer Learning/Transfer Reinforcement Learning
Webly Supervised Joint Embedding for Cross-Modal lmage-Text Retrieval,Cross-Modal Retrieval
The role of grammar in transition-probabilities of subsequent words in English text,Text Generation
End-to-End Speech Recognition with High-Frame-Rate Features Extraction,Data Augmentation/End-To-End Speech Recognition/Speech Recognition
Automatic Language Ident

In [None]:
!nlprep --dataset clas_csv --infile papers-with-abstracts.csv --outdir data_pwa --util splitData

2020-09-14 16:04:47.289678: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
  import pandas.util.testing as tm
seed (number), [default=612]: 
train_ratio (between 0-1), [default=0.7]: 
test_ratio (between 0-1), [default=0.2]: 
valid_ratio (between 0-1), [default=0.1]: 
Start processing data...
100% 66194/66194 [00:00<00:00, 178600.54it/s]
100% 18912/18912 [00:00<00:00, 270816.44it/s]
100% 9456/9456 [00:00<00:00, 290057.77it/s]


In [None]:
!tfkit-train --train ./data_pwa/papers-with-abstracts.csv_valid.csv --test ./data_pwa/papers-with-abstracts.csv_valid.csv --model clas --config albert-base-v2 --maxlen 300 --batch 10 --cache

2020-09-14 16:06:58.021121: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
TRAIN PARAMETER
batch : 10
lr : [5e-05]
epoch : 10
maxlen : 300
savedir : checkpoints/
add_tokens : 0
train : ['./data_pwa/papers-with-abstracts.csv_valid.csv']
test : ['./data_pwa/papers-with-abstracts.csv_valid.csv']
model : ['clas']
tag : None
config : albert-base-v2
seed : 609
worker : 8
grad_accum : 1
tensorboard : False
resume : None
cache : True
enable_arg_panel : False
9456it [00:03, 2384.93it/s]
Processed 9456 data, removed 0 data that exceed the maximum length.
Using device: cuda
training batch : 10
 11% 100/946 [01:09<09:45,  1.45it/s]epoch: 1, tag: clas_0, model: MtClassifier, step: 100, loss: 0.40530025828629734, total:946
 21% 200/946 [02:18<08:36,  1.44it/s]epoch: 1, tag: clas_0, model: MtClassifier, step: 200, loss: 0.22391750159673393, total:946
 32% 300/946 [03:27<07:26,  1.45it/s]epoch: 1, tag: clas_0, model: MtClassifier, 

In [None]:
!tfkit-eval --valid ./data_pwa/papers-with-abstracts.csv_valid.csv --model ./checkpoints/1.pt --metric clas

2020-09-14 16:19:11.953641: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
===model info===
model_config : albert-base-v2
tags : ['clas_0']
type : ['clas']
maxlen : 300
epoch : 1
task-label : {'input_target_0_multi_label': ['', ' Named Entity Recognition ', ' Relation Extraction', '3D Absolute Human Pose Estimation', '3D Action Recognition', '3D Car Instance Understanding', '3D Character Animation From A Single Photo', '3D Depth Estimation', '3D Face Reconstruction', '3D Facial Expression Recognition', '3D Hand Pose Estimation', '3D Human Pose Estimation', '3D Instance Segmentation', '3D Multi-Object Tracking', '3D Multi-Person Pose Estimation', '3D Object Classification', '3D Object Detection', '3D Object Recognition', '3D Object Reconstruction', '3D Object Reconstruction From A Single Image', '3D Object Retrieval', '3D Object Super-Resolution', '3D Part Segmentation', '3D Point Cloud Matching', '3D Pose Estimation

In [None]:
!tfkit-eval --valid ./data_pwa/papers-with-abstracts.csv_valid.csv --model ./checkpoints/1.pt --metric clas --print

2020-09-14 16:25:04.761307: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
===model info===
model_config : albert-base-v2
tags : ['clas_0']
type : ['clas']
maxlen : 300
epoch : 1
task-label : {'input_target_0_multi_label': ['', ' Named Entity Recognition ', ' Relation Extraction', '3D Absolute Human Pose Estimation', '3D Action Recognition', '3D Car Instance Understanding', '3D Character Animation From A Single Photo', '3D Depth Estimation', '3D Face Reconstruction', '3D Facial Expression Recognition', '3D Hand Pose Estimation', '3D Human Pose Estimation', '3D Instance Segmentation', '3D Multi-Object Tracking', '3D Multi-Person Pose Estimation', '3D Object Classification', '3D Object Detection', '3D Object Recognition', '3D Object Reconstruction', '3D Object Reconstruction From A Single Image', '3D Object Retrieval', '3D Object Super-Resolution', '3D Part Segmentation', '3D Point Cloud Matching', '3D Pose Estimation