# AdaptNLP - Multi-Label Classifier - PapersWithCode Dataset
In this notebook, we will try to predict the tasks of paper's abstract based on the paperwithcode dataset.

References:
- https://github.com/Novetta/adaptnlp
- https://github.com/paperswithcode/paperswithcode-data

## 1. Google Colaboratory

Google Colaboratory is a hosted Python development environment based on [Jupyter notebooks](https://jupyter.org/). These notebooks allow to interweave structured text (using the [markdown syntax](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)) and Python code. They and are particularly suited for prototyping new ideas, due to interactive code execution.

### 1.1 Activate GPU support

To activate GPU support, click on `Runtime > Change runtime type` int he notebook menu and choose `GPU` as the hardware accelerator. To check whether the GPU is available for computation, we import the deep learning framework [PyTorch](https://pytorch.org/):

In [None]:
import torch
torch.cuda.is_available()

If successful, the output of the cell above should print `True`. Note that Google Colaboratory also offers [TPU](https://cloud.google.com/tpu/) support. These *Tensor Processing Units* are specifically designed for machine learning tasks and may outperform conventional GPUs. While support for TPUs in PyTorch is still pending, [tensorflow](https://www.tensorflow.org/) models may benefit from using TPUs (see [this tutorial](https://colab.research.google.com/notebooks/tpu.ipynb)).

### 1.2 Useful commands

Within the notebook environment, you can not only execute Python code, but also bash commands by prepending a `!`. For example, you can install new Python packages via the package manager `pip`. Here, we just check the installed version of PyTorch:

In [None]:
!pip show torch

Another useful command is `!kill -9 -1`. It will reset all running kernels and free up memory (including GPU memory). Furthermore, there are a few commands to have a closer look on the hardware spcifications, i.e. to get information about the installed CPU and GPU:

In [None]:
!lscpu |grep 'Model name'

In [None]:
!nvidia-smi -L

In addition, you can check the available RAM and HDD memory:

In [None]:
!cat /proc/meminfo | grep 'MemAvailable'

In [None]:
!df -h / | awk '{print $4}'

Finally, one can execute the following command to get a live update on the GPU usage. This is useful to check how much of the GPU memory is in use to optimize the batchsize for training. Note that whenever the training routine in a notebook is still running, you need to execute this command in another Colaboratory notebook to get an instant response:

In [None]:
!nvidia-smi

### 1.3 Mount Google Drive

Another important prerequisite for training our neural network is a place to save checkpoints of the trained model and to store obtained training data. Colaboratory provides convenient access to Google Drive via the `google.colab` Python module. The following command will mount your Google Drive contents to the folder path `/content/gdrive` on the Colaboratory instance. For authentication, you have to click the generated link and paste the authorization code into the input field:



In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Now you can use conventional Python packages such as `os` or `sys` to create/delete/change files in your Google Drive folders just as if you were working on your local machine.

## 2. Getting training data

We will fetch the paperswithcode's dataset, being updated daily, via a their direct link. For more information, please visit their dedicated repository:
- https://github.com/paperswithcode/paperswithcode-data

In [None]:
%rm -f papers-with-abstracts.json.gz*
!wget -nc https://paperswithcode.com/media/about/papers-with-abstracts.json.gz
!gunzip -f papers-with-abstracts.json.gz
!ls -lhS
!head -n 30 papers-with-abstracts.json

--2020-09-14 11:38:57--  https://paperswithcode.com/media/about/papers-with-abstracts.json.gz
Resolving paperswithcode.com (paperswithcode.com)... 172.67.73.69, 104.26.13.155, 104.26.12.155, ...
Connecting to paperswithcode.com (paperswithcode.com)|172.67.73.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78863047 (75M) [application/octet-stream]
Saving to: ‘papers-with-abstracts.json.gz’


2020-09-14 11:39:06 (10.3 MB/s) - ‘papers-with-abstracts.json.gz’ saved [78863047/78863047]

total 245M
-rw-r--r-- 1 root root 245M Sep 13 20:09 papers-with-abstracts.json
drwx------ 4 root root 4.0K Sep 14 11:14 gdrive
drwxr-xr-x 1 root root 4.0K Aug 27 16:39 sample_data
[
  {
    "paper_url": "https://paperswithcode.com/paper/understanding-the-semantic-intent-of-natural",
    "arxiv_id": null,
    "title": "Understanding the Semantic Intent of Natural Language Query",
    "abstract": "",
    "url_abs": "https://www.aclweb.org/anthology/I13-1063/",
    "url_pdf": "https

### 2.1 Getting AdaptNLP package

<p align="center">
    <a href="https://github.com/Novetta/adaptnlp"> <img src="https://raw.githubusercontent.com/novetta/adaptnlp/master/docs/img/NovettaAdaptNLPlogo-400px.png" width="400"/></a>
</p>

<p align="center">
<strong> A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models for end to end tasks.</strong>
</p>
<p align="center">
    <a href="https://circleci.com/gh/Novetta/adaptnlp">
        <img src="https://img.shields.io/circleci/build/github/Novetta/adaptnlp/master">
    </a>
    <a href="https://badge.fury.io/py/adaptnlp">
        <img src="https://badge.fury.io/py/adaptnlp.svg">
    </a>
    <a href="https://github.com/Novetta/adaptnlp/blob/master/LICENSE">
        <img src="https://img.shields.io/github/license/novetta/adaptnlp">
    </a>
</p>


AdaptNLP allows users ranging from beginner python coders to experienced machine learning engineers to leverage
state-of-the-art NLP models and training techniques in one easy-to-use python package.

Built atop Zalando Research's Flair and Hugging Face's Transformers library, AdaptNLP provides Machine
Learning Researchers and Scientists a modular and **adaptive** approach to a variety of NLP tasks with an
**Easy** API for training, inference, and deploying NLP-based microservices.

**Key Features**

  - **[Full Guides and API Documentation](https://novetta.github.io/adaptnlp)**
  - [Tutorial](https://github.com/Novetta/adaptnlp/tree/master/tutorials) Jupyter/Google Colab Notebooks
  - Unified API for NLP Tasks with SOTA Pretrained Models (Adaptable with Flair and Transformer's Models)
    - Token Tagging 
    - Sequence Classification
    - Embeddings
    - Question Answering
    - Summarization
    - Translation
    - Text Generation
    - <em> More in development </em>
  - Training and Fine-tuning Interface
    - Integration with Transformer's Trainer Module for fast and easy transfer learning with custom datasets
    - Jeremy's **[ULM-FIT](https://arxiv.org/abs/1801.06146)** approach for transfer learning in NLP
    - Fine-tuning Transformer's language models and task-specific predictive heads like Flair's `SequenceClassifier`
  - [Rapid NLP Model Deployment](https://github.com/Novetta/adaptnlp/tree/master/rest) with Sebastián's [FastAPI](https://github.com/tiangolo/fastapi) Framework
    - Containerized FastAPI app
    - Immediately deploy any custom trained Flair or AdaptNLP model
  - [Dockerizing AdaptNLP with GPUs](https://hub.docker.com/r/achangnovetta/adaptnlp)
    - Easily build and run AdaptNLP containers leveraging NVIDIA GPUs with Docker

In [None]:
!pip install git+https://github.com/Novetta/adaptnlp

Collecting git+https://github.com/Novetta/adaptnlp
  Cloning https://github.com/Novetta/adaptnlp to /tmp/pip-req-build-_38de75j
  Running command git clone -q https://github.com/Novetta/adaptnlp /tmp/pip-req-build-_38de75j
Collecting jupyterlab
[?25l  Downloading https://files.pythonhosted.org/packages/d7/a9/d7c904ee406d1ce320fd1d91e05111fa158e66bb217f68d070b5f58c5937/jupyterlab-2.2.8-py3-none-any.whl (7.8MB)
[K     |████████████████████████████████| 7.8MB 2.8MB/s 
Collecting transformers<4.0.0,>=3.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 44.4MB/s 
[?25hCollecting nlp<1.0.0,>=0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/09/e3/bcdc59f3434b224040c1047769c47b82705feca2b89ebbc28311e3764782/nlp-0.4.0-py3-none-any.whl (1.7MB)
[K     |████████████████████████████████| 1.7MB 43.2MB/s 
[?2

### 2.2 Preparing train data

We will have to convert the json dataset to a csv file (commas separated) with the following columns:
```
Content,Labels
"Fifty-four patients had pancreas cancer, confirmed by resection or biopsy in all cases .",outcome/population
```

The label in the target of the data will be separated by "/".


In [None]:
!pip install pandas
!pip install tqdm



In [None]:
import pandas as pd 
from tqdm import tqdm

import json
import os.path

file_name = "papers-with-abstracts.json"
cols = ['content', 'tasks']
df = pd.DataFrame(columns=cols)

with open(file_name, encoding='utf-8') as f:
    docs = json.load(f)
    for doc in docs:
      # print(doc)
      if doc['title'] != '' and len(doc['tasks']) > 0:
        lst_dict=({'content': doc['title'], 'tasks': ",".join(doc['tasks'])})
        df = df.append(lst_dict, ignore_index=True)

df.to_csv('papers-with-abstracts.csv', index=False)

In [None]:
!head -n 35 papers-with-abstracts.csv
df.info()

content,tasks
"Commonsense knowledge relations are crucial for advanced NLU tasks. We examine the learnability of such relations as represented in CONCEPTNET, taking into account their specific properties, which can make relation classification difficult: a given concept pair can be linked by multiple relation types, and relations can have multi-word arguments of diverse semantic types. We explore a neural open world multi-label classification approach that focuses on the evaluation of classification accuracy for individual relations. Based on an in-depth study of the specific properties of the CONCEPTNET resource, we investigate the impact of different relation representations and model variations. Our analysis reveals that the complexity of argument types and relation ambiguity are the most important challenges to address. We design a customized evaluation method to address the incompleteness of the resource that can be expanded in future work.",Multi-Label Classification/Relation Cl