<a href="https://colab.research.google.com/github/kunalsonalkar/transformers-nlp/blob/main/transformer_based_label_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
dataset_url = "https://git.io/nlp-with-transformers"
df_issues = pd.read_json(dataset_url, lines = True)

In [2]:
cols = ["url","id","title","user","labels","state","created_at","body"]
df_issues.loc[2, cols].to_frame()

Unnamed: 0,2
url,https://api.github.com/repos/huggingface/trans...
id,849529761
title,[DeepSpeed] ZeRO stage 3 integration: getting ...
user,"{'login': 'stas00', 'id': 10676103, 'node_id':..."
labels,"[{'id': 2659267025, 'node_id': 'MDU6TGFiZWwyNj..."
state,open
created_at,2021-04-02 23:40:42
body,"**[This is not yet alive, preparing for the re..."


In [3]:
# Getting only the names of labels
df_issues['labels'] = (df_issues["labels"].apply(lambda x: [meta['name'] for meta in x]))

In [4]:
df_issues[["labels"]]

Unnamed: 0,labels
0,[]
1,[]
2,[DeepSpeed]
3,[]
4,[]
...,...
9925,[]
9926,[]
9927,[]
9928,[]


In [5]:
df_issues["labels"].apply(lambda x: len(x)).value_counts().to_frame().T

labels,0,1,2,3,4,5
count,6440,3057,305,100,25,3


In [6]:
df_issues["labels"].explode().value_counts().to_frame().head(10)

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
wontfix,2284
model card,649
Core: Tokenization,106
New model,98
Core: Modeling,64
Help wanted,52
Good First Issue,50
Usage,46
Core: Pipeline,42
Feature request,41


In [7]:
label_map = {"Core: Tokenization": "tokenization",
             "New model": "new model",
             "Core: Modeling": "model training",
             "Usage": "usage",
             "Core: Pipeline": "pipeline",
             "Tensorflow": "tensorflow or tf",
             "PyTorch": "pytorch",
             "Examples": "examples",
             "Documentation": "documentation"}

In [8]:
def filter_labels(x):
  return [label_map[label] for label in x if label in label_map]

In [9]:
df_issues["labels"] = df_issues["labels"].apply(filter_labels)
all_labels = list(label_map.values())

In [10]:
df_counts = df_issues["labels"].explode().value_counts().to_frame().T

In [11]:
df_counts

labels,tokenization,new model,model training,usage,pipeline,pytorch,documentation,examples
count,106,98,64,46,42,37,28,24


In [12]:
df_issues["split"] = "unlabeled"
mask = df_issues["labels"].apply(lambda x: len(x)> 0)
df_issues.loc[mask, "split"] = "labeled"
df_issues["split"].value_counts().to_frame()

Unnamed: 0_level_0,count
split,Unnamed: 1_level_1
unlabeled,9516
labeled,414


In [13]:
for column in ["title", "body", "labels"]:
  print(f" {df_issues[column].iloc[26]}\n")

 Add new CANINE model

 # 🌟 New model addition

## Model description

Google recently proposed a new **C**haracter **A**rchitecture with **N**o tokenization **I**n **N**eural **E**ncoders architecture (CANINE). Not only the title is exciting:

> Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficie

In [14]:
# concatenate the body and description
df_issues["text"] =(df_issues.apply(lambda x: x["title"] + "\n\n" + x["body"], axis = 1))

In [15]:
df_issues["text"].iloc[26]

"Add new CANINE model\n\n# 🌟 New model addition\r\n\r\n## Model description\r\n\r\nGoogle recently proposed a new **C**haracter **A**rchitecture with **N**o tokenization **I**n **N**eural **E**ncoders architecture (CANINE). Not only the title is exciting:\r\n\r\n> Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectivel

In [16]:
len_before = len(df_issues)
df_issues = df_issues.drop_duplicates(subset = ["text"])
len_after = len(df_issues)
print(f"Removed {len_before - len_after} duplicates")

Removed 187 duplicates


In [17]:
# Create Training sets
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb_fit = mlb.fit([all_labels])

In [18]:
mlb.transform([["tokenization", "new model"], ["documentation"]])

array([[0, 0, 0, 1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0]])

In [19]:
pip install scikit-multilearn

Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl.metadata (6.0 kB)
Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/89.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0


In [23]:
from skmultilearn.model_selection import iterative_train_test_split
def balanced_split(df, test_size=0.5):
  ind = np.expand_dims(np.arange(len(df)), axis=1)
  #print(ind)
  labels = mlb.transform(df["labels"])
  ind_train, _, ind_test, _ = iterative_train_test_split(ind, labels, test_size)
  return df.loc[ind_train[:, 0]], df.loc[ind_test[:, 0]]

from sklearn.model_selection import train_test_split
df_clean = df_issues[["text","labels","split"]].reset_index(drop=True).copy()
df_unsup = df_clean.loc[df_clean["split"] == "unlabeled",["text","labels"]].reset_index(drop=True).copy()
df_sup = df_clean.loc[df_clean["split"] == "labeled",["text","labels"]].reset_index(drop=True).copy()


In [31]:
import numpy as np
np.random.seed(0)
df_train, df_tmp = balanced_split(df_sup, test_size=0.5)
df_tmp = df_tmp.reset_index(drop=True).copy()
df_valid, df_test = balanced_split(df_tmp, test_size=0.5)

In [43]:
pip install datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:

In [44]:
from datasets import Dataset, DatasetDict
ds = DatasetDict({
    "train": Dataset.from_pandas(df_train.reset_index(drop=True)),
    "valid": Dataset.from_pandas(df_valid.reset_index(drop=True)),
    "test": Dataset.from_pandas(df_test.reset_index(drop=True)),
    "unsup": Dataset.from_pandas(df_unsup.reset_index(drop=True)),
})

In [45]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 206
    })
    valid: Dataset({
        features: ['text', 'labels'],
        num_rows: 104
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 103
    })
    unsup: Dataset({
        features: ['text', 'labels'],
        num_rows: 9330
    })
})

In [46]:
np.random.seed(0)
all_indices = np.expand_dims(list(range(len(ds["train"]))), axis=1)
indices_pool = all_indices
labels = mlb.transform(ds["train"]["labels"])
train_samples = [8, 16, 32, 64, 128]
train_slices, last_k = [], 0

In [50]:
labels.size

1854