For zero shot learning we experiment using The Huggingface transform. 
The experiment basically tries to identify the unknown classes encountered in the test model

# Importing Depdencencies

In [8]:
# !pip install transformers==3.1.0
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py): started
  Building wheel for langdetect (setup.py): finished with status 'done'
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=ecfeb50ed99ca7a88e5e1eeb98ff363f4abde35913f60f5d609b594cf3a0cf18
  Stored in directory: c:\users\ht_13\appdata\local\pip\cache\wheels\d1\c1\d9\7e068de779d863bc8f8fc9467d85e25cfe47fa5051fff1a1bb
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [1]:
import pandas as pd
import numpy
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import pprint
import numpy as np
import torch
from tqdm import tqdm

In [2]:
import torch
from torch.utils.data import DataLoader

In [3]:
from torch import nn
import torchvision, torch
# from torchsummary import summary
from torchvision import transforms as T
from torch import optim
import copy

In [4]:
import scipy.io as sio

In [5]:
from transformers import pipeline
from langdetect import detect

# Loading Training and Testing Data

In [None]:
train_df = pd.read_csv('./dataset/Gungor_2018_VictorianAuthorAttribution_data-train.csv',encoding='latin-1')

In [6]:
test_df = pd.read_csv('dataset/dataset/Gungor_2018_VictorianAuthorAttribution_data.csv',encoding='latin-1')

In [7]:
test_author = sio.loadmat('dataset/dataset/test_author.mat')["test_author"]

In [8]:
test_df.count()

text    38809
dtype: int64

In [9]:
author_list_path='dataset/dataset/author_list.txt'

In [10]:
def load_author_names():
    f = open(author_list_path, 'r')
    author_list = f.read().split('\n')
    f.close()
    author_catalog = {}
    for i in range(len(author_list)):
        author_catalog[i+1] = author_list[i]
    return author_catalog

In [11]:
author_catalog = load_author_names()

In [12]:
test_df['author_id'] = test_author

In [13]:
test_df.head()

Unnamed: 0,text,author_id
0,nt it seems te me how much money is he worth a...,1
1,to talk about why you heard of such a case as ...,1
2,my foot on the ground and said i believe you d...,1
3,hour or wait for miss oh wait for by all means...,1
4,will not listen to such words now go and remem...,1


In [14]:
test_df['author_name'] = [author_catalog[i] for i in test_df['author_id']]

In [15]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38809 entries, 0 to 38808
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   text         38809 non-null  object
 1   author_id    38809 non-null  uint8 
 2   author_name  38809 non-null  object
dtypes: object(2), uint8(1)
memory usage: 644.4+ KB


In [16]:
authors = [author_catalog[i] for i in test_df['author_id'].unique()]

# Zero-shot Learning Implementation

The classifier takes in two sequences and determines whether they contradict each other, entail each other, or neither.
We set a threshold for the probabilities obtained from the results in the experiments mentioned above.


In [17]:
classifier = pipeline("zero-shot-classification",device = 0)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [27]:
test_df = test_df.sample(frac=1).reset_index(drop=True)

In [29]:
candidate_results = [ 0 for i in range(len(authors))]
results = []
candidate_results_argmax = []

#### Perform classification for 1000 text input

**Confidence Threshold:** We set a threshold for the probabilities obtained from the results in the experiments mentioned above. If the confidence on the scores is below the threshold, we classify the sample as unknown and if the scores are equal or above threshold, we classify the results into the known classes.

In [30]:
for text in tqdm(test_df['text'].values[:1000]):
    # To do multi-class classification, simply pass multi_class=True.
    # In this case, the scores will be independent, but each will fall between 0 and 1.
    res = classifier(text, authors)
    
    SCORES = res["scores"]
    CLASSES = res["labels"]
    results.append((SCORES,CLASSES))
    BEST_INDEX = np.argmax(SCORES)
    predicted_class = CLASSES[BEST_INDEX]
    predicted_score = SCORES[BEST_INDEX]
    candidate_results_argmax.append(predicted_class)
    for i,author in enumerate(authors):
        if predicted_class == author and predicted_score > 0.5:
            candidate_results[i] = candidate_results[i]+1

print(candidate_results)

100%|████████████████████████████████████████████████████████████████████████████| 1000/1000 [3:05:48<00:00, 11.15s/it]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]





We performed zero-shot learning for 1000 text inputs, but got very low confidence scores for 50 classes, this is happened because the transformer was never trained such large multi-class classification.