[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/extreme-classification/extreme-classification.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/extreme-classification/extreme-classification.ipynb)

# Extreme Classification

This demo aims to label new texts automatically when the number of possible labels is enormous. This scenario is known as extreme classification, a supervised learning variant that deals with multi-class and multi-label problems involving many choices. 

Examples for applying extreme classification are labeling a new article with Wikipedia's topical labels, matching web content with a set of relevant advertisements, classifying product descriptions with catalog labels, and classifying a resume into a collection of pertinent job titles. 

Here's how we'll perform extreme classification:

1. We'll transform 250,000 labels into vector embeddings using a publicly available embedding model and upload them into a [managed vector index](https://www.pinecone.io/). 
2. Then we'll take an article that requires labeling and transform it into a [vector embedding](https://www.pinecone.io/learn/vector-embeddings/) using the same model.
3. We'll use that article's vector embedding as the query to search the vector index. In effect, this will retrieve the most similar labels to the article's semantic content.
4. With the most relevant labels retrieved, we can automatically apply them to the article.

Let's get started!

# Install Dependencies

In [1]:
!pip install -qU pinecone-client sentence-transformers datasets

# Setting up Pinecone's Similarity Search Service
Here we set up our similarity search service. We assume you are familiar with Pinecone's [quick start tutorial](https://www.pinecone.io/docs/quickstart-python/). To create our vector index, we first need to initialize our connection to Pinecone. For this we need a [free API key](https://app.pinecone.io). Once we have that, we initialize the connection like so:

In [2]:
from pinecone import Pinecone

# connect to pinecone environment
pinecone.init(
    api_key="YOUR_API_KEY",
    environment="YOUR_ENV"  # find next to API key in console
)

Now we create a new index called "extreme-ml". What we name this isn't important.

In [3]:
# pick a name for the new index
index_name = 'extreme-ml'

# check if the extreme-ml index exists
if index_name not in pinecone.list_indexes().names():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=384,
        metric="cosine"
    )

# connect to extreme-ml index we created
index = pinecone.Index(index_name)

# Data Preparation 
In this demo, we classify Wikipedia articles using a standard dataset from an extreme classification benchmarking [resource](http://manikvarma.org/downloads/XC/XMLRepository.html). The data used in this example is [Wikipedia-500k](https://drive.google.com/drive/folders/12HiiGWmbLfTEEObs2Y2jiTETZfXDowrn) which contains around 500,000 labels. We will load a subset of this dataset from huggingface which contains 200,000 articles with around 250,000 different labels, already prepared for this classification task.


In [22]:
from datasets import load_dataset

# load the dataset from huggingface
data = load_dataset("ashraq/WikiTitles-200K")
# load the train split into a pandas dataframe
df = data["train"].to_pandas()
df.head()



  0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,title,content,target
0,Anarchism,anarchism is a political philosophy that a...,"[Anarchism, Anti-capitalism, Anti-fascism, Far..."
1,Academy_Awards,the academy awards or the oscars (the offi...,"[1929_establishments_in_the_United_States, Aca..."
2,Anthropology,anthropology is the scientific study of hu...,"[Anthropology, Social_sciences]"
3,American_Football_Conference,the american football conference (afc) is o...,"[American_Football_League, National_Football_L..."
4,Analysis_of_variance,analysis of variance (anova) is a collection ...,"[Analysis_of_variance, Design_of_experiments, ..."


# Create Vector Embeddings

Recall, we want to index and search all possible (250,000) *labels*. We do that by averaging, for each label, the corresponding article vector embeddings that contain that label. 

Let's first create the article vector embeddings. Here we use the SentenceTransformer model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to generate article vector embeddings.

In [5]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load the model from huggingface
model = SentenceTransformer(
    'sentence-transformers/all-MiniLM-L6-v2',
    device=device
)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [6]:
import pandas as pd

# Create embeddings
encoded_articles = model.encode(df['content'].tolist(), show_progress_bar=True)
# add the embeddings to our dataframe
df['content_vector'] = pd.Series(encoded_articles.tolist())

Batches:   0%|          | 0/6250 [00:00<?, ?it/s]

It appears that using the article embeddings per se doesn't provide good enough accuracies. Therefore, we chose to index and search the labels directly. The label embedding is simply the average of all its corresponding article embeddings.

Let's create the label embeddings.

In [8]:
import numpy as np

# Explode the target indicator column
df_explode = df.explode('target')

# Group by label and define a unique vector for each label
label_vectors = df_explode.groupby('target').agg(mean=('content_vector', lambda x: np.vstack(x).mean(axis=0).tolist()))
label_vectors['target'] = label_vectors.index
label_vectors.columns = ['content_vector', 'label']

label_vectors.sample(10)

Unnamed: 0_level_0,content_vector,label
target,Unnamed: 1_level_1,Unnamed: 2_level_1
Chilean_people_of_Spanish_descent,"[-0.01580731357846941, -0.0031899410699095044,...",Chilean_people_of_Spanish_descent
Danish_scientists,"[-0.0458383746445179, 0.06631474941968918, -0....",Danish_scientists
Jewish_delicatessens,"[0.004267878830432892, 0.019741827622056007, -...",Jewish_delicatessens
"Companies_based_in_Victoria,_British_Columbia","[-0.009467452298849821, -0.014646705240011215,...","Companies_based_in_Victoria,_British_Columbia"
"Cities_in_Lewis_County,_Kentucky","[0.04348186030983925, -0.012287872843444347, 0...","Cities_in_Lewis_County,_Kentucky"
Airlines_of_Costa_Rica,"[0.07663661427795887, 0.02771153673529625, -0....",Airlines_of_Costa_Rica
School_buildings_completed_in_1891,"[0.07744941860437393, -0.017717836424708366, -...",School_buildings_completed_in_1891
Political_parties_established_in_1971,"[-0.02586129680275917, -0.039618962444365025, ...",Political_parties_established_in_1971
1946_in_Japan,"[-0.026875406503677368, 0.03590376675128937, 0...",1946_in_Japan
19th_century_in_Rome,"[-0.09453806281089783, -0.02094138227403164, 0...",19th_century_in_Rome


# Upsert Embeddings to Pinecone Index

In [9]:
from tqdm.auto import tqdm

# we will use batches of 256
batch_size = 256

for i in tqdm(range(0, len(label_vectors), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(label_vectors))
    # extract batch
    batch = label_vectors.iloc[i:i_end]
    # select embeddings for batch
    emb = batch["content_vector"].tolist()
    # get metadata
    meta = [{"label": l} for l in batch["label"]]
    # create unique IDs
    ids = [f"{idx}" for idx in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

  0%|          | 0/1004 [00:00<?, ?it/s]

Let's validate the number of indexed labels.

In [10]:
# check that we have all vectors in index
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.2,
 'namespaces': {'': {'vector_count': 256899}},
 'total_vector_count': 256899}

# Query 

Now, let's test the vector index and examine the classifier results. We use new articles which were not used when we generated the label embeddings. For that, we load the test split of the dataset as follows:

In [11]:
# load the test split into a pandas dataframe
df_test = data["test"].to_pandas()
df_test.head()

Unnamed: 0,title,content,target
0,Autism,<!-- notes: 1) please follow the wikipedia...,"[Autism, Communication_disorders, Mental_and_b..."
1,Altruism,altruism or selflessness is the principle o...,"[Altruism, Auguste_Comte, Defence_mechanisms, ..."
2,Alchemy,alchemy is an influential tradition whose p...,"[Alchemy, Esotericism, Hermeticism]"
3,Andorra,{{infobox country | conventional_long_name ...,"[Andorra, Constitutional_monarchies, Countries..."
4,Alkali_metal,the alkali metals are a group (column) in th...,"[Alkali_metals, Groups_in_the_periodic_table, ..."


First, we write some helper functions to select test articles and to query pinecone index for labels.

In [12]:
from pprint import pprint

In [13]:
def select_test_article(index):
    print("Query Article:")
    # select the article associated with index from test split
    article = df_test.iloc[index]
    # print test article data
    data = {"Title": article.title, "Content": article.content[:1000],  "Original Labels": list(article.target)}
    pprint(data)
    return article

In [14]:
def query_pinecone(article, top_k=3):
    # Create embeddings for test articles
    xq = model.encode(article.content).tolist()
    # query pinecone for labels
    results = index.query(vector=xq, top_k=top_k, include_metadata=True)
    # select only the labels from result and print
    labels = [res["metadata"]["label"] for res in results.matches]
    pprint({"Predicted Labels": labels})

Now let's run some queries. We select articles from the test split and use its content embeddings to query the Pinecone index and retrieve the most similar labels to the article's semantic content.

In [15]:
article = select_test_article(524)

Query Article:
{'Content': 'mpeg-4 is a method of defining compression of audio and visual '
            '(av) digital data. it was introduced in late 1998 and designated '
            'a standard for a group of audio and video coding formats and '
            'related technology agreed upon by the iso/iec moving picture '
            'experts group (mpeg) (iso/iec jtc1/sc29/wg11) under the formal '
            'standard iso/iec 14496&nbsp;â\x80\x93 coding of audio-visual '
            'objects. uses of mpeg-4 include compression of av data for web '
            '(streaming media) and cd distribution, voice (telephone, '
            'videophone) and broadcast television applications. ==background== '
            'mpeg-4 absorbs many of the features of mpeg-1 and mpeg-2 and '
            'other related standards, adding new features such as (extended) '
            'vrml support for 3d rendering, object-oriented composite files '
            '(including audio, video and vrml objects), s

In [16]:
query_pinecone(article, top_k=10)

{'Predicted Labels': ['Container_formats',
                      'MPEG',
                      'Open_standards_covered_by_patents',
                      'Video_codecs',
                      'Microsoft_Windows_multimedia_technology',
                      'Free_video_software',
                      'Audio_format_converters',
                      'Video_conversion_software',
                      'Video_editing_software',
                      'Digital_rights_management_standards']}


Let's run some more queries.

In [17]:
article = select_test_article(354)

Query Article:
{'Content': '  the houston astros are a professional baseball team located in '
            'houston, texas. the team is a member of the western division of '
            "major league baseball's american league, having moved in 2013 "
            'after spending their first 51 seasons in the national league. the '
            'astros have played their home games at minute maid park since '
            '2000. the astros were established as the houston colt .45s in . '
            'th',
 'Original Labels': ['Houston_Astros',
                     'Major_League_Baseball_teams',
                     'Professional_baseball_teams_in_Texas',
                     'Sports_clubs_established_in_1962'],
 'Title': 'Houston_Astros'}


In [18]:
query_pinecone(article, top_k=15)

{'Predicted Labels': ['1965_Major_League_Baseball_season',
                      'Houston_Astros_seasons',
                      'Cities_in_Texas_County,_Missouri',
                      'San_Diego_Padres_minor_league_affiliates',
                      'Professional_baseball_teams_in_Texas',
                      'Sports_in_San_Antonio,_Texas',
                      'Houston_Astros_broadcasters',
                      'Texas_soccer_clubs',
                      'Sports_venues_in_Greater_Orlando',
                      'Buildings_and_structures_in_Kissimmee,_Florida',
                      'Martinsville_Astros_players',
                      'Culture_of_Houston,_Texas',
                      'American_soccer_clubs_2008_season',
                      '2008_Major_League_Soccer_season',
                      'Buildings_and_structures_in_Houston,_Texas']}


In [19]:
article = select_test_article(400)

Query Article:
{'Content': ' this article details the variety of means of transport in '
            'jersey, channel islands. thumb|cycle lane in st helier ==air '
            'transport== airports: *jersey airport ==rail transport== '
            'historically there were public railway services in the island, '
            'provided by two railway companies: *the jersey railway closed in '
            '1936. *the jersey eastern railway closed in 1929. during the '
            'german military occupation 1940â\x80\x931945, light railways were '
            're-established by the germans for the purpose of supplying '
            'coastal fortifications. a one-metre gauge line was laid down '
            'following the route of the former jersey railway from saint '
            'helier to la corbiÃ¨re, with a branch line connecting the stone '
            'quarry at ronez in saint john. a 60cm line ran along the west '
            'coast, and another was laid out heading east from sain

In [20]:
query_pinecone(article, top_k=10)

{'Predicted Labels': ['PATH_stations_in_New_Jersey',
                      'New_Jersey_Transit_Rail_Operations',
                      'New_Jersey_streetcar_lines',
                      'Hudson-Bergen_Light_Rail_stations',
                      'Beeching_closures_in_Wales',
                      'Railway_stations_opened_in_1898',
                      'Rail_infrastructure_in_New_Jersey',
                      'Transportation_in_Rockland_County,_New_York',
                      'Closed_railway_lines_in_Wales',
                      'Disused_railway_stations_in_Kent']}


We can see that the predicted labels are in fact related to the articles.

# Summary

We demonstrated a similarity search approach for performing extreme classification of texts. We took a simple approach representing labels as the average of their corresponding texts' vector embeddings. In classification time, we match between a new article embedding and its nearest label embeddings. Our result examples indicate the usefulness of this approach. 

You can take this forward by exploring advanced ideas. For example, you can utilize the hierarchical relationship between labels or improve the label representations.  Just have fun, and feel free to [share](https://www.pinecone.io/contact/) your thoughts. 

# Delete the index

Delete the index once you do not want to use it anymore. Once the index is deleted, you cannot use it again.

In [None]:
pinecone.delete_index(index_name)

---