# Kick Start the Bulk Labeling using Embeddings

Contributor: Manikandan Sivanesan

## Elevator Pitch

- For Text Classification problems in Machine Learning projects, it is essential to have a good labeled datasets to create an accurate model.
- Eg: for sentiment analysis in movie review, you need labels a given review as positive, negative, neutral
- In RedHat, when cases comes in we need a way to identify the common trends in the cases to improve product and for support resource allocation.
- We create rules using keyword based heuristics. Eg: 'email' keyword with Services sbr in RHEL product, then the case belongs to email topic.
- Challenge is there can be more than 10 tags per sbr. It is hard to scale this approach.  
- In this hackday, evaluate a tool that can speed up this process using embeddings and bulk label to curate subsets of the dataset.

## Goals

- Evaluate the bulk tool
- Create embeddings for ansible dataset
- Visualize the cluster embeddings of different topics.

## Background and Technologies

- sentence transformers package to create the embeddings
- paraphrase-MiniLM-L6-v2 model to create 768 dimensions
- UMAP - Reduce the higher dimensions to two dimensions for visualization

## Resources

- [Youtube: Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep 2](https://youtu.be/KRQJDLyc1uM?si=fjAF2jUJa3yM9u5R) : Vincent Warmerdam builds a lot of NLP tools (https://github.com/koaning). Many of these tools target the scikit-learn ecosystem and there's a theme of labeling across many of them. A recent focus of his stack of tools is to improve training data. In this video, Vincent and Jay discuss a few of these tools and show how they work together. 
  - Human-learn: a toolkit to build human-based scikit-learn components
  - Doubtlab: a toolkit to help find doubtful labels in data
  - Embetter: A library that makes it very easy to use embeddings in scikit-learn
  - [Bulk](https://github.com/koaning/bulk): a library that uses embeddings to leverage bulk labeling

- We are specifically exploring the tool [Bulk](https://github.com/koaning/bulk): a library that uses embeddings to leverage bulk labeling.


## Create embeddings for dataset and reduce the dimensionality for visual exploration

In [None]:
from fastcore.all import *
import pandas as pd
from umap import UMAP

from sentence_transformers import SentenceTransformer

In [None]:
# Load the universal sentence encoder
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

  return torch._C._cuda_getDeviceCount() > 0


In [None]:
# Load the dataset
df = pd.read_csv('customer_support.csv')
sentences = df['text']

In [None]:
# Calculate the embeddings
embeddings = model.encode(sentences)

In [None]:
# Reduce the dimensionality of the embeddings
umap = UMAP(n_components=2) # UMAP is a dimensionality reduction algorithm to reduce the embeddings from 768 to 2
X_tfm = umap.fit_transform(embeddings)

In [None]:
# Apply the coordinates
df['x'] = X_tfm[:,0]
df['y'] = X_tfm[:,1]

In [None]:
df.to_csv('customer_support_embeddings_visual_ready.csv', index=False)

## Apply the learning on Ansible Dataset 

In [9]:
from fastcore.all import *
import pandas as pd
from umap import UMAP

from sentence_transformers import SentenceTransformer

In [10]:
# Load the universal sentence encoder
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

  return torch._C._cuda_getDeviceCount() > 0


In [1]:
BASE = '/home/msivanes/0Work/routing/garlock/data/interim'
ans_df = pd.read_feather(f'{BASE}/ansible_labeled_dataset.feather')

In [11]:
def embed_2d(df):
    sentences = df['text']
    embeddings = model.encode(sentences)     # Calculate the embeddings
    umap = UMAP(n_components=2) # UMAP is a dimensionality reduction algorithm to reduce the embeddings from 768 to 2
    X_tfm = umap.fit_transform(embeddings)
    df['x'], df['y'] = X_tfm[:,0], X_tfm[:,1]
    return df

ans_df['text'] = ans_df['case_summary'] # text is a required colu,m and using case summary for now
ans_df = embed_2d(ans_df)

In [6]:
ans_df[['case_number', 'case_summary', 'case_tags']].head()

Unnamed: 0,case_number,case_summary,case_tags
0,2551134,Ansible Tower Installation Issue,installation
1,2551460,Is there a way to provide permission to invent...,api
2,2551667,Need assistance setting up LDAP to RH IdM.,ldap
3,2551788,Unable to update Ansible 2.7 to 2.8 - dependen...,upgrade
4,2552169,RFE - Unable to manage multiple scm sources in...,rfe


In [14]:
ans_df.columns

Index(['case_number', 'case_createdDate', 'case_product', 'case_summary',
       'case_description', 'case_sbr', 'case_tags', 'case_type', 'case_origin',
       'sbr_length', 'tag_length', 'targets', 'labels', 'text', 'x', 'y'],
      dtype='object')

In [12]:
ans_df.to_csv(f'{BASE}/ansible_labeled_dataset_visualize.csv', index=False)

In [15]:
f'{BASE}/ansible_labeled_dataset_visualize.csv'

'/home/msivanes/0Work/routing/garlock/data/interim/ansible_labeled_dataset_visualize.csv'

In [7]:
len(ans_df)

3796

In [18]:
tags = ans_df['case_tags'].value_counts().to_dict()

In [20]:
tags.keys()

dict_keys(['upgrade', 'registration', 'ldap', 'installation', 'tower_license', 'automation_hub', 'inventory', 'scm_update', 'api', 'rfe', 'windows', 'backup_restore', 'security', 'collections', 'execution_environments', 'ansible_analytics', 'ansible_builder', 'ansible_navigator'])