### Zero-shot Topic Modeling Algorithm

Zero-shot topic modeling is a use case of zero-shot text classification on topic predictions. Zero-shot text classification is a Natural Language Inference (NLI) model where two sequences are compared to see if they contradict each other, entail each other, or are neutral (neither contradict nor entail).  
  
When using zero-shot topic modeling, we will have the text as the premise and the pre-defined candidate labels as hypotheses. If the model predicts a text document such as a review entails the topic in the candidate labels, then the document is likely to belong to the topic. Otherwise, the document is not likely to belong to the topic.

In [None]:
conda install -c huggingface transformers

In [None]:
# Data processing
import pandas as pd

# Modeling
from transformers import pipeline
classifier = pipeline(task="zero-shot-classification", 
                      model="facebook/bart-large-mnli",
                      device=0) 

In [4]:
# Read in data
amz_review = pd.read_csv('amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

# Drop te label 
amz_review = amz_review.drop('label', axis=1);

# Take a look at the data
amz_review.head()

Unnamed: 0,review
0,So there is no way for me to plug it in here i...
1,"Good case, Excellent value."
2,Great for the jawbone.
3,Tied to charger for conversations lasting more...
4,The mic is great.


### Zero-shot Topic Prediction of a Single Topic

* Firstly, the reviews are put into a list for the pipeline.  
* Then, the candidate labels are defined. We set four candidate labels, sound quality, battery, price, and comfortable.  
* After that, the hypothesis template is defined. The default template is used by the Hugging Face pipeline is This example is {}, we use a hypothesis template that is more specific to the topic modeling The topic of this review is {}. and it helps to improve the results.  
* Finally, the text, the candidate labels, and the hypothesis template are passed into the zero-shot classification pipeline called classifier.

In [5]:
# Put reviews in a list
sequences = amz_review['review'].to_list()

# Define the candidate labels 
candidate_labels = ["sound quality", "battery", "price", "comfortable"]

# Set the hyppothesis template
hypothesis_template = "The topic of this review is {}."

# Prediction results
single_topic_prediction = classifier(sequences, candidate_labels, hypothesis_template=hypothesis_template)

# Save the output as a dataframe
single_topic_prediction = pd.DataFrame(single_topic_prediction)

# Take a look at the data
single_topic_prediction.head()

NameError: name 'classifier' is not defined

In [None]:
# Tune the batch_size to fit in the memory
batch_size = 4 

# Put reviews in a list
sequences = amz_review['review'].to_list()

# Define the candidate labels 
candidate_labels = ["sound quality", "battery", "price", "comfortable"]

# Set the hyppothesis template
hypothesis_template = "The topic of this review is {}."

# Create an empty list to save the prediciton results
single_topic_prediction = []

# Loop through the batches
for i in range(0, len(sequences), batch_size):
    # Append the results 
    single_topic_prediction += classifier(sequences[i:i+batch_size], candidate_labels, hypothesis_template=hypothesis_template)

In [None]:
# The column for the predicted topic
single_topic_prediction['predicted_topic'] = single_topic_prediction['labels'].apply(lambda x: x[0])

# The column for the score of predi ted topic
single_topic_prediction['predicted_topic_score'] = single_topic_prediction['scores'].apply(lambda x: x[0])

# Take a look at the data
single_topic_prediction.head()

### Zero-shot Topic Prediction of Multiple Topics

In [None]:
# Put reviews in a list
sequences = amz_review['review'].to_list()

# Define the candidate labels 
candidate_labels = ["sound quality", "battery", "price", "comfortable"]

# Set the hyppothesis template
hypothesis_template = "The topic of this review is {}."

# Prediction results
multi_topic_prediction = classifier(sequences, candidate_labels, hypothesis_template=hypothesis_template, multi_label=True)

# Save the output in a dataframe
multi_topic_prediction = pd.DataFrame(multi_topic_prediction)

# Take a look at the data
multi_topic_prediction.head()

We set the threshold = 0.6 meaning that the labels with a predicted probability of greater than or equal to 0.6 is assigned to the reviews.

In [None]:
# Threshold probability
threshold = 0.6

# Expand the lists
multi_topic_prediction = multi_topic_prediction.set_index('sequence').apply(pd.Series.explode).reset_index()

# Filter by threshold
multi_topic_prediction = multi_topic_prediction[multi_topic_prediction['scores'] >= threshold]

# Take a look at the data
multi_topic_prediction.head()