# Introduction

Hello and welcome to this NLP Labs API walkthrough notebook! The NLP Labs API is built to auto-categorize text data in a customized manner. We'll walk through how to perform two types of text categorization tasks:

* Text Classification
* Named Entity Recognition

In each of these tasks we make an API request where we send a dataset of Disneyland reviews and the custom labels. The output will be a downloadable CSV with each input text tagged with one of the specified custom labels.

Behind the scenes, we host models called zero-shot models that allow you to tag text data however you want without actually creating a dataset yourself.

Pretty cool huh?

# Documentation of API endpoints

[Swagger Docs](https://anishpdalal-nlp-labs-api-fastapi-app.modal.run/docs)

# Setup

In [None]:
# Import dependencies
import pandas as pd
import requests
from io import StringIO

In [None]:
# Download sample dataset
!wget https://raw.githubusercontent.com/nlp-labs-ai/nlp-labs-api-walkthrough/main/sample.csv

--2022-10-13 02:19:48--  https://raw.githubusercontent.com/nlp-labs-ai/nlp-labs-api-walkthrough/main/sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 768345 (750K) [text/plain]
Saving to: ‘sample.csv’


2022-10-13 02:19:48 (16.6 MB/s) - ‘sample.csv’ saved [768345/768345]



Notice that we have the following columns:
* Review_ID - unique identifier of the Review
* Rating - Rating on scale of 1-5 the customer gave
* Year_Month - Review date
* Review_Location - Country of Reviewer
* **Review_Text** - actual review (this is the text data we want to categorize)
* Branch - Disneyland location

In [None]:
!head -5 sample.csv

Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
279977484,3,2015-6,United Kingdom,"As a family we left a little disappointed by our visit to DLP. The atmosphere in the park is brought down by rude people who think nothing of pushing in front of others in queues and smoking wherever they want. There also seemed to be a lot of construction work going on around the park and a few of the rides were shut. I have visited Disney World in Florida several times and have never encountered any of these problems there.On a positive note some of the rides were very good such as Thunder mountain, small world and Pirates of the Carribean. If you have younger children DLP is also more manageable to get round as its a much smaller park than Florida. Overall we managed to enjoy our holiday but think we will save for Florida next time.",Disneyland_Paris
182524324,5,2013-6,United States,"It truly IS the funnest place on earth!!! We had the chance to take our daughter and kids to Disneylan

# Upload Text to NLP Labs API

In [None]:
API_KEY = "<Your API Key Here>"
API_URL = "https://anishpdalal-nlp-labs-api-fastapi-app.modal.run"

In [None]:
upload_response = requests.post(
    f"{API_URL}/uploads",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    files={"file": open("sample.csv", "rb")}
).json()

In [None]:
upload_response

{'id': 'bb86a83c-f48b-44ef-b566-beda066d60b3', 'name': 'sample.csv'}

# Text Classification

In [None]:
request_body = {
  "upload_id": upload_response["id"],
  "text_column": "Review_Text",
  "name": "Disney_Land_Reviews_Classification",
  "type": "classification",
  "categories": ["food", "rides", "lines", "pricing"]
}

In [None]:
dataset_response = requests.post(
    f"{API_URL}/datasets",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    json=request_body
).json()

In [None]:
dataset_response

{'id': '91eacaa7-5858-4215-8a01-a685f7507056',
 'upload_id': 'bb86a83c-f48b-44ef-b566-beda066d60b3',
 'name': 'Disney_Land_Reviews_Classification',
 'type': 'classification',
 'created': '2022-10-13 02:20:08'}

Let's poll the dataset to see how far along the classification task is

In [None]:
dataset_status_response = requests.get(
    f"{API_URL}/datasets/{dataset_response['id']}/status",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    json=request_body
).json()

There are two types of statuses: PENDING and COMPLETE

Progress is a percentage of the total dataset that has been processed

In [None]:
import time

status = "PENDING"
while status == "PENDING":
  dataset_status_response = requests.get(
    f"{API_URL}/datasets/{dataset_response['id']}/status",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    json=request_body
  ).json()
  print(dataset_status_response)
  status = dataset_status_response["status"]
  time.sleep(5)

{'status': 'PENDING', 'progress': 80}
{'status': 'PENDING', 'progress': 90}
{'status': 'PENDING', 'progress': 90}
{'status': 'COMPLETE', 'progress': 100}


Awesome! Looks like our results are ready for download. The results are downloadable as a CSV file object.

In [None]:
download_response = requests.get(
    f"{API_URL}/datasets/{dataset_response['id']}/download",
    headers={"X-Api-Key": API_KEY}
)
results_io = StringIO(download_response.content.decode("utf-8"))

Let's now load the results into a Pandas dataframe to inspect them

In [None]:
results_df = pd.read_csv(results_io)

The results have the following columns:
* id - the corresponds to the index of the input row. So the first row of data (excluding column headers) in the input CSV file corresponds to `id=0`
* label - assigned label pulled from the list of custom labels
* score - how confident the classifier is on a scale of 0 - 1.

In [None]:
results_df.head()

Unnamed: 0,id,label,score
0,0,lines,0.421821
1,1,rides,0.407061
2,2,rides,0.461304
3,3,rides,0.948876
4,4,rides,0.513835


## Celebrate

🙌

Now that we have our results the fun part begins! All of the interesting questions you have about the reviews are now possible to ask because the reviews are now structured thanks to the auto classification! 

This is the value of the NLP Labs API. Without auto categorization it would be too time consuming and impractical to manually label this dataset just ask to ask a few questions.

But maybe those questions could lead to business altering insights! Now it's practical to ask a lot of questions against text data because the API removes the need to manually label and makes text analytics 100x faster!

# Asking Questions of your new and improved dataset

One easy way to ask questions is to join `results_df` with the original dataset 



In [None]:
sample_df = pd.read_csv("sample.csv")
merged_df = pd.merge(sample_df, results_df, left_index=True, right_on="id").drop(columns=["id"])

In [None]:
merged_df.head()

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch,label,score
0,279977484,3,2015-6,United Kingdom,As a family we left a little disappointed by o...,Disneyland_Paris,lines,0.421821
1,182524324,5,2013-6,United States,It truly IS the funnest place on earth!!! We h...,Disneyland_California,rides,0.407061
2,603684918,4,2017-12,United States,We took the bus from Shenzhen to get there (gi...,Disneyland_HongKong,rides,0.461304
3,278627765,5,2014-10,United States,I always love Disneyland. Best Rides and atmos...,Disneyland_California,rides,0.948876
4,193829431,3,2014-2,Panama,It was nice to be at the original park created...,Disneyland_California,rides,0.513835


In [None]:
# What is the breakdown of reviews by label where the model was confident
merged_df[merged_df.score >= 0.5].groupby("label")["Review_ID"].count()

label
food        26
lines       90
pricing     28
rides      182
Name: Review_ID, dtype: int64

In [None]:
# What is the breakdown of reviews for California branch by label where the model was confident.
merged_df[(merged_df.score >= 0.5) & (merged_df.Branch == "Disneyland_California")].groupby("label")["Review_ID"].count()

label
food       12
lines      39
pricing    14
rides      92
Name: Review_ID, dtype: int64

Wouldn't it also be nice to know the overall sentiment broken down by label? There is a rating scale from 1-5 but people may have their own personal scales as to what constitutes a 1 or 3. Let's augment this dataset by doing another classification task this time by sentiment labels: `positive`, `neutral`, `negative` 

In [None]:
request_body = {
  "upload_id": upload_response["id"],
  "text_column": "Review_Text",
  "name": "Disney_Land_Reviews_Sentiment",
  "type": "classification",
  "categories": ["positive", "neutral", "negative"]
}

dataset_response = requests.post(
    f"{API_URL}/datasets",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    json=request_body
).json()

In [None]:
dataset_status_response = requests.get(
    f"{API_URL}/datasets/{dataset_response['id']}/status",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    json=request_body
).json()

In [None]:
import time

status = "PENDING"
while status == "PENDING":
  dataset_status_response = requests.get(
    f"{API_URL}/datasets/{dataset_response['id']}/status",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    json=request_body
  ).json()
  print(dataset_status_response)
  status = dataset_status_response["status"]
  time.sleep(5)

{'status': 'COMPLETE', 'progress': 100}


In [None]:
download_response = requests.get(
    f"{API_URL}/datasets/{dataset_response['id']}/download",
    headers={"X-Api-Key": API_KEY}
)
results_io = StringIO(download_response.content.decode("utf-8"))
sentiment_df = pd.read_csv(results_io)

In [None]:
sentiment_df.head()

Unnamed: 0,id,label,score
0,0,negative,0.510492
1,1,positive,0.915286
2,2,negative,0.539722
3,3,positive,0.953319
4,4,negative,0.531893


In [None]:
merged_df = pd.merge(merged_df, sentiment_df, left_index=True, right_on="id").drop(columns=["id"])

In [None]:
merged_df.rename(inplace=True, columns={"label_x": "topic", "score_x": "topic_score", "label_y": "sentiment", "score_y": "sentiment_score"})

In [None]:
merged_df.groupby(["topic", "sentiment"])["Review_ID"].count()

topic    sentiment
food     negative      55
         neutral        4
         positive      80
lines    negative      98
         neutral        1
         positive     191
pricing  negative      47
         neutral        5
         positive      88
rides    negative     104
         neutral        5
         positive     322
Name: Review_ID, dtype: int64

# Named Entity Recognition

Using the NLP Labs API for NER is very similar to classification. In fact it only requires one line of code change: changing the `type` from `classification` to `entity_recognition`. Let's use this capability and extract locations, mascots, and attractions mentioned in the reviews.

In [None]:
request_body = {
  "upload_id": upload_response["id"],
  "text_column": "Review_Text",
  "name": "Disney_Land_Reviews_NER",
  "type": "entity_recognition",
  "categories": ["location", "mascot", "attraction"]
}

dataset_response = requests.post(
    f"{API_URL}/datasets",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    json=request_body
).json()

In [None]:
import time

status = "PENDING"
while status == "PENDING":
  dataset_status_response = requests.get(
    f"{API_URL}/datasets/{dataset_response['id']}/status",
    headers={"X-Api-Key": API_KEY, "Accept": "application/json"},
    json=request_body
  ).json()
  print(dataset_status_response)
  status = dataset_status_response["status"]
  time.sleep(5)

{'status': 'PENDING', 'progress': 0}
{'status': 'PENDING', 'progress': 0}
{'status': 'PENDING', 'progress': 0}
{'status': 'PENDING', 'progress': 0}
{'status': 'PENDING', 'progress': 0}
{'status': 'PENDING', 'progress': 0}
{'status': 'PENDING', 'progress': 10}
{'status': 'PENDING', 'progress': 10}
{'status': 'PENDING', 'progress': 10}
{'status': 'PENDING', 'progress': 20}
{'status': 'PENDING', 'progress': 20}
{'status': 'PENDING', 'progress': 20}
{'status': 'PENDING', 'progress': 30}
{'status': 'PENDING', 'progress': 30}
{'status': 'PENDING', 'progress': 30}
{'status': 'PENDING', 'progress': 40}
{'status': 'PENDING', 'progress': 40}
{'status': 'PENDING', 'progress': 40}
{'status': 'PENDING', 'progress': 40}
{'status': 'PENDING', 'progress': 50}
{'status': 'PENDING', 'progress': 50}
{'status': 'PENDING', 'progress': 50}
{'status': 'PENDING', 'progress': 70}
{'status': 'PENDING', 'progress': 70}
{'status': 'PENDING', 'progress': 80}
{'status': 'COMPLETE', 'progress': 100}


In [None]:
download_response = requests.get(
    f"{API_URL}/datasets/{dataset_response['id']}/download",
    headers={"X-Api-Key": API_KEY}
)
results_io = StringIO(download_response.content.decode("utf-8"))
ner_df = pd.read_csv(results_io)

In [None]:
ner_df.head(10)

Unnamed: 0,id,text,label,score
0,0,DLP,attraction,0.981427
1,0,Disney World,location,0.999901
2,0,Florida,location,0.999988
3,0,Thunder mountain,attraction,0.998749
4,0,Pirates of the Carribean,mascot,0.999953
5,0,DLP,attraction,0.980512
6,0,Florida,location,0.999992
7,0,Florida,location,0.999992
8,1,Disneyland,location,0.999575
9,1,Disneyland,location,0.999707
