In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2021, 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Text Classification with Data Labeling Service Integration</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---

# Overview:

Data labeling is the process of identifying specific properties or characteristics of raw data (text and images) and applying annotations called labels. Examples of meaningful labels are the topic of a news article, the sentiment of a tweet, the caption for an image, important words spoken in an audio recording, the genre of a video, etc. The purpose of data labeling is to create enriched, labeled datasets. These labeled datasets are required inputs for many machine learning applications.

The Oracle Cloud Infrastructure (OCI) Data Labeling service enables customers to efficiently build enriched, labeled datasets for the purpose of accurately training AI/ML models. This notebook demonstrates operations that can be performed using the Advanced Data Science (ADS) Data Labeling module. The demonstrated operations are:

* How to list datasets in the OCI Data Labeling service (DLS).
* How to generate a snapshot of a dataset.
* How to load a labeled dataset.

The purpose of the `data_labeling` module is to provide an efficient and convenient way for users to utilize OCI DLS in a notebook session.

Compatible conda pack: [Natural Language Processing](https://docs.oracle.com/iaas/data-science/using/conda-nlp-fam.htm) for CPU on Python 3.7 (version 2.0)

---

## Contents:

- <a href="#intro">Introduction</a>
- <a href='#dls'>Data Labeling Service Class</a>
    - <a href="#dls_list">Listing Labeled Datasets</a>
    - <a href='#dls_snapshot'>Export</a>
- <a href='#text_classification'>Text Classification</a>
    - <a href="#binary_class">Binary Classification</a>
        - <a href="#single_dataset">Dataset</a>
            - <a href="#single_loaddataset">Load the Labeled Dataset</a>
        - <a href='#single_train'>Decision Tree Classifier</a>
        - <a href='#single_test'>Prediction</a>
    - <a href="#imultilabel_class">Multilabel Classification</a>
        - <a href="#multi_dataset">Dataset</a>
            - <a href="#multi_loaddataset">Load the Labeled Dataset</a>
        - <a href='#multi_train'>Linear Support Vector Classifier</a>
        - <a href='#multi_test'>Evaluation</a>   
- <a href='#ref'>References</a>

---


Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.
    
You can access the `20 Newsgroups` dataset license [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING).


In [None]:
import warnings

warnings.filterwarnings("ignore")

import logging

logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.ERROR)

import ads
import nltk
import oci
import os
import pandas as pd

pd.set_option("display.max_colwidth", None)

from ads.data_labeling import DataLabeling
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

<a id="intro"></a>
# Introduction

The OCI Data Labeling service (DLS) allows customers to create and browse datasets, view data records (text, images), and apply labels to build AI/ML models. The service also provides interactive user interfaces designed to aid in the labeling process. Once records are labeled, the dataset can be exported as line-delimited JSON for machine learning model development. 

Datasets are the core resource available within the Data Labeling service. Datasets contain records and their associated labels. A record represents a single image or text document. Records are stored by reference to their original source, such as a path on Object Storage. Customers can also upload records from local storage. Labels are annotations that describe a data record. In the case of text, labels may include the beginning and ending character position. Labeled datasets can be exported as a JSON manifest for use as an input to machine learning model development. 

Training datasets that are used in natural language processing applications require a corpus of raw text that is tagged. The tags indicate the important parts of the text, and labels are assigned to these tags. There are different types of labeling, such as parts of speech (POS) or named entities. POS would identify words or phrases such as a noun, verb, adverb, or adjective. Name entities would identify proper nouns and places within the text, such as Carey and Redwood City. The start and end positions are identified in the text, and a label is associated with that bounded region.

<a id='dls'></a>
# Data Labeling Service Class

The main entry point to working with the DLS in ADS is the `DataLabeling` class. Generally, you will work with the DLS user interface in the console to label the data. The ADS library allows you to work with the labeled datasets within the notebook session efficiently.

To obtain a handle to a `DataLabeling` object, call the `DataLabeling()` constructor. The default compartment is the same compartment as the notebook session but the parameter `compartment_id` can be used to select a different compartment.

In [None]:
dls = DataLabeling()

<a id="dls_list"></a>
## Listing Labeled Datasets

The `list_dataset()` method provides a list of the available labeled datasets in the compartment set in the previous step when calling `DataLabeling()`. It returns a `DatasetList` object where each row is a dataset. It has the following columns:
* `annotation_format`: The format of the annotation. Examples are ENTITY_EXTRACTION, MULTI_LABEL, etc.
* `defined_tags`: Preset tags.
* `display_name`: Name of the dataset as it is shown in the console.
* `format_type`: The format of the dataset, for example, TEXT or IMAGE.
* `freeform_tags`: User specified tags.
* `id`: OCID of the dataset.
* `lifecycle_state`: The lifecycle state of the dataset, such as ACTIVE, DELETED, etc.
* `lifecycle_details`: Details about the lifecycle state.
* `system_tags`: Tags that are defined by OCI.
* `time_created`: The timestamp of when the dataset was created.
* `time_updated`: The timestamp when the dataset was last updated. It will be None if the dataset has not been updated.

The following cell will list the datasets that are available in the notebook session's compartment.

In [None]:
dls.list_dataset()

<a id='dls_snapshot'></a>
## Export

To work with the labeled data, a snapshot of the dataset is needed. The `export()` method will copy the labeled data from the DLS into a bucket in Object Storage. The `export()` method has the following parameters:

* `dataset_id`: The OCID of the DLS dataset to take a snapshot of.
* `path`: The object storage path to store the generated snapshot.

The snapshot operation will create a unique file in the specified bucket. The `export()` method will return the key to the file that was created.

This notebook does not assume that you have a labeled text dataset in the DLS service. Therefore, it will not attempt to create a snapshot.

<a id='text_classification'></a>
# Text Classification

This section shows you how to build a binary and a multiclass classifier using two different annotated datasets that were labeled by the DLS.  


<a id="binary_class"></a>
## Binary Classification
To demonstrate a typical data science workflow, we will train a binary classifier using the text data as the source for generating features and the annotations to represent the response variable.

<a id="single_dataset"></a>
### Dataset

A subset of the 20 Newsgroups dataset is used in this notebook. The complete dataset is a collection of approximately 20,000 newsgroup documents partitioned across 20 different newsgroups. The dataset is popular for experiments where the machine learning application predicts which newsgroup a record belongs to. The newsgroups are treated as the categories/labels that you would like to predict. For the toy model created in this notebook, only the `rec.sport.baseball`, and `sci.space` newgroups are used, as this is a binary classification model.

<a id='single_loaddataset'></a>
#### Load the Labeled Dataset

The Oracle Data Science service has created a labeled dataset based on the data described in the <a href="#single_dataset">Dataset</a> section. It is stored in a publically accessible bucket and will be used here. The snapshot process creates a JSONL file that contains metadata about the labeled dataset. There is also a record JSONL file that has links to the original dataset and its label.

Use the `read_labeled_data()` method to read in the metadata file, record file, and all the corpus documents. Only the metadata file needs to be specified as it contains references to the record and corpus documents. The `read_labeled_data()` method returns a dataframe that is easy to work with.

The following cell loads a labeled dataset. It will return the text from each email and the labeled annotation.

In [None]:
df = pd.DataFrame.ads.read_labeled_data(
    "oci://hosted-ds-datasets@bigdatadatasciencelarge/DLS/text_single_label_20news/metadata.jsonl",
    auth={"config": oci.config.from_file(os.path.join("~/.oci", "config"))},
    materialize=True,
)

df.head()

<a id='single_train'></a>
### Decision Tree Classifier 

The first step is to perform some preprocessing on the data. The following cell performs the following operations:
* Converts the text to lower case.
* Uses a Regular Expression (RegEx) command to remove any character that is not alphanumeric, underscore, or whitespace.
* Replace the sequence of characters `\n` with a space.

In [None]:
df["text_clean"] = (
    df["Content"].str.lower().str.replace(r"[^\w\s]+", "").str.replace("\n", " ")
)

The binary classifier model we will train is a decision tree where the features are based on n-grams of the words. We will use n-grams that are one, two, and three words long: unigrams, bigrams, and trigrams. The vectorizer will remove English stop words as they provide little value to the model being built. A weight will be assigned to these features using the term frequency-inverse document frequency (TF*IDF) approach. More details about TF*IDF can be found [here](https://en.wikipedia.org/wiki/Tf–idf).

In this example, we will skip splitting the dataset into the training and test sets since our goal is to build a toy model for a demonstration purpose. In the following, we assign `0` for the `rec.sport.baseball` label and `1` for the `sci.space` label.

In [None]:
vectorizer = TfidfVectorizer(stop_words="english", analyzer="word", ngram_range=(1, 3))
classifier = DecisionTreeClassifier()

feature = vectorizer.fit_transform(df["text_clean"])
model = classifier.fit(feature, df["Annotations"])

<a id='single_test'></a>
### Prediction

Use the following code to predict the category for a given text data using the trained binary classifier.

In [None]:
classifier.predict(vectorizer.transform(["reggie jackson played right field"]))

<a id="multilabel_class"></a>
## Multilabel Classification

In the previous section, you saw how to build a binary text classifier. Building a multiclass text classifier is a similar process. However, a one-vs-the-rest (OvR) multiclass strategy will be employed. That is, you create one classifier for each class. The classifier will predict if the observation is in the class or not. If there are m classes then there will be m classifiers. Classification will be based on which classifier has the more confidence that an observation is in the class.

<a id="multi_dataset"></a>
### Dataset

A subset of the [Reuters Corpus](https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection) dataset is used in this notebook. You will use scikit-learn and nltk to build a multilabel classifier. The Reuters data is a benchmark dataset for document classification. More precisely, it is a multilabel (each document can belong to many classes) dataset. It has 90 categories, 7,769 training documents, and 3,019 testing documents.

<a id='multi_loaddataset'></a>
#### Load the Labeled Dataset

The following cell loads a multi-labeled dataset. It will return the text and the multi-labeled annotation in a dataframe.

In [None]:
df = pd.DataFrame.ads.read_labeled_data(
    "oci://hosted-ds-datasets@bigdatadatasciencelarge/DLS/text_multi_label_nltk_reuters/metadata.jsonl",
    auth={"config": oci.config.from_file(os.path.join("~/.oci", "config"))},
    materialize=True,
)

df.head()

<a id='multi_train'></a>
### Linear Support Vector Classifier

To demonstrate a typical data science workflow, you will train a `LinearSVC` classifier using the text data to generate features and annotations to represent the response variable.

We utilize the `MultiLabelBinarizer()` method to convert the labels into the scikit-learn classification format during the dataset preprocessing. This transformer converts a list of sets or tuples into the supported multilabel format, a binary matrix of `samples*classes`. Further details about how this works can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html).

The next step is to vectorize the input text to feed it into a supervised machine learning system. In this example, TF*IDF vectorization is used in a manner that is similar to what was used in the <a href="#single_train">Decision Tree Classifier</a> section.

For performance reasons, the `TfidfVectorizer` is limited to 10,000 words.

In [None]:
nltk.download("stopwords")

stop_words = stopwords.words(
    "english"
)  ## See scikit-learn documentation for what these words are
vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=10000)
mlb = MultiLabelBinarizer()

X_train = vectorizer.fit_transform(df["Content"])  ## Vectorize the inputs with tf-idf
y_train = mlb.fit_transform(df["Annotations"])  ## Vectorize the labels

A one-vs-rest (OvR) strategy is used in this model. This approach is a heuristic method that uses one class versus the rest of classes, in which we train m binary classifiers, where m is the number of classes in the training dataset. The data from the study class is treated as positive, and the data from all the other classes is treated as negative, [See here](https://probml.github.io/pml-book/book0.html).

The following cell uses the scalable Linear Support Vector Machine, `LinearSVC`, for classification. It is quick to train and empirically adequate on NLP problems.

In [None]:
clf = OneVsRestClassifier(LinearSVC(class_weight="balanced"), n_jobs=-1)
clf.fit(X_train, y_train)

Let's see how the model did! The next cell applies cross-validation to estimate the prediction error. `K` fold cross-validation works by partitioning a dataset into `K` splits. For the `k`th part, it fits the model to the other `K-1` splits of the data and calculates the prediction error. It uses the `k`th part to do this prediction. For more details about this process, see [here](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) and specifically this [image](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/1920px-K-fold_cross_validation_EN.svg.png).


<a id='multi_test'></a>
### Evaluation

By performing cross-validation, there will be five separate models trained on different train and test splits. The goal is to get an estimate of the error that is expected when the model is generalized to an independent dataset. The following cell uses the [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) method to estimate the mean and standard deviation of errors.

In [None]:
cross_val_score(clf, X_train, y_train, cv=5)

<a id="ref"></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)