<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/classify_freetext_with_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Gretel Classify to Label Free Text

In this blueprint, we analyze and label a set of email dumps looking for PII and other potentially sensitive information. 

## Setup

First we install our python dependencies and configure the Gretel client.

_Note: we install spacy for their visualization helper, displacy_

In [None]:
!pip install -Uqq gretel-client spacy datasets

In [None]:

from getpass import getpass
import json
import datasets
import pandas as pd
import yaml
from smart_open import open
from gretel_client import create_project, poll, configure_session, ClientConfig

pd.set_option('max_colwidth', None)

dataset_file_path = "emails.csv"

configure_session(
    ClientConfig(
        api_key=getpass(prompt="Enter Gretel API key"),
        endpoint="https://api.gretel.cloud",
    )
)

## Load the dataset

Using Hugging Face's [datasets](https://github.com/huggingface/datasets) library, we load a dataset containing a dump of [Enron emails](https://huggingface.co/datasets/aeslc). This data contains unstructured emails that we pass through a NER pipeline for labeling and PII discovery.

In [None]:
source_dataset = datasets.load_dataset("aeslc")
source_df = pd.DataFrame(source_dataset["train"]).sample(n=10000, random_state=99)
source_df.to_csv(dataset_file_path, index=False)

In [None]:
source_df.head()

## Configure a Gretel Project and Model

In [None]:
project = create_project(display_name="Gretel NLP Blueprint")

In [None]:
# Passing `use_nlp: true` into the model config,
# enables additional predictions using NLP models.
classify_config = """
schema_version: "1.0"
models:
  - classify:
      data_source: "_"
      use_nlp: true
"""

If you wish to transform, instead of classify the dataset, you may pass the same `use_nlp: true` property into a transformation pipeline. For an example transform pipeline, see the [Redact PII Notebook](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/redact_pii.ipynb). Below is an example transform pipeline that uses nlp.

```yaml
schema_version: "1.0"
models:
  - transforms:
      data_source: "_"
      use_nlp: true
      policies:
        - name: remove_pii
          rules:
            - name: redact_pii
              conditions: 
                value_label:
                  - person_name
                  - location
                  - credit_card_number
                  - phone_number
                  - email_address
              transforms:
                - type: fake
                - type: redact_with_char
                  attrs:
                    char: X
```

### Create the Classification Model

This next cell will create the classification model. After we verify the model is working correctly, the the entire dataset will be passed into the model for classification.

In [None]:
model = project.create_model_obj(yaml.safe_load(classify_config), dataset_file_path)
model.submit_cloud()
poll(model)

Using the created model, we download the report to get a summary view of found entities. This report is based on a sample of the original dataset, and is used to ensure the model has been configured correctly.

In [None]:
# `report_json` contains a summary of entities by field
with open(model.get_artifact_link("report_json")) as fh:
    report = json.load(fh)

In [None]:
# By converting these summaries into a dataframe we can quickly view
# entities found by the model.
summary = []
for field in report["metadata"]["fields"]:
    row = {"name": field["name"]}
    for entity in field["entities"]:
        row[entity["label"]] = entity["count"]
    summary.append(row)

pd.DataFrame(summary).set_index("name").fillna(0)

### Classify the emails

Now that the model has been configured and verified, let's run the full dataset through the model.

In [None]:
records = model.create_record_handler_obj(data_source=dataset_file_path)
records.submit_cloud()
poll(records)

In [None]:
# the `data` artifact returns a JSONL formatted file containing
# entity predictions by row.
with open(records.get_artifact_link("data")) as fh:
    records = [json.loads(line) for line in fh]

In [None]:
from spacy import displacy

for row, entities in zip(source_df.values, records):
    email_body, subject = row
    displacy.render(
        {
            "title": f"Subject: {subject}",
            "text": email_body,
            "ents": [e for e in entities["entities"] if e["field"] == "email_body"]
        },
        style="ent",
        jupyter=True,
        manual=True
    )
    input("\nPress [enter] to see the next email")

In [None]:
# now that you've ran the notebook, you can also view the same
# project using our web console.
project.get_console_url()