# Parallelize Data Visualization Pipeline

Currently, the data visualization pipeline only generates visuals for the training split of our dataset.

In this notebook, we will:
- Set up a "local" development environment with mount. 
- Test the current functionality of the visualization pipeline. 
- Update our pipeline to visualize all 3 data splits (train, val, and test). 
- Paralleize our pipeline to create these visualizations simultaneously. 


## 0. Install Requirements
This notebook should be run after first deploying the market-sentiment example. 

The first thing that we need to do is install the requirements for our project in our notebook. We can do this by installing the `pip`requirements. 

Note: Best practice is to install these with a virtual env like `venv`, `pipenv` or `conda`. 

In [None]:
!pip install -r requirements.txt

## 1. Mount Visualization pipeline data

In a terminal (inside JupyterLab) run: 
```
pachctl mount -r dataset@master -r sentiment_words@master /pfs/
```

In [None]:
# If not already installed, install the tree command to view the directory structure
!sudo apt-get install tree -y

In [2]:
!tree /pfs

[01;34m/pfs[00m
├── [01;34mdataset[00m
│   ├── test.csv
│   ├── train.csv
│   └── validation.csv
└── [01;34msentiment_words[00m
    └── LoughranMcDonald_SentimentWordLists_2018.csv

2 directories, 4 files


## 2. Test current functionality

In [None]:
!mkdir output

In [None]:
!python data_visualization.py --data-file /pfs/dataset/train.csv --sentiment-words-file /pfs/sentiment_words/LoughranMcDonald_SentimentWordLists_2018.csv --output-dir ./output/ -v

In [5]:
!tree output/

[01;34moutput/[00m
├── correlation.png
├── frequent_words.png
└── word_cloud.png

0 directories, 3 files


## 3. Modify the code to visualize each split

In [9]:
%%writefile data_visualization.py
# Python libraries
from tqdm import tqdm
import os
import logging
import random
import json
import argparse

# Data Science modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from market_sentiment.nlp_utils import *
from market_sentiment.data_utils import *
from market_sentiment.visualize import visualize_frequent_words, generate_word_cloud

plt.style.use("ggplot")

# Import Scikit-learn moduels
from sklearn.preprocessing import LabelEncoder

parser = argparse.ArgumentParser(description="Sentiment Analysis Trainer")
parser.add_argument(
    "--data-dir",
    help="directory of dataset csv files.",
    default="/pfs/dataset/",
)
parser.add_argument(
    "--sentiment-words-file",
    help="csv with sentiment word list",
    default="/pfs/sentiment_words/LoughranMcDonald_SentimentWordLists_2018.csv",
)
parser.add_argument(
    "--output-dir", metavar="DIR", default="/pfs/out", help="output directory for model"
)
parser.add_argument("--seed", type=int, default=42, help="random seed value")
parser.add_argument(
    "-v", "--verbose", help="increase output verbosity", action="store_true"
)


# Set Seaborn Style
sns.set(style="white", palette="deep")


def create_vis(filename, sentiment_words_file, output_dir, seed=42):
    train_df = load_finphrase(filename)
    file_basename = os.path.splitext(os.path.basename(filename))[0]

    # Samples
    pd.set_option("display.max_colwidth", -1)
    logging.debug(train_df.sample(n=1, random_state=seed))

    # Encode the label
    le = LabelEncoder()
    le.fit(train_df["label"])
    train_df["label"] = le.transform(train_df["label"])
    logging.debug(list(le.classes_))
    logging.debug(train_df["label"])

    corpus = create_corpus(train_df)
    fig = visualize_frequent_words(corpus, stop_words)
    fig.savefig(os.path.join(output_dir, "frequent_words_" + file_basename + ".png"))

    wordcloud = generate_word_cloud(corpus, stop_words)
    wordcloud.to_file(os.path.join(output_dir, "word_cloud_" + file_basename + ".png"))

    # Load sentiment data
    sentiment_df = pd.read_csv(sentiment_words_file)

    # Make all words lower case
    sentiment_df["word"] = sentiment_df["word"].str.lower()
    sentiments = sentiment_df["sentiment"].unique()
    sentiment_df.groupby(by=["sentiment"]).count()

    sentiment_dict = {
        sentiment: sentiment_df.loc[sentiment_df["sentiment"] == sentiment][
            "word"
        ].values.tolist()
        for sentiment in sentiments
    }

    columns = [
        "tone_score",
        "word_count",
        "n_pos_words",
        "n_neg_words",
        "pos_words",
        "neg_words",
    ]

    # Analyze tone for original text dataframe
    print(train_df.shape)
    tone_lmdict = [
        tone_count_with_negation_check(sentiment_dict, x.lower())
        for x in tqdm(train_df["sentence"], total=train_df.shape[0])
    ]
    tone_lmdict_df = pd.DataFrame(tone_lmdict, columns=columns)
    train_tone_df = pd.concat(
        [train_df, tone_lmdict_df.reindex(train_df.index)], axis=1
    )
    train_tone_df.head()

    # Show corelations to next_decision
    plt.figure(figsize=(10, 6))
    corr_columns = ["label", "n_pos_words", "n_neg_words"]
    sns.heatmap(
        train_tone_df[corr_columns].astype(float).corr(),
        cmap="coolwarm",
        annot=True,
        fmt=".2f",
        vmin=-1,
        vmax=1,
    )
    plt.savefig(os.path.join(output_dir, "correlation_" + file_basename + ".png"))


def main():
    args = parser.parse_args()
    if args.verbose:
        logging.basicConfig(level=logging.DEBUG)

    os.makedirs(args.output_dir, exist_ok=True)

    # Set Random Seed
    random.seed(args.seed)
    np.random.seed(args.seed)

    sentiment_words_file = args.sentiment_words_file
    
    for dirpath, dnames, fnames in os.walk(args.data_dir):
        for f in fnames:
            create_vis(os.path.join(dirpath, f), sentiment_words_file, args.output_dir, args.seed)


if __name__ == "__main__":
    main()


Overwriting data_visualization.py


In [None]:
!python data_visualization.py --data-dir /pfs/dataset/ --sentiment-words-file /pfs/sentiment_words/LoughranMcDonald_SentimentWordLists_2018.csv --output-dir ./output/ -v

In [18]:
!tree output/

[01;34moutput/[00m
├── correlation_test.png
├── correlation_train.png
├── correlation_validation.png
├── frequent_words_test.png
├── frequent_words_train.png
├── frequent_words_validation.png
├── word_cloud_test.png
├── word_cloud_train.png
└── word_cloud_validation.png

0 directories, 9 files


## 4. Update Pipeline with our new version
In this notebook, we will take a shortcut to avoid rebuilding a Docker image for the pipeline. This is useful for debugging, but proper a proper [development cycle](https://docs.pachyderm.com/latest/how-tos/developer-workflow/) is recommended in production to maintain reproducibility. 

<img src="https://docs.pachyderm.com/latest/assets/images/d_steps_analysis_pipeline.svg" alt="Drawing" style="width: 800px;"/>

Here we will inject our code as the entrypoint to our container by: 
1. Encoding our python file as base64
2. Update the pipeline, injecting our new python file as the entrypoint. 


In [10]:
import base64

def generate_command(python_file):
    with open(python_file,'r') as file:
        user_code = file.read()
    code = user_code 
    encoded_code = base64.standard_b64encode(bytes(code, 'utf-8')).decode('ascii')
    
    command = r"""
    python -c "import base64; __name__ = \"__main__\"; exec(base64.b64decode(\"{code_64}\"))"
    """.strip(("\n\r ")).format(code_64=encoded_code)
    return command

Using our Python client, [python-pachyderm](https://github.com/pachyderm/python-pachyderm/), we will update our pipeline. This will allow us to modify our pipeline programatically rather than modifying our pipeline definition file. 

In [13]:
import python_pachyderm
from python_pachyderm.service import pps_proto
client = python_pachyderm.Client()

In [14]:
def create_python_pipeline(name, client, command):
    client.create_pipeline(
        pipeline_name=name,
        transform=python_pachyderm.service.pps_proto.Transform(
            image="jimmywhitaker/market_sentiment:dev0.25",
            cmd=['/bin/sh'],
            stdin=[command]
        ),
        input=pps_proto.Input(
            cross=[
                pps_proto.Input(pfs=pps_proto.PFSInput(repo="dataset", glob="/")),
                pps_proto.Input(pfs=pps_proto.PFSInput(repo="sentiment_words", glob="/"))
            ]),
        update=True,
        reprocess_spec="every_job"
    )

Update the pipeline with our base64 encoded python file. 

In [None]:
code = generate_command('./data_visualization.py')

In [19]:
create_python_pipeline('visualizations', client, code)

After the job runs, we can view the output of our new pipeline and see we now have visualizations for each split. 

In [50]:
!pachctl list file visualizations@master

NAME                           TYPE SIZE     
/correlation_test.png          file 31.12KiB 
/correlation_train.png         file 30.94KiB 
/correlation_validation.png    file 31.26KiB 
/frequent_words_test.png       file 36.78KiB 
/frequent_words_train.png      file 37.15KiB 
/frequent_words_validation.png file 37.32KiB 
/word_cloud_test.png           file 63.49KiB 
/word_cloud_train.png          file 73.95KiB 
/word_cloud_validation.png     file 72.52KiB 


## 5. Parallelize Pachyderm Pipeline with Glob Pattern
Pachyderm pipelines can easily parallelize your processing across your data with zero code changes. 

[Glob patterns](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/glob-pattern/) are the Pachyderm mechanism to do this. 

For example, a input with a glob `/` like so, 
```
"input": "dataset"
"glob": "/"
```
would pass all the files in the `dataset` repository to a job. 


In our example, that would be: 
```
# Job 1 of 1
/pfs/dataset
     ├── test.csv
     ├── train.csv
     └── validation.csv
```
We can test this with the `pachctl glob file` command.  

In [26]:
!pachctl glob file dataset@master:/

NAME TYPE SIZE     
/    dir  669.1KiB 


However, if we change this glob pattern to `/*` then we tell Pachyderm to treat each file in the root directory of the repo as a separate datum, which means it should run as its own job. 

For example, an input with the glob `/*` like so, 
```
"input": "dataset"
"glob": "/*"
```
would pass one file from the `dataset` repository to  job. 


In our example, that would be: 
```
# Job 1 of 3
/pfs/dataset
     └── test.csv
     
# Job 2 of 3
/pfs/dataset
     └── train.csv
     
# Job 3 of 3
/pfs/dataset
     └── validation.csv
```

This means that each file can be processed separately, and parallelized automatically by Pachyderm.

In [28]:
!pachctl glob file dataset@master:/*

NAME            TYPE SIZE     
/test.csv       file 134.9KiB 
/train.csv      file 481.5KiB 
/validation.csv file 52.74KiB 


In [70]:
def create_parallelized_python_pipeline(name, client, command):
    client.create_pipeline(
        pipeline_name=name,
        transform=python_pachyderm.service.pps_proto.Transform(
            image="jimmywhitaker/market_sentiment:dev0.25",
            cmd=['/bin/sh'],
            stdin=[command]
        ),
        input=pps_proto.Input(
            cross=[
                pps_proto.Input(
                    pfs=pps_proto.Input(repo="dataset", glob="/*")
                ),
                pps_proto.Input(
                    pfs=pps_proto.Input(repo="sentiment_words", glob="/")
                )
            ]),
        update=True,
        reprocess_spec="every_job"
    )

In [71]:
create_parallelized_python_pipeline('visualizations', client, code)

If we list our jobs, we can see che changes in the `PROGRESS` field. 

In [84]:
!pachctl list job -p visualizations --history all

PIPELINE       ID                               STARTED        DURATION   RESTART PROGRESS  DL       UL       STATE                                 
visualizations 4261036ed5f64d1b8d7ee563961edf0b 4 minutes ago  27 seconds 0       3 + 0 / 3 913.8KiB 424.1KiB [32msuccess[0m                               
visualizations 22fd16c90d7b47429c90ac5e5b8245cd 9 minutes ago  26 seconds 0       3 + 0 / 3 913.8KiB 419.5KiB [32msuccess[0m                               
visualizations 4472d147797143ed88a40a3c67ae308a 16 minutes ago 16 seconds 0       1 + 0 / 1 750.7KiB 414.5KiB [32msuccess[0m                               
visualizations ad348405004947ea9595d11f09273c49 20 minutes ago 3 seconds  0       0 + 0 / 1 750.7KiB 0B       [31mfailure[0m: datum 51932426570330879ff... 
visualizations d862e8fe11924198bcdfc604ec43b829 2 hours ago    10 seconds 0       1 + 0 / 1 750.7KiB 141.3KiB [32msuccess[0m                               


The format of the progress column is `DATUMS PROCESSED + DATUMS SKIPPED / TOTAL DATUMS`.
We can see that our `PROGRESS`value has changed from `1 + 0 / 1` to `3 + 0 / 3`, meaning instead of executing everything in one job, without changing any code, Pachyderm can split our data and run our code on each piece independently. 

In [77]:
# Original job had a single datum 
!pachctl list job d862e8fe11924198bcdfc604ec43b829

PIPELINE       ID                               STARTED     DURATION   RESTART PROGRESS  DL       UL       STATE   
visualizations d862e8fe11924198bcdfc604ec43b829 2 hours ago 10 seconds 0       1 + 0 / 1 750.7KiB 141.3KiB [32msuccess[0m 


In [75]:
# New job has 3 datums that can execute in parallel 
!pachctl list job 4261036ed5f64d1b8d7ee563961edf0b

PIPELINE       ID                               STARTED        DURATION   RESTART PROGRESS  DL       UL       STATE   
visualizations 4261036ed5f64d1b8d7ee563961edf0b 28 seconds ago 27 seconds 0       3 + 0 / 3 913.8KiB 424.1KiB [32msuccess[0m 


If we view the datums that were input to each job, we can see that our Pachyderm pipeline automatically paralleizes our code without us having to make any changes. 

In [81]:
!pachctl list datum visualizations@4261036ed5f64d1b8d7ee563961edf0b

ID                                                               FILES                                                                                                        STATUS  TIME      
22b0aef9fa4e54ff1e967e1c28716b5b55a49b90e75a396d3a4442d81a2d79b6 dataset@4261036ed5f64d1b8d7ee563961edf0b:/validation.csv, sentiment_words@4261036ed5f64d1b8d7ee563961edf0b:/ [32msuccess[0m 3 seconds 
41dae939561609fc88d8f72cdff08adaf6d9f6f1ea8c41809a448dac84d5404a dataset@4261036ed5f64d1b8d7ee563961edf0b:/train.csv, sentiment_words@4261036ed5f64d1b8d7ee563961edf0b:/      [32msuccess[0m 6 seconds 
9c92a8264bd805733271a310ce90761ed00cd92d3344410fe2ea175fcac54c78 dataset@4261036ed5f64d1b8d7ee563961edf0b:/test.csv, sentiment_words@4261036ed5f64d1b8d7ee563961edf0b:/       [32msuccess[0m 4 seconds 
