Skip to content
Branch: master
Find file History
pappasilenus Spouts example (#3731)
* created structure of the spouts demo directory and started READMEs.

* added python example dir

* interim checking for yusuf to look at

* Fixed reference to old kafka dir

* Moved kafka test consumer to etc

* Example kafka producer/consumer in python

* first working version of imap spout

* complete imap spout example

* minor typo fixes

* fixed readmes, removed incomplete Python kafka example

* updated READMEs

* fixed issue with Makefile for kafka testing

* fixed minor typos, clarified how to set up for other than gmail, showed how to confirm secrets.

* fixed numbering.

* re-fixed numbering.

* last round of typos.
Latest commit 20b36fb May 20, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
Dockerfile.imap_spout Spouts example (#3731) May 20, 2019
Dockerfile.sentimentalist Spouts example (#3731) May 20, 2019
Makefile Spouts example (#3731) May 20, 2019
README.md Spouts example (#3731) May 20, 2019
imap_spout.json Spouts example (#3731) May 20, 2019
imap_spout.py Spouts example (#3731) May 20, 2019
sentimentalist.json Spouts example (#3731) May 20, 2019
sentimentalist.py

README.md

Email Sentiment Analysis

Background

This example connects to an IMAP mail account, collects all the incoming mail and analyzes it for positive or negative sentiment, sorting the emails into directories in its output repo with scoring information added to the email header "X-Sentiment-Rating".

It is inspired by the email sentiment analysis bot documented in this article by Shanglung Wang,

It uses Python-based VADER from CJ Hutto at Georgia Tech.

Introduction

In this example, we will connect a spout called imap_spout to an email account using IMAP. That spout's repo will be the input to a pipeline, sentimentalist, which will score the email's positive, negative, neutral, and compound sentiment, adding a header to each with a detailed sentiment score and sorting them into two folders, positive and negative, in its output repo based on the compound score.

This demo will process emails from an account you configure, moving them from the Inbox to a mailbox called "Processed", which it will create if it doesn't exist. The emails will be scored and then sorted. You'll see them in the sentimentalist output repo by their unique identifier from the Inbox, which ensures they'll be unique.

Setup

This guide assumes that you already have a Pachyderm cluster running and have configured pachctl to talk to the cluster and kubectl to talk to Kubernetes. Installation instructions can be found here.

  1. Create an email account you want to use.
    Keep the email addrees (which is usually the account name) and the password handy.
  2. Enable IMAP on that account. In Gmail, click the gear for "settings" and then click "Forwarding and POP/IMAP" to get to the IMAP settings. In this example, we're assuming you're using Gmail. Look in the source code for ./imap_spout.py for environment variables you may need to add to the pipeline spec for the spout to use another email service or other default IMAP folders.
  3. Create the secrets needed to securely access the account.
    The values <your-password> and <account name> are enclosed in single quotes to prevent the shell from interpreting them. Confirm the values in these files are what you expect.
$ echo -n '<your-password>' > IMAP_PASSWORD
$ echo -n '<account-name>` > IMAP_LOGIN
$ kubectl create secret generic imap-credentials --from-file=./IMAP_LOGIN --from-file=./IMAP_PASSWORD
  1. Confirm that the secrets got set correctly. You use kubectl get secret to output the secrets, and then decode them to confirm they're correct.
$ kubectl get secret imap-credentials -o yaml
apiVersion: v1
data:
  IMAP_LOGIN: <base64-encoded-imap-login>
  IMAP_PASSWORD: <base64-encoded-imap-password>
kind: Secret
metadata:
  creationTimestamp: "2019-05-20T21:27:56Z"
  name: imap-credentials
  namespace: default
  resourceVersion: <some-version>
  selfLink: /api/v1/namespaces/default/secrets/imap-credentials
  uid: <some-uid>
type: Opaque
$ echo -n `<base64-encoded-imap-login>` | base64 -d
<imap-login>
$ echo -n `<base64-encoded-imap-password>` | base64 -d
<imap-password>
  1. Build the docker image for the imap_spout. Put your own docker account name in for<docker-account-name>. There is a prebuilt image in the Pachyderm DockerHub registry account, if you want to use it.
$ docker login
$ docker build -t <docker-account-name>/imap_spout:1.9 -f ./Dockerfile.imap_spout .
$ docker push <docker-account-name>/imap_spout:1.9
  1. Build the docker image for the sentimentalist. Put your own docker account name in for<docker-account-name>. There is a prebuilt image in the Pachyderm DockerHub registry account, if you want to use it.
$ docker build -t <docker-account-name>/sentimentalist:1.9 -f ./Dockerfile.sentimentalist .
$ docker push <docker-account-name>/sentimentalist:1.9
  1. Edit the pipeline definition files to refer to your own docker repo. Put your own docker account name in for <docker-account-name>. There are prebuilt images for both pipelines in the Pachyderm DockerHub registry account, if you want to use those.
$ sed s/pachyderm/<docker-account-name>/g < sentimentalist.json > my_sentimentalist.json
$ sed s/pachyderm/<docker-account-name>/g < imap_spout.json > my_imap_spout.json
  1. Confirm the pipeline definition files are correct.
  2. Create the pipelines
pachctl create pipeline -f my_imap_spout.json
pachctl create pipeline -f my_sentimentalist.json
  1. Start sending plain-text emails to the account you created. Every few seconds, the imap_spout pipeline will fetch emails from that account via IMAP and send them to its output repo, where the sentimentalist pipeline will score them as positive or negative and sort them into output repos accordingly. Have fun! Try tricking the VADER sentiment engine with vague and ironic statements. Try emojis!

Pipelines

imap_spout

The imap_spout pipeline is an implementation of a Pachyderm spout in Python. It's configurable with environment variables that can be populated by Kubernetes secrets.

The spout connects to an IMAP account via SSL, creates a "Processed" mailbox for storing already-scored emails, and every five seconds checks for new emails.

It then puts each email as a separate file in the spout's output repo.

A couple of things to note, to expand on the Pachyderm spout documentation.

  1. Look in the source code for ./imap_spout.py for environment variables you may need to add to the pipeline spec for the spout to use another email service or other default IMAP folders.
  2. The function open_pipe opens /pfs/out, the named pipe that's the gateway to the spout's output repo. Note that it must open that pipe as write only and in binary mode. If you omit this, you're likely to see errors like TypeError: a bytes-like object is required, not 'str' in your pachctl logs for the pipeline.
  3. The files are not written directly to the /pfs/out; they're written as part of a tarfile object.
    Pachyderm uses the Unix tar format to ensure that multiple files can be written to /pfs/out and appear correctly in the output repo of your spout.
  4. In Python, the tarfile.open() command must use the mode="w|" argument, along with the named pipe's file object, to ensure that the tarfile object won't try to seek on the named pipe /pfs/out. If you forget this argument, you're likely to to see errors like file stream is not seekable in your pachctl logs for the pipeline.
  5. Every time you close() /pfs/out, it's a commit.
  6. Note that open_pipe backs off and attempts to open /pfs/out if any errors happen. Sometimes it'll take the spout a little bit of time to reopen/pfs/out after out code closes it for a commit; the backoff is insurance.
  7. It saves each email in a file with the mbox extension, which is the standard extension for Unix emails. eml is also commonly used, but is a slightly different format than what we use here. Each mbox file contains one email.

sentimentalist

Sentimentalist is a thin wrapper around the Python-based VADER from CJ Hutto at Georgia Tech.

It looks in its input repo for individual email files, loads them into a Python email object, and extracts the body and subject as plain text for scoring.

It uses the "compound" score to sort the emails into different directories, and adds a header to each email with detailed scoring information for use by subsequent pipelines.

Citations

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
You can’t perform that action at this time.