Skip to content
This repository has been archived by the owner on Dec 23, 2021. It is now read-only.

morsapaes/pyflink-nlp

Repository files navigation

Building an Analytics Pipeline with PyFlink

⚠️ Update: This repository will no longer be actively maintained. Please check the Ververica fork.

See the slides for more context.

Docker

To keep things simple, this demo uses a Docker Compose setup that makes it easier to bundle up all the services you need:

demo_overview

Getting the setup up and running

docker-compose build

docker-compose up -d

Is everything really up and running?

docker-compose ps

You should be able to access the Flink Web UI (http://localhost:8081), as well as Superset (http://localhost:8088).

Analyzing the Flink User Mailing List

What are people asking more frequently about in the Flink User Mailing List? How can you make sense of such a huge amount of random text?

Some Background

The model in this demo was trained using a popular topic modeling algorithm called LDA and Gensim, a Python library with a good implementation of the algorithm. The trained model knows to some extent what combination of words are associated with certain topics, and can just be passed as a dependency to PyFlink.

Don't trust the model. 👹

Submitting the PyFlink job

docker-compose exec jobmanager ./bin/flink run -py /opt/pyflink-nlp/pipeline.py -d

Once you get the Job has been submitted with JobID <JobId> green light, you can check and monitor its execution using the Flink WebUI:

Flink-Web-UI

Visualizing on Superset

To visualize the results, navigate to (http://localhost:8088) and log into Superset using:

username: admin

password: superset

There should be a default dashboard named "Flink User Mailing List" listed under Dashboards:

Superset


And that's it!

For the latest updates on PyFlink, follow Apache Flink on Twitter.

About

Self-contained demo using PyFlink with Gensim+spaCy to find topics in the Flink User Mailing List. All you need is Docker! 🐳

Topics

Resources

Stars

Watchers

Forks