# Flink Execution Environment

To connect to Flink cluster you need something called an _execution environment_.

Since we use Python, we will rely on Table API.

During this lab session we will only use Flink's batch processing engine, although Flink is mostly famous as a streaming processor.
The reason is that streaming applications for Flink are written usually in Java, not Python.

Nevertheless, Table API is _the same_ for working with streams and batches, so let's get started:

In [1]:
from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, BatchTableEnvironment

exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)

# Reading and Writing Data

Again, because of the limitedness of the Python API, we will work with JSON files as with CSVs with one column.

Since we use Table API, we first should register our files as tables in Flink.

In [2]:
from pyflink.table.descriptors import Schema, OldCsv, FileSystem
from pyflink.table import DataTypes

infile = "/workdir/boris/data/yelp_dataset/yelp_academic_dataset_review.json"
(
    t_env.connect(FileSystem().path(infile))
    .with_format(OldCsv())
    .with_schema(Schema().field("line", DataTypes.STRING()))
    .create_temporary_table("input_table")
)

<pyflink.table.descriptors.BatchTableDescriptor at 0x7fef14fb9490>

In [3]:
(
    t_env.connect(FileSystem().path("result.json"))
    .with_format(OldCsv())
    .with_schema(Schema().field("line", DataTypes.STRING()))
    .create_temporary_table("output_table")
)

<pyflink.table.descriptors.BatchTableDescriptor at 0x7fef14745ed0>

# Coding a Flink Job

Typical Flink job should have a source, a sink, and some tasks to do:

In [4]:
(
    t_env.from_path("input_table")
    .select("""
    CONCAT('{"total_lines": ', count(1), '}') AS line
    """)
    .insert_into("output_table")
)

In [5]:
# Before running the job we remove the previous results if any
!rm result.json

rm: impossible de supprimer « result.json »: Aucun fichier ou dossier de ce type


In [6]:
# After running the job we can observe some statistics of it's execution

res = t_env.execute("io_job")
print(res.get_net_runtime())

30366


In [7]:
# Let's see the results:
!head result.json

{"total_lines": 8021122}


# Do It Yourself

1. get some CSV data: https://grouplens.org/datasets/movielens/
1. register new data as a multi-column CSV table
1. find total numbers of movies and users
1. find movies viewed by most users and vice versa
1. find average numbers of users per movie and vice versa
1. find totals of users, movies and averages per year and month