Skip to content
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
talks/big-data-with-hadoop-and-python/
talks/big-data-with-hadoop-and-python/

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
pig
 
 
 
 
 
 
 
 

Big Data with Hadoop and Python

Here you can find code examples from my talk "Big Data with Hadoop and Python" that was given on EuroPython 2015, that were used to benchmark the performance of different Python frameworks for Hadoop compared to Java and Apache Pig.

The examples are a simple wordcount implementation. I used a book by Mark Lutz - Learning Python, 3rd edition as an input, multiplied 10000 times with the following simple bash script:

#!/bin/bash

IN="${1}"
OUT="${2}"

for i in {1..10000}; do
    echo "${IN}"
done | xargs cat > "${OUT}"

Below are the brief instructions that will help you to recreate the benchmarks if needed.

Java

  1. Download the WordCount.java and run the following command to compile it:

    javac -classpath /path/to/hadoop-common.jar:/path/to/hadoop-mapreduce-client-core.jar:/path/to/hadoop-annotations.jar WordCount.java
  2. Run the following command to create a Jar file from compiled class files:

    jar -cvf WordCount.jar WordCount*.class
  3. Finally run the following command to submit a job to the cluster:

    hadoop jar WordCount.jar WordCount hdfs:///input hdfs:///output

Pig

  1. Download the wordcount.pig. Open it in your favourite text editor and set the input and output directories inside HDFS.

  2. Run the following command to submit a job to the cluster:

    pig wordcount.pig

Streaming

  1. Download the mapper.py and reducer.py and change the first line if you want to run them under PyPy.

  2. Run the following command to submit a job to the cluster:

    hadoop jar /path/to/hadoop-streaming.jar \
        -Dmapreduce.job.reduces=98 \
        -file /path/to/mapper.py \
        -file /path/to/reducer.py \
        -mapper /path/to/mapper.py \
        -reducer /path/to/reducer.py \
        -combiner /path/to/reducer.py \
        -input hdfs:///input \
        -output hdfs:///output

MRJob

  1. Download the raw-wordcount.py and ujson-wordcount.py. Keep in mind that you'll need to install ujson library to the whole cluster and you'll need MRJob >= 0.5.0 for this to work.

  2. Set path to Hadoop home with the following command:

    export HADOOP_HOME=/path/to/hadoop/home/dir
  3. Run the following command to submit a job to the cluster:

    python raw-wordcount.py -r hadoop hdfs:///input --output-dir hdfs:///output --no-output --hadoop-streaming-jar /path/to/hadoop-streaming.jar --jobconf mapreduce.job.reduces=98

Luigi

  1. Download the client.cfg, default-wordcount.py and json-wordcount.py. Open them in your favourite text editor and set streaming jar path and the input / output directories inside HDFS.

  2. Run the following command to submit a job to the cluster:

    python default-wordcount.py WordCount --local-scheduler

Pydoop

  1. Download the wordcount.py.

  2. Set path to Java home with the following commands:

    export JAVA_HOME=/path/to/java/home/dir
  3. Create the Pydoop archive with the following command (this is needed because Pydoop doesn't automatically uploads itself to a cluster):

    tar -czf pydoop.tgz -C /path/to/pydoop .
  4. Run the following command to submit a job to the cluster:

    pydoop submit --upload-archive-to-cache pydoop.tgz --num-reducers 98 --upload-file-to-cache wordcount.py wordcount /input /output