# Activity 1: Setting Up

In [1]:
import findspark

# provide path to your spark directory directly
findspark.init("../../spark2")

import pyspark

IndexError: list index out of range

**Ok, so we have an error. Now what?**

**Did you start the Spark instance first?**

        cd spark2/sbin
        ./start-master.sh

**Have you specifed the path correctly?**

In [1]:
import findspark

# provide path to your spark directory directly
findspark.init("../../spark2/")

import pyspark

IndexError: list index out of range

**Now let's create a SparkContext and use it to count the number of lines in a file. For that, let's create a text file first.**

        cd
        ls >> helloworld
        cat helloworld

In [3]:
print("hello\nworld")

hello
world


In [2]:
sc = pyspark.SparkContext(appName="helloworld")

In [3]:
def nonempty(x):
    return len(x) > 0
    
# let's test our setup by counting the number of nonempty lines in a text file
lines = sc.textFile('../README.md')
lines_nonempty = lines.filter(nonempty)
lines_nonempty.count()

29

In [None]:
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
lines_nonempty.count()

In [7]:
lines

../README.md MapPartitionsRDD[6] at textFile at NativeMethodAccessorImpl.java:0

In [9]:
lines_nonempty

PythonRDD[8] at RDD at PythonRDD.scala:48

In [11]:
lines.take(5)

['# Learn Spark2 with Python',
 '',
 '1. [Set up your machine](https://github.com/soumendra/learn-spark-python/blob/master/setting-up.md)',
 '2. [Go through the pre-class reading list](https://github.com/soumendra/learn-spark-python/blob/master/pre-course-reading.md)',
 '']

In [5]:
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
lines_nonempty.count()

29

In [12]:
lines.count()

42

**Ok, so we can't run multiple SparkContexts at once! What about running the one created before?**

In [None]:
# let's test our setup by counting the number of nonempty lines in a text file
lines = sc.textFile('README.md')
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
lines_nonempty.count()

# Activity 2: Using Anonyous Functions

**Let's use *lambda* to create an anonymous function to count the number of lines containing *Python*.**

In [13]:
%%bash
text="Python is a fun language,\n
but then what language\n
is not, if\n
I may ask. But Python\n
is also."

echo -e $text > python.txt
cat python.txt

Python is a fun language,
 but then what language
 is not, if
 I may ask. But Python
 is also.


In [15]:
lines = sc.textFile("python.txt")
pythonLines = lines.filter(lambda line: "Python" in line)
print("No of lines containing 'Python':", pythonLines.count())

No of lines containing 'Python': 2


In [16]:
lines.count()

5

**Well, do explain the answer.**

* Task: Count the no of lines in the wikipedia page for Python which have the word Python in them.
* Hint
    - Use the [requests](http://docs.python-requests.org/en/master/) package to read a URL
    - Use the [lxml]() package to extract the text content out of the html
    - [Understand how to write a string to a file](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)

In [29]:
%%bash
source activate bdap
pip install requests
pip install lxml
pip install html2text

Collecting html2text
  Downloading html2text-2016.9.19.tar.gz (47kB)
Building wheels for collected packages: html2text
  Running setup.py bdist_wheel for html2text: started
  Running setup.py bdist_wheel for html2text: finished with status 'done'
  Stored in directory: /Users/soumendra/Library/Caches/pip/wheels/96/13/e0/25f9de1c524662d264bb143dde112812b72789bc8058dc4f57
Successfully built html2text
Installing collected packages: html2text
Successfully installed html2text-2016.9.19


In [34]:
import requests
from html2text import html2text

r = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
if r.status_code==200:
    with open('python_html', 'w') as f:
        f.write(html2text(r.text))

In [32]:
%%bash
head python_html

[![This is a good article. Click here for more
information.](//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/19px-
Symbol_support_vote.svg.png)](/wiki/Wikipedia:Good_articles "This is a good
article. Click here for more information." )

# Python (programming language)

From Wikipedia, the free encyclopedia

Jump to: navigation, search


In [35]:
lines = sc.textFile("python_html")
pythonLines = lines.filter(lambda line: "Python" in line)
print("No of lines containing 'Python':", pythonLines.count())

No of lines containing 'Python': 375


# Activity 3: Counting Primes

**We’ll go ahead and calculate the number of primes less than a given large number. To start with, we'll define a function that determines the primality of any given number (we'll later parallelize this function on a set of numbers).**

In [37]:
def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

In [38]:
# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(range(1000000))

# Compute the number of primes in the RDD
print(nums.filter(isprime).count())

78498


# Activity 4:  Word and Line Counting

In [4]:
import re
from operator import add

filein = sc.textFile('../README.md')

**Count lines**

In [42]:
print('number of lines in file: %s' % filein.count())

number of lines in file: 42


In [43]:
filein

../README.md MapPartitionsRDD[31] at textFile at NativeMethodAccessorImpl.java:0

In [44]:
filein.collect()

['# Learn Spark2 with Python',
 '',
 '1. [Set up your machine](https://github.com/soumendra/learn-spark-python/blob/master/setting-up.md)',
 '2. [Go through the pre-class reading list](https://github.com/soumendra/learn-spark-python/blob/master/pre-course-reading.md)',
 '',
 '# Setting up AWS instance',
 '',
 'When you are logging into a new ec2 instance for the first time, execute the following:',
 '',
 '```bash',
 'sudo apt-get update -y',
 'sudo apt-get upgrade -y',
 'sudo apt-get install -y python-dev software-properties-common curl default-jre ',
 'sudo apt-get install -y default-jdk python-software-properties byobu vim',
 '',
 'sudo apt-get install git git-core',
 'git config --global user.email "you@example.com"',
 'git config --global user.name "Your Name"',
 '```',
 '* Set up anaconda - https://github.com/soumendra/python-machinelearning-setup',
 '* clone and install from spark.yml',
 '',
 '',
 '```bash',
 'jupyter notebook --generate-config',
 'mkdir certs',
 'cd certs',
 'cd

**Count non-empty lines**

In [45]:
filein_nonempty = filein.filter( lambda x: len(x) > 0 )
print('number of non-empty lines in file: %s' % filein_nonempty.count()) 

number of non-empty lines in file: 29


**Count no of characters**

In [46]:
chars = filein.map(lambda s: len(s)).reduce(add)
print('number of characters in file: %s' % chars)

number of characters in file: 1080


**Count words of length greater than 3 characters**

In [7]:
words = filein.flatMap(lambda line: re.split('\W+', line.lower().strip()))
words = words.filter(lambda x: len(x) > 3)
words.collect()

['learn',
 'spark2',
 'with',
 'python',
 'your',
 'machine',
 'https',
 'github',
 'soumendra',
 'learn',
 'spark',
 'python',
 'blob',
 'master',
 'setting',
 'through',
 'class',
 'reading',
 'list',
 'https',
 'github',
 'soumendra',
 'learn',
 'spark',
 'python',
 'blob',
 'master',
 'course',
 'reading',
 'setting',
 'instance',
 'when',
 'logging',
 'into',
 'instance',
 'first',
 'time',
 'execute',
 'following',
 'bash',
 'sudo',
 'update',
 'sudo',
 'upgrade',
 'sudo',
 'install',
 'python',
 'software',
 'properties',
 'common',
 'curl',
 'default',
 'sudo',
 'install',
 'default',
 'python',
 'software',
 'properties',
 'byobu',
 'sudo',
 'install',
 'core',
 'config',
 'global',
 'user',
 'email',
 'example',
 'config',
 'global',
 'user',
 'name',
 'your',
 'name',
 'anaconda',
 'https',
 'github',
 'soumendra',
 'python',
 'machinelearning',
 'setup',
 'clone',
 'install',
 'from',
 'spark',
 'bash',
 'jupyter',
 'notebook',
 'generate',
 'config',
 'mkdir',
 'certs',
 '

In [8]:
words = filein.map(lambda line: re.split('\W+', line.lower().strip()))
words = words.filter(lambda x: len(x) > 3)
words.collect()

[['', 'learn', 'spark2', 'with', 'python'],
 ['1',
  'set',
  'up',
  'your',
  'machine',
  'https',
  'github',
  'com',
  'soumendra',
  'learn',
  'spark',
  'python',
  'blob',
  'master',
  'setting',
  'up',
  'md',
  ''],
 ['2',
  'go',
  'through',
  'the',
  'pre',
  'class',
  'reading',
  'list',
  'https',
  'github',
  'com',
  'soumendra',
  'learn',
  'spark',
  'python',
  'blob',
  'master',
  'pre',
  'course',
  'reading',
  'md',
  ''],
 ['', 'setting', 'up', 'aws', 'instance'],
 ['when',
  'you',
  'are',
  'logging',
  'into',
  'a',
  'new',
  'ec2',
  'instance',
  'for',
  'the',
  'first',
  'time',
  'execute',
  'the',
  'following',
  ''],
 ['sudo', 'apt', 'get', 'update', 'y'],
 ['sudo', 'apt', 'get', 'upgrade', 'y'],
 ['sudo',
  'apt',
  'get',
  'install',
  'y',
  'python',
  'dev',
  'software',
  'properties',
  'common',
  'curl',
  'default',
  'jre'],
 ['sudo',
  'apt',
  'get',
  'install',
  'y',
  'default',
  'jdk',
  'python',
  'software',
 

In [59]:
words = filein.map(lambda w: (w, 1))
words = words.reduceByKey(add)
print('number of words with more than 3 characters in file: %s' % words.count())

number of words with more than 3 characters in file: 28


# Activity 5: Workflow Template

In [None]:
## Spark Application Template - execute with spark-submit

## Imports
from pyspark import SparkConf, SparkContext

## Module Constants
APP_NAME = "Name of Application"  #helps in debugging

## Closure Functions

## Main functionality

def main(sc):
    pass

if __name__ == "__main__":
    # Configure Spark
    conf = SparkConf().setAppName(APP_NAME)
    conf = conf.setMaster("local[*]")
    sc   = SparkContext(conf=conf)

    # Execute Main functionality
    main(sc)

# To close or exit the program use sc.stop() or sys.exit(0)

In [None]:
from add import *

add(5,6)

# Activity 6: Sample Application

In [None]:
import findspark

# provide path to your spark directory directly
findspark.init("/Users/soumendra/spark2/")

import pyspark

In [None]:
%matplotlib inline
## Imports
import csv
import matplotlib.pyplot as plt

from io import StringIO
from datetime import datetime
from collections import namedtuple
from operator import add, itemgetter
from pyspark import SparkConf, SparkContext

## Module Constants
APP_NAME = "Flight Delay Analysis"
DATE_FMT = "%Y-%m-%d"
TIME_FMT = "%H%M"

fields   = ('date', 'airline', 'flightnum', 'origin', 'dest', 'dep',
            'dep_delay', 'arv', 'arv_delay', 'airtime', 'distance')
Flight   = namedtuple('Flight', fields)

## Closure Functions
def parse(row):
    """
    Parses a row and returns a named tuple.
    """

    row[0]  = datetime.strptime(row[0], DATE_FMT).date()
    row[5]  = datetime.strptime(row[5], TIME_FMT).time()
    row[6]  = float(row[6])
    row[7]  = datetime.strptime(row[7], TIME_FMT).time()
    row[8]  = float(row[8])
    row[9]  = float(row[9])
    row[10] = float(row[10])
    return Flight(*row[:11])

def split(line):
    """
    Operator function for splitting a line with csv module
    """
    reader = csv.reader(StringIO(line))
    return reader.next()

def plot(delays):
    """
    Show a bar chart of the total delay per airline
    """
    airlines = [d[0] for d in delays]
    minutes  = [d[1] for d in delays]
    index    = list(xrange(len(airlines)))

    fig, axe = plt.subplots()
    bars = axe.barh(index, minutes)

    # Add the total minutes to the right
    for idx, air, min in zip(index, airlines, minutes):
        if min > 0:
            bars[idx].set_color('#d9230f')
            axe.annotate(" %0.0f min" % min, xy=(min+1, idx+0.5), va='center')
        else:
            bars[idx].set_color('#469408')
            axe.annotate(" %0.0f min" % min, xy=(10, idx+0.5), va='center')

    # Set the ticks
    ticks = plt.yticks([idx+ 0.5 for idx in index], airlines)
    xt = plt.xticks()[0]
    plt.xticks(xt, [' '] * len(xt))

    # minimize chart junk
    plt.grid(axis = 'x', color ='white', linestyle='-')

    plt.title('Total Minutes Delayed per Airline')
    plt.show()

## Main functionality
def main(sc):

    # Load the airlines lookup dictionary
    airlines = dict(sc.textFile("ontime/airlines.csv").map(split).collect())

    # Broadcast the lookup dictionary to the cluster
    airline_lookup = sc.broadcast(airlines)

    # Read the CSV Data into an RDD
    flights = sc.textFile("ontime/flights.csv").map(split).map(parse)

    # Map the total delay to the airline (joined using the broadcast value)
    delays  = flights.map(lambda f: (airline_lookup.value[f.airline],
                                     add(f.dep_delay, f.arv_delay)))

    # Reduce the total delay for the month to the airline
    delays  = delays.reduceByKey(add).collect()
    delays  = sorted(delays, key=itemgetter(1))

    # Provide output from the driver
    for d in delays:
        print("%0.0f minutes delayed\t%s" % (d[1], d[0]))

    # Show a bar chart of the delays
    plot(delays)

if __name__ == "__main__":
    # Configure Spark
    conf = SparkConf().setMaster("local[*]")
    conf = conf.setAppName(APP_NAME)
    sc   = SparkContext(conf=conf)
    # Uncomment the lines above when running the application with "submit" (spark-submit app.py)
    # Comment the lines above out when running in IPython Notebook

    # Execute Main functionality
    main(sc)