## PySpark Tutorial 

Source: 

- https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial
- http://www.kdnuggets.com/2015/11/introduction-spark-python.html



In [1]:
import findspark
findspark.init() 

from pyspark import SparkContext
sc =SparkContext()

In [2]:
# checking if sc is working 
data = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
data.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [20]:
# Reading a file in PySpark Shell
# you can choose anything you want. 
RDDread = sc.textFile ("/Users/ijung/Desktop/confusion.txt")

In [11]:
RDDread.collect()

['# Apache Spark',
 '',
 'Spark is a fast and general cluster computing system for Big Data. It provides',
 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that',
 'supports general computation graphs for data analysis. It also supports a',
 'rich set of higher-level tools including Spark SQL for SQL and DataFrames,',
 'MLlib for machine learning, GraphX for graph processing,',
 'and Spark Streaming for stream processing.',
 '',
 '<http://spark.apache.org/>',
 '',
 '## Online Documentation',
 '',
 'You can find the latest Spark documentation, including a programming',
 'guide, on the [project web page](http://spark.apache.org/documentation.html)',
 '',
 '',
 '## Python Packaging',
 '',
 'This README file only contains basic information related to pip installed PySpark.',
 'This packaging is currently experimental and may change in future versions (although we will do our best to keep compatibility).',
 'Using PySpark requires the Spark JARs, and if you are build

In [12]:
# First () – This will return the first element from the dataset.        
RDDread.first()

'# Apache Spark'

In [15]:
# Take (n) - This will return the first n lines from the dataset and display them on the console.
RDDread.take(3)

['# Apache Spark',
 '',
 'Spark is a fast and general cluster computing system for Big Data. It provides']

In [16]:
# TakeSample (withReplacement, n, [seed]) - This action will return n elements from the dataset, with or without replacement (true or false). Seed is an optional parameter that is used as a random generator.
RDDread.takeSample(False,10,2)


['',
 '',
 '',
 'At its core PySpark depends on Py4J (currently version 0.10.4), but additional sub-packages have their own requirements (including numpy and pandas).',
 '## Python Requirements',
 'This packaging is currently experimental and may change in future versions (although we will do our best to keep compatibility).',
 '# Apache Spark',
 'supports general computation graphs for data analysis. It also supports a',
 '',
 'and Spark Streaming for stream processing.']

In [17]:
# Count () – To know the number of lines in a RDD
RDDread.count()

32

Transformation and Actions in Apache Spark

Spark Transformations:
- map()
- flatMap()
- filter()
- sample()
- union()
- intersection()
- distinct()
- join()

Spark Actions:
- reduce()             
- collect()               
- count()
- first()    
- takeSample(withReplacement, num, [seed])    


In [22]:
RDDread.take(5)

['Confusion is the inability to think as clearly or quickly as you normally do.',
 '',
 'You may  have difficulty paying attention to anything , remembering anyone, and making decisions.',
 '',
 'Confusion may come to anyone early or late phase of the life, depending on the reason behind it .']

In [23]:
# iterable of iterables
mappedconfusion = RDDread.map(lambda line : line.split(" "))
mappedconfusion.take(2)

[['Confusion',
  'is',
  'the',
  'inability',
  'to',
  'think',
  'as',
  'clearly',
  'or',
  'quickly',
  'as',
  'you',
  'normally',
  'do.'],
 ['']]

In [24]:
#  iterable of strings.
flatMappedConfusion = RDDread.flatMap(lambda line : line.split(" "))
flatMappedConfusion.take(2)

['Confusion', 'is']

In [25]:
# try to find out the lines having confusion term in it in the confusedRDD-
onlyconfusion = RDDread.filter(lambda line : ("confus" in line.lower()))
onlyconfusion.count()

7

In [26]:
onlyconfusion.collect() 

['Confusion is the inability to think as clearly or quickly as you normally do.',
 'Confusion may come to anyone early or late phase of the life, depending on the reason behind it .',
 'Many times, confusion lasts for a very short span and goes away.',
 'Confusion is more common in people who are in late stages of the life and often occurs when you have stayed in hospital.',
 'Some confused people may have strange or unusual behavior or may act aggressively.',
 'A good way to find out if anyone is confused is to question the person their identity i.e. name, age, and the date.',
 'If they are little not sure or unable to answer correctly, they are confused']

https://www.tutorialspoint.com/apache_spark/apache_spark_core_programming.htm

In [27]:
inputfile = sc.textFile("input.txt")

In [28]:
inputfile.collect()

['people are not as beautiful as they look, ',
 'as they walk or as they talk.',
 'they are only as beautiful  as they love, ',
 'as they care as they share.']

In [32]:
inputfile.flatMap(lambda line: line.split(" ")).collect()

['people',
 'are',
 'not',
 'as',
 'beautiful',
 'as',
 'they',
 'look,',
 '',
 'as',
 'they',
 'walk',
 'or',
 'as',
 'they',
 'talk.',
 'they',
 'are',
 'only',
 'as',
 'beautiful',
 '',
 'as',
 'they',
 'love,',
 '',
 'as',
 'they',
 'care',
 'as',
 'they',
 'share.']

In [33]:
inputfile.map(lambda line: line.split(" ")).collect()

[['people', 'are', 'not', 'as', 'beautiful', 'as', 'they', 'look,', ''],
 ['as', 'they', 'walk', 'or', 'as', 'they', 'talk.'],
 ['they', 'are', 'only', 'as', 'beautiful', '', 'as', 'they', 'love,', ''],
 ['as', 'they', 'care', 'as', 'they', 'share.']]

In [36]:
inputfile.flatMap(lambda line:line.split(" ")).map(lambda word: (word, 1)).collect()

[('people', 1),
 ('are', 1),
 ('not', 1),
 ('as', 1),
 ('beautiful', 1),
 ('as', 1),
 ('they', 1),
 ('look,', 1),
 ('', 1),
 ('as', 1),
 ('they', 1),
 ('walk', 1),
 ('or', 1),
 ('as', 1),
 ('they', 1),
 ('talk.', 1),
 ('they', 1),
 ('are', 1),
 ('only', 1),
 ('as', 1),
 ('beautiful', 1),
 ('', 1),
 ('as', 1),
 ('they', 1),
 ('love,', 1),
 ('', 1),
 ('as', 1),
 ('they', 1),
 ('care', 1),
 ('as', 1),
 ('they', 1),
 ('share.', 1)]

In [37]:
inputfile.map(lambda line:line.split(" ")).map(lambda word: (word, 1)).collect()

[(['people', 'are', 'not', 'as', 'beautiful', 'as', 'they', 'look,', ''], 1),
 (['as', 'they', 'walk', 'or', 'as', 'they', 'talk.'], 1),
 (['they', 'are', 'only', 'as', 'beautiful', '', 'as', 'they', 'love,', ''],
  1),
 (['as', 'they', 'care', 'as', 'they', 'share.'], 1)]

In [52]:

counts = inputfile.flatMap(lambda line:line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda x,y: x+y)
counts.collect()

[('are', 2),
 ('as', 8),
 ('', 3),
 ('walk', 1),
 ('only', 1),
 ('share.', 1),
 ('people', 1),
 ('not', 1),
 ('beautiful', 2),
 ('they', 7),
 ('look,', 1),
 ('or', 1),
 ('talk.', 1),
 ('love,', 1),
 ('care', 1)]

In [54]:
counts.toDebugString()

b'(2) PythonRDD[87] at collect at <ipython-input-52-93d337ba8392>:3 []\n |  MapPartitionsRDD[86] at mapPartitions at PythonRDD.scala:422 []\n |  ShuffledRDD[85] at partitionBy at NativeMethodAccessorImpl.java:0 []\n +-(2) PairwiseRDD[84] at reduceByKey at <ipython-input-52-93d337ba8392>:2 []\n    |  PythonRDD[83] at reduceByKey at <ipython-input-52-93d337ba8392>:2 []\n    |  input.txt MapPartitionsRDD[26] at textFile at NativeMethodAccessorImpl.java:0 []\n    |  input.txt HadoopRDD[25] at textFile at NativeMethodAccessorImpl.java:0 []'

In [None]:
counts = inputfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)