**Checking Java Version**

In [0]:
!java -version

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)


**Setting Java 8 environment**

In [0]:
!sudo update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
  0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode
* 2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2


***Downloading Spark***

In [0]:
!wget -q http://apachemirror.wuchna.com/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

**Extracting Spark Files**

In [0]:
!tar xf spark-2.4.4-bin-hadoop2.7.tgz

**Installing FindSpark**

In [0]:
!pip install -q findspark

**Setting up Home environment**

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

**Creating Spark Session**

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

**Stopping the session**

In [0]:
spark.stop()

**Installing Pyspark**

In [0]:
!pip install pyspark



# First PySpark Job

**Importing Pyspark**

In [0]:
import pyspark

**Creating a Spark Context**

In [0]:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setMaster('local')
conf.setAppName('spark-basic')
sc = SparkContext(conf=conf)

**Function to calculate mod**

In [0]:
def mod(x):
    import numpy as np
    return (x, np.mod(x, 2))

**Creating an RDD**

In [0]:
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print(rdd)

[(0, 0), (1, 1), (2, 0), (3, 1), (4, 0), (5, 1), (6, 0), (7, 1), (8, 0), (9, 1)]


**Creating an RDD using List**

In [0]:
values = [1, 2, 3, 4, 5]
rdd = sc.parallelize(values)

**Printing all the 5 elements of RDD**

In [0]:
rdd.take(5)

[1, 2, 3, 4, 5]

**Uploading Files to Colab**

In [0]:
from google.colab import files
uploaded = files.upload()

Saving Spark.txt to Spark.txt


**Loading a text file to Spark**

In [0]:
rdd = sc.textFile("Spark.txt")

**Print the rdd data**

In [0]:
rdd.collect()

['Apache Spark with Python is PySpark']

**RDD Persistence**

In [0]:
aba = sc.parallelize(range(1,10000,2))
aba.persist()

PythonRDD[7] at RDD at PythonRDD.scala:53

**RDD Caching**

In [0]:
textFile = sc.textFile("Spark.txt")
textFile.cache()

Spark.txt MapPartitionsRDD[9] at textFile at NativeMethodAccessorImpl.java:0

**Map**

In [0]:
x = sc.parallelize(["spark", "rdd", "example",  "sample", "example"])
y = x.map(lambda x:(x, 1))
y.collect()

[('spark', 1), ('rdd', 1), ('example', 1), ('sample', 1), ('example', 1)]

**FlatMap**

In [0]:
rdd = sc.parallelize([2, 3, 4])
sorted(rdd.flatMap(lambda x: range(1, x)).collect())

[1, 1, 1, 2, 2, 3]

**Filter**

In [0]:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6])
rdd.filter(lambda x: x % 2 == 0).collect()

[2, 4, 6]

**Sample**

In [0]:
parallel = sc.parallelize(range(9))
parallel.sample(True,.2).count()

1

**Sample**

In [0]:
parallel.sample(False,1).collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8]

**Union**

In [0]:
parallel = sc.parallelize(range(1,9))
par = sc.parallelize(range(5,15))
parallel.union(par).collect()

[1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

**Intersection**

In [0]:
parallel = sc.parallelize(range(1,9))
par = sc.parallelize(range(5,15))
parallel.intersection(par).collect()

[6, 8, 5, 7]

**Distinct**

In [0]:
parallel = sc.parallelize(range(1,9))
par = sc.parallelize(range(5,15))
parallel.union(par).distinct().collect()

[2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13]

**SortBy**

In [0]:
y = sc.parallelize([5, 7, 1, 3, 2, 1, 10])
y.sortBy(lambda c: c, True).collect()

[1, 1, 2, 3, 5, 7, 10]

**SortBy**

In [0]:
z = sc.parallelize([("H", 10), ("A", 26), ("Z", 1), ("L", 5)])
z.sortBy(lambda c: c, False).collect()

[('Z', 1), ('L', 5), ('H', 10), ('A', 26)]

**MapPartitions**

In [0]:
rdd = sc.parallelize([1, 2, 3, 4], 2)
def f(iterator): yield sum(iterator)
rdd.mapPartitions(f).collect()

[3, 7]

**MapPartitions - WithIndex**

In [0]:
rdd = sc.parallelize([1, 2, 3, 4], 4)
def f(splitIndex, iterator): yield splitIndex
rdd.mapPartitionsWithIndex(f).sum()

6

**GroupBy**

In [0]:
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
result = rdd.groupBy(lambda x: x % 2).collect()
sorted([(x, sorted(y)) for (x, y) in result])

[(0, [2, 8]), (1, [1, 1, 3, 5])]

**KeyBy**

In [0]:
x = sc.parallelize(range(0,3)).keyBy(lambda x: x*x)
y = sc.parallelize(zip(range(0,5), range(0,5)))
[(x, list(map(list, y))) for x, y in sorted(x.cogroup(y).collect())]

[(0, [[0], [0]]),
 (1, [[1], [1]]),
 (2, [[], [2]]),
 (3, [[], [3]]),
 (4, [[2], [4]])]

**Zip**

In [0]:
x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()

[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

**Zip - WithIndex**

In [0]:
sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()

[('a', 0), ('b', 1), ('c', 2), ('d', 3)]

**Repartition**

In [0]:
rdd = sc.parallelize([1,2,3,4,5,6,7], 4)
sorted(rdd.glom().collect())

[[1], [2, 3], [4, 5], [6, 7]]

In [0]:
len(rdd.repartition(2).glom().collect())

2

**Coalesce**

In [0]:
sc.parallelize([1, 2, 3, 4, 5], 3).glom().collect()

[[1], [2, 3], [4, 5]]

In [0]:
sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(2).glom().collect()

[[1], [2, 3, 4, 5]]

**Reduce**

In [0]:
from operator import add
sc.parallelize([1, 2, 3, 4, 5]).reduce(add)

15

In [0]:
sc.parallelize((2 for _ in range(10))).map(lambda x: 1).cache().reduce(add)

10

**First**

In [0]:
sc.parallelize([1, 2, 3, 4]).first()

1

**TakeOrdered**

In [0]:
nums = sc.parallelize([1,5,3,9,4,0,2])
nums.takeOrdered(5)

[0, 1, 2, 3, 4]

**Take**

In [0]:
nums = sc.parallelize([1,5,3,9,4,0,2])
nums.take(5)

[1, 5, 3, 9, 4]

**Count**

In [0]:
nums = sc.parallelize([1,5,3,9,4,0,2,4])
nums.count()

8

**Collect**

In [0]:
c = sc.parallelize(["Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"], 2)
c.collect()

['Gnu', 'Cat', 'Rat', 'Dog', 'Gnu', 'Rat']

In [0]:
c = sc.parallelize(["Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"], 2)
c.distinct().collect()

['Cat', 'Rat', 'Gnu', 'Dog']

In [0]:
alphanumerics = sc.parallelize([(1,"a"),(2,"b"),(3,"c")])
alphanumerics.collectAsMap()

{1: 'a', 2: 'b', 3: 'c'}

**SaveAsTextfile**

In [0]:
a = sc.parallelize(range(1,10000), 3)
a.saveAsTextFile("/usr/bin/mydata_a1")

In [0]:
x = sc.parallelize([1,2,3,4,5,6,6,7,9,8,10,21], 3)
x.saveAsTextFile("/usr/bin/sample1.txt")

**Foreach**

In [0]:
def f(x): print(x)
sc.parallelize([1,2,3,4,5]).foreach(f)  

**Foreach - Partition**

In [0]:
def f(iterator):
  for x in iterator:
    print(x)
sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(f)

**Mathematical Actions**

In [0]:
numbers = sc.parallelize(range(1,100))

In [0]:
numbers.sum()

4950

In [0]:
numbers.min()

1

In [0]:
numbers.variance()

816.6666666666666

In [0]:
numbers.max()

99

In [0]:
numbers.mean()

50.0

In [0]:
numbers.stdev()

28.577380332470412

**CountByValue**

In [0]:
a = sc.parallelize([1,2,3,4,5,6,7,8,2,4,2,3,3,3,1,1,1])
a.countByValue()

defaultdict(int, {1: 4, 2: 3, 3: 4, 4: 2, 5: 1, 6: 1, 7: 1, 8: 1})

**toDebugString**

In [0]:
a = sc.parallelize(range(1,19),3)
b = sc.parallelize(range(1,13),3)
c = a.subtract(b)
c.toDebugString()

b'(6) PythonRDD[156] at RDD at PythonRDD.scala:53 []\n |  MapPartitionsRDD[155] at mapPartitions at PythonRDD.scala:133 []\n |  ShuffledRDD[154] at partitionBy at NativeMethodAccessorImpl.java:0 []\n +-(6) PairwiseRDD[153] at subtract at <ipython-input-82-e1f9a4054d92>:3 []\n    |  PythonRDD[152] at subtract at <ipython-input-82-e1f9a4054d92>:3 []\n    |  UnionRDD[151] at union at NativeMethodAccessorImpl.java:0 []\n    |  PythonRDD[149] at RDD at PythonRDD.scala:53 []\n    |  ParallelCollectionRDD[147] at parallelize at PythonRDD.scala:195 []\n    |  PythonRDD[150] at RDD at PythonRDD.scala:53 []\n    |  ParallelCollectionRDD[148] at parallelize at PythonRDD.scala:195 []'

**Creating Pair RDDs**

In [0]:
rdd = sc.parallelize([("a1", "b1", "c1", "d1", "e1"), ("a2", "b2", "c2", "d2", "e2")])
result = rdd.map(lambda x: (x[0], list(x[1:])))
result.collect()

[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]

**WordCount using RDD concepts**

In [0]:
rdd =sc.textFile("Spark.txt")

In [0]:
nonempty_lines = rdd.filter(lambda x: len(x) > 0)

In [0]:
words = nonempty_lines.flatMap(lambda x: x.split(' '))

In [0]:
wordcount = words.map(lambda x:(x,1)).reduceByKey(lambda x,y: x+y).map(lambda x: (x[1], x[0])).sortByKey(False)

In [0]:
for word in wordcount.collect():
   print(word)

(3, 'PySpark')
(1, 'Apache')
(1, 'Spark')
(1, 'with')
(1, 'Python')
(1, 'is')


In [0]:
wordcount.saveAsTextFile("/content/Wordcount")

**Passing Functions to Spark**

In [0]:
rdd = sc.parallelize([1,2,3,4,5])
rdd.map(lambda x: x+2).collect()

[3, 4, 5, 6, 7]

In [0]:
lambda x: x+2

<function __main__.<lambda>>