# Introduction to Big Data analysis with Spark

This track introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

## Preparing the environment

### Importing libraries

In [1]:
from pyspark.sql.types import (_parse_datatype_string, StructType, StructField,
                               DoubleType, IntegerType, StringType)
from pyspark.sql import SparkSession

### Connect to Spark

In [2]:
spark = SparkSession.builder.getOrCreate()

# eval DataFrame in notebooks
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

### Reading the data

In [3]:
fifa = spark.read.csv('data-sources/Fifa2018_dataset.csv', header=True, inferSchema=True)
# cast to integer
for col_name in ['Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control', 'Composure', 
                 'Crossing', 'Curve', 'Dribbling', 'Finishing', 'Free kick accuracy', 'GK diving', 
                 'GK handling', 'GK kicking', 'GK positioning', 'GK reflexes', 'Heading accuracy', 
                 'Interceptions', 'Jumping', 'Long passing', 'Long shots', 'Marking', 'Penalties', 
                 'Positioning', 'Reactions', 'Short passing', 'Shot power', 'Sliding tackle', 
                 'Sprint speed', 'Stamina', 'Standing tackle', 'Strength', 'Vision', 'Volleys']:
    fifa = fifa.withColumn(col_name, fifa[col_name].cast('integer'))
fifa.createOrReplaceTempView("fifa")
fifa.printSchema()
fifa.limit(2)

root
 |-- _c0: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Photo: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Flag: string (nullable = true)
 |-- Overall: integer (nullable = true)
 |-- Potential: integer (nullable = true)
 |-- Club: string (nullable = true)
 |-- Club Logo: string (nullable = true)
 |-- Value: string (nullable = true)
 |-- Wage: string (nullable = true)
 |-- Special: integer (nullable = true)
 |-- Acceleration: integer (nullable = true)
 |-- Aggression: integer (nullable = true)
 |-- Agility: integer (nullable = true)
 |-- Balance: integer (nullable = true)
 |-- Ball control: integer (nullable = true)
 |-- Composure: integer (nullable = true)
 |-- Crossing: integer (nullable = true)
 |-- Curve: integer (nullable = true)
 |-- Dribbling: integer (nullable = true)
 |-- Finishing: integer (nullable = true)
 |-- Free kick accuracy: integer (nullable = true)
 |-- GK diving: integer (nulla

_c0,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,Value,Wage,Special,Acceleration,Aggression,Agility,Balance,Ball control,Composure,Crossing,Curve,Dribbling,Finishing,Free kick accuracy,GK diving,GK handling,GK kicking,GK positioning,GK reflexes,Heading accuracy,Interceptions,Jumping,Long passing,Long shots,Marking,Penalties,Positioning,Reactions,Short passing,Shot power,Sliding tackle,Sprint speed,Stamina,Standing tackle,Strength,Vision,Volleys,CAM,CB,CDM,CF,CM,ID,LAM,LB,LCB,LCM,LDM,LF,LM,LS,LW,LWB,Preferred Positions,RAM,RB,RCB,RCM,RDM,RF,RM,RS,RW,RWB,ST
0,Cristiano Ronaldo,32,https://cdn.sofif...,Portugal,https://cdn.sofif...,94,94,Real Madrid CF,https://cdn.sofif...,€95.5M,€565K,2228,89,63,89,63,93,95,85,81,91,94,76,7,11,15,14,11,88,29,95,77,92,22,85,95,96,83,94,23,91,92,31,80,85,88,89.0,53.0,62.0,91.0,82.0,20801,89.0,61.0,53.0,82.0,62.0,91.0,89.0,92.0,91.0,66.0,ST LW,89.0,61.0,53.0,82.0,62.0,91.0,89.0,92.0,91.0,66.0,92.0
1,L. Messi,30,https://cdn.sofif...,Argentina,https://cdn.sofif...,93,93,FC Barcelona,https://cdn.sofif...,€105M,€565K,2154,92,48,90,95,95,96,77,89,97,95,90,6,11,15,14,8,71,22,68,87,88,13,74,93,95,88,85,26,87,73,28,59,90,85,92.0,45.0,59.0,92.0,84.0,158023,92.0,57.0,45.0,84.0,59.0,92.0,90.0,88.0,91.0,62.0,RW,92.0,57.0,45.0,84.0,59.0,92.0,90.0,88.0,91.0,62.0,88.0


In [4]:
movies = spark.read.csv('data-sources/movie-ratings.csv', header=False, inferSchema=True,
                        schema='userId int, movieId int, rating double, timestamp int')
movies.createOrReplaceTempView("movies")
movies.printSchema()
movies.limit(2)

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179


In [5]:
people = spark.read.csv('data-sources/people.csv', header=True, inferSchema=True)
people.createOrReplaceTempView("people")
people.printSchema()
people.limit(2)

root
 |-- _c0: integer (nullable = true)
 |-- person_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- date of birth: timestamp (nullable = true)



_c0,person_id,name,sex,date of birth
0,100,Penelope Lewis,female,1990-08-31 00:00:00
1,101,David Anthony,male,1971-10-14 00:00:00


In [6]:
wine = spark.read.csv('data-sources/wine-data.csv', header=True, inferSchema=True)
wine.createOrReplaceTempView("wine")
wine.printSchema()
wine.limit(2)

root
 |-- Wine: integer (nullable = true)
 |-- Alcohol: double (nullable = true)
 |-- Malic.acid: double (nullable = true)
 |-- Ash: double (nullable = true)
 |-- Acl: double (nullable = true)
 |-- Mg: integer (nullable = true)
 |-- Phenols: double (nullable = true)
 |-- Flavanoids: double (nullable = true)
 |-- Nonflavanoid.phenols: double (nullable = true)
 |-- Proanth: double (nullable = true)
 |-- Color.int: double (nullable = true)
 |-- Hue: double (nullable = true)
 |-- OD: double (nullable = true)
 |-- Proline: integer (nullable = true)



Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050


In [7]:
spark.catalog.listTables()

[Table(name='fifa', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='movies', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='people', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='wine', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

## Ex. 1 - Understanding SparkContext

A SparkContext represents the entry point to Spark functionality. It's like a key to your car. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. PySpark automatically creates a SparkContext for you in the PySpark shell (so you don't have to create it by yourself) and is exposed via a variable sc.

In this simple exercise, you'll find out the attributes of the SparkContext in your PySpark shell which you'll be using for the rest of the course.

**Instructions:**

1. Print the version of `SparkContext` in the PySpark shell.
2. Print the Python version of `SparkContext` in the PySpark shell.
3. What is the master of `SparkContext` in the PySpark shell?

In [8]:
sc = spark.sparkContext

In [9]:
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is:", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is:", sc.pythonVer)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is:", sc.master)

The version of Spark Context in the PySpark shell is: 3.5.1
The Python version of Spark Context in the PySpark shell is: 3.11
The master of Spark Context in the PySpark shell is: local[*]


## Ex. 2 - Interactive Use of PySpark

Spark comes with an interactive Python shell in which PySpark is already installed. PySpark shell is useful for basic testing and debugging and is quite powerful. The easiest way to demonstrate the power of PySpark’s shell is with an exercise. In this exercise, you'll load a simple list containing numbers ranging from 1 to 100 in the PySpark shell.

The most important thing to understand here is that we are not creating any SparkContext object because PySpark automatically creates the SparkContext object named `sc` in the PySpark shell.

**Instructions:**

1. Create a Python list named `numb` containing the numbers `1` to `100`.
2. Load the list into Spark using Spark Context's parallelize method and assign it to a variable `spark_data`.

In [10]:
# Create a Python list of numbers from 1 to 100 
numb = range(1, 101)

# Load the list into PySpark  
spark_data = sc.parallelize(numb)

# Review the parallelized data
print(f"""
{spark_data}
Total values: {spark_data.count()}
Values:
{spark_data.take(100)}
""")


PythonRDD[70] at RDD at PythonRDD.scala:53
Total values: 100
Values:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]



## Ex. 3 - Loading data in PySpark shell
In PySpark, we express our computation through operations on distributed collections that are automatically parallelized across the cluster. In the previous exercise, you have seen an example of loading a list as parallelized collections and in this exercise, you'll load the data from a local file in PySpark shell.

**Instructions:**

1. Load a local text file sample_text.md in PySpark shell.

In [11]:
file_path = 'data-sources/sample_text.md'

# Load a local file into PySpark shell
lines = sc.textFile(file_path)

# Review the loaded data
print(f"""
{lines}
Total lines in the file: {lines.count()}
First 5 lines of the file:
""")
lines.take(5)


data-sources/sample_text.md MapPartitionsRDD[76] at textFile at <unknown>:0
Total lines in the file: 36
First 5 lines of the file:



['[![buildstatus](https://travis-ci.org/holdenk/learning-spark-examples.svg?branch=master)](https://travis-ci.org/holdenk/learning-spark-examples)',
 'Examples for Learning Spark',
 'Examples for the Learning Spark book. These examples require a number of libraries and as such have long build files. We have also added a stand alone example with minimal dependencies and a small build file',
 'in the mini-complete-example directory.']

## Ex. 4 - Use of lambda() with map()

The `map()` function in Python returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.). The general syntax of `map()` function is `map(fun, iter)`. We can also use lambda functions with `map()`. Refer to slide 5 of video 1.7 for general help of `map()` function with `lambda()`.

In this exercise, you'll be using lambda function inside the `map()` built-in function to square all numbers in the list.

**Instructions:**

1. Print `my_list` which is available in your environment.
2. Square each item in `my_list` using `map()` and `lambda()`.
3. Print the result of `map` function.

In [12]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [13]:
# Print my_list in the console
print("Input list is", my_list)

# Square all numbers in my_list
squared_list_lambda = list(map(lambda x: x**2, my_list))

# Print the result of the map function
print("The squared numbers are", squared_list_lambda)

Input list is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The squared numbers are [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


## Ex. 5 - Use of lambda() with filter()

Another function that is used extensively in Python is the `filter()` function. The `filter()` function in Python takes in a function and a list as arguments. Similar to the `map()`, `filter()` can be used with lambda function. Refer to slide 6 of video 1.7 for general help of the `filter()` function with `lambda()`.

In this exercise, you'll be using `lambda()` function inside the `filter()` built-in function to find all the numbers divisible by 10 in the list.

**Instructions:**

1. Print `my_list2` which is available in your environment.
2. Filter the numbers divisible by 10 from `my_list2` using `filter()` and `lambda()`.
3. Print the numbers divisible by 10 from `my_list2`.

In [14]:
my_list2 = [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]

In [15]:
# Print my_list2 in the console
print("Input list is:", my_list2)

# Filter numbers divisible by 10
filtered_list = list(filter(lambda x: (x%10 == 0), my_list2))

# Print the numbers divisible by 10
print("Numbers divisible by 10 are:", filtered_list)

Input list is: [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]
Numbers divisible by 10 are: [10, 40, 60, 80]


## Close

In [16]:
spark.stop()