<h2> Pyspark - First Touch

This notebook shows the first touch with Pyspark and some basic & fundamental commands.

What is Pyspark? 
PySpark is the Python API for Spark.

Prerequisites for run the following code:

Installed & Configured: Spark + Pyspark + Python

For the configuration, I suggest this source: https://docs.anaconda.com/anaconda-scale/howto/spark-configuration/

Author: Luciano Nieto    |     Date: 05/06/20

In [1]:
import pyspark
from pyspark.sql import SparkSession

# The first step is create a Spark Session, where its possible to configure the cluster nodes, 
# and the memory allocated to each one.

# Init your spark session:

spark = SparkSession.builder \
   .master("local") \
   .appName("My First App") \
   .config("spark.executor.memory", "1gb") \
   .getOrCreate()
   
sc = spark.sparkContext

In [2]:
# RDD: Resilient Distributed Datasets, Spark revolves around the concept of a resilient distributed dataset (RDD), 
# which is a fault-tolerant collection of elements that can be operated on in parallel. 
# There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing 
# a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source 
# offering a Hadoop InputFormat. Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html

# 2 ways to create the RDD:
# Parallelize: From Collection
# TextFile: From External Data


# From Collection! 
rdd = sc.parallelize([10,11,12,13,14],2)

# Visualize the content of RDD:
rdd.take(2)

[10, 11]

In [70]:
#RDD from External Data !

#a) Lets first take the dataset: source: https://www.kaggle.com/blastchar/telco-customer-churn/data
#b) Read the Data from CSV to RDD:

file = 'data/WA_Fn-UseC_-Telco-Customer-Churn.csv' #<file location + filename>

rdd = sc.textFile(file)

#c) See the content of the RDD - first 3 elements:
rdd.take(3)

['customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn',
 '7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No',
 '5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No']

_________________________________________________________________________________________________________


<h4> RDD Transformation: produces a new rdd with the applied transformation. </h4>

  Narrow Transformation: Calculate all the elements at the same partition.
  Commands: Map,FlatMap,MapPartition,Filter,Sample,Union...

  Wide Transformation: Calculate all the elements in a single partition or may live in anothers.
  Commands: Intersection,Distinct,ReduceByKey,GroupByKey,Join,Cartesian,Repartition,Coalesce..
  
  
  
<h4> RDD Action: work with the actual dataset operations. </h4>
  Commands: Reduce, Collect, Count, First, Take, CountByKey...
  
More: https://spark.apache.org/docs/latest/rdd-programming-guide.html


_________________________________________________________________________________________________________


In [4]:
# RDD action:
# First element

rdd.first()

'customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn'

In [5]:
# Length of the first 3 elements:

rdd.map(lambda s: len(s)).take(3)

[259, 122, 98]

In [102]:
# Split lines by comma. When the RDD is created, all the lines are inside a string.
# With the command Map, we can separate each field by a comma, for instance.

rdd_splited = rdd.map(lambda line: line.split(","))
rdd_splited.take(2)

[['customerID',
  'gender',
  'SeniorCitizen',
  'Partner',
  'Dependents',
  'tenure',
  'PhoneService',
  'MultipleLines',
  'InternetService',
  'OnlineSecurity',
  'OnlineBackup',
  'DeviceProtection',
  'TechSupport',
  'StreamingTV',
  'StreamingMovies',
  'Contract',
  'PaperlessBilling',
  'PaymentMethod',
  'MonthlyCharges',
  'TotalCharges',
  'Churn'],
 ['7590-VHVEG',
  'Female',
  '0',
  'Yes',
  'No',
  '1',
  'No',
  'No phone service',
  'DSL',
  'No',
  'Yes',
  'No',
  'No',
  'No',
  'No',
  'Month-to-month',
  'Yes',
  'Electronic check',
  '29.85',
  '29.85',
  'No']]

In [20]:
#Filter: filter the rdd based in a condition:

rdd_churns = rdd.filter(lambda x: x[20] == "Yes")
rdd_churns.take(1)

[['3668-QPYBK',
  'Male',
  '0',
  'No',
  'No',
  '2',
  'Yes',
  'No',
  'DSL',
  'Yes',
  'Yes',
  'No',
  'No',
  'No',
  'No',
  'Month-to-month',
  'Yes',
  'Mailed check',
  '53.85',
  '108.15',
  'Yes']]

In [29]:
#Filter: "in" line:

rdd_fiber = rdd.filter(lambda x: "Fiber optic" in x)
rdd_fiber.take(1)

[['9237-HQITU',
  'Female',
  '0',
  'No',
  'No',
  '2',
  'Yes',
  'No',
  'Fiber optic',
  'No',
  'No',
  'No',
  'No',
  'No',
  'No',
  'Month-to-month',
  'Yes',
  'Electronic check',
  '70.7',
  '151.65',
  'Yes']]

In [61]:
# sort the rdd by the Total Changes columns - desccending:

rdd_sorted = rdd.sortBy(lambda line: line[19],ascending = False)
rdd_sorted.take(2)

[['customerID',
  'gender',
  'SeniorCitizen',
  'Partner',
  'Dependents',
  'tenure',
  'PhoneService',
  'MultipleLines',
  'InternetService',
  'OnlineSecurity',
  'OnlineBackup',
  'DeviceProtection',
  'TechSupport',
  'StreamingTV',
  'StreamingMovies',
  'Contract',
  'PaperlessBilling',
  'PaymentMethod',
  'MonthlyCharges',
  'TotalCharges',
  'Churn'],
 ['9093-FPDLG',
  'Female',
  '0',
  'No',
  'No',
  '11',
  'Yes',
  'No',
  'Fiber optic',
  'No',
  'Yes',
  'Yes',
  'Yes',
  'No',
  'Yes',
  'Month-to-month',
  'Yes',
  'Electronic check',
  '94.2',
  '999.9',
  'No']]

In [123]:
# Key Value pairs: RDD: to perform aggregations, and transform the data.


rdd_2 = rdd_splited.map(lambda line: (line[0:17],line[18:20]))
rdd_2.take(2)

[(['customerID',
   'gender',
   'SeniorCitizen',
   'Partner',
   'Dependents',
   'tenure',
   'PhoneService',
   'MultipleLines',
   'InternetService',
   'OnlineSecurity',
   'OnlineBackup',
   'DeviceProtection',
   'TechSupport',
   'StreamingTV',
   'StreamingMovies',
   'Contract',
   'PaperlessBilling'],
  ['MonthlyCharges', 'TotalCharges']),
 (['7590-VHVEG',
   'Female',
   '0',
   'Yes',
   'No',
   '1',
   'No',
   'No phone service',
   'DSL',
   'No',
   'Yes',
   'No',
   'No',
   'No',
   'No',
   'Month-to-month',
   'Yes'],
  ['29.85', '29.85'])]

<h3> RDD x DataFrame x DataSets

RDD: Primary user-facing API in Spark, since the beggining. About 2011.

DataFrame: Distribute collection of Row objects, UDFs, logical plan optimizer. Spark 2.0. 2015.

Dataset: Starting in Spark 2.0. Strongly-typed API & performed. 2016 (Scala & Java, only unitl now***).

Source: https://spark.apache.org/docs/latest/sql-programming-guide.html

<h3> PySpark DataFrames

In [127]:
# There are many ways to create DataFrames in Pyspark. Lets see how it works:

#a) create dataframe based on rdd. Just for an example:

df = rdd_splited.toDF()
df.head(2) #show values from dataframe.

[Row(_1='customerID', _2='gender', _3='SeniorCitizen', _4='Partner', _5='Dependents', _6='tenure', _7='PhoneService', _8='MultipleLines', _9='InternetService', _10='OnlineSecurity', _11='OnlineBackup', _12='DeviceProtection', _13='TechSupport', _14='StreamingTV', _15='StreamingMovies', _16='Contract', _17='PaperlessBilling', _18='PaymentMethod', _19='MonthlyCharges', _20='TotalCharges', _21='Churn'),
 Row(_1='7590-VHVEG', _2='Female', _3='0', _4='Yes', _5='No', _6='1', _7='No', _8='No phone service', _9='DSL', _10='No', _11='Yes', _12='No', _13='No', _14='No', _15='No', _16='Month-to-month', _17='Yes', _18='Electronic check', _19='29.85', _20='29.85', _21='No')]

In [136]:
#b) create dataframe based on a collection:

c = [(1,2,3),(4,5,6),(7,8,9),(10,11,12),(13,14,15)]
df = spark.createDataFrame(c)
df.head(2)

[Row(_1=1, _2=2, _3=3), Row(_1=4, _2=5, _3=6)]

In [139]:
#c) create dataframe based on csv file (external):

df = spark.read.csv(file)
df.head(2)

[Row(_c0='customerID', _c1='gender', _c2='SeniorCitizen', _c3='Partner', _c4='Dependents', _c5='tenure', _c6='PhoneService', _c7='MultipleLines', _c8='InternetService', _c9='OnlineSecurity', _c10='OnlineBackup', _c11='DeviceProtection', _c12='TechSupport', _c13='StreamingTV', _c14='StreamingMovies', _c15='Contract', _c16='PaperlessBilling', _c17='PaymentMethod', _c18='MonthlyCharges', _c19='TotalCharges', _c20='Churn'),
 Row(_c0='7590-VHVEG', _c1='Female', _c2='0', _c3='Yes', _c4='No', _c5='1', _c6='No', _c7='No phone service', _c8='DSL', _c9='No', _c10='Yes', _c11='No', _c12='No', _c13='No', _c14='No', _c15='Month-to-month', _c16='Yes', _c17='Electronic check', _c18='29.85', _c19='29.85', _c20='No')]