# What is PySpark?
pyspark is a python api for working with apache spark. I will first explain what do I mean by a "python api" for something and then explain what, specifically, is 'apache spark'.

what I mean by **'python api'** is that you can use the syntex and agility of python to interact with and send commands to a system that is not based, at its core, on python. 

with pyspark, you intercat with apache spark - a system designed for working, analyzing and modeling with immense amounts of data in many computers at the same time. putting it in a different way, apache spark allows you to run computations in parallel, instead of sequentially. it allows you to divide one incredibly large task into many smaller tasks, and run each such task on a different machine.this allowes you to accomplish your analysis goals in reasonable time that would not be possible on a single machine.

usually, we would define the amount of data that suits PySpark as what would not fit into a single machine storage (let alone RAM).

**important related concepts:** 
1. distributed computing - when you distribute a task into several smaller task that run at the same time. this is what pyspark allows you to do with many machines, but it can also be done on a single machine with several threads, for example.
2. cluster - a network of machines that can take on tasks from a user, interact with one another and return results. these provide the computing resources that pyspark will use to make the computations.
3. Resilient Distributed Dataset (RDD) - an immutable distributed collection of data. it is not tabular, like DataFrames which we will work with later, and has no data schema. therefore, for tabular data wrangling, DataFrames allowes for more API options and uner-then-hood optimizations. still, you might encounter RDDs as you learn more about Spark, and should be aware of their existence.

**Part of PySpark we will cover:**
1. PySpark SQL - contains commands for data processing and manipulation.
2. PySpark MLlib - includes a variety of models, model training and related commands.

**Spark Architecture:**
to send commands and receive results from a cluster, you will need to initiate a spark session. this object is your tool for interacting with Spark. each user of the cluster will have its own Spark Session, that will allow him to use the cluster in isolation from other users. all of the sessions are communicating with a spark context, which is the master node in the cluster - that is, it assigns each of computers in the cluster tasks and coordinates them. each of the computers in the cluster that perform tasks for a master node is called a worker node. to connect to a worker node, the master node needs to get that node's comput power allocated to it, by a cluster manager, that is responsable for distributing the cluster resources. inside each worker node, there are execute programs that run the tasks - they can run multiple tasks simultaneously, and has their own cashe for storing results. so, each master node can have multiple worker nodes, that can have multiple tasks running.  

In [1]:
# a SparkSession object can perform the most common data processing tasks
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate() # will return existing session if one was
                                                           # created before and was not closed

In [2]:
spark

**dataset:**
https://www.kaggle.com/fedesoriano/heart-failure-prediction

In [3]:
# read csv, all columns will be of type string
df = spark.read.option('header','true').csv('heart.csv')

In [4]:
df.show()

+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
|Age|Sex|ChestPainType|RestingBP|Cholesterol|FastingBS|RestingECG|MaxHR|ExerciseAngina|Oldpeak|ST_Slope|HeartDisease|
+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
| 40|  M|          ATA|      140|        289|        0|    Normal|  172|             N|      0|      Up|           0|
| 49|  F|          NAP|      160|        180|        0|    Normal|  156|             N|      1|    Flat|           1|
| 37|  M|          ATA|      130|        283|        0|        ST|   98|             N|      0|      Up|           0|
| 48|  F|          ASY|      138|        214|        0|    Normal|  108|             Y|    1.5|    Flat|           1|
| 54|  M|          NAP|      150|        195|        0|    Normal|  122|             N|      0|      Up|           0|
| 39|  M|          NAP|      120|        339|        0| 

In [6]:
# show head of table
df.show(3)

+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
|Age|Sex|ChestPainType|RestingBP|Cholesterol|FastingBS|RestingECG|MaxHR|ExerciseAngina|Oldpeak|ST_Slope|HeartDisease|
+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
| 40|  M|          ATA|      140|        289|        0|    Normal|  172|             N|      0|      Up|           0|
| 49|  F|          NAP|      160|        180|        0|    Normal|  156|             N|      1|    Flat|           1|
| 37|  M|          ATA|      130|        283|        0|        ST|   98|             N|      0|      Up|           0|
+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
only showing top 3 rows



In [7]:
# count number of rows
df.count()

918

In [8]:
# show parts of the table
df.select('Age').show(3)
df.select(['Age','Sex']).show(3)

+---+
|Age|
+---+
| 40|
| 49|
| 37|
+---+
only showing top 3 rows

+---+---+
|Age|Sex|
+---+---+
| 40|  M|
| 49|  F|
| 37|  M|
+---+---+
only showing top 3 rows



## Pandas DataFrame VS PySpark DataFrame

both represents a table of data with rows and columns. however, under the hood they are different, as PySpark dataframe needs to support distributed computations. as we move forward, we will see more and more features of it that are not present in Pandas DataFrame. that being said - if you know how to use Pandas, than moving to PySpark will feel like a natural transition.

## DAG
directed acyclic graph is the way Spark runs computations. when you give it a series of transformation to apply to the dataset, it build a graph out of those transformations, so it knows what to do - but it does not execute those commands immediately, if it does not have to. rather, it is lazy - it will go through the DAG and apply the transformations only when it must, to provide a needed result. this allows better performance, since spark knows what's ahead of a certain computation and get optimize the process accordingly.

## transformations VS actions
in PySpark, there are two types of command: transformations and actions. transformation commands are added to the DAG, but does not get it to actually be executed. they transform one DataFrame into another, not changing the input DataFrame. on the other hand, actions make PySpark execute the DAG but does not create a new DataFrame - instead, they output the result of the DAG.

## Caching
every time you run a DAG, it will be re-computed from the beginning. that is, the results are not saved in memory. 
so, if we want to save a result so it won't have to be recomputed, we can use the cache command. note, that this will occupy space in the working node's memory - so be careful with the sizes of datasets you are caching! by default, the cached DF is stored to RAM, and is unserialized (not converted into a stream of bytes). you can change both of these - store data to hard disk, serialized it, or both!

## Collecting
even after caching a DataFrame, it still sits in the worker nodes memory. if you want to collect is pieces, assemble them and save them on the master node so you won't have to pull it every time, use the command for collecting. again, be very careful with this, since the collected file will have to fit in the master node memory!

In [9]:
df.cache()
df.collect()

[Row(Age='40', Sex='M', ChestPainType='ATA', RestingBP='140', Cholesterol='289', FastingBS='0', RestingECG='Normal', MaxHR='172', ExerciseAngina='N', Oldpeak='0', ST_Slope='Up', HeartDisease='0'),
 Row(Age='49', Sex='F', ChestPainType='NAP', RestingBP='160', Cholesterol='180', FastingBS='0', RestingECG='Normal', MaxHR='156', ExerciseAngina='N', Oldpeak='1', ST_Slope='Flat', HeartDisease='1'),
 Row(Age='37', Sex='M', ChestPainType='ATA', RestingBP='130', Cholesterol='283', FastingBS='0', RestingECG='ST', MaxHR='98', ExerciseAngina='N', Oldpeak='0', ST_Slope='Up', HeartDisease='0'),
 Row(Age='48', Sex='F', ChestPainType='ASY', RestingBP='138', Cholesterol='214', FastingBS='0', RestingECG='Normal', MaxHR='108', ExerciseAngina='Y', Oldpeak='1.5', ST_Slope='Flat', HeartDisease='1'),
 Row(Age='54', Sex='M', ChestPainType='NAP', RestingBP='150', Cholesterol='195', FastingBS='0', RestingECG='Normal', MaxHR='122', ExerciseAngina='N', Oldpeak='0', ST_Slope='Up', HeartDisease='0'),
 Row(Age='39',

In [10]:
# convert PySpark DataFrame to Pandas DataFrame
pd_df = df.toPandas()
# convert it back
spark_df = spark.createDataFrame(pd_df)

In [11]:
# show first three rows as three row objects, which is how spark represents single rows from a table.
# we will learn more about it later
df.head(3)

[Row(Age='40', Sex='M', ChestPainType='ATA', RestingBP='140', Cholesterol='289', FastingBS='0', RestingECG='Normal', MaxHR='172', ExerciseAngina='N', Oldpeak='0', ST_Slope='Up', HeartDisease='0'),
 Row(Age='49', Sex='F', ChestPainType='NAP', RestingBP='160', Cholesterol='180', FastingBS='0', RestingECG='Normal', MaxHR='156', ExerciseAngina='N', Oldpeak='1', ST_Slope='Flat', HeartDisease='1'),
 Row(Age='37', Sex='M', ChestPainType='ATA', RestingBP='130', Cholesterol='283', FastingBS='0', RestingECG='ST', MaxHR='98', ExerciseAngina='N', Oldpeak='0', ST_Slope='Up', HeartDisease='0')]

In [12]:
# type os columns
df.printSchema()

root
 |-- Age: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- ChestPainType: string (nullable = true)
 |-- RestingBP: string (nullable = true)
 |-- Cholesterol: string (nullable = true)
 |-- FastingBS: string (nullable = true)
 |-- RestingECG: string (nullable = true)
 |-- MaxHR: string (nullable = true)
 |-- ExerciseAngina: string (nullable = true)
 |-- Oldpeak: string (nullable = true)
 |-- ST_Slope: string (nullable = true)
 |-- HeartDisease: string (nullable = true)



In [13]:
# column dtypes as list of tuples
df.dtypes

[('Age', 'string'),
 ('Sex', 'string'),
 ('ChestPainType', 'string'),
 ('RestingBP', 'string'),
 ('Cholesterol', 'string'),
 ('FastingBS', 'string'),
 ('RestingECG', 'string'),
 ('MaxHR', 'string'),
 ('ExerciseAngina', 'string'),
 ('Oldpeak', 'string'),
 ('ST_Slope', 'string'),
 ('HeartDisease', 'string')]

In [14]:
# cast a column from one type to other
from pyspark.sql.types import FloatType
df = df.withColumn("Age",df.Age.cast(FloatType()))
df = df.withColumn("RestingBP",df.Age.cast(FloatType()))

In [15]:
# compute summery statistics
df.select(['Age','RestingBP']).describe().show()

+-------+------------------+------------------+
|summary|               Age|         RestingBP|
+-------+------------------+------------------+
|  count|               918|               918|
|   mean|53.510893246187365|53.510893246187365|
| stddev|  9.43261650673202|  9.43261650673202|
|    min|              28.0|              28.0|
|    max|              77.0|              77.0|
+-------+------------------+------------------+



In [16]:
# add a new column or replace existing one
AgeFixed = df['Age'] + 1  # select alwayes returns a DataFrame object, and we need a column object
df = df.withColumn('AgeFixed', AgeFixed)

In [17]:
df.select(['AgeFixed','Age']).describe().show()

+-------+------------------+------------------+
|summary|          AgeFixed|               Age|
+-------+------------------+------------------+
|  count|               918|               918|
|   mean|54.510893246187365|53.510893246187365|
| stddev|  9.43261650673202|  9.43261650673202|
|    min|              29.0|              28.0|
|    max|              78.0|              77.0|
+-------+------------------+------------------+



In [18]:
# remove columns
df.drop('AgeFixed').show(1) # add df = to get the new DataFrame into a variable

+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
| Age|Sex|ChestPainType|RestingBP|Cholesterol|FastingBS|RestingECG|MaxHR|ExerciseAngina|Oldpeak|ST_Slope|HeartDisease|
+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
|40.0|  M|          ATA|     40.0|        289|        0|    Normal|  172|             N|      0|      Up|           0|
+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
only showing top 1 row



In [19]:
# rename a column
df.withColumnRenamed('Age','age').select('age').show(1)
# to rename more than a single column, i would suggest a loop.
name_pairs = [('Age','age'),('Sex','sex')]
for old_name, new_name in name_pairs:
    df = df.withColumnRenamed(old_name,new_name)

+----+
| age|
+----+
|40.0|
+----+
only showing top 1 row



In [20]:
df.select(['age','sex']).show(1)

+----+---+
| age|sex|
+----+---+
|40.0|  M|
+----+---+
only showing top 1 row



In [21]:
# drop all rows that contain any NA
df = df.na.drop()
df.count()
# drop all rows where all values are NA
df = df.na.drop(how='all')
# drop all rows where more at least 2 values are NOT NA
df = df.na.drop(thresh=2)
# drop all rows where any value at specific columns are NAs.
df = df.na.drop(how='any', subset=['age','sex']) # 'any' is the defult

In [22]:
# fill missing values in a specific column with a '?'
df = df.na.fill(value='?',subset=['sex'])
# replace NAs with mean of column
from pyspark.ml.feature import Imputer # In statistics, imputation is the process of
                                       # replacing missing data with substituted values
imptr = Imputer(inputCols=['age','RestingBP'],
                outputCols=['age','RestingBP']).setStrategy('mean') # can also be 'median' and so on

df = imptr.fit(df).transform(df)

In [23]:
# filter to adults only and calculate mean
df.filter('age > 18')
df.where('age > 18')# 'where' is an alias to 'filter'
df.where(df['age'] > 18) # third option
# add another condition ('&' means and, '|' means or)
df.where((df['age'] > 18) | (df['ChestPainType'] == 'ATA'))
# take every record where the 'ChestPainType' is NOT 'ATA'
df.filter(~(df['ChestPainType'] == 'ATA'))

DataFrame[age: float, sex: string, ChestPainType: string, RestingBP: float, Cholesterol: string, FastingBS: string, RestingECG: string, MaxHR: string, ExerciseAngina: string, Oldpeak: string, ST_Slope: string, HeartDisease: string, AgeFixed: float]

In [24]:
df.filter('age > 18').show()

+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+--------+
| age|sex|ChestPainType|RestingBP|Cholesterol|FastingBS|RestingECG|MaxHR|ExerciseAngina|Oldpeak|ST_Slope|HeartDisease|AgeFixed|
+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+--------+
|40.0|  M|          ATA|     40.0|        289|        0|    Normal|  172|             N|      0|      Up|           0|    41.0|
|49.0|  F|          NAP|     49.0|        180|        0|    Normal|  156|             N|      1|    Flat|           1|    50.0|
|37.0|  M|          ATA|     37.0|        283|        0|        ST|   98|             N|      0|      Up|           0|    38.0|
|48.0|  F|          ASY|     48.0|        214|        0|    Normal|  108|             Y|    1.5|    Flat|           1|    49.0|
|54.0|  M|          NAP|     54.0|        195|        0|    Normal|  122|             N|      0|      Up

In [27]:
# evaluate a string expression into command
from pyspark.sql.functions import expr
exp = 'age + 0.2 * AgeFixed'
df.withColumn('new_col', expr(exp)).select('new_col').show(3)

+-------+
|new_col|
+-------+
|   48.2|
|   59.0|
|   44.6|
+-------+
only showing top 3 rows



In [28]:
df.show()

+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+--------+
| age|sex|ChestPainType|RestingBP|Cholesterol|FastingBS|RestingECG|MaxHR|ExerciseAngina|Oldpeak|ST_Slope|HeartDisease|AgeFixed|
+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+--------+
|40.0|  M|          ATA|     40.0|        289|        0|    Normal|  172|             N|      0|      Up|           0|    41.0|
|49.0|  F|          NAP|     49.0|        180|        0|    Normal|  156|             N|      1|    Flat|           1|    50.0|
|37.0|  M|          ATA|     37.0|        283|        0|        ST|   98|             N|      0|      Up|           0|    38.0|
|48.0|  F|          ASY|     48.0|        214|        0|    Normal|  108|             Y|    1.5|    Flat|           1|    49.0|
|54.0|  M|          NAP|     54.0|        195|        0|    Normal|  122|             N|      0|      Up

In [30]:
# group by age
disease_by_age = df.groupby('age').mean().select(['age'])
# sort values in desnding order
from pyspark.sql.functions import desc
disease_by_age.orderBy(desc("age")).show(5)

+----+
| age|
+----+
|77.0|
|76.0|
|75.0|
|74.0|
|73.0|
+----+
only showing top 5 rows



In [31]:
from pyspark.sql.functions import asc
disease_by_age = df.groupby('age').mean().select(['age'])
disease_by_age.orderBy(desc("age")).show(3)

+----+
| age|
+----+
|77.0|
|76.0|
|75.0|
+----+
only showing top 3 rows



In [32]:
# aggregate to get several statistics for several columns
# the available aggregate functions are avg, max, min, sum, count
from pyspark.sql import functions as F
df.agg(F.min(df['age']),F.max(df['age']),F.avg(df['sex'])).show()

+--------+--------+--------+
|min(age)|max(age)|avg(sex)|
+--------+--------+--------+
|    28.0|    77.0|    NULL|
+--------+--------+--------+



In [33]:
df.groupby('HeartDisease').agg(F.min(df['age']),F.avg(df['sex'])).show()

+------------+--------+--------+
|HeartDisease|min(age)|avg(sex)|
+------------+--------+--------+
|           0|    28.0|    NULL|
|           1|    31.0|    NULL|
+------------+--------+--------+



In [34]:
# run an SQL query on the data
df.createOrReplaceTempView("df") # tell PySpark how the table will be called in the SQL query
spark.sql("""SELECT sex from df""").show(2)

# we also choose columns using SQL sytnx, with a command that combins '.select()' and '.sql()'
df.selectExpr("age >= 40 as older", "age").show(2)

+---+
|sex|
+---+
|  M|
|  F|
+---+
only showing top 2 rows

+-----+----+
|older| age|
+-----+----+
| true|40.0|
| true|49.0|
+-----+----+
only showing top 2 rows



In [35]:
df.groupby('age').pivot('sex', ("M", "F")).count().show(3)

+----+---+---+
| age|  M|  F|
+----+---+---+
|64.0| 16|  6|
|47.0| 15|  4|
|58.0| 35|  7|
+----+---+---+
only showing top 3 rows



In [36]:
# pivot - expensive operation
df.selectExpr("age >= 40 as older", "age",'sex').groupBy("sex")\
                    .pivot("older", ("true", "false")).count().show()

+---+----+-----+
|sex|true|false|
+---+----+-----+
|  F| 174|   19|
|  M| 664|   61|
+---+----+-----+



In [37]:
df.select(['age','MaxHR','Cholesterol']).show(4)

+----+-----+-----------+
| age|MaxHR|Cholesterol|
+----+-----+-----------+
|40.0|  172|        289|
|49.0|  156|        180|
|37.0|   98|        283|
|48.0|  108|        214|
+----+-----+-----------+
only showing top 4 rows



Aplicação Machine Learning

In [52]:
df = spark.read.option('header','true').csv('heart.csv')

In [53]:
df.printSchema()

root
 |-- Age: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- ChestPainType: string (nullable = true)
 |-- RestingBP: string (nullable = true)
 |-- Cholesterol: string (nullable = true)
 |-- FastingBS: string (nullable = true)
 |-- RestingECG: string (nullable = true)
 |-- MaxHR: string (nullable = true)
 |-- ExerciseAngina: string (nullable = true)
 |-- Oldpeak: string (nullable = true)
 |-- ST_Slope: string (nullable = true)
 |-- HeartDisease: string (nullable = true)



In [54]:
df.show()

+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
|Age|Sex|ChestPainType|RestingBP|Cholesterol|FastingBS|RestingECG|MaxHR|ExerciseAngina|Oldpeak|ST_Slope|HeartDisease|
+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
| 40|  M|          ATA|      140|        289|        0|    Normal|  172|             N|      0|      Up|           0|
| 49|  F|          NAP|      160|        180|        0|    Normal|  156|             N|      1|    Flat|           1|
| 37|  M|          ATA|      130|        283|        0|        ST|   98|             N|      0|      Up|           0|
| 48|  F|          ASY|      138|        214|        0|    Normal|  108|             Y|    1.5|    Flat|           1|
| 54|  M|          NAP|      150|        195|        0|    Normal|  122|             N|      0|      Up|           0|
| 39|  M|          NAP|      120|        339|        0| 

In [55]:
from pyspark.sql.types import FloatType
df = df.withColumn("Age",df.Age.cast(FloatType()))
df = df.withColumn('Cholesterol',df.Age.cast(FloatType()))
df = df.withColumn('MaxHR',df.Age.cast(FloatType()))

In [56]:
# devide dataset to training features and target
X_column_names = ['Age','Cholesterol']
target_colum_name = ['MaxHR']

In [57]:
# convert feature columns into a columns where the vlues are feature vectors
from pyspark.ml.feature import VectorAssembler
v_asmblr = VectorAssembler(inputCols=X_column_names, outputCol='Fvec')

In [58]:
df = v_asmblr.transform(df)

In [60]:
df.show()

+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+-----------+
| Age|Sex|ChestPainType|RestingBP|Cholesterol|FastingBS|RestingECG|MaxHR|ExerciseAngina|Oldpeak|ST_Slope|HeartDisease|       Fvec|
+----+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+-----------+
|40.0|  M|          ATA|      140|       40.0|        0|    Normal| 40.0|             N|      0|      Up|           0|[40.0,40.0]|
|49.0|  F|          NAP|      160|       49.0|        0|    Normal| 49.0|             N|      1|    Flat|           1|[49.0,49.0]|
|37.0|  M|          ATA|      130|       37.0|        0|        ST| 37.0|             N|      0|      Up|           0|[37.0,37.0]|
|48.0|  F|          ASY|      138|       48.0|        0|    Normal| 48.0|             Y|    1.5|    Flat|           1|[48.0,48.0]|
|54.0|  M|          NAP|      150|       54.0|        0|    Normal| 54.0|          

In [61]:
X = df.select(['Age','Cholesterol','Fvec','MaxHR'])
X.show(3)

+----+-----------+-----------+-----+
| Age|Cholesterol|       Fvec|MaxHR|
+----+-----------+-----------+-----+
|40.0|       40.0|[40.0,40.0]| 40.0|
|49.0|       49.0|[49.0,49.0]| 49.0|
|37.0|       37.0|[37.0,37.0]| 37.0|
+----+-----------+-----------+-----+
only showing top 3 rows



In [62]:
# devide dataset into training and testing sets
trainset, testset = X.randomSplit([0.8,0.2])

In [63]:
# predict 'RestingBP' using linear regression
from pyspark.ml.regression import LinearRegression
model = LinearRegression(featuresCol='Fvec', labelCol='MaxHR')
model = model.fit(trainset)
print(model.coefficients)
print(model.intercept)

[0.49999999999999845,0.49999999999999845]
1.6597346967072897e-13


In [64]:
# evaluate model
model.evaluate(testset).predictions.show(3)

+----+-----------+-----------+-----+------------------+
| Age|Cholesterol|       Fvec|MaxHR|        prediction|
+----+-----------+-----------+-----+------------------+
|28.0|       28.0|[28.0,28.0]| 28.0| 28.00000000000008|
|32.0|       32.0|[32.0,32.0]| 32.0|32.000000000000064|
|32.0|       32.0|[32.0,32.0]| 32.0|32.000000000000064|
+----+-----------+-----------+-----+------------------+
only showing top 3 rows



In [65]:
# handel categorical features with ordinal indexing
from pyspark.ml.feature import StringIndexer
indxr = StringIndexer(inputCol='ChestPainType', outputCol='ChestPainTypeInxed')
indxr.fit(df).transform(df).select('ChestPainTypeInxed').show(3)

+------------------+
|ChestPainTypeInxed|
+------------------+
|               2.0|
|               1.0|
|               2.0|
+------------------+
only showing top 3 rows

