# Introduction to PySpark
>  Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 1 exercises "Machine Learning with PySpark" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [None]:
%%capture
!pip install pyspark
import pyspark
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession

## Machine Learning & Spark

### Characteristics of Spark

<div class=""><p>Spark is currently the most popular technology for processing large quantities of data. Not only is it able to handle enormous data volumes, but it does so very efficiently too! Also, unlike some other distributed computing technologies, developing with Spark is a pleasure.</p>
<p>Which of these describe Spark?</p></div>

<pre>
Possible Answers
Spark is a framework for cluster computing.
Spark does most processing in memory.
Spark has a high-level API, which conceals a lot of complexity.
<b>All of the above.</b>
</pre>

### Components in a Spark Cluster

<div class=""><p>Spark is a distributed computing platform. It achieves efficiency by distributing data and computation across a cluster of computers.</p>
<p>A Spark cluster consists of a number of hardware and software components which work together.</p>
<p>Which of these is not part of a Spark cluster?</p></div>

<pre>
Possible Answers
One or more nodes
A cluster manager
<b>A load balancer</b>
Executors
</pre>

## Connecting to Spark

### Location of Spark master

<p>Which of the following is <strong>not</strong> a valid way to specify the location of a Spark cluster?</p>

<pre>
Possible Answers
<b></b>
spark://13.59.151.161:7077
spark://ec2-18-188-22-23.us-east-2.compute.amazonaws.com:7077
spark://18.188.22.23
local
local[4]
local[*]
</pre>

**A Spark URL must always include a port number, so this URL is not valid.**

### Creating a SparkSession

<div class=""><p>In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a <code>SparkSession</code> object.</p>
<p>The <code>SparkSession</code> class has a <code>builder</code> attribute, which is an instance of the <code>Builder</code> class. The <code>Builder</code> class exposes three important methods that let you:</p>
<ul>
<li>specify the location of the master node;</li>
<li>name the application (optional); and </li>
<li>retrieve an existing <code>SparkSession</code> or, if there is none, create a new one.</li>
</ul>
<p>The <code>SparkSession</code> class has a <code>version</code> attribute which gives the version of Spark.</p>
<p>Find out more about <code>SparkSession</code> <a href="https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession" target="_blank" rel="noopener noreferrer">here</a>.</p>
<p>Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.</p>
<p><strong>Note::</strong> You might find it useful to review the slides from the lessons in the <em>Slides</em> panel next to the <em>IPython Shell</em>.</p></div>

Instructions
<ul>
<li>Import the <code>SparkSession</code> class from <code>pyspark.sql</code>.</li>
<li>Create a <code>SparkSession</code> object connected to a local cluster. Use all available cores. Name the application <code>'test'</code>.</li>
<li>Use the <code>SparkSession</code> object to retrieve the version of Spark running on the cluster. <strong>Note:</strong> The version might be different to the one that's used in the presentation (it gets updated from time to time).</li>
<li>Shut down the cluster.</li>
</ul>

In [None]:
# Import the PySpark module
from pyspark.sql import SparkSession

# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
# (Might be different to what you saw in the presentation!)
print(spark.version)

# Terminate the cluster
spark.stop()

3.0.2


## Loading Data

### Loading flights data

<div class=""><p>In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format <a href="https://assets.datacamp.com/production/repositories/3918/datasets/e1c1a03124fb2199743429e9b7927df18da3eacf/flights-larger.csv" target="_blank" rel="noopener noreferrer">here</a>.</p>
<p>Notes on CSV format:</p>
<ul>
<li>fields are separated by a comma (this is the default separator) and</li>
<li>missing data are denoted by the string 'NA'.</li>
</ul>
<p>Data dictionary:</p>
<ul>
<li><code>mon</code> — month (integer between 1 and 12)</li>
<li><code>dom</code> — day of month (integer between 1 and 31)</li>
<li><code>dow</code> — day of week (integer; 1 = Monday and 7 = Sunday)</li>
<li><code>org</code> — origin airport (<a href="https://en.wikipedia.org/wiki/IATA_airport_code" target="_blank" rel="noopener noreferrer">IATA code</a>)</li>
<li><code>mile</code> — distance (miles)</li>
<li><code>carrier</code> — carrier (<a href="https://en.wikipedia.org/wiki/List_of_airline_codes" target="_blank" rel="noopener noreferrer">IATA code</a>)</li>
<li><code>depart</code> — departure time (decimal hour)</li>
<li><code>duration</code> — expected duration (minutes)</li>
<li><code>delay</code> — delay (minutes)</li>
</ul>
<p><code>pyspark</code> has been imported for you and the session has been initialized.</p>
<p><em>Note:</em> The data have been aggressively down-sampled.</p></div>

In [None]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/22-machine-learning-with-pyspark/datasets/flights.csv
spark = SparkSession.builder.master('local[*]').appName('flights').getOrCreate()

Instructions
<ul>
<li>Read data from a CSV file called 'flights.csv'. Assign data types to columns automatically. Deal with missing data.</li>
<li>How many records are in the data?</li>
<li>Take a look at the first five records.</li>
<li>What data types have been assigned to the columns? Do these look correct?</li>
</ul>

In [None]:
# Read data from CSV file
flights = spark.read.csv('flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
print(flights.dtypes)

The data contain 50000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

[('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


**The correct data types have been inferred for all of the columns.**

### Loading SMS spam data

<div class=""><p>You've seen that it's possible to infer data types directly from the data. Sometimes it's convenient to have direct control over the column types. You do this by defining an explicit schema.</p>
<p>The file <code>sms.csv</code> contains a selection of SMS messages which have been classified as either 'spam' or 'ham'. These data have been adapted from the <a href="https://archive.ics.uci.edu/ml/datasets/sms+spam+collection" target="_blank" rel="noopener noreferrer">UCI Machine Learning Repository</a>. There are a total of 5574 SMS, of which 747 have been labelled as spam.</p>
<p>Notes on CSV format:</p>
<ul>
<li>no header record and</li>
<li>fields are separated by a semicolon (this is <strong>not</strong> the default separator).</li>
</ul>
<p>Data dictionary:</p>
<ul>
<li><code>id</code> — record identifier</li>
<li><code>text</code> — content of SMS message</li>
<li><code>label</code> — spam or ham (integer; 0 = ham and 1 = spam)</li>
</ul></div>

In [None]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/22-machine-learning-with-pyspark/datasets/sms.csv

Instructions
<ul>
<li>Specify the data schema, giving columns names (<code>"id"</code>, <code>"text"</code>, and <code>"label"</code>) and column types.</li>
<li>Read data from a delimited file called <code>"sms.csv"</code>.</li>
<li>Print the schema for the resulting DataFrame.</li>
</ul>

In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv('sms.csv', sep=';', header=False, schema=schema)

# Print schema of DataFrame
sms.printSchema()

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



**You now know how to initiate a Spark session and load data. In the next chapter you'll use the data you've just loaded to build a classification model.**