# Introduction to Spark and Notebooks - Python
This notebook is a live tutorial that introduces notebooks and some basic Spark operations.

The goal of this tutorial is to get productive with notebooks as quickly as possible.

## What is a Notebook?
A **notebook** is a web application that enables the creation of documents that include executable code.<br/>
This is why a notebook includes a kernel that defines the programming language and Spark engine used. 

It starts with the concept of a **cell**: A Cell can either be formatted text (Markdown), executable code (Code), or Raw which can contain anything.<br/>
`Markdown` and `Code` are the basic cell element we will be working with throughout this lab.

You can see menus and buttons at the top of the screen. The main menus are:<br/>
`File`, `Edit`, `View`, `Insert`, `Cell`, `Kernel`, and `Help`

Under the menus, we see a list of buttons. You can hover the mouse on top of them to get an idea of what they are used for.

You should take some time to look at the menu options. We'll mention some of them as we go through this tutorial.

## Executing code
It's time to execute our first cell. The executable cell below is preceded by __`In [ ]:`__

* Click on the cell below to make sure it is selected
* Find the __`run cell`__ button in the buttons above
* Click the run cell button

You should see the __`In [ ]`__ become __`In [*]`__ and finally __`In [1]`__.

When you see __`In [*]`__, it means that the code is executing.<br/>
Once it becomes a number, the execution is completed. The number is the sequence in multiple executions.<br/>
In our case, it is the first cell we executed so it is __`[1]`__

Some cells may not return a result. The execution sequence will indicate that it is done executing.

The cell below gets a Spark session and displays the kernel version.<br/>
A Spark session is the basic interface with the Spark kernel.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

spark.version

## Reading a file
One fundamental operation is to access existing data. Spark has the ability to read from many sources including files and databases.<br/>
When reading from files, it can accommodate files that are compressed or not, in csv, json, and parquet format, and more.

Since Spark uses lazy evaluation. This means that if a result does not need to be computed, Spark simply notes what needs to be odne and waits to execute it until it is really needed. In thss case, the result is a dataframe that is a reference to what needs to be done to access the data.

In the following example we read a compressed file that is in json format. The `count` method forces the evaluation and displays the record count.

In [None]:
df = spark.read.format('json').load('../datasets/world_bank.json.gz')
df.count()

## RDDs or DataFrames?
These two elements are representations of data. You may have heard of RDDs: Resilient Distributed Dataset. RDDs are the foundation to data maipulation in Spark. Over time, DataFrames were added to provide more functionality and optimization opportunities.

In general, you'll want to stay with the manipulation of DataFrames. DataFrames provide access to the underlying RDDs, so if necessary you can still go to that level of manipulation.

## DataFrame schema
We can look at the data definition to get a better idea of what we want to do with it.

Execute the following cell to see the dataframe schema.

In [None]:
df.printSchema()

## Code completion
It is impossible to remember all the methods available for a specific class.<br/>
Let's take the DataFrame we created earlier. How can we know there is a `printSchema` method?

We can keep a browser window open on the documentation and seach for what we want or we can use the `tab` key.

Try it in the cell below with the following. Type __`df.p`__ and then __press the tab key__.<br/>
You should see a selection drop down menu appear where you can select the one you want.

How can you find out the names of the attributes that are part of the DataFrame? Hint, it starts with "c".<br/>
Try it and find out.

In [None]:
# df.p
# tab completion should list df.printSchema

## Manipulating DataFrames
We already saw the use of the __`count()`__ method on DataFrames. There are many methods available. In this section we explore some of these methods. For further information, you need to refer to the <a href="https://spark.apache.org/docs/latest/" target="_blank">Spark documentation</a>.

The following methods are used to extract rows (records) form a Dataframe:
* __collect()__ - return the entire content of a DataFrame
* __first()__   - Return the first row of a DataFrame
* __head(n)__   - Return the first n rows of a DataFrame
* __take(n)__   - Return the first n rows of a DataFrame

The __`collect()`__ method can be dangerous since it could return the entire content of a DataFrame which could represent many millions of records. In can be used in conjunction with the __`limit()`__ method that would reduce the number of records returned, turning the __`limit(n).collect()`__ into a __`head(n)`__.

In [None]:
# Limit the dataframe to two rows then collect them
df.limit(2).collect()

In [None]:
# return the first two rows
df.head(2)

## Data exploration
An important aspect of data science is the exploration of the data to achieve a better understanding. Several of these methods can be used together to refine the exploration:

* __`describe(cols)`__ - provide basic statistics on the provided columns
* __`filter(conditions)`__ - select specific rows based on column values (also called select)

The next few cells show some examples of what could be done.

In [None]:
# Distribution of values for grantamt
df.describe(['grantamt']).show()

In [None]:
# Distribution of values for grantamt in Africa
df.filter("countryname = 'Africa'").describe(['grantamt']).show()

In [None]:
# How many grantamt are zero
df.filter("grantamt = 0").count()

In [None]:
# Distribution of grant amounts that are not zero
df.filter("grantamt != 0").describe(['grantamt']).show()

## Splitting data and sampling
One important operation in data science is to work with samples to speed up our testing and to split a dataset into training, testing and validation sets. These operations are directly available using:

* __`randomSplit()`__
* __`sample`__
* __`sampleBy()`__

Check out the following examples:

In [None]:
# Create the three DataFrames in one operation with split at 70%, 15%, 15% approximately
(training_df, testing_df, valid_df) = df.randomSplit([70.0, 15.0, 15.0], 41)
print ("Training set count:   " + str(training_df.count()) )
print ("Testing set count:    " + str(testing_df.count()) )
print ("Validation set count: " + str(valid_df.count()) )

In [None]:
# Take a random sample (without replacement) of around 20% of the data
random_df = df.sample(False, 0.2)
random_df.count()

## Writing data out
For completeness, we need to quickly cover writing data out.

A `DataFrame` is a distributed data set. It is expected to live on multiple nodes in the cluster at the same time.<br/>
For this reason, the default write method is expected to save it in a distributed matter if we are to save it as a file.<br/>
This means the write method returns a `DataFrameWriter` that would then expect a distributed file system where a file is saved as a directory that contains multiple parts of the file. Of course, if a DataFrame consist of only one partition, the directory wil contain only one part.

It is also possible to save the data into a database using a `jdbc` interface.

## Executing SQL queries
Dataframes provide an SQL-like interface to the data. It is also possible to define a reference to a view so we can execute SQL statements directly on the data. What is returned is a DataFrame.

In this example, we simply display the result to the output.

In [None]:
df.createOrReplaceTempView("world_bank")
spark.sql("""
SELECT id, borrower
FROM world_bank
LIMIT 5
""").show()

# Displaying results using Pandas
The Python Pandas library is included in the environment. It can be used to provide a nicer format to the output.<br/>
To enable this functionality, we simply import the library.

In [None]:
import pandas as pd

spark.sql("""
SELECT  regionname, count(*) as project_count
FROM     world_bank
GROUP BY regionname 
ORDER BY count(*) DESC
""").toPandas()

## More on Pandas
Pandas is a Python data analysis toolkit.

When we convert a Spark DataFrame to a Pandas DataFrame, we basically read the DataFrame into our local memory. For this reason, a Spark DataFrame should be converted to a Pandas DataFrame only when it is small, like in the case of an aggregate operation like the one we just saw.

The Pandas toolkit includes a lot of functionality and is beyond the scope of this introduction but ine part of it should be known right away: the ability to plot the content of a DataFrame with the help of matplotlib.<br/>


In [None]:
pddf = spark.sql("""
SELECT  regionname, count(*) as project_count
FROM     world_bank
GROUP BY regionname 
ORDER BY count(*) DESC
""").toPandas()

In [None]:
# Show the previous result as a pie chard showing percentages of total.
import matplotlib.pyplot as plt
%matplotlib inline

plt.close()
plt.figure(figsize=(16,8))
ax1 = plt.subplot(121, aspect='equal')

pddf.plot(kind='pie', y='project_count', ax = ax1, autopct='%1.1f%%', 
 startangle=90, shadow=False, labels=pddf['regionname'], legend = False, fontsize=14)
plt.show()

## Saving a notebook
Watson Studio will automatically save your notebook from time to time.<br/>
If  you have a lot of changes, you may want to explicitly save your notebook. For this, you can either use the save icon (looking like a floppy disk) under the __`File`__ menu at the top of the screen or you can click on __`File`__ and select __"Save and Checkpoint"__.

You can also download your saved notebook to your local machine using the __"Download as"__ option in the __`File`__ menu.

## Congratulations!
You've completed the introductory tutorial.<br/>
Explore the other tutorials for more in-depth knowledge.