## Introduction to PySpark

Every pyspark program that you write will have these “core parts”, which are:

1. importing the pyspark package (or modules)

2. starting your Spark Session

3. defining a set of transformations and actions over Spark DataFrames


> PySpark is organized in a number of modules, such as sql (to access Spark SQL), pandas (to access the Pandas API of Spark), ml (to access Spark MLib).Going further, we can have sub-modules (or modules inside a module) too. As an example, the sql module of pyspark have the functions and window sub-modules.



In [1]:
import pyspark
import pandas as pd

In [2]:
# How we can import moules

from pyspark.sql.functions import sum, col  # importing functions
import pyspark.sql.functions as F # importing the entire module

### Starting your Spark Session

Every Spark application starts with a Spark Session. Basically, the Spark Session is the entry point to your application. This means that, in every pyspark program that you write, you should always start by defining your Spark Session. We do this, by using the getOrCreate() method from pyspark.sql.SparkSession.builder module.

Just store the result of this method in any python object. Is very common to name this object as spark, like in the example below. This way, you can access all the information and methods of Spark from this spark object.

In [3]:
from pyspark.sql import SparkSession

In [12]:
spark = SparkSession.builder.appName("prueba_spark").getOrCreate()

In [13]:
df5 = spark.range(5) # creating a Spark DataFrame and store
print(dir(df5)) # to show all the df's methods

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', '_collect_as_arrow', '_ipython_key_completions_', '_jcols', '_jdf', '_jmap', '_joinAsOf', '_jseq', '_lazy_rdd', '_repr_html_', '_sc', '_schema', '_session', '_show_string', '_sort_cols', '_sql_ctx', '_support_repr_html', 'agg', 'alias', 'approxQuantile', 'cache', 'checkpoint', 'coalesce', 'colRegex', 'collect', 'columns', 'corr', 'count', 'cov', 'createGlobalTempView', 'createOrReplaceGlobalTempView', 'createOrReplaceTempView', 'createTempView', 'crossJoin', 'crosstab', 'cube', 'describe', 'distinct', 'drop', 'dropDuplicates', 'dropDuplicatesWithinWatermark',

In [14]:
df5.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+



In [15]:
# Buildin a Spark Data Frame

from datetime import date
from pyspark.sql import Row

In [16]:
#Creating a list

data = [ 
    Row(id = 1, value = 28.3, date = date(2021,1,1)), # Row() to create rows
    Row(id = 2, value = 28.3, date = date(2021,1,2)),
    Row(id = 3, value = 28.3, date = date(2021,1,3)),
    Row(id = 4, value = 28.3, date = date(2021,1,4))
]

# Transforming to Spark DataFrame

df = spark.createDataFrame(data) 

In [17]:
type(df)

pyspark.sql.dataframe.DataFrame

In [None]:
# Seeing a short description
df.show()

In [24]:
print(df[0].count())

TypeError: 'Column' object is not callable

In [20]:
data = [
  (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
  (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
  (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
  (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
  (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
  (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
]

columns = [
  'StudentID', 'Name', 'Age', 'Height', 'Score1',
  'Score2', 'Score3', 'Score4', 'Course', 'Department'
]

students = spark.createDataFrame(data, columns)
students

DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]

A key aspect of Spark is its laziness. In other words, for most operations, Spark will only check if your code is correct and if it makes sense. Spark will not actually run or execute the operations you are describing in your code, unless you explicit ask for it with a **trigger operation**, which is called an **action**

In [21]:
# Spark will only calculate and print a summary of the structure of your Spark DataFrame, and not the DataFrame itself.
students

DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]

In [23]:
students[0].show(5)

TypeError: 'Column' object is not callable