# Getting Started

1. Create a SparkSession that connects to Spark in local mode. Configure the SparkSession to use two cores.
1. Using the example from the lesson, create a spark data frame that contains your favorite programming languages. The name of the column should be language.
1. Print the schema of the dataframe
1. View the dataframe
1. Count the number of records using .count

In [1]:
import multiprocessing
import pyspark
import pandas as pd

In [2]:
spark = pyspark.sql.SparkSession.builder.master('local').\
config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.16').\
config('spark.driver.memory', '4G').config('spark.driver.cores', 2).\
config('spark.sql.shuffle.partitions', 2).appName('MySparkApplication').getOrCreate()

In [3]:
# df = spark.createDataFrame([('Python', ), ('Swift', )], schema=['language'])
# df.show()

In [4]:
# df.printSchema()

In [5]:
# df.count()

In [6]:
import numpy as np

pandas_dataframe = pd.DataFrame(dict(n=np.arange(100), group=np.random.choice(list('abc'), 100)))

In [9]:
df = spark.createDataFrame(pandas_dataframe)

In [10]:
df.createOrReplaceTempView('numbers')

In [16]:
df.show(5)

+---+-----+
|  n|group|
+---+-----+
|  0|    a|
|  1|    c|
|  2|    c|
|  3|    c|
|  4|    b|
+---+-----+
only showing top 5 rows



##### Grouping then aggregating average with expression

In [11]:
from pyspark.sql.functions import expr, avg

df.groupBy('group').agg(expr('avg(n)')).show()

+-----+-----------------+
|group|           avg(n)|
+-----+-----------------+
|    a|45.34285714285714|
|    c|49.94871794871795|
|    b|54.42307692307692|
+-----+-----------------+



##### Grouping then aggregating average

In [13]:
df.groupBy('group').agg(avg(df.n)).show()

+-----+-----------------+
|group|           avg(n)|
+-----+-----------------+
|    a|45.34285714285714|
|    c|49.94871794871795|
|    b|54.42307692307692|
+-----+-----------------+



##### Similar to running a SQL query

In [23]:
spark.sql('''
SELECT group as letter, count(n) as number_of_occurences
FROM numbers
GROUP BY group
''').show(5)

+------+--------------------+
|letter|number_of_occurences|
+------+--------------------+
|     a|                  35|
|     c|                  39|
|     b|                  26|
+------+--------------------+



##### SELECT statement like in MySQL and using expr.

In [29]:
df2 = df.select('n', 'group', expr('n + 1 as incremented'))

In [30]:
df2.show(5)

+---+-----+-----------+
|  n|group|incremented|
+---+-----+-----------+
|  0|    a|          1|
|  1|    c|          2|
|  2|    c|          3|
|  3|    c|          4|
|  4|    b|          5|
+---+-----+-----------+
only showing top 5 rows



##### Selecting more similar to pandas.

In [33]:
df.select(df.n, df.group, (df.n + 1).alias('incremented')).show(5)

+---+-----+-----------+
|  n|group|incremented|
+---+-----+-----------+
|  0|    a|          1|
|  1|    c|          2|
|  2|    c|          3|
|  3|    c|          4|
|  4|    b|          5|
+---+-----+-----------+
only showing top 5 rows

