## PySpark Notes

1. [Read in data](#Read_in_data)
2. [Get the schema of the dataframe](#get_schema)
3. [Get column names](#get_cols)
4. [Select columns](#select_cols)
5. [Describe the dataset](#describe_data)
6. [Add columns](#add_cols)
7. [Drop columns](#drop_cols)
8. [Rename columns](#rename_cols)

In [1]:
import pyspark
from pyspark.sql import SparkSession

In [2]:
# Create session
spark = SparkSession.builder.appName("Spark Intro").getOrCreate()

#### Read in data
<a id='Read_in_data'></a>

In [3]:
# inferSchema=True ensures that the profit is being interpreted as floats
df = spark.read.csv('sales.csv', header=True, inferSchema=True)
df.show(n=10)

+----------+------------+----------------+---------------+
|Order Date|Total Profit|         Country|      Item Type|
+----------+------------+----------------+---------------+
|2012-07-27|     3839.13|    South Africa|         Fruits|
|2013-09-14|   338631.84|         Morocco|        Clothes|
|2015-05-15|     20592.0|Papua New Guinea|           Meat|
|2017-05-17|    41273.28|        Djibouti|        Clothes|
|2016-10-26|    62217.18|        Slovakia|      Beverages|
|2011-11-07|     3323.39|       Sri Lanka|         Fruits|
|2013-01-18|     9349.02|     Seychelles |      Beverages|
|2016-11-30|    23114.16|        Tanzania|      Beverages|
|2017-03-23|    113120.0|           Ghana|Office Supplies|
|2016-05-23|  1350622.16|        Tanzania|      Cosmetics|
+----------+------------+----------------+---------------+
only showing top 10 rows



#### Get the schema of the dataframe
<a id='get_schema'></a>

In [4]:
df.printSchema()

root
 |-- Order Date: string (nullable = true)
 |-- Total Profit: double (nullable = true)
 |-- Country: string (nullable = true)
 |-- Item Type: string (nullable = true)



In [5]:
# Another option (same as pandas essentially)
df.dtypes 

[('Order Date', 'string'),
 ('Total Profit', 'double'),
 ('Country', 'string'),
 ('Item Type', 'string')]

#### Get column names
<a id='get_cols'></a>

In [6]:
df.columns

['Order Date', 'Total Profit', 'Country', 'Item Type']

#### Select columns
<a id='select_cols'></a>

Select the `Order Date` and `Country` column

In [7]:
df.select('Order Date', 'Country').show(n=10)

+----------+----------------+
|Order Date|         Country|
+----------+----------------+
|2012-07-27|    South Africa|
|2013-09-14|         Morocco|
|2015-05-15|Papua New Guinea|
|2017-05-17|        Djibouti|
|2016-10-26|        Slovakia|
|2011-11-07|       Sri Lanka|
|2013-01-18|     Seychelles |
|2016-11-30|        Tanzania|
|2017-03-23|           Ghana|
|2016-05-23|        Tanzania|
+----------+----------------+
only showing top 10 rows



Select the `Total Profit` and `Item Type` column

In [8]:
# Passing in a list is fine too
df.select(['Total Profit', 'Item Type']).show(n=10)

+------------+---------------+
|Total Profit|      Item Type|
+------------+---------------+
|     3839.13|         Fruits|
|   338631.84|        Clothes|
|     20592.0|           Meat|
|    41273.28|        Clothes|
|    62217.18|      Beverages|
|     3323.39|         Fruits|
|     9349.02|      Beverages|
|    23114.16|      Beverages|
|    113120.0|Office Supplies|
|  1350622.16|      Cosmetics|
+------------+---------------+
only showing top 10 rows



#### Describe the dataset
<a id='describe_data'></a>

In [9]:
df.describe().show()

+-------+----------+------------------+-----------+----------+
|summary|Order Date|      Total Profit|    Country| Item Type|
+-------+----------+------------------+-----------+----------+
|  count|    500000|            500000|     500000|    500000|
|   mean|      null| 392479.9645884398|       null|      null|
| stddev|      null|378751.68881151074|       null|      null|
|    min|2010-01-01|              2.41|Afghanistan| Baby Food|
|    max|2017-07-28|         1738700.0|   Zimbabwe|Vegetables|
+-------+----------+------------------+-----------+----------+



#### Add columns
<a id='add_cols'></a>
Add a `cost` column, and set values arbitrarily (just do `Total Profit` - 3000 to see how it works)

In [10]:
# Copies of the dataframe are created (not views)
df = df.withColumn(colName='cost', col=df['Total Profit']-3000)

#### Drop columns
<a id='drop_cols'></a>
Drop the `cost` and `Item Type`, column

In [11]:
# Always creating a copy, hence inplace=True is not a thing
df = df.drop('cost', 'Item Type')
df.show(n=10)

+----------+------------+----------------+
|Order Date|Total Profit|         Country|
+----------+------------+----------------+
|2012-07-27|     3839.13|    South Africa|
|2013-09-14|   338631.84|         Morocco|
|2015-05-15|     20592.0|Papua New Guinea|
|2017-05-17|    41273.28|        Djibouti|
|2016-10-26|    62217.18|        Slovakia|
|2011-11-07|     3323.39|       Sri Lanka|
|2013-01-18|     9349.02|     Seychelles |
|2016-11-30|    23114.16|        Tanzania|
|2017-03-23|    113120.0|           Ghana|
|2016-05-23|  1350622.16|        Tanzania|
+----------+------------+----------------+
only showing top 10 rows



#### Rename columns
<a id='rename_cols'></a>

In [12]:
df = df.withColumnRenamed(existing='Order Date', new='Date')
df.show(n=10)

+----------+------------+----------------+
|      Date|Total Profit|         Country|
+----------+------------+----------------+
|2012-07-27|     3839.13|    South Africa|
|2013-09-14|   338631.84|         Morocco|
|2015-05-15|     20592.0|Papua New Guinea|
|2017-05-17|    41273.28|        Djibouti|
|2016-10-26|    62217.18|        Slovakia|
|2011-11-07|     3323.39|       Sri Lanka|
|2013-01-18|     9349.02|     Seychelles |
|2016-11-30|    23114.16|        Tanzania|
|2017-03-23|    113120.0|           Ghana|
|2016-05-23|  1350622.16|        Tanzania|
+----------+------------+----------------+
only showing top 10 rows



In [13]:
# Multiple columns at the same time (requires all columns)
df = df.toDF('Order Date', 'Total Profit', 'Order Country')
df.show(n=10)

+----------+------------+----------------+
|Order Date|Total Profit|   Order Country|
+----------+------------+----------------+
|2012-07-27|     3839.13|    South Africa|
|2013-09-14|   338631.84|         Morocco|
|2015-05-15|     20592.0|Papua New Guinea|
|2017-05-17|    41273.28|        Djibouti|
|2016-10-26|    62217.18|        Slovakia|
|2011-11-07|     3323.39|       Sri Lanka|
|2013-01-18|     9349.02|     Seychelles |
|2016-11-30|    23114.16|        Tanzania|
|2017-03-23|    113120.0|           Ghana|
|2016-05-23|  1350622.16|        Tanzania|
+----------+------------+----------------+
only showing top 10 rows

