#### Data Problems:
- Missing fields, 
- Bizarre formatting, 
- Orders of magnitude more data. 

#### Possibletasksindatacleaning:
- Reformatting or replacing text
- Performing calculations
- Removing garbage or incomplete data

#### Spark Schemas 
- Define the format of  DataFrame 
- Various data types:
    - Strings,
    - dates,
    - integers,
    - arrays
    
- Can filter garbage data during import 
- Improves read performance

#### Note: The primary limit to Spark's abilities is the level of RAM in the Spark cluster.

In [1]:
import os
import sys

os.environ['SPARK_HOME'] = "C:/Spark"
sys.path.append("C:/Spark/spark-3.1.2-bin-hadoop3.2/python/")



In [2]:
from pyspark import  SparkContext as sc# And then try to import SparkContext.
# Verify SparkContext
print(sc)

# Print Spark version
print(sc.version)

<class 'pyspark.context.SparkContext'>
<property object at 0x000000306C18C368>


In [3]:
#import os
print(os.environ.get("SPARK_HOME"))

print(os.path.join(os.environ.get("SPARK_HOME"), './bin/spark-submit'))

#gateway = JavaGateway()

os.environ['SPARK_HOME']="C:/Spark/spark-3.1.2-bin-hadoop3.2"
os.environ['JAVA_HOME']="C:/Program Files/Java/jdk1.8.0_144"
sys.path.append("C:/Spark/spark-3.1.2-bin-hadoop3.2/python")
os.environ['HADOOP_HOME']="C:/Hadoop"

from pyspark import SparkContext
from pyspark import SparkConf

import pyspark # only run after findspark.init()
from pyspark.sql.functions import to_date, col
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Print spark
print(spark)
spark.conf.set("spark.sql.parquet.compression.codec", "gzip")

C:/Spark
C:/Spark\./bin/spark-submit
<pyspark.sql.session.SparkSession object at 0x000000306C3F7F48>


In [4]:

processFile=('..\\PySpark')
fileName =(os.listdir(processFile))
file_path_total=[]
def file_path(fileName):
    #processFile=('..\\PySpark')
    #fileName =(os.listdir(processFile))
    for file in fileName:
        if ".csv" in file:
            file_path = "%s/%s" % (processFile, file)
            file_path_total.append(file_path)
    return file_path_total

### SparkSchema
`StructField(String name, DataType dataType, boolean nullable, Metadata metadata) `

In [5]:
import pyspark.sql.types
from pyspark.sql.types import *
#from pyspark.sql.types import StructType,StructField

# Define a new schema using the StructType method
people_schema = StructType([
  # Define a StructField for each field
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('city', StringType(), False)
])

#### Immutability and lazy processing

**Immutability:**
- Immutable variables are:
    - A component of functional programming 
    - Define once 
    - Unable to be directly modied 
    - Re-created if reassigned 
    - Able to be share efficiently

Spark takes advantage of data immutability to efficiently share / create new data representations throughout the cluster.

#### Using lazy processing
Lazy processing operations will usually return in about the same amount of time regardless of the actual quantity of data. Remember that this is due to Spark not performing any transformations until an action is requested.

For this exercise, we'll be defining a Data Frame (aa_dfw_df) and add a couple transformations. Note the amount of time required for the transformations to complete when defined vs when the data is actually queried. These differences may be short, but they will be noticeable. When working with a full Spark cluster with larger quantities of data the difference will be more apparent.

In [6]:
from pyspark.sql import functions as F
aa_dfw_df = spark.read.format('csv').options(Header=True).load('AA_DFW_2017_Departures_Short.csv.gz')
aa_dfw_df = aa_dfw_df.withColumn('airport', F.lower(aa_dfw_df['Destination Airport']))
aa_dfw_df.show(5)

+-----------------+-------------+-------------------+-----------------------------+-------+
|Date (MM/DD/YYYY)|Flight Number|Destination Airport|Actual elapsed time (Minutes)|airport|
+-----------------+-------------+-------------------+-----------------------------+-------+
|       01/01/2017|         0005|                HNL|                          537|    hnl|
|       01/01/2017|         0007|                OGG|                          498|    ogg|
|       01/01/2017|         0037|                SFO|                          241|    sfo|
|       01/01/2017|         0043|                DTW|                          134|    dtw|
|       01/01/2017|         0051|                STL|                           88|    stl|
+-----------------+-------------+-------------------+-----------------------------+-------+
only showing top 5 rows



In [7]:

# Drop the Destination Airport column
aa_dfw_df = aa_dfw_df.drop(aa_dfw_df['Destination Airport'])
aa_dfw_df.show(5)

+-----------------+-------------+-----------------------------+-------+
|Date (MM/DD/YYYY)|Flight Number|Actual elapsed time (Minutes)|airport|
+-----------------+-------------+-----------------------------+-------+
|       01/01/2017|         0005|                          537|    hnl|
|       01/01/2017|         0007|                          498|    ogg|
|       01/01/2017|         0037|                          241|    sfo|
|       01/01/2017|         0043|                          134|    dtw|
|       01/01/2017|         0051|                           88|    stl|
+-----------------+-------------+-----------------------------+-------+
only showing top 5 rows



#### Understanding Parquet
- a columnar data store, allowing Spark to use predicate pushdown. 
- means Spark will only process the data necessary to complete the operations you define versus reading the entire dataset. 

=>This gives Spark more flexibility in accessing the data and often drastically improves performance on large datasets.

In [8]:
from pyspark.sql import functions as F
voter_df_org = spark.read.format('csv').options(Header=True).load('DallasCouncilVotes.csv.gz')

voter_df_org.show(5)

+----------+------------------+---------+--------+-------------+-------------------+---------+------------------+-----------------------+------------------+--------------------+
|      DATE|AGENDA_ITEM_NUMBER|ITEM_TYPE|DISTRICT|        TITLE|         VOTER NAME|VOTE CAST|FINAL ACTION TAKEN|AGENDA ITEM DESCRIPTION|         AGENDA_ID|             VOTE_ID|
+----------+------------------+---------+--------+-------------+-------------------+---------+------------------+-----------------------+------------------+--------------------+
|02/08/2017|                 1|   AGENDA|      13|Councilmember|  Jennifer S. Gates|      N/A|  NO ACTION NEEDED|          Call to Order|020817__Special__1|020817__Special__...|
|02/08/2017|                 1|   AGENDA|      14|Councilmember| Philip T. Kingston|      N/A|  NO ACTION NEEDED|          Call to Order|020817__Special__1|020817__Special__...|
|02/08/2017|                 1|   AGENDA|      15|        Mayor|Michael S. Rawlings|      N/A|  NO ACTION NEED

### Cleaning voter_name
`voter_name`:
- contains the date of the vote 
- the name and position of the voter.

=> clean this data so it can later be integrated into some desired reports. 
**Tasks":**
    - to remove any null entries or odd characters and 
    - return a specific set of voters where can validate their information.

In [9]:
# Show the distinct VOTER_NAME entries
voter_df_org= voter_df_org.withColumnRenamed("VOTER NAME","VOTER_NAME")
voter_df_org= voter_df_org["DATE","TITLE","VOTER_NAME"]
voter_df= voter_df_org
voter_df.select("VOTER_NAME").distinct().show(truncate =False)


+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|VOTER_NAME                                                                                                                                                                                                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [10]:
voter_df = voter_df.withColumn("length_of_VOTER_NAME", F.length("VOTER_NAME"))
voter_df=voter_df.withColumn("length_of_VOTER_NAME",voter_df.length_of_VOTER_NAME.cast(IntegerType()))
voter_df =voter_df.filter(voter_df.length_of_VOTER_NAME < 20)
voter_df.select("VOTER_NAME").distinct().show()


+-------------------+
|         VOTER_NAME|
+-------------------+
|     Tennell Atkins|
|       Scott Griggs|
|      Scott  Griggs|
|      Sandy Greyson|
|Michael S. Rawlings|
|       Kevin Felder|
|       Adam Medrano|
|         011018__42|
|   Casey Thomas, II|
|      Mark  Clayton|
|  Casey  Thomas, II|
|     Sandy  Greyson|
|       Mark Clayton|
| Jennifer S.  Gates|
|  Tiffinni A. Young|
|   B. Adam  McGough|
|       Omar Narvaez|
| Philip T. Kingston|
| Rickey D. Callahan|
|  Dwaine R. Caraway|
+-------------------+
only showing top 20 rows



In [11]:
# Filter voter_df where the VOTER_NAME is 1-20 characters in lengthb
voter_df = voter_df_org
#.show()

#voter_df.select('VOTER_NAME').distinct().show(20, truncate=False)

voter_df= voter_df.filter(~F.col('VOTER_NAME').contains('_'))#.show(30)
voter_df = voter_df.filter(F.length("VOTER_NAME") < 20 )
voter_df.select("VOTER_NAME").distinct().sort("VOTER_NAME").show(200, truncate=False)


+-------------------+
|VOTER_NAME         |
+-------------------+
|Adam Medrano       |
|B. Adam  McGough   |
|Carolyn King Arnold|
|Casey  Thomas, II  |
|Casey Thomas, II   |
|Dwaine R. Caraway  |
|Erik Wilson        |
|Jennifer S.  Gates |
|Jennifer S. Gates  |
|Kevin Felder       |
|Lee Kleinman       |
|Lee M. Kleinman    |
|Mark  Clayton      |
|Mark Clayton       |
|Michael S. Rawlings|
|Monica R. Alonzo   |
|Omar Narvaez       |
|Philip T.  Kingston|
|Philip T. Kingston |
|Rickey D.  Callahan|
|Rickey D. Callahan |
|Sandy  Greyson     |
|Sandy Greyson      |
|Scott  Griggs      |
|Scott Griggs       |
|Tennell Atkins     |
|Tiffinni A. Young  |
+-------------------+



In [12]:
voter_df_org.filter(F.col('VOTER_NAME').contains('_')).show(30)
voter_df.filter(F.col('VOTER_NAME').contains('_')).show(30)

+--------------------+-------+----------+
|                DATE|  TITLE|VOTER_NAME|
+--------------------+-------+----------+
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
|MADELEINE JOHNSON...| 2020]"|011018__42|
+--------------------+-------+----------+

+----+-----+----------+
|DATE|TITLE|VOTER_NAME|
+----+-----+----------+
+----+-----+----------+



In [13]:
voter_df = voter_df.filter(~ F.col('VOTER_NAME').contains('_'))

voter_df.show()
voter_df.filter(F.col('VOTER_NAME').contains('_')).show(30)

+----------+-------------+-------------------+
|      DATE|        TITLE|         VOTER_NAME|
+----------+-------------+-------------------+
|02/08/2017|Councilmember|  Jennifer S. Gates|
|02/08/2017|Councilmember| Philip T. Kingston|
|02/08/2017|        Mayor|Michael S. Rawlings|
|02/08/2017|Councilmember|       Adam Medrano|
|02/08/2017|Councilmember|   Casey Thomas, II|
|02/08/2017|Councilmember|Carolyn King Arnold|
|02/08/2017|Councilmember|       Scott Griggs|
|02/08/2017|Councilmember|   B. Adam  McGough|
|02/08/2017|Councilmember|       Lee Kleinman|
|02/08/2017|Councilmember|      Sandy Greyson|
|02/08/2017|Councilmember|  Jennifer S. Gates|
|02/08/2017|Councilmember| Philip T. Kingston|
|02/08/2017|        Mayor|Michael S. Rawlings|
|02/08/2017|Councilmember|       Adam Medrano|
|02/08/2017|Councilmember|   Casey Thomas, II|
|02/08/2017|Councilmember|Carolyn King Arnold|
|02/08/2017|Councilmember| Rickey D. Callahan|
|01/11/2017|Councilmember|  Jennifer S. Gates|
|04/25/2018|C

In [14]:
voter_df.select('VOTER_NAME').distinct().show(40, truncate=True)


+-------------------+
|         VOTER_NAME|
+-------------------+
|     Tennell Atkins|
|       Scott Griggs|
|      Scott  Griggs|
|      Sandy Greyson|
|Michael S. Rawlings|
|       Kevin Felder|
|       Adam Medrano|
|   Casey Thomas, II|
|      Mark  Clayton|
|  Casey  Thomas, II|
|     Sandy  Greyson|
|       Mark Clayton|
| Jennifer S.  Gates|
|  Tiffinni A. Young|
|   B. Adam  McGough|
|       Omar Narvaez|
| Philip T. Kingston|
| Rickey D. Callahan|
|  Dwaine R. Caraway|
|Philip T.  Kingston|
|  Jennifer S. Gates|
|    Lee M. Kleinman|
|   Monica R. Alonzo|
|Rickey D.  Callahan|
|Carolyn King Arnold|
|        Erik Wilson|
|       Lee Kleinman|
+-------------------+



### Modifying DataFrame columns

In [22]:
voter_df = voter_df.withColumn('splits', F.split(voter_df.VOTER_NAME, ','))
voter_df = voter_df.withColumn('splits', voter_df.splits.getItem(0))
voter_df = voter_df.withColumn('splits', F.split(voter_df.splits, '\s+'))
voter_df.show()

+----------+-------------+-------------------+--------------------+
|      DATE|        TITLE|         VOTER_NAME|              splits|
+----------+-------------+-------------------+--------------------+
|02/08/2017|Councilmember|  Jennifer S. Gates|[Jennifer, S., Ga...|
|02/08/2017|Councilmember| Philip T. Kingston|[Philip, T., King...|
|02/08/2017|        Mayor|Michael S. Rawlings|[Michael, S., Raw...|
|02/08/2017|Councilmember|       Adam Medrano|     [Adam, Medrano]|
|02/08/2017|Councilmember|   Casey Thomas, II|     [Casey, Thomas]|
|02/08/2017|Councilmember|Carolyn King Arnold|[Carolyn, King, A...|
|02/08/2017|Councilmember|       Scott Griggs|     [Scott, Griggs]|
|02/08/2017|Councilmember|   B. Adam  McGough| [B., Adam, McGough]|
|02/08/2017|Councilmember|       Lee Kleinman|     [Lee, Kleinman]|
|02/08/2017|Councilmember|      Sandy Greyson|    [Sandy, Greyson]|
|02/08/2017|Councilmember|  Jennifer S. Gates|[Jennifer, S., Ga...|
|02/08/2017|Councilmember| Philip T. Kingston|[P

In [23]:

# Create a new column called first_name based on the first item in splits
voter_df = voter_df.withColumn('first_name', voter_df.splits.getItem(0))

voter_df.show()

+----------+-------------+-------------------+--------------------+----------+
|      DATE|        TITLE|         VOTER_NAME|              splits|first_name|
+----------+-------------+-------------------+--------------------+----------+
|02/08/2017|Councilmember|  Jennifer S. Gates|[Jennifer, S., Ga...|  Jennifer|
|02/08/2017|Councilmember| Philip T. Kingston|[Philip, T., King...|    Philip|
|02/08/2017|        Mayor|Michael S. Rawlings|[Michael, S., Raw...|   Michael|
|02/08/2017|Councilmember|       Adam Medrano|     [Adam, Medrano]|      Adam|
|02/08/2017|Councilmember|   Casey Thomas, II|     [Casey, Thomas]|     Casey|
|02/08/2017|Councilmember|Carolyn King Arnold|[Carolyn, King, A...|   Carolyn|
|02/08/2017|Councilmember|       Scott Griggs|     [Scott, Griggs]|     Scott|
|02/08/2017|Councilmember|   B. Adam  McGough| [B., Adam, McGough]|        B.|
|02/08/2017|Councilmember|       Lee Kleinman|     [Lee, Kleinman]|       Lee|
|02/08/2017|Councilmember|      Sandy Greyson|    [S

In [25]:
# Get the last entry of the splits list and create a column called last_name

voter_df=voter_df.withColumn('last_name', voter_df.splits.getItem(F.size('splits')-1))
voter_df = voter_df.drop('splits')
voter_df.show()

+----------+-------------+-------------------+----------+---------+
|      DATE|        TITLE|         VOTER_NAME|first_name|last_name|
+----------+-------------+-------------------+----------+---------+
|02/08/2017|Councilmember|  Jennifer S. Gates|  Jennifer|    Gates|
|02/08/2017|Councilmember| Philip T. Kingston|    Philip| Kingston|
|02/08/2017|        Mayor|Michael S. Rawlings|   Michael| Rawlings|
|02/08/2017|Councilmember|       Adam Medrano|      Adam|  Medrano|
|02/08/2017|Councilmember|   Casey Thomas, II|     Casey|   Thomas|
|02/08/2017|Councilmember|Carolyn King Arnold|   Carolyn|   Arnold|
|02/08/2017|Councilmember|       Scott Griggs|     Scott|   Griggs|
|02/08/2017|Councilmember|   B. Adam  McGough|        B.|  McGough|
|02/08/2017|Councilmember|       Lee Kleinman|       Lee| Kleinman|
|02/08/2017|Councilmember|      Sandy Greyson|     Sandy|  Greyson|
|02/08/2017|Councilmember|  Jennifer S. Gates|  Jennifer|    Gates|
|02/08/2017|Councilmember| Philip T. Kingston|  