# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
# Loading all library
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import configparser
from datetime import datetime
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
#from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format, dayofweek
#from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg
#from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window
import boto3
import os, shutil
from pyspark.sql import functions as F
from pyspark.sql import types as T
import time
from pyspark.sql.functions import rand,round
from functools import reduce
import pandas as pd
pd.set_option('max_colwidth', 200)
pd.set_option('display.max_columns', 200)
config = configparser.ConfigParser()
config.read_file(open('csp.cfg'))
KEY                      = config.get('AWS','AWS_ACCESS_KEY_ID')
SECRET                   = config.get('AWS','AWS_SECRET_ACCESS_KEY')
os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [2]:
from pyspark.sql import SparkSession

def create_spark_session():
    """
    This function is used to create a spark session to work in
    """
    spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
        .getOrCreate()
    return spark
spark = create_spark_session()


In [None]:
arr = []
for i in newColumns:
    a = df.select(i).distinct().count()
    arr.append((i, a))
df = pd.DataFrame(arr)
df

In [None]:
# Read in the data here
s3 = boto3.resource('s3',
                       region_name="us-east-1",
                       aws_access_key_id=KEY,
                       aws_secret_access_key=SECRET
                   )
## This is to display what we have in S3 
sampleDbBucket =  s3.Bucket("john-udacity-s3")
i = 0
for obj in sampleDbBucket.objects.filter(Prefix="movie"):
    print(obj)
    i += 1
    if i > 10:
        break   

In [3]:
df_spark = spark.read.csv("s3a://john-udacity-s3/loan/ibrd-statement-of-loans-historical-data.csv", header=True)
# function that will uppercase everything in the dataframe
from pyspark.sql.functions import *
from pyspark.sql.types import *
fields = df_spark.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields) 
allFields = [*stringFieldsTransformed, *nonStringFields]
df_new = df_spark.select(allFields)
# rename the column name
oldColumns = df_new.schema.names
newColumns  = ['End_of_Period', 'Loan_Number', 'Region', 'Country_Code', 'Country','Borrower','Guarantor_Country_Code','Guarantor','Loan_Type','Loan_Status','Interest_Rate','Currency_of_Commitment','Project_ID','Project_Name','Original_Principal_Amount','Cancelled_Amount','Undisbursed_Amount','Disbursed_Amount','Repaid_to_IBRD','Due_to_IBRD','Exchange_Adjustment','Borrowers_Obligation','Sold_3rd_Party','Repaid_3rd_Party','Due_3rd_Party','Loans_Held','First_Repayment_Date','Last_Repayment_Date','Agreement_Signing_Date','Board_Approval_Date','Effective_Date_Most_Recent','Closed_Date_Most_Recent','Last_Disbursement_Date']
df = reduce(lambda df_spark, idx: df_spark.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), df_new)

In [62]:
# This is to display the number of null value of each column
from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).toPandas()

Unnamed: 0,End_of_Period,Loan_Number,Region,Country_Code,Country,Borrower,Guarantor_Country_Code,Guarantor,Loan_Type,Loan_Status,Interest_Rate,Project_ID,Project_Name,Original_Principal_Amount,Cancelled_Amount,Undisbursed_Amount,Disbursed_Amount,Repaid_to_IBRD,Due_to_IBRD,Exchange_Adjustment,Borrowers_Obligation,Sold_3rd_Party,Repaid_3rd_Party,Due_3rd_Party,Loans_Held,First_Repayment_Date,Last_Repayment_Date,Agreement_Signing_Date,Board_Approval_Date,Effective_Date_Most_Recent,Closed_Date_Most_Recent
0,0,0,0,0,0,5565,31121,57779,0,0,24930,0,167552,0,0,0,0,0,0,0,0,0,0,0,0,1570,1474,9979,4,4603,742


In [5]:
df=df.drop('Last_Disbursement_Date')
df=df.drop('Currency_of_Commitment')

In [6]:
# edit the loan_number to make it 9 digits
df_good = df.filter(length(col("Loan_Number")) == 9)
df_6 = df.where(length(col("Loan_Number")) == 6).withColumn("Loan_Number", regexp_replace(col("Loan_Number") ,  "(\\w{4})(\\d{2})" , "$1000$2" ))
df_7 = df.where(length(col("Loan_Number")) == 7).withColumn("Loan_Number", regexp_replace(col("Loan_Number") ,  "(\\w{4})(\\d{3})" , "$100$2" ))
df_8 = df.where(length(col("Loan_Number")) == 8).withColumn("Loan_Number", regexp_replace(col("Loan_Number") ,  "(\\w{4})(\\d{4})" , "$10$2" ))
# No records with loan_number that has less than 9 digits
df = df_good.union(df_6).union(df_7).union(df_8)

In [61]:
# update missing project_id from existing project_id based on loan_number, false positive missing value: they have a loan number in the data but missing project id 
# Get dataframe that project_id is missing
LwithoutPP = df.filter('Project_ID is NULL')
# Get dataframe that project_id is not missing, find it's project_id and loan number
LwithPP = df.filter('Project_ID is not NULL').select('Project_ID','Loan_Number','Project_Name').distinct()
df1 = LwithoutPP.alias('df1')
df2 = LwithPP.alias('df2')
# Join 2 dataframes with Loan_number, get an extra project_id column for those false positive missing value
new = df1.join(df2, df1.Loan_Number == df2.Loan_Number).drop(col('df1.Loan_Number')).drop(col('df1.Project_ID'))
df = df.filter('Project_ID is not NULL').union(new)
df.count()
##check 
##
#LwithPP = df.filter('Project_ID is not NULL').select('Project_ID','Loan_Number').filter('Loan_Number =\'IBRDM2001\'').limit(5).toPandas()
#LwithoutPP.filter('Loan_Number =\'IBRDM2001\'').toPandas()
#new.filter('Loan_Number =\'IBRDM2001\'').toPandas()

770790

In [65]:
df.filter('Project_ID is not NULL').filter('Project_Name is NULL').toPandas()

Unnamed: 0,End_of_Period,Loan_Number,Region,Country_Code,Country,Borrower,Guarantor_Country_Code,Guarantor,Loan_Type,Loan_Status,Interest_Rate,Project_ID,Project_Name,Original_Principal_Amount,Cancelled_Amount,Undisbursed_Amount,Disbursed_Amount,Repaid_to_IBRD,Due_to_IBRD,Exchange_Adjustment,Borrowers_Obligation,Sold_3rd_Party,Repaid_3rd_Party,Due_3rd_Party,Loans_Held,First_Repayment_Date,Last_Repayment_Date,Agreement_Signing_Date,Board_Approval_Date,Effective_Date_Most_Recent,Closed_Date_Most_Recent
0,2017-05-31T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
1,2017-05-31T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
2,2017-07-31T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
3,2017-08-31T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
4,2017-09-30T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
5,2017-10-31T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
6,2017-11-30T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,,NON POOL,FULLY REPAID,,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
7,2017-12-31T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
8,2018-01-31T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
9,2018-02-28T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,FULLY REPAID,4.25,P037383,,250000000.00,0.00,0.00,250000000.00,38000.00,0.00,0.00,0.00,249962000.00,249962000.00,0.00,0.00,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000


In [7]:
df.limit(2).toPandas()

Unnamed: 0,End_of_Period,Loan_Number,Region,Country_Code,Country,Borrower,Guarantor_Country_Code,Guarantor,Loan_Type,Loan_Status,Interest_Rate,Project_ID,Project_Name,Original_Principal_Amount,Cancelled_Amount,Undisbursed_Amount,Disbursed_Amount,Repaid_to_IBRD,Due_to_IBRD,Exchange_Adjustment,Borrowers_Obligation,Sold_3rd_Party,Repaid_3rd_Party,Due_3rd_Party,Loans_Held,First_Repayment_Date,Last_Repayment_Date,Agreement_Signing_Date,Board_Approval_Date,Effective_Date_Most_Recent,Closed_Date_Most_Recent
0,2011-04-30T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,REPAID,4.25,P037383,RECONSTRUCTION,250000000.0,0.0,0.0,250000000.0,38000.0,0.0,0.0,0.0,249962000.0,249962000.0,0.0,0.0,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000
1,2011-05-31T00:00:00.000,IBRD00010,EUROPE AND CENTRAL ASIA,FR,FRANCE,CREDIT NATIONAL,FR,FRANCE,NON POOL,REPAID,4.25,P037383,RECONSTRUCTION,250000000.0,0.0,0.0,250000000.0,38000.0,0.0,0.0,0.0,249962000.0,249962000.0,0.0,0.0,1952-11-01T00:00:00.000,1977-05-01T00:00:00.000,1947-05-09T00:00:00.000,1947-05-09T00:00:00.000,1947-06-09T00:00:00.000,1947-12-31T00:00:00.000


In [56]:
df.filter('Project_ID is NULL').select('Loan_Number').distinct().limit(10).toPandas()

Unnamed: 0,Loan_Number
0,IBRDM2001
1,IBRDM2003
2,IBRDM0723
3,IBRDM0905
4,IBRDM1104
5,IBRDM0827
6,IBRDM3010
7,IBRDM0807
8,IBRD09820
9,IBRDM1511


In [None]:
df.select('Project_ID', 'Loan_Number').filter('Project_ID is NULL').filter('Loan_Number like \'IBRDM%\'').distinct().toPandas()

In [99]:
df.select('Project_ID', 'Loan_Number').filter('Loan_Number like \'HIDA%\'').distinct().toPandas()

Unnamed: 0,Project_ID,Loan_Number


In [100]:
df.select('Project_ID', 'Loan_Number').filter('Loan_Number like \'IBRDM%\'').count()

26864

In [16]:
LwithoutPP = df.filter('Project_ID is NULL')
LwithPP = df.filter('Project_ID is not NULL').select('Project_ID','Loan_Number').distinct()

In [57]:
LwithPP = df.filter('Project_ID is not NULL').select('Project_ID','Loan_Number').filter('Loan_Number =\'IBRDM2001\'').limit(5).toPandas()

In [None]:
LwithoutPP.filter('Loan_Number =\'IBRDM2001\'').toPandas()

In [17]:
LwithPP.count()

8444

In [None]:
df1 = LwithoutPP.alias('df1')
df2 = LwithPP.alias('df2')


#df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns]
#select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).toPandas()

In [35]:
x = ''
for i in newColumns1:
    i = 'df1.' + i + ','
    x += i
x

'df1.End_of_Period,df1.Loan_Number,df1.Region,df1.Country_Code,df1.Country,df1.Borrower,df1.Guarantor_Country_Code,df1.Guarantor,df1.Loan_Type,df1.Loan_Status,df1.Interest_Rate,df1.Currency_of_Commitment,df1.Project_Name,df1.Original_Principal_Amount,df1.Cancelled_Amount,df1.Undisbursed_Amount,df1.Disbursed_Amount,df1.Repaid_to_IBRD,df1.Due_to_IBRD,df1.Exchange_Adjustment,df1.Borrowers_Obligation,df1.Sold_3rd_Party,df1.Repaid_3rd_Party,df1.Due_3rd_Party,df1.Loans_Held,df1.First_Repayment_Date,df1.Last_Repayment_Date,df1.Agreement_Signing_Date,df1.Board_Approval_Date,df1.Effective_Date_Most_Recent,df1.Closed_Date_Most_Recent,df1.Last_Disbursement_Date,'

In [53]:
df1.join(df2, df1.Loan_Number == df2.Loan_Number).drop(col('df1.Loan_Number')).drop(col('df1.Project_ID')).count()

42

In [140]:
aa = df.filter('Project_ID is NULL').where(df.Loan_Number == LwithPP.Loan_Number).withColumn('Project_ID', LwithPP.Project_ID)

In [10]:
aa.toPandas()

NameError: name 'aa' is not defined

In [59]:
dft.filter('Loan_Number =\'IBRDM2001\'').toPandas()

Unnamed: 0,End_of_Period,Loan_Number,Region,Country_Code,Country,Borrower,Guarantor_Country_Code,Guarantor,Loan_Type,Loan_Status,Interest_Rate,Project_ID,Project_Name,Original_Principal_Amount,Cancelled_Amount,Undisbursed_Amount,Disbursed_Amount,Repaid_to_IBRD,Due_to_IBRD,Exchange_Adjustment,Borrowers_Obligation,Sold_3rd_Party,Repaid_3rd_Party,Due_3rd_Party,Loans_Held,First_Repayment_Date,Last_Repayment_Date,Agreement_Signing_Date,Board_Approval_Date,Effective_Date_Most_Recent,Closed_Date_Most_Recent,Project_ID.1


In [78]:
df.select(length('Project_ID').alias('Project_ID')).groupBy('Project_ID').count().show()

+----------+------+
|Project_ID| count|
+----------+------+
|      null| 31341|
|         7|770748|
+----------+------+



In [59]:
df_good = df.filter(length(col("Loan_Number")) == 9)
df = df_good.union(newdf)

In [None]:
df.filter('Project_ID ==\'P037383\'').toPandas()

In [None]:
# Generate user table 
df_user = df.select('Borrower').distinct()
df_user = df_country.withColumn('User_Id',row_number().over(Window.orderBy(monotonically_increasing_id())))
df_user.limit(2).toPandas()

In [23]:
# Generate country table 
df_country = df.select('Country_Code','Country','Region').distinct()
df_country = df_country.withColumn('Country_Id',row_number().over(Window.orderBy(monotonically_increasing_id())))
df_country.limit(2).toPandas()

Unnamed: 0,Country_Code,Country,Region,Country_Id
0,SK,SLOVAK REPUBLIC,EUROPE AND CENTRAL ASIA,1
1,MT,MALTA,MIDDLE EAST AND NORTH AFRICA,2


In [60]:
# Generate project table 
df_project = df.select('Project_ID','Project_Name','Loan_Number','Loan_Type','Loan_Status').filter('Project_ID is NULL').distinct()
df_project.toPandas()

Unnamed: 0,Project_ID,Project_Name,Loan_Number,Loan_Type,Loan_Status


Unnamed: 0,Project_ID,Project_Name,Loan_Number,Loan_Type,Loan_Status


In [None]:
df.select('Project_ID','Project_Name','Loan_Number','Loan_Type','Loan_Status').filter('Project_ID == \'P037362\'').toPandas()

In [None]:
df_spark.select('Project ID').groupBy('Project ID').agg({'Project ID': 'count'}).withColumnRenamed('count(Project ID)', 'pc').sort(('pc')).show(1000)

In [11]:
df_project = 

In [27]:
df.select('Country_Code','Country','Region').distinct().filter('Region is NULL').toPandas()

Unnamed: 0,Country_Code,Country,Region


In [27]:
df.select('Project_ID','Project_Name','Loan_Number','First_Repayment_Date').filter('Project_Name is not NULL').where(length(col("Project_ID")) <= 6).show(5)

+----------+------------+-----------+--------------------+
|Project_ID|Project_Name|Loan_Number|First_Repayment_Date|
+----------+------------+-----------+--------------------+
+----------+------------+-----------+--------------------+



In [None]:
df.select('Guarantor_Country_Code','Guarantor').distinct().toPandas()

In [35]:
df_project.limit(5).toPandas()

Unnamed: 0,Project_ID,Project_Name,Loan_Number,First_Repayment_Date
0,P037458,KLM AIRLINES,IBRD00590,1954-01-01T00:00:00.000
1,P010002,RAILWAY,IBRD00600,1954-08-15T00:00:00.000
2,P007511,II TOLL TRANS PROJEC,IBRD04010,1969-04-01T00:00:00.000
3,P002013,HIGHWAY REHABILITATI,IBRD06400,1972-12-15T00:00:00.000
4,P006240,DFC-BANCO DO NORDEST,IBRD06560,1973-02-15T00:00:00.000


In [31]:
df.select('Project_ID','Project_Name','Loan_Number','First_Repayment_Date').where(df.Project_ID=='P037456').show(5)

+----------+------------+-----------+--------------------+
|Project_ID|Project_Name|Loan_Number|First_Repayment_Date|
+----------+------------+-----------+--------------------+
|   P037456| SHIPPING IV|  IBRD00100|2049-01-15T00:00:...|
|   P037456| SHIPPING IV|  IBRD00100|2049-01-15T00:00:...|
|   P037456| SHIPPING IV|  IBRD00100|2049-01-15T00:00:...|
|   P037456| SHIPPING IV|  IBRD00100|2049-01-15T00:00:...|
|   P037456| SHIPPING IV|  IBRD00100|1949-01-15T00:00:...|
+----------+------------+-----------+--------------------+
only showing top 5 rows



In [None]:
# Performing cleaning tasks here

def process_song_data(spark, input_data, output_data):
    """
    This function is used to load songs data from s3 to our data lake and export as parquet file back to my s3 folder.
    """
    # get filepath to song data file
    song_data = input_data + 'song_data/A/A/A/*'
    # uncomment if you want to load all data
    #song_data = input_data + 'song_data/*/*/*/*'
    
    # read song data file, using song_data/A/A/A/* for performance
    df = spark.read.json(song_data)

    # extract columns to create songs table
    songs_table = df.select('song_id', 'title', 'artist_id', 'year', 'duration')
    
    # convert the data type to proper data type for each column
    fields_1 = {'song_id':'string','title':'string', 'artist_id':'string', 'year':'int', 'duration':'float'}
    exprs_1 = [ "cast ({} as {})".format(key,value) for key, value in fields_1.items()]
    songs_table = songs_table.selectExpr(*exprs_1)
    
    # write songs table to parquet files partitioned by year and artist
    songs_table.write.partitionBy('year','artist_id').parquet(output_data + "songs.parquet")

    # extract columns to create artists table
    artists_table = df.select('artist_id', 'artist_name', 'artist_location', 'artist_latitude', 'artist_longitude')
    
    # convert the data type to proper data type for each column
    fields_2 = {'artist_id':'string', 'artist_name':'string', 'artist_location':'string', 'artist_latitude':'string', 'artist_longitude':'string'}
    exprs_2 = [ "cast ({} as {})".format(key,value) for key, value in fields_2.items()]
    songs_table = artists_table.selectExpr(*exprs_2)
    
    # write artists table to parquet files
    artists_table.write.parquet(output_data + "artists.parquet")



### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.