# Cloud Data Pipeline for International Bank for Reconstruction and Development(IBRD)
### Data Engineering Capstone Project

#### Project Summary
The International Bank for Reconstruction and Development, wants to move their processes and data pipeline onto the cloud. Their data resides in S3, in a directory of CSV on the transcation history of the loans. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Clean and Save the Data
* Step 4: Define the Data Model
* Step 5: Run ETL to Model the Data
* Step 6: Complete Project Write Up

### Step 1: Scope the Project and Gather Data

#### Scope 
* Project Description

In this project, we will build an ETL pipeline that extracts data from S3, processes and clean them using Spark, and loads the data back into S3 as a set of dimensional tables in csv format. Then, create cluster with redshift and load data into redshift. This will allow our analytics team to continue finding insights of our loan users.

* Tool

 In this porject, we will use AWS S3, Redshift, Spark, Python
 
#### Describe and Gather Data 
The data is coming from World Bank and you can download it from below page:

https://finances.worldbank.org/Loans-and-Credits/IBRD-Statement-Of-Loans-Historical-Data/zucq-nrc3

It is around 300mb and for data layout of this file, please see the data layout file in the workspace.

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.


In [1]:
# Import all modules and setup here
# Loading all library
import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import configparser
from datetime import datetime
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Window
import boto3
import time
from functools import reduce
import pandas as pd
# Change padans parameter to adjust visliazation
pd.set_option('max_colwidth', 200)
pd.set_option('display.max_columns', 200)
# Load in aws credential
config = configparser.ConfigParser()
config.read_file(open('dwh.cfg'))
KEY = config.get('AWS','KEY')
SECRET = config.get('AWS','SECRET')
os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['KEY']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['SECRET']

In [None]:
# Define a function that creates spark session
def create_spark_session():
    """
    This function is used to create a spark session to work in
    """
    spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
        .getOrCreate()
    return spark
# Create spark session
spark = create_spark_session()

In [None]:
# Load data from AWS s3
df_spark = spark.read.csv("s3a://udacity-leejohn/loan/ibrd-statement-of-loans-historical-data.csv", header=True)
# Function that will uppercase everything in the dataframe
fields = df_spark.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields) 
allFields = [*stringFieldsTransformed, *nonStringFields]
df_new = df_spark.select(allFields)
# Rename the column name
# Get old column names 
oldColumns = df_new.schema.names
# Setup new column names
newColumns  = ['End_of_Period', 'Loan_Number', 'Region', 'Country_Code', 'Country','Borrower','Guarantor_Country_Code','Guarantor','Loan_Type','Loan_Status','Interest_Rate','Currency_of_Commitment','Project_ID','Project_Name','Original_Principal_Amount','Cancelled_Amount','Undisbursed_Amount','Disbursed_Amount','Repaid_to_IBRD','Due_to_IBRD','Exchange_Adjustment','Borrowers_Obligation','Sold_3rd_Party','Repaid_3rd_Party','Due_3rd_Party','Loans_Held','First_Repayment_Date','Last_Repayment_Date','Agreement_Signing_Date','Board_Approval_Date','Effective_Date_Most_Recent','Closed_Date_Most_Recent','Last_Disbursement_Date']
# Rename the dataframe
df = reduce(lambda df_spark, idx: df_spark.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), df_new)

In [None]:
# Count the number of rows in data
df.count()

In [None]:
# Describe the data
df.describe().toPandas()

In [None]:
# This is to display the number of null value of each column, before we do any cleaning, you will see null data count for each column with below command
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).toPandas()

### Step 3: Cleaning Steps

In this step, we will clean data to remove any null value with different methods. 
This process will take about 8 min.

In [None]:
# Since below 2 columns have too much null values(average 50% null values) and are not necessary, so we drop them.  
df = df.drop('Last_Disbursement_Date')
df = df.drop('Currency_of_Commitment')
# For these 2 columns, if we look into the raw data,we will see that there are some "?", "@" in the data, which are some errors come from raw data. 
# I think this is the error comes from different coding type.
# Instead of trying to fix them, we can just drop them.
df = df.drop('Borrower')
df = df.drop('Project_Name')
start_time = time.time()

In [None]:
# Edit the loan_number to make it 9 digits
# From data layout, we can know that loan_number should follow a 9 dights pattern.
# Find records that have less than 9 digits Loan_Number, replace them to the correct format
df_6 = df.where(length(col("Loan_Number")) == 6).withColumn("Loan_Number", regexp_replace(col("Loan_Number") ,  "(\\w{4})(\\d{2})" , "$1000$2" ))
df_7 = df.where(length(col("Loan_Number")) == 7).withColumn("Loan_Number", regexp_replace(col("Loan_Number") ,  "(\\w{4})(\\d{3})" , "$100$2" ))
df_8 = df.where(length(col("Loan_Number")) == 8).withColumn("Loan_Number", regexp_replace(col("Loan_Number") ,  "(\\w{4})(\\d{4})" , "$10$2" ))

# No records with loan_number that has less than 9 digits
df = df.filter(length(col("Loan_Number")) == 9).union(df_6).union(df_7).union(df_8)

In [None]:
# If both Guarantor_Country_Code and Guarantor are empty, then it's hard to say whether they are suppose to be empty(no guarantor)
# or they are missing values, so I just drop them.
df = df.filter('Guarantor_Country_Code is not NULL or Guarantor is not NULL')

In [None]:
# For each loan number, there should be one country code. Run below code, we will find there are 3 records that have the wrong country data, so we drop them. 
#df.select('Loan_Number','Country_Code').distinct().groupBy('Loan_Number').count().withColumnRenamed('count', 'ccount').filter('count>1').toPandas()
#df.where(df.Loan_Number == 'IBRD82610').select('Country').groupBy('Country').count().toPandas()
#df.where(df.Loan_Number == 'IBRD82550').select('Country').groupBy('Country').count().toPandas()
#df.where(df.Loan_Number == 'IBRD82580').select('Country').groupBy('Country').count().toPandas()
df = df.where((df.Loan_Number == 'IBRD82580') & (df.Country != 'CHINA') | (df.Loan_Number != 'IBRD82580'))
df = df.where((df.Loan_Number == 'IBRD82550') & (df.Country != 'CHINA') | (df.Loan_Number != 'IBRD82550'))
df = df.where((df.Loan_Number == 'IBRD82610') & (df.Country != 'INDIA') | (df.Loan_Number != 'IBRD82610'))

In [None]:
# Remove all missing value for Project_ID column
# Create a dict that key is loan_number and value is project_id
x1 = df.filter('Project_ID is not NULL').select('Loan_Number','Project_ID').distinct().toPandas().set_index('Loan_Number')['Project_ID'].to_dict() 
# Create a pandas dataframe that only has the records that have missing project_id
pdf = df.filter('Project_ID is NULL').toPandas()
# Loop through the dataframe and replace its missing value with the loan_number
for index,row in pdf.iterrows():
    att = row.Loan_Number
    if att in x1.keys():
        row.Project_ID = x1.get(att)
# Convert it back to a spark dataframe
ddd = spark.createDataFrame(pdf.astype(str)).filter('Project_ID is not NULL').filter('Project_ID != \'None\'')
# Union it with the good records and get new dataframe
df = df.filter('Project_ID is not NULL').union(ddd)

In [None]:
# Remove all missing value for Guarantor and Guarantor_Country_Code column
# Create a dict that key is loan_number and value is a array of country, country_code
c1 = df.select('Loan_Number','Country','Country_Code').distinct().toPandas().set_index('Loan_Number').T.to_dict('list')
# By exploring data, we find there is one more country in Guarantor, which is united kingdom, we will add it to our dict
c1b = df.filter('Guarantor == \'UNITED KINGDOM\'').select('Loan_Number').distinct().collect()
for i in c1b:
    c1[i.Loan_Number] = ['UNITED KINGDOM','GB']
# Loop through the dataframe and replace its missing value with the loan_number
pdf = df.filter('Guarantor is NULL or Guarantor_Country_Code is NULL').toPandas()
for index,row in pdf.iterrows():
    att = row.Loan_Number
    if att in c1.keys():
        row.Guarantor = c1.get(att)[0]
        row.Guarantor_Country_Code = c1.get(att)[1]
# Convert it back to a spark dataframe
ddd = spark.createDataFrame(pdf).filter('Guarantor is not NULL and Guarantor_Country_Code is not NULL')
# Union it with the good records and get new dataframe
df = df.filter('Guarantor is not NULL and Guarantor_Country_Code is not NULL').union(ddd)

In [None]:
# Remove all missing value for Interest_Rate
# Since for each loan number, we will find different interest rate. So to handle this case, for rach loan number, I will use mean value of all intertest rate to repalce its nul value. 
# Create a dict that key is loan_number and value is intertest rate
i1 = df.filter('Interest_Rate is not NULL').filter('Interest_Rate != \'None\'').select('Loan_Number','Interest_Rate').distinct().toPandas()
i1.Interest_Rate = i1.Interest_Rate.astype(np.float16)
i1 = i1.groupby('Loan_Number', as_index=True).agg({"Interest_Rate": "mean"})['Interest_Rate'].to_dict()
pdf = df.filter('Interest_Rate is NULL').toPandas()
# Loop through the dataframe and replace its missing value with the loan_number
for index,row in pdf.iterrows():
    att = row.Interest_Rate
    if att in c1.keys():
        row.Interest_Rate = c1.get(att)
# Convert it back to a spark dataframe
ddd = spark.createDataFrame(pdf.astype(str)).filter('Interest_Rate is not NULL').filter('Interest_Rate != \'None\'')
# Union it with the good records and get new dataframe
df = df.filter('Interest_Rate is not NULL').union(ddd)
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# This is to display the number of null value of each column, before we do any cleaning, you will see null data count for each column with below command
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).toPandas()

After above steps, we wouldn't see any missing value except for the date related columns.
For date related columns, since they have more problems, we are not necessarily to remove all missing value. We have another way to deal with them.

There are 2 issues with date related columns:
1. Missing values
2. Having more than one date.
For each loan_number, we should only have one unique set of [First_Repayment_Date,Last_Repayment_Date,Agreement_Signing_Date,Board_Approval_Date,Effective_Date_Most_Recent,Closed_Date_Most_Recent],
However, we will see there are more than one set.

To handle this case, for each date related column, we can find the most frequent date of each loan number and use it as its only date. By doing this, we can find a unique set of date values for each loan number.
Therefore, we can create time table directly. Later we will need to create the transaction data, we can ignore the raw data of these date related columns and use loan number to recreate these columns.
This will also help us prevent spending time on cleaning missing data, so we can jump into data modeling directly.

### Step 4: Define the Data Model
#### Conceptual Data Model
In this project, I am using fact-dimensional data modeling method. The reason I am using this method is that it will be much more straightforward and easy for users.
We will have 1 fact table and 5 dimensional tables.

##### Fact table:
trainsaction:Loan_Number, Time_Id, Country_Id, Guarantor_Country_Id, Loan_Type, Loan_Status_Id, Amount_Id, End_of_Period, Interest_Rate, Project_ID, Exchange_Adjustment, Borrowers_Obligation, 
Cancelled_Amount,Undisbursed_Amount, Disbursed_Amount, Repaid_to_IBRD, Due_to_IBRD, Loans_Held

##### Dimension table

df_time: Time_Id, First_Repayment_Date, Last_Repayment_Date, Board_Approval_Date, Agreement_Signing_Date, Effective_Date_Most_Recent, Closed_Date_Most_Recent

df_country: Country_Id, Country_Code, Country, Region

df_amount: Amount_Id, Original_Principal_Amount, Sold_3rd_Party, Repaid_3rd_Party, Due_3rd_Party

df_loan_type: Loan_Type_Id, Loan_Type

df_loan_status: Loan_Status_Id, Loan_Status

In [None]:
# Generate time dataframe 
# By exploring data, we can see that for each loan number, it should only have one set of dates related column. However, we will find for date related columns, it has more than one combination. so I crete this function to find its more frequent value for each loan number. 
def getss(df, column):
    # Create a sql table with loan number and column
    df.select('Loan_Number', column).filter('{} is not NULL'.format(column)).filter('{} != \'None\''.format(column)).createOrReplaceTempView("time")
    # Run query to get the most frequent value of each category for each loan number
    x = spark.sql("""
    SELECT Loan_Number, {} FROM 
    (
    SELECT Loan_Number, {}, count(1) total_records, ROW_NUMBER() OVER (PARTITION BY Loan_Number ORDER BY count(1) desc) AS seqnum
    FROM time 
    group by Loan_Number, {}
    )
    WHERE seqnum = 1
    """.format(column, column, column))
    return x
# Getting the most frequent value of each column for each loan number
x1 = getss(df, 'First_Repayment_Date')
x2 = getss(df, 'Last_Repayment_Date')
x3 = getss(df, 'Agreement_Signing_Date')
x4 = getss(df, 'Board_Approval_Date')
x5 = getss(df, 'Effective_Date_Most_Recent')
x6 = getss(df, 'Closed_Date_Most_Recent')
# Combine them and drop the duplicate key
df_time = x1.join(x2, x1.Loan_Number == x2.Loan_Number).drop(x2.Loan_Number)
df_time = df_time.join(x3, df_time.Loan_Number == x3.Loan_Number).drop(x3.Loan_Number)
df_time = df_time.join(x4, df_time.Loan_Number == x4.Loan_Number).drop(x4.Loan_Number)
df_time = df_time.join(x5, df_time.Loan_Number == x5.Loan_Number).drop(x5.Loan_Number)
df_time = df_time.join(x6, df_time.Loan_Number == x6.Loan_Number).drop(x6.Loan_Number)
# Add a unique id Time_Id for this dataframe
df_time = df_time.withColumn('Time_Id',row_number().over(Window.orderBy(monotonically_increasing_id())))
# Rename its column name
df_time = df_time.selectExpr('Time_Id', 'Loan_Number', 'First_Repayment_Date as First_Repayment_Date_t', 'Last_Repayment_Date as Last_Repayment_Date_t', 'Agreement_Signing_Date as Agreement_Signing_Date_t','Board_Approval_Date \
                         as Board_Approval_Date_t','Effective_Date_Most_Recent as Effective_Date_Most_Recent_t','Closed_Date_Most_Recent as Closed_Date_Most_Recent_t')

In [None]:
# Generate country dataframe 
df_country = df.select('Country_Code','Country','Region').distinct()
newRow = spark.createDataFrame([('GB','United Kingdom','EUROPE AND CENTRAL ASIA')])
df_country = df_country.union(newRow)
# Add a unique id Country_Id for this dataframe
df_country = df_country.withColumn('Country_Id',row_number().over(Window.orderBy(monotonically_increasing_id())))
# Casting to the right data type
fields = {'Country_Code':'string', 'Country':'string', 'Region':'string', 'Country_Id':'integer'}
exprs = [ "cast ({} as {})".format(key,value) for key, value in fields.items()]
df_country = df_country.selectExpr(*exprs)

In [None]:
# Generate Amount dataframe 
df_amount = df.select('Loan_Number','Original_Principal_Amount','Sold_3rd_Party','Repaid_3rd_Party', 'Due_3rd_Party').distinct()
# Add a unique id Amount_Id for this dataframe
df_amount = df_amount.withColumn('Amount_Id',row_number().over(Window.orderBy(monotonically_increasing_id())))

In [None]:
# Generate Loan_Type dataframe 
df_loan_type = df.select('Loan_Type').distinct()
# Add a unique id Loan_Type_Id for this dataframe
df_loan_type = df_loan_type.withColumn('Loan_Type_Id',row_number().over(Window.orderBy(monotonically_increasing_id())))
# Casting to the right data type
fields = {'Loan_Type':'string', 'Loan_Type_Id':'integer'}
exprs = [ "cast ({} as {})".format(key,value) for key, value in fields.items()]
df_loan_type = df_loan_type.selectExpr(*exprs)

In [None]:
# Generate Loan_Status dataframe 
df_loan_status = df.select('Loan_Status').distinct()
# Add a unique id Loan_Status_Id for this dataframe
df_loan_status = df_loan_status.withColumn('Loan_Status_Id',row_number().over(Window.orderBy(monotonically_increasing_id())))
# Casting to the right data type
fields = {'Loan_Status':'string', 'Loan_Status_Id':'integer'}
exprs = [ "cast ({} as {})".format(key,value) for key, value in fields.items()]
df_loan_status = df_loan_status.selectExpr(*exprs)

In [None]:
# create sql tables 
df_country.createOrReplaceTempView("country")
df_time.createOrReplaceTempView("time")
df_amount.createOrReplaceTempView("amount")
df_loan_type.createOrReplaceTempView("loan_type")
df_loan_status.createOrReplaceTempView("loan_status")
df.createOrReplaceTempView("log")

In [None]:
# Generate transaction dataframe by joing log table with the other tables on loan number
trainsaction = spark.sql("""
    select l.Loan_Number, t.Time_Id, c.Country_Id, cc.Country_Id as Guarantor_Country_Id, lt.Loan_Type, ls.Loan_Status_Id, a.Amount_Id,
    l.End_of_Period, l.Interest_Rate, l.Project_ID, l.Exchange_Adjustment, l.Borrowers_Obligation, l.Cancelled_Amount,
    l.Undisbursed_Amount, l.Disbursed_Amount, l.Repaid_to_IBRD, l.Due_to_IBRD, l.Loans_Held
    from log l 
    join country c on l.Country_Code= c.Country_Code
    join country cc on l.Guarantor = cc.Country 
    join loan_type lt on l.Loan_Type = lt.Loan_Type 
    join loan_status ls on l.Loan_Status = ls.Loan_Status
    join amount a on l.Loan_Number = a.Loan_Number
    join time t on l.Loan_Number = t.Loan_Number
    """)
# Casting to the right data type
fields = {'Loan_Number':'string', 'Time_Id':'integer', 'Country_Id':'integer', 'Guarantor_Country_Id':'integer',\
          'Loan_Type':'string','Loan_Status_Id':'integer', 'Amount_Id':'integer', 'End_of_Period':'timestamp', \
          'Interest_Rate':'float', 'Project_ID':'string', 'Exchange_Adjustment':'float','Borrowers_Obligation':'float',\
          'Cancelled_Amount':'float', 'Undisbursed_Amount':'float', 'Disbursed_Amount':'float','Repaid_to_IBRD':'float', 'Due_to_IBRD':'float', 'Loans_Held':'float'}
exprs = [ "cast ({} as {})".format(key,value) for key, value in fields.items()]
trainsaction = trainsaction.selectExpr(*exprs)

In [None]:
# drop the loan number column from the dataframes
df_amount_final = df_amount.select('Amount_Id','Original_Principal_Amount','Sold_3rd_Party','Repaid_3rd_Party', 'Due_3rd_Party')
# Casting to the right data type
fields = {'Amount_Id':'integer', 'Original_Principal_Amount':'float', 'Sold_3rd_Party':'float', 'Repaid_3rd_Party':'float', 'Due_3rd_Party':'float'}
exprs = [ "cast ({} as {})".format(key,value) for key, value in fields.items()]
df_amount_final = df_amount_final.selectExpr(*exprs)

In [None]:
# Rename the column names
df_time_final = df_time.selectExpr('Time_Id', 'First_Repayment_Date_t as First_Repayment_Date', \
                                   'Last_Repayment_Date_t as Last_Repayment_Date','Agreement_Signing_Date_t as Agreement_Signing_Date',\
                                   'Board_Approval_Date_t as Board_Approval_Date','Effective_Date_Most_Recent_t as Effective_Date_Most_Recent',\
                                   'Closed_Date_Most_Recent_t as Closed_Date_Most_Recent')
# Casting to the right data type
fields = {'Time_Id':'integer', 'First_Repayment_Date':'timestamp', 'Last_Repayment_Date':'timestamp', \
          'Agreement_Signing_Date':'timestamp', 'Board_Approval_Date':'timestamp', 'Effective_Date_Most_Recent':'timestamp', 'Closed_Date_Most_Recent':'timestamp'}
exprs = [ "cast ({} as {})".format(key,value) for key, value in fields.items()]
df_time_final = df_time_final.selectExpr(*exprs)

In [None]:
# Generate a dataframe that contains the counts of each table
a = df_country.count()
b = df_time_final.count()
c = df_amount_final.count()
d = df_loan_type.count()
e = df_loan_status.count()
f = trainsaction.count()
data = {'Table_name':['df_country', 'df_time_final', 'df_amount_final', 'df_loan_type','df_loan_status','trainsaction'], 
        'Count':[a, b, c, d, e, f]}
df_count = pd.DataFrame(data)
print(df_count)

### Saving files to S3 
This process will take about 20 minutes. I already saved these files on AWS and you can ignore this part and go to next part directly.

In [None]:
# Uncomment if you want to save data to AWS. The data is already available on aws in this path: udacity-leejohn-w2/loan, will setup path here as loan_test to prevent duplicate error
#start_time = time.time()
#df_country.write.csv("s3a://udacity-leejohn-w2/loan_test/country")
#df_time_final.write.csv("s3a://udacity-leejohn-w2/loan_test/time")
#df_loan_status.write.csv("s3a://udacity-leejohn-w2/loan_test/loan_status")
#df_loan_type.write.csv("s3a://udacity-leejohn-w2/loan_test/loan_type")
#df_amount_final.write.csv("s3a://udacity-leejohn-w2/loan_test/amount")
#trainsaction.write.csv("s3a://udacity-leejohn-w2/loan_test/trainsaction")
#print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# Run this command to see the file on AWS
s3 = boto3.resource('s3',
                       region_name="us-west-2",
                       aws_access_key_id=KEY,
                       aws_secret_access_key=SECRET
                   )
## This is to display what we have in S3 
sampleDbBucket =  s3.Bucket("udacity-leejohn-w2")
i = 0
for obj in sampleDbBucket.objects.filter(Prefix="loan"):
    print(obj)

### Step 5: Run ETL to load the Data
First, run below script will create a redshift cluster and it will take about 5-10 min. 

Second, when you see "cluster is available", you can go ahread and run the etl.py.

In [2]:
# This script will create a redshift cluster automatically, please don't move forward until you see "cluster is available!" and "Policy created, you can go ahead!"
%run -i 'Redshift_Create.py'
# This will take 5-10 min

                    Param       Value
0        DWH_CLUSTER_TYPE  multi-node
1           DWH_NUM_NODES           4
2           DWH_NODE_TYPE   dc2.large
3  DWH_CLUSTER_IDENTIFIER    redshift
4                  DWH_DB         dwh
5             DWH_DB_USER     dwhuser
6         DWH_DB_PASSWORD    Passw0rd
7                DWH_PORT        5439
8       DWH_IAM_ROLE_NAME     dwhRole
1.1 Creating a new IAM Role
1.2 Attaching Policy
1.3 Get the IAM role ARN
arn:aws:iam::638983295418:role/dwhRole
1.4 Creating Cluster
We are still working on creating the cluster, approximate 0/20 done!
We are still working on creating the cluster, approximate 1/20 done!
We are still working on creating the cluster, approximate 2/20 done!
We are still working on creating the cluster, approximate 3/20 done!
We are still working on creating the cluster, approximate 4/20 done!
We are still working on creating the cluster, approximate 5/20 done!
We are still working on creating the cluster, approximate 6/20 done!
We 

In [3]:
# This script will drop, create and load table into redshift cluster. You will see "All table dropped! All table created! All table loaded!" if everything works as exptected.
%run -i 'etl.py'

All table dropped!
All table created!
All table loaded!


#### Data Quality Checks
From here, we will first check Integrity constraints on the relational database (e.g., unique key, data type, etc.)

Then, we will count on each table to make sure there's no empty table. This ensures that the scripts are doing the right things.

Last, we will compare the counts between target table and spark dataframe to ensure data is loaded correctly.

In [4]:
# Perform quality checks here
%load_ext sql
DWH_DB_USER = config.get('CLUSTER','DB_USER')
DWH_DB_PASSWORD = config.get('CLUSTER','DB_PASSWORD')
DWH_ENDPOINT = config.get('CLUSTER','HOST')
DWH_PORT = config.get('CLUSTER','DB_PORT')
DWH_DB = config.get('CLUSTER','DB_NAME')
conn_string="postgresql://{}:{}@{}:{}/{}".format(DWH_DB_USER, DWH_DB_PASSWORD, DWH_ENDPOINT, DWH_PORT,DWH_DB)
print(conn_string)
%sql $conn_string

postgresql://dwhuser:Passw0rd@redshift.cqvdryicbxdk.us-west-2.redshift.amazonaws.com:5439/dwh


'Connected: dwhuser@dwh'

In [None]:
# This is the count of each table from spark dataframe and we can compare this with the count of target table on redshift to make sure everything is loaded correctly.
print(df_count)

In [5]:
%sql select count(*) from amount;

 * postgresql://dwhuser:***@redshift.cqvdryicbxdk.us-west-2.redshift.amazonaws.com:5439/dwh
1 rows affected.


count
8152


In [None]:
%sql select * from PG_TABLE_DEF where schemaname = 'public' and tablename = 'amount'

In [None]:
%sql select count(*) from country;

In [None]:
%sql select * from PG_TABLE_DEF where schemaname = 'public' and tablename = 'country'

In [None]:
%sql select count(*) from loan_status;

In [None]:
%sql select * from PG_TABLE_DEF where schemaname = 'public' and tablename = 'loan_status'

In [None]:
%sql select count(*) from loan_type;

In [None]:
%sql select * from PG_TABLE_DEF where schemaname = 'public' and tablename = 'loan_type'

In [None]:
%sql select count(*) from time;

In [None]:
%sql select * from PG_TABLE_DEF where schemaname = 'public' and tablename = 'time'

In [None]:
%sql select count(*) from transaction;

In [None]:
%sql select * from PG_TABLE_DEF where schemaname = 'public' and tablename = 'transaction'

In [6]:
#### Run to delete the created resources
%run -i 'Redshift_Drop.py'
# This will take 5-10 min
#### CAREFUL!!

                    Param       Value
0  DWH_CLUSTER_TYPE        multi-node
1  DWH_NUM_NODES           4         
2  DWH_NODE_TYPE           dc2.large 
3  DWH_CLUSTER_IDENTIFIER  redshift  
4  DWH_DB                  dwh       
5  DWH_DB_USER             dwhuser   
6  DWH_DB_PASSWORD         Passw0rd  
7  DWH_PORT                5439      
8  DWH_IAM_ROLE_NAME       dwhRole   
We are still working on deleting the cluster!
We are still working on deleting the cluster!
We are still working on deleting the cluster!
We are still working on deleting the cluster!
We are still working on deleting the cluster!
We are still working on deleting the cluster!
We are still working on deleting the cluster!
We are still working on deleting the cluster!
Cluster is dropped, please go ahead!
Role is dropped, please go ahead!


#### Data dictionary 
Every column is coming from the raw data except those unique IDs. For data layout, please see Data Layout.txt file in workspace.

#### Step 5: Project Write Up
##### Rationale for the choice of tools and technologies for the project.

In this project, I choose AWS S3, Redshift, Spark as my tools since these are really good tools to handle big data in cloud enviroment.

AWS S3: a well-known file system, which supports a lot of file type and provides great enough API that we can easily meet our file transfer requirement.

AWS Redshift: a convenient tool, which is built on massive parallel processing data warehouse. It has different types of database and can handle large scale data sets. The most important benefit is its sscalability.  

PySpark: easy to learn and implement and provides simple and comprehensive API. It also comes with a wide range of libraries like numpy, pandas, scikit-learn, seaborn, matplotlib etc.
It is backed up by a huge and active community.

#### Below are some scenarios that we might encounter in the future.
##### The data was increased by 100x.

If the data was increased by 100x, we will have 2 ways to deal with it.
1.Ungrade the AWS Redshift and EC2 so we will more compute power. This is easier way, however, it will cost more.
2.Partition the data with Loan Number and process eahc chunk in parallel.

##### The data populates a dashboard that must be updated on a daily basis by 7am every day.

We can use airflow to create a dag to take care of the scheduling. This is also a further step I want to work on for this project.

##### The database needed to be accessed by 100+ people.

We can create different users and roles in AWS IAM and assign access to different user different roles.