In [1]:
import psycopg2
import pandas as pd
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Database connection
conn = psycopg2.connect(
    host=os.getenv('DB_HOST'),
    port=os.getenv('DB_PORT'),
    user=os.getenv('DB_USER'),
    password=os.getenv('DB_PASSWORD'),
    database=os.getenv('DB_NAME')
)

print("âœ… Successfully connected to lending database!")

âœ… Successfully connected to lending database!


# Practical Exam: Loan Insights

EasyLoan offers a wide range of loan services, including personal loans, car loans, and mortgages.

EasyLoan offers loans to clients from Canada, United Kingdom and United States.

The analytics team wants to report performance across different geographic areas. They aim to identify areas of strength and weakness for the business strategy team.

They need your help to ensure the data is accessible and reliable before they start reporting.


**Database Schema**

The data you need is in the database named `lending`.

![database schema](lending_schema.png)

# Task 1 

The analytics team wants to use the `client` table to create a dashboard for client details. For them to proceed, they need to be sure the data is clean enough to use.

The `client` table below illustrates what the analytics team expects the data types and format to be.

Write an SQL query that returns the `client` table with the specified format, including identifying and cleaning all invalid values. 
-  Your output should be a DataFrame with the name 'client'. Do not modify the `client` table.
-  Note that the DataLab environment formats dates as YYYY-MM-DD-hh-ss-SSS. 

| Column Name       | Description                                                      |
|-------------------|------------------------------------------------------------------|
| client_id         | Unique integer (set by the database, canâ€™t take any other value) |
| date\_of\_birth       | Date of birth of the client, as a date                           |
| employment_status        | Current employment status of the client, either employed or unemployed, as a lower case string                              |
| country          | The country where the client resides, either USA, UK or CA, as an upper case string                      |

In [8]:
# Preview client table
query = "SELECT * FROM client LIMIT 10"
df_preview = pd.read_sql_query(query, conn)
df_preview

  df_preview = pd.read_sql_query(query, conn)


Unnamed: 0,client_id,date_of_birth,employment_status,country
0,1,1963-07-08T00:00:00.000,unemployed,USA
1,2,1957-02-07T00:00:00.000,unemployed,UK
2,3,1993-02-21T00:00:00.000,Emplouyed,CA
3,4,1978-03-19T00:00:00.000,employed,CA
4,5,2000-10-02T00:00:00.000,Emplouyed,USA
5,6,1974-08-05T00:00:00.000,unemployed,USA
6,7,1980-07-14T00:00:00.000,Emplouyed,UK
7,8,1995-06-24T00:00:00.000,unemployed,USA
8,9,1962-02-21T00:00:00.000,unemployed,USA
9,10,1992-05-28T00:00:00.000,employed,CA


In [9]:
# Task 1: Clean client table
query = """
SELECT
    client_id,
    CAST(date_of_birth AS DATE) AS date_of_birth,
    CASE
  		 WHEN employment_status ILIKE 'un%' THEN 'unemployed'
   		 WHEN employment_status ILIKE 'e%' 
     		OR employment_status ILIKE 'f%' 
     		OR employment_status ILIKE 'p%' THEN 'employed'
   		ELSE NULL
		END AS employment_status,
    country
	
FROM client;
"""

client = pd.read_sql_query(query, conn)
print(f"\nðŸ“Š Cleaned client table: {len(client)} rows")
client.head(10)


ðŸ“Š Cleaned client table: 300 rows


  client = pd.read_sql_query(query, conn)


Unnamed: 0,client_id,date_of_birth,employment_status,country
0,1,1963-07-08,unemployed,USA
1,2,1957-02-07,unemployed,UK
2,3,1993-02-21,employed,CA
3,4,1978-03-19,employed,CA
4,5,2000-10-02,employed,USA
5,6,1974-08-05,unemployed,USA
6,7,1980-07-14,employed,UK
7,8,1995-06-24,unemployed,USA
8,9,1962-02-21,unemployed,USA
9,10,1992-05-28,employed,CA


# Task 2

You have been told that there was a problem in the backend system as some of the `repayment_channel` values are missing. 

The missing values are critical to the analysis so they need to be filled in before proceeding.

Luckily, they have discovered a pattern in the missing values:

- Repayment higher than 4000 dollars should be made via `bank account`.
- Repayment lower than 1000 dollars should be made via `mail`.

Write an SQL query that makes the `repayment` table match these criteria.
-  Your output should be a DataFrame with the name 'repayment'. Do not modify the original `repayment` table.

In [10]:
# Preview repayment table
query = "SELECT * FROM repayment LIMIT 10"
df_preview_repayment = pd.read_sql_query(query, conn)
df_preview_repayment

  df_preview_repayment = pd.read_sql_query(query, conn)


Unnamed: 0,repayment_id,loan_id,repayment_date,repayment_amount,repayment_channel
0,1,357,2022-10-16T00:00:00.000,1675.83,bank account
1,2,805,2023-01-12T00:00:00.000,867.22,debit card
2,3,843,2022-06-02T00:00:00.000,718.83,phone
3,4,243,2022-12-26T00:00:00.000,1620.97,credit card
4,5,991,2023-03-18T00:00:00.000,2182.17,phone
5,6,130,2023-01-31T00:00:00.000,772.19,-
6,7,903,2022-05-23T00:00:00.000,1340.22,bank account
7,8,157,2022-10-11T00:00:00.000,1381.22,credit card
8,9,121,2022-06-21T00:00:00.000,1941.47,credit card
9,10,120,2023-03-31T00:00:00.000,410.42,-


In [11]:
# Task 2: Impute missing repayment_channel values
query = """
SELECT
    repayment_id,
    loan_id,
    repayment_date,
    repayment_amount,
    CASE
        WHEN repayment_channel = '-' AND repayment_amount > 4000 THEN 'bank account'
        WHEN repayment_channel = '-' AND repayment_amount < 1000 THEN 'mail'
        ELSE repayment_channel
    END AS repayment_channel
	
FROM repayment;
"""

repayment = pd.read_sql_query(query, conn)
print(f"\nðŸ“Š Repayment table with imputed values: {len(repayment)} rows")
repayment.head(10)


ðŸ“Š Repayment table with imputed values: 1500 rows


  repayment = pd.read_sql_query(query, conn)


Unnamed: 0,repayment_id,loan_id,repayment_date,repayment_amount,repayment_channel
0,1,357,2022-10-16T00:00:00.000,1675.83,bank account
1,2,805,2023-01-12T00:00:00.000,867.22,debit card
2,3,843,2022-06-02T00:00:00.000,718.83,phone
3,4,243,2022-12-26T00:00:00.000,1620.97,credit card
4,5,991,2023-03-18T00:00:00.000,2182.17,phone
5,6,130,2023-01-31T00:00:00.000,772.19,mail
6,7,903,2022-05-23T00:00:00.000,1340.22,bank account
7,8,157,2022-10-11T00:00:00.000,1381.22,credit card
8,9,121,2022-06-21T00:00:00.000,1941.47,credit card
9,10,120,2023-03-31T00:00:00.000,410.42,mail


# Task 3

Starting on January 1st, 2022, all US clients started to use an online system to sign contracts.

The analytics team wants to analyze the loans for US clients who used the new online system.

Write a query that returns the data for the analytics team. Your output should include `client_id`,`contract_date`, `principal_amount` and `loan_type` columns.

![database schema](lending_schema.png)

In [12]:
# Preview loan table
query = "SELECT * FROM loan LIMIT 10"
df_preview_loan = pd.read_sql_query(query, conn)
df_preview_loan

  df_preview_loan = pd.read_sql_query(query, conn)


Unnamed: 0,loan_id,client_id,contract_id,principal_amount,interest_rate,loan_type
0,1,2,359,133143.0,0.02,car
1,2,235,106,154242.0,0.34,personal
2,3,117,120,45256.0,0.19,car
3,4,149,239,70487.0,0.13,car
4,5,11,23,55389.0,0.19,car
5,6,215,240,15580.0,0.28,personal
6,7,8,470,127781.0,0.06,car
7,8,68,63,51017.0,0.04,car
8,9,198,52,110165.0,0.22,personal
9,10,140,236,108734.0,0.14,car


In [15]:
# Task 3: US clients who used the online system (contracts from 2022-01-01 onwards)
query = """
SELECT
    c.client_id,
    ctr.contract_date,
    l.principal_amount,
    l.loan_type
	
FROM loan AS l
JOIN client  AS c   ON c.client_id    = l.client_id
JOIN contract AS ctr ON ctr.contract_id = l.contract_id
	
WHERE c.country='USA'
  AND CAST(ctr.contract_date AS DATE) >= DATE '2022-01-01';
"""

df_us_online = pd.read_sql_query(query, conn)
print(f"\nðŸ“Š US clients using online system (2022+): {len(df_us_online)} loans")
df_us_online.head(10)


ðŸ“Š US clients using online system (2022+): 94 loans


  df_us_online = pd.read_sql_query(query, conn)


Unnamed: 0,client_id,contract_date,principal_amount,loan_type
0,267,2022-03-08T00:00:00.000,179230.0,personal
1,50,2022-01-13T00:00:00.000,143729.0,mortgage
2,280,2022-01-02T00:00:00.000,171122.0,car
3,79,2022-01-24T00:00:00.000,43784.0,mortgage
4,245,2022-01-03T00:00:00.000,95003.0,mortgage
5,181,2022-02-16T00:00:00.000,45866.0,mortgage
6,194,2022-01-03T00:00:00.000,174800.0,car
7,251,2022-04-14T00:00:00.000,93214.0,personal
8,128,2022-03-27T00:00:00.000,44186.0,personal
9,211,2022-03-18T00:00:00.000,107766.0,car


# Task 4

The business strategy team is considering offering a more competitive rate to the US market. 

The analytic team want to compare the average interest rates offered by the company for the same loan type in different countries to determine if there are significant differences.

Write a query that returns the data for the analytics team. Your output should include `loan_type`, `country` and `avg_rate` columns.

![database schema](lending_schema.png)

In [14]:
# Task 4: Average interest rate by loan_type and country
query = """
SELECT
    l.loan_type,
    c.country AS country,
    AVG(l.interest_rate) AS avg_rate
	
FROM loan AS l
JOIN client AS c ON c.client_id = l.client_id
	
GROUP BY l.loan_type, c.country
ORDER BY l.loan_type, c.country;
"""

df_avg_rates = pd.read_sql_query(query, conn)
print(f"\nðŸ“Š Average interest rates by loan type and country:")
df_avg_rates


ðŸ“Š Average interest rates by loan type and country:


  df_avg_rates = pd.read_sql_query(query, conn)


Unnamed: 0,loan_type,country,avg_rate
0,car,CA,0.112039
1,car,UK,0.122613
2,car,USA,0.103636
3,mortgage,CA,0.044068
4,mortgage,UK,0.042281
5,mortgage,USA,0.04386
6,personal,CA,0.217253
7,personal,UK,0.198738
8,personal,USA,0.202721
