# Client Demographs 

## Description
This script sets up a connection to an Amazon Redshift database using SQLAlchemy, configures Pandas display options, and calculates various date and time values. Below is a summary of the code with explanations.



## Importing Modules:

- `create_engine` from `sqlalchemy` to establish a database connection.
- `pandas` (pd) for data manipulation and analysis.
- `numpy` (np) for numerical operations (although not used in this script).
- `json` to handle JSON files.
- `datetime` and `timedelta` from `datetime` to work with dates and times.

### Loading Credentials:

- Reads database credentials from a JSON file located at `/Workspace/Credentials/db_data.json`.

### Database Connection:

- Uses the extracted credentials to create a connection string for Amazon Redshift using `sqlalchemy.create_engine`.

### Pandas Configuration:

- Configures Pandas to display float values with two decimal places.

### Date Calculations:

- Calculates today's date, yesterday's date, the date 14 days ago, and the current date and time.
- Prints these dates and times for reference.

### Time Calculations:

- Calculates the date and time 30 minutes ago and prints the range from 30 minutes ago to the current time.

This script provides a basic setup for connecting to a database, configuring Pandas, and performing simple date and time calculations.


In [0]:
from sqlalchemy import create_engine
import pandas as pd 
import numpy as np 
import json
from datetime import datetime, timedelta

# Load database credentials from a JSON file
with open('/Workspace/Credentials/db_data.json', 'r') as fp:
    data = json.load(fp)

# Extract Redshift credentials from the loaded JSON data
host = data['redshift']['host']
user = data['redshift']['user']
passwd = data['redshift']['passwd']
database = data['redshift']['database']

# Create a connection engine to the Redshift database
conn = create_engine(f"postgresql+psycopg2://{user}:{passwd}@{host}:5439/{database}")

# Set Pandas to display floats with two decimal places
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Get today's date and yesterday's date in 'YYYY-MM-DD' format
today = datetime.today().strftime('%Y-%m-%d')
yesterday =  (datetime.today() - timedelta(days = 1)).strftime('%Y-%m-%d')

# Print today's and yesterday's dates
print(today)
print(yesterday)

# Calculate and print the date 14 days ago
last_2_wks = (datetime.today() - timedelta(days = 14)).strftime('%Y-%m-%d')
print('------------------------------------')
print(last_2_wks)

# Print a newline for separation
print('\n')

# Get the current date and time in 'YYYY-MM-DD HH:MM:SS' format
now = datetime.today().strftime('%Y-%m-%d %H:%M:%S')

# Calculate and print the date and time 30 minutes ago, truncated to 'YYYY-MM-DD HH:MM'
last_30_mins = (datetime.today() - timedelta(minutes = 30)).strftime('%Y-%m-%d %H:%M:%S')
trunc_last_30_mins = (datetime.today() - timedelta(minutes = 30)).strftime('%Y-%m-%d %H:%M')
print(last_30_mins, 'to', now)

## Generation demographic attributes


The script below executes a SQL query to retrieve client demographic data from a Redshift database and loads the results into a Pandas DataFrame. Below is a summary of the code with explanations.

%md

`%%time` is an IPython magic command to measure the execution time of the query.

## SQL Query:

A multi-line string containing the SQL query is defined using triple quotes (`'''`). The query retrieves client demographic data from the `dwh_all_clients` table, and joins it with the `dwh_clients_bvn_data` table on `client_id`.

## SELECT Clause:

- Retrieves distinct client details such as `client_id`, `client_name`, `mobile_number`, `bvn_phone_no`, `email_address`, `state`, and other demographic details from the joined tables.
- Includes a `CASE` statement to classify clients into generational cohorts based on their birth year.

## FROM Clause:

- Specifies the `dwh_all_clients` table as the main table.
- Uses a `LEFT OUTER JOIN` to include data from the `dwh_clients_bvn_data` table where available.

## WHERE Clause:

- Filters out clients whose status is 'closed'.

## GROUP BY Clause:

- Groups the results by multiple client attributes to ensure unique client records.

## Pandas DataFrame:

- `pd.read_sql_query` executes the SQL query and loads the result into a Pandas DataFrame `rcdem`.
- `conn` is the database connection engine created earlier.

## Execution Time Measurement:

- The `%%time` magic command outputs the time taken to execute the query and load the data into the DataFrame.


In [0]:
%%time

rcdem = pd.read_sql_query(f'''
-- CLIENT DEMOGRAPH

SELECT
    DISTINCT dac.client_id,
    dac.client_name,
    dac.mobile_number,
    dcbd.bvn_phone_no AS bvn_phone_no,
    dac.email_address,
    dac.state,
    dcbd.bvn_email bvn_email,
    dcbd.bvn,
    dcbd.bvn_gender AS gender,
    dcbd.bvn_dob AS date_of_birth,
    CASE
        WHEN dcbd.bvn_dob IS NULL THEN 'Not Found'
        WHEN RIGHT(dcbd.bvn_dob, 4) < '1945' THEN 'Silent Generation'
        WHEN RIGHT(dcbd.bvn_dob, 4) BETWEEN '1946' AND '1964' THEN 'Baby Boomers'
        WHEN RIGHT(dcbd.bvn_dob, 4) BETWEEN '1965' AND '1979' THEN 'Generation X'
        WHEN RIGHT(dcbd.bvn_dob, 4) BETWEEN '1980' AND '1994' THEN 'Millennials'
        WHEN RIGHT(dcbd.bvn_dob, 4) BETWEEN '1995' AND '2012' THEN 'Generation Z'
        WHEN RIGHT(dcbd.bvn_dob, 4) > '2012' THEN 'Generation Alpha'
    END AS generation,
    dcbd.bvn_state_of_origin AS state_of_origin,
    dcbd.bvn_state_of_residence AS residence_state,
    dac.client_tier,
    dac.client_category,
   /* rb.client_id AS referral_id,
    rb.client_name AS referral_name,
    rb.referral_code,*/
    MIN(dac.activation_date) AS date_onboarded,
    CURRENT_DATE as run_date
    
FROM
    dwh_all_clients dac
-- LEFT JOIN referred_by rb ON rb.client_id = dac.referred_by_id
LEFT OUTER JOIN dwh_clients_bvn_data dcbd ON dac.client_id = dcbd.client_id
WHERE
    dac.client_status != 'closed' 
GROUP BY
    dac.client_id,
    dac.client_name,
    dac.mobile_number,
    dac.email_address,
    dcbd.bvn_email,
    dac.state,
    dcbd.bvn,
    dcbd.bvn_phone_no,
    dcbd.bvn_gender,
    dcbd.bvn_dob,
    dcbd.bvn_state_of_origin,
    dcbd.bvn_state_of_residence,
    dac.client_tier,
    dac.client_category
/*    rb.client_id,
    rb.client_name,
    rb.referral_code*/
--  LIMIT 1000;



''' , conn)


rcdem


This script below writes data from a Pandas DataFrame into an Amazon Redshift database in a single operation using the `to_sql` method with batching and multi-row insert support. The execution time of the operation is measured using the IPython `%%time` magic command.

### Code Summary

```python
%%time
rcdem.to_sql("dwh_clients_demograph", conn, index=False, if_exists='replace', chunksize=30000, method='multi')

print("run completed successfully on " + now)


## Load demographic attributes to Redshift table

In [0]:

%%time
rcdem.to_sql("dwh_clients_demograph", conn, index=False, if_exists='replace', chunksize=30000, method='multi')

print("run completed successfully on " + now)



This script writes data from a Pandas DataFrame into an Amazon Redshift database in iterative batches. This approach helps manage memory usage and ensures efficient data transfer for large datasets.

Code Summary

```python
# Define the batch size for iterative writing
batch_size = 10000

# Loop through the DataFrame in increments of batch_size
for i in range(0, len(rcdem), batch_size):
    # Extract a batch of data from the DataFrame
    rcdem_batch = rcdem[i:i+batch_size]
    
    # Write the batch to the Redshift table 'dwh_clients_demograph'
    rcdem_batch.to_sql("dwh_clients_demograph", conn, index=False, if_exists='replace')


In [0]:
'''

# Write the data in iterative batches into Redshift
batch_size = 10000
for i in range(0, len(rcdem), batch_size):
    rcdem_batch = rcdem[i:i+batch_size]
    rcdem_batch.to_sql("dwh_clients_demograph", conn, index=False, if_exists='replace')
'''



In [0]:
print("run completed successfully on " + now)