<h1 align="center">MSIN0114: Business Analytics Consulting Project</h1>
<h2 align="center">S2R Analytics</h2>

# Table of Contents

**Data enginering (ETL pipeline)**

* [Part 0](#part0): Data extraction

* [Part 1](#part1): Data transformation
    * [1.1](#1_1): Projects
    * [1.2](#1_2): Transactions
    * [1.3](#1_3): Stages
    * [1.4](#1_4): Data health
    * [1.5](#1_5): Clients 
    * [1.6](#1_6): Staff
    * [1.7](#1_7): Clean-up    
 <br />
 
* [Part 2](#part2): Data loading
    * [2.1](#2_1): Database design and storage
    * [2.2](#2_2): Conversion to flat file

## Notebook Setup

In [None]:
#Essentials
import pandas as pd
from pandas import Series, DataFrame
from pandas.api.types import CategoricalDtype
pd.options.display.max_columns = None
import numpy as np; np.random.seed(2022)
import random
import sqlite3
import pyodbc

#Image creation and display
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.patches as mpatches
from matplotlib import pyplot
import plotly.express as px
import plotly.graph_objects as go
#from image import image, display

#Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.base import clone
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier

#Other
import itertools as it
import io
import os
os.sys.path
import sys
import glob
import concurrent.futures
from __future__ import print_function
import binascii
import struct
from PIL import Image
import scipy
import scipy.misc
import scipy.cluster
import datetime, time
import functools, operator
from datetime import datetime

## Part 0: <a class="anchor" id="part0"></a> Data extraction

Main data is from API scripts from Jonny.
X columns from PowerBI.

## Part 1: <a class="anchor" id="part1"></a> Data transformation

### 1.1 <a class="anchor" id="1_1"></a> Projects (wga.projects)

Step 1: Create a list of projects to drop.

In [None]:
# Read all projects from Synergy API
all_projects = pd.read_csv('csv-files/wga_synergy_incremental_projects.csv')
all_projects = all_projects[['Project ID', 'Project Number', 'Project Name', 'Is Office Project', 'Is Billable', 'Project Status']]

# Projects to keep: external (i.e. client only)
external_projects = all_projects[(all_projects['Is Office Project'] != 'Yes')]
external_projects = external_projects[(external_projects['Is Billable'] != 'No')]
external_ids = external_projects['Project ID'].tolist()

# Projects to keep: status-based
successful_projects = external_projects[external_projects['Project Status'].isin(['Complete', 'Active', 'Pending Invoice']) == True]
valid_ids = successful_projects['Project ID'].tolist()

# See how many unique projects we shold have
print('We should have ' + str(len(valid_ids)) + ' projects in total.')

Step 2: Cleaning data from Synergy API.

In [None]:
# Load only valid projects
api_projects = pd.read_csv('csv-files/wga_synergy_incremental_projects.csv')
api_projects = (api_projects[api_projects['Project ID'].isin(valid_ids)])


# Drop unnecesary columns
api_projects.drop(columns = ['Unnamed: 0', 'Primary Contact Name', 'Status Name', 'Organisation ID',
                             'customFields', 'Address Line 1', 'Address Line 2', 'Project Type ID',
                             'Primary Contact', 'Primary Contact ID', 'Project Scope', 'Address Postal Code',
                             'Address State', 'Address Town', 'Address Google', 'Client Reference Number',
                             'Address State Postal Code Country', 'Address Single Line', 'Project Type Code',
                             'External Name', 'Address Longitude', 'Address Latitude',
                             'Project Forecast Value', 'Created Date', 'Updated Date', 'Manager ID'], inplace = True)


# Convert columns for unified style
api_projects.rename(columns = {'Invoices':'Number of Invoices', 'Project Net Residual (Neg as Zero)':'Project Net Residual',
                              'Start Date (Project)': 'Project Start Date', 'End Date (Project)': 'Project End Date',
                              'Address Country':'Country', 'Project Type': 'Sector'}, inplace = True)
api_projects['Country'].replace(['AUSTRALIA', 'AUS', 'Autralia', 'NZ', 'new zealand', 'PNG', 'samoa', 'SAMOA', 'TONGA', 'SA', 'CHINA'],
                                ['Australia', 'Australia', 'Australia', 'New Zealand', 'New Zealand', 'Papua New Guinea', 'Samoa', 'Samoa', 'Tonga', 'Saudi Arabia', 'China'],inplace=True)
api_projects['Project Start Date'] = pd.to_datetime(api_projects['Project Start Date'])
api_projects['Project End Date'] = pd.to_datetime(api_projects['Project End Date'])


# Generalise minority observations into bigger groups
api_projects['Sector'].mask(api_projects['Sector'] == 'Commercial', 'Commercial & Retail Buildings', inplace=True)
api_projects['Sector'].mask(api_projects['Sector'] == 'Residential', 'Civic & Education Buildings', inplace=True)
api_projects['Default Rate Group'].mask(api_projects['Default Rate Group'] != 'Standard', 'Non-standard', inplace=True)


# Adding 'Due Date' and'Project Director' columns
custom_fields = pd.read_csv('csv-files/wga_synergy_incremental_projects_custom_fields.csv')
custom_fields = custom_fields[['PROPOSAL - Due Date', 'PROSPECT - Project Director', 'Project ID']].copy()
custom_fields.rename(columns = {'PROSPECT - Project Director':'Project Director', 'PROPOSAL - Due Date': 'Due Date'}, inplace = True)
custom_fields['Due Date'] = pd.to_datetime(custom_fields['Due Date'])
api_projects = pd.merge(api_projects, custom_fields,  how='left', left_on='Project ID', right_on='Project ID')


# Rearrange column names for easier interpretation
api_projects = api_projects[['Project ID', 'Country',
                             'Project Status', 'Sector',
                             'Project Director', 'Project Manager', 'Office',
                             'Project Start Date', 'Project End Date', 'Due Date',
                             'Default Rate Group']]

api_projects.head(1)
len(api_projects)

Step 3: Cleaning transformed PowerBI data from S2R Analytics.

In [None]:
# Read the pre-transformed data from PowerBI
pbi_projects = pd.read_csv('csv-files/wga_power_bi_projects.csv', encoding = 'ISO-8859-1')
pbi_projects = pbi_projects[['Project ID', 'Project Size Sort Order',
                             'Project Duration (Weeks)', 'Is Multi Discipline Project','Is First Client Project']]

# Load only valid projects
pbi_projects = (pbi_projects[pbi_projects['Project ID'].isin(valid_ids)])

# Convert columns for unified style
pbi_projects.rename(columns = {'Project Duration (Weeks)':'Project Duration Weeks'}, inplace = True)
pbi_projects['Is Multi Discipline Project'].replace(['No', 'Yes'],[False, True],inplace=True)
pbi_projects['Is First Client Project'].replace(['No', 'Yes'],[False, True],inplace=True)

pbi_projects.head(1)
len(pbi_projects)

Step 4: Merge the two 'Projects' tables together.

In [None]:
# Merge the projects table from API and preprocesed Power BI table
projects = pd.merge(api_projects, pbi_projects,  how='left', left_on='Project ID', right_on='Project ID')
projects.columns = projects.columns.str.replace(' ', '_')


**4 features to engineer:**
* Delivered_on_Time
* Fully_In_Lockdown
* Partially_In_Lockdown
* Suffered_Data_Loss (projects that started after July 2018 did not suffer from data loss, projects that ended before July 2018 did not suffer from data loss)

The features are engineered after dealing with transactions table to fill in blanks of missing start and end dates of projects.

### 1.2 <a class="anchor" id="1_2"></a> Transactions (wga.transactions)

In [None]:
# Read only valid projects' transactions from Synergy API.
transactions = pd.read_csv('csv-files/wga_sql_transactions.csv')
transactions = (transactions[transactions['projectId'].isin(valid_ids)])

transactions = transactions[['id', 'projectId', 'stageId', 'transactionTypeId',
                             'rateType', 'status','units','valueTotal',
                             'invoiceValueTotal','actualCostTotal',
                             'targetChargeTotal', 'date']]

transactions.rename(columns = {'id':'Transaction ID', 'projectId':'Project ID',
                               'transactionTypeId': 'Transaction Type',
                               'rateType': 'Rate Type', 'status': 'Status',
                               'stageId': 'Stage ID', 'date':'Date',
                               'invoiceValueTotal': 'Invoice Value Total',
                               'actualCostTotal':'Actual Cost Total',
                               'targetChargeTotal':'Target Charge Total',
                               'valueTotal':'Value Total',
                               'units': 'Units'}, inplace = True)

transactions = transactions[(transactions['Status'] == 'Invoiced') | (transactions['Status'] == 'Written off')]
transactions['Transaction Type'].replace([100, 200, 300, 400, 500, 700, 750, 800],
                                         ['Time', 'Cash', 'Travel', 'Office', 'Bill', 'Balance', 'Unearned', 'Invoice Custom'], inplace=True)
transactions['Date'] = pd.to_datetime(transactions['Date'])
transactions.columns = transactions.columns.str.replace(' ', '_')

transactions.head(1)

**4 features to engineer:**
* Perc_of_Subcontractors (move to 'Projects' table)
* Is_Front_Loaded (move to 'Projects' table)
* Avg_Profit (move to 'Projects' table) - average profit margin per project
* Avg_Rec (move to 'Projects' table) - average financial recoverability per project

**2 features to update:**
* Project_Start_Date, Project_End_Date for projects with absent dates in the 'projects' table.

**Table alterations:**
* FK Time_Profile (links 'wga.projects' table on 'Project_ID')

Perc_of_Subcontractors = 
* total units of subcontractors divided by
* sum of units where transaction type is 'bill' or 'time'

* 'Time' = Company's employees
* 'Bill' = Hired subcontrators
* Time + Bill = total human capital on project in hours

In [None]:
# Perc_of_Subcontractors
subs = transactions[['Project_ID', 'Units', 'Rate_Type']]
subs = subs[(subs['Rate_Type'] == 'Subcontractor')]
subs.drop(columns = ['Rate_Type'], inplace = True)
subs = pd.DataFrame(subs.groupby(['Project_ID'])['Units'].count()).reset_index()
subs.rename(columns = {'Units': 'Sub_Hours_Per_Project'}, inplace = True)

total_hours = transactions[['Project_ID', 'Units', 'Transaction_Type']]
total_hours = total_hours[(total_hours['Transaction_Type'] == 'Time') | (total_hours['Transaction_Type'] == 'Bill')]
total_hours = pd.DataFrame(total_hours.groupby(['Project_ID'])['Units'].count()).reset_index()
total_hours.rename(columns = {'Units': 'Total_Hours_Per_Project'}, inplace = True)

df_1 = pd.merge(projects, subs, how='left', left_on='Project_ID', right_on='Project_ID')
df_2 = pd.merge(df_1, total_hours, how='left', left_on='Project_ID', right_on='Project_ID')
df_2['Sub_Hours_Per_Project'].fillna(0, inplace=True)
df_2['Total_Hours_Per_Project'].fillna(0, inplace=True)
df_2['Perc_of_Subcontractors'] = (df_2['Sub_Hours_Per_Project'] / df_2['Total_Hours_Per_Project']).round(decimals = 2)
df_2 = df_2[['Project_ID', 'Perc_of_Subcontractors']]

# Add the new feature to the 'Projects' table
projects = pd.merge(projects, df_2,  how='left', left_on='Project_ID', right_on='Project_ID')

In [None]:
# Is_Front_Loaded
project_dates = projects[['Project_ID', 'Project_Start_Date', 'Project_End_Date']]
df_3 = transactions[['Project_ID', 'Units', 'Date']]
df_3 = pd.merge(df_3, project_dates, how='left', left_on='Project_ID', right_on='Project_ID')

first_half = df_3[(df_3['Date']  < df_3['Project_Start_Date'] + (df_3['Project_End_Date'] - df_3['Project_Start_Date'])/2)] # finding mid-point between 2 dates
first_half = pd.DataFrame(first_half.groupby(['Project_ID'])['Units'].sum()).reset_index()
first_half.rename(columns = {'Units': '1st_Half_Units'}, inplace = True)

total_units = pd.DataFrame(df_3.groupby(['Project_ID'])['Units'].sum()).reset_index()
total_units.rename(columns = {'Units': 'Total_Effort_Units'}, inplace = True)

df_4 = pd.merge(total_units, first_half, how ='left', left_on='Project_ID', right_on='Project_ID')
df_4['Perc_Being_Front'] = df_4['1st_Half_Units']/df_4['Total_Effort_Units']
df_4['Is_Front_Loaded'] = (df_4['Perc_Being_Front']>=0.7)
df_4 = df_4[['Project_ID', 'Is_Front_Loaded']]

# Add the new feature to the 'Projects' table
projects = pd.merge(projects, df_4,  how='left', left_on='Project_ID', right_on='Project_ID')

In [None]:
# Recoverability, Profit_Measure
transactions = transactions[['Project_ID', 'Stage_ID', 'Value_Total', 'Invoice_Value_Total', 'Actual_Cost_Total', 'Target_Charge_Total', 'Date']]
transactions['Recoverability'] = transactions['Value_Total']/transactions['Target_Charge_Total']
infinites = transactions[(transactions['Recoverability'] == np.inf) | (transactions['Recoverability'] == -np.inf)]
#infinites['Target_Charge_Total'].sum() - shows that all cells in 'Target_Charge_Total' column are equal to 0, creating unwanted infinite values

transactions = transactions[(transactions['Target_Charge_Total'] != 0)]
#transactions['Recoverability'].min(), transactions['Recoverability'].max() - shows that there limits are real numbers, not infinite

transactions['Profit_Measure'] = transactions['Invoice_Value_Total']/transactions['Actual_Cost_Total']
#transactions['Profit_Measure'].min(), transactions['Profit_Measure'].max()  - shows that there limits are real numbers, not infinite
transactions = transactions[['Project_ID', 'Stage_ID', 'Date', 'Recoverability', 'Profit_Measure']]

# stage_transactions
df_5 = pd.DataFrame(transactions.groupby(['Project_ID', 'Stage_ID'])['Recoverability'].count()).reset_index()
df_5.rename(columns = {'Recoverability':'Count'}, inplace = True)
stage_transactions = pd.DataFrame(transactions.groupby(['Project_ID', 'Stage_ID'])['Recoverability', 'Profit_Measure'].sum()).reset_index()
stage_transactions = pd.merge(stage_transactions, df_5, how='left', on=['Project_ID', 'Stage_ID'])
stage_transactions['Avg_Rec'] = stage_transactions['Recoverability']/stage_transactions['Count']
stage_transactions['Avg_Profit'] = stage_transactions['Profit_Measure']/stage_transactions['Count']
stage_transactions = stage_transactions[['Project_ID', 'Stage_ID', 'Avg_Rec', 'Avg_Profit']]

# project_transactions
df_6 =  pd.DataFrame(transactions.groupby(['Project_ID'])['Recoverability'].count()).reset_index()
df_6.rename(columns = {'Recoverability':'Count'}, inplace = True)
project_transactions =  pd.DataFrame(transactions.groupby(['Project_ID'])['Recoverability', 'Profit_Measure'].sum()).reset_index()
project_transactions = pd.merge(project_transactions, df_6, how='left', on='Project_ID')
project_transactions['Avg_Rec'] = project_transactions['Recoverability']/project_transactions['Count']
project_transactions['Avg_Profit'] = project_transactions['Profit_Measure']/project_transactions['Count']
project_transactions = project_transactions[['Project_ID', 'Avg_Rec', 'Avg_Profit']]

# Add the 2 new features to the 'Projects' table
projects = pd.merge(projects, project_transactions, how='left', on='Project_ID')

In [None]:
print('Only ' + str(transactions['Project_ID'].nunique()) + ' projects have transaction recorded, meaning ' + str(len(projects) - transactions['Project_ID'].nunique()) + ' projects will be missing from transaction tables.')

In [None]:
# Project_Start_Date, Project_End_Date

def nat_check(date):
    if type(date) == pd._libs.tslibs.nattype.NaTType:
        return True
    else:
        return False

min_dates =  pd.DataFrame(transactions.groupby(['Project_ID'])['Date'].min()).reset_index()
min_dates.rename(columns = {'Date':'Min_Date'}, inplace = True)
max_dates =  pd.DataFrame(transactions.groupby(['Project_ID'])['Date'].max()).reset_index()
max_dates.rename(columns = {'Date':'Max_Date'}, inplace = True)
all_dates = pd.merge(min_dates, max_dates, how='left', on='Project_ID')
projects = pd.merge(projects, all_dates, how='left', on='Project_ID')


In [None]:
projects['Project_Start_Date'] = projects['Project_Start_Date'].map(str)
projects['Project_End_Date'] = projects['Project_End_Date'].map(str)
projects['Due_Date'] = projects['Due_Date'].map(str)

projects.loc[projects['Project_Start_Date']=='NaT','Project_Start_Date']=projects['Min_Date']
projects.loc[projects['Project_End_Date']=='NaT','Project_End_Date']=projects['Max_Date']
projects.loc[projects['Due_Date']=='NaT','Due_Date']=projects['Max_Date']

projects['Project_Start_Date'] = pd.to_datetime(projects['Project_Start_Date'])
projects['Project_End_Date'] = pd.to_datetime(projects['Project_End_Date'])
projects['Due_Date'] = pd.to_datetime(projects['Due_Date'])

projects.drop(columns = ['Min_Date', 'Max_Date'], inplace = True)

Now, let's go back to engieering date-dependent features with newly filled in values in the 'Projects' table.

In [None]:
# Delivered_on_Time
    
Delivered_on_Time = {}

for due_date in projects['Due_Date']:
    for completed in projects['Project_End_Date']:
        if nat_check(due_date) == True:
            continue
        else:
            if due_date <= completed:
                Delivered_on_Time[due_date] = True
            else:
                Delivered_on_Time[due_date] = False

df_7 = pd.DataFrame([{'Due_Date': due_date, 'Delivered_on_Time': is_on_time} for (due_date, is_on_time) in Delivered_on_Time.items()])

projects = pd.merge(projects, df_7, how='left', on='Due_Date')
projects.head()

In [None]:
# Fully_In_Lockdown, Partially_In_Lockdown

Lockdown_Period = (pd.date_range(start='2020-03-16', end = '2020-11-21', freq='D')).to_series()

projects['Start_in_Lockdown'] = projects['Project_Start_Date'].isin([Lockdown_Period])
projects['End_in_Lockdown'] = projects['Project_End_Date'].isin([Lockdown_Period])
dates_prep = pd.concat([projects['Start_in_Lockdown'], projects['End_in_Lockdown']], axis = 1) #axis=1 specifies horizontal stacking

projects['Fully_In_Lockdown'] = pd.DataFrame(dates_prep.all(axis=1))
projects['Partially_In_Lockdown'] = pd.DataFrame(dates_prep.any(axis=1))

projects.drop(columns = ['Start_in_Lockdown', 'End_in_Lockdown'], inplace = True)

In [None]:
# Suffered_Data_Loss

#def data_loss_check(start_date, end_date):
#    if start_date < pd.Timestamp('2018-07-15') and end_date < pd.Timestamp('2018-07-15'): #project started and ended before the acqusition
#        return False
#    elif start_date > pd.Timestamp('2018-07-15') and end_date > pd.Timestamp('2018-07-15'): #project started and ended after the acqusition
#        return False
#    elif start_date < pd.Timestamp('2018-07-15') and end_date > pd.Timestamp('2018-07-15'): #project started before the acqusition but ended after it
#        return True

#Data_Loss_Check = {}

#for start_date in projects['Project_Start_Date']:
#    for end_date in projects['Project_End_Date']:
#        if (nat_check(start_date) or nat_check(end_date)) == True:
#            continue
#        else:
#            if data_loss_check(start_date, end_date) == True:
#                Data_Loss_Check[start_date.strftime(format = '%Y-%m-%d %H:%M:%S'), end_date.strftime(format = '%Y-%m-%d %H:%M:%S')] = True
#            else:
#                Data_Loss_Check[start_date.strftime(format = '%Y-%m-%d %H:%M:%S'), end_date.strftime(format = '%Y-%m-%d %H:%M:%S')] = False

In [None]:
# URL: https://www.geeksforgeeks.org/python-program-to-convert-a-tuple-to-a-string/#:~:text=There%20are%20various%20approaches%20to%20convert%20a%20tuple,of%20the%20tuple%20and%20convert%20it%20into%20string.

#def convertTuple(tup):
#    string = ', '.join(tup)
#    return string

#Execution_Timeframe = Data_Loss_Check.copy()

#for key in Execution_Timeframe.keys():
#    Execution_Timeframe[key] = convertTuple(key)
    
# Changing keys of our final dictionary
#Suffered_Data_Loss = dict(zip((Execution_Timeframe.values()), (Data_Loss_Check.values())))

In [None]:
# Create a column in 'Projects' table to create merge on
#projects['Execution_Timeframe'] = projects['Project_Start_Date'].map(str) + ', ' + projects['Project_End_Date'].map(str)
#projects['Execution_Timeframe'][0]
#type(projects['Execution_Timeframe'][0])

In [None]:
# Check whether the two future columns will merge
#list(Suffered_Data_Loss)[0] == projects['Execution_Timeframe'][0]

In [None]:
#df_8 = pd.DataFrame.from_dict(Suffered_Data_Loss, orient ='index')
#df_8 = df_8.reset_index()
#df_8.rename(columns = {'index':'Execution_Timeframe', 0:'Suffered_Data_Loss'}, inplace = True)

#projects = pd.merge(projects, df_8, how='left', on='Execution_Timeframe')
#projects

### 1.3 <a class="anchor" id="1_3"></a> Stages (wga.stages)

In [None]:
# Read only valid projects' stages
stages = pd.read_csv('csv-files/wga_power_bi_stages.csv', encoding = 'ISO-8859-1')
stages = (stages[stages['Project ID'].isin(valid_ids)])
stages = stages[(stages['Stage Type'] != 'Proposal')] # We only want professional fees
stages = stages[['Project ID', 'Stage ID', 'Stage Fee Type', 'Is Disbursement Stage',
                 'Stage Manager', 'Stage Discipline','Stage Start Date','Stage End Date']]

stages['Is Disbursement Stage'].replace(['No', 'Yes'], [False, True],inplace=True)
stages['Stage Start Date'] = pd.to_datetime(stages['Stage Start Date'])
stages['Stage End Date'] = pd.to_datetime(stages['Stage End Date'])
stages.columns = stages.columns.str.replace(' ', '_')

# Add profit and recoverability measures to the 'Stages' table
stages = pd.merge(stages, stage_transactions, how='left', left_on=['Project_ID', 'Stage_ID'], right_on = ['Project_ID', 'Stage_ID'])
stages

In [None]:
stages['Project_ID'].nunique()

**1 feature to engineer:**
* Perc_of_Stages_with_Fixed_Fee

In [None]:
#Perc_of_Stages_with_Fixed_Fee
df_9 = pd.DataFrame(stages.groupby(['Project_ID', 'Stage_Fee_Type'])['Stage_ID'].count()).reset_index()
df_10 = pd.DataFrame(stages.groupby(['Project_ID'])['Stage_Fee_Type'].count()).reset_index()
df_10.rename(columns = {'Stage_Fee_Type':'Total_Num_Stages'}, inplace = True)
df_10 = pd.merge(df_9, df_10, how='left', on='Project_ID')
df_10.rename(columns = {'Stage_ID':'Num_of_Stages_Per_Type'}, inplace = True)
df_11 = df_10[(df_10['Stage_Fee_Type'] == 'Fixed fee')]
df_11['Perc_of_Stages_with_Fixed_Fee'] = (df_11['Num_of_Stages_Per_Type'] / df_11['Total_Num_Stages']).round(decimals = 2)
df_11 = df_11[['Project_ID', 'Perc_of_Stages_with_Fixed_Fee']]
all_stages =  pd.merge(df_10, df_11,  how='left', on='Project_ID')
all_stages = all_stages.fillna(0)
test = pd.merge(stages, all_stages,  how='left', on='Project_ID')
test['Perc_of_Stages_with_Fixed_Fee'].isnull().sum()

In [None]:
nulls = pd.DataFrame(test['Perc_of_Stages_with_Fixed_Fee'].isnull())
nulls.rename(columns = {'Perc_of_Stages_with_Fixed_Fee':'checker'}, inplace = True)
nulls = nulls.loc[nulls['checker'] == True]
indexes = list(nulls.index.values)
nulls = test.iloc[indexes]
nulls

In [None]:
# Since the projects with mising values don't have fixed fee at all, let's fill them with zero
test = test.fillna(0)
test['Perc_of_Stages_with_Fixed_Fee'].isnull().sum()

In [None]:
test = test[['Project_ID', 'Total_Num_Stages', 'Perc_of_Stages_with_Fixed_Fee']]
test.drop_duplicates(inplace = True, ignore_index=True)
test

In [None]:
projects = pd.merge(projects, test,  how ='left', on ='Project_ID')
projects['Perc_of_Stages_with_Fixed_Fee'].isnull().sum()

In [None]:
nulls = pd.DataFrame(projects['Perc_of_Stages_with_Fixed_Fee'].isnull())
nulls.rename(columns = {'Perc_of_Stages_with_Fixed_Fee':'checker'}, inplace = True)
nulls = nulls.loc[nulls['checker'] == True]
indexes = list(nulls.index.values)
nulls = projects.iloc[indexes]
nulls

In [None]:
stages.loc[stages['Project_ID'] == 15907 | 59885 | 60137]

It turns out, 3 projects are not included in the 'stages' tables, so their stages attributes are not recorded.

In [None]:
# Stage_Duration_Weeks
#df_4 = pd.DataFrame(stages['Stage_Start_Date'].notnull() & stages['Stage_End_Date'].notnull())
#df_4.rename(columns = {0:'checker'}, inplace = True)
#df_4 = df_4.loc[df_4['checker'] == True]

#stages = pd.merge(stages, df_4, left_index=True, right_index=True)
#stages['Stage_Duration_Weeks'] = ((stages['Stage_End_Date'] - stages['Stage_Start_Date']).astype('timedelta64[W]'))
#stages.drop(columns = 'checker', inplace = True)

In [None]:
projects['Fully_In_Lockdown'].value_counts()

In [None]:
projects['Partially_In_Lockdown'].value_counts()

In [None]:
projects.drop(columns = ['Fully_In_Lockdown', 'Partially_In_Lockdown'], inplace = True)

### 1.4 <a class="anchor" id="1_4"></a> Data health (wga.health)

In [None]:
# Load only valid projects
health = pd.read_csv('csv-files/wga_power_bi_stages.csv', encoding = 'ISO-8859-1')
health = (health[health['Project ID'].isin(valid_ids)])


# Only leave columns that are relevant
health = health[['Project ID', 'Stage ID',
                 'Data Quality - Has Issues',
                 'Data Quality - Has Inactive Staff Resourced', 
                 'Data Quality - Rate Group', 'Health - % Duration Complete',
                 'Health - % Fee Used', 'Health - Stages With Alerts #']]

# Convert columns for unified style
health.rename(columns = {'Data Quality - Has Issues': 'DQ_Has_Issues',
                         'Data Quality - Has Inactive Staff Resourced':'DQ_Has_Inactive_Staff_Resourced',
                         'Data Quality - Rate Group':'DQ_Rate_Group',
                         'Health - % Duration Complete':'Health_Perc_Duration_Complete',
                         'Health - % Fee Used':'Health_Perc_Fee_Used',
                         'Health - Stages With Alerts #':'Alerts_Total_Per_Stage'}, inplace = True)

health['DQ_Has_Issues'].replace(['No', 'Yes'],[False, True],inplace=True)
health['DQ_Has_Inactive_Staff_Resourced'].replace(['No', 'Yes'],[False, True],inplace=True)
health.columns = health.columns.str.replace(' ', '_')

health.head(1)
len(health)

In [None]:
checker = health[health['Project_ID'].isin([368035]) == True]
checker = checker[['Project_ID', 'Stage_ID', 'Alerts_Total_Per_Stage']]
checker

**1 feature to engineer:**
* Total_Data_Issues

In [None]:
# Alerts_Total_Per_Project
issues = health.groupby(['Project_ID'], sort=False).sum('Alerts_Total_Per_Stage').reset_index()
issues = issues[['Project_ID', 'Alerts_Total_Per_Stage']]
issues.rename(columns = {'Alerts_Total_Per_Stage':'Total_Data_Issues'}, inplace = True)
projects = pd.merge(projects, issues, how ='left', on='Project_ID')
projects

### 1.5 <a class="anchor" id="1_5"></a> Clients (wga.clients)

Step 1: Cleaning all given data, from Synergy API and Power BI.

In [None]:
# Step 1: Cleaning data from Synergy API.
api_clients = pd.read_csv('csv-files/wga_synergy_overnight_1_clients.csv')
api_clients.drop(columns = {'Client Name', 'Unnamed: 0', 'Contact Type', 'Organisation ID'}, inplace = True)
api_clients['Created Date'] = pd.to_datetime(api_clients['Created Date'])

# Step 2: Cleaning transformed PowerBI data from S2R Analytics.
pbi_clients = pd.read_csv('csv-files/wga_power_bi_clients.csv', encoding = 'ISO-8859-1')
pbi_clients = pbi_clients[['Client ID', 'Client Projects - Total No', 'Client Projects - First Project ID']]
pbi_clients.rename(columns = {'Client Projects - Total No': 'Client Projects Total No',
                              'Client Projects - First Project ID':'1st Project ID'}, inplace = True)

# Step 3: Merge the two 'Clients' tables together.
clients = pd.merge(api_clients, pbi_clients,  how='left', left_on='Client ID', right_on='Client ID')
clients.columns = clients.columns.str.replace(' ', '_')
clients.head(1)

In [None]:
clients['Client_ID'].nunique()

**3 features to engineer:**
* Client_Duration_Months
* Client_Is_Repeated
* Client_Is_Recent

In [None]:
# Client_Is_Repeated
clients['Client_Is_Repeated'] = clients['1st_Project_ID'].notnull()

# Client_Duration_Months
clients['Client_Duration_Months'] = datetime.now() - clients['Created_Date']
clients['Client_Duration_Months'] = (clients['Client_Duration_Months'].astype('timedelta64[M]'))
clients['Client_Duration_Months'].isnull().sum()

In [None]:
# Client_Is_Recent

Client_Is_Recent = {}

for months in clients['Client_Duration_Months']:
    if months < 6:
        Client_Is_Recent[months] = True
    else:
        Client_Is_Recent[months] = False
         
df_12 = pd.DataFrame(
    [{'Client_Duration_Months': months, 'Client_Is_Recent': recent_status} for (months, recent_status) in Client_Is_Recent.items()])

clients = pd.merge(clients, df_12, how='left', on='Client_Duration_Months')
#clients['1st_Project_ID'] = clients['1st_Project_ID'].astype(int)
clients.head(1)

### 1.6 <a class="anchor" id="1_6"></a> Human resources (wga.staff)

In [None]:
staff = pd.read_csv('csv-files/wga_synergy_overnight_1_staff.csv')
staff = staff[['Staff ID', 'Staff Name', 'Employment Date', 'Synergy Team']] #staff['Termination_Date'].nunique() was 0, so we don't include it
staff['Employment Date'] = pd.to_datetime(staff['Employment Date'])
staff.columns = staff.columns.str.replace(' ', '_')
staff.head(1)

**2 features to engineer:**
* Employment_Total_Months
* Manager_Is_Recent (move it to 'Projects' table)

In [None]:
# Employment_Total_Months
staff['Employment_Total_Months'] = ((datetime.now() - staff['Employment_Date']).astype('timedelta64[M]'))

In [None]:
# Manager_Is_Recent
managers = projects[['Project_ID', 'Project_Manager', 'Project_Start_Date']]
managers = pd.merge(managers, staff, how='left', left_on='Project_Manager', right_on='Staff_Name')
managers.drop(columns = ['Staff_Name'], inplace = True)
managers['Months_Before_Project'] = (managers['Project_Start_Date'] - managers['Employment_Date']).astype('timedelta64[M]')

Manager_Is_Recent = {}

for months in managers['Months_Before_Project']:
    if np.isnan(months) == True:
        continue
    else:
        if months < 6:
            Manager_Is_Recent[months] = True
        else:
            Manager_Is_Recent[months] = False
        
df_13 = pd.DataFrame([{'Months_Before_Project': months, 'Manager_Is_Recent': recent_status} for (months, recent_status) in Manager_Is_Recent.items()])

managers = pd.merge(managers, df_13, how ='left', on='Months_Before_Project')
managers.head(1)
len(managers)

In [None]:
managers.head(1)

In [None]:
managers = managers[['Project_ID',	'Staff_ID', 'Synergy_Team', 'Employment_Total_Months', 'Manager_Is_Recent']]
projects = pd.merge(projects, managers, how ='left', on='Project_ID')

### 1.7 <a class="anchor" id="1_7"></a> Clean-up

Dropped 7 columns: Project_Status, Project_Start_Date,	Project_End_Date,	Due_Date, Number_of_Invoices,	Project_Net_Residual, Execution_Timeframe.

In [None]:
# Drop columns unnecessary for analysis and rearrange
projects = projects[['Project_ID', 'Country', 'Office', 'Sector', 'Project_Size_Sort_Order',
                     'Total_Num_Stages', 'Is_Multi_Discipline_Project', 'Is_First_Client_Project',
                     'Default_Rate_Group', 'Perc_of_Stages_with_Fixed_Fee',
                     'Project_Manager', 'Manager_Is_Recent', 'Project_Director', 'Perc_of_Subcontractors',
                     'Project_Duration_Weeks', 'Is_Front_Loaded', 'Delivered_on_Time',
                     'Total_Data_Issues', 'Avg_Rec', 'Avg_Profit']]

projects.head(1)

In [None]:
%who DataFrame

In [None]:
# Release all dataframes from Python memory apart from final ones that go into the WGA schema
dfs = [all_dates, all_projects, all_stages, api_clients, api_projects, checker, custom_fields, dates_prep, df_1, df_10, df_11, df_12, df_13,
       df_2, df_3, df_4, df_5, df_6, df_7, df_9, external_projects, first_half, health, infinites, issues,
       managers, max_dates, min_dates, nulls, pbi_clients, pbi_projects, project_dates, project_transactions, stage_transactions, subs,
       successful_projects, test, total_hours, total_units]
del all_dates, all_projects, all_stages, api_clients, api_projects, checker, custom_fields, dates_prep, df_1, df_10, df_11, df_12, df_13, df_2, df_3, df_4, df_5, df_6, df_7, df_9, external_projects, first_half, health, infinites, issues, managers,  max_dates, min_dates, nulls, pbi_clients, pbi_projects, project_dates, project_transactions, stage_transactions, subs, successful_projects, test, total_hours,total_units
del dfs

In [None]:
%who DataFrame

In [None]:
 # Save the dataframes in Parquet format

#projects.to_parquet('parquet-files/projects.parquet', index=False)
#clients.to_parquet('parquet-files/clients.parquet', index=False)
#stages.to_parquet('parquet-files/stages.parquet', index=False)
#transactions.to_parquet('parquet-files/transactions.parquet', index=False)
#staff.to_parquet('parquet-files/staff.parquet', index=False)

## Part 2: <a class="anchor" id="part2"></a> Data loading

### 2.1 <a class="anchor" id="2_1"></a> Database design and storage

### 2.2 <a class="anchor" id="2_2"></a> Conversion to flat file

In [None]:
# Flat file on project level
project_lvl = pd.merge(projects, staff, how='left', left_on='Project_Manager', right_on='Staff_Name')
project_lvl.drop(columns = ['Staff_Name', 'Project_Manager', 'Employment_Date'], inplace = True)
project_lvl

In [None]:
# Flat file on stage level

stage_lvl = pd.merge(project_lvl, clients, how='left', left_on='Project_ID', right_on='1st_Project_ID')
stage_lvl.drop(columns = ['1st_Project_ID', 'Created_Date'], inplace = True)
stage_lvl.rename(columns = {'Staff_ID':'Project_Manager'}, inplace = True)
stages_prep = stages[['Project_ID',  'Stage_ID', 'Is_Disbursement_Stage', 'Stage_Discipline', 'Stage_Fee_Type']]
stage_lvl = pd.merge(stage_lvl, stages_prep, how='left', on='Project_ID')
stage_lvl

In [None]:
transactions.head(1)

In [None]:
stage_lvl.head(1)

In [None]:
# Flat file on transaction level
transaction_lvl = pd.merge(stage_lvl, transactions, how='left', on=['Project_ID', 'Stage_ID'])

transaction_lvl = transaction_lvl[['Project_ID', 'Country', 'Office', 'Sector', 'Project_Size_Sort_Order', 'Total_Num_Stages',
'Is_Multi_Discipline_Project', 'Is_First_Client_Project', 'Default_Rate_Group', 'Perc_of_Stages_with_Fixed_Fee',
'Project_Director',  'Project_Manager', 'Synergy_Team', 'Employment_Total_Months', 'Manager_Is_Recent', 'Perc_of_Subcontractors',
'Project_Duration_Weeks', 'Is_Front_Loaded', 'Delivered_on_Time', 'Total_Data_Issues',
'Client_ID', 'Client_Projects_Total_No', 'Client_Is_Repeated', 'Client_Duration_Months', 'Client_Is_Recent',
'Stage_ID', 'Stage_Discipline', 'Recoverability', 'Profit_Measure']]
transaction_lvl

In [None]:
# Save the final dataframes in CSV format
project_lvl.to_csv('csv-files/project_lvl.csv', index=False)
stage_lvl.to_csv('csv-files/stage_lvl.csv', index=False)
transaction_lvl.to_csv('csv-files/transaction_lvl.csv', index=False)