<h1 align="center">MSIN0114: Business Analytics Consulting Project</h1>
<h2 align="center">S2R Analytics</h2>

# Table of Contents

**Data enginering (ETL pipeline)**

* [Part 0](#part0): Data extraction

* [Part 1](#part1): Data transformation
    * [1.1](#1_1): Projects
    * [1.2](#1_2): Clients
    * [1.3](#1_3): Stages
    * [1.4](#1_4): Transactions
    * [1.5](#1_5): Data health
    * [1.6](#1_6): Staff
    * [1.7](#1_7): Clean-up    
 <br />
* [Part 2](#part2): Data loading
    * [2.1](#2_1): Database design and storage
    * [2.2](#2_2): Conversion to flat file

## Notebook Setup

In [2]:
#Essentials
import pandas as pd
from pandas import Series, DataFrame
from pandas.api.types import CategoricalDtype
pd.options.display.max_columns = None
import numpy as np; np.random.seed(2022)
import random
import sqlite3
import pyodbc

#Image creation and display
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.patches as mpatches
from matplotlib import pyplot
import plotly.express as px
import plotly.graph_objects as go
#from image import image, display

#Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.base import clone
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier

#Other
import itertools as it
import io
import os
os.sys.path
import sys
import glob
import concurrent.futures
from __future__ import print_function
import binascii
import struct
from PIL import Image
import scipy
import scipy.misc
import scipy.cluster
import datetime, time
import functools, operator
from datetime import datetime

## Part 0: <a class="anchor" id="part0"></a> Data extraction

Main data is from API scripts from Jonny.
X columns from PowerBI.

## Part 1: <a class="anchor" id="part1"></a> Data transformation

### 1.1 <a class="anchor" id="1_1"></a> Projects (wga.projects)

Step 1: Create a list of projects to drop.

In [4]:
# Read all projects from Synergy API
all_projects = pd.read_csv('csv-files/wga_synergy_incremental_projects.csv')
all_projects = all_projects[['Project ID', 'Project Number', 'Project Name', 'Is Office Project', 'Is Billable', 'Project Status']]

# Projects to keep: external (i.e. client only)
external_projects = all_projects[(all_projects['Is Office Project'] != 'Yes')]
external_projects = external_projects[(external_projects['Is Billable'] != 'No')]
external_ids = external_projects['Project ID'].tolist()

# Projects to keep: status-based
successful_projects = external_projects[external_projects['Project Status'].isin(['Complete', 'Active', 'Pending Invoice']) == True]
valid_ids = successful_projects['Project ID'].tolist()

# See how many unique projects we shold have
print('We should have ' + str(len(valid_ids)) + ' projects in total.')

We should have 9755 projects in total.


Step 2: Cleaning data from Synergy API.

In [95]:
# Load only valid projects
api_projects = pd.read_csv('csv-files/wga_synergy_incremental_projects.csv')
api_projects = (api_projects[api_projects['Project ID'].isin(valid_ids)])


# Drop unnecesary columns
api_projects.drop(columns = ['Unnamed: 0', 'Primary Contact Name', 'Status Name', 'Organisation ID',
                             'customFields', 'Address Line 1', 'Address Line 2', 'Project Type ID',
                             'Primary Contact', 'Primary Contact ID', 'Project Scope', 'Address Postal Code',
                             'Address State', 'Address Town', 'Address Google', 'Client Reference Number',
                             'Address State Postal Code Country', 'Address Single Line', 'Project Type Code',
                             'External Name', 'Address Longitude', 'Address Latitude',
                             'Project Forecast Value', 'Created Date', 'Updated Date', 'Manager ID'], inplace = True)


# Convert columns for unified style
api_projects.rename(columns = {'Invoices':'Number of Invoices', 'Project Net Residual (Neg as Zero)':'Project Net Residual',
                              'Start Date (Project)': 'Project Start Date', 'End Date (Project)': 'Project End Date',
                              'Address Country':'Country', 'Project Type': 'Sector'}, inplace = True)
api_projects['Country'].replace(['AUSTRALIA', 'AUS', 'Autralia', 'NZ', 'new zealand', 'PNG', 'samoa', 'SAMOA', 'TONGA', 'SA', 'CHINA'],
                                ['Australia', 'Australia', 'Australia', 'New Zealand', 'New Zealand', 'Papua New Guinea', 'Samoa', 'Samoa', 'Tonga', 'Saudi Arabia', 'China'],inplace=True)
api_projects['Project Start Date'] = pd.to_datetime(api_projects['Project Start Date'])
api_projects['Project End Date'] = pd.to_datetime(api_projects['Project End Date'])


# Combine minority sectors with larger ones
api_projects['Sector'].mask(api_projects['Sector'] == 'Commercial', 'Commercial & Retail Buildings', inplace=True)
api_projects['Sector'].mask(api_projects['Sector'] == 'Residential', 'Civic & Education Buildings', inplace=True)


# Adding 'Due Date' and'Project Director' columns
custom_fields = pd.read_csv('csv-files/wga_synergy_incremental_projects_custom_fields.csv')
custom_fields = custom_fields[['PROPOSAL - Due Date', 'PROSPECT - Project Director', 'Project ID']].copy()
custom_fields.rename(columns = {'PROSPECT - Project Director':'Project Director', 'PROPOSAL - Due Date': 'Due Date'}, inplace = True)
custom_fields['Due Date'] = pd.to_datetime(custom_fields['Due Date'])
api_projects = pd.merge(api_projects, custom_fields,  how='left', left_on='Project ID', right_on='Project ID')


# Rearrange column names for easier interpretation
api_projects = api_projects[['Project ID', 'Country',
                             'Project Status', 'Sector',
                             'Project Director', 'Project Manager', 'Office',
                             'Project Start Date', 'Project End Date', 'Due Date',
                             'Default Rate Group','Number of Invoices', 'Project Net Residual']]

api_projects.head(1)
len(api_projects)

9755

Step 3: Cleaning transformed PowerBI data from S2R Analytics.

In [96]:
# Read the pre-transformed data from PowerBI
pbi_projects = pd.read_csv('csv-files/wga_power_bi_projects.csv', encoding = 'ISO-8859-1')
pbi_projects = pbi_projects[['Project ID', 'Project Size Sort Order',
                             'Project Duration (Weeks)', 'Is Multi Discipline Project','Is First Client Project']]

# Load only valid projects
pbi_projects = (pbi_projects[pbi_projects['Project ID'].isin(valid_ids)])

# Convert columns for unified style
pbi_projects.rename(columns = {'Project Duration (Weeks)':'Project Duration Weeks'}, inplace = True)
pbi_projects['Is Multi Discipline Project'].replace(['No', 'Yes'],[False, True],inplace=True)
pbi_projects['Is First Client Project'].replace(['No', 'Yes'],[False, True],inplace=True)

pbi_projects.head(1)
len(pbi_projects)

  exec(code_obj, self.user_global_ns, self.user_ns)


9754

Step 4: Merge the two 'Projects' tables together.

In [97]:
# Merge the projects table from API and preprocesed Power BI table
projects = pd.merge(api_projects, pbi_projects,  how='left', left_on='Project ID', right_on='Project ID')
projects.columns = projects.columns.str.replace(' ', '_')
projects.head(1)
len(projects)

9755

**4 features to engineer:**
* Delivered_on_Time
* Fully_In_Lockdown
* Partially_In_Lockdown
* Suffered_Data_Loss (projects that started after July 2018 did not suffer from data loss, projects that ended before July 2018 did not suffer from data loss)

In [88]:
# Delivered_on_Time

def nat_check(date):
    if type(date) == pd._libs.tslibs.nattype.NaTType:
        return True
    else:
        return False
    
Delivered_on_Time = {}

for due_date in projects['Due_Date']:
    for completed in projects['Project_End_Date']:
        if nat_check(due_date) == True:
            continue
        else:
            if due_date <= completed:
                Delivered_on_Time[due_date] = True
            else:
                Delivered_on_Time[due_date] = False

df_1 = pd.DataFrame([{'Due_Date': due_date, 'Delivered_on_Time': is_on_time} for (due_date, is_on_time) in Delivered_on_Time.items()])

projects = projects.merge(df_1,how='left', left_on='Due_Date', right_on='Due_Date')

KeyboardInterrupt: 

In [None]:
# Fully_In_Lockdown, Partially_In_Lockdown

Lockdown_Period = (pd.date_range(start='2020-03-16', end = '2020-11-21', freq='D')).to_series()

projects['Start_in_Lockdown'] = projects['Project_Start_Date'].isin([Lockdown_Period])
projects['End_in_Lockdown'] = projects['Project_End_Date'].isin([Lockdown_Period])
dates_prep = pd.concat([projects['Start_in_Lockdown'], projects['End_in_Lockdown']], axis = 1) #axis=1 specifies horizontal stacking

projects['Fully_In_Lockdown'] = pd.DataFrame(dates_prep.all(axis=1))
projects['Partially_In_Lockdown'] = pd.DataFrame(dates_prep.any(axis=1))

projects.drop(columns = ['Start_in_Lockdown', 'End_in_Lockdown'], inplace = True)
len(projects)

9755

In [45]:
# Suffered_Data_Loss

def data_loss_check(start_date, end_date):
    if start_date < pd.Timestamp('2018-07-15') and end_date < pd.Timestamp('2018-07-15'): #project started and ended before the acqusition
        return False
    elif start_date > pd.Timestamp('2018-07-15') and end_date > pd.Timestamp('2018-07-15'): #project started and ended after the acqusition
        return False
    elif start_date < pd.Timestamp('2018-07-15') and end_date > pd.Timestamp('2018-07-15'): #project started before the acqusition but ended after it
        return True

Data_Loss_Check = {}

for start_date in projects['Project_Start_Date']:
    for end_date in projects['Project_End_Date']:
        if (nat_check(start_date) or nat_check(end_date)) == True:
            continue
        else:
            if data_loss_check(start_date, end_date) == True:
                Data_Loss_Check[start_date.strftime(format = '%Y-%m-%d %H:%M:%S'), end_date.strftime(format = '%Y-%m-%d %H:%M:%S')] = True
            else:
                Data_Loss_Check[start_date.strftime(format = '%Y-%m-%d %H:%M:%S'), end_date.strftime(format = '%Y-%m-%d %H:%M:%S')] = False

In [46]:
# URL: https://www.geeksforgeeks.org/python-program-to-convert-a-tuple-to-a-string/#:~:text=There%20are%20various%20approaches%20to%20convert%20a%20tuple,of%20the%20tuple%20and%20convert%20it%20into%20string.

def convertTuple(tup):
    string = ', '.join(tup)
    return string

Execution_Timeframe = Data_Loss_Check.copy()

for key in Execution_Timeframe.keys():
    Execution_Timeframe[key] = convertTuple(key)
    
# Changing keys of our final dictionary
Suffered_Data_Loss = dict(zip((Execution_Timeframe.values()), (Data_Loss_Check.values())))

In [47]:
# Create a column in 'Projects' table to create merge on
projects['Execution_Timeframe'] = projects['Project_Start_Date'].map(str) + ', ' + projects['Project_End_Date'].map(str)
projects['Execution_Timeframe'][0]
type(projects['Execution_Timeframe'][0])

str

In [48]:
# Check whether the two future columns will merge
list(Suffered_Data_Loss)[0] == projects['Execution_Timeframe'][0]

True

In [49]:
df_2 = pd.DataFrame.from_dict(Suffered_Data_Loss, orient ='index')
df_2 = df_2.reset_index()
df_2.rename(columns = {'index':'Execution_Timeframe', 0:'Suffered_Data_Loss'}, inplace = True)

projects = projects.merge(df_2, how='left', left_on='Execution_Timeframe', right_on='Execution_Timeframe')
projects.head(1)
len(projects)

9755

### 1.2 <a class="anchor" id="1_2"></a> Clients (wga.clients)

Step 1: Cleaning all given data, from Synergy API and Power BI.

In [50]:
# Step 1: Cleaning data from Synergy API.
api_clients = pd.read_csv('csv-files/wga_synergy_overnight_1_clients.csv')
api_clients.drop(columns = {'Client Name', 'Unnamed: 0', 'Contact Type', 'Organisation ID'}, inplace = True)
api_clients['Created Date'] = pd.to_datetime(api_clients['Created Date'])

# Step 2: Cleaning transformed PowerBI data from S2R Analytics.
pbi_clients = pd.read_csv('csv-files/wga_power_bi_clients.csv', encoding = 'ISO-8859-1')
pbi_clients = pbi_clients[['Client ID', 'Client Projects - Total No', 'Client Projects - First Project ID']]
pbi_clients.rename(columns = {'Client Projects - Total No': 'Client Projects Total No',
                              'Client Projects - First Project ID':'1st Project ID'}, inplace = True)

# Step 3: Merge the two 'Clients' tables together.
clients = pd.merge(api_clients, pbi_clients,  how='left', left_on='Client ID', right_on='Client ID')
clients.columns = clients.columns.str.replace(' ', '_')
clients.head(1)

Unnamed: 0,Client_ID,Created_Date,Client_Projects_Total_No,1st_Project_ID
0,10317738,2022-05-06,,


**3 features to engineer:**
* Client_Duration_Months
* Client_Is_Repeated
* Client_Is_Recent

In [51]:
# Client_Is_Repeated
clients['Client_Is_Repeated'] = clients['1st_Project_ID'].notnull()

# Client_Duration_Months
clients['Client_Duration_Months'] = datetime.now() - clients['Created_Date']
clients['Client_Duration_Months'] = (clients['Client_Duration_Months'].astype('timedelta64[M]'))
clients['Client_Duration_Months'].isnull().sum() 

0

In [52]:
# Client_Is_Recent

Client_Is_Recent = {}

for months in clients['Client_Duration_Months']:
    if months < 6:
        Client_Is_Recent[months] = True
    else:
        Client_Is_Recent[months] = False
         
df_3 = pd.DataFrame(
    [{'Client_Duration_Months': months, 'Client_Is_Recent': recent_status} for (months, recent_status) in Client_Is_Recent.items()])

clients = clients.merge(df_3,how='left', left_on='Client_Duration_Months', right_on='Client_Duration_Months')
#clients['1st_Project_ID'] = clients['1st_Project_ID'].astype(int)
clients.head(1)

Unnamed: 0,Client_ID,Created_Date,Client_Projects_Total_No,1st_Project_ID,Client_Is_Repeated,Client_Duration_Months,Client_Is_Recent
0,10317738,2022-05-06,,,False,1.0,True


### 1.3 <a class="anchor" id="1_3"></a> Stages (wga.stages)

In [53]:
# Read only valid projects' stages
stages = pd.read_csv('csv-files/wga_power_bi_stages.csv', encoding = 'ISO-8859-1')
stages = (stages[stages['Project ID'].isin(valid_ids)])
stages = stages[(stages['Stage Type'] != 'Proposal')] # We only want professional fees
stages = stages[(stages['Is Disbursement Stage'] == 'No')] # We only want professional fees
stages = stages[['Project ID', 'Stage ID',
                 'Is Disbursement Stage', 'Stage Fee Type',
                 'Stage Manager', 'Stage Discipline','Stage Start Date','Stage End Date']]

stages['Stage Start Date'] = pd.to_datetime(stages['Stage Start Date'])
stages['Stage End Date'] = pd.to_datetime(stages['Stage End Date'])
stages.columns = stages.columns.str.replace(' ', '_')
len(stages)

  exec(code_obj, self.user_global_ns, self.user_ns)


53209

**2 features to engineer:**
* Stage_Duration_Weeks
* Perc_of_Stages_with_Fixed_Fee

In [54]:
# Stage_Duration_Weeks
df_4 = pd.DataFrame(stages['Stage_Start_Date'].notnull() & stages['Stage_End_Date'].notnull())
df_4.rename(columns = {0:'checker'}, inplace = True)
df_4 = df_4.loc[df_4['checker'] == True]

stages = pd.merge(stages, df_4, left_index=True, right_index=True)
stages['Stage_Duration_Weeks'] = ((stages['Stage_End_Date'] - stages['Stage_Start_Date']).astype('timedelta64[W]'))
stages.drop(columns = 'checker', inplace = True)
len(stages)

20630

In [55]:
#Perc_of_Stages_with_Fixed_Fee
df_5 = pd.DataFrame(stages.groupby(['Project_ID', 'Stage_Fee_Type'])['Stage_ID'].count()).reset_index()
df_6 = pd.DataFrame(df_5.groupby('Project_ID')['Stage_ID'].agg('sum')).reset_index().astype(int)
df_6.rename(columns = {'Stage_ID':'Total_Num_Stages'}, inplace = True)
df_6 = df_5.merge(df_6,how='left', left_on='Project_ID', right_on='Project_ID')
df_6.rename(columns = {'Stage_ID':'Sum_Fixed_Type_Stages'}, inplace = True)
df_7 = df_6[(df_6['Stage_Fee_Type'] == 'Fixed fee')]
df_7['Perc_of_Stages_with_Fixed_Fee'] = ((df_7['Sum_Fixed_Type_Stages'] / df_7['Total_Num_Stages']) * 100).round(decimals = 2)
stages = pd.merge(stages, df_7,  how='left', left_on='Project_ID', right_on='Project_ID')
stages.drop(columns = 'Stage_Fee_Type_x', inplace = True)
stages.rename(columns = {'Stage_Fee_Type_y': 'Stage_Fee_Type'}, inplace = True)
stages.head(1)
len(stages)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_7['Perc_of_Stages_with_Fixed_Fee'] = ((df_7['Sum_Fixed_Type_Stages'] / df_7['Total_Num_Stages']) * 100).round(decimals = 2)


20630

In [56]:
fixed_fee = stages[['Project_ID', 'Perc_of_Stages_with_Fixed_Fee']]
fixed_fee = fixed_fee.drop_duplicates()
projects = pd.merge(projects, fixed_fee,  how='left', left_on='Project_ID', right_on='Project_ID')
projects.head(1)
len(projects)

9755

### 1.4 <a class="anchor" id="1_4"></a> Transactions (wga.transactions)

In [6]:
# Read only valid projects' transactions from Synergy API.
transactions = pd.read_csv('csv-files/wga_sql_transactions.csv')
transactions = (transactions[transactions['projectId'].isin(valid_ids)])

transactions = transactions[['id', 'projectId', 'stageId', 'transactionTypeId',
                             'rateType', 'status','units','valueTotal',
                             'invoiceValueTotal','actualCostTotal',
                             'targetChargeTotal', 'date']]

transactions.rename(columns = {'id':'Transaction ID', 'projectId':'Project ID',
                               'transactionTypeId': 'Transaction Type',
                               'rateType': 'Rate Type', 'status': 'Status',
                               'stageId': 'Stage ID', 'date':'Date',
                               'invoiceValueTotal': 'Invoice Value Total',
                               'actualCostTotal':'Actual Cost Total',
                               'targetChargeTotal':'Target Charge Total',
                               'valueTotal':'Value Total',
                               'units': 'Units'}, inplace = True)

transactions = transactions[(transactions['Status'] == 'Invoiced') | (transactions['Status'] == 'Written off')]
transactions['Transaction Type'].replace([100, 200, 300, 400, 500, 700, 750, 800],
                                         ['Time', 'Cash', 'Travel', 'Office', 'Bill', 'Balance', 'Unearned', 'Invoice Custom'], inplace=True)
transactions['Date'] = pd.to_datetime(transactions['Date'])
transactions.columns = transactions.columns.str.replace(' ', '_')

transactions.head(1)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Transaction_ID,Project_ID,Stage_ID,Transaction_Type,Rate_Type,Status,Units,Value_Total,Invoice_Value_Total,Actual_Cost_Total,Target_Charge_Total,Date
9,45264071,375028,1427835,Time,Staff,Invoiced,1.5,450.0,491.33,360.735,360.735,2021-08-26


**3 features to engineer:**
* Perc_of_Subcontractors (move to 'Projects' table)
* Is_Front_Loaded (move to 'Projects' table)
* Profitability (move to 'Projects' table) - not in the ETL pipeline, in ML pipeline instead

**Table alterations:**
* FK Time_Profile (links 'wga.projects' table on 'Project_ID')

Perc_of_Subcontractors = 
* total units of subcontractors divided by
* sum of units where transaction type is 'bill' or 'time'

* 'Time' = Company's employees
* 'Bill' = Hired subcontrators
* Time + Bill = total human capital on project in hours

In [58]:
# Perc_of_Subcontractors
subs = transactions[['Project_ID', 'Units', 'Rate_Type']]
subs = subs[(subs['Rate_Type'] == 'Subcontractor')]
subs.drop(columns = ['Rate_Type'], inplace = True)
subs = pd.DataFrame(subs.groupby(['Project_ID'])['Units'].count()).reset_index()
subs.rename(columns = {'Units': 'Sub_Hours_Per_Project'}, inplace = True)

total_hours = transactions[['Project_ID', 'Units', 'Transaction_Type']]
total_hours = total_hours[(total_hours['Transaction_Type'] == 'Time') | (total_hours['Transaction_Type'] == 'Bill')]
total_hours = pd.DataFrame(total_hours.groupby(['Project_ID'])['Units'].count()).reset_index()
total_hours.rename(columns = {'Units': 'Total_Hours_Per_Project'}, inplace = True)

df_8 = pd.merge(projects, subs, how='left', left_on='Project_ID', right_on='Project_ID')
df_9 = pd.merge(df_8, total_hours, how='left', left_on='Project_ID', right_on='Project_ID')
df_9['Sub_Hours_Per_Project'].fillna(0, inplace=True)
df_9['Total_Hours_Per_Project'].fillna(0, inplace=True)
df_9['Perc_of_Subcontractors'] = ((df_9['Sub_Hours_Per_Project'] / df_9['Total_Hours_Per_Project']) * 100).round(decimals = 2)
df_9 = df_9[['Project_ID', 'Perc_of_Subcontractors']]

# Add the new feature to the 'Projects' table
projects = pd.merge(projects, df_9,  how='left', left_on='Project_ID', right_on='Project_ID')

In [59]:
# Is_Front_Loaded
project_dates = projects[['Project_ID', 'Project_Start_Date', 'Project_End_Date']]
df_10 = transactions[['Project_ID', 'Units', 'Date']]
df_10 = pd.merge(df_10, project_dates, how='left', left_on='Project_ID', right_on='Project_ID')

first_half = df_10[(df_10['Date']  < df_10['Project_Start_Date'] + (df_10['Project_End_Date'] - df_10['Project_Start_Date'])/2)] # finding mid-point between 2 dates
first_half = pd.DataFrame(first_half.groupby(['Project_ID'])['Units'].sum()).reset_index()
first_half.rename(columns = {'Units': '1st_Half_Units'}, inplace = True)

total_units = pd.DataFrame(df_10.groupby(['Project_ID'])['Units'].sum()).reset_index()
total_units.rename(columns = {'Units': 'Total_Effort_Units'}, inplace = True)

df_11 = pd.merge(total_units, first_half, how ='left', left_on='Project_ID', right_on='Project_ID')
df_11['Perc_Being_Front'] = df_11['1st_Half_Units']/df_11['Total_Effort_Units']
df_11['Is_Front_Loaded'] = (df_11['Perc_Being_Front']>=0.7)
df_11 = df_11[['Project_ID', 'Is_Front_Loaded']]

# Add the new feature to the 'Projects' table
projects = pd.merge(projects, df_11,  how='left', left_on='Project_ID', right_on='Project_ID')

In [60]:
# Recoverability, Profit_Measure
transactions = transactions[['Project_ID', 'Stage_ID', 'Value_Total', 'Invoice_Value_Total', 'Actual_Cost_Total', 'Target_Charge_Total']]
transactions['Recoverability'] = transactions['Value_Total']/transactions['Target_Charge_Total']
infinites = transactions[(transactions['Recoverability'] == np.inf) | (transactions['Recoverability'] == -np.inf)]
#infinites['Target_Charge_Total'].sum() - shows that all cells in 'Target_Charge_Total' column are equal to 0, creating unwanted infinite values

transactions = transactions[(transactions['Target_Charge_Total'] != 0)]
#transactions['Recoverability'].min(), transactions['Recoverability'].max() - shows that there limits are real numbers, not infinite

transactions['Profit_Measure'] = transactions['Invoice_Value_Total']/transactions['Actual_Cost_Total']
#transactions['Profit_Measure'].min(), transactions['Profit_Measure'].max()  - shows that there limits are real numbers, not infinite
transactions = transactions[['Project_ID', 'Stage_ID', 'Recoverability', 'Profit_Measure']]

#stage_transactions
df_12 = pd.DataFrame(transactions.groupby(['Project_ID', 'Stage_ID'])['Recoverability'].count()).reset_index()
df_12.rename(columns = {'Recoverability':'Count'}, inplace = True)
stage_transactions = pd.DataFrame(transactions.groupby(['Project_ID', 'Stage_ID'])['Recoverability', 'Profit_Measure'].sum()).reset_index()
stage_transactions = pd.merge(stage_transactions, df_12, how='left', on=['Project_ID', 'Stage_ID'])
stage_transactions['Avg_Rec'] = stage_transactions['Recoverability']/stage_transactions['Count']
stage_transactions['Avg_Profit'] = stage_transactions['Profit_Measure']/stage_transactions['Count']
stage_transactions = stage_transactions[['Project_ID', 'Stage_ID', 'Avg_Rec', 'Avg_Profit']]

# project_transactions
df_13 =  pd.DataFrame(transactions.groupby(['Project_ID'])['Recoverability'].count()).reset_index()
df_13.rename(columns = {'Recoverability':'Count'}, inplace = True)
project_transactions =  pd.DataFrame(transactions.groupby(['Project_ID'])['Recoverability', 'Profit_Measure'].sum()).reset_index()
project_transactions = pd.merge(project_transactions, df_13, how='left', left_on='Project_ID', right_on='Project_ID')
project_transactions['Avg_Rec'] = project_transactions['Recoverability']/project_transactions['Count']
project_transactions['Avg_Profit'] = project_transactions['Profit_Measure']/project_transactions['Count']
project_transactions = project_transactions[['Project_ID', 'Avg_Rec', 'Avg_Profit']]

# Add the 2 new features to the 'Stages' table
stages = pd.merge(stages, stage_transactions, how='left', left_on=['Project_ID', 'Stage_ID'], right_on = ['Project_ID', 'Stage_ID'])

# Add the 2 new features to the 'Projects' table
projects = pd.merge(projects, project_transactions, how='left', left_on='Project_ID', right_on='Project_ID')

len(stages), len(projects)

  stage_transactions = pd.DataFrame(transactions.groupby(['Project_ID', 'Stage_ID'])['Recoverability', 'Profit_Measure'].sum()).reset_index()
  project_transactions =  pd.DataFrame(transactions.groupby(['Project_ID'])['Recoverability', 'Profit_Measure'].sum()).reset_index()


(20630, 9755)

### 1.5 <a class="anchor" id="1_5"></a> Data health (wga.health)

In [61]:
# Load only valid projects
health = pd.read_csv('csv-files/wga_power_bi_stages.csv', encoding = 'ISO-8859-1')
health = (health[health['Project ID'].isin(valid_ids)])


# Only leave columns that are relevant
health = health[['Project ID', 'Stage ID',
                 'Data Quality - Has Issues',
                 'Data Quality - Has Inactive Staff Resourced', 
                 'Data Quality - Rate Group', 'Health - % Duration Complete',
                 'Health - % Fee Used', 'Health - Stages With Alerts #']]

# Convert columns for unified style
health.rename(columns = {'Data Quality - Has Issues': 'DQ_Has_Issues',
                         'Data Quality - Has Inactive Staff Resourced':'DQ_Has_Inactive_Staff_Resourced',
                         'Data Quality - Rate Group':'DQ_Rate_Group',
                         'Health - % Duration Complete':'Health_Perc_Duration_Complete',
                         'Health - % Fee Used':'Health_Perc_Fee_Used',
                         'Health - Stages With Alerts #':'Alerts_Total_Per_Stage'}, inplace = True)

health['DQ_Has_Issues'].replace(['No', 'Yes'],[False, True],inplace=True)
health['DQ_Has_Inactive_Staff_Resourced'].replace(['No', 'Yes'],[False, True],inplace=True)
health.columns = health.columns.str.replace(' ', '_')

health.head(1)
len(health)

  exec(code_obj, self.user_global_ns, self.user_ns)


60245

In [62]:
checker = health[health['Project_ID'].isin([368035]) == True]
checker = checker[['Project_ID', 'Stage_ID', 'Alerts_Total_Per_Stage']]
checker

Unnamed: 0,Project_ID,Stage_ID,Alerts_Total_Per_Stage
0,368035,1390483,0
26089,368035,1390482,1
76438,368035,1390484,1
76439,368035,1390485,0
76440,368035,1390486,0


**1 feature to engineer:**
* Total_Data_Issues

In [63]:
# Alerts_Total_Per_Project
issues = health.groupby(['Project_ID'], sort=False).sum('Alerts_Total_Per_Stage').reset_index()
issues = issues[['Project_ID', 'Alerts_Total_Per_Stage']]
issues.rename(columns = {'Alerts_Total_Per_Stage':'Total_Data_Issues'}, inplace = True)
projects = projects.merge(issues, how ='left', left_on='Project_ID', right_on='Project_ID')
projects.head(1)
len(projects)

9755

### 1.6 <a class="anchor" id="1_6"></a> Human resources (wga.staff)

In [117]:
staff = pd.read_csv('csv-files/wga_synergy_overnight_1_staff.csv')
staff = staff[['Staff ID', 'Staff Name', 'Employment Date', 'Synergy Team']] #staff['Termination_Date'].nunique() was 0, so we don't include it
staff['Employment Date'] = pd.to_datetime(staff['Employment Date'])
staff.columns = staff.columns.str.replace(' ', '_')
staff.head(1)

Unnamed: 0,Staff_ID,Staff_Name,Employment_Date,Synergy_Team
0,7612683,Mel Chittleborough,2000-09-25,SA - Finance


**2 features to engineer:**
* Employment_Total_Months
* Manager_Is_Recent (move it to 'Projects' table)

In [118]:
# Employment_Total_Months
staff['Employment_Total_Months'] = ((datetime.now() - staff['Employment_Date']).astype('timedelta64[M]'))

In [119]:
# Manager_Is_Recent
managers = projects[['Project_ID', 'Project_Manager', 'Project_Start_Date']]
managers = managers.merge(staff,how='left', left_on='Project_Manager', right_on='Staff_Name')
managers.drop(columns = ['Staff_Name'], inplace = True)
managers['Months_Before_Project'] = (managers['Project_Start_Date'] - managers['Employment_Date']).astype('timedelta64[M]')

Manager_Is_Recent = {}

for months in managers['Months_Before_Project']:
    if np.isnan(months) == True:
        continue
    else:
        if months < 6:
            Manager_Is_Recent[months] = True
        else:
            Manager_Is_Recent[months] = False
        
df_14 = pd.DataFrame([{'Months_Before_Project': months, 'Manager_Is_Recent': recent_status} for (months, recent_status) in Manager_Is_Recent.items()])

managers = managers.merge(df_14, how ='left', left_on='Months_Before_Project', right_on='Months_Before_Project')
managers.head(1)
len(managers)

9755

In [120]:
managers.head(1)

Unnamed: 0,Project_ID,Project_Manager,Project_Start_Date,Staff_ID,Employment_Date,Synergy_Team,Employment_Total_Months,Months_Before_Project,Manager_Is_Recent
0,367704,David McKay,2015-01-07,7612852.0,2012-09-01,SA - Industrial,117.0,28.0,False


In [123]:
managers = managers[['Project_ID',	'Staff_ID', 'Synergy_Team', 'Employment_Total_Months', 'Manager_Is_Recent']]
projects = projects.merge(managers, how ='left', left_on='Project_ID', right_on='Project_ID')
projects.head(1)

Unnamed: 0,Project_ID,Country,Project_Status,Sector,Project_Director,Project_Manager,Office,Project_Start_Date,Project_End_Date,Due_Date,Default_Rate_Group,Number_of_Invoices,Project_Net_Residual,Project_Size_Sort_Order,Project_Duration_Weeks,Is_Multi_Discipline_Project,Is_First_Client_Project,Staff_ID,Synergy_Team,Employment_Total_Months,Manager_Is_Recent
0,367704,Australia,Complete,Ports & Marine,Mark Gilbert,David McKay,Whyalla,2015-01-07,2015-01-08,NaT,Standard,6,0.0,4.0,4.0,False,True,7612852.0,SA - Industrial,117.0,False


### 1.7 <a class="anchor" id="1_7"></a> Clean-up

Dropped 7 columns: Project_Status, Project_Start_Date,	Project_End_Date,	Due_Date, Number_of_Invoices,	Project_Net_Residual, Execution_Timeframe.

In [70]:
# Drop columns unnecessary for analysis and rearrange
projects = projects[['Project_ID', 'Country', 'Office', 'Sector', 'Project_Size_Sort_Order',
                     'Is_Multi_Discipline_Project', 'Is_First_Client_Project',
                     'Default_Rate_Group', 'Perc_of_Stages_with_Fixed_Fee',
                     'Project_Manager', 'Manager_Is_Recent', 'Project_Director', 'Perc_of_Subcontractors',
                     'Project_Duration_Weeks', 'Is_Front_Loaded', 'Delivered_on_Time',
                     'Fully_In_Lockdown','Partially_In_Lockdown',
                     'Suffered_Data_Loss', 'Total_Data_Issues', 'Avg_Rec', 'Avg_Profit']]

projects.head(1)

Unnamed: 0,Project_ID,Country,Office,Sector,Project_Size_Sort_Order,Is_Multi_Discipline_Project,Is_First_Client_Project,Default_Rate_Group,Perc_of_Stages_with_Fixed_Fee,Project_Manager,Manager_Is_Recent,Project_Director,Perc_of_Subcontractors,Project_Duration_Weeks,Is_Front_Loaded,Delivered_on_Time,Fully_In_Lockdown,Partially_In_Lockdown,Suffered_Data_Loss,Total_Data_Issues,Avg_Rec,Avg_Profit
0,367704,Australia,Whyalla,Ports & Marine,4.0,False,True,Standard,100.0,David McKay,False,Mark Gilbert,0.0,4.0,True,,False,False,False,2.0,2.070711,0.991479


In [71]:
%who DataFrame

all_projects	 api_clients	 api_projects	 checker	 clients	 custom_fields	 dates_prep	 df_1	 df_10	 
df_11	 df_12	 df_13	 df_14	 df_2	 df_3	 df_4	 df_5	 df_6	 
df_7	 df_8	 df_9	 external_projects	 first_half	 fixed_fee	 health	 infinites	 issues	 
managers	 pbi_clients	 pbi_projects	 project_dates	 project_transactions	 projects	 staff	 stage_transactions	 stages	 
subs	 successful_projects	 total_hours	 total_units	 transactions	 


In [72]:
# Release all dataframes from Python memory apart from final ones that go into the WGA schema
dfs = [all_projects, api_clients, api_projects, checker, custom_fields, dates_prep, df_1, df_10, df_11, df_12, df_13, df_14,
       df_2, df_3, df_4, df_5, df_6, df_7, df_8, df_9, external_projects, first_half, fixed_fee, health, infinites, issues,
       managers, pbi_clients, pbi_projects, project_dates, project_transactions, stage_transactions, subs, successful_projects, total_hours,
       total_units]
del all_projects, api_clients, api_projects, checker, custom_fields, dates_prep, df_1, df_10, df_11, df_12, df_13, df_14, df_2, df_3, df_4, df_5, df_6, df_7, df_8, df_9, external_projects, first_half, fixed_fee, health, infinites, issues, managers, pbi_clients, pbi_projects, project_dates, project_transactions, stage_transactions, subs, successful_projects, total_hours,total_units
del dfs

In [73]:
%who DataFrame

clients	 projects	 staff	 stages	 transactions	 


In [None]:
 # Save the dataframes in Parquet format

projects.to_parquet('parquet-files/projects.parquet', index=False)
clients.to_parquet('parquet-files/clients.parquet', index=False)
stages.to_parquet('parquet-files/stages.parquet', index=False)
transactions.to_parquet('parquet-files/transactions.parquet', index=False)
staff.to_parquet('parquet-files/staff.parquet', index=False)

## Part 2: <a class="anchor" id="part2"></a> Data loading

### 2.1 <a class="anchor" id="2_1"></a> Database design and storage

### 2.2 <a class="anchor" id="2_2"></a> Conversion to flat file

In [74]:
# Flat file on project level
project_lvl = pd.merge(projects, staff, how='left', left_on='Project_Manager', right_on='Staff_Name')
project_lvl.drop(columns = ['Staff_Name', 'Project_Manager', 'Employment_Date'], inplace = True)
project_lvl

Unnamed: 0,Project_ID,Country,Office,Sector,Project_Size_Sort_Order,Is_Multi_Discipline_Project,Is_First_Client_Project,Default_Rate_Group,Perc_of_Stages_with_Fixed_Fee,Manager_Is_Recent,Project_Director,Perc_of_Subcontractors,Project_Duration_Weeks,Is_Front_Loaded,Delivered_on_Time,Fully_In_Lockdown,Partially_In_Lockdown,Suffered_Data_Loss,Total_Data_Issues,Avg_Rec,Avg_Profit,Staff_ID,Synergy_Team,Employment_Total_Months
0,367704,Australia,Whyalla,Ports & Marine,4.0,False,True,Standard,100.00,False,Mark Gilbert,0.0,4.0,True,,False,False,False,2.0,2.070711,0.991479,7612852.0,SA - Industrial,117.0
1,367705,Australia,WGASA Pty Ltd,Civic & Education Buildings,1.0,False,True,Standard,,,Geoff Wallbridge,0.0,,False,,False,False,,2.0,1.926671,0.000000,7612773.0,SA - Buildings,310.0
2,367706,Australia,WGASA Pty Ltd,Civic & Education Buildings,1.0,False,True,Standard,,,Loreto Taglienti,0.0,,False,,False,False,,0.0,1.936449,1.290966,7612773.0,SA - Buildings,310.0
3,367707,Australia,WGASA Pty Ltd,Commercial & Retail Buildings,3.0,False,True,Standard,,,Mark Gilbert,0.0,,False,,False,False,,1.0,2.053392,1.631426,7612773.0,SA - Buildings,310.0
4,367708,Australia,WGASA Pty Ltd,Civic & Education Buildings,7.0,False,True,Standard,,,Peter McBean,0.0,,False,,False,False,,3.0,1.082860,0.000000,7612695.0,SA - Buildings,420.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9750,1524776,Australia,WGAVIC Pty Ltd,Civic & Education Buildings,1.0,False,False,Vic chargeout FY22,66.67,False,James Brownlie,,1.0,,True,False,False,False,1.0,,,9501752.0,VIC - Traffic,10.0
9751,1529864,Australia,WGAWA Pty Ltd,Civic & Education Buildings,1.0,False,False,WA Chargeout Rates from 1st July 2021,50.00,False,James Davidson,,1.0,,,False,False,False,1.0,,,7809410.0,WA - Civil,20.0
9752,1538446,Australia,WGAVIC Pty Ltd,Energy,1.0,False,False,Vic chargeout FY22,100.00,False,Cameron Jackson (MPD),,3.0,,True,False,False,False,0.0,,,7612860.0,VIC - Buildings,172.0
9753,1538447,Australia,WGAVIC Pty Ltd,Energy,1.0,False,False,Vic chargeout FY22,100.00,False,Cameron Jackson (MPD),,4.0,,True,False,False,False,1.0,,,7612860.0,VIC - Buildings,172.0


In [75]:
# Flat file on stage level
stage_lvl = pd.merge(project_lvl, clients, how='left', left_on='Project_ID', right_on='1st_Project_ID')
stage_lvl.drop(columns = ['1st_Project_ID', 'Created_Date'], inplace = True)
stage_lvl.rename(columns = {'Staff_ID':'Project_Manager'}, inplace = True)
stages_prep = stages[['Project_ID',  'Stage_ID', 'Total_Num_Stages', 'Is_Disbursement_Stage', 'Stage_Discipline', 'Stage_Duration_Weeks', 'Stage_Fee_Type']]
stage_lvl = pd.merge(stage_lvl, stages_prep, how='left', left_on='Project_ID', right_on='Project_ID')
stage_lvl

Unnamed: 0,Project_ID,Country,Office,Sector,Project_Size_Sort_Order,Is_Multi_Discipline_Project,Is_First_Client_Project,Default_Rate_Group,Perc_of_Stages_with_Fixed_Fee,Manager_Is_Recent,Project_Director,Perc_of_Subcontractors,Project_Duration_Weeks,Is_Front_Loaded,Delivered_on_Time,Fully_In_Lockdown,Partially_In_Lockdown,Suffered_Data_Loss,Total_Data_Issues,Avg_Rec,Avg_Profit,Project_Manager,Synergy_Team,Employment_Total_Months,Client_ID,Client_Projects_Total_No,Client_Is_Repeated,Client_Duration_Months,Client_Is_Recent,Stage_ID,Total_Num_Stages,Is_Disbursement_Stage,Stage_Discipline,Stage_Duration_Weeks,Stage_Fee_Type
0,367704,Australia,Whyalla,Ports & Marine,4.0,False,True,Standard,100.0,False,Mark Gilbert,0.0,4.0,True,,False,False,False,2.0,2.070711,0.991479,7612852.0,SA - Industrial,117.0,7615441.0,6.0,True,146.0,False,1388262.0,1.0,No,Design,0.0,Fixed fee
1,367705,Australia,WGASA Pty Ltd,Civic & Education Buildings,1.0,False,True,Standard,,,Geoff Wallbridge,0.0,,False,,False,False,,2.0,1.926671,0.000000,7612773.0,SA - Buildings,310.0,7615975.0,1.0,True,138.0,False,,,,,,
2,367706,Australia,WGASA Pty Ltd,Civic & Education Buildings,1.0,False,True,Standard,,,Loreto Taglienti,0.0,,False,,False,False,,0.0,1.936449,1.290966,7612773.0,SA - Buildings,310.0,7614333.0,3.0,True,149.0,False,,,,,,
3,367707,Australia,WGASA Pty Ltd,Commercial & Retail Buildings,3.0,False,True,Standard,,,Mark Gilbert,0.0,,False,,False,False,,1.0,2.053392,1.631426,7612773.0,SA - Buildings,310.0,7614635.0,1.0,True,149.0,False,,,,,,
4,367708,Australia,WGASA Pty Ltd,Civic & Education Buildings,7.0,False,True,Standard,,,Peter McBean,0.0,,False,,False,False,,3.0,1.082860,0.000000,7612695.0,SA - Buildings,420.0,7613925.0,13.0,True,149.0,False,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26372,1529864,Australia,WGAWA Pty Ltd,Civic & Education Buildings,1.0,False,False,WA Chargeout Rates from 1st July 2021,50.0,False,James Davidson,,1.0,,,False,False,False,1.0,,,7809410.0,WA - Civil,20.0,,,,,,5998563.0,2.0,No,* NOT SET,5.0,Fixed fee
26373,1529864,Australia,WGAWA Pty Ltd,Civic & Education Buildings,1.0,False,False,WA Chargeout Rates from 1st July 2021,50.0,False,James Davidson,,1.0,,,False,False,False,1.0,,,7809410.0,WA - Civil,20.0,,,,,,5998564.0,2.0,No,* NOT SET,5.0,Fixed fee
26374,1538446,Australia,WGAVIC Pty Ltd,Energy,1.0,False,False,Vic chargeout FY22,100.0,False,Cameron Jackson (MPD),,3.0,,True,False,False,False,0.0,,,7612860.0,VIC - Buildings,172.0,,,,,,6009400.0,1.0,No,* NOT SET,15.0,Fixed fee
26375,1538447,Australia,WGAVIC Pty Ltd,Energy,1.0,False,False,Vic chargeout FY22,100.0,False,Cameron Jackson (MPD),,4.0,,True,False,False,False,1.0,,,7612860.0,VIC - Buildings,172.0,,,,,,6009401.0,1.0,No,* NOT SET,16.0,Fixed fee


In [76]:
# Flat file on transaction level
transaction_lvl = pd.merge(stage_lvl, transactions, how='left', on=['Project_ID', 'Stage_ID'])

transaction_lvl = transaction_lvl[['Project_ID', 'Country', 'Office', 'Sector', 'Project_Size_Sort_Order', 'Total_Num_Stages',
'Is_Multi_Discipline_Project', 'Is_First_Client_Project', 'Default_Rate_Group', 'Perc_of_Stages_with_Fixed_Fee',
'Project_Director',  'Project_Manager', 'Synergy_Team', 'Employment_Total_Months', 'Manager_Is_Recent', 'Perc_of_Subcontractors',
'Project_Duration_Weeks', 'Is_Front_Loaded', 'Delivered_on_Time','Fully_In_Lockdown', 'Partially_In_Lockdown',
'Suffered_Data_Loss','Total_Data_Issues',
'Client_ID', 'Client_Projects_Total_No', 'Client_Is_Repeated', 'Client_Duration_Months', 'Client_Is_Recent',
'Stage_ID', 'Stage_Discipline', 'Stage_Duration_Weeks', 'Recoverability', 'Profit_Measure']]
transaction_lvl

Unnamed: 0,Project_ID,Country,Office,Sector,Project_Size_Sort_Order,Total_Num_Stages,Is_Multi_Discipline_Project,Is_First_Client_Project,Default_Rate_Group,Perc_of_Stages_with_Fixed_Fee,Project_Director,Project_Manager,Synergy_Team,Employment_Total_Months,Manager_Is_Recent,Perc_of_Subcontractors,Project_Duration_Weeks,Is_Front_Loaded,Delivered_on_Time,Fully_In_Lockdown,Partially_In_Lockdown,Suffered_Data_Loss,Total_Data_Issues,Client_ID,Client_Projects_Total_No,Client_Is_Repeated,Client_Duration_Months,Client_Is_Recent,Stage_ID,Stage_Discipline,Stage_Duration_Weeks,Recoverability,Profit_Measure
0,367704,Australia,Whyalla,Ports & Marine,4.0,1.0,False,True,Standard,100.0,Mark Gilbert,7612852.0,SA - Industrial,117.0,False,0.0,4.0,True,,False,False,False,2.0,7615441.0,6.0,True,146.0,False,1388262.0,Design,0.0,2.424242,0.000000
1,367704,Australia,Whyalla,Ports & Marine,4.0,1.0,False,True,Standard,100.0,Mark Gilbert,7612852.0,SA - Industrial,117.0,False,0.0,4.0,True,,False,False,False,2.0,7615441.0,6.0,True,146.0,False,1388262.0,Design,0.0,2.424242,0.000000
2,367704,Australia,Whyalla,Ports & Marine,4.0,1.0,False,True,Standard,100.0,Mark Gilbert,7612852.0,SA - Industrial,117.0,False,0.0,4.0,True,,False,False,False,2.0,7615441.0,6.0,True,146.0,False,1388262.0,Design,0.0,2.424242,0.000000
3,367704,Australia,Whyalla,Ports & Marine,4.0,1.0,False,True,Standard,100.0,Mark Gilbert,7612852.0,SA - Industrial,117.0,False,0.0,4.0,True,,False,False,False,2.0,7615441.0,6.0,True,146.0,False,1388262.0,Design,0.0,1.843658,1.284239
4,367704,Australia,Whyalla,Ports & Marine,4.0,1.0,False,True,Standard,100.0,Mark Gilbert,7612852.0,SA - Industrial,117.0,False,0.0,4.0,True,,False,False,False,2.0,7615441.0,6.0,True,146.0,False,1388262.0,Design,0.0,1.728429,1.203816
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
504658,1529864,Australia,WGAWA Pty Ltd,Civic & Education Buildings,1.0,2.0,False,False,WA Chargeout Rates from 1st July 2021,50.0,James Davidson,7809410.0,WA - Civil,20.0,False,,1.0,,,False,False,False,1.0,,,,,,5998563.0,* NOT SET,5.0,,
504659,1529864,Australia,WGAWA Pty Ltd,Civic & Education Buildings,1.0,2.0,False,False,WA Chargeout Rates from 1st July 2021,50.0,James Davidson,7809410.0,WA - Civil,20.0,False,,1.0,,,False,False,False,1.0,,,,,,5998564.0,* NOT SET,5.0,,
504660,1538446,Australia,WGAVIC Pty Ltd,Energy,1.0,1.0,False,False,Vic chargeout FY22,100.0,Cameron Jackson (MPD),7612860.0,VIC - Buildings,172.0,False,,3.0,,True,False,False,False,0.0,,,,,,6009400.0,* NOT SET,15.0,,
504661,1538447,Australia,WGAVIC Pty Ltd,Energy,1.0,1.0,False,False,Vic chargeout FY22,100.0,Cameron Jackson (MPD),7612860.0,VIC - Buildings,172.0,False,,4.0,,True,False,False,False,1.0,,,,,,6009401.0,* NOT SET,16.0,,


In [77]:
# Save the final dataframes in CSV format
project_lvl.to_csv('csv-files/project_lvl.csv', index=False)
stage_lvl.to_csv('csv-files/stage_lvl.csv', index=False)
transaction_lvl.to_csv('csv-files/transaction_lvl.csv', index=False)