<h1 align="center">MSIN0114: Business Analytics Consulting Project</h1>
<h2 align="center">S2R Analytics</h2>

# Table of Contents

**Data enginering**

* [Part 0](#part0): Data extraction

* [Part 1](#part1): Data transformation
    * [1.1](#1.1): Project
    * [1.2](#1.2): Client
    * [1.3](#1.3): Transactions
    * [1.4](#1.4): Data health
    * [1.5](#1.5): What else?
    
* [Part 2](#part2): Data loading
    * [2.1](#2.1): Database design
    * [2.2](#2.2): Data storage

**Predictive analytics**

* [Part 3](#part3): Data splitting and scaling
* [Part 4](#part4): Model training
* [Part 5](#part5): Performance evaluation
* [Part 6](#part6): Feature importance and statistical tests
* [Part 7](#part7): Converting the output
* [Part 8](#part8): Pipeline creation

# Report

## Notebook Setup

In [None]:
#!pip install plotly
#!pip install xgboost

In [None]:
#Essentials
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
from pandas import Series, DataFrame
from pandas.api.types import CategoricalDtype
pd.options.display.max_columns = None
import numpy as np; np.random.seed(2022)
import random
import sqlite3
import pyodbc

#Image creation
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.patches as mpatches
from matplotlib import pyplot
import plotly.express as px
import plotly.graph_objects as go

#Image display
from IPython.display import Image as image
from IPython.display import display

#Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.base import clone
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier

#Other
import itertools as it
import io
import os
os.sys.path
import sys
import glob
import concurrent.futures
from __future__ import print_function
import binascii
import struct
from PIL import Image
import scipy
import scipy.misc
import scipy.cluster

## Part 0: <a class="anchor" id="part0"></a> Data extraction

API scripts from Jonny.

## Part 1: <a class="anchor" id="part1"></a> Data transformation

Check whether the database is relational or flat file. If flat file, proceed to step 3. If RDB, make it a flat file, i.e., make the data table a 2-dimensional table. Convert the hierarchical database into a flat one.

URL: https://stackoverflow.com/questions/52122119/create-database-using-python-on-jupyter-notebook#:~:text=read%20the%20CSV%20df%20%3D%20pd.read_csv%20%28%27sample.csv%27%29%20connect,creates%20a%20Any_Database_Name.db%20file%20in%20the%20current%20directory?msclkid=92378564cf7911eca8118bfb618ee4dd

URL: https://sparkbyexamples.com/pyspark/select-columns-from-pyspark-dataframe/#:~:text=You%20can%20select%20the%20single%20or%20multiple%20columns,ways%20to%20select%20single%2C%20multiple%20or%20all%20columns

### 1.1 <a class="anchor" id="1_1"></a> Projects (wga.projects)

In [None]:
# Read the data from Synergy API
api_projects = pd.read_csv('wga_synergy_incremental_projects.csv')


# Drop unnecesary columns
api_projects.drop(columns = ['Unnamed: 0', 'Primary Contact Name', 'Project Status', 'Status Name',
                             'customFields', 'Address Line 1', 'Address Line 2', 'Project Type ID',
                             'Primary Contact', 'Primary Contact ID', 'Project Scope', 'Address Postal Code',
                             'Address State', 'Address Town', 'Address Google', 'Client Reference Number',
                             'Address State Postal Code Country', 'Address Single Line', 'Project Type Code',
                             'External Name', 'Address Longitude', 'Address Latitude',
                             'Project Forecast Value', 'Created Date', 'Updated Date', 'Manager ID'], inplace = True)


# Drop all internal projects
api_projects = api_projects[(api_projects['Is Office Project'] != 'Yes') & (api_projects['Is Billable'] != 'No')]
api_projects.drop(columns = ['Project Number', 'Project Name', 'Is Office Project', 'Is Billable'], inplace = True)


# Drop rows that are not 'Complete', 'Active' or 'Pending Invoice'
projects_status = pd.read_csv('wga_synergy_overnight_1_projects_status.csv')
projects_status.rename(columns = {'Project Status ID': 'Status ID', 'Group': 'Project Status'}, inplace = True)
projects_status.drop(columns = ['Unnamed: 0', 'Status Name', 'Status Type', 'Success Factor'], inplace = True)
api_projects = pd.merge(api_projects, projects_status, how='left', left_on='Status ID', right_on='Status ID')
api_projects.drop(columns = 'Status ID', inplace = True)
api_projects = api_projects[api_projects['Project Status'].isin(['Complete', 'Active', 'Pending Invoice']) == True]


# Convert columns for unified style
api_projects.rename(columns = {'Invoices':'Number of Invoices', 'Project Net Residual (Neg as Zero)':'Project Net Residual',
                              'Start Date (Project)': 'Project Start Date', 'End Date (Project)': 'Project End Date',
                              'Address Country':'Country'}, inplace = True)
api_projects['Country'].replace(['AUSTRALIA', 'AUS', 'Autralia', 'NZ', 'new zealand', 'PNG', 'samoa', 'SAMOA', 'TONGA', 'SA', 'CHINA'],
                                ['Australia', 'Australia', 'Australia', 'New Zealand', 'New Zealand', 'Papua New Guinea', 'Samoa', 'Samoa', 'Tonga', 'Saudi Arabia', 'China'],inplace=True)
api_projects['Project Start Date'] = pd.to_datetime(api_projects['Project Start Date'])
api_projects['Project End Date'] = pd.to_datetime(api_projects['Project End Date'])


# Move 31 'Commercial' project types to 'Commercial & Retail Buildings' projet types
api_projects['Project Type'].mask(api_projects['Project Type'] == 'Commercial', 'Commercial & Retail Buildings', inplace=True)


# Adding 'Due Date' and'Project Director' columns
custom_fields = pd.read_csv('wga_synergy_incremental_projects_custom_fields.csv')
custom_fields = custom_fields[['PROPOSAL - Due Date', 'PROSPECT - Project Director', 'Project ID']].copy()
custom_fields.rename(columns = {'PROSPECT - Project Director':'Project Director', 'PROPOSAL - Due Date': 'Due Date'}, inplace = True)
custom_fields['Due Date'] = pd.to_datetime(custom_fields['Due Date'])
api_projects = pd.merge(api_projects, custom_fields,  how='left', left_on='Project ID', right_on='Project ID')


# Rearrange column names for easier interpretation
api_projects = api_projects[['Project ID', 'Organisation ID', 'Country',
                             'Project Status', 'Project Type',
                             'Project Director', 'Project Manager', 'Office',
                             'Project Start Date', 'Project End Date', 'Due Date',
                             'Default Rate Group','Number of Invoices', 'Project Net Residual']]


api_projects.head(1)
len(api_projects)

In [None]:
# Read the pre-transformed data from PowerBI
pbi_projects = pd.read_csv('wga_power_bi_projects.csv', encoding = 'ISO-8859-1')
pbi_projects = pbi_projects[['Project ID', 'Project Number', 'Project Name',
                             'Project Size Sort Order', 'Project Duration (Weeks)',
                             'Sector', 'Is Multi Discipline Project',
                             'Is First Client Project',  'Project Status Group',
                             'Is Office Project', 'Is Billable','Is INT Project']].copy()

# Exclude all projects that we are not interested in
pbi_projects = pbi_projects[pbi_projects['Project Status Group'].isin(['Complete', 'Active', 'Pending Invoice']) == True]
pbi_projects = pbi_projects[(pbi_projects['Project Number'] != 'Internal')]
pbi_projects = pbi_projects[(pbi_projects['Project Name'] != 'Internal')]
pbi_projects = pbi_projects[(pbi_projects['Is INT Project'] != 'Yes')]
pbi_projects = pbi_projects[(pbi_projects['Is Office Project'] != 'Yes') & (pbi_projects['Is Billable'] != 'No')]
pbi_projects.drop(columns = ['Project Number', 'Project Name', 'Is Office Project', 'Is Billable', 'Is INT Project', 'Project Status Group'], inplace = True)


# Convert columns for unified style
pbi_projects.rename(columns = {'Project Duration (Weeks)':'Project Duration Weeks'}, inplace = True)
pbi_projects['Is Multi Discipline Project'].replace(['No', 'Yes'],[False, True],inplace=True)
pbi_projects['Is First Client Project'].replace(['No', 'Yes'],[False, True],inplace=True)


pbi_projects.head(1)
len(pbi_projects)

In [None]:
inc_projects.isnull().sum().sort_values(ascending=False)

In [None]:
inc_projects.drop(columns = ['Unnamed: 0', 'Project Number', 'Primary Contact Name', 'Status Name',
                             'customFields', 'Address Line 1', 'Address Line 2',
                             'Primary Contact', 'Primary Contact ID', 'Project Scope', 'Address Postal Code',
                             'Address State', 'Address Town', 'Address Google', 'Client Reference Number',
                             'Address State Postal Code Country', 'Address Single Line', 'Project Type Code',
                             'External Name', 'Address Longitude', 'Address Latitude',
                             'Project Forecast Value', 'Project Type'], inplace = True)
inc_projects.isnull().sum().sort_values(ascending=False)

In [None]:
inc_projects['Is Office Project'].value_counts()
inc_projects['Is Billable'].value_counts()

In [None]:
inc_projects['Is Office Project'].replace(['No', 'Yes'],[False, True],inplace=True)
inc_projects['Is Office Project'].value_counts()

inc_projects['Is Billable'].replace(['No', 'Yes'],[False, True],inplace=True)
inc_projects['Is Billable'].value_counts()

In [None]:
inc_projects.rename(columns = {'Invoices':'Number of Invoices', 'Project Net Residual (Neg as Zero)':'Project Net Residual',
                              'Start Date (Project)': 'Project Start Date', 'End Date (Project)': 'Project End Date'}, inplace = True)

In [None]:
inc_projects['Address Country'].value_counts()

In [None]:
inc_projects['Address Country'].replace(['AUSTRALIA', 'AUS', 'Autralia', 'NZ', 'new zealand', 'PNG', 'samoa', 'SAMOA', 'TONGA', 'SA', 'CHINA'],
                                        ['Australia', 'Australia', 'Australia', 'New Zealand', 'New Zealand', 'Papua New Guinea', 'Samoa', 'Samoa', 'Tonga', 'Saudi Arabia', 'China'],inplace=True)
                                        
inc_projects['Address Country'].value_counts()

In [None]:
inc_projects.dtypes.astype(str).value_counts()

Notice that there are no column with timstamp data type. This means we have to transform start and end date columns.

In [None]:
inc_projects['Project Start Date'] = pd.to_datetime(inc_projects['Project Start Date'])
inc_projects['Project Start Year'] = pd.DatetimeIndex(inc_projects['Project Start Date']).year
inc_projects['Project Start Month'] = pd.DatetimeIndex(inc_projects['Project Start Date']).month

inc_projects['Project End Date'] = pd.to_datetime(inc_projects['Project End Date'])
inc_projects['Project End Year'] = pd.DatetimeIndex(inc_projects['Project End Date']).year
inc_projects['Project End Month'] = pd.DatetimeIndex(inc_projects['Project End Date']).month

In [None]:
projects_status = pd.read_csv('wga_synergy_overnight_1_projects_status.csv')
projects_status.rename(columns = {'Project Status ID':'Status ID', 'Group': 'Project Status Group'}, inplace = True)
projects_status.drop(columns = ['Unnamed: 0', 'Status Name', 'Status Type', 'Success Factor'], inplace = True)
projects_status.head()
len(projects_status)

In [None]:
merge_1 = pd.merge(inc_projects, projects_status, on=['Status ID'])
merge_1.head(1)
len(merge_1)

We want as fewer categorical columns as possible. This is why we will mostly only use IDs rther than full names of stages and project types, so that encoding is not needed.

In [None]:
projects_types = pd.read_csv('wga_synergy_overnight_1_projects_types.csv')
projects_types.drop(columns = ['Unnamed: 0', 'Project Type Display', 'Project Type Code'], inplace = True)
projects_types.head(1)
len(projects_types)

In [None]:
merge_2 = pd.merge(merge_1, projects_types,  how='left', left_on='Project Type ID', right_on='Project Type ID')
merge_2.head(1)
len(merge_2)

In [None]:
custom_fields = pd.read_csv('wga_synergy_incremental_projects_custom_fields.csv')
custom_fields.head()
len(custom_fields)

In [None]:
custom_fields.isnull().sum().sort_values(ascending=False)

In [None]:
custom_fields = custom_fields[['PROPOSAL - Due Date', 'PROSPECT - Project Director', 'Project ID']].copy()
custom_fields.rename(columns = {'PROSPECT - Project Director':'Project Director', 'PROPOSAL - Due Date': 'Due Date'}, inplace = True)
custom_fields['Due Date'] = pd.to_datetime(custom_fields['Due Date'])
custom_fields

In [None]:
merge_3 = pd.merge(merge_2, custom_fields,  how='left', left_on='Project ID', right_on='Project ID')

In [None]:
merge_3.columns
len(merge_3.columns)

In [None]:
given_projects = merge_3[['Organisation ID', 'Project ID','Project Name',
                          'Status ID', 'Project Status', 'Project Status Group',
                          'Project Type ID', 'Project Type Name', 'Is Office Project',
                          'Project Director', 'Project Manager', 'Manager ID', 'Office',
                          'Address Country','Is Billable', 'Default Rate Group',
                          'Number of Invoices', 'Project Net Residual',
                          'Created Date', 'Updated Date',
                          'Project Start Date', 'Project End Date',
                          'Project Start Year', 'Project Start Month',
                          'Project End Year','Project End Month', 'Due Date']]
len(given_projects.columns)

In [None]:
# Checking if the two columns are the same
data_compare = [given_projects['Project Status'], given_projects['Project Status Group']]
df_compare = pd.concat(data_compare, axis=1)
df_compare['same'] = (df_compare['Project Status'] == df_compare['Project Status Group']) 

# Printing the dataframe
df_compare[df_compare['same'] == False]

In [None]:
given_projects['Project Status'].value_counts()

In [None]:
given_projects['Project Status Group'].value_counts()

In [None]:
given_projects.drop(columns = 'Project Status', inplace = True)

In [None]:
# Drop rows that are not 'Complete' and 'Active'
given_projects = given_projects[given_projects['Project Status Group'].isin(['Complete', 'Active']) == True]
given_projects.drop(columns = 'Project Status Group', inplace = True)
given_projects.head(1)
len(given_projects)

In [None]:
power_bi_projects = pd.read_csv('wga_power_bi_projects.csv', encoding = 'ISO-8859-1')
power_bi_projects.head(1)

In [None]:
power_bi_projects = pd.read_csv('wga_power_bi_projects.csv', encoding = 'ISO-8859-1')
power_bi_projects = power_bi_projects[['Project ID','Project Duration (Weeks)', 'Project Size', 'Project Size Sort Order',
                                       'Is Multi Discipline Project', 
                                       'Project Fee Remaining','Project Fee Remaining - Active Only',
                                       'Is First Client Project', 'Primary Service', 'Sector', 'Sub-Sector',
                                       'Is INT Project','Project Team Manager', 'Project Status Group']].copy()
power_bi_projects = power_bi_projects[power_bi_projects['Project Status Group'].isin(['Complete', 'Active']) == True]
power_bi_projects.drop(columns = 'Project Status Group', inplace = True)
power_bi_projects.head(1)
len(power_bi_projects)

In [None]:
power_bi_projects.rename(columns = {'Is INT Project':'Is Int Project', 'Sub-Sector': 'Sub Sector',
                                    'Project Fee Remaining - Active Only':'Project Fee Remaining Active Only',
                                    'Project Duration (Weeks)':'Project Duration Weeks'}, inplace = True)

power_bi_projects['Is Multi Discipline Project'].replace(['No', 'Yes'],[False, True],inplace=True)
power_bi_projects['Is Multi Discipline Project'].value_counts()

power_bi_projects['Is Int Project'].replace(['No', 'Yes'],[False, True],inplace=True)
power_bi_projects['Is Int Project'].value_counts()

power_bi_projects.head(1)

In [None]:
merge_4 = pd.merge(given_projects, power_bi_projects,  how='left', left_on='Project ID', right_on='Project ID')
merge_4.head(1)
len(merge_4)

In [None]:
engineered_projects = merge_4.copy()
engineered_projects['Suffered Data Loss'] = 
#engineered_projects['Between_March_2020_Jan_2022'] = 
#engineered_projects['Delivered_on_Time'] =
#engineered_projects['Perc_of_Stages_with_Fixed_Fee'] =
#engineered_projects['Is_Government_Project'] =
#engineered_projects['Sector_Profitability_Rank'] =
#engineered_projects.head(1)

engineered_projects['End Before July 2018'].head()

In [None]:
def data_loss_1(x): #projects that started after July 2018 did not suffer from data loss
    if ['Project Start Date']date > pd.Timestamp('2018-07-15'):
        return False
    else:
        return True

def data_loss_2(x): #projects that ended before July 2018 did not suffer from data loss
    if date < pd.Timestamp('2018-07-15'):
        return False
    else:
        return True

engineered_projects['Suffered Data Loss_1'] = engineered_projects['Project Start Date'].apply(data_loss_1)
engineered_projects['Suffered Data Loss_2'] = engineered_projects['Project End Date'].apply(data_loss_2)

In [None]:
#engineered_projects['Suffered Data Loss'] = 

In [None]:
def prompt_checker(x): #projects that ended before due date are prompt
    if pd.Timestamp(engineered_projects['Project End Date']) > pd.Timestamp(engineered_projects['Due Date']):
        return False
    else:
        return True

In [None]:
df[['A','C']].apply(lambda x: my_func(x) if(np.all(pd.notnull(x[1]))) else x, axis = 1)

In [None]:
engineered_projects['Delivered on Time'] = engineered_projects['Due Date'].apply(prompt_checker)

**6 more features to engineer:**
* Suffered_Data_Loss
* Between_March_2020_Jan_2022
* Delivered_on_Time
* Is_Government_Project
* Sector_Profitability_Rank (research)
* Perc_of_Stages_with_Fixed_Fee

**Table alterations:**
* FK Total_Data_Health_Issues (references 'wga.health' table on 'Alerts_Total_Per_Project')

In [None]:
# Creates a new column containing all the days between Project_Start_Date and Project_End_Date
projects_exec_dates['Execution_Period'] = projects_exec_dates.apply(lambda row: pd.date_range(start=row['Project_Start_Date'], end=row['Project_End_Date'], freq='D'), axis=1)
projects_exec_dates

In [None]:
inc_projects['Project Start Date'] = pd.to_datetime(inc_projects['Project Start Date'])
inc_projects['Project Start Year'] = pd.DatetimeIndex(inc_projects['Project Start Date']).year
inc_projects['Project Start Month'] = pd.DatetimeIndex(inc_projects['Project Start Date']).month

inc_projects['Project End Date'] = pd.to_datetime(inc_projects['Project End Date'])
inc_projects['Project End Year'] = pd.DatetimeIndex(inc_projects['Project End Date']).year
inc_projects['Project End Month'] = pd.DatetimeIndex(inc_projects['Project End Date']).month

In [None]:
# Suffered_Data_Loss

def data_loss_checker(start_date, end_date):
    if start_date < pd.Timestamp('2018-07-15') and end_date < pd.Timestamp('2018-07-15'): #project started and ended before the acqusition
        return False
    elif start_date > pd.Timestamp('2018-07-15') and end_date > pd.Timestamp('2018-07-15'): #project started and ended after the acqusition
        return False
    elif start_date < pd.Timestamp('2018-07-15') and end_date > pd.Timestamp('2018-07-15'): #project started before the acqusition but ended after it
        return True

Suffered_Data_Loss = {}

for start_date in projects['Project_Start_Date']:
    for end_date in projects['Project_End_Date']:
            if data_loss_checker(start_date, end_date) == True:
                Suffered_Data_Loss[start_date, end_date] = True
            else:
                Suffered_Data_Loss[start_date, end_date] = False

Suffered_Data_Loss

### 1.2 <a class="anchor" id="1_2"></a> Clients (wga.clients)

In [None]:
overnight_1_clients = pd.read_csv('wga_synergy_overnight_1_clients.csv')
overnight_1_clients.drop(columns = {'Client Name', 'Unnamed: 0'}, inplace = True)
overnight_1_clients['Contact Type'].replace(['Company', 'Individual'],[1, 0],inplace=True)
overnight_1_clients['Created Date'] = pd.to_datetime(overnight_1_clients['Created Date'])
overnight_1_clients

#client.columns = client.columns.str.replace(' ', '_')

In [None]:
power_bi_clients = pd.read_csv('wga_power_bi_clients.csv', encoding = 'ISO-8859-1')
power_bi_clients.drop(columns = ['ï»¿Synergy URL (Client)', 'Organisation ID', 'Client Name', 'Created Date', 'Contact Type'], inplace=True)
power_bi_clients.rename(columns = {'Client Name (Short)':'Short Client Name',
                                   'Client Projects - Total No': 'Client Projects Total No',
                                   'Client Projects - First Project ID':'1st Project ID'}, inplace = True)
power_bi_clients.head()

In [None]:
clients = pd.merge(overnight_1_clients, power_bi_clients,  how='left', left_on='Client ID', right_on='Client ID')
clients

**2 features to engineer:**
* Tenure_Duration_Weeks
* Is_New

**Table alterations:**
* FK 1st_Project_ID (links 'wga.projects' table on 'Project_ID')
* FK Organisation_ID (references 'wga.projects' table on 'Organisation_ID')

In [None]:
#client['Is_New'] = 
clients['Tenure_Duration_Weeks'] = (clients['Created Date']-###).apply(lambda x: x/np.timedelta64(1,'M'))

### 1.3 <a class="anchor" id="1_3"></a> Stages (wga.stages)

In [None]:
# from API file

#sql_stages = pd.read_csv('wga_sql_stages.csv')
#sql_stages = sql_stages[[ 'customer', 'id', 'projectId', 'managerId', 'organisationId', 'name', 'accounts','statusId']]
#sql_stages.head(1)
#len(sql_stages)

In [None]:
# From PowerBI file
stages = pd.read_csv('wga_power_bi_stages.csv', encoding = 'ISO-8859-1')
stages = stages[['customer', 'Project ID', 'Stage ID', 'Phase Name (Short)', 'Stage Status Sort Order',
                 'Is Disbursement Stage', 'Stage Type', 'Stage Forecast Distribution Type', 'Stage Fee Type',
                 'Stage Manager', 'Stage Discipline','Stage Start Date','Stage End Date', 'Stage Updated Date']].copy()

stages['Stage Start Date'] = pd.to_datetime(stages['Stage Start Date'])
stages['Stage End Date'] = pd.to_datetime(stages['Stage End Date'])
stages['Stage Updated Date'] = pd.to_datetime(stages['Stage Updated Date'])
stages.rename(columns = {'customer':'Customer','Phase Name (Short)': 'Phase Name'}, inplace = True)
stages.columns = stages.columns.str.replace(' ', '_')

stages.head(1)
len(stages)

In [None]:
stages['Customer'].nunique()

In [None]:
stages.drop(columns = 'Customer', inplace = True)

**1 feature to engineer:**
* Stage_Duration_Weeks


**Table alterations:**
* FK Alerts_Total_Per_Stage (references 'wga.health' table on 'Alerts_Total_Per_Stage')

### 1.4 <a class="anchor" id="1_4"></a> Transactions (wga.transactions)

In [None]:
# Read only valid projects' transactions
transactions = pd.read_csv('wga_sql_transactions.csv')
transactions = (transactions[transactions['Project ID'].isin(valid_ids)])

transactions = transactions[['id', 'projectId', 'projectNumber', 'projectName',
                                     'transactionTypeId','status',
                                     'stageId', 'stageName',
                                     'expenseType',
                                     'invoiceValueTotal',
                                     'cost', 'actualCostTotal',
                                     'targetChargeTotal', 'standardCostTotal',
                                     'valueTotal', 'date']].copy()

transactions.rename(columns = {'id':'Transaction_ID', 'projectId':'Project_ID',
                                   'transactionTypeId':'Transaction_Type_ID',
                                   'status': 'Status', 'stageId': 'Stage_ID', 'stageName':'Stage_Name',
                                   'expenseType':'Expense_Type',
                                   'invoiceValueTotal': 'Invoice_Value_Total', 'cost': 'Cost',
                                   'actualCostTotal':'Actual_Cost_Total',
                                   'targetChargeTotal':'Target_Charge_Total',
                                   'standardCostTotal':'Standard_Cost_Total',
                                   'valueTotal':'Value_Total', 'date':'Date'}, inplace = True)

# Drop any rows that have 'Internal' in the project name and the project number columns
transactions = transactions[(transactions['projectNumber'] != 'Internal') & (transactions['projectName'] != 'Internal')]
transactions.drop(columns = ['projectNumber', 'projectName'], inplace = True)

# Transform timestamps from object data type to datetime
transactions['Date'] = pd.to_datetime(transactions['Date'])

transactions.head(1)
len(transactions)

In [None]:
original = pd.read_csv('wga_sql_transactions.csv')
original.head(1)

In [None]:
sql_transactions = pd.read_csv('wga_sql_transactions.csv')
sql_transactions = sql_transactions[['id', 'invoiceNumber', 'projectId','transactionTypeId',
                                     'statusId', 'status',
                                     'stageId', 'stageName',
                                     'taskId', 'taskName',
                                     'rateTypeId', 'rateType',
                                     'reasonCode', 'expenseType',
                                     'writeOffStaffId', 'writeOffStaff',
                                     'cost', 'actualCostTotal',
                                     'targetChargeTotal', 'standardCostTotal',
                                     'valueTotal', 'wipValueTotal', 'isUtilised',
                                     'date', 'invoiceDate' ]].copy()

sql_transactions.rename(columns = {'Data Quality - Is Forecastable':'DQ_Is_Forecastable',
                         'Data Quality - Has Issues': 'DQ_Has_Issues',
                         'Data Quality - Has Inactive Staff Resourced':'DQ_Has_Inactive_Staff_Resourced',
                         'Data Quality - Rate Group':'DQ_Rate_Group',
                         'Health - % Duration Complete':'Health_Perc_Duration_Complete',
                         'Health - % Fee Used':'Health_Perc_Fee_Used',
                         'Health - Stages With Alerts #':'Alerts_Total_Per_Stage'}, inplace = True)

sql_transactions.head(1)
len(sql_transactions)

In [None]:
invoices_payments = pd.read_csv('wga_synergy_overnight_1_invoices_payments.csv')
invoices_payments.head(1)
len(invoices_payments)

In [None]:
rates = pd.read_csv('wga_synergy_overnight_1_rates.csv')
rates.head()
len(rates)

**2 features to engineer:**
* Tenure_Duration_Weeks
* Is_New

**Table alterations:**
* PK 1st_Project_ID (links 'wga.projects' table on 'Project_ID')
* FK Organisation_ID (references 'wga.projects' table on 'Organisation_ID')

### 1.5 <a class="anchor" id="1_5"></a> Data health (wga.health)

In [None]:
health = pd.read_csv('wga_power_bi_stages.csv', encoding = 'ISO-8859-1')

health = health[['Project Number', 'Project Name', 'Is INT Project', 'Is Office Project',
                 'Project ID', 'Stage ID',
                 'Data Quality - Is Forecastable',
                 'Data Quality - Has Issues',
                 'Data Quality - Has Inactive Staff Resourced', 
                 'Data Quality - Rate Group', 'Health - % Duration Complete',
                 'Health - % Fee Used', 'Health - Stages With Alerts #']].copy()

# Exclude all projects that we are not interested in
pbi_projects = pbi_projects[pbi_projects['Project Status Group'].isin(['Complete', 'Active', 'Pending Invoice']) == True]

pbi_projects = pbi_projects[(pbi_projects['Project Name'] != 'Internal')]
pbi_projects = pbi_projects[(pbi_projects['Is INT Project'] != 'Yes')]
pbi_projects = pbi_projects[(pbi_projects['Is Office Project'] != 'Yes') & (pbi_projects['Is Billable'] != 'No')]
pbi_projects.drop(columns = ['Project Number', 'Project Name', 'Is Office Project', 'Is Billable', 'Is INT Project', 'Project Status Group'], inplace = True)



health.rename(columns = {'Data Quality - Is Forecastable':'DQ_Is_Forecastable',
                         'Data Quality - Has Issues': 'DQ_Has_Issues',
                         'Data Quality - Has Inactive Staff Resourced':'DQ_Has_Inactive_Staff_Resourced',
                         'Data Quality - Rate Group':'DQ_Rate_Group',
                         'Health - % Duration Complete':'Health_Perc_Duration_Complete',
                         'Health - % Fee Used':'Health_Perc_Fee_Used',
                         'Health - Stages With Alerts #':'Alerts_Total_Per_Stage'}, inplace = True)

health['DQ_Is_Forecastable'].replace(['No', 'Yes'],[False, True],inplace=True)
health['DQ_Has_Issues'].replace(['No', 'Yes'],[False, True],inplace=True)
health['DQ_Has_Inactive_Staff_Resourced'].replace(['No', 'Yes'],[False, True],inplace=True)
health.columns = health.columns.str.replace(' ', '_')

health.head(1)
len(health)

In [None]:
checker = health[health['Project_ID'].isin([368035]) == True]
checker = checker[['Project_ID', 'Stage_ID', 'Alerts_Total_Per_Stage']]
checker

**1 feature to engineer:**
* Alerts_Total_Per_Project

**Table alterations:**

* FK Project_ID (references 'wga.projects' table on 'Project_ID')
* FK Stage_ID (references 'wga.stages' table on 'Stage_ID')
* FK Alerts_Total_Per_Project (links 'wga.projects' table on 'Total_Data_Health_Issues')
* FK Alerts_Total_Per_Stage (links 'wga.stages' table on 'Alerts_Total_Per_Stage')

### 1.6 <a class="anchor" id="1_6"></a> Human resources (wga.hr)

In [None]:
staff = pd.read_csv('wga_synergy_overnight_1_staff.csv')
staff = staff[['Organisation ID', 'Staff ID', 'Reports To', 'Synergy Team', 'Employment Date', 'Termination Date']]
staff.head(1)
len(staff)

**1 feature to engineer:**
* Employment_Duration_Weeks_by_May_22

**Table alterations:**
* FK Organisation_ID (references 'wga.projects' table on 'Organisation_ID')

**Functions**

In [None]:
# Suffered_Data_Loss
def data_loss_1(x): #projects that started after July 2018 did not suffer from data loss
    if ['Project Start Date']date > pd.Timestamp('2018-07-15'):
        return False
    else:
        return True

def data_loss_2(x): #projects that ended before July 2018 did not suffer from data loss
    if date < pd.Timestamp('2018-07-15'):
        return False
    else:
        return True

# Between_March_2020_Jan_2022


# Delivered_on_Time
def prompt_checker(x): #projects that ended before due date are prompt
    if pd.Timestamp(engineered_projects['Project End Date']) > pd.Timestamp(engineered_projects['Due Date']):
        return False
    else:
        return True
    

print(df['new_time'] > pd.Timestamp(2018, 1, 5, 12))

# Perc_of_Stages_with_Fixed_Fee

# Is_Government_Project

# Sector_Profitability_Rank

**Applying functions**

## Part 2: <a class="anchor" id="part2"></a> Data loading

In [None]:
# Create a database and connect to it

conn = sqlite3.connect('WGA.db') #since the db does not exist, this creates a WGA.db file in the current directory

In [None]:
# Connecting to a database cerated in MS SQL Server Management Studio

server = '.\sqlexpress' 
database = 'wga' 
username = 'sa'  
password  = 'marfa'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+password)
cursor = cnxn.cursor()

# Test query
sql_statement = "select 1"
response = cursor.execute(sql_statement).fetchone()
print(response[0])

In [None]:
# Store the tables in the database:

sql_transactions.to_sql('sql_transactions', conn)
sql_stages.to_sql('sql_stages', conn)
sql_stages_snapshot.to_sql('sql_stages_snapshot', conn)


inc_projects.to_sql('inc_projects', conn)
inc_projects_custom_fields.to_sql('inc_projects_custom_fields', conn)
inc_stages_forecast.to_sql('inc_stages_forecast', conn)
inc_staff.to_sql('inc_staff', conn)
inc_invoices_all_fy22.to_sql('inc_invoices_all_fy22', conn)

fy21.to_sql('fy21', conn)
fy20.to_sql('fy20', conn)
fy19.to_sql('fy19', conn)
fy18.to_sql('fy18', conn)
fy18_2.to_sql('fy18_2', conn)
fy17.to_sql('fy17', conn)
fy16.to_sql('fy16', conn)
fy15.to_sql('fy15', conn)
fy14.to_sql('fy14', conn)
fy13.to_sql('fy13', conn)
fy12.to_sql('fy12', conn)
fy11.to_sql('fy11', conn)
fy10.to_sql('fy10', conn)

overnight_1_clients.to_sql('overnight_1_clients', conn)
overnight_1_invoices_payments.to_sql('overnight_1_invoices_payments', conn)
overnight_1_projects_contracts.to_sql('overnight_1_projects_contracts', conn)
overnight_1_projects_status.to_sql('overnight_1_projects_status', conn)
overnight_1_projects_status_change.to_sql('overnight_1_projects_status_change', conn)
overnight_1_projects_types.to_sql('overnight_1_projects_types', conn)
overnight_1_rates.to_sql('overnight_1_rates', conn)
overnight_1_staff.to_sql('overnight_1_staff', conn)
overnight_1_staff_leavers.to_sql('overnight_1_staff_leavers', conn)
overnight_1_stages_status_changes.to_sql('overnight_1_stages_status_changes', conn)
overnight_2_notes.to_sql('overnight_2_notes', conn)

resourcing.to_sql('resourcing', conn)
reference_staff_data.to_sql('reference_staff_data', conn)
reference_stages_forecast_custom.to_sql('reference_stages_forecast_custom', conn)

In [None]:
# Step 4: Read a SQL Query out of WGA database and transform it into a pandas dataframe for closer investigation
#sql_string = 'SELECT * FROM sql_stages'
#sql_stages = pd.read_sql(sql_string, conn)

https://chrisnicoll.net/2020/02/exploring-an-sqlite-database-from-jupyter-notebook/#:~:text=To%20explore%20the%20database%20I%20only%20need%20to,%23%20bog-standard%20read-write%20connection%20conn%20%3D%20sqlite3.connect%20%28%27digikam4.db%27%29?msclkid=37019978cf8711ecac7d1f5d1ef22333 (Nicoll, 2020)

In [None]:
curs = conn.cursor()

In [None]:
curs.execute('SELECT * FROM fy21').description

## Part 2: <a class="anchor" id="part2"></a> Data cleaning

* convert object columns to numeric columns with the (try) method
* drop column with perfect collinearity, like 'project ID'. 'invoice ID'

## Part 3: <a class="anchor" id="part3"></a> Feature engineeering

* Size of the team
* Project complexity (number of stages)
* Client longevity (number of months with the company)

## Part 4: <a class="anchor" id="part4"></a> Label encoding

In [None]:
encoded_df = df.copy()

le = LabelEncoder()
encoded_df['creator'] = le.fit_transform(encoded_df['creator'])
encoded_df['artwork_name'] = le.fit_transform(encoded_df['artwork_name'])
encoded_df['collection'] = le.fit_transform(encoded_df['collection'])
encoded_df['art_series'] = le.fit_transform(encoded_df['art_series'])
encoded_df = encoded_df.drop(columns = ['path'])
encoded_df.head()

## Part 5: <a class="anchor" id="part5"></a> Data splitting and scaling

In [None]:
# Split dataset into features and labels
X = full_df[['creator', 'artwork_name', 'collection',
           'art_series', 'media', 'likes', 'nsfw',
           'tokens','year', 'rights', 'artwork_counts']]  # Removed original price
y = full_df['price_class']

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022) # 80% training and 20% test
print(f"No. of training data: {X_train.shape[0]}")
print(f"No. of training targets: {y_train.shape[0]}")
print(f"No. of testing data: {X_test.shape[0]}")
print(f"No. of testing targets: {y_test.shape[0]}")

## Part 6: <a class="anchor" id="part6"></a> Model training

### 6.5 <a class="anchor" id="6_5"></a> Stacking

In [None]:
#URL: https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/

# get a stacking ensemble of models
def get_stacking():
	# define the base models
	level0 = list()
	level0.append(('lr', LogisticRegression()))
	level0.append(('knn', KNeighborsClassifier()))
	level0.append(('dtc', DecisionTreeClassifier()))
	level0.append(('rfc', rfc_tuned))
	level0.append(('xgb', XGBClassifier()))
	level0.append(('gnb', GaussianNB()))

	# define meta learner model
	level1 = rfc_tuned
	# define the stacking ensemble
	model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
	return model

# get a list of models to evaluate
def get_models():
	models = dict()
	models['lr'] = LogisticRegression()
	models['knn'] = KNeighborsClassifier()
	models['dtc'] = DecisionTreeClassifier()
	models['rfc'] = rfc_tuned
	models['xgb'] = XGBClassifier()
	models['gnb'] = GaussianNB()
	models['stacking'] = get_stacking()
	return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
	return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	scores = evaluate_model(model, X, y)
	results.append(scores)
	names.append(name)
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

## Part 7: <a class="anchor" id="part7"></a> Performance evaluation

## Part 8: <a class="anchor" id="part8"></a> Feature performance

In [None]:
feature_imp = pd.Series(rfc_tuned.feature_importances_, index=X.columns).sort_values(ascending=False)
feature_imp

In [None]:
# Create a new DataFrame for feature importance
rfc_tuned.feature_names = normal_df.drop("price", axis = 1).columns
rfc_tuned_feature_importance = pd.DataFrame({"Feature": rfc_tuned.feature_names,"Importance":rfc_tuned.feature_importances_})
rfc_tuned_feature_importance = rfc_tuned_feature_importance.sort_values(by = ["Importance"], ascending = False)

In [None]:
# Plotting a bar plot for feature importance
%matplotlib inline

plt.figure(figsize = (14,7))
sns.barplot(rfc_tuned_feature_importance["Feature"], rfc_tuned_feature_importance["Importance"], color = "navy")
plt.title("Feature Importance")
plt.xlabel("Features")
plt.ylabel("Feature Importance Score")
plt.xticks(rotation = "vertical")
plt.legend()
plt.show()

## Part 9: <a class="anchor" id="part9"></a> Converting output

## Part 10: <a class="anchor" id="Part 10"></a> Pipeline creation