# Team Wrangler KPIs

---

**What:** This code calculates KPIs related to Team Wrangler.<br>
**When:** This code should run *weekly* in order to calculate KPIs correctly.<br>
- Note that some KPIs do not have any historical data (these are marked below), so querying the KPIs on-time (i.e., weekly) is important to get these data at all.
- Some KPIs are collected with day-level granularity, others only with week-level granularity (this is marked below).

**Where:** Results from these calculations appended to tables in the `s3://nucleo-databricks-shared-files/wrangler_kpis`

>**Note:** as of the version of this notebook on Oct. 21, 2022, this notebook does not tolerate data being written to S3 outside of a regular cycle.
This is because there is not yet a mechanism to ensure that duplicate data is not being written. The only way to ensure this now is by ONLY writing data to S3 in regularly-scheduled intervals
and that the interval is the same as the 'timespan_days' value below (e.g., our DB job runs this notebook once every 7 days & timespan_days is also set to 7).
---

**More details:**
Planning and overview of these KPIs here: https://docs.google.com/spreadsheets/d/1G0tD4O8eEj5ZC1eWxoR16jmuWxxF2IaLoPieMRiJFFo/edit#gid=596961878

## Setup

####  >> Configurable parameters <<

In [0]:
#Set how many days in the past you want the LAST DAY of your queries to look at (e.g., a value of 7 will mean that the last date of KPIs queried will be 7 days ago).
day_delay = 2

#Set how many total days you want to query KPI data for. This is subtracted from 'day_delay' and represents the FIRST DAY that your queries will look at (e.g., a value of 14 will query a 2-week range of data)
timespan_days = 7

#Team members of 'Team Wrangler'. These are used to filter this team's responses to GitHub issues and PRs. Update as members enter / leave the team.
#Team members last updated: Sept. 28, 2022
team_wrangler = [
    'aakin',
    'smolkuva',
    'pmali',
    'arukhan',
    'hyonetani',
    'ianiskovets',
    'kakula'
]

#Toggle this value to exclude inclusion of KPIs which do not have historical data (and which therefore cannot be backfilled for historical dates).
#Set this to True if you need to query data more than 1 week into the past so that you don't get current data named as historical data.
exclude_kpis_with_no_historical_data = False

#Location where result tables will be written:
output_bucket = "s3://nucleo-databricks-shared-files/wrangler_kpis"

In [0]:
from py4j.protocol import Py4JJavaError
#Only writing to S3 if this is NOT a manual run (we don't want to write to S3 accidentally > once)

try:
    write_results = (dbutils.widgets.get("triggered_by") == "scheduled_job")
    
except Py4JJavaError as e:
    write_results = False
    print('''Error: No weekly-level data will be written to S3 (job not triggered automatically - this is a mechanism to avoid duplicate data in S3). 
    if this job was run automatically; check job parameters to ensure the correct value is being passed.''')
    raise

#### Module Import

In [0]:
from datetime import datetime, timedelta
from github import Github
from pyspark.sql import Row
import logging

In [0]:
%run ./support_functions/utilities_functions

#### Setting date range to query

In [0]:
#Setting reference_date to today
reference_date = datetime.now()

#last date should be UNTIL midnight (i.e., inclusive of that day)
last_date = reference_date  - timedelta(day_delay)
last_date = last_date.replace(hour=23,minute=59,second=59)
#Since we are setting the last date to be inclusive of the last date, we need to decrement it by 1 so that we aren't
#pulling in data for an extra day
last_date = last_date - timedelta(days = 1)

#first_date should be STARTING AT midnight (i.e., inclusive of that day)
first_date = reference_date - timedelta(day_delay + timespan_days)
first_date = first_date.replace(hour=0,minute=0,second=0)

In [0]:
#-------------------------------------
#Creating derived data types for dates:
first_date_str = first_date.strftime("%Y-%m-%d")
last_date_str = last_date.strftime("%Y-%m-%d")
 
#Creating range of days between a start date and end date - stored as list
date_pairs = get_time_pairs(first_date,last_date)

#### Appending KPI results to list for eventual JOINing

In [0]:
weekly_kpis = {}
daily_kpis = {}

## GitHub KPIs
Setup:

In [0]:
gh_token = dbutils.secrets.get(scope = "team-wrangler-monitoring-scope", key = "github_kpi_kevinstine_token")
gh_client = Github(base_url="https://github.bus.zalan.do/api/v3", login_or_token=gh_token)

In [0]:
%run ./support_functions/GitHub_support_functions

### GitHub: Average response time to new tickets in 'Wrangler/issues' (per week)

In [0]:
gh_ticket_response_time = get_new_tickets_response_time(gh_client, team_wrangler, first_date, last_date, date_pairs)

#Converting to Spark DF
gh_ticket_response_time_df = spark.createDataFrame(Row(**x) for x in gh_ticket_response_time)

#aggregating values over all days and removing 'date' field (to match other week-level KPIs):
gh_ticket_response_time_df = sum_all_sparkdf_cols(gh_ticket_response_time_df, ['date'])


weekly_kpis.update({
    "gh_ticket_response_time":gh_ticket_response_time_df
})
#gh_ticket_response_time_df.show()

### GitHub: Average merge time to new pull requests in 'Wrangler/processing-platform' (per week)

In [0]:
gh_pr_merge_time = get_pr_merge_response_time(gh_client, team_wrangler, first_date, last_date, date_pairs)

#Converting to Spark DF
gh_pr_merge_time_df = spark.createDataFrame(Row(**x) for x in gh_pr_merge_time)

#aggregating values over all days and removing 'date' field (to match other week-level KPIs):
gh_pr_merge_time_df = sum_all_sparkdf_cols(gh_pr_merge_time_df, ['date'])


weekly_kpis.update({
    "gh_pr_merge_time":gh_pr_merge_time_df
})
#gh_pr_merge_time_df.show()

---

## Databricks KPIs

---

In [0]:
db_credentials = dbutils.secrets.get(scope = "team-wrangler-monitoring-scope", key = "databricks_kpi_admin_token")
databricks_headers = {'Authorization': 'Bearer ' + db_credentials,'Content-Type':'application/json'}

In [0]:
%run ./support_functions/Databricks_support_functions

In [0]:
#Pre-loading cluster list used for multiple KPIs
databricks_clusters = get_cluster_list(databricks_headers)

### Databricks: AWS Costs per Databricks unit (monthly)
>Note: Although AWS costs only come out 1x per month, this is currently being calculated on a weekly basis.

In [0]:
aws_spend_per_dbu = get_aws_spend_over_dbus_ratio(first_date_str,last_date_str)

#aggregating values over all days and removing 'date' field (to match other week-level KPIs):
aws_spend_per_dbu = sum_all_sparkdf_cols(aws_spend_per_dbu, ['date'])

weekly_kpis.update({
    "aws_spend_per_dbu":aws_spend_per_dbu
})
#aws_spend_per_dbu.show()

### Databricks: Average cluster startup time + nodes lost events (per day)

In [0]:
#Returns JSON list
dbricks_startup_nodes = get_cluster_startup_and_nodes_lost(date_pairs, databricks_clusters, databricks_headers)

#Converting to Spark DF
dbricks_startup_nodes_df = spark.createDataFrame(Row(**x) for x in dbricks_startup_nodes)

daily_kpis.update({
    "dbricks_startup_nodes":dbricks_startup_nodes_df
})
#dbricks_startup_nodes_df.show()

### Databricks: Number of interactive clusters vs. jobs (per day)

In [0]:
#Returns Spark DF
dbricks_interactive_clusters = get_interactive_clusters_or_jobs(first_date_str,last_date_str)
daily_kpis.update({
    "dbricks_interactive_clusters":dbricks_interactive_clusters
})
#dbricks_interactive_clusters.show()

### Databricks: Availability (per day)

In [0]:
#Returns Spark DF
dbricks_databricks_availability = get_databricks_availability(first_date_str, last_date_str, minimum_downtime_in_minutes = 1)
daily_kpis.update({
    "dbricks_databricks_availability":dbricks_databricks_availability
})
#dbricks_databricks_availability.show()

### Databricks: Number of clusters with unsupported runtimes (per week)

---

**NOTE: No historical data available! Only gets current status. Be cautious when backfilling KPI data for previous weeks / months**

---

In [0]:
if not exclude_kpis_with_no_historical_data:
    #Returns JSON list
    dbricks_unsupported_runtimes = get_clusters_unsupported_runtimes(databricks_clusters, databricks_headers)

    #Converting to Spark DF
    dbricks_unsupported_runtimes_df = spark.createDataFrame(Row(**x) for x in [dbricks_unsupported_runtimes])

    weekly_kpis.update({
        "dbricks_unsupported_runtimes":dbricks_unsupported_runtimes_df
    })
    
#dbricks_unsupported_runtimes_df.show()

### Databricks: Number of teams using Databricks (per week)

### TESTING:
- [?] Returns inclusive data?
- [?] Data is checked for accuracy?

In [0]:
#Returns Spark DF
try:
    dbricks_active_teams = get_active_databricks_teams(first_date_str, last_date_str)
    weekly_kpis.append(dbricks_active_teams)
    #dbricks_active_teams.show()
except:
    print("Getting active teams on Databricks failed. Likely due to inaccessible employee data on S3.")


### Databricks: Number of unique users using Databricks (per week)

In [0]:
#Returns Spark DF
dbricks_active_users = get_active_databricks_users(first_date_str,last_date_str)
weekly_kpis.update({
    "dbricks_active_users":dbricks_active_users
})

#dbricks_active_users.show()

## Oracle KPIs

Setup:

In [0]:
oracle_config = {
    'user':dbutils.secrets.get(scope="team-wrangler-monitoring-scope", key="oracle-user-zalando-nagios"),
    'password':dbutils.secrets.get(scope="team-wrangler-monitoring-scope", key="oracle-password-zalando-nagios"),
    'jdbc_url':'jdbc:oracle:thin:@{}:{}/zalando.dummy.url.com',
    'driver':'oracle.jdbc.driver.OracleDriver',
    'host':'00.000.00.00',
    'port':'1521'
}

In [0]:
%run ./support_functions/Oracle_support_functions

### Oracle: Availability (per day)

In [0]:
oracle_availability = get_oracle_availability(oracle_config,first_date_str,last_date_str,minimum_downtime_in_minutes = 1)
daily_kpis.update({
    "oracle_availability":oracle_availability
})
#oracle_availability.show()

### Oracle: Percentage of erroneous jobs (per day)

In [0]:
oracle_erroneous_jobs = oracle_get_erroneous_jobs(oracle_config,first_date_str,last_date_str)
daily_kpis.update({
    "oracle_erroneous_jobs":oracle_erroneous_jobs
})
#oracle_erroneous_jobs.show()

### Oracle: Get % of remaining storage (per week)

---

**NOTE: No historical data available! Only gets current status. Be cautious when backfilling KPI data for previous weeks / months**

---

In [0]:
if not exclude_kpis_with_no_historical_data:
    oracle_pct_remaining_storage = oracle_get_remaining_storage(oracle_config)
    weekly_kpis.update({
        "oracle_pct_remaining_storage":oracle_pct_remaining_storage
    })
    
#oracle_pct_remaining_storage.show()

### Oracle: Active Users (per week)

In [0]:
oracle_active_users = oracle_get_active_users(oracle_config,first_date_str,last_date_str)
weekly_kpis.update({
    "oracle_active_users":oracle_active_users
})
#oracle_active_users.show()

---

## Combining data + uploading to central data storage

#### To Do (Oct. 18)
1. Configure both tables to fit format agreed on by Aykut + Shrini
2. Upload test table to S3
3. Download table and check accuracy
4. Upload new table (append) 
5. Download table and see effects of 'append' action

### Day-level data:

In [0]:
#Copying in case this fails, we can retry easily
daily_kpis_copy = list(daily_kpis.values())
daily_kpi_table = daily_kpis_copy.pop()

#Joining all day-level KPIs on 'date' field
while(len(daily_kpis_copy) > 0):
    new_table = daily_kpis_copy.pop()
    daily_kpi_table = daily_kpi_table \
        .join(new_table, "date","full")
        

daily_kpi_table = daily_kpi_table.withColumnRenamed("date", "first_date")

#Duplicating 'first_date' value in 'last_date' column so that table format matches weekly data
daily_kpi_table = daily_kpi_table.withColumn("last_date",col("first_date"))


In [0]:
#Converting dataframe from 'wide' format to 'long'
col_names = daily_kpi_table.columns
col_names.remove("first_date")
col_names.remove("last_date")
daily_kpi_table_long = melt(daily_kpi_table,id_vars=['first_date','last_date'],value_vars=col_names)

#Specifying that these variables are day-level granularity
daily_kpi_table_long = daily_kpi_table_long.withColumn("granularity",lit("daily"))

### Week-level data:

In [0]:
#Aggregating day-level KPIs to be week-level:
daily_kpi_table_aggregated = sum_all_sparkdf_cols(daily_kpi_table, ['first_date','last_date'])
weekly_kpis.update({
    "daily_kpi_table_aggregated":daily_kpi_table_aggregated
})

In [0]:
#Copying in case this fails, we can retry easily
weekly_kpis_copy = list(weekly_kpis.values())
weekly_kpi_table = weekly_kpis_copy.pop()

weekly_kpi_table = weekly_kpi_table.withColumn("last_date",lit(last_date_str))

#Joining all week-level KPIs on 'last_date' field
while(len(weekly_kpis_copy) > 0):
    new_table = weekly_kpis_copy.pop()
    weekly_kpi_table = weekly_kpi_table \
        .join(
        new_table.withColumn("last_date",lit(last_date_str)),
        "last_date",
        "full"
    )
    
weekly_kpi_table = weekly_kpi_table.withColumn("first_date",lit(first_date_str))

In [0]:
#Converting dataframe from 'wide' format to 'long'
col_names = weekly_kpi_table.columns
col_names.remove("first_date")
col_names.remove("last_date")
weekly_kpi_table_long = melt(weekly_kpi_table,id_vars=['first_date','last_date'],value_vars=col_names)

#Specifying that these variables are day-level granularity
weekly_kpi_table_long = weekly_kpi_table_long.withColumn("granularity",lit("weekly"))

### Merging & writing tables

In [0]:
kpi_write_table = daily_kpi_table_long.union(weekly_kpi_table_long)

In [0]:
#Only writing to S3 if this is NOT a manual run (we don't want to write to S3 accidentally > once)

if write_results:
    print("Writing to S3...")
    kpi_write_table \
        .write \
        .option("header",True) \
        .mode("append") \
        .parquet(output_bucket)
else:
    print("No data written to S3 because this notebook was triggered manually. This is a mechanism to avoid duplicate data in S3.")
    raise