# Weekly Core Metric Analysis

This notebook provides pre-processing, analysis, and visualization on weekly ATRT data provided by Cerner (Armand Kok) in a .csv dataset that includes daily (aggregated) timer data, transaction counts, etc... for a 90-day lookback period. 

10/26:
- **Business questions**:
    - Which timers are negatively affecting ATRT? 
    - What is the makeup of ATRT for a given period, and how has that changed over the previous periods? 
    - When did a shift begin for a particular timer? 
    - What are the smoothed trends for our core metrics overall?  
    - What is our ATRT for relative dates (month to date, past 30/60/90, etc). 

**To-do**
- [ ] consider moving to 120 day lookback period for control charts. 
- [ ] Joining most recent week's data: 
    - MySQL/postgres database that will index based on timer, date, application. 
    - run import job on .csv every week (Friday/Monday) to update with current week's data.

In [30]:
# Import libraries/packages
import csv
import pandas as pd
import datetime
import numpy as np
import os
from datetime import datetime 
from pandas.tseries.offsets import BDay
from scipy.stats.mstats import winsorize
#import statsmodels.api as sm
#import seaborn as sns
#import matplotlib.pyplot as plt

### Import dataset and make sure it's working. 
**To-do**:
- 10/25 automate dataset generation. Use case for hook into Vertica: Armand is currently manually giving this ds weekly. 
- 10/25 find best place to store to capture history

In [31]:
#client_atrt = pd.read_csv('Z:\IUH\khickman1\Datasets\client_atrt10.15.18.csv')
client_atrt = pd.read_csv('C:\\Users\\khickman1\\Desktop\\client_atrt10.29.csv')
#client_atrt = client_atrt[['dt', 'application_name', 'Timer Subtimer Name', 'ATRT']]
client_atrt.head()
len(client_atrt) #should be around 650-680k rows

676399

## Preprocessing

 - Dates
 - Filtering
 - Outlier removal
 - moving average calculation
 - window calculation
 - if/else statement based on window 
 - calculate UCL/LCL
 - feature selection

**To-Do**
 - [ ] remove subtimer name? 
 
### Date column transformation

In [56]:
atrt_df = pd.DataFrame(client_atrt)
atrt_df['dt'] = pd.to_datetime(atrt_df['dt'])
print(atrt_df['dt'][0:10])
atrt_df.head()
# check to make sure the 'dt' column is type:datetime64

0   2018-10-01
1   2018-10-29
2   2018-08-15
3   2018-08-28
4   2018-09-11
5   2018-10-19
6   2018-08-20
7   2018-10-17
8   2018-09-13
9   2018-08-08
Name: dt, dtype: datetime64[ns]


Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,total_elapsed,transAbove5,transUnder2,transaction_cnt,isBDay
0,0.04,0.00%,100.00%,USR:SRG SNDOCDISP DISCONTINUING SEGMENT (BMH P...,1,SNSURGINET.EXE,2018-10-01,0.003983,BMH PACU PHASE 1 DELAYS,USR:SRG SNDOCDISP DISCONTINUING SEGMENT,0.80064,0,21,21,2018-10-01
1,0.41,0.00%,100.00%,USR: PDOC NAVIGATOR BAND CLICK (NICU FINNEGAN),1,POWERCHART,2018-10-29,0.087445,NICU FINNEGAN,USR: PDOC NAVIGATOR BAND CLICK,13.028515,0,32,32,2018-10-29
2,0.65,0.00%,100.00%,USR: ICU-OPENING-BAND (PCA ADLS AND SAFETY),1,FIRSTNET,2018-08-15,0.07563,PCA ADLS AND SAFETY,USR: ICU-OPENING-BAND,4.530593,0,7,7,2018-08-15
3,0.15,0.00%,100.00%,USR: PDOC NAVIGATOR DOC SET CLICK (PICU RISK A...,1,POWERCHART,2018-08-28,0.096875,PICU RISK ASSESSMENT,USR: PDOC NAVIGATOR DOC SET CLICK,11.668,0,79,79,2018-08-28
4,0.38,0.00%,100.00%,USR:MPG.DOCUMENTATION_HPI.O1 - LOAD COMPONENT ...,1,POWERCHART,2018-09-11,0.311815,VB_WORKFLOWAMBPEDSCARDIOLOGY,USR:MPG.DOCUMENTATION_HPI.O1 - LOAD COMPONENT,36.360582,0,95,95,2018-09-11


### Remove non-working days.  

Currently only removing weekends. 

**To-do**
- remove federal/bank holidays?  E.g. sept 3 was labor day. 
- clean up is-bday column via subsetting or boolean

In [57]:
isBusinessDay = BDay().onOffset
match_series = atrt_df['dt'].map(isBusinessDay)
atrt_df['isBDay'] = atrt_df['dt'][match_series]
atrt_df

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,total_elapsed,transAbove5,transUnder2,transaction_cnt,isBDay
0,0.04,0.00%,100.00%,USR:SRG SNDOCDISP DISCONTINUING SEGMENT (BMH P...,1,SNSURGINET.EXE,2018-10-01,0.003983,BMH PACU PHASE 1 DELAYS,USR:SRG SNDOCDISP DISCONTINUING SEGMENT,0.800640,0,21,21,2018-10-01
1,0.41,0.00%,100.00%,USR: PDOC NAVIGATOR BAND CLICK (NICU FINNEGAN),1,POWERCHART,2018-10-29,0.087445,NICU FINNEGAN,USR: PDOC NAVIGATOR BAND CLICK,13.028515,0,32,32,2018-10-29
2,0.65,0.00%,100.00%,USR: ICU-OPENING-BAND (PCA ADLS AND SAFETY),1,FIRSTNET,2018-08-15,0.075630,PCA ADLS AND SAFETY,USR: ICU-OPENING-BAND,4.530593,0,7,7,2018-08-15
3,0.15,0.00%,100.00%,USR: PDOC NAVIGATOR DOC SET CLICK (PICU RISK A...,1,POWERCHART,2018-08-28,0.096875,PICU RISK ASSESSMENT,USR: PDOC NAVIGATOR DOC SET CLICK,11.668000,0,79,79,2018-08-28
4,0.38,0.00%,100.00%,USR:MPG.DOCUMENTATION_HPI.O1 - LOAD COMPONENT ...,1,POWERCHART,2018-09-11,0.311815,VB_WORKFLOWAMBPEDSCARDIOLOGY,USR:MPG.DOCUMENTATION_HPI.O1 - LOAD COMPONENT,36.360582,0,95,95,2018-09-11
5,0.21,0.00%,100.00%,USR:MPG.MICRO.O2 - LOAD COMPONENT (VB_WORKFLOW...,1,POWERCHART,2018-10-19,0.108207,VB_WORKFLOWIPCHARIS,USR:MPG.MICRO.O2 - LOAD COMPONENT,0.615875,0,3,3,2018-10-19
6,0.53,0.00%,100.00%,USR:MPG.NEW_ORDER_ENTRY.O1 - LOAD COMPONENT (V...,1,FIRSTNET,2018-08-20,0.170823,VB_EDQUICKORDERS,USR:MPG.NEW_ORDER_ENTRY.O1 - LOAD COMPONENT,255.817332,0,484,484,2018-08-20
7,1.15,0.00%,100.00%,USR:SRG SNDOCDISP SAVING SEGMENT (UH RAD GENER...,1,SNSURGINET.EXE,2018-10-17,0.493563,UH RAD GENERAL CASE DATA,USR:SRG SNDOCDISP SAVING SEGMENT,16.084465,0,14,14,2018-10-17
8,0.61,0.00%,100.00%,USR:SRG SNDOCDISP SAVING SEGMENT (NH ENDO DEPA...,1,SNSURGINET.EXE,2018-09-13,0.320618,NH ENDO DEPARTURE FROM OR,USR:SRG SNDOCDISP SAVING SEGMENT,7.933674,0,13,13,2018-09-13
9,0.29,0.00%,100.00%,USR:FAMHX.LOAD PROFILE (N/A),1,SNSURGINET,2018-08-08,0.081754,,USR:FAMHX.LOAD PROFILE,15.467000,0,53,53,2018-08-08


### Filter dataset by application - e.g. Powerchart, firstnet, surginet.
(Actually apply the weekend/weekday filter here as well). 

**to-do**
- [ ] are there differences between timers in the different applications? 
- [X]clean up the workflow to do one step at a time

In [58]:
atrt_df_PC = atrt_df[atrt_df['application_name']== 'POWERCHART']
atrt_df_PC = atrt_df_PC[atrt_df_PC.isBDay.notnull()]
atrt_df_PC

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,total_elapsed,transAbove5,transUnder2,transaction_cnt,isBDay
1,0.41,0.00%,100.00%,USR: PDOC NAVIGATOR BAND CLICK (NICU FINNEGAN),1,POWERCHART,2018-10-29,0.087445,NICU FINNEGAN,USR: PDOC NAVIGATOR BAND CLICK,13.028515,0,32,32,2018-10-29
3,0.15,0.00%,100.00%,USR: PDOC NAVIGATOR DOC SET CLICK (PICU RISK A...,1,POWERCHART,2018-08-28,0.096875,PICU RISK ASSESSMENT,USR: PDOC NAVIGATOR DOC SET CLICK,11.668000,0,79,79,2018-08-28
4,0.38,0.00%,100.00%,USR:MPG.DOCUMENTATION_HPI.O1 - LOAD COMPONENT ...,1,POWERCHART,2018-09-11,0.311815,VB_WORKFLOWAMBPEDSCARDIOLOGY,USR:MPG.DOCUMENTATION_HPI.O1 - LOAD COMPONENT,36.360582,0,95,95,2018-09-11
5,0.21,0.00%,100.00%,USR:MPG.MICRO.O2 - LOAD COMPONENT (VB_WORKFLOW...,1,POWERCHART,2018-10-19,0.108207,VB_WORKFLOWIPCHARIS,USR:MPG.MICRO.O2 - LOAD COMPONENT,0.615875,0,3,3,2018-10-19
13,2.27,0.66%,35.77%,USR: DOCV: ADD NEW PN NOTE (N/A),1,POWERCHART,2018-10-11,1.044684,,USR: DOCV: ADD NEW PN NOTE,22526.261366,66,3551,9928,2018-10-11
17,0.08,0.00%,100.00%,USR:MPG.SECONDARYASSESSMENT.O2 - LOAD COMPONEN...,1,POWERCHART,2018-10-25,0.003860,VB_WORKFLOWAMBOBPRENATALPPUN,USR:MPG.SECONDARYASSESSMENT.O2 - LOAD COMPONENT,0.163154,0,2,2,2018-10-25
22,0.63,0.00%,96.30%,USR:MPG.DOCUMENTS.O2 - LOAD COMPONENT (VB_WORK...,1,POWERCHART,2018-08-10,0.449859,VB_WORKFLOWIPPEDSSURGERY,USR:MPG.DOCUMENTS.O2 - LOAD COMPONENT,17.040464,0,26,27,2018-08-10
26,0.24,0.00%,99.87%,USR:MPG.ORDERSELECTION.O2 - LOAD COMPONENT (VB...,1,POWERCHART,2018-10-11,0.126692,VB_QUICKORDERSPULMONARY,USR:MPG.ORDERSELECTION.O2 - LOAD COMPONENT,185.579233,0,783,784,2018-10-11
29,0.14,0.00%,100.00%,USR:MPG.COMMUNICATION_EVENTS.O1.SUM - LOAD COM...,1,POWERCHART,2018-09-11,,VB_HCCASELIST,USR:MPG.COMMUNICATION_EVENTS.O1.SUM - LOAD COM...,0.138463,0,1,1,2018-09-11
30,0.25,0.00%,100.00%,USR:MPG.ORDERSELECTION.O2 - LOAD COMPONENT (VB...,1,POWERCHART,2018-09-25,0.129549,VB_QUICKORDERSPULMONARY,USR:MPG.ORDERSELECTION.O2 - LOAD COMPONENT,67.500276,0,273,273,2018-09-25


## Anomaly detection/Handling

#### Applying quantiles:

Apply 5th and 95th quantiles to each timer for the entire 90 day period. 
Then we can lookup the values using ```loc``` and ```filter``` 


**To-do**
- Investigate outliers/anomalies by timer. Replace with mean/remove altogether. 
- Use KNN? 
- Complete - remove outliers by timer. 

Sample outlier removal calculation by group. 
```atrt_df_PC[np.abs(atrt_df_PC.ATRT-atrt_df_PC.ATRT.mean()) <= (3*atrt_df_PC.ATRT.std())]```

Takes the absolute value of standardized value for each datapoint, then removes it if it's less than 3 standard deviations from the mean. 
- I need to do this by group! 
- Group first, then pass in group calculation. 

In [59]:
#find outliers
outliers = atrt_df_PC.groupby(["Timer Subtimer Name"])['ATRT'].quantile([0.05, 0.95]).unstack(level=1)
#filter outliers
atrt_df_PC = atrt_df_PC.loc[((outliers.loc[atrt_df_PC['Timer Subtimer Name'], .05] < atrt_df_PC.ATRT.values) & (atrt_df_PC.ATRT.values < outliers.loc[atrt_df_PC['Timer Subtimer Name'], .95])).values]


### Rolling average notes
Lambda function for applying rolling average to a group: 
The groupby statement can be in an earlier variable, but I've chosen to include it here. 

**To-do**
- [ ] Find out exactly what reset_index does - reset the average calculation to the beginning of each group? 
- [ ] Remove outliers/anomalies by timer. Replace with mean/remove altogether. 
- [ ] Use KNN for clustering timers...analyze what timers that have shifted up have in common. 
- [ ] Establish control chart
    - UCL = .75 quartile + 1.5x IQR
    - LCL = .25 quartile - 1.5x IQR
    - mean = 60-day mean
    - median = 60-day median (use with outliers included)
```
atrt_df_PC.groupby(['Timer Subtimer Name', 'dt'])['ATRT'].rolling(30).mean().reset_index(0,drop=True)
```
Pass into dataset feature with ```atrt_df_PC['mavg30'] = (above statement)```

```atrt_df_PC['mavg30'] = atrt_df_PC.groupby(['Timer Subtimer Name', 'dt'])['ATRT'].rolling(30).mean().reset_index(0,drop=True)```

In [60]:
#Setting index appears to cause issues when outputting to csv: 
#atrt_df_PC = atrt_df_PC.set_index(['Timer Subtimer Name', 'dt'])

atrt_df_PC = atrt_df_PC.sort_values(['Timer Subtimer Name', 'dt'])
atrt_df_PC['mavg30'] = atrt_df_PC.groupby('Timer Subtimer Name')['ATRT'].apply(lambda x:x.rolling(center=False,window=30).mean())

#check to make sure that everything looks correct: 
atrt_df_PC


Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,total_elapsed,transAbove5,transUnder2,transaction_cnt,isBDay,mavg30
642711,0.14,0.02%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-07-31,0.273690,,DMSM_GETMEDIACONTENT,880.955,1,6210,6211,2018-07-31,
280854,0.14,0.02%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-01,0.143897,,DMSM_GETMEDIACONTENT,900.327,1,6230,6231,2018-08-01,
447545,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-02,0.074053,,DMSM_GETMEDIACONTENT,889.563,0,6090,6090,2018-08-02,
376550,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-03,0.086916,,DMSM_GETMEDIACONTENT,777.760,0,5006,5007,2018-08-03,
305295,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-06,0.080803,,DMSM_GETMEDIACONTENT,908.990,0,6021,6021,2018-08-06,
98917,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-07,0.085864,,DMSM_GETMEDIACONTENT,931.508,0,6202,6203,2018-08-07,
426129,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-08,0.093581,,DMSM_GETMEDIACONTENT,900.311,0,5663,5664,2018-08-08,
347498,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-09,0.104255,,DMSM_GETMEDIACONTENT,938.544,0,5701,5702,2018-08-09,
8354,0.16,0.00%,99.96%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-10,0.097991,,DMSM_GETMEDIACONTENT,782.390,0,4969,4971,2018-08-10,
661101,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-14,0.070129,,DMSM_GETMEDIACONTENT,921.509,0,6318,6318,2018-08-14,


### Windows

Add window - 7/8 day window using pandas shift(). 

Difference between Trend and Shift
- trend = "7 days in a row"
- shift = "8 days above median"

**to-do**
- [x] start with shift/window. 
- [ ] establish how to identify trends and shifts using elif statements.

**questions**
- what happens when I compare something with a NaN value? - e.g. I compare days 1-7 with NaN? 
- do I need to drop rows with NaN, or return a null value? 

In [61]:
# dataframe var is atrt_df_PC

atrt_df_PC['t-1'] = atrt_df_PC['ATRT'].shift(1)
atrt_df_PC['t-2'] = atrt_df_PC['ATRT'].shift(2)
atrt_df_PC['t-3'] = atrt_df_PC['ATRT'].shift(3)
atrt_df_PC['t-4'] = atrt_df_PC['ATRT'].shift(4)
atrt_df_PC['t-5'] = atrt_df_PC['ATRT'].shift(5)
atrt_df_PC['t-6'] = atrt_df_PC['ATRT'].shift(6)
atrt_df_PC['t-7'] = atrt_df_PC['ATRT'].shift(7)
atrt_df_PC['t-8'] = atrt_df_PC['ATRT'].shift(8)

df_pc = atrt_df_PC
df_pc.head()

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,...,isBDay,mavg30,t-1,t-2,t-3,t-4,t-5,t-6,t-7,t-8
642711,0.14,0.02%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-07-31,0.27369,,DMSM_GETMEDIACONTENT,...,2018-07-31,,,,,,,,,
280854,0.14,0.02%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-01,0.143897,,DMSM_GETMEDIACONTENT,...,2018-08-01,,0.14,,,,,,,
447545,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-02,0.074053,,DMSM_GETMEDIACONTENT,...,2018-08-02,,0.14,0.14,,,,,,
376550,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-03,0.086916,,DMSM_GETMEDIACONTENT,...,2018-08-03,,0.15,0.14,0.14,,,,,
305295,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-06,0.080803,,DMSM_GETMEDIACONTENT,...,2018-08-06,,0.16,0.15,0.14,0.14,,,,


## Control Charts
Set upper and lower control limits, 60-day mean, signal detection. 
- Upper and lower control limits. 
    - Upper OUTLIER = .75 qt + 1.5x IQR??
    - [x] Upper CONTROL = .75 qt 
    - [ ] Lower Outlier = .25 qt - 1.5x IQR
    - [x] Lower control = .25 qt 
    - [x] Obtain IQR (on 60 day mean) for each timer. 
    - [ ] Multiply that by 1.5, set as variable
    - [x] Then calculate UCL and LCL for each timer. 
    
- Shift detection:
    - When t-n 8 of the previous N days above/below the median/mean
    - elif statement? 
   
- Trend detection:
    - When t-n (7) days in a row are higher than the previous day.
    
**To-Do**

- [x] interquartile range - loc lookup to ds? running calc? pandas has quartile built into running package
- [x] drop NA values after creating all relevant metrics. 
- [ ] plotting control charts for timer. 


**Questions**

Does it make sense to aggregate by week for smoother trends and less volatility? 
Which timers gave consistent up/down trend/shift signals for that week? Month? 
Which timers were outside the control limits more than N times during the previous period? 

### Quantiles

In [81]:
df_pc['LCL_30'] = df_pc['ATRT'].rolling(30).quantile(.25, interpolation='lower')
df_pc['UCL_30'] = df_pc['ATRT'].rolling(30).quantile(.75, interpolation='higher')
df_pc['var_30'] = df_pc['ATRT'].rolling(30).var(ddof=1)

#IQR for future outlier removal
#df_pc['IQR'] = df_pc['.75'] - df_pc['.25']

df_pc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,...,t-2,t-3,t-4,t-5,t-6,t-7,t-8,LCL_30,UCL_30,var_30
673451,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-12,0.070212,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,0.14,0.15,0.15,,,
261648,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-13,0.089830,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.14,0.15,0.14,0.15,,,
48954,0.15,0.00%,99.96%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-14,0.089884,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.15,0.14,0.15,0.14,,,
394703,0.15,0.03%,99.97%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-17,0.156930,,DMSM_GETMEDIACONTENT,...,0.16,0.15,0.15,0.15,0.15,0.14,0.15,,,
332814,0.14,0.03%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-18,0.233025,,DMSM_GETMEDIACONTENT,...,0.15,0.16,0.15,0.15,0.15,0.15,0.14,,,
641619,0.15,0.00%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-27,0.101519,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.16,0.15,0.15,0.15,0.15,,,
383603,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-28,0.100805,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.16,0.15,0.15,0.15,,,
399610,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-02,0.070028,,DMSM_GETMEDIACONTENT,...,0.15,0.14,0.15,0.15,0.16,0.15,0.15,,,
392869,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-03,0.062836,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,0.15,0.16,0.15,,,
571543,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-09,0.069858,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.14,0.15,0.15,0.16,,,


### Remove missing values
Removing any values without the 30 day moving average.
Use ```np.isfinite``` instead of dropping na values. 

**ToDo**
    
    [ ] should we instead just use a dummy var if any value isna? 

In [84]:
df_pc = df_pc[np.isfinite(df_pc['mavg30'])]
df_pc

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,...,t-2,t-3,t-4,t-5,t-6,t-7,t-8,LCL_30,UCL_30,var_30
673451,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-12,0.070212,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,0.14,0.15,0.15,,,
261648,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-13,0.089830,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.14,0.15,0.14,0.15,,,
48954,0.15,0.00%,99.96%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-14,0.089884,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.15,0.14,0.15,0.14,,,
394703,0.15,0.03%,99.97%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-17,0.156930,,DMSM_GETMEDIACONTENT,...,0.16,0.15,0.15,0.15,0.15,0.14,0.15,,,
332814,0.14,0.03%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-18,0.233025,,DMSM_GETMEDIACONTENT,...,0.15,0.16,0.15,0.15,0.15,0.15,0.14,,,
641619,0.15,0.00%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-27,0.101519,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.16,0.15,0.15,0.15,0.15,,,
383603,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-28,0.100805,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.16,0.15,0.15,0.15,,,
399610,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-02,0.070028,,DMSM_GETMEDIACONTENT,...,0.15,0.14,0.15,0.15,0.16,0.15,0.15,,,
392869,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-03,0.062836,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,0.15,0.16,0.15,,,
571543,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-09,0.069858,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.14,0.15,0.15,0.16,,,


### Signal detection: 

- when did the signal start? 
- what is the signal up/down shift or trend? 
- include "magnitude of change"
- how about cumulative sum? 
- abs(t1-t2) + abs(t2-t3)


**Questions**

- How useful is identifying a spike if we don't have daily data? 

In [95]:
df_pc_test = df_pc
df_pc_test

def signal_detect(df):

    if (df['t-8'] >= df['mavg30'] and (df['t-7'] >= df['mavg30'])):
        return 'Red'
    #elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
    #    return 'Yellow'
    #elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
    #    return 'Orange'
    #elif (df['height'] > 8):
    #    return np.nan

df_pc_test['signal'] = df_pc_test.apply(signal_detect, axis = 1)
df_pc_test

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,...,t-4,t-5,t-6,t-7,t-8,LCL_30,UCL_30,var_30,cusum,signal
673451,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-12,0.070212,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.14,0.15,0.15,,,,0.00,
261648,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-13,0.089830,,DMSM_GETMEDIACONTENT,...,0.15,0.14,0.15,0.14,0.15,,,,0.00,
48954,0.15,0.00%,99.96%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-14,0.089884,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,0.14,,,,0.01,
394703,0.15,0.03%,99.97%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-17,0.156930,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.14,0.15,,,,0.01,
332814,0.14,0.03%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-18,0.233025,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.15,0.14,,,,0.00,
641619,0.15,0.00%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-27,0.101519,,DMSM_GETMEDIACONTENT,...,0.16,0.15,0.15,0.15,0.15,,,,0.01,
383603,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-28,0.100805,,DMSM_GETMEDIACONTENT,...,0.15,0.16,0.15,0.15,0.15,,,,0.01,
399610,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-02,0.070028,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.16,0.15,0.15,,,,0.00,
392869,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-03,0.062836,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.16,0.15,,,,0.01,Red
571543,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-09,0.069858,,DMSM_GETMEDIACONTENT,...,0.15,0.14,0.15,0.15,0.16,,,,0.00,Red


### Investigate the data on a weekly basis
Counts
Sums
Averages
Variance

### Write out to csv:

In [81]:
atrt_df_PC.to_csv('C:\\Users\\khickman1\\Desktop\\PCMavg30.csv', sep=',')

### Stuff I learned: 

- Moving average example
- Subsetting
- Date and column filtering
- Index by both date and by timer/subtimer using pandas 'set index'
- Indexing
- Sorting
- Get unique values
- grouping
- filtering outliers (clipping, trimming, winsorizing)
- moving data

```
open_chart = atrt_df['Timer Subtimer Name' == "USR:PWR-OPEN CHART (DISCERNRPT/DISCERNPCTAB.DLL)"]
open_chart = atrt_df[atrt_df['Timer Subtimer Name'] == 'USR:PWR-OPEN CHART (DISCERNRPT/DISCERNPCTAB.DLL)']
open_chart = open_chart[open_chart['application_name']== 'POWERCHART']
open_chart_post = open_chart[(open_chart['dt'] > '2018-07-01') & (open_chart['dt'] < '2018-09-30')]
open_chart = open_chart.set_index(['Timer Subtimer Name', 'dt'])
open_chart = open_chart.sort_index(axis=0)
open_chart['moving_avg'] = open_chart['ATRT'].rolling(30).mean()
atrt_df['dt'] = pd.to_datetime(atrt_df['dt'].dt.date)
atrt_high_trans = atrt_df[atrt_df['transaction_cnt'] > 1000]
atrt_high_trans.sort_values(by=['transaction_cnt'], ascending=False)
len(atrt_high_trans)
grp_high_trans = atrt_high_trans.groupby(["Timer Subtimer Name", "dt"])
grp_high_trans.head()
open_chart_post.groupby('Timer Subtimer Name').nunique()
len(open_chart_post['Timer Subtimer Name'].unique())
len(open_chart_post)
open_chart_slim = open_chart_post[['ATRT', 'dt']]
open_chart_slim.set_index('dt')
open_chart_slim.sort_values('dt')

Stuff that didn't work
#atrt_df_PC = atrt_df_PC[['Timer Subtimer Name', 'dt', 'ATRT', 'application_name']]
#atrt_df_PC['mavg30'] = atrt_df_PC['ATRT'].rolling(30).mean()
#PC = atrt_df_PC.groupby(['Timer Subtimer Name', 'dt'])
#atrt_df_PC = atrt_df_PC.set_index(['Timer Subtimer Name', 'dt'])
#PC = atrt_df_PC
#PC.head()
#atrt_df_PC = atrt_df_PC.set_index(['Timer Subtimer Name'])
#atrt_df_PC = atrt_df_PC.sort_index(axis=0)
#atrt_df_PC = atrt_df_PC.groupby(['Timer Subtimer Name'])
#atrt_df_PC = atrt_df_PC.sort(['Timer Subtimer Name', 'dt']).groupby('Timer Subtimer Name')
#atrt_df_PC['mavg30'] = atrt_df_PC['ATRT'].rolling(30).mean()
#atrt_df_PC[0:100]```


Getting csv from website example
```
import csv
import urllib3
import requests

#This URL will be the URL that your login form points to with the "action" tag.
post_login_url = 'https://cernercare.com/accounts/login?returnTo=https%3A%2F%2Flightson.cerner.com%2Fsocial-auth%2Fcomplete%2Fprofessional%2F%3Fjanrain_nonce%3D2018-10-23T15%253A00%253A41ZYe3mOd'

#This URL is the page you actually want to pull down with requests.
request_url = 'https://lightson.cerner.com/clients/CHP_IN/domains/P1558/kpi/response-time/worst-timers.csv?dt=2018-10-22'
#request_url = 'https://lightson.cerner.com/api/metrics/trend.csv?category_name=Performance&metrics%5B%5D=RT_2SEC&cdr_ids=95370&data_type=ENVIRONMENT&date_type=DAILY&days_of_week=1%2C2%2C3%2C4%2C5%2C6%2C7&doc_cuid=LONfa31c997c3d543768036e2941&end_date=2018-10-22&physician_rollup_flag=2&start_date=2018-07-25'

payload = {
    'username': 'khickman1',
    'pass': '<password>'
}

with requests.Session() as session:
    post = session.post(post_login_url, data=payload)
    r = session.get(request_url)
    decoded_content = r.content.decode('utf-8')
    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    my_list = list(cr)
    for row in my_list: 
        print(row)

for row in cr:
    with open("daily.csv", "wb") as f:
        writer = csv.writer(f)
        writer.writerows(row)
```
