# Weekly Core Metric Analysis

This notebook provides pre-processing, analysis, and visualization on weekly ATRT data provided by Cerner (Armand Kok) in a .csv dataset that includes daily (aggregated) timer data, transaction counts, etc... for a 90-day lookback period. 

10/26:
- **Business questions**:
    - Which timers are negatively affecting ATRT? 
    - What is the makeup of ATRT for a given period, and how has that changed over the previous periods? 
    - When did a shift begin for a particular timer? 
    - What are the smoothed trends for our core metrics overall?  
    - What is our ATRT for relative dates (month to date, past 30/60/90, etc). 

**To-do**
- [ ] consider moving to 120 day lookback period for control charts. 
- [ ] Joining most recent week's data: 
    - MySQL/postgres database that will index based on timer, date, application. 
    - run import job on .csv every week (Friday/Monday) to update with current week's data.

In [1]:
# Import libraries/packages
import csv
import pandas as pd
import datetime
import numpy as np
import os
from datetime import datetime 
from pandas.tseries.offsets import BDay
from scipy import stats
#from scipy.stats.mstats import winsorize
#import statsmodels.api as sm
#import seaborn as sns
#import matplotlib.pyplot as plt

### Import dataset and make sure it's working. 
**To-do**:
- 10/25 automate dataset generation. Use case for hook into Vertica: Armand is currently manually giving this ds weekly. 
- 10/25 find best place to store to capture history

In [2]:
#client_atrt = pd.read_csv('Z:\IUH\khickman1\Datasets\client_atrt10.15.18.csv')
client_atrt = pd.read_csv('C:\\Users\\khickman1\\Desktop\\client_atrt11.1.18.csv')
#client_atrt = client_atrt[['dt', 'application_name', 'Timer Subtimer Name', 'ATRT']]
client_atrt.head()
len(client_atrt) #should be around 650-680k rows

678178

## Preprocessing

 - Dates
 - Filtering
 - Outlier removal
 - moving average calculation
 - window calculation
 - if/else statement based on window 
 - calculate UCL/LCL
 - feature selection

**To-Do**
 - [ ] remove subtimer name? 
 
### Date column transformation

In [3]:
atrt_df = pd.DataFrame(client_atrt)
atrt_df['dt'] = pd.to_datetime(atrt_df['dt'])
print(atrt_df['dt'][0:10])
atrt_df.head()
# check to make sure the 'dt' column is type:datetime64

0   2018-08-19
1   2018-10-11
2   2018-08-24
3   2018-09-14
4   2018-09-04
5   2018-09-27
6   2018-08-04
7   2018-09-11
8   2018-10-20
9   2018-10-04
Name: dt, dtype: datetime64[ns]


Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,total_elapsed,transAbove5,transUnder2,transaction_cnt
0,2.9,0.08%,7.57%,USR: PN AUTOPOPULATE DATA (N/A),1,FIRSTNET,2018-08-19,0.562857,,USR: PN AUTOPOPULATE DATA,3482.946764,1,91,1202
1,0.03,0.00%,100.00%,USR:SRG SNDOCDISP DISCONTINUING SEGMENT (SSSC ...,1,SNSURGINET.EXE,2018-10-11,0.003413,SSSC HEMOSTASIS,USR:SRG SNDOCDISP DISCONTINUING SEGMENT,0.154385,0,5,5
2,0.22,0.00%,100.00%,USR:MPG.POC SUMMARY COMPONENT - LOAD COMPONENT...,1,FIRSTNET,2018-08-24,,,USR:MPG.POC SUMMARY COMPONENT - LOAD COMPONENT,0.221,0,1,1
3,0.85,0.02%,99.88%,USR:BSC-ENSURE MAW RESULTS (N/A),1,POWERCHART,2018-09-14,0.225968,,USR:BSC-ENSURE MAW RESULTS,31670.881,8,37109,37153
4,2.11,0.00%,66.67%,USR:MPG.NEW_ORDER_ENTRY.O2 - LOAD COMPONENT (V...,1,POWERCHART,2018-09-04,0.615354,VB_WORKFLOWAMBPEDSPSYCH,USR:MPG.NEW_ORDER_ENTRY.O2 - LOAD COMPONENT,6.338292,0,2,3


### Remove non-working days.  

Currently only removing weekends. 

**To-do**
- remove federal/bank holidays?  E.g. sept 3 was labor day. 
- clean up is-bday column via subsetting or boolean

In [4]:
isBusinessDay = BDay().onOffset
match_series = atrt_df['dt'].map(isBusinessDay)
atrt_df['isBDay'] = atrt_df['dt'][match_series]
atrt_df

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,total_elapsed,transAbove5,transUnder2,transaction_cnt,isBDay
0,2.90,0.08%,7.57%,USR: PN AUTOPOPULATE DATA (N/A),1,FIRSTNET,2018-08-19,0.562857,,USR: PN AUTOPOPULATE DATA,3482.946764,1,91,1202,NaT
1,0.03,0.00%,100.00%,USR:SRG SNDOCDISP DISCONTINUING SEGMENT (SSSC ...,1,SNSURGINET.EXE,2018-10-11,0.003413,SSSC HEMOSTASIS,USR:SRG SNDOCDISP DISCONTINUING SEGMENT,0.154385,0,5,5,2018-10-11
2,0.22,0.00%,100.00%,USR:MPG.POC SUMMARY COMPONENT - LOAD COMPONENT...,1,FIRSTNET,2018-08-24,,,USR:MPG.POC SUMMARY COMPONENT - LOAD COMPONENT,0.221000,0,1,1,2018-08-24
3,0.85,0.02%,99.88%,USR:BSC-ENSURE MAW RESULTS (N/A),1,POWERCHART,2018-09-14,0.225968,,USR:BSC-ENSURE MAW RESULTS,31670.881000,8,37109,37153,2018-09-14
4,2.11,0.00%,66.67%,USR:MPG.NEW_ORDER_ENTRY.O2 - LOAD COMPONENT (V...,1,POWERCHART,2018-09-04,0.615354,VB_WORKFLOWAMBPEDSPSYCH,USR:MPG.NEW_ORDER_ENTRY.O2 - LOAD COMPONENT,6.338292,0,2,3,2018-09-04
5,0.49,0.07%,98.56%,USR:MPG.NOTES_REMINDERS.O1 - LOAD COMPONENT (V...,1,POWERCHART,2018-09-27,0.397299,VB_ONCOLOGYSUMMARY,USR:MPG.NOTES_REMINDERS.O1 - LOAD COMPONENT,1355.636862,2,2740,2780,2018-09-27
6,0.52,0.00%,100.00%,USR:SRG SNDOCDISP SAVING SEGMENT (BLH ENDO CAS...,1,SNSURGINET.EXE,2018-08-04,0.494043,BLH ENDO CASE TIMES,USR:SRG SNDOCDISP SAVING SEGMENT,10.919433,0,21,21,NaT
7,0.00,0.00%,100.00%,USR:ORM.SIGNORDERS-NONMEDDUPCHECK (N/A),1,POWERCHART,2018-09-11,0.002295,,USR:ORM.SIGNORDERS-NONMEDDUPCHECK,259.930018,0,70146,70146,2018-09-11
8,0.82,0.42%,87.71%,USR:PWR-CREATE VIEW (CLINDOCUMENT/PVNOTES),1,FIRSTNET,2018-10-20,1.061376,CLINDOCUMENT/PVNOTES,USR:PWR-CREATE VIEW,6609.329131,34,7048,8036,NaT
9,0.19,0.00%,100.00%,USR:SRG SNDOCDISP LOADING SEGMENT (NH ENDO IMP...,1,SNSURGINET.EXE,2018-10-04,0.013296,NH ENDO IMPLANT RECORD,USR:SRG SNDOCDISP LOADING SEGMENT,0.757934,0,4,4,2018-10-04


### Filter dataset by application - e.g. Powerchart, firstnet, surginet.
(Actually apply the weekend/weekday filter here as well). 

**to-do**
- [ ] are there differences between timers in the different applications? 
- [X]clean up the workflow to do one step at a time

In [5]:
atrt_df_PC = atrt_df[atrt_df['application_name']== 'POWERCHART']
atrt_df_PC = atrt_df_PC[atrt_df_PC.isBDay.notnull()]
atrt_df_PC

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,total_elapsed,transAbove5,transUnder2,transaction_cnt,isBDay
3,0.85,0.02%,99.88%,USR:BSC-ENSURE MAW RESULTS (N/A),1,POWERCHART,2018-09-14,0.225968,,USR:BSC-ENSURE MAW RESULTS,31670.881000,8,37109,37153,2018-09-14
4,2.11,0.00%,66.67%,USR:MPG.NEW_ORDER_ENTRY.O2 - LOAD COMPONENT (V...,1,POWERCHART,2018-09-04,0.615354,VB_WORKFLOWAMBPEDSPSYCH,USR:MPG.NEW_ORDER_ENTRY.O2 - LOAD COMPONENT,6.338292,0,2,3,2018-09-04
5,0.49,0.07%,98.56%,USR:MPG.NOTES_REMINDERS.O1 - LOAD COMPONENT (V...,1,POWERCHART,2018-09-27,0.397299,VB_ONCOLOGYSUMMARY,USR:MPG.NOTES_REMINDERS.O1 - LOAD COMPONENT,1355.636862,2,2740,2780,2018-09-27
7,0.00,0.00%,100.00%,USR:ORM.SIGNORDERS-NONMEDDUPCHECK (N/A),1,POWERCHART,2018-09-11,0.002295,,USR:ORM.SIGNORDERS-NONMEDDUPCHECK,259.930018,0,70146,70146,2018-09-11
13,0.21,0.00%,100.00%,USR:MPG.DOCUMENTS.O2 - LOAD COMPONENT (VB_WORK...,1,POWERCHART,2018-09-20,0.043101,VB_WORKFLOWIPCHARIS,USR:MPG.DOCUMENTS.O2 - LOAD COMPONENT,2.673216,0,13,13,2018-09-20
20,0.32,0.00%,97.50%,USR: PDOC NAVIGATOR BAND CLICK (PICU ECMO),1,POWERCHART,2018-10-22,0.484452,PICU ECMO,USR: PDOC NAVIGATOR BAND CLICK,12.656418,0,39,40,2018-10-22
28,0.86,0.00%,91.18%,USR:MPG.REMINDERS.O2 - LOAD COMPONENT (VB_WORK...,1,POWERCHART,2018-09-19,0.806611,VB_WORKFLOWAMBPEDSNEPHROLOGY,USR:MPG.REMINDERS.O2 - LOAD COMPONENT,29.397991,0,31,34,2018-09-19
29,0.52,0.00%,100.00%,USR:MPG.NOTES_REMINDERS.O1 - LOAD COMPONENT (V...,1,POWERCHART,2018-08-21,0.130344,VB_HCWORKLIST,USR:MPG.NOTES_REMINDERS.O1 - LOAD COMPONENT,227.100146,0,439,439,2018-08-21
30,0.55,0.40%,96.39%,USR:MPG.MEDS.O1 - LOAD COMPONENT (VB_IUHNEONAT...,1,POWERCHART,2018-08-20,0.750812,VB_IUHNEONATESUMMARY,USR:MPG.MEDS.O1 - LOAD COMPONENT,679.916677,5,1201,1246,2018-08-20
35,0.21,0.04%,99.91%,USR:ORM.CONVERTTOINPATIENTMR-BEGIN (N/A),1,POWERCHART,2018-08-15,0.276444,,USR:ORM.CONVERTTOINPATIENTMR-BEGIN,490.151686,1,2323,2325,2018-08-15


## Anomaly detection/Handling

#### Applying quantiles:

Apply 5th and 95th quantiles to each timer for the entire 90 day period. 
Then we can lookup the values using ```loc``` and ```filter``` 


**To-do**
- Investigate outliers/anomalies by timer. Replace with mean/remove altogether. 
- Use KNN? 
- Complete - remove outliers by timer. 

Sample outlier removal calculation by group. 
```atrt_df_PC[np.abs(atrt_df_PC.ATRT-atrt_df_PC.ATRT.mean()) <= (3*atrt_df_PC.ATRT.std())]```

Takes the absolute value of standardized value for each datapoint, then removes it if it's less than 3 standard deviations from the mean. 
- I need to do this by group! 
- Group first, then pass in group calculation. 

In [6]:
#find outliers
outliers = atrt_df_PC.groupby(["Timer Subtimer Name"])['ATRT'].quantile([0.05, 0.95]).unstack(level=1)
#filter outliers
atrt_df_PC = atrt_df_PC.loc[((outliers.loc[atrt_df_PC['Timer Subtimer Name'], .05] < atrt_df_PC.ATRT.values) & (atrt_df_PC.ATRT.values < outliers.loc[atrt_df_PC['Timer Subtimer Name'], .95])).values]


### Rolling average notes
Lambda function for applying rolling average to a group: 
The groupby statement can be in an earlier variable, but I've chosen to include it here. 

**To-do**
- [ ] Find out exactly what reset_index does - reset the average calculation to the beginning of each group? 
- [ ] Remove outliers/anomalies by timer. Replace with mean/remove altogether. 
- [ ] Use KNN for clustering timers...analyze what timers that have shifted up have in common. 
- [ ] Establish control chart:
    - UCL = .75 quartile + 1.5x IQR
    - LCL = .25 quartile - 1.5x IQR
    - mean = 60-day mean
    - median = 60-day median (use with outliers included)
```
atrt_df_PC.groupby(['Timer Subtimer Name', 'dt'])['ATRT'].rolling(30).mean().reset_index(0,drop=True)
```
Pass into dataset feature with ```atrt_df_PC['mavg30'] = (above statement)```

```atrt_df_PC['mavg30'] = atrt_df_PC.groupby(['Timer Subtimer Name', 'dt'])['ATRT'].rolling(30).mean().reset_index(0,drop=True)```

In [7]:
#Setting index appears to cause issues when outputting to csv: 
#atrt_df_PC = atrt_df_PC.set_index(['Timer Subtimer Name', 'dt'])

atrt_df_PC = atrt_df_PC.sort_values(['Timer Subtimer Name', 'dt'])
atrt_df_PC['mavg30'] = atrt_df_PC.groupby('Timer Subtimer Name')['ATRT'].apply(lambda x:x.rolling(center=False,window=30).mean())

#check to make sure that everything looks correct: 
atrt_df_PC


Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,total_elapsed,transAbove5,transUnder2,transaction_cnt,isBDay,mavg30
497739,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-03,0.086916,,DMSM_GETMEDIACONTENT,777.760,0,5006,5007,2018-08-03,
473231,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-06,0.080803,,DMSM_GETMEDIACONTENT,908.990,0,6021,6021,2018-08-06,
163400,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-07,0.085864,,DMSM_GETMEDIACONTENT,931.508,0,6202,6203,2018-08-07,
428185,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-08,0.093581,,DMSM_GETMEDIACONTENT,900.311,0,5663,5664,2018-08-08,
671485,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-09,0.104255,,DMSM_GETMEDIACONTENT,938.544,0,5701,5702,2018-08-09,
571364,0.16,0.00%,99.96%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-10,0.097991,,DMSM_GETMEDIACONTENT,782.390,0,4969,4971,2018-08-10,
451173,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-14,0.070129,,DMSM_GETMEDIACONTENT,921.509,0,6318,6318,2018-08-14,
628818,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-15,0.081760,,DMSM_GETMEDIACONTENT,900.628,0,6110,6110,2018-08-15,
426393,0.15,0.02%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-16,0.109795,,DMSM_GETMEDIACONTENT,939.268,1,6180,6181,2018-08-16,
195299,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-17,0.073036,,DMSM_GETMEDIACONTENT,762.252,0,5042,5042,2018-08-17,


### Windows

Add window - 7/8 day window using pandas shift(). 

Difference between Trend and Shift
- trend = "7 days in a row"
- shift = "8 days above median"

**to-do**
- [x] start with shift/window. 
- [ ] establish how to identify trends and shifts using elif statements.

**questions**
- what happens when I compare something with a NaN value? - e.g. I compare days 1-7 with NaN? 
- do I need to drop rows with NaN, or return a null value? 

In [8]:
# dataframe var is atrt_df_PC

atrt_df_PC['t-1'] = atrt_df_PC['ATRT'].shift(1)
atrt_df_PC['t-2'] = atrt_df_PC['ATRT'].shift(2)
atrt_df_PC['t-3'] = atrt_df_PC['ATRT'].shift(3)
atrt_df_PC['t-4'] = atrt_df_PC['ATRT'].shift(4)
atrt_df_PC['t-5'] = atrt_df_PC['ATRT'].shift(5)
atrt_df_PC['t-6'] = atrt_df_PC['ATRT'].shift(6)
atrt_df_PC['t-7'] = atrt_df_PC['ATRT'].shift(7)
atrt_df_PC['t-8'] = atrt_df_PC['ATRT'].shift(8)

df_pc = atrt_df_PC
df_pc.head()

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,...,isBDay,mavg30,t-1,t-2,t-3,t-4,t-5,t-6,t-7,t-8
497739,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-03,0.086916,,DMSM_GETMEDIACONTENT,...,2018-08-03,,,,,,,,,
473231,0.15,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-06,0.080803,,DMSM_GETMEDIACONTENT,...,2018-08-06,,0.16,,,,,,,
163400,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-07,0.085864,,DMSM_GETMEDIACONTENT,...,2018-08-07,,0.15,0.16,,,,,,
428185,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-08,0.093581,,DMSM_GETMEDIACONTENT,...,2018-08-08,,0.15,0.15,0.16,,,,,
671485,0.16,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-08-09,0.104255,,DMSM_GETMEDIACONTENT,...,2018-08-09,,0.16,0.15,0.15,0.16,,,,


## Control Charts
Set upper and lower control limits, 60-day mean, signal detection. 
- Upper and lower control limits. 
    - Upper OUTLIER = .75 qt + 1.5x IQR??
    - [x] Upper CONTROL = .75 qt 
    - [ ] Lower Outlier = .25 qt - 1.5x IQR
    - [x] Lower control = .25 qt 
    - [x] Obtain IQR (on 60 day mean) for each timer. 
    - [ ] Multiply that by 1.5, set as variable
    - [x] Then calculate UCL and LCL for each timer. 
    
- Shift detection:
    - When t-n 8 of the previous N days above/below the median/mean
    - elif statement? 
   
- Trend detection:
    - When t-n (7) days in a row are higher than the previous day.
    
**To-Do**

- [x] interquartile range - loc lookup to ds? running calc? pandas has quartile built into running package
- [x] drop NA values after creating all relevant metrics. 
- [ ] plotting control charts for timer. 
- [ ] investigate using actual IQR for control limits. What about when NOR shifts up/down? 


**Questions**

Does it make sense to aggregate by week for smoother trends and less volatility? 
Which timers gave consistent up/down trend/shift signals for that week? Month? 
Which timers were outside the control limits more than N times during the previous period? 

### Quantiles
Play with the different values of what's an outlier. Currently set at anything below the bottom 10, above upper 90. 

In [29]:
df_pc['LCL_30'] = df_pc['ATRT'].rolling(30).quantile(.25, interpolation='lower')
df_pc['UCL_30'] = df_pc['ATRT'].rolling(30).quantile(.75, interpolation='higher')
df_pc['var_30'] = df_pc['ATRT'].rolling(30).var(ddof=1)

#IQR for future outlier removal
#df_pc['IQR'] = df_pc['.75'] - df_pc['.25']

df_pc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,...,t-5,t-6,t-7,t-8,LCL_30,UCL_30,var_30,signal,trend,shift
224889,0.15,0.03%,99.97%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-17,0.156930,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,,,,,,
517425,0.14,0.03%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-18,0.233025,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.14,,,,,,
372849,0.15,0.00%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-27,0.101519,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.15,,,,,,
19242,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-28,0.100805,,DMSM_GETMEDIACONTENT,...,0.16,0.15,0.15,0.15,,,,,,
290846,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-02,0.070028,,DMSM_GETMEDIACONTENT,...,0.15,0.16,0.15,0.15,,,,,,
223081,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-03,0.062836,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.16,0.15,,,,,,
379310,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-09,0.069858,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.16,,,,,,
53862,0.14,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-15,0.083543,,DMSM_GETMEDIACONTENT,...,0.15,0.14,0.15,0.15,,,,,,
364756,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-16,0.075542,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,,,,,,
261106,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-18,0.062809,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.14,,,,,,


### Remove missing values
Removing any values without the 30 day moving average.
Use ```np.isfinite``` instead of dropping na values. 

**ToDo**
    
    [ ] should we instead just use a dummy var if any value isna? 

In [30]:
df_pc = df_pc[np.isfinite(df_pc['mavg30'])]
df_pc

Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,...,t-5,t-6,t-7,t-8,LCL_30,UCL_30,var_30,signal,trend,shift
224889,0.15,0.03%,99.97%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-17,0.156930,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,,,,,,
517425,0.14,0.03%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-18,0.233025,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.14,,,,,,
372849,0.15,0.00%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-27,0.101519,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.15,,,,,,
19242,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-28,0.100805,,DMSM_GETMEDIACONTENT,...,0.16,0.15,0.15,0.15,,,,,,
290846,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-02,0.070028,,DMSM_GETMEDIACONTENT,...,0.15,0.16,0.15,0.15,,,,,,
223081,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-03,0.062836,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.16,0.15,,,,,,
379310,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-09,0.069858,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.16,,,,,,
53862,0.14,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-15,0.083543,,DMSM_GETMEDIACONTENT,...,0.15,0.14,0.15,0.15,,,,,,
364756,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-16,0.075542,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,,,,,,
261106,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-18,0.062809,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.14,,,,,,


### Signal detection: 

- when did the signal start? 
- what is the signal up/down shift or trend? 
- include "magnitude of change"
- how about cumulative sum? 
- abs(t1-t2) + abs(t2-t3)
- **Trend**: 
    - look at scipy.stats.linregress module.

**Questions**

- How useful is identifying a spike if we don't have daily data? 
- See .loc access method vs. chained indexing, which I'm currently doing below: 

**To-Do**
- [ ] fit line for last 8 days. Positive = upward trend. 
- 

### Trend Detection: 
Using linear regression over the last 8 days: 

```pandas.rolling_corr(arg1, arg2=None, window=None, min_periods=None, freq=None, center=False, pairwise=None, how=None)```

In [12]:
#df_pc_test['roll_corr'] = pd.rolling.corr(df_pc_test['ATRT'], window=7, min_periods=7, freq=None, center=False, pairwise=None, how=None)

This doesn't work: 

```df_pc_test = df_pc

def lin_regress(df):
    x = [1, 2, 3, 4, 5, 6, 7, 8]
    y = df[['t-8', 't-7', 't-6', 't-5', 't-4', 't-3', 't-2', 't-1']]
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    return r_value**2
    
df_pc_test['slope'] = df_pc_test.apply(lin_regress, axis=0)```

In [31]:
df_pc_test = df_pc
df_pc_test

def signal_detect(df):

    if ((df['t-1'] > df['t-2']) 
        and (df['t-2'] > (df['t-3'])
        and (df['t-3'] > (df['t-4'])
        and (df['t-4'] > (df['t-5']))))):
        return 'Upward Trend'

    elif ((df['t-2'] > df['t-3']) 
        and (df['t-3'] > (df['t-4'])
        and (df['t-4'] > (df['t-5'])
        and (df['t-5'] > (df['t-6']))))):
        return 'Upward Trend'
    
    elif ((df['t-2'] > df['t-3']) 
        and (df['t-3'] > (df['t-4'])
        and (df['t-4'] > (df['t-5'])
        and (df['t-5'] > (df['t-6']))))):
        return 'Upward Trend'
    
def shift_detect(df):
    # if 7 of 8 days are above the 30(60) day average, then signal else nothing
    if ((df['t-1'] > df['mavg30'])
        and (df['t-2'] > df['mavg30'])
        and (df['t-3'] > df['mavg30'])
        and (df['t-4'] > df['mavg30'])
        and (df['t-5'] > df['mavg30'])):
        return "Upward Shift"
    
    elif ((df['t-2'] > df['mavg30'])
        and (df['t-3'] > df['mavg30'])
        and (df['t-4'] > df['mavg30'])
        and (df['t-5'] > df['mavg30'])
        and (df['t-6'] > df['mavg30'])):
        return "Upward Shift"

    elif ((df['t-3'] > df['mavg30'])
        and (df['t-4'] > df['mavg30'])
        and (df['t-5'] > df['mavg30'])
        and (df['t-6'] > df['mavg30'])
        and (df['t-7'] > df['mavg30'])):
        return "Upward Shift"
    
df_pc_test['trend'] = df_pc_test.apply(signal_detect, axis=1)
df_pc_test['shift'] = df_pc_test.apply(shift_detect, axis=1)

df_pc_test


Unnamed: 0,ATRT,% of Transactions > 5 seconds,% of Transactions < 2 seconds,Timer Subtimer Name,Number of Records,application_name,dt,strt,subtimername,timername,...,t-5,t-6,t-7,t-8,LCL_30,UCL_30,var_30,signal,trend,shift
224889,0.15,0.03%,99.97%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-17,0.156930,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,,,,,,
517425,0.14,0.03%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-18,0.233025,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.14,,,,,,
372849,0.15,0.00%,99.95%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-27,0.101519,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.15,0.15,,,,,,
19242,0.15,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-09-28,0.100805,,DMSM_GETMEDIACONTENT,...,0.16,0.15,0.15,0.15,,,,,,
290846,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-02,0.070028,,DMSM_GETMEDIACONTENT,...,0.15,0.16,0.15,0.15,,,,,,
223081,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-03,0.062836,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.16,0.15,,,,,,
379310,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-09,0.069858,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.16,,,,,,
53862,0.14,0.00%,99.98%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-15,0.083543,,DMSM_GETMEDIACONTENT,...,0.15,0.14,0.15,0.15,,,,,,
364756,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-16,0.075542,,DMSM_GETMEDIACONTENT,...,0.15,0.15,0.14,0.15,,,,,,
261106,0.14,0.00%,100.00%,DMSM_GETMEDIACONTENT (N/A),1,POWERCHART,2018-10-18,0.062809,,DMSM_GETMEDIACONTENT,...,0.14,0.15,0.15,0.14,,,,,,


### Investigate the data on a weekly basis
Counts
Sums
Averages
Variance

### Write out to csv:

In [28]:
df_pc_test.to_csv('C:\\Users\\khickman1\\Desktop\\PCMavg30.csv', sep=',')

### Stuff I learned: 

- Moving average example
- Subsetting
- Date and column filtering
- Index by both date and by timer/subtimer using pandas 'set index'
- Indexing
- Sorting
- Get unique values
- grouping
- filtering outliers (clipping, trimming, winsorizing)
- moving data

```
open_chart = atrt_df['Timer Subtimer Name' == "USR:PWR-OPEN CHART (DISCERNRPT/DISCERNPCTAB.DLL)"]
open_chart = atrt_df[atrt_df['Timer Subtimer Name'] == 'USR:PWR-OPEN CHART (DISCERNRPT/DISCERNPCTAB.DLL)']
open_chart = open_chart[open_chart['application_name']== 'POWERCHART']
open_chart_post = open_chart[(open_chart['dt'] > '2018-07-01') & (open_chart['dt'] < '2018-09-30')]
open_chart = open_chart.set_index(['Timer Subtimer Name', 'dt'])
open_chart = open_chart.sort_index(axis=0)
open_chart['moving_avg'] = open_chart['ATRT'].rolling(30).mean()
atrt_df['dt'] = pd.to_datetime(atrt_df['dt'].dt.date)
atrt_high_trans = atrt_df[atrt_df['transaction_cnt'] > 1000]
atrt_high_trans.sort_values(by=['transaction_cnt'], ascending=False)
len(atrt_high_trans)
grp_high_trans = atrt_high_trans.groupby(["Timer Subtimer Name", "dt"])
grp_high_trans.head()
open_chart_post.groupby('Timer Subtimer Name').nunique()
len(open_chart_post['Timer Subtimer Name'].unique())
len(open_chart_post)
open_chart_slim = open_chart_post[['ATRT', 'dt']]
open_chart_slim.set_index('dt')
open_chart_slim.sort_values('dt')

Stuff that didn't work
#atrt_df_PC = atrt_df_PC[['Timer Subtimer Name', 'dt', 'ATRT', 'application_name']]
#atrt_df_PC['mavg30'] = atrt_df_PC['ATRT'].rolling(30).mean()
#PC = atrt_df_PC.groupby(['Timer Subtimer Name', 'dt'])
#atrt_df_PC = atrt_df_PC.set_index(['Timer Subtimer Name', 'dt'])
#PC = atrt_df_PC
#PC.head()
#atrt_df_PC = atrt_df_PC.set_index(['Timer Subtimer Name'])
#atrt_df_PC = atrt_df_PC.sort_index(axis=0)
#atrt_df_PC = atrt_df_PC.groupby(['Timer Subtimer Name'])
#atrt_df_PC = atrt_df_PC.sort(['Timer Subtimer Name', 'dt']).groupby('Timer Subtimer Name')
#atrt_df_PC['mavg30'] = atrt_df_PC['ATRT'].rolling(30).mean()
#atrt_df_PC[0:100]```


Getting csv from website example
```
import csv
import urllib3
import requests

#This URL will be the URL that your login form points to with the "action" tag.
post_login_url = 'https://cernercare.com/accounts/login?returnTo=https%3A%2F%2Flightson.cerner.com%2Fsocial-auth%2Fcomplete%2Fprofessional%2F%3Fjanrain_nonce%3D2018-10-23T15%253A00%253A41ZYe3mOd'

#This URL is the page you actually want to pull down with requests.
request_url = 'https://lightson.cerner.com/clients/CHP_IN/domains/P1558/kpi/response-time/worst-timers.csv?dt=2018-10-22'
#request_url = 'https://lightson.cerner.com/api/metrics/trend.csv?category_name=Performance&metrics%5B%5D=RT_2SEC&cdr_ids=95370&data_type=ENVIRONMENT&date_type=DAILY&days_of_week=1%2C2%2C3%2C4%2C5%2C6%2C7&doc_cuid=LONfa31c997c3d543768036e2941&end_date=2018-10-22&physician_rollup_flag=2&start_date=2018-07-25'

payload = {
    'username': 'khickman1',
    'pass': '<password>'
}

with requests.Session() as session:
    post = session.post(post_login_url, data=payload)
    r = session.get(request_url)
    decoded_content = r.content.decode('utf-8')
    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    my_list = list(cr)
    for row in my_list: 
        print(row)

for row in cr:
    with open("daily.csv", "wb") as f:
        writer = csv.writer(f)
        writer.writerows(row)
```
