## PRMT-2023
### Hypothesis
We believe that for two Vision practices, pending Vision transfers look different to EMIS and TPP pending transfers
We will know this to be true when we can see different patterns in the data for each supplier in terms of the number of messages per conversation 

### Scope
- Look at the following practice ASID codes

896286726030

244934959036

(NB After discovering a distinct pattern in the first practice, we added a third: 052047562039)

- Compare the makeup of messages per conversation across pending transfers for each supplier and identify if there’s any patterns 
- Generate a sample of 10 conversation IDs per practice, for pending transfers that are Vision 2 Vision
- Show the makeup of number of messages per conversation ID

### Acceptance Criteria
- We have a list of 20 conversation IDs for vision to vision pending transfers for the two practices stated, and we know how many messages there are for each of these conversations
- We have a confluence page that shows any patterns that either prove or disprove the hypothesis

In [11]:
import pandas as pd
import numpy as np
# Using data generated from branch PRMT-1742-duplicates-analysis.
# This is needed to correctly handle duplicates.
# Once the upstream pipeline has a fix for duplicate EHRs, then we can go back to using the main output.
transfer_file_location = "s3://prm-gp2gp-data-sandbox-dev/transfers-duplicates-hypothesis/"
transfer_files = [
    "9-2020-transfers.parquet",
    "10-2020-transfers.parquet",
    "11-2020-transfers.parquet",
    "12-2020-transfers.parquet",
    "1-2021-transfers.parquet",
    "2-2021-transfers.parquet"
]

transfer_input_files = [transfer_file_location + f for f in transfer_files]
transfers_raw = pd.concat((
    pd.read_parquet(f)
    for f in transfer_input_files
))

# In the data from the PRMT-1742-duplicates-analysis branch, these columns have been added , but contain only empty values.
transfers_raw = transfers_raw.drop(["sending_supplier", "requesting_supplier"], axis=1)


# Given the findings in PRMT-1742 - many duplicate EHR errors are misclassified, the below reclassifies the relevant data

has_at_least_one_successful_integration_code = lambda errors: any((np.isnan(e) or e==15 for e in errors))
successful_transfers_bool = transfers_raw['request_completed_ack_codes'].apply(has_at_least_one_successful_integration_code)
transfers = transfers_raw.copy()
transfers.loc[successful_transfers_bool, "status"] = "INTEGRATED"

# Correctly interpret certail sender errors as failed.
# This is explained in PRMT-1974. Eventaully this will be fixed upstream in the pipeline. 
pending_sender_error_codes=[6,7,10,24,30,23,14,99]
transfers_with_pending_sender_code_bool=transfers['sender_error_code'].isin(pending_sender_error_codes)
transfers_with_pending_with_error_bool=transfers['status']=='PENDING_WITH_ERROR'
transfers_which_need_pending_to_failure_change_bool=transfers_with_pending_sender_code_bool & transfers_with_pending_with_error_bool
transfers.loc[transfers_which_need_pending_to_failure_change_bool,'status']='FAILED'

# Add integrated Late status
eight_days_in_seconds=8*24*60*60
transfers_after_sla_bool=transfers['sla_duration']>eight_days_in_seconds
transfers_with_integrated_bool=transfers['status']=='INTEGRATED'
transfers_integrated_late_bool=transfers_after_sla_bool & transfers_with_integrated_bool
transfers.loc[transfers_integrated_late_bool,'status']='INTEGRATED LATE'

# If the record integrated after 28 days, change the status back to pending.
# This is to handle each month consistentently and to always reflect a transfers status 28 days after it was made.
# TBD how this is handled upstream in the pipeline
twenty_eight_days_in_seconds=28*24*60*60
transfers_after_month_bool=transfers['sla_duration']>twenty_eight_days_in_seconds
transfers_pending_at_month_bool=transfers_after_month_bool & transfers_integrated_late_bool
transfers.loc[transfers_pending_at_month_bool,'status']='PENDING'
transfers_with_early_error_bool=(~transfers.loc[:,'sender_error_code'].isna()) |(~transfers.loc[:,'intermediate_error_codes'].apply(len)>0)
transfers.loc[transfers_with_early_error_bool & transfers_pending_at_month_bool,'status']='PENDING_WITH_ERROR'

# Supplier name mapping
supplier_renaming = {
    "EGTON MEDICAL INFORMATION SYSTEMS LTD (EMIS)":"EMIS",
    "IN PRACTICE SYSTEMS LTD":"Vision",
    "MICROTEST LTD":"Microtest",
    "THE PHOENIX PARTNERSHIP":"TPP",
    None: "Unknown"
}

asid_lookup_file = "s3://prm-gp2gp-data-sandbox-dev/asid-lookup/asidLookup-Mar-2021.csv.gz"
asid_lookup = pd.read_csv(asid_lookup_file)
lookup = asid_lookup[["ASID", "MName", "NACS","OrgName"]]

transfers = transfers.merge(lookup, left_on='requesting_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'requesting_supplier', 'ASID': 'requesting_supplier_asid', 'NACS': 'requesting_ods_code','OrgName':'requesting_practice_name'}, axis=1)
transfers = transfers.merge(lookup, left_on='sending_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'sending_supplier', 'ASID': 'sending_supplier_asid', 'NACS': 'sending_ods_code','OrgName':'sending_practice_name'}, axis=1)

transfers["sending_supplier"] = transfers["sending_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())
transfers["requesting_supplier"] = transfers["requesting_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())

In [12]:
practice_asids=['896286726030' ,'244934959036','052047562039']
relevant_practice_bool=transfers['requesting_practice_asid'].isin(practice_asids)
from_vision_bool=transfers['sending_supplier']=='Vision'
pending_status_bool=transfers['status']=='PENDING'
relevant_data_bool=(relevant_practice_bool)&(from_vision_bool)&(pending_status_bool)
relevant_transfers=transfers.loc[relevant_data_bool]

## Practice 1 

In [13]:
# Randomly select a list of 10 conversations: The original 10 that are were used are written in the query
practice_1_transfers_data=relevant_transfers.loc[relevant_transfers['requesting_practice_asid']==practice_asids[0]]
practice_1_transfers_data.sample(n=10)['conversation_id'].values
practice_1_transfers_data['date_requested'].dt.month.value_counts()

2     183
9       7
10      4
1       2
12      1
11      1
Name: date_requested, dtype: int64

In [14]:
# The output from splunk was placed in S3 - we load it in and look at the message patterns that occur
practice_data_folder="s3://prm-gp2gp-data-sandbox-dev/PRMT-2023-Practice-Data/"
practice_1_filename="PRMT-2023_Practice_1_Data.csv"
practice_1_data=pd.read_csv(practice_data_folder+practice_1_filename)
practice_1_data=practice_1_data.sort_values(by=['conversationID','_time'])
practice_1_data.groupby('conversationID')['interactionName'].apply(list).value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[request started]                                 9
[request started, application acknowledgement]    1
Name: interactionName, dtype: int64

### Practice 2

In [15]:
practice_2_transfers_data=relevant_transfers.loc[relevant_transfers['requesting_practice_asid']==practice_asids[1]]
practice_2_transfers_data.sample(n=10)['conversation_id'].values
practice_2_transfers_data['date_requested'].dt.month.value_counts()

2     22
9     20
10    19
1     19
12    16
11    12
Name: date_requested, dtype: int64

In [16]:
practice_2_filename="PRMT-2023_Practice_2_Data.csv"
practice_2_data=pd.read_csv(practice_data_folder+practice_2_filename)
practice_2_data=practice_2_data.sort_values(by=['conversationID','_time'])
practice_2_data.groupby('conversationID')['interactionName'].apply(list).value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[request started, application acknowledgement]                                                    6
[request started, application acknowledgement, request completed]                                 2
[request started, application acknowledgement, request completed, application acknowledgement]    2
Name: interactionName, dtype: int64

### Practice 3

In [17]:
practice_3_transfers_data=relevant_transfers.loc[relevant_transfers['requesting_practice_asid']==practice_asids[2]]
practice_3_transfers_data.sample(n=10,random_state=1)['conversation_id'].values
practice_3_transfers_data['date_requested'].dt.month.value_counts()

9     30
1     20
12    19
11    19
2     18
10    13
Name: date_requested, dtype: int64

In [18]:
practice_data_folder="s3://prm-gp2gp-data-sandbox-dev/PRMT-2023-Practice-Data/"
practice_3_filename="PRMT-2023_Practice_3_Data.csv"
practice_3_data=pd.read_csv(practice_data_folder+practice_3_filename)
practice_3_data=practice_3_data.sort_values(by=['conversationID','_time'])
practice_3_data.groupby('conversationID')['interactionName'].apply(list).value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[request started, application acknowledgement]    10
Name: interactionName, dtype: int64

### Output data to Excel

In [19]:
with pd.ExcelWriter('32-PRMT-2023-Vision_conversations.xlsx') as writer:
    practice_1_data.groupby('conversationID')['interactionName'].apply(list).to_excel(writer, sheet_name=practice_asids[0])
    practice_2_data.groupby('conversationID')['interactionName'].apply(list).to_excel(writer, sheet_name=practice_asids[1])
    practice_3_data.groupby('conversationID')['interactionName'].apply(list).to_excel(writer, sheet_name=practice_asids[2])