<a href="https://colab.research.google.com/github/gowthamloganathan7/Data_Science/blob/main/Akaike_rough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Akaike Structured Data Assignment

# > Description of data

The dataset in question contains a comprehensive collection of electronic
health records belonging to patients who have been diagnosed with a specific
disease. These health records comprise a detailed log of every aspect of the
patients' medical history, including all diagnoses, symptoms, prescribed drug
treatments, and medical tests that they have undergone. Each row represents a
healthcare record/medical event for a patient and it includes a timestamp for each
entry/event, thereby allowing for a chronological view of the patient's medical history

The Data has mainly three columns   

1) Patient-Uid - Unique Alphanumeric Identifier for a patient   
2) Date - Date when patient encountered the event.              
3) Incident - This columns describes which event occurred on the day.

# > Problem Statement

# > Problem 1

The development of drugs is critical in providing therapeutic options
for patients suffering from chronic and terminal illnesses. “Target Drug”, in particular,
is designed to enhance the patient's health and well-being without causing
dependence on other medications that could potentially lead to severe and
life-threatening side effects. These drugs are specifically tailored to treat a particular
disease or condition, offering a more focused and effective approach to treatment,
while minimising the risk of harmful reactions.

# > Objective of Problem Statement

The primary objective of this assignment is to develop a predictive model that accurately determines a patient's eligibility for the "Target Drug" within the next 30 days. By achieving this objective, the following outcomes are anticipated:

**Enhancing Patient Care:**

 The predictive model will empower healthcare professionals, including physicians, to make informed decisions about patient treatments. By knowing in advance whether a patient will be eligible for the "Target Drug," physicians can tailor their treatment plans for improved patient care and outcomes.

**Risk Mitigation:**

The model will assist in identifying patients who can benefit from the "Target Drug" while minimizing the risk of adverse reactions or side effects associated with other medications. This risk mitigation is critical for patients with chronic and terminal illnesses.

**Focused Treatment:**

The "Target Drug" is designed to provide focused and effective treatment for specific diseases or conditions. The predictive model will help ensure that patients who would benefit from this specialized treatment receive it in a timely manner.

**Patient Well-Being:**

By accurately predicting eligibility for the "Target Drug," the model contributes to enhancing the health and well-being of patients. It aligns treatment plans with patients' needs, potentially improving their quality of life.

In summary, the objective is to harness predictive modeling to support healthcare decision-making, ultimately leading to better patient care, reduced risks, and more effective treatments for patients suffering from chronic and terminal illnesses. The assignment aims to develop a model that provides a valuable tool for healthcare professionals in their mission to improve patient outcomes and well-being.

In [4]:
import pandas as pd
import numpy as np

In [7]:
df = pd.read_parquet("/content/train.parquet")
df.head()

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,2015-05-16,PRIMARY_DIAGNOSIS
3,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,2018-01-30,SYMPTOM_TYPE_0
4,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,2015-04-22,DRUG_TYPE_0
8,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2016-06-18,DRUG_TYPE_1


In [8]:
df.shape

(3220868, 3)

In [12]:
len(df['Patient-Uid'].unique())

27033

In [17]:
a = df['Incident'].unique()
a

array(['PRIMARY_DIAGNOSIS', 'SYMPTOM_TYPE_0', 'DRUG_TYPE_0',
       'DRUG_TYPE_1', 'DRUG_TYPE_2', 'TEST_TYPE_0', 'DRUG_TYPE_3',
       'DRUG_TYPE_4', 'DRUG_TYPE_5', 'DRUG_TYPE_6', 'DRUG_TYPE_8',
       'DRUG_TYPE_7', 'SYMPTOM_TYPE_1', 'DRUG_TYPE_10', 'SYMPTOM_TYPE_29',
       'SYMPTOM_TYPE_2', 'DRUG_TYPE_11', 'DRUG_TYPE_9', 'DRUG_TYPE_13',
       'SYMPTOM_TYPE_5', 'TEST_TYPE_1', 'SYMPTOM_TYPE_6', 'TEST_TYPE_2',
       'SYMPTOM_TYPE_3', 'SYMPTOM_TYPE_8', 'DRUG_TYPE_14', 'DRUG_TYPE_12',
       'SYMPTOM_TYPE_9', 'SYMPTOM_TYPE_10', 'SYMPTOM_TYPE_7',
       'SYMPTOM_TYPE_11', 'TEST_TYPE_3', 'DRUG_TYPE_15', 'SYMPTOM_TYPE_4',
       'SYMPTOM_TYPE_14', 'SYMPTOM_TYPE_13', 'SYMPTOM_TYPE_16',
       'SYMPTOM_TYPE_17', 'SYMPTOM_TYPE_15', 'SYMPTOM_TYPE_18',
       'SYMPTOM_TYPE_12', 'SYMPTOM_TYPE_20', 'SYMPTOM_TYPE_21',
       'DRUG_TYPE_17', 'SYMPTOM_TYPE_22', 'TEST_TYPE_4',
       'SYMPTOM_TYPE_23', 'DRUG_TYPE_16', 'TEST_TYPE_5',
       'SYMPTOM_TYPE_19', 'SYMPTOM_TYPE_24', 'SYMPTOM_TYPE_25',
   

In [20]:
a

array(['PRIMARY_DIAGNOSIS', 'SYMPTOM_TYPE_0', 'DRUG_TYPE_0',
       'DRUG_TYPE_1', 'DRUG_TYPE_2', 'TEST_TYPE_0', 'DRUG_TYPE_3',
       'DRUG_TYPE_4', 'DRUG_TYPE_5', 'DRUG_TYPE_6', 'DRUG_TYPE_8',
       'DRUG_TYPE_7', 'SYMPTOM_TYPE_1', 'DRUG_TYPE_10', 'SYMPTOM_TYPE_29',
       'SYMPTOM_TYPE_2', 'DRUG_TYPE_11', 'DRUG_TYPE_9', 'DRUG_TYPE_13',
       'SYMPTOM_TYPE_5', 'TEST_TYPE_1', 'SYMPTOM_TYPE_6', 'TEST_TYPE_2',
       'SYMPTOM_TYPE_3', 'SYMPTOM_TYPE_8', 'DRUG_TYPE_14', 'DRUG_TYPE_12',
       'SYMPTOM_TYPE_9', 'SYMPTOM_TYPE_10', 'SYMPTOM_TYPE_7',
       'SYMPTOM_TYPE_11', 'TEST_TYPE_3', 'DRUG_TYPE_15', 'SYMPTOM_TYPE_4',
       'SYMPTOM_TYPE_14', 'SYMPTOM_TYPE_13', 'SYMPTOM_TYPE_16',
       'SYMPTOM_TYPE_17', 'SYMPTOM_TYPE_15', 'SYMPTOM_TYPE_18',
       'SYMPTOM_TYPE_12', 'SYMPTOM_TYPE_20', 'SYMPTOM_TYPE_21',
       'DRUG_TYPE_17', 'SYMPTOM_TYPE_22', 'TEST_TYPE_4',
       'SYMPTOM_TYPE_23', 'DRUG_TYPE_16', 'TEST_TYPE_5',
       'SYMPTOM_TYPE_19', 'SYMPTOM_TYPE_24', 'SYMPTOM_TYPE_25',
   

In [21]:
b = []
for i in a:
  b.append(i.lower())

In [27]:
c = pd.DataFrame(b,columns = ['uni'])
len(c['uni'].unique())

57

In [10]:
a = df[df['Patient-Uid'] == 'a0db1e73-1c7c-11ec-ae39-16262ee38c7f']
a

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
26810,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-08-04,DRUG_TYPE_2
54034,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-05,SYMPTOM_TYPE_0
69930,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-04-16,DRUG_TYPE_6
145155,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-12-27,DRUG_TYPE_0
...,...,...,...
3100536,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-08-03,SYMPTOM_TYPE_0
3119371,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-03-08,PRIMARY_DIAGNOSIS
3138153,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-01-09,TEST_TYPE_0
3178420,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-03-09,TEST_TYPE_0


In [15]:
b = a[a['Incident'] == ['PRIMARY_DIAGNOSIS',]]
b

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
245656,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-02-21,PRIMARY_DIAGNOSIS
653465,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-08-04,PRIMARY_DIAGNOSIS
1034216,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-05,PRIMARY_DIAGNOSIS
1497523,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-08-03,PRIMARY_DIAGNOSIS
1797762,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-17,PRIMARY_DIAGNOSIS
2014519,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-01-09,PRIMARY_DIAGNOSIS
2362143,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-08-03,PRIMARY_DIAGNOSIS
2635491,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-16,PRIMARY_DIAGNOSIS
2727679,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-18,PRIMARY_DIAGNOSIS


In [16]:
df_dt1 = b.sort_values(by = 'Date', ascending = True).reset_index()
df_dt1

Unnamed: 0,index,Patient-Uid,Date,Incident
0,245656,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-02-21,PRIMARY_DIAGNOSIS
1,1034216,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-05,PRIMARY_DIAGNOSIS
2,2732312,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-05,PRIMARY_DIAGNOSIS
3,0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
4,2635491,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-16,PRIMARY_DIAGNOSIS
5,3038398,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-16,PRIMARY_DIAGNOSIS
6,1797762,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-17,PRIMARY_DIAGNOSIS
7,2727679,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-18,PRIMARY_DIAGNOSIS
8,2014519,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-01-09,PRIMARY_DIAGNOSIS
9,3119371,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2020-03-08,PRIMARY_DIAGNOSIS


In [19]:
df_dt1 = a.sort_values(by = 'Date', ascending = True).reset_index()
df_dt1.head(50)

Unnamed: 0,index,Patient-Uid,Date,Incident
0,1750087,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2015-09-22,DRUG_TYPE_7
1,1473893,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-04-13,SYMPTOM_TYPE_2
2,1387922,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-05-02,DRUG_TYPE_7
3,748933,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,DRUG_TYPE_0
4,223191,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,SYMPTOM_TYPE_0
5,1284014,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,DRUG_TYPE_11
6,557302,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,DRUG_TYPE_9
7,858256,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,TEST_TYPE_0
8,1320912,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-24,DRUG_TYPE_0
9,1237510,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-24,DRUG_TYPE_7
