# Assignment 1

**Objective:**  
This assignment tests your algorithmic thinking skills. You may use any A.I. tools or resources to assist you in completing this coding assignment.

---

## Instructions:

1. **Teamwork:**  
   - This assignment is to be done in pairs, preferably with your thesis partner.
   
2. **GitHub Repository:**  
   - Each person should create a GitHub repository titled `Assignment_1_Data_Analytics`.
   
3. **Read the Journal:**  
   - Thoroughly read the journal provided (make sure to check any additional resources).
   
4. **Develop Python Implementation:**  
   - Implement the procedures discussed in the journal using Python. This should be a clean and well-documented Python implementation.
   
5. **Deadline:**  
   - The assignment is due **before the pre-midterm week**. Make sure to finish and submit your work on time.

---

## Deliverables:
1. Python implementation of the procedures outlined in the journal.
2. Link to the **GitHub repository** containing the code.


<span style="font-size: 20px; color: #6E3482;">I. Imports and Setup</span>


In [15]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from collections import defaultdict

# Set random seed for reproducibility
np.random.seed(24)

# Define constants
number_of_patients = 200
evaluation_years = 4
evaluation_months = evaluation_years * 12
MAX_MATCHES = 100

# Define the variables used in the model
variables = ['pain_current', 'urgency_current', 'frequency_current', 
             'pain_baseline', 'urgency_baseline', 'frequency_baseline']


<span style="font-size: 20px; color: #6E3482;"> II. Baseline Data Generation</span>

In [19]:
# Generate baseline data for patients
baseline = pd.DataFrame({
    "patient_id": np.arange(number_of_patients),  # Ensure unique patient_id from 0 to number_of_patients-1
    "gender": np.random.choice(['M', 'F'], number_of_patients),
    "pain": np.random.randint(0, 10, number_of_patients),
    "urgency": np.random.randint(0, 10, number_of_patients),
    "frequency": np.random.randint(0, 20, number_of_patients),
    "age": np.random.randint(18, 75, number_of_patients),
    "location": np.random.choice(['Urban', 'Rural'], number_of_patients),
    "medical_history": np.random.choice(['None', 'Diabetes', 'Hypertension', 'Asthma'], number_of_patients)
})

# Display the first few rows of the baseline data
print(baseline.head())

   patient_id gender  pain  urgency  frequency  age location medical_history
0           0      F     3        8         10   28    Urban          Asthma
1           1      F     1        1         18   62    Rural        Diabetes
2           2      M     6        6          8   27    Urban          Asthma
3           3      F     0        0          0   60    Urban        Diabetes
4           4      M     6        2          3   18    Rural          Asthma


<span style="font-size: 20px; color: #6E3482;"> III. Generating Patient Evaluations</span>

In [20]:
# List to store all patient evaluations
patient_evaluations = []

# Generate evaluations for each patient
for patient_index in range(number_of_patients):
    treatment_start_time = np.random.choice(list(np.arange(3, evaluation_months + 1, 3)) + [None])
    
    # Simulate evaluations every 3 months
    for evaluation_month in range(3, evaluation_months + 1, 3):
        pain_level = np.random.randint(0, 10)
        urgency_level = np.random.randint(0, 10)
        symptom_frequency = np.random.randint(0, 20)
        is_treated = 1 if treatment_start_time and evaluation_month >= treatment_start_time else 0
        
        # Store the evaluation for this patient
        patient_evaluations.append({
            'patient_index': patient_index,
            'pain_level': pain_level,
            'urgency_level': urgency_level,
            'symptom_frequency': symptom_frequency,
            'months_since_entry': evaluation_month,
            'treatment_start_month': treatment_start_time,
            'is_treated': is_treated
        })

# Convert evaluations to DataFrame
evaluations_df = pd.DataFrame(patient_evaluations)

# Group by patient_index and calculate the mean for each patient's evaluations
evaluation_summary = evaluations_df.groupby('patient_index')[['pain_level', 'urgency_level', 'symptom_frequency']].mean()

# Display a preview of the evaluation summary
evaluation_summary.head()


Unnamed: 0_level_0,pain_level,urgency_level,symptom_frequency
patient_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,4.8125,4.375,11.6875
1,4.25,4.5625,7.75
2,5.8125,3.25,7.5
3,3.375,3.0625,9.375
4,3.5625,4.1875,8.1875


<span style="font-size: 20px; color: #6E3482;"> IV. Matching Treated and Untreated Patients by Distance</span>

In [22]:
# Randomly split the baseline patients into treated and untreated groups
treated_group = baseline.sample(frac=0.5, random_state=24)  # 50% treated
untreated_group = baseline.drop(treated_group.index)  # Rest are untreated

# Define risk_sets (treated and untreated)
risk_sets = {
    "set_1": (treated_group, untreated_group)  # This is just an example; you can define multiple sets as needed
}

# Now proceed with matching and distance calculation

# Create an empty dictionary to store the new risk sets with matches
new_rs = {}

# Simulate risk sets matching process
for key, (t, u) in risk_sets.items():
    # Merge baseline data with treated and untreated patients
    treated = t.merge(baseline, on='patient_id', suffixes=['_current', '_baseline'])
    untreated = u.merge(baseline, on='patient_id', suffixes=['_current', '_baseline'])
    
    # Calculate distances between treated and untreated
    distance_matrix = compute_distance(treated, untreated, variables)

    # Store matches by distance
    distance_dict = defaultdict(list)
    for i, row in enumerate(distance_matrix):
        for j, distance in enumerate(row):
            distance_dict[distance].append((int(treated['patient_id'].iloc[i]), int(untreated['patient_id'].iloc[j])))
    
    # Sort the matches by distance and select up to MAX_MATCHES
    mcf_matches = []
    for keys in sorted(distance_dict.keys()):
        for match in distance_dict[keys]:
            if len(mcf_matches) < MAX_MATCHES:
                mcf_matches.append(match)
            else:
                break
    
    new_rs[key] = mcf_matches

# Preview the matches for the first risk set
list(new_rs.values())[0][:5]


[(197, 28), (197, 33), (197, 36), (197, 38), (197, 42)]

<span style="font-size: 20px; color: #6E3482;"> V. Result Evaluation</span>

In [23]:
# Check the length of the matches for each risk set
for key, matches in new_rs.items():
    print(f"Risk Set {key} has {len(matches)} matches.")

Risk Set set_1 has 100 matches.
