### Simulating a Time-Based Data Pipeline with Full and Incremental Extraction Using Hospital Admissions Data


##### **Import necessary libraries**

In [37]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

#### **Hospital Admissions Data Simulation**
In the following code:

- A realistic hospital admissions dataset is simulated over a 60-day period starting April 1, 2025.  
- Hospitals and patient severity levels are predefined for random assignment to each admission.  
- For each day, between 3 and 6 admissions are generated.
- Records between 180 & 360(inclusive) are expected in any random run.
- Each admission is assigned a random patient ID, hospital, severity level, and a last updated timestamp with a random hour and minute on the admission date.  
- All records are stored in a list of dictionaries.  
- The list is converted into a Pandas DataFrame for easier data handling.  
- The DataFrame is saved as a CSV file named `'hospital_admissions.csv'`.  
- A preview of the first few rows is displayed to verify the structure and content of the data.


In [38]:
# List of hospitals and severity levels
random.seed(42)
hospitals = ['General Hospital', 'City Clinic', 'Mercy Medical', 'St. Mary’s', 'County Hospital']
severity_levels = ['Low', 'Moderate', 'High', 'Critical']

data = []
start_date = datetime(2025, 4, 1)

# Simulate data for 60 days
for i in range(1, 61):
    date = start_date + timedelta(days=i)
    
    # Random 3–6 admissions per day
    for _ in range(random.randint(3, 6)):
        data.append({
            'id': random.randint(1000, 9999),  # Random patient ID
            'hospital': random.choice(hospitals),  # Random hospital
            'admission_date': date.date().isoformat(),  # Admission date
            'severity': random.choice(severity_levels),  # Condition severity
            'last_updated': (date + timedelta(
                hours=random.randint(0, 23),
                minutes=random.randint(0, 59)
            )).isoformat()  # Timestamp of last record update
        })

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('hospital_admissions.csv', index=False)

# Preview first few rows
df.head()


Unnamed: 0,id,hospital,admission_date,severity,last_updated
0,1409,Mercy Medical,2025-04-02,Moderate,2025-04-02T07:08:00
1,2679,County Hospital,2025-04-02,Low,2025-04-02T18:27:00
2,1520,General Hospital,2025-04-02,Low,2025-04-02T06:14:00
3,4257,County Hospital,2025-04-03,Critical,2025-04-03T07:28:00
4,5557,General Hospital,2025-04-03,Moderate,2025-04-03T22:27:00


#### **Full Extraction**
In the following code:
- The entire dataset is loaded from the CSV file `'hospital_admissions.csv'` with the `'last_updated'` column parsed as datetime.  
- The number of rows and columns in the dataset is displayed separately to give a detailed overview of its dimensions.  
- The total number of rows pulled is reiterated for clarity.  
- A sample of the first few rows is printed to verify the data content and structure.



In [39]:
import pandas as pd

# Load all rows from the CSV and parse 'last_updated' as datetime
df_full = pd.read_csv("hospital_admissions.csv", parse_dates=["last_updated"])

print(f"Number of rows: {df_full.shape[0]}")    # Rows
print(f"Number of columns: {df_full.shape[1]}") # Columns

# Show how many records were pulled
print(f"Pulled {df_full.shape[0]} rows via full extraction.")

print("Sample data:")
df_full.head()


Number of rows: 280
Number of columns: 5
Pulled 280 rows via full extraction.
Sample data:


Unnamed: 0,id,hospital,admission_date,severity,last_updated
0,1409,Mercy Medical,2025-04-02,Moderate,2025-04-02 07:08:00
1,2679,County Hospital,2025-04-02,Low,2025-04-02 18:27:00
2,1520,General Hospital,2025-04-02,Low,2025-04-02 06:14:00
3,4257,County Hospital,2025-04-03,Critical,2025-04-03 07:28:00
4,5557,General Hospital,2025-04-03,Moderate,2025-04-03 22:27:00


#### **Setting Initial Last Extraction Time**
In the following code:
  
- A fixed datetime string `"2025-04-20 12:00:00"` is written into a file last_extraction.txt (separately created) to simulate the last time data was extracted.  
- This timestamp serves as a reference point for future incremental data extraction processes.


In [40]:
# Set initial last extraction time 
with open("last_extraction.txt", "w") as f:
    f.write("2025-04-20 12:00:00") 

#### **Incremental Extraction**
In the following code:

- The last extraction timestamp is read from the file `'last_extraction.txt'` and any extra whitespace is  using strip() function.  
- The full dataset is loaded from `'hospital_admissions.csv'` with the `'last_updated'` column parsed as datetime.  
- The timestamp from the file is converted into a pandas datetime object for comparison.  
- The dataset is filtered to include only rows where the `'last_updated'` timestamp is later than the last extraction time, simulating incremental extraction.  
- The number of new or updated rows since the last extraction is displayed.  
- A sample of these new/updated records is shown for verification.


In [41]:
import pandas as pd

# Step 1: Read the last extraction timestamp from the text file
with open("last_extraction.txt", "r") as f:
    last_extraction = f.read().strip()

# Step 2: Load the full dataset and parse 'last_updated' as datetime
df = pd.read_csv("hospital_admissions.csv", parse_dates=["last_updated"])

# Step 3: Convert the last extraction time to datetime format
last_extraction_time = pd.to_datetime(last_extraction)

# Step 4: Filter rows that were updated after the last extraction time
df_incremental = df[df['last_updated'] > last_extraction_time]

df_incremental.to_csv("hospital_admission_incremental.csv", index=False)


# Step 5: Display results
print(f"Pulled {len(df_incremental)} new/updated rows since {last_extraction}.")
df_incremental.head()


Pulled 196 new/updated rows since 2025-04-20 12:00:00.


Unnamed: 0,id,hospital,admission_date,severity,last_updated
81,4652,General Hospital,2025-04-20,Moderate,2025-04-20 12:21:00
85,6663,Mercy Medical,2025-04-21,Critical,2025-04-21 19:32:00
86,2894,St. Mary’s,2025-04-21,Moderate,2025-04-21 08:02:00
87,8144,General Hospital,2025-04-21,Moderate,2025-04-21 11:27:00
88,2146,Mercy Medical,2025-04-21,High,2025-04-21 21:54:00


#### **Updating the Last Extraction Timestamp**
In this code:

- The most recent `'last_updated'` timestamp in the current dataset is identified as the new checkpoint.  
- This new checkpoint timestamp is saved to the `'last_extraction.txt'` file, overwriting the previous value.  
- A confirmation message is printed to indicate that the last extraction timestamp has been updated successfully.


In [42]:
# Step 1: Get the latest timestamp from the data
new_checkpoint = df['last_updated'].max()

# Step 2: Save this new checkpoint to the extraction file
with open("last_extraction.txt", "w") as f:
    f.write(new_checkpoint.isoformat())

# Step 3: Confirm the update
print(f"Updated last_extraction.txt to {new_checkpoint}")


Updated last_extraction.txt to 2025-05-31 15:47:00
