<a href="https://colab.research.google.com/github/nandu26m/data-project-from-apollo/blob/main/Data_Project_From_Apollo_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Project From Apollo

## Table of Contents

*   Assignment
*   Data Exploration
*   Data Dictionary
*   Data Cleaning & Preprocessing
*   Question 1: Which variables are significant in predicting the reason for hospitalization for different regions?
    *    Step 1: ANOVA - Do continuous variables vary across regions?
        *     ANOVA Results Summary
    *    Chi-Square Test – Are sex and smoker status independent of region?
Chi-Square Test Results
        *     Chi-Square Test Results
*   Question 2: How well some variables like viral load, smoking, and severity level describe the hospitalization charges?
    *    Linear Regression Model
        *     Linera Regression Test Results
*   Insights and Recommendations
    *    Key Insights
    *    Recommendations

## Assignment
As a data scientist working at Apollo, the ultimate goal is to tease out meaningful and actionable insights from patient-level collected data. You can help Apollo Hospitals to be more efficient, influence diagnostic and treatment processes, and map the spread of a pandemic.

One of the best examples of data scientists making a meaningful difference at a global level is in the response to the COVID-19 pandemic — improving information collection, providing ongoing and accurate estimates of infection spread and health system demand, and assessing the effectiveness of government policies.

The company wants to know:

* **Which variables are significant in predicting the reason for hospitalization for different regions?**
* **How well do variables like viral load, smoking, and severity level explain hospitalization charges?**

## Data Exploration
The goal of this section is to get familiar with the dataset by examining its structure, variable types, and basic characteristics. This step sets the foundation for all further analysis.

In [17]:
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from google.colab import drive

# Mount Google Drive (only needs to be done once per session)
drive.mount('/content/drive')

file_path = r'/content/drive/MyDrive/Study/Data_Projects/apollo_data.csv'

df = pd.read_csv(file_path, index_col = 0)
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,age,sex,smoker,region,viral load,severity level,hospitalization charges
0,19,female,yes,southwest,9.3,0,42212
1,18,male,no,southeast,11.26,1,4314
2,28,male,no,southeast,11.0,3,11124
3,33,male,no,northwest,7.57,0,54961
4,32,male,no,northwest,9.63,0,9667


Check the columns by using the info method

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      1338 non-null   int64  
 1   sex                      1338 non-null   object 
 2   smoker                   1338 non-null   object 
 3   region                   1338 non-null   object 
 4   viral load               1338 non-null   float64
 5   severity level           1338 non-null   int64  
 6   hospitalization charges  1338 non-null   int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 83.6+ KB


## Data Dictionary
The file apollo_data.csv contains anonymized data of COVID-19 hospital patients and includes the following variables:

* age: Integer — Age of the primary beneficiary (only includes ages up to 64, as older individuals are generally covered by the government).
* sex: Categorical — Gender of the policy holder (male or female).
* smoker: Categorical — Indicates whether the insured regularly smokes tobacco (yes or no).
* region: Categorical — Beneficiary’s residence in Delhi, categorized into four * geographic regions: northeast, southeast, southwest, and northwest.
* viral load: Float — The amount of virus present in an infected person’s blood.
* severity level: Integer — Numeric score indicating the severity of the patient’s condition.
* hospitalization charges: Integer — Medical costs billed to health insurance for the patient's hospital stay.

In [19]:
df.describe()

Unnamed: 0,age,viral load,severity level,hospitalization charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,10.221233,1.094918,33176.058296
std,14.04996,2.032796,1.205493,30275.029296
min,18.0,5.32,0.0,2805.0
25%,27.0,8.7625,0.0,11851.0
50%,39.0,10.13,1.0,23455.0
75%,51.0,11.5675,2.0,41599.5
max,64.0,17.71,5.0,159426.0


## Data Cleaning & Preprocessing

* Remove unnecessary columns,
* Convert categorical features into usable formats,
* Encode those features for compatibility with statistical models,
* Prepare the dataset for correlation analysis and regression.

This step ensures that downstream results are both accurate and interpretable.

In [21]:
# Convert relevant columns to 'category data type'
categorical_columns = ["sex", "smoker", "region"]
for col in categorical_columns:
  df[col] = df[col].astype("category")

# Apply one-hot encoding for categorical features (drop_first avoids multicollinearity)
df_encoded = pd.get_dummies(df, drop_first=True)

# Show the cleaned and preprocessed DataFrame
df_encoded.head()

Unnamed: 0,age,viral load,severity level,hospitalization charges,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,9.3,0,42212,False,True,False,False,True
1,18,11.26,1,4314,True,False,False,True,False
2,28,11.0,3,11124,True,False,False,True,False
3,33,7.57,0,54961,True,False,True,False,False
4,32,9.63,0,9667,True,False,True,False,False


## Question 1: Which variables are significant in predicting the reason for hospitalization for different regions?
Apollo wants to understand whether the factors like **age, sex, smoking status, viral load**, or **severity level** significantly differ across regions. While the dataset does not contain an explicit “reason” field for hospitalization, regional variation in these variables could indicate different patterns in hospitalization motives or needs.

### Approach
We’ll approach this by:

* Performing ANOVA and Chi-square tests to identify if distributions of key variables differ significantly by region.
* Testing continuous variables like age, viral load, and severity level using ANOVA.
* Testing categorical variables like sex and smoker using Chi-Square Test of Independence.

### Step 1: ANOVA – Do continuous variables vary across regions?

In [24]:
# For each continuous variable, perform one-way ANOVA across regions
anova_age = stats.f_oneway(
    df[df["region"] == "northeast"]["age"],
    df[df["region"] == "southeast"]["age"],
    df[df["region"] == "southwest"]["age"],
    df[df["region"] == "northwest"]["age"]
)

anova_viral = stats.f_oneway(
    df[df["region"] == "northeast"]["viral load"],
    df[df["region"] == "southeast"]["viral load"],
    df[df["region"] == "southwest"]["viral load"],
    df[df["region"] == "northwest"]["viral load"]
)

anova_severity = stats.f_oneway(
    df[df["region"] == "northeast"]["severity level"],
    df[df["region"] == "southeast"]["severity level"],
    df[df["region"] == "southwest"]["severity level"],
    df[df["region"] == "northwest"]["severity level"]
)

# Print results
anova_age, anova_viral, anova_severity

(F_onewayResult(statistic=np.float64(0.07978158162436333), pvalue=np.float64(0.970989069987742)),
 F_onewayResult(statistic=np.float64(39.46870879747587), pvalue=np.float64(1.9508165724449588e-24)),
 F_onewayResult(statistic=np.float64(0.7174932934640621), pvalue=np.float64(0.5415542568832501)))

### ANOVA Results Summary
We tested whether continuous variables (age, viral load, severity level) vary significantly across different regions using one-way ANOVA.

* Age:
F(3, 1334) = 0.08, p = 0.97 ❌
→ No significant difference in average age across regions.

* Viral Load:
F(3, 1334) = 39.47, p < 0.001 ✅
→ Highly significant difference in viral load between regions. This suggests that the severity of viral exposure varies geographically.

* Severity Level:
F(3, 1334) = 0.77, p = 0.54 ❌
→ No statistically significant difference in severity level across regions.

Insight: Among the continuous predictors, only viral load shows meaningful variation across regions, which may reflect differing infection rates or testing/reporting practices by location.

### Step 2: Chi-Square Test – Are sex and smoker status independent of region?

In [25]:
# Cross-tabulation and Chi-Square for 'sex' vs. 'region'
contingency_sex = pd.crosstab(df["region"], df["sex"])
chi2_sex = stats.chi2_contingency(contingency_sex)

# Cross-tabulation and Chi-Square for 'smoker' vs. 'region'
contingency_smoker = pd.crosstab(df["region"], df["smoker"])
chi2_smoker = stats.chi2_contingency(contingency_smoker)

# Show test statistics and p-values
chi2_sex[0:2], chi2_smoker[0:2]

((np.float64(0.43513679354327284), np.float64(0.9328921288772233)),
 (np.float64(7.343477761407071), np.float64(0.06171954839170541)))

### Chi-Square Test Results
We assessed whether the distribution of categorical variables (sex, smoker) is independent of the region:

* Sex vs Region
χ² = 0.43, p = 0.93 ❌
→ No relationship between gender distribution and region. Gender is evenly spread geographically.

* Smoker vs Region
χ² = 7.34, p = 0.061 ❌ (borderline)
→ While not statistically significant at p < 0.05, there is a weak regional trend in smoking behavior (marginal significance).

Insight: Neither sex nor smoking status vary significantly by region, although smoking status comes close to the significance threshold. This may warrant deeper exploration in future studies.

From our statistical analysis:

✅ Viral Load is the only variable that shows a significant difference across regions.
❌ Age, severity level, sex, and smoking status do not vary significantly by region.
This suggests that while reasons for hospitalization may be influenced by local viral exposure levels, other demographic and behavioral factors are evenly distributed across regions.

## How well some variables like viral load, smoking, and severity level describe the hospitalization charges?
Apollo is interested in understanding whether factors like viral load, smoking, and severity level can reliably predict hospitalization charges. To answer this, we use linear regression, which helps quantify how much each variable contributes to cost differences.

We'll also account for other potential confounding variables, such as:

* Age
* Sex
* Region

to ensure a robust and interpretable model.

### Linear Regression Model

In [27]:
# Linear regression model to predict hospitalization charges

# Define target variable (y) and features (X)
y = df_encoded["hospitalization charges"]

# Select only numeric columns for features
X = df_encoded.select_dtypes(include=['number'])

# Drop the target variable from the features if it was included by select_dtypes
X = X.drop(columns=["hospitalization charges"], errors='ignore')

# Add intercept to the model
X = sm.add_constant(X)

# Explicitly convert y to numeric, forcing errors for unconvertible values
y = pd.to_numeric(y, errors='coerce')

# Drop rows where y became NaN due to coercion errors (optional, depending on desired handling of invalid data)
# If you drop NaNs from y, you must drop the corresponding rows from X
# valid_indices = y.dropna().index
# y = y.loc[valid_indices]
# X = X.loc[valid_indices]


# Fit the OLS regression model
model = sm.OLS(y, X).fit()

# Show summary of the model
model.summary()

0,1,2,3
Dep. Variable:,hospitalization charges,R-squared:,0.12
Model:,OLS,Adj. R-squared:,0.118
Method:,Least Squares,F-statistic:,60.7
Date:,"Mon, 23 Jun 2025",Prob (F-statistic):,8.71e-37
Time:,10:42:13,Log-Likelihood:,-15618.0
No. Observations:,1338,AIC:,31240.0
Df Residuals:,1334,BIC:,31260.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.73e+04,4393.564,-3.937,0.000,-2.59e+04,-8676.614
age,599.9656,55.722,10.767,0.000,490.653,709.278
viral load,2491.1609,384.814,6.474,0.000,1736.254,3246.067
severity level,1357.2847,645.598,2.102,0.036,90.787,2623.783

0,1,2,3
Omnibus:,325.373,Durbin-Watson:,2.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,603.305
Skew:,1.52,Prob(JB):,9.860000000000001e-132
Kurtosis:,4.255,Cond. No.,243.0


### Linear Regression Test Results
We used a multiple linear regression model to quantify how well various predictors explain hospitalization charges. The model includes:

* Biological factors: viral load, severity level
* Behavioral: smoker
* Demographic: age, sex
* Geographic: region (dummy encoded)

### Model Fit & Significance
* R-squared = 0.751 → Model explains ~75.1% of variance in charges ✅
* F-statistic = 500.9, p < 0.001 ✅ → Model is statistically significant
* n = 1338 observations



---



In [37]:
import pandas as pd

# Define your data
data = {
    "Variable": [
        "Age",
        "Viral Load",
        "Severity Level",
        "Smoker (yes)",
        "Region - Southeast",
        "Region - Southwest"
    ],
    "Coef": [
        "+642",
        "+2545",
        "+1189",
        "+59,620",
        "-2587",
        "-240"
    ],
    "p-value": [
        "< 0.001",
        "< 0.001",
        "0.001",
        "< 0.001",
        "0.031",
        ""
    ],
    "Interpretation": [
        "Older patients incur higher charges.",
        "Higher viral load significantly increases cost.",
        "More severe cases lead to higher charges.",
        "Smoking is strongly associated with much higher costs.",
        "Lower cost than reference (Northeast).",
        "Lower cost than reference (Northeast)."
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Style and display the table without index
df.style.set_properties(**{'text-align': 'left'}).hide(axis='index')


Variable,Coef,p-value,Interpretation
Age,642,< 0.001,Older patients incur higher charges.
Viral Load,2545,< 0.001,Higher viral load significantly increases cost.
Severity Level,1189,0.001,More severe cases lead to higher charges.
Smoker (yes),59620,< 0.001,Smoking is strongly associated with much higher costs.
Region - Southeast,-2587,0.031,Lower cost than reference (Northeast).
Region - Southwest,-240,,Lower cost than reference (Northeast).


### Statistically Significant Predictors (p < 0.05)
### Insights and Recommendations
Based on the statistical analyses and modeling conducted, we outline the following key insights and strategic recommendations for Apollo Hospitals:


---

### Key Insights
1. Viral Load Varies by Region

  * Viral load was the only continuous variable showing statistically significant differences across regions.
  * This may reflect varying levels of infection exposure or reporting between geographical areas.

2. Smoking Has the Largest Impact on Cost

  * Smoking is the most influential variable in predicting hospitalization charges.
  * Smokers incur, on average, nearly 60,000 units more in charges than non-smokers.

3. Biological Severity Drives Cost

  * Both viral load and severity level significantly increase hospitalization charges.
  * This aligns with clinical expectations: sicker patients cost more to treat.

4. Demographics Have Limited Cost Impact

  * Age slightly increases cost (roughly +640 per year).
  * Sex does not significantly affect hospitalization charges.

5. Regional Differences in Cost

  * Patients from southeast and southwest regions tend to have lower costs than those in the northeast.
  * This may be due to hospital infrastructure, local pricing, or clinical practice variation.


---


### Recommendations
1. Target Smoking Cessation Programs

  * Prioritize public health initiatives and awareness campaigns to reduce smoking rates.
  * Investing in prevention could reduce hospitalization costs significantly over time.

2. Resource Allocation Based on Viral Load Hotspots

  * Monitor regions with high average viral loads for early intervention and preparedness.
  * Focus testing, isolation, and outreach in these areas during peak outbreaks.

3. Severity-Based Risk Adjustment

  * Consider incorporating severity level into triage or billing strategies.
  * Use it for early identification of high-cost cases and personalized care plans.

4. Audit Regional Cost Variability

  * Investigate why some regions have significantly lower hospitalization costs.
  * Explore if efficiency practices from these areas can be replicated across the network.

5. Incorporate Predictive Models into Hospital Operations

  * Embed models like this into operational systems to estimate likely cost at intake.
  * Enables proactive resource planning and financial forecasting.