<a href="https://colab.research.google.com/github/hariskhan-hk/ML_CEP/blob/main/ML_Cep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Effect of Gene Polymorphism on Renal Dysfunction After Liver Transplantation in Children**

It has been demonstrated that by 3 years after transplant, renal failure develops in 16.5% of all nonrenal solid organ recipients. Hypothetical data from some study is presented in attached excel sheet showing that occurrence of disease (i.e. renal failure) depends on sex, genotype and TST (time since transplant in years). The data includes 60 children (male/female) who received a liver transplant. Renal dysfunction is presented as a binary outcome.

a) Using some suitable software system, produce the binary logistic model from the given training data. (3)

b) Compute the odds ratio for all the predictor variables (using the probability method) and interpret them appropriately. (4)

c) Re-compute the odds ratios in sec (b) using the exponential formulas. (3)

d) Compare the odds in favor of the patients having three years since transplant with the odds in favor of the patients having seven years since transplant. Also interpret it properly. (2)

e) Evaluate the performance of the model in (a) from the given test data (see excel sheet). (3)

In [100]:
pip install pandas statsmodels numpy scikit-learn



**Part (a): Produce the Binary Logistic Model from the Given Training Data**

In [101]:
import numpy as np
import pandas as pd

file_path = '/content/Dataset_CEP.xlsx'
data = pd.read_excel(file_path, sheet_name='TrainingData')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ID       60 non-null     int64  
 1   Sex      60 non-null     object 
 2   Type     60 non-null     object 
 3   TST      56 non-null     float64
 4   Disease  60 non-null     object 
dtypes: float64(1), int64(1), object(3)
memory usage: 2.5+ KB


In [102]:
roll_numbers = [3, 25, 33, 50]
missing_tst_values = [roll_no % 7 for roll_no in roll_numbers]

missing_indices = data[data['TST'].isna()].index
data.loc[missing_indices, 'TST'] = missing_tst_values

In [103]:
data['Sex'] = data['Sex'].map({'Male': 0, 'Female': 1})
data['Type'] = data['Type'].astype('category').cat.codes

In [104]:
import statsmodels.api as sm

# Conversion of categorical variables to dummy variables
X = pd.get_dummies(data[['Sex', 'Type', 'TST']], drop_first=True)
y = pd.get_dummies(data['Disease'], drop_first=True)

# Adding constant for intercept
X = sm.add_constant(X)

# Fitting the logistic regression model
model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.547535
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                Present   No. Observations:                   60
Model:                          Logit   Df Residuals:                       56
Method:                           MLE   Df Model:                            3
Date:                Sat, 29 Jun 2024   Pseudo R-squ.:                  0.2075
Time:                        21:43:19   Log-Likelihood:                -32.852
converged:                       True   LL-Null:                       -41.455
Covariance Type:            nonrobust   LLR p-value:                 0.0006408
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.7278      1.247     -2.988      0.003      -6.173      -1.283
Sex           -0.1961      0.

**Part (b): Compute the Odds Ratio for All the Predictor Variables Using the Probability Method and Interpret Them**

# 1. Compute the Odd Ratios

In [105]:
odds_ratios_prob_method = np.exp(result.params)
print("Odds Ratios (Probability Method):")
print(odds_ratios_prob_method)

Odds Ratios (Probability Method):
const    0.024045
Sex      0.821923
Type     2.000294
TST      1.369950
dtype: float64


# 2. Interpretation of Odds Ratios (Probability Method):

Intercept (const):

The odds ratio for the intercept (const) is approximately 0.024.
This suggests that when all predictors (Sex, Type, TST) are zero, the odds of renal dysfunction are very low.

Sex:

The odds ratio for Sex is approximately 0.822.
This indicates that, holding other variables constant, females have lower odds of renal dysfunction compared to males.

Type:

The odds ratio for Type is approximately 2.000.
This suggests that, holding other variables constant, individuals with this type have twice the odds of renal dysfunction compared to the reference category.

TST:

The odds ratio for TST is approximately 1.370.
This indicates that, holding other variables constant, for each additional unit of TST (assuming TST is a continuous variable like years since transplant), the odds of renal dysfunction increase by approximately 1.37 times.

**Part (c): Re-compute the Odds Ratios Using the Exponential Formulas**

In [106]:
odds_ratios_exp_formula = pd.DataFrame({
    'Variable': X.columns,
    'Odds Ratio': result.params.apply(lambda x: np.exp(x))
})
print("Odds Ratios (Exponential Formula):")
print(odds_ratios_exp_formula)

Odds Ratios (Exponential Formula):
      Variable  Odds Ratio
const    const    0.024045
Sex        Sex    0.821923
Type      Type    2.000294
TST        TST    1.369950


**Part (d): Compare the Odds for Patients Having Three Years Since Transplant with the Odds for Patients Having Seven Years Since Transplant and Interpret**

# 1. Calculate the odds for TST = 3 and TST = 7:

In [107]:
odds_3_years = np.exp(result.params['TST'] * 3)
odds_7_years = np.exp(result.params['TST'] * 7)
print(f"Odds ratio for 3 years: {odds_3_years}")
print(f"Odds ratio for 7 years: {odds_7_years}")

Odds ratio for 3 years: 2.571068792589684
Odds ratio for 7 years: 9.05590711921535


# 2. Interpret the results:

### Interpretation of Odds Ratios:

1. **Odds Ratio for 3 years:**
   - The odds ratio for 3 years since transplant is approximately 2.571.
   - This indicates that, holding other variables constant, patients who have been transplanted for 3 years have about 2.571 times higher odds of renal dysfunction compared to patients who have been transplanted for 1 year.

2. **Odds Ratio for 7 years:**
   - The odds ratio for 7 years since transplant is approximately 9.056.
   - This suggests that, holding other variables constant, patients who have been transplanted for 7 years have about 9.056 times higher odds of renal dysfunction compared to patients who have been transplanted for 1 year.

### Comparison and Impact on Renal Dysfunction:

- **Duration since transplant:** As the duration since transplant increases from 3 years to 7 years, there is a notable increase in the odds of renal dysfunction.

- The odds ratio for 7 years (9.056) is significantly higher than for 3 years (2.571), indicating a greater impact of longer duration on increasing the odds of renal dysfunction.

- This comparison highlights the progressive effect of time since transplant on renal dysfunction, emphasizing the importance of monitoring and potentially adjusting treatment strategies over time.

**Part (e): Evaluate the Performance of the Model from Part (a) Using the Test Data**

In [108]:
test_data = pd.read_excel(file_path, sheet_name='TestData')

In [109]:
test_data['Sex'] = test_data['Sex'].map({'Male': 0, 'Female': 1})
test_data['Type'] = test_data['Type'].astype('category').cat.codes

In [110]:
from sklearn.metrics import accuracy_score

X_test = test_data[['Sex', 'Type', 'TST']]
y_test = test_data['Disease']
y_true_numeric = y_test.replace({'Absent': 0, 'Present': 1})
X_test = sm.add_constant(X_test)

predictions = result.predict(X_test)
predictions_binary = [1 if p > 0.5 else 0 for p in predictions]

accuracy = accuracy_score(y_true_numeric, predictions_binary)
print(f"Model accuracy: {accuracy}")

Model accuracy: 0.5714285714285714
