## Submission Deadline

- All submissions must be made before 10:00 PM on Thursday, April 18, 2024.

## Submission Guidelines

**- Format: Submissions are to be made in PDF format via Avenue to Learn, either individually or as a group of up to three members.  
    - GitHub Repository: Your submission must include a link to a public GitHub repository containing the assignment.  
    - Team Submissions: For group submissions, Question 15 must detail each member's contributions. Note that while there are no points allocated to Question 15, failure to provide this information will result in the assignment not being graded.**  

Please view the submission in the public GitHub repository available here: `https://github.com/kyleosung/stats_3da3_a6.git`

## Late Submissions

- 15% will be deducted from assignments each day after the due date (rounding up).

-   Assignments won't be accepted after 48 hours after the due date.

## Assignment Standards

Please ensure your assignment adheres to the following standards for submission:

- **Title Page Requirements:** Each submission must include a title page featuring your group members' names and student numbers. Assignments lacking a title page will not be considered for grading.
- **Individual Work:** While discussing homework problems with peers and group is permitted, the final written submission must be your group work.
- **Formatting Preferences:** The use of LaTeX for document preparation is highly recommended.
- **Font and Spacing:** Submissions must utilize an eleven-point font (Times New Roman or a similar font) with 1.5 line spacing. Ensure margins of at least 1 inch on all sides.
- **Submission Content:** Do not include the assignment questions within your PDF. Instead, clearly mark each response with the corresponding question number. Screenshots are not an acceptable form of submission under any circumstances.
- **Academic Writing:** Ensure that your writing and any references used are appropriate for an undergraduate level of study.
- **Originality Checks:** Be aware that the instructor may use various tools, including those available on the internet, to verify the originality of submitted assignments.
-  Assignment policy on the use of generative AI:
    -  Students are not permitted to use generative AI in this assignment. In alignment with [McMaster academic integrity policy](https://secretariat.mcmaster.ca/app/uploads/Academic-Integrity-Policy-1-1.pdf), it "shall be an offence knowingly to ...  submit academic work for assessment that was purchased or acquired from another source".  This includes work created by generative AI tools. Also state in the policy is the following, "Contract Cheating is the act of"outsourcing of student work to third parties" (Lancaster & Clarke, 2016, p. 639)
    with or without payment." Using Generative AI tools is a form of contract cheating.  Charges of academic dishonesty will be brought forward to the Office of Academic Integrity. 

\newpage

## Chronic Kidney Disease Classification Challenge

### Overview

Engage with the dataset from the [Early Stage of Indians Chronic Kidney Disease (CKD)](https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease) project, which comprises data on 250 early-stage CKD patients and 150 healthy controls.

For foundational knowledge on the subject, refer to "Predict, diagnose, and treat chronic kidney disease with machine learning: a systematic literature review" by [Sanmarchi et al., (2023)](https://link.springer.com/article/10.1007/s40620-023-01573-4).

### Objectives

Analyze the dataset using two classification algorithms, focusing on exploratory data analysis, feature selection, engineering, and especially on handling missing values and outliers. Summarize your findings with insightful conclusions.

**Classifier Requirement:** Ensure at least one of the classifiers is interpretable, to facilitate in-depth analysis and inference.

### Guidelines

- **Teamwork:** Group submissions should compile the workflow (Python codes and interpretations) into a single PDF, including a GitHub repository link. The contributions listed should reflect the GitHub activity.
- **Content:** Address the following questions in your submission, offering detailed insights and conclusions from your analysis.

### Assignment Questions

1. **Classification Problem Identification:** Define and describe a classification problem based on the dataset.
2. **Variable Transformation:** Implement any transformations chosen or justify the absence of such modifications.
3. **Dataset Overview:** Provide a detailed description of the dataset, covering variables, summaries, observation counts, data types, and distributions (at least three statements).
4. **Association Between Variables:** Analyze variable relationships and their implications for feature selection or extraction (at least three statements).
5. **Missing Value Analysis and Handling:** Implement your strategy for identifying and addressing missing values in the dataset, or provide reasons for not addressing them.
6. **Outlier Analysis:** Implement your approach for identifying and managing outliers, or provide reasons for not addressing them.
7. **Sub-group Analysis:** Explore potential sub-groups within the data, employing appropriate data science methods to find the sub-groups of patients and visualize the sub-groups. The sub-group analysis must not include the labels (for CKD patients and healthy controls).
8. **Data Splitting:** Segregate 30% of the data for testing, using a random seed of 1. Use the remaining 70% for training and model selection.
9. **Classifier Choices:** Identify the two classifiers you have chosen and justify your selections.
10. **Performance Metrics:** Outline the two metrics for comparing the performance of the classifiers.
11. **Feature Selection/Extraction:** Implement methods to enhance the performance of at least one classifier in (9). The answer for this question can be included in (12).
12. **Classifier Comparison:** Utilize the selected metrics to compare the classifiers based on the test set. Discuss your findings (at least two statements).
13. **Interpretable Classifier Insight:** After re-training the interpretable classifier with all available data, analyze and interpret the significance of predictor variables in the context of the data and the challenge (at least two statements).
14. **[Bonus]** Sub-group Improvement Strategy: If sub-groups were identified, propose and implement a method to improve one classifier performance further. Compare the performance of the new classifer with the results in (12).
15. **Team Contributions:** Document each team member's specific contributions related to the questions above.
16. **Link** to the public GitHub repository.

### Notes

- This assignment encourages you to apply sophisticated machine learning methods to a vital healthcare challenge, promoting the development of critical analytical skills, teamwork, and practical problem-solving abilities in the context of chronic kidney disease diagnosis and treatment.
- Students can choose one classifer not covered in the lectures.

# Section 1: Classification Problem Identification

In [18]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [19]:
## Load in the dataset
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
chronic_kidney_disease = fetch_ucirepo(id=336) 
  
# data (as pandas dataframes) 
X = chronic_kidney_disease.data.features 
y = chronic_kidney_disease.data.targets

In [20]:
# metadata 
chronic_kidney_disease.metadata

{'uci_id': 336,
 'name': 'Chronic Kidney Disease',
 'repository_url': 'https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease',
 'data_url': 'https://archive.ics.uci.edu/static/public/336/data.csv',
 'abstract': 'This dataset can be used to predict the chronic kidney disease and it can be collected from the hospital nearly 2 months of period.',
 'area': 'Other',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 400,
 'num_features': 24,
 'feature_types': ['Real'],
 'demographics': ['Age'],
 'target_col': ['class'],
 'index_col': None,
 'has_missing_values': 'yes',
 'missing_values_symbol': 'NaN',
 'year_of_dataset_creation': 2015,
 'last_updated': 'Mon Mar 04 2024',
 'dataset_doi': '10.24432/C5G020',
 'creators': ['L. Rubini', 'P. Soundarapandian', 'P. Eswaran'],
 'intro_paper': None,
 'additional_info': {'summary': 'We use the following representation to collect the dataset\r\n                        age\t\t-\tage\t\r\n\t\t\tbp\t\t-\tblood p

In [21]:
# variable information 
chronic_kidney_disease.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,year,yes
1,bp,Feature,Integer,,blood pressure,mm/Hg,yes
2,sg,Feature,Categorical,,specific gravity,,yes
3,al,Feature,Categorical,,albumin,,yes
4,su,Feature,Categorical,,sugar,,yes
5,rbc,Feature,Binary,,red blood cells,,yes
6,pc,Feature,Binary,,pus cell,,yes
7,pcc,Feature,Binary,,pus cell clumps,,yes
8,ba,Feature,Binary,,bacteria,,yes
9,bgr,Feature,Integer,,blood glucose random,mgs/dl,yes


We observe that by the variables dataframe, we have 24 predictor variables and a target variable (class). By the metadata, we identify the task at hand: to identify the class, whether the patient has chronic kidney disease or not, based on the predictor variables.

In our exploratory analysis, we observe the following:
- There are a few categorical variables, `sg`: specific gravity, `al`: albumin, `su`: sugar. We will have to preprocess these, perhaps with one-hot-encoding, before we can apply machine learning algorithms.
- There are missing values from every single predictor variable. These missing values may have to be imputed, or rows may have to be removed. However, given that there are only 400 observations in the dataset, we can only remove these sparingly.

In [22]:
print(f"sg Specific Gravity Unique Values: \n{X['sg'].unique()}\n")
print(f"al Albumin Unique Values: \n{X['al'].unique()}\n")
print(f"su Sugar Unique Values: \n{X['su'].unique()}")

sg Specific Gravity Unique Values: 
[1.02  1.01  1.005 1.015   nan 1.025]

al Albumin Unique Values: 
[ 1.  4.  2.  3.  0. nan  5.]

su Sugar Unique Values: 
[ 0.  3.  4.  1. nan  2.  5.]


In [23]:
# view the first couple rows for the first fifteen columns
X.iloc[:3, :15]

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,36.0,1.2,,,15.4
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,18.0,0.8,,,11.3
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,53.0,1.8,,,9.6


In [24]:
# view the first couple rows for the last columns
X.iloc[:3, 15:]

Unnamed: 0,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane
0,44.0,7800.0,5.2,yes,yes,no,good,no,no
1,38.0,6000.0,,no,no,no,good,no,no
2,31.0,7500.0,,no,yes,no,poor,no,yes


# Section 2: Variable Transformation

We observe that we may turn binary variables into one-hot-encoded True/False variables.

In [25]:
chronic_kidney_disease.variables[chronic_kidney_disease.variables['type'] == 'Binary']

Unnamed: 0,name,role,type,demographic,description,units,missing_values
5,rbc,Feature,Binary,,red blood cells,,yes
6,pc,Feature,Binary,,pus cell,,yes
7,pcc,Feature,Binary,,pus cell clumps,,yes
8,ba,Feature,Binary,,bacteria,,yes
18,htn,Feature,Binary,,hypertension,,yes
19,dm,Feature,Binary,,diabetes mellitus,,yes
20,cad,Feature,Binary,,coronary artery disease,,yes
21,appet,Feature,Binary,,appetite,,yes
22,pe,Feature,Binary,,pedal edema,,yes
23,ane,Feature,Binary,,anemia,,yes


In [26]:
binary_variables = chronic_kidney_disease.variables[chronic_kidney_disease.variables['type'] == 'Binary']['name'].unique()[:-1] ## REMOVE CLASS PREDICTOR
binary_variables

array(['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane'],
      dtype=object)

In [27]:
for var in binary_variables:
    print(f'{var} Values: {X[var].unique()}')


rbc Values: [nan 'normal' 'abnormal']
pc Values: ['normal' 'abnormal' nan]
pcc Values: ['notpresent' 'present' nan]
ba Values: ['notpresent' 'present' nan]
htn Values: ['yes' 'no' nan]
dm Values: ['yes' 'no' '\tno' nan]
cad Values: ['no' 'yes' nan]
appet Values: ['good' 'poor' nan]
pe Values: ['no' 'yes' nan]
ane Values: ['no' 'yes' nan]


In [28]:
%%capture
# suppress warnings

# ONE-HOT ENCODE RBC COLUMN
X.loc[X['rbc'] == 'normal', 'rbc'] = 1
X.loc[X['rbc'] == 'abnormal', 'rbc'] = 0
X = X.rename(columns = {'rbc': 'rbc_normal'})

# ONE-HOT ENCODE PC COLUMN
X.loc[X['pc'] == 'normal', 'pc'] = 1
X.loc[X['pc'] == 'abnormal', 'pc'] = 0
X = X.rename(columns = {'pc': 'pc_normal'})

# ONE-HOT ENCODE PCC COLUMN
X.loc[X['pcc'] == 'present', 'pcc'] = 1
X.loc[X['pcc'] == 'notpresent', 'pcc'] = 0
X = X.rename(columns = {'pcc': 'pcc_present'})

# ONE-HOT ENCODE BA COLUMN
X.loc[X['ba'] == 'present', 'ba'] = 1
X.loc[X['ba'] == 'notpresent', 'ba'] = 0
X = X.rename(columns = {'ba': 'ba_present'})

# ONE-HOT ENCODE HTN COLUMN
X.loc[X['htn'] == 'yes', 'htn'] = 1
X.loc[X['htn'] == 'no', 'htn'] = 0
# no need to rename

# ONE-HOT ENCODE DM COLUMN
X = X.replace('\tno', 'no', regex = True)
X.loc[X['dm'] == 'yes', 'dm'] = 1
X.loc[X['dm'] == 'no', 'dm'] = 0
# no need to rename

# ONE-HOT ENCODE CAD COLUMN
X.loc[X['cad'] == 'yes', 'cad'] = 1
X.loc[X['cad'] == 'no', 'cad'] = 0
# no need to rename

# ONE-HOT ENCODE appet COLUMN
X.loc[X['appet'] == 'good', 'appet'] = 1
X.loc[X['appet'] == 'poor', 'appet'] = 0
X = X.rename(columns={'appet': 'good_appet'})

# ONE-HOT ENCODE pe COLUMN
X.loc[X['pe'] == 'yes', 'pe'] = 1
X.loc[X['pe'] == 'no', 'pe'] = 0
# no need to rename

# ONE-HOT ENCODE ane COLUMN
X.loc[X['ane'] == 'yes', 'ane'] = 1
X.loc[X['ane'] == 'no', 'ane'] = 0
# no need to rename

In [29]:
# view the first couple rows for the first fifteen columns
X.iloc[:5, :15]

Unnamed: 0,age,bp,sg,al,su,rbc_normal,pc_normal,pcc_present,ba_present,bgr,bu,sc,sod,pot,hemo
0,48.0,80.0,1.02,1.0,0.0,,1.0,0.0,0.0,121.0,36.0,1.2,,,15.4
1,7.0,50.0,1.02,4.0,0.0,,1.0,0.0,0.0,,18.0,0.8,,,11.3
2,62.0,80.0,1.01,2.0,3.0,1.0,1.0,0.0,0.0,423.0,53.0,1.8,,,9.6
3,48.0,70.0,1.005,4.0,0.0,1.0,0.0,1.0,0.0,117.0,56.0,3.8,111.0,2.5,11.2
4,51.0,80.0,1.01,2.0,0.0,1.0,1.0,0.0,0.0,106.0,26.0,1.4,,,11.6


In [30]:
# view the first couple rows for the last columns
X.iloc[:5, 15:]

Unnamed: 0,pcv,wbcc,rbcc,htn,dm,cad,good_appet,pe,ane
0,44.0,7800.0,5.2,1.0,1,0,1,0,0
1,38.0,6000.0,,0.0,0,0,1,0,0
2,31.0,7500.0,,0.0,1,0,0,0,1
3,32.0,6700.0,3.9,1.0,0,0,0,1,1
4,35.0,7300.0,4.6,0.0,0,0,1,0,0


In [31]:
for col in X.columns:
    print(f"{col} Values: {X[col].unique()}")

age Values: [48.  7. 62. 51. 60. 68. 24. 52. 53. 50. 63. 40. 47. 61. 21. 42. 75. 69.
 nan 73. 70. 65. 76. 72. 82. 46. 45. 35. 54. 11. 59. 67. 15. 55. 44. 26.
 64. 56.  5. 74. 38. 58. 71. 34. 17. 12. 43. 41. 57.  8. 39. 66. 81. 14.
 27. 83. 30.  4.  3.  6. 32. 80. 49. 90. 78. 19.  2. 33. 36. 37. 23. 25.
 20. 29. 28. 22. 79.]
bp Values: [ 80.  50.  70.  90.  nan 100.  60. 110. 140. 180. 120.]
sg Values: [1.02  1.01  1.005 1.015   nan 1.025]
al Values: [ 1.  4.  2.  3.  0. nan  5.]
su Values: [ 0.  3.  4.  1. nan  2.  5.]
rbc_normal Values: [nan  1.  0.]
pc_normal Values: [ 1.  0. nan]
pcc_present Values: [ 0.  1. nan]
ba_present Values: [ 0.  1. nan]
bgr Values: [121.  nan 423. 117. 106.  74. 100. 410. 138.  70. 490. 380. 208.  98.
 157.  76.  99. 114. 263. 173.  95. 108. 156. 264. 123.  93. 107. 159.
 140. 171. 270.  92. 137. 204.  79. 207. 124. 144.  91. 162. 246. 253.
 141. 182.  86. 150. 146. 425. 112. 250. 360. 163. 129. 133. 102. 158.
 165. 132. 104. 127. 415. 169. 251. 109. 280. 2

We observe that we can fill in categorical variables that are naturally ordered with integers corresponding to their order. Such variables include

# Section 3: Dataset Overview

In [32]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 24 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          391 non-null    float64
 1   bp           388 non-null    float64
 2   sg           353 non-null    float64
 3   al           354 non-null    float64
 4   su           351 non-null    float64
 5   rbc_normal   248 non-null    float64
 6   pc_normal    335 non-null    float64
 7   pcc_present  396 non-null    float64
 8   ba_present   396 non-null    float64
 9   bgr          356 non-null    float64
 10  bu           381 non-null    float64
 11  sc           383 non-null    float64
 12  sod          313 non-null    float64
 13  pot          312 non-null    float64
 14  hemo         348 non-null    float64
 15  pcv          329 non-null    float64
 16  wbcc         294 non-null    float64
 17  rbcc         269 non-null    float64
 18  htn          398 non-null    float64
 19  dm      

# Section 4: Association Between Variables

# Section 5: Missing Value Analysis and Handling

In [33]:
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

If a feature is missing at random, we may safely remove the row. If the feature is missing for statistical reasons, we must not remove the row and take this info account.

We use the MCAR test.

In [34]:
from pyampute.exploration.mcar_statistical_tests import MCARTest
from scipy.stats import chi2

def mcar(data, alpha = 0.05):
    mt = MCARTest(method='little')
    p_value = mt.little_mcar_test(data)

    if p_value < alpha:
        print(f'Reject null hypothesis: Data is not MCAR (p-value={p_value:.4f})')
    else:
        print(f'Do not reject null hypothesis: Data is MCAR (p-value={p_value:.4f})')

mcar(X)

Reject null hypothesis: Data is not MCAR (p-value=0.0000)


# Section 6: Outlier Analysis

# Section 7: Subgroup Analysis

# Section 8: Data Splitting

# Section 9: Classifier Choices

# Section 10: Performance Metrics

# Section 11: Feature Selection and Extraction

# Section 12: Classifier Comparison

# Section 13: Interpretable Classifier Insight

# [BONUS] Section 14: Subgroup Improvement Strategy

\newpage

## Grading scheme 

\begin{table}[H]
\begin{tabular}{p{0.15\textwidth}  p{0.65\textwidth}}
1.   & Answer [1]\\
2.   & Codes [2] \\
     & OR answer [2]\\
3.   & Codes [3] and answer [3]\\
4.   & Codes [2] and answer [3]\\
5.   & Codes [2]\\
     & OR answer [2]\\
6.   & Codes [2] \\
     & OR answer [2]\\
7.   & Codes [3] and Plot [1]\\
8.   & Codes [1]\\
9.   & Answers [2]\\
10.   & Describe the two metrics [2]\\
11.   & Codes [2] \\
      & these codes can be included in (12)\\
12.   & Codes (two classifiers training,\\
     & model selection for each classifier, \\
     & classifiers comparisons) [5] and answer [2]\\
13.   & Codes [1] and answers [2]\\
14.   & Codes and comparison will \\
     & give \textbf{bonus 2 points for the final grade}.\\
\end{tabular}
\end{table}

**The maximum point for this assignment is 39. We will convert this to 100%.**

**All group members will receive the same grade if they contribute to the same.**
