# Mini-Datathon

For this session, we will be providing 3 datasets for the mini-datathon. However, you are free to choose other datasets, but do check with us first before proceeding.

The aim of this mini-datathon is to see how we can apply what we learnt to
- process and tidy the data
- explore the data and define a question of interest
- visualize the relevant data
- use the appropriate statistical test/plot
- present and interpret the plot(s)

For the mini-datathon, please choose **ONE** dataset and prepare a **short slide presentation** (e.g. 3-4 slides) of your analysis. We will provide the links on the day of the mini-datathon.

The slides should contain the following:
- The team name and group members
- 1 paragraph stating question(s) of interest
- 1-2 plots to address the question(s)
- 1 paragraph stating your choice of the statistical plot and your interpretation
- The code used to process and plot the data (include it as an appendix slide for documentation)

We will review the submissions using the following rubric:

![datathon_rubric_stat.png](images/datathon_rubric_stat.png)

You may find the cheatsheets here:
- base R: https://github.com/rstudio/cheatsheets/raw/main/base-r.pdf
- Tidy data transformation: https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf
- ggplot2: https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf

We have also included a notebook that summarizes the statistical tables/tests/plots **(09 - Summary)**

---
# Dataset 1: Heart Failure Dataset

This is a dataset containing the medical records of 299 heart failure patients collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab, Pakistan), during April–December 2015. All 299 patients had left ventricular systolic dysfunction and had previous heart failures that put them in classes III or IV of New York Heart Association (NYHA) classification of the stages of heart failure.

The target outcome is `DEATH_EVENT` (0=alive, 1=dead)

Columns:
- `age`: age of the patient (years)
- `anaemia`: decrease of red blood cells or hemoglobin (0 = no, 1 = yes)
- `high_blood_pressure`: if the patient has hypertension (0 = no, 1 = yes)
- `creatinine_phosphokinase`: level of the CPK enzyme in the blood (mcg/L)
- `diabetes`: if the patient has diabetes (0 = no, 1 = yes)
- `ejection_fraction`: percentage of blood leaving the heart at each contraction (percentage)
- `platelets`: platelets in the blood (kiloplatelets/mL)
- `sex`: woman or man (0 = woman, 1 = man)
- `serum_creatinine`: level of serum creatinine in the blood (mg/dL)
- `serum_sodium`: level of serum sodium in the blood (mEq/L)
- `smoking`: if the patient smokes or not (0 = no, 1 = yes)
- `time`: follow-up period (days)
- `DEATH_EVENT`: if the patient deceased during the follow-up period (0 = no, 1 = yes)

**Source:** UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records)

In [None]:
library(tidyverse)

data1 <- read_csv("https://raw.githubusercontent.com/kennethban/dataset/main/heart_failure_clinical_records_dataset.csv")

head(data1)

# start here

---
# Dataset 2: Pima Indians Diabetes Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Columns:
- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (mg/dL)
- `BloodPressure`: Diastolic blood pressure (mmHg)
- `SkinThickness`: Triceps skin fold thickness (mm)
- `Insulin`: 2-Hour serum insulin (micro U/ml)
- `BMI`: Body mass index (weight in kg/(height in m)^2)
- `DiabetesPedigreeFunction`: Diabetes pedigree function (derived by study authors - may ignore)
- `Age`: Age (years)
- `Outcome`: Onset of non-insulin-dependent diabetes mellitus (DM) within a five-year period (0 = no, 1 = yes)

**Source:** Kaggle (https://www.kaggle.com/uciml/pima-indians-diabetes-database)

In [None]:
library(tidyverse)

data2 <- read_csv("https://raw.githubusercontent.com/kennethban/dataset/main/diabetes.csv")

head(data2)

# start here

# Dataset 3: Hepatocellular Carcinoma Survival

This dataset was obtained from a University Hospital in Portugal and contains several demographic, risk factors, laboratory and overall survival features of 165 patients diagnosed with HCC. 

This is an heterogeneous dataset, with 23 quantitative variables, and 26 qualitative variables. The target variable is the survival at 1 year (Alive/Dead)

Columns:
- `Gender`: M/F
- `Symptoms`: Y/N
- `Alcohol`: Y/N
- `HBsAg`: Hepatitis B Surface Antigen (Y/N)
- `HBeAg`: Hepatitis B e Antigen (Y/N)
- `HBcAb`: Hepatitis B Core Antibody (Y/N)
- `HCVAb`: Hepatitis C Virus Antibody (Y/N)
- `Cirrhosis`: Y/N
- `Endemic`: Endemic country (Y/N)
- `Smoking`: Y/N
- `Diabetes`: Y/N
- `Obesity`: Y/N
- `Hemochro`: Hemochromatosis (Y/N)
- `AHT`: Arterial Hypertension (Y/N)
- `CRI`: Chronic Renal Insufficiency (Y/N)
- `HIV`: Human Immunodeficiency Virus (Y/N)
- `NASH`: Nonalcoholic Steatohepatitis (Y/N)
- `Varices`: Esophageal Varices (Y/N)
- `Spleno`: Splenomegaly (Y/N)
- `PHT`: Portal Hypertension (Y/N)
- `PVT`: Portal Vein Thrombosis (Y/N)
- `Metastasis`: Liver Metastasis (Y/N)
- `Hallmark`: Radiological Hallmark (Y/N)
- `Age`: Age at diagnosis (years)
- `Alcohol_grams_day`: Grams of Alcohol per day
- `Packs_year`: Packs of cigarettes per year
- `PS`: Performance Status (Active/Restricted/Ambulatory/Selfcare/Disabled)
- `Encephalopathy`: Encephalopathy degree (None/Grade_I_II/Grade_III_IV)
- `Ascites`: Ascites degree (None/Mild/Moderate_Severe)
- `INR`: International Normalised Ratio
- `AFP`: Alpha-Fetoprotein (ng/mL)
- `Hb`: Haemoglobin (g/dL)
- `MCV`: Mean Corpuscular Volume (fl)
- `Leucocytes`: Leukocytes(G/L)
- `Platelets`: Platelets (G/L)
- `Albumin`: Albumin (mg/dL)
- `Total_Bil`: Total Bilirubin(mg/dL)
- `ALT`: Alanine transaminase (U/L)
- `AST`: Aspartate transaminase (U/L)
- `GGT`: Gamma glutamyl transferase (U/L)
- `ALP`: Alkaline phosphatase (U/L)
- `TP`: Total Proteins (g/dL)
- `Creatinine`: Creatinine (mg/dL)
- `Nodules`: Number of Nodules
- `Major_dim`: Major dimension of nodule (cm)
- `Dir_Bil`: Direct Bilirubin (mg/dL)
- `Iron`: Iron (mcg/dL)
- `Sat`: Oxygen Saturation (%)
- `Ferritin`: Ferritin (ng/mL)
- `Outcome`: Alive/Dead


In the original dataset, missing data represents 10.22% of the whole dataset and only eight patients have complete information in all fields. For this mini-datathon, the missing values  were imputed using the `missForest` package

```R
data_input <- read_csv("https://raw.githubusercontent.com/kennethban/dataset/main/hcc_survival.csv") # original with missing values
data_input %>% mutate_if(sapply(data_input, is.character), as.factor) %>%
               missForest:missForest() # impute
```

**Source:** https://archive.ics.uci.edu/ml/datasets/HCC+Survival

In [None]:
library(tidyverse)

data3 <- read_csv("https://raw.githubusercontent.com/kennethban/dataset/main/hcc_survival_impute.csv") %>%
         mutate_if(sapply(., is.character), as.factor) # strings to factors

head(data3)

# start here