# Module 5 - Getting to know your data

Hello, and welcome to this introductory course in Machine Learning for Biomedical applications. To kick things off, I'd like to introduce you to [Weka](https://waikato.github.io/weka-wiki/downloading_weka/) one of the first Machine Learning platforms I used when I started out learning ML. Please take a minute and checkout [this great Weka video tutorial](https://waikato.github.io/weka-wiki/downloading_weka/)by Google. I think Weka is a great gateway platform that will undoubtedly take you a long way down these first steps into the ML ecosystem. In fact, Weka is still my goto platform for exploratory data analysis (EDA) which is the topic of today. Follow along the Weka example in the accompanying Lecture slides as we get to know the autism dataset distributed along with this notebook.

 ## What we learned about the autism dataset using Weka
 1. We have 292 observations and 21 features
 2. The authors have developed a model using Demographics and 10 Yes/No questions to score patients for autism.
 3. Some features contain missing values.
 4. Appears the "relation" categorical variable was filled in by hand and has typos.
 5. Ethinicity for example is not uniformly sampled. Most patients are white.
 6. Most patients are 4yrs old or 11 with few patients in between.
 7. 2/3 of patients are Male.
 8. We can train a model with 100% accuracy using only the 10 yes/no questions.

**Questions**
+ How would you interpret the value of author's score metric? 
+ What is the precision and sensitivity of the author's model?
+ In our trained model using only the 10 yes/no questions, What score does a patient start off with before we start considering the answers to the 10 yes/no questions?
+ What score would a new patient get if they answers yes for all 10 question? 

## What question am I answering?

When getting to know new data try to think about the following questions:
+ Why was the data collected? 
+ How the data was intended to be used?
+ Who was the author? This can inform your understanding of the data’s purpose.
+ What does the literature say?
    + Cohorts and datasets are often well studied.
    + What seem to be the important features, the clinical need they focus on.
+ What do the values in the data represent?
+ What are the values trying to represent?
+ **What potential added value does this data have for the greater clinical need?** 

A lot of this preliminary work is going to be done by hand and driven by your understanding of the subject matter. It is easy to underestimate how much effort this will take especially if the subject is outside of your wheelhouse. Only after you complete this critical step should you begin to formulate your central question. Skipping this due diligence is a common pitfall leading poor or at the very least unimpactful science. So, it is not enough to ask
*"What relationship can I demonstrate with this data?"*, rather, you should have a benchmark in mind you want to overcome __*because*__ you have this data.

For example, let's say the clinical standard for diagnosing some disease is through a biopsy. A well accepted rubric exists that quantifies the level of pathology in the tissue sample based on some histological metrics. But, you have accompanying urine analysis data not utilized by the benchmark model and also have evidence suggesting it has some prognostic potential.

You can:
1. Demonstrate a gain in prognostic strength by adding urine markers to the benchmark model.
2. Demonstrate the urine analysis is a more cost effective predictor than the invasive biopsy.
3. Demonstrate a relationship between the urine markers and the underlying histological features to improve the decision makeing process dictating when a biopsy is performed.
4. Inferr a causal relationship between the urine markers and histological features to make statements about the pathology progression.

Each analysis will require the data to be formatted in its own way.

# Exercise

In this excercise we will tidy up the 21 variables in Autism data. We will then build a predictive model to predict the binary YES/NO Autism diagnosis. To format we will:
+ Load the autism data in .arff format to a pandas dataframe
+ Check all categorical features have consistent annotations - ie. no typos
+ Have no missing values
+ Remove uninformative features
+ Report a "table 1" of patient characteristics

## Requirements
Let's begin by installing all the packages we'll need for this excercise. It's also good to set the random number seed first thing too so we can reproduce the results.

In [None]:
# requirements
!pip install scipy
!pip install pandas
!pip install numpy
!pip install tableone
!pip install matplotlib

# Globals
seed = 1017

## Load the data
Ideally, data should be loaded from a repostitory and acceced programmatically for the sake of transparancy and reproducibility. The best analyses are totally divorced from the source data. We don't have that here. Today we will be reading a data file distributed allong with this excercise.  

In [None]:
from scipy.io import arff

#Read the arff file with scipy.io.arrff.loadarrf which returns the data and metadata
data, meta = arff.loadarff('data/Autism-Child-Data.arff')

#the metadata contains information such as name and type of attributes along with value ranges

In [None]:
#let's look at the meta data
display(meta)

In [None]:
#the feature names and types can also be accessed as arrays
display(meta.names())
display(meta.types())

In [None]:
#the data is loaded as a record array accessible by attribute names
#but it might be more convenient to convert it to a pandas dataframe 
import pandas as pd
import numpy as np
df = pd.DataFrame(data)

In [None]:
#Let's look at the data
display(df)

In [None]:
#Those 'b' are python's way of displaying a bytes array and are not part of the data
#They indicate you're treating a byte string, litterally a sequence of octets, which are ASCII cahracters.
#They will appear only in nominal features. The numeric features "age" and "result" will not have them.
#You can decode them using .decode("utf-8")

#Let's decode only the nominal features
for i in range(len(meta.types())):
    if meta.types()[i] == 'nominal':
        df[meta.names()[i]] = df[meta.names()[i]].str.decode('utf-8')    


In [None]:
#let's look at the data now
display(df)

## Heal annotations

Recall the 'Self' vs 'self' typo for the 'relation' variable.
One easy fix is to convert all strings to lowercase.

In [None]:
# convert all nominal variable string elements to lowercase.
for i in range(len(meta.types())):
    if meta.types()[i] == 'nominal':
        df[meta.names()[i]] = df[meta.names()[i]].str.lower()    

#let's look at the data now
display(df)

## Treat Missing

Treating missing values generally falls into one of two strategies:
1. A mask that globally indicates missing values.
2. A sentinel value that indicates a missing entry.

In the **mask** approach, it might be a same-sized Boolean array representation or use one bit to represent the local state of missing entry. This requires allocations of an additional Boolean array which adds memory and computational overhead. This 

In the **sentinel** value approach, a tag value is used for indicating the missing value, such as NaN (Not a Number), null or a special value which is part of the programming language. This limits the range of valid values that can be represented and may require extra logic in CPU and GPU arithmetic.

Here we will use the sentinel approach but later we will use masks for image processing in the computer vision project. Pandas use sentinels to handle missing values, specifically Pandas use two already-existing Python null values:
+ the python None object
+ the floating-point NaN value

None is a Python ‘object’ data that is often used for missing data. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ‘object’. Notice the dtype in following integer numpy array created with a None element. 

In [None]:
arr1 = np.array([[1,2,3],
                [4,None,6]])
arr1


Because the array is of type object you will generally get an error if you perform aggregations like sum() or min().

In [None]:
#uncomment to show error
#np.sum(arr1, axis = 0)

NaN is an acronym for 'Not a Number' and a special floating-point value in the standard IEEE floating-point representation. Notice the type of the array if initialize it with a NaN instead of None.

In [None]:
arr2 = np.array([[1,2,3],
                [4,np.nan,6]])
arr2.dtype


NaN is specifically a floating-point value; there is no such thing for integers, strings or other types. Also, be aware that regardless of the operations, the result of arithmetic with NaN will be another NaN. For example,

In [None]:
#aggreating arrays with np.nan elements
np.sum(arr2, axis = 0)

Pandas is built to handle the None and NaN nearly interchangeably, converting between them where appropriate. For types that don’t have an available sentinel value, Pandas automatically type-casts when NaN values are present. Notice the difference in type in the following pandas series. 

In [None]:
iseries = pd.Series([0, 1, 2])
aseries = pd.Series(["a", "b", "c"])
display(iseries)
display(aseries)

In [None]:
iseries[1]= None
aseries[1]= None
display(iseries)
display(aseries)

In [None]:
iseries[2]= np.nan
aseries[2]= np.nan
display(iseries)
display(aseries)

Often missing values don't come in nice and clean with None or np.nan values. If we know what kind of characters used as missing values in the dataset, we can handle them while creating the DataFrame using na_values parameter `df = pd.read_csv("source.csv", na_values = ['?', '&'])`

However, since we already have the data loaded we will have to edit the dataframe ourselves. In our case we need to replace the '?' characters used to indicate missing data in the arff file with None. Pandas will atutomatically type-cast to NaNs for the numeric variables while keeping None for the categorical variables. This can be accomplished with the pandas replace() function.

In [None]:
#replace all "?" with None objects
df = df.replace({ "?": None })   

In [None]:
#Now show summary stats for each feature
for i in range(len(meta.types())):
    display("--------------")
    display(df[[meta.names()[i]]].describe())
    display(df[[meta.names()[i]]].value_counts()) 

### Counting the missing values
Pandas has two useful methods for detecting missing values: isnull() and notnull() . Either one will return a Boolean mask over the data. For example:
df.isnull() returns a Boolean same-sized DataFrame indicating if values are missing

In [None]:
#display the sum of missing values for each feature
display(df.isnull().sum())

### Brutal approach
You can use dropna() to remove missing values. You can specify to drop either rows or columns with missing data by specifying the axis.

In [None]:
display(df.shape)
display(df.dropna(axis=0, inplace=False).shape)
display(df.dropna(axis=1, inplace=False).shape)

### OK approach
You can use fillna()to fill in missing values. It has extra arugments
+ value: value to use to replace NaN
+ method: method to use for replacing NaN. method='ffill' does the forward replacement. method='bfill' does the backword replacement.

In [None]:
# replaces missing values with values in the previous row
display(df.fillna(axis=0, method='ffill', inplace=False).shape)
# replaces missing values with a constant
display(df.fillna(axis=0, value=0, inplace=False).shape)

### Better approach
The other common replacement is to replace missing values with the mean for continuous variables or median for categorical ones. Median works for both so let's just go with that.

In [None]:
df.fillna(value=df.median(axis=1, skipna=True), inplace=True)

## Remove uninformative variables
 If all observations have the same value for a particular feature then that feature cannot help discriminate among the observations. These variables should be removed. In later modules we will elaborate on this idea of informative variables when we discuss feature selection.

In [None]:
#print the number of unique values per feature
for i in range(len(meta.names())):
    display(meta.names()[i] +" has "+str(df[meta.names()[i]].nunique()))

Looks like 'age_desc' has only one value. Let's remove it.

In [None]:
#remove variables with one unique value
for col in df.columns:
    if df[col].nunique() == 1:
        #drop the column
        print(col)
        df.drop([col], axis=1, inplace=True)

#display the remaining columns
display(df.columns)

Finally, we should remove the 'result' variable provided by the authors. We can't use this data to predict.

**Challenge: Remove the 'result' column from the dataframe.**

In [None]:
#write code here

#this line should confirm the column has been removed 
display(df.columns)

## Generating a Table 1
Often in a biomedical paper the first table is a summary of the study population characteristics. Not only will your reviewers expect it, generating it is actually very useful to distill your understanding of the problem you propose to tackle. Fortunately there are many packages out there that will generate a table 1 for you. Here we will use the cleverly named 'tableone' package and stratify our summary based on the autism outcome.

In [None]:
from tableone import TableOne
TableOne(df, groupby="Class/ASD", pval=True, dip_test=True, normal_test=True, tukey_test=True)

### Exploring the warnings
Chi-squared test is used to see whether distributions of categorical variables differ from each another. Looks like the statistics for country, relation, and ethnicity variables are poor so we cannot say one way or another if the distributions are different.

Normality test informs you which variables are not normally distributed. Often one would report the median and inter-quartile ranges instead of the mean and standard deviation for these variables.

Tukey's rule is a method to find outliers among continuous variables. None were found here but we would visualize this this in a boxplot. To demonstrate, we'll create box plots for age.

Hartigan's Dip Test is a test for multimodality. The test has suggested that the age  distribution may be multimodal. We'll plot the distributions too and have a look.

In [None]:
import matplotlib.pyplot as plt

In [None]:
#box plots
df[['age']].boxplot(whis=3)
plt.show()

In [None]:
#distributions
df[['age']].dropna().plot.kde(figsize=[12,8])
plt.legend(['Age (years)', 'result(score)'])
plt.xlim([0,15])

When the distribution has distinct humps like this then one might consider binning them to convert them into categorical features.

**Challenge - Bin the age variable to categories 'child' and 'adolecent' with cut off at 8 years old and regenerate your Table 1** 