# Tutorial 2 - Data exploration & preprocessing

In today's tutorial we will focus on a core skill of data science: data exploration and preprocessing. \
The most important thing to do before any machine learning task is to have a look at your dataset. After doing that, you should be able to answer the following questions:
* Does your dataset values make sense to you? 
* How many missing values does your dataset contain? 
* What are the relevant variables are there to your task? 
* Does your data include correlated variables?
* Is your data "clean"? Should you filter it? 
* Should you reorganize it in order to have an easy access to it later on?

One of the most widely used packages in Python is called `pandas`.  This tutorial will cover will cover some of its' well known commands. Later on, we will also use what one might consider as the most useful package in the field of basic machine learning task which is `scikit-learn`. 

Before we go on, make sure you have updated our environment `bm-336546` with `tutorial2.yml` file as explained in the previous tutorial. Once you are all set we can move on.
```shell
conda env update --name bm-336546 --file tutorial2.yml
```

# Interfacing with Pycharm
Before we continue with this tutorial, we should all become familiar with one of the best Python IDEs which is `PyCharm`. In `PyCharm`, we can debug our code and create a projects that contains many `.py` files that run with the same virtual environment. `PyCharm` also helps us build our code in the correct structure and it even has a spelling check for your comments :). Furthermore, `PyCharm` has the ability to interact with distant servers. 

In general, the professional version of `PyCharm` has a `Jupyter` [editor](https://www.jetbrains.com/help/pycharm/jupyter-notebook-support.html#ui). Here, however, we will see how to convert `ipynb` files to `.py` files. This can be done easily by a package installed in your `bm-336546` environment called `jupytext`. The `jupytext` package can also convert `.py` files back to `.ipynb` once edited by `PyCharm`. 

Use Anaconda prompt and `cd` to the correct location. Now you can perform one of the three operations:
```shell
jupytext --to py notebook.ipynb                 # convert notebook.ipynb to a .py file

jupytext --to notebook notebook.py              # convert notebook.py to an .ipynb file with no outputs

jupytext --to notebook --execute notebook.py    # convert notebook.py to an .ipynb file and run it 
```

# Medical topic
Diabetes Mellitus affects hundreds of millions of people around the world and can cause many complications if not diagnosed early and treated properly. Diabetes can be predicted ahead using some medical explanatory variables. In our case we will use the study of Pima Indian population near Phoenix, Arizona. All of the patients were women above the age of 21. The population has been under continuous study since 1965 by the National Institute of Diabetes and Digestive and Kidney Diseases because of its high incidence rate of diabetes. In this tutorial we would only focus on the data exploration part. 

## Dataset
The following features have been provided to help us predict whether a person is diabetic or not:
* Pregnancies: Number of times each woman was pregnant.
* Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test $[mg/dl]$.
* BloodPressure: Diastolic blood pressure in $[mm/Hg]$.
* SkinThickness: Triceps skin fold thickness in $[mm]$.
* Insulin: Insulin serum over 2 hours in an oral glucose tolerance test $[\mu U/ml]$.
* BMI: Body mass index in $[ kg / m ^ 2 ]$.
* DiabetesPedigreeFunction: A function which scores likelihood of diabetes based on family history.
* Age: Age in $[years]$.
* Outcome: Class variable (0 if non-diabetic, 1 if diabetic).

Credit: The data was imported from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

import numpy as np
import pandas as pd
from sklearn import preprocessing
from pathlib import Path


# to make this notebook's output stable across runs
np.random.seed(42)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

from sklearn.impute import SimpleImputer 
from pandas.plotting import scatter_matrix


`pandas` can load many types of files into some kind of a table that is called a `DataFrame`. Every column within this table is called a `Series`.

In [None]:
col_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'Skin Thickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv("Data/PimaDiabetes.csv", header=None, names=col_names)

Here are some of the most useful commands using `pandas`.

In [None]:
df.info() # general information on data samples amount and their type

In [None]:
df.head()  # print the 5 first observations

In [None]:
# print 5 repetitive random observations using sample and random_state
df.sample(n=5, random_state=1)

In [None]:
df.describe() # print summary statistics of variables

---
<span style="color:red">***Question:***</span> *In which cases do we only need the **mean** and the **std** for distribution estimation and in which cases do we need the whole **summary statistics**?*

---

### Types of subsetting dataframe 

In [None]:
df[1:5] # Notice it does not include the last element

In [None]:
G = df[['Glucose', 'Insulin']]  # double brackets for column access (i.e. G is also a dataframe)

In [None]:
G['Glucose']

In [None]:
df[['Glucose', 'Insulin']][1:5]  # Double brackets for column access and additional outer brackets for observations

In [None]:
df.loc[1:5, ['Glucose', 'Insulin']] # loc method allows indexing with string within variables (here it does include the last element!).

In [None]:
df.iloc[1:5, 1:3]  # iloc method allows indexing with integers within variables (does not include the last element!)

`loc` and `iloc` should be used carefully. Basically, `loc` uses strings for columns and labels of rows and `iloc` uses indices. For more information, follow the documentations [here](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) and [here](https://datacarpentry.org/python-ecology-lesson/03-index-slice-subset/index.html).

We would now like to examine the distribution of our data:

In [None]:
%matplotlib inline

axarr = df.hist(bins=50, figsize=(20, 15))  # histograms of dataframe variables
xlbl = ['# of pregnancies [N.u]', 'Glucose [mg/dl]', 'Blood Pressure [mm/Hg]','Skin Thickness [mm]', 
        'Insulin [uU/ml]','BMI [Kg/m^2]','DPF [N.U]', 'Age [years]', 'Diabetes [binary]' ]

for idx, ax in enumerate(axarr.flatten()):
    ax.set_xlabel(xlbl[idx])
    ax.set_ylabel("Count")
    
plt.show()

We can see from those histograms that some of the variables are impossible such as 0 values in BMI, insulin, skin thickness and blood pressure.\
First, we'll replace these values with nan. 

*All of the operations will be applied on a copy of the dataframe called* `df_nan`.

In [None]:
df_nan = df.copy()
df_nan.loc[:, 'Glucose':'BMI'] = df_nan.loc[:, 'Glucose':'BMI'].replace(0, np.nan) # replace the non-realistic (0 in our case) values with nan
df_nan.isna().sum()/df.shape[0] # fraction of replaced values

Most of the missing data are in the variables *insulin* and *skin thickness*. There are several ways to handle missing values. Here are some examples:
* The variable's missing values can be imputed by some value (median for instance).
* Can replaced by randomly picked values from the rest of the data's distribution.
* The probability density function can be estimated from the variable's values histogram and missing values would be replaced by sampled values from the pdf.
* The missing values can be replaced by random values from the variable's values.
* The total variable can be eliminated when there is no sufficient number of samples.

For more options, visit [this site](https://hrngok.github.io/posts/missing%20values/).

Here, we will show only median imputation in two different methods within the relevant variables. 

In [None]:
df_nan.head()

In [None]:
df_nan.loc[:, 'Glucose':'BMI'] = df_nan.loc[:, 'Glucose':'BMI'].fillna(df_nan.median())  # method 1
df_nan.head()

In [None]:
# create df_nan again for second method
df_nan = df.copy()
df_nan.loc[:, 'Glucose':'BMI'] = df_nan.loc[:, 'Glucose':'BMI'].replace(0, np.nan) # replace the non-realistic results with nan
df_nan.head()

In [None]:
imputer = SimpleImputer(strategy="median") # method 2, mostly preferred due to it's generalized form
p = imputer.fit(df_nan)
X = imputer.transform(df_nan)
df1 = pd.DataFrame(X, columns=df_nan.columns)  # construct X object as Dataframe
df1.head()

In [None]:
axarr = df1.hist(bins=50, figsize=(20, 15)) # histograms of dataframe variables
for idx, ax in enumerate(axarr.flatten()):
    ax.set_xlabel(xlbl[idx])
    ax.set_ylabel("Count")
plt.show()

As expected, the median imputation did not "work well" for *insulin* and *skin thickness* due to multiple missing values. 

- For the *insulin* values, it might be better to replace them with values drawn from the distribution which has a pretty low variance relative to the mean. 

- For the *skin thickness variable*, we can consider elimination of all the variable's values if we assume that it does not affect the outcome. 

Either way, it is not right to just "drop" the missing samples of both variables because it will significantly reduce the amount of data but it is reasonable to "drop" a feature. 



***
Let's move forward and see how to apply a function on a `dataframe` variable. In our case we will replace `nan` with random values distributed as the current value distribution.

In order to do so, we will now apply median imputation on all of the variables which are not *insulin* or *skin thickness*. \
We will then apply random sampling on *insulin* variable values and "drop" the "skin thickness" variable. All of the operations will now be applied directly on the original `dataframe`.

In [None]:
df.loc[:, 'Glucose':'BMI'] = df.loc[:, 'Glucose':'BMI'].replace(0, np.nan) # replace the non-realistic results with nan
df.loc[:, ['Glucose', 'BloodPressure', 'BMI']] = df.loc[:, ['Glucose', 'BloodPressure', 'BMI']].fillna(df.median())  # apply a "known funtion" on selected variables # median imputation
df.drop(columns=['Skin Thickness'],inplace=True)
insulin_hist = df_nan.loc[:,'Insulin'].dropna()

In [None]:
def rand_sampling(x, var_hist):
    if x == np.nan:
        rand_idx = np.random.choice(len(var_hist))
        x = var_hist[rand_idx]
    return x

In [None]:
df[['Insulin']].applymap(lambda x: rand_sampling(x, insulin_hist))
xlbl.remove('Skin Thickness [mm]')
xlbl.append('')

axarr = df.hist(bins=50, figsize=(20, 15))
for idx, ax in enumerate(axarr.flatten()):
    ax.set_xlabel(xlbl[idx])
    ax.set_ylabel("Count")
plt.show()

Pay attention to the missing category and the difference in the insulin figure.

In many tasks, we may find that we need to to scale our data. Each task will likely require a specific kind of scaling. \
The scaling process will help us to correctly identify the variables that are most important for our task regardless their magnitudes. 

Here is an example of scaling your data using the mean and standard deviation.

In [None]:
scaled_features = preprocessing.StandardScaler().fit_transform(df.values)
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)

In [None]:
scaled_features_df.hist(bins=50, figsize=(20, 15)) 
plt.show()

---
<span style="color:red">***Question:***</span> *Should we scale all of our data as we did?*

---

Now let's see if we can find correlations among selected variables. \
This can help us later on in choosing the most relevant variables with minimum redundancy.

In [None]:
attributes = ["Age", "BMI", "Glucose", "Pregnancies"]
scatter_matrix(df[attributes], figsize=(12, 8)) # correlation between chosen variables
plt.show()

Unfortunately, we can't really find any significant correlation except age and pregnancies which was pretty obvious to begin with. 

Another important thing that we would like to check within our data is the prevalence. \
Let's check what is the prevalence of diabetes in our dataset.

In [None]:
prc_diab = 100 * df['Outcome'].value_counts(normalize=True)  # normalize=True for percentage

print(r'%.2f%% of the Pima tribe women have diabetes.' % prc_diab[1])

Sometimes we would like to count values above or below a specific threshold to get a sense of the data. Then, we can check if those conditions have any impact on the outcome prevalence, or in other words, check if they are *predicative*.

In [None]:
val = df[df['Glucose'] > 150].shape[0]  # how many of the tribe women have glucose values higher than 150

print(r'%d women have glucose values higher than 150 [mg/dl].' % val)

In [None]:
selected_obs = df[(df['Glucose'] > 150) & (df['Insulin'] > 100)]  # Extract patients who have glucose values higher than 150 and insulin values higher than 100
val = 100 * selected_obs['Outcome'].value_counts(normalize=True)[1]  # show how many of the selected patients have diabetes.

print(r'Out of the women who have glucose values higher than 150[mg/dl] and insulin values higher than 100[uU/ml], %.2f%% have diabetes.' % val)

A significant deviation can be seen in the prevalence once we choose women with high levels of insulin and glucose. 


The last things that we will see in this tutorial is how to group, sort, filter and plot variables. Here are some examples:

In [None]:
df.sort_values(by='Pregnancies') # Notice the labels

In [None]:
df.groupby('Pregnancies').describe()  # summary statistics of subsets of women who had the same number of pregnancies

In [None]:
df.groupby('Pregnancies').describe()['Age'] # for a single variable

In [None]:
Preg_group = df.groupby('Pregnancies')

In [None]:
Preg_group.get_group(5)['Age'].shape  # how many women have had 5 pregnancies

In [None]:
Preg_group.filter(lambda x: len(x) > 24) # drop groups who have less than 24 samples

In [None]:
df.plot('Age', 'Glucose', kind='scatter') # scatter plot of two variables


***
> ### <span style="color:#3385ff">In this tutorial we demonstrated some of the capabilities of `pandas`. We are sure that you will find the capabilities of `pandas` very useful in almost every task in data science. *See you next time!*</span>
***


#### *This tutorial was written by Moran Davoodi & Alon Begin with the assitance of Yuval Ben Sason & Kevin Kotzen*