# 1 Getting Familiar with Raw Data

## Preparing Jupyter Environment
Usually, the first step is to make the Jupyter notebook environment ready for further coding. This preparation could be importing necessary packages/libraries, specifying some graphical characteristics or cell behaviour, or anything needed for the coding procedure.

In [1]:
# The following piece of code gives the opportunity to show multiple outputs
# in one cell:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd

## Loading Data

In [2]:
path = "/home/damoon/Dropbox/McGill Teaching/data/UCI/pima_indian_diabetes/"
filename = "diabetes.csv"
data = pd.read_csv(path+filename)

## First Look

In [5]:
# print(data.head(20))
data.head(10)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


## Data Dimension
Knowledge of the dimension of a dataset may be useful for several reasons. The following list mentions a couple of possible reasons that one wants to check the dimension of the data before taking further steps:

 + To discover if you have too many rows or not as some algorithms may take too long to train; 
 + To see if you have enough observations to train a particular model; or
 + To check the number of features since too many features may worsen the performance of some algorithms. This problem is called *the curse of dimensionality*.

In [6]:
data.shape

(768, 9)

## Type of Attributes

In [7]:
data.dtypes

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

# 2 Descriptive Statistics
Descriptive statistics provide a useful tool to obtain a sense of the data at hand. This will help you understand the data better which in turn leads to more reasonable modelling. 


## 2.1 Common Statistics
Often you can create more summaries than you have time to review. The `describe()` function on the **Pandas DataFrame** lists $8$ statistical properties of each attribute:

+ Count
+ Mean
+ Standard Devaition
+ Minimum Value
+ 25th Percentile
+ 50th Percentile (Median)
+ 75th Percentile
+ Maximum Value

In [11]:
pd.set_option('display.width', 100)
pd.set_option('precision', 2)
data.describe()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.85,120.89,69.11,20.54,79.8,31.99,0.47,33.24,0.35
std,3.37,31.97,19.36,15.95,115.24,7.88,0.33,11.76,0.48
min,0.0,0.0,0.0,0.0,0.0,0.0,0.08,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.37,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.63,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


note the calls to `pandas.set_option()`; we use them here to change the precision of the numbers and the preferred width of the output. This is to make it more readable for this example.

Also, in classification problems you may want to know how balanced the class values are. This can be checked as follows:

In [12]:
data.groupby('class').size()

class
0    500
1    268
dtype: int64

## 2.2 Correlation Between Attributes
Correlation refers to the relationship between two variables and how they may or may not change together.

The most common method for calculating correlation is *Pearson’s Correlation Coefficient*. Some machine learning algorithms like *linear* and *logistic regression* can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pair-wise correlations of the attributes in your dataset. You can use the `corr()` function on the Pandas DataFrame to calculate a correlation matrix.

In [13]:
data.corr(method='pearson')

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
preg,1.0,0.13,0.14,-0.08,-0.07,0.02,-0.03,0.54,0.22
plas,0.13,1.0,0.15,0.06,0.33,0.22,0.14,0.26,0.47
pres,0.14,0.15,1.0,0.21,0.09,0.28,0.04,0.24,0.07
skin,-0.08,0.06,0.21,1.0,0.44,0.39,0.18,-0.11,0.07
test,-0.07,0.33,0.09,0.44,1.0,0.2,0.19,-0.04,0.13
mass,0.02,0.22,0.28,0.39,0.2,1.0,0.14,0.04,0.29
pedi,-0.03,0.14,0.04,0.18,0.19,0.14,1.0,0.03,0.17
age,0.54,0.26,0.24,-0.11,-0.04,0.04,0.03,1.0,0.24
class,0.22,0.47,0.07,0.07,0.13,0.29,0.17,0.24,1.0


## 2.3 Skewness

In [14]:
data.skew()

preg     0.90
plas     0.17
pres    -1.84
skin     0.11
test     2.27
mass    -0.43
pedi     1.92
age      1.13
class    0.64
dtype: float64