# Exploring Data with Descriptive Statistics

You must understand your data in order to get better results. In this notebook I'll explain you that you can use Python to better understand your data. There is no substitute for looking at the raw data. Looking at the raw data can reveal **insights** that you cannot get any other way.

## Import Libraries 

In [1]:
import pandas as pd # for data manipulation
import numpy as np # for linear algebra 

# set options 
pd.set_option('display.width', 100)
pd.set_option('precision', 3)

## Load Dataset 
- In this example, I'll use heart disease dataset
- Data Source - [Kaggle](https://www.kaggle.com/ronitf/heart-disease-uci)

In [2]:
# load data 
df = pd.read_csv('../data/heart.csv')

## Peek at Your Data 

In [3]:
# examine first few rows 
df.head() 

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
# examine last few rows 
df.tail() 

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [5]:
# examine specific number of rows 
df.head(10)

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [12]:
df.output.unique() 

array([1, 0])

## Dimensions
- Too many rows and algorithms may take too long to train. Too few and perhaps you do
  not have enough data to train the algorithms.
- Too many features and some algorithms can be distracted or suffer poor performance due
  to the curse of dimensionality.

In [6]:
# check shape of data 
df.shape

(303, 14)

In [7]:
# no. of rows 
df.shape[0]

303

In [8]:
# no. of columns 
df.shape[1]

14

## Data Type For Each Attribute

In [9]:
# dtypes 
df.dtypes

age           int64
sex           int64
cp            int64
trtbps        int64
chol          int64
fbs           int64
restecg       int64
thalachh      int64
exng          int64
oldpeak     float64
slp           int64
caa           int64
thall         int64
output        int64
dtype: object

## Descriptive Statistics
- Count 
- Mean 
- Median 
- Standard Deviation
- Five Number Summary
    - Minimum 
    - 25th Percentile 
    - 50th Percentile 
    - 75th Percentile 
    - Maximum 


In [10]:
# summary stats 
df.describe() 

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366,0.683,0.967,131.624,246.264,0.149,0.528,149.647,0.327,1.04,1.399,0.729,2.314,0.545
std,9.082,0.466,1.032,17.538,51.831,0.356,0.526,22.905,0.47,1.161,0.616,1.023,0.612,0.499
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [11]:
# transpose summary stats 
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,303.0,54.366,9.082,29.0,47.5,55.0,61.0,77.0
sex,303.0,0.683,0.466,0.0,0.0,1.0,1.0,1.0
cp,303.0,0.967,1.032,0.0,0.0,1.0,2.0,3.0
trtbps,303.0,131.624,17.538,94.0,120.0,130.0,140.0,200.0
chol,303.0,246.264,51.831,126.0,211.0,240.0,274.5,564.0
fbs,303.0,0.149,0.356,0.0,0.0,0.0,0.0,1.0
restecg,303.0,0.528,0.526,0.0,0.0,1.0,1.0,2.0
thalachh,303.0,149.647,22.905,71.0,133.5,153.0,166.0,202.0
exng,303.0,0.327,0.47,0.0,0.0,0.0,1.0,1.0
oldpeak,303.0,1.04,1.161,0.0,0.0,0.8,1.6,6.2


## Class Distribution(Classification Only) 

In [14]:
# class distribution 
df['output'].value_counts() 

1    165
0    138
Name: output, dtype: int64

In [16]:
# class distribution 
df['output'].value_counts(normalize=True) 

1    0.545
0    0.455
Name: output, dtype: float64

In [17]:
# class distribution using group by 
df.groupby('output').size() 

output
0    138
1    165
dtype: int64

## Correlations Between Attributes
- Correlation refers to the relationship between two variables and how they may or may not
  change together.
- The most common method for calculating correlation is **Pearson’s** Correlation
  Coefficient, that assumes a normal distribution of the attributes involved. 
- A correlation of -1 or 1 shows a full negative or positive correlation respectively.
- Whereas a value of 0 shows no correlation at all.
- Some machine learning algorithms like linear and logistic regression can suffer
  **poor** performance if there are **highly** correlated attributes in your dataset.

In [19]:
# correlation: Pearson’s by default 
df.corr(method='pearson')

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
age,1.0,-0.098,-0.069,0.279,0.214,0.121,-0.116,-0.399,0.097,0.21,-0.169,0.276,0.068,-0.225
sex,-0.098,1.0,-0.049,-0.057,-0.198,0.045,-0.058,-0.044,0.142,0.096,-0.031,0.118,0.21,-0.281
cp,-0.069,-0.049,1.0,0.048,-0.077,0.094,0.044,0.296,-0.394,-0.149,0.12,-0.181,-0.162,0.434
trtbps,0.279,-0.057,0.048,1.0,0.123,0.178,-0.114,-0.047,0.068,0.193,-0.121,0.101,0.062,-0.145
chol,0.214,-0.198,-0.077,0.123,1.0,0.013,-0.151,-0.01,0.067,0.054,-0.004,0.071,0.099,-0.085
fbs,0.121,0.045,0.094,0.178,0.013,1.0,-0.084,-0.009,0.026,0.006,-0.06,0.138,-0.032,-0.028
restecg,-0.116,-0.058,0.044,-0.114,-0.151,-0.084,1.0,0.044,-0.071,-0.059,0.093,-0.072,-0.012,0.137
thalachh,-0.399,-0.044,0.296,-0.047,-0.01,-0.009,0.044,1.0,-0.379,-0.344,0.387,-0.213,-0.096,0.422
exng,0.097,0.142,-0.394,0.068,0.067,0.026,-0.071,-0.379,1.0,0.288,-0.258,0.116,0.207,-0.437
oldpeak,0.21,0.096,-0.149,0.193,0.054,0.006,-0.059,-0.344,0.288,1.0,-0.578,0.223,0.21,-0.431


## Skewness
- Skew refers to a distribution that is assumed **Gaussian** (normal or bell curve) that is shifted or
  squashed in one direction or another.
- Many machine learning algorithms assume a **Gaussian or normal** distribution.
- Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models.
- The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew

In [20]:
# skew 
df.skew() 

age        -0.202
sex        -0.791
cp          0.485
trtbps      0.714
chol        1.143
fbs         1.987
restecg     0.163
thalachh   -0.537
exng        0.743
oldpeak     1.270
slp        -0.508
caa         1.310
thall      -0.477
output     -0.180
dtype: float64

In [21]:
df.kurtosis() 

age        -0.542
sex        -1.383
cp         -1.193
trtbps      0.929
chol        4.505
fbs         1.960
restecg    -1.363
thalachh   -0.062
exng       -1.458
oldpeak     1.576
slp        -0.628
caa         0.839
thall       0.298
output     -1.981
dtype: float64