# Understanding and Interpreting Data using Descriptive Statistics
## Data Analysis Workflow
- Data Collection
- Importing Data
- Data Cleaning
  - Handling Missing Data
  - Outlier Detection and Removal
- Exploring Data using Descriptive Statistics
  - Understanding Data using
    - Univariate Analysis
    - Bivariate Analysis
    - Multivariate Analysis
  - Understanding Data using Visualizations
    - Univariate
      - Histograms
      - Density Plot
    - Bivariate
      - Scatter Plot
      - Boxplot
    - Multivariate
      - Correlation Matrix
      - Covariance Matrix
- Decision Making using Inferential Statistics
  - Hypothesis Testing(T-Test, Z-Test, Chi-square, ANOVA)
  - Creating Predicting Models

## Dataset 
### Source
- http://www.statsci.org/data/oz/ms212.html

The data was supplied by Dr Richard J. Wilson, Department of Mathematics,
University of Queensland. Original data file is tab-delimited text.

### Description
110 students in an introductory statistics class (MS212 taught by Professor John Eccleston and Dr Richard Wilson
at The University of Queensland) participated in a simple experiment. The students took their own pulse rate.
They were then asked to flip a coin. If the coin came up heads, they were to run in place for one minute.
Otherwise they sat for one minute. Then everyone took their pulse again. The pulse rates and other physiological
and lifestyle data are given in the data. There was missing data for one student and seemingly incorrect values for
heights for two students. These observations were removed resulting in 107 subjects in the final dataset.
Five class groups between 1993 and 1998 participated in the experiment. The lecturer, Richard Wilson, was
concerned that some students would choose the less strenuous option of sitting rather than running even if their
coin came up heads, so in the years 1995-1998 a different method of random assignment was used. In these
years, data forms were handed out to the class before the experiment. The forms were pre-assigned to either
running or non-running and there were an equal number of each. In 1995 and 1998 not all of the forms were
returned so the numbers running and sitting was still not entirely controlled.

### Variable Information 
![](../img/data_docs.png)

## Importing Data

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

In [None]:
data = pd.read_table('../data/pulse.txt')

## Exploring Data 

In [None]:
data.head() 

In [None]:
data.tail() 

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.dtypes

In [None]:
data.info() 

## Data Preprocessing 
- Rename variables 
- Check missing values 
- Remove missing values 
- Check duplicate rows 
- Drop duplicate rows 
- Creating new variables 
- Outliers detection and removal

## Missing Values 

In [None]:
data.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data.isnull(), cmap="viridis")
plt.show()

In [None]:
# impute with mean 
data['Pulse1'] = data['Pulse1'].fillna(data['Pulse1'].mean())

In [None]:
data['Pulse2'] = data['Pulse2'].fillna(data['Pulse1'].mean())

In [None]:
data.isnull().sum() 

### Duplicate Rows

In [None]:
data.duplicated().sum() 

### Outliers Detection and Removal 

In [None]:
data.describe() 

In [None]:
data.quantile(0.25)

In [None]:
# calculate quantile 
Q1, Q2, Q3 = data['Height'].quantile([.25, .50, .75])
print("Q1 25 percentile of the given data is: ", Q1)
print("Q2 50 percentile of the given data is: ", Q2)
print("Q3 75 percentile of the given data is: ", Q3)

In [None]:
r = data.Height.max() - data.Height.min() 
print(r)

In [None]:
# iqr 
IQR = Q3 - Q1 
print(IQR)

In [None]:
# set upper and lower limit [Q1 - 1.5 x IQR, Q3 + 1.5 x IQR]
lower = Q1 - 1.5 * IQR 
upper = Q3 + 1.5 * IQR 
lower, upper

In [None]:
data.shape

In [None]:
# detect & removal outliers 
data_new = data[(data['Height'] < upper) & (data['Height'] > lower)]
data_new

In [None]:
data.shape, data_new.shape 

### Creating New Variable

In [None]:
data.head() 

In [None]:
data['BMI'] = data['Weight']/(data['Height']/100*data['Height']/100)
data.head() 

In [None]:
# 1 = Underweight, 2 = Normal, 3 = Overweight, 4 = Obese
def bmicat(bmi): 
    if 0 <= bmi < 19.5: 
        return 1 
    elif 18.5 <= bmi < 25: 
        return 2 
    elif 25 <= bmi < 30: 
        return 3 
    else: 
        return 4 

In [None]:
data["BMICat"] = data["BMI"].apply(bmicat)
data.head() 

### Natural Logarithm Transformation

In [None]:
data['WeightLog10'] = np.log10(data['Weight'])
data.head() 

### Standardize a Variable

In [None]:
data['AgeStd'] = (data['Age'] - data['Age'].mean())/data['Age'].std() 
data.head() 

## Identifying Variables

In [None]:
data.columns

### Categorical Variables 
- Gender
- Smokes
- Alcohol
- Exercise
- Ran
- BMICat
### Numerical Variables 
- Height
- Weight
- Age
- Pulse1
- Pulse2

## Qualitative Univariate Analysis 

### Frequency Distribution: One-way Table 

In [None]:
import researchpy as rp 

In [None]:
rp.summary_cat(data['Gender'])

In [None]:
rp.summary_cat(data[['Gender', 'Smokes', 'Alcohol', 'Exercise']])

In [None]:
rp.codebook(data[['Age', 'Height']])

In [None]:
data.columns

In [None]:
# Sex (1 = Male, 2 =Female)
data['Gender'].value_counts() 

In [None]:
data['Gender'].value_counts(normalize=True) 

In [None]:
# Regular smoker? (1 = Yes, 2 = No)
data['Smokes'].value_counts() 

In [None]:
data['Smokes'].value_counts(normalize=True) 

In [None]:
# Regular drinker? (1 = Yes, 2 = No)
data['Alcohol'].value_counts() 

In [None]:
data['Alcohol'].value_counts(normalize=True) 

In [None]:
# Frequency of exercise (1 = High, 2 = Moderate, 3 = Low)
data['Exercise'].value_counts() 

In [None]:
# Frequency of exercise (1 = High, 2 = Moderate, 3 = Low)
data['Exercise'].value_counts(normalize=True) 

In [None]:
data['Ran'].value_counts() 

In [None]:
data['Ran'].value_counts(normalize=True) 

In [None]:
data['BMICat'].value_counts() 

In [None]:
data['BMICat'].value_counts(normalize=True) 

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "Gender")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "Smokes")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "Alcohol")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "Exercise")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "BMICat")
plt.show()

## Qualitative Bivariate Analysis 

### Frequency Distribution: Two-way Table 

In [None]:
pd.crosstab(data['Gender'], data['Smokes'])

In [None]:
pd.crosstab(data['Gender'], data['Smokes'], normalize=True) * 100 

In [None]:
pd.crosstab(data['Gender'], data['Alcohol'])

In [None]:
pd.crosstab(data['Gender'], data['Alcohol'], normalize=True)

In [None]:
pd.crosstab(data['Gender'], data['Exercise'])

In [None]:
pd.crosstab(data['Gender'], data['Exercise'], normalize=True)

In [None]:
pd.crosstab(data['Gender'], data['BMICat'])

### Frequency Distribution: Marginal Table

In [None]:
pd.crosstab(data['Gender'], data['Smokes'], normalize=True, margins=True)

In [None]:
pd.crosstab(data['Gender'], data['Smokes'], normalize=True, margins=True)

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "Gender", hue="Smokes")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "Gender", hue="Alcohol")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "Gender", hue="Exercise")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data = data, x = "Gender", hue="BMICat")
plt.show()

## Quantitative Univariate Analysis

In [None]:
data['Height'].describe() 

In [None]:
data['Weight'].describe() 

In [None]:
data['Age'].describe() 

In [None]:
data['Pulse1'].describe() 

In [None]:
data['Pulse2'].describe() 

In [None]:
data['BMI'].describe() 

In [None]:
data['BMI'].skew() 

In [None]:
data['BMI'].kurtosis() 

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.histplot(data=data, x="Age")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.histplot(data=data, x="Height")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.histplot(data=data, x="Pulse1")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.histplot(data=data, x="Pulse2")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.histplot(data=data, x="BMI")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.boxplot(data=data, x="BMI")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.boxplot(data['BMI'])
plt.show()

## Quantitative Bivariate Analysis

In [None]:
data.Age.corr(data.Height)

In [None]:
data.Age.corr(data.BMI)

In [None]:
data.Age.corr(data.Weight)

In [None]:
data.Age.cov(data.BMI)

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.scatterplot(data=data, x="Age", y="BMI")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.scatterplot(data=data, x="Age", y="BMI", hue="Gender")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.scatterplot(data=data, x="Age", y="Weight")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.scatterplot(data=data, x="Age", y="Pulse1")
plt.show()

## Multivariate Analysis 

In [None]:
data.corr() 

In [None]:
data.cov() 

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.heatmap(data.corr())
plt.show()

In [None]:
plt.figure(figsize=(20,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.heatmap(data.corr(), annot=True)
plt.show()

## Categorical - Quantitative(C-Q) Analysis 

In [None]:
data.groupby('Gender')['BMI'].describe()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.boxplot(data=data, x="Gender", y="BMI")
plt.show()

In [None]:
data.groupby('Gender')['BMI'].describe() 

## Categorical- Categorical(CC) Analysis

In [None]:
data.groupby('Gender')['Smokes'].value_counts().unstack() 

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.countplot(data=data, x="Gender", hue='Smokes')
plt.show()

## Quantitative - Quantitative Analysis 

In [None]:
data.Age.corr(data.BMI) 

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.scatterplot(data=data, x="Age", y="BMI")
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5, palette= "viridis")
sns.scatterplot(data=data, x="Age", y="BMI", hue='Gender')
plt.show()