<a href="https://colab.research.google.com/github/lisovyy/pokedex/blob/master/Statistical_Data_Analysis_with_Python_Udemy_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Statistical Analysis with Python: Notebook

### Pre-requisites

In [None]:
# Importing the required libraries
# ---
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## 1. Univariate Analysis

### Frequency Tables

#### Example

In [None]:
# Example: Frequency Tables
# ---
# Loading and previewing our dataset
# ---
credit_df = pd.read_csv('http://tiny.cc/aspnpz')
credit_df.head()

Unnamed: 0.1,Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,0,67,male,2,own,,little,1169,6,radio/TV
1,1,22,female,2,own,little,moderate,5951,48,radio/TV
2,2,49,male,1,own,little,,2096,12,education
3,3,45,male,2,free,little,little,7882,42,furniture/equipment
4,4,53,male,2,free,little,little,4870,24,car


In [None]:
# Then creating our frequency table
# ---
# 
credit_df['Sex'].value_counts()

male      690
female    310
Name: Sex, dtype: int64

#### <font color="green">Challenge</font>

In [None]:
# Challenge
# ---
# Question: Create a frequency table from the credit dataset 
# above on the values in the Savings account variable.
# ---
# 

### Bar Charts

#### Example

In [None]:
# Example 
# ---
# A bar chart presents categorical data with rectangular bars with heights 
# or lengths proportional to the values that they represent. 
# Lets visualise the savings accounts variable using a bar chart
# ---
# 
credit_df['Saving accounts'].value_counts().plot(kind='bar');  

#### <font color="green">Challenge</font>

In [None]:
# Challenge 
# ---
# Create a bar chart to understand the purpose variable
# given the credit dataset above. Record 3 key observations.
# ---
# 

### Histograms

#### Example

In [None]:
# Example
# ---
# A histogram is a visualisation that uses bars of different heights. 
# It is similar to a bar chart, but a histogram groups numbers into ranges.
# It is an alternative way to display the distribution of a quantitative variable. 
# ---
# Let's create a histogram to get the distribution of our duration variable:
# ---
# 
plt.hist(credit_df['Duration'], bins=10, histtype='bar', rwidth=0.9);

Observations:
1. The highest occurence of duration is around 22 months.
2. The histogram illustrates positive skew. This means there's a long tail on the right side of our peak. Because of this skew, the mean duration is larger than the median duration.
2. Most duration is from 18 - 24 months.


#### <font color="green">Challenge</font>

In [None]:
# Challenge 
# ---
# Create a histogram to understand the distribution of the age variable in the credit dataset.
# Record 3 key observations.
# ---
# 

### Pie Charts

#### Example

In [None]:
# Example 
# ---
# A pie chart is a circular graph that shows the relative contribution 
# that different categories contribute to an overall total.
# ---
# Let's create a pie chart of the savings account variable.
# ---
# 

# We first create a summary table
# ---
# 
credit_savings = credit_df['Saving accounts'].value_counts()
credit_savings

In [None]:
# create our labels
# ---
labels = credit_savings.index.tolist()
labels

In [None]:
# Then later create our pie chart
# ---
plt.pie(credit_savings, labels=labels, autopct='%i%%');

What did we observe?

#### <font color="green">Challenge</font>

In [None]:
# Challenge 
# ---
# Create a pie chart to understand the contribution of 
# Checking account categories to the total.
# ---
# 

### Box Plots

#### Example  

In [None]:
# Question
# ---
# A box plot is used in explanatory data analysis to visually 
# show the distribution of numerical data and skewness through 
# displaying the data quartiles (or percentiles) and averages.
# It can tell us whether our variable contains outliers. 
# ---
# Question: Let's create a box plot to show the distribution 
# of the Duration variable. Record your observations.
# ---
# 
sns.boxplot(credit_df["Duration"]);

Observations:
1. The median duration is roughly 18.
2. The minimum recorded duration is 4 and the maximum is 72.
3. 75% of duration was above 24 and 25% of duration was below 12.






In [None]:
# To see exact numeric values of the quartiles in a box
# We can print out the following:
# ---
# 
credit_df['Duration'].describe()

#### <font color="green">Challenge</font>

In [None]:
# Challenge 
# ---
# Create a box plot of the Credit amount variable recording key observations.
# ---
# 

### Measures of Central Tendency

#### Examples

In [None]:
# Example 1: Mean
# ---
# Mean is the the average of the values in a variable.
# ---
# Let's find the mean of the Age variable 
# ---
# 
mean_age = credit_df['Age'].mean()
mean_age
round(mean_age, 0)

In [None]:
# Example 2: Median
# ---
# Median is the middle value given a variable.
# ---
# 
median_age = credit_df['Age'].median()
median_age

In [None]:
# Example 3: Mode 
# ---
# The mode is the value that occurs the most frequently in your data set.
# ---
# Let's calculate the mode of the Age variable
# ---
# 
mode_age = credit_df['Age'].mode().iloc[0]
mode_age

In [None]:
# Bonus: We can also query our credit_df dataframe 
# to see the records in which the credit_df is equal to 27
# ---
# 
credit_df.query('Age==27')

#### <font color="green">Challenges</font>

In [None]:
# Challenge 1
# ---
# Question: Calculate the mean of Credit Amount
# ---
# 

In [None]:
# Challenge 2
# ---
# Question: Calculate the median of Credit Amount
# ---
#

In [None]:
# Challenge 3
# ---
# Question: Calculate the mode of Credit Amount
# ---
#

### Measures of Dispersion

#### Example

In [None]:
# Example 1: Variance
# ---
# Variance is a measurement of the spread between numbers in a data set. 
# It measures how far each number in the set is from the mean 
# and therefore from every other number in the set.
# ---
# Finding the variance of the Credit amount variable:
#
credit_df["Credit amount"].var()

In [None]:
# Example 2: Standard Deviation  
# ---
# Standard deviation measures the dispersion of a dataset relative 
# to its mean. It is calculated as the square root of variance by determining 
# the variation between each data point relative to the mean. 
# If the data points are further from the mean, there is a 
# higher deviation within the data set; thus, the more spread 
# out the data, the higher the standard deviation.
# ---
# Finding the standard deviation for Credit Amount
# ---
# 
credit_df["Credit amount"].std()

In [None]:
# Examples 3: Range
# ---
# Range is the difference between the minimum and maximum values in a variable
# ---
# Let's find out the range for Credit Amount
# ---
# 
# Finding the min and max values of the herbicide average prices
credit_df_max = credit_df["Credit amount"].max()
credit_df_min = credit_df["Credit amount"].min()

# Calculating the range
credit_df_max - credit_df_min

In [None]:
# Example 4: Quantiles 
# ---
# Quantiles are used to calculate the interquartile range, 
# which is a measure of variability around the median.
# The measure the spread of values above and below 
# the mean by dividing the distribution into four groups.
# Quantiles are also used to calculate the interquartile range, 
# which is a measure of variability around the median.
# ---
# 
# Finding the quantiles of the Credit amount variable
# 
credit_df["Credit amount"].quantile([0.25, 0.5, 0.75])

In [None]:
# Examples 5: Skewness
# ---
# Skewness is a measure of symmetry, or more precisely, 
# the lack of symmetry. A distribution, or data set, is symmetric 
# if it looks the same to the left and right of the center point.
# A negatively skewed distribution has a negative value while 
# a positive value means the distribution is positively skewed.
# ---
# Determining the skewness of the credit amount variable.
# 
credit_df["Credit amount"].skew()

In [None]:
# Examples 6: Kurtosis
# ---
# Kurtosis is a statistical measure used to describe the degree 
# to which scores cluster in the tails or the peak of a frequency distribution. 
# The peak is the tallest part of the distribution, 
# and the tails are the ends of the distribution.
# Positive values of kurtosis indicate that a distribution is peaked and possess thick tails; leptokurtic distribution.
# Negative values of kurtosis indicate that a distribution is flat and has thin tails; platykurtic distribution.
# If the kurtosis is close to 0, then a normal distribution is often assumed; mesokurtic distribution.
# ---
# Determining the skeweness of the credit amout variable
# 
credit_df["Credit amount"].kurt()

In [None]:
# Example 7: Summary statistics
# ---
# We can display the summary statistics of a varible by:
# ---
# 
credit_df['Credit amount'].describe()

#### <font color="green">Challenge</font>

In [None]:
# Challenge 1 
# ---
# Find the variance of the duration variable recording your observations.
# ---
# 

In [None]:
# Challenge 2
# ---
# Finding the standard deviation for duration variable recording your observations.
# ---
#

In [None]:
# Challenge 3
# ---
# Calculate the range of the duration variable and record your observations.
# ---
#

In [None]:
# Challenge 4
# ---
# Finding the quantiles of the duration variable and record your observations.
# ---
#

In [None]:
# Challenge 5
# ---
# Determine the skewness of the duration variable. 
# Record your observation.
# ---
# 

In [None]:
# Challenge 6
# ---
# Question: Determine the kurtosis of the duration variable.
# Record your observation.
# ---
#

In [None]:
# Challenge 7
# ---
# Display the summary statistics of the duration variable
# and record your observation.
# ---
3

## 2. Bivariate Analysis

### Scatter Plots

#### Example

In [None]:
# Scatter Plot
# ---
# A scatter plot reveals relationships or association between two variables.
# If the markers are close to making a straight line in the scatter plot, 
# the two variables have a high correlation. 
# If the markers are equally distributed in the scatter plot, 
# the correlation is low, or zero. 
# However, even though a correlation may seem to be present, 
# this might not always be the case. We use a correlation matrix to help with this.
# ---
# This type of plot helps us answer the following questions:
# - Are variables X and Y related?
# - Are variables X and Y linearly related?
# - Are variables X and Y non-linearly related?
# - Does the variation in Y change depending on X?
# - Are there outliers?
# ---
# Let's create scatter plot to determine the relationship between the age 
# and Duration variables:
# ---
# This time we will use following startup dataset to determine whether
# there is any correlation between R&D spend and profit fro startups.
# ---
#
startup_df = pd.read_csv('https://bit.ly/Startupsdataset')
startup_df.head()

In [None]:
# Plotting our Scatterplot
# ---
# 
startup_df.plot.scatter(x='R&D Spend', y='Profit') 
plt.title('R&D Spend vs Profit') 
plt.show()

Key observations:
1. As one R&D spend increases, Profit also increases. This means that the two variables have a positive correlation. 

NB: This correlation does not mean causation. One would need to investigate further whether that is valid.

#### <font color="green">Challenge</font>

In [None]:
# Challenge 1
# ---
# Using the startup dataset, determine whether there is a correlation between 
# marketing spend and profit and record your observations. 
# ---
# 

In [None]:
# Challenge 2
# ---
# Again using the startup dataset, determine whether there is a correlation between 
# R&D and Administration costs and record your observations. 
# ---
# 

### Pearson Correlation Coefficient

#### Example


In [None]:
# Example 1
# ---
# We can get the pearson correlation coefficient for R&D Spend and marketing spend:
# ---
pearson_coeff = startup_df["R&D Spend"].corr(startup_df["Marketing Spend"], method="pearson") 
pearson_coeff

Key observations
1. A correlation coefficient of 0.72 means that the two variables have a strong positive correlation.

A few things to note while working with correlation coefficients:
- A correlation coefficient > 0.5 means that the two variables have a strong positive correlation.
- A correlation coefficient < 0.5 means that the two variables have a weak positive correlation.
- A correlation coefficient of 0 means that the two variables are not correlated at all.
- A correlation coefficient between -0.5 and 0 means that the two variables have weak negative correlation.
- A correlation coefficient between -1 and -0.5 means that the two variables have strong negative correlation.


In [None]:
# Example 2
# ---
# We can also create a correlation matrix to get all the pearson correlation 
# coefficients for the variables in a dataframe as shown below.
# ---
# 
startup_df.corr()

In [None]:
# Example 3 
# ---
# To better visualise our correlation matrix we can use a heatmap as shown
# ---
# 
sns.heatmap(startup_df.corr(), annot=True, vmin=-1, vmax=1);

#### <font color="green">Challenge</font>

In [None]:
# Challenge 
# ---
# Calculate the pearson correlation coefficient to determine 
# the relationship between R&D Spend and Administration costs
# given the startup dataset.
# ---
# 

In [None]:
# Challenge
# ---
# You have been provided the following dataset that contains 
# 7 species of fish data for market sale. 
# Create correlation matrix and a heat map determining the variables 
# which have a strong positive correlation.
# ---
# Dataset url = https://bit.ly/Fishdataset
# ---
# 

### Line Graphs

In [None]:
# We can also examine relationship between two variables through the use 
# of line charts as shown in this example. 
# ---
# We will two variables in the following example. 
# The given dataset  provides production of chocolate by month, 
# as a percent production from the year 2016 January.
# ---
# Dataset = https://bit.ly/CandyProductionDs
# Dataset info = The industrial production (IP) index measures the real output 
# of all relevant establishments located in the United States, 
# regardless of their ownership, but not those located in U.S. territories. 
# This dataset tracks industrial production every month from January 1972 to August 2017.
# ---
# 

# Reading our dataset 
# ---
# 
candy_df = pd.read_csv('https://bit.ly/CandyProductionDs')

# Getting our years 2016 onwards
# ---
#
candy2016_plus_df = candy_df[candy_df['observation_date'] > '2015-12-01']
candy2016_plus_df.head()

In [None]:
# We then plot our line graph as shown 
# ---
# We can also answer the followig question:
# - Which months have the highest candy production?
# ---
#   
candy2016_plus_df.plot.line(x='observation_date', y='IPG3113N');

#### <font color="green">Challenge</font>

In [None]:
# Challenge 
# ---
# Given the following dataset of temperature data recorded 
# in the city of Sao Paulo, Brazil, clean the dataset then
# perform data exploration of the average temperature levels 
# for the past 20 years.
# ---
# Dataset url = https://bit.ly/SaoPauloTemp
# 