# Data Analytics and Visualization with Python

### Learning Objective - 

- Descriptive Statistics
- Visualizing Data
    - Introduction to matplotlib library
    - Anatomy of a figure
    - Creating sub-plots
    - Chart aesthetics
- Visual Data Analytics
    - Univariate Analysis
        - count plots
        - histograms and boxplot
    - Bivariate Analysis
        - scatter plot
        - bar plot
        - line charts
        - pair plots, heatmaps
- Create and publish interactive charts using plotly and Dash

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

In [None]:
# Final Code ------------

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = [4, 3]

# Reading data from file
df = pd.read_csv("coffee_sales.csv", header = 2)

# Removing null columns and rows and replacing nulls
df.drop(columns=['Unnamed: 0'], inplace=True)
df.dropna(how = "all", inplace= True)
df["Target Profit"].fillna("0", inplace=True)

# Converting str columns to float/date
df.Sales = df.Sales.str.replace("$", "").astype(float)
df.Profit = df.Profit.str.replace("$", "").astype(float)
df["Target Profit"] = df["Target Profit"].str.replace("$", "").astype(float)
df["Target Sales"] = df["Target Sales"].str.translate(str.maketrans("", "", "$,")).astype(float)
df["Date"] = pd.to_datetime(df["Date"])

df["Target Status"] = np.where(df.Sales >= df["Target Sales"], "Achieved", "Not Achieved")
df.insert(2, "Year", df.Date.dt.year)
df.insert(3, "Month", df.Date.dt.month_name())
df.insert(3, "Month#", df.Date.dt.month)
df.head()

## Data Visualisation

### Primary Objects of matplotlib
<img src = "./images/figure.png" align = left width = 300>
<br><br><br><br><br><br><br><br>

- The [figure] is the overall figure space that can contain one or more plots
- The [axes] is the individual plots that are rendered within the figure

### Anatomy of a figure

<img src = "./images/figure_anatomy.jpg" align = left width = 400>

In [None]:
import numpy as np
dates = np.arange('2019-01', '2022-01', dtype='datetime64[M]')
sales = np.array([42390, 77560, 77385, 76039, 42968, 53833, 47205, 68936, 51175, 48186, 71520, 66387, 62717, 52614, 42866, 64275, 44749, 68560,66258, 62221, 66303, 52428, 42300, 65645, 59215, 66944, 67519,46231, 79780, 59746, 59992, 70805, 64609, 72995, 60402, 76956])
profits = np.array([ 7206.3 ,  8531.6 , 13155.45,  9885.07,  7304.56,  9689.94, 5664.6 , 11029.76,  6141.  ,  5300.46,  9297.6 , 11285.79, 7526.04,  5787.54,  6429.9 , 12212.25,  5369.88, 12340.8 , 12589.02,  6222.1 ,  7293.33,  8388.48,  6768.  , 11816.1 , 7697.95, 11380.48,  7427.09,  6934.65,  8775.8 ,  7169.52, 7199.04,  9204.65, 10337.44,  9489.35, 10268.34, 14621.64])

### Bar Chart

### Bullet Chart

## Visual Data Analytics

### Univariate Analysis
Univariate analysis is a statistical method used to describe and analyze data consisting of only one variable. It focuses on understanding the characteristics and distribution of a single variable without considering the relationship with other variables.

- Descriptive Statistics
- Frequency Distribution
- Measures of Central Tendency
- Measures of Dispersion
- Visualization:
    - Box plots: Displaying the distribution of data using quartiles.
    - Histograms: Showing the frequency distribution of continuous variables.
    - Bar charts: Displaying the frequency distribution of categorical variables.
- Probability Distribution:
  - Normal distribution: Assessing if the data follows a normal distribution using graphical methods or statistical tests.

#### Categorial variable 

#### Numeric Variable

Descriptive statistics deals with summarizing and describing the features of a dataset or sample. Descriptive statistics provides a summary of the main features of the data, including measures of central tendency, dispersion, shape, and relationships between variables.

**Measures of Central Tendency:**

    - Mean: The average value of the data points.
    - Median: The middle value of the data when arranged in ascending order.
    - Mode: The most frequently occurring value in the dataset.

**Measures of Dispersion:**

    - Range: The difference between the maximum and minimum values in the dataset.
    - Variance: The average of the squared differences from the mean.
    - Standard Deviation: The square root of the variance, representing the average deviation from the mean.

**Measures of Shape:**

    - Skewness: A measure of the asymmetry of the distribution.
        - Positive skewness indicates a longer right tail and a concentration of data on the left side.
        - Negative skewness indicates a longer left tail and a concentration of data on the right side.
        - Skewness close to zero indicates approximate symmetry around the mean.

    - Kurtosis: A measure of the "peakedness" or "flatness" of the distribution.
        - Positive kurtosis indicates heavy tails and a sharp peak (leptokurtic).
        - Negative kurtosis indicates light tails and a flat peak (platykurtic).
        - A kurtosis of 0 indicates a distribution with similar tails to the normal distribution (mesokurtic).

**Frequency Distribution:**

    - Frequency table: A table that shows the frequency or count of each value in the dataset.
    - Histogram: A graphical representation of the frequency distribution, showing the distribution of values in bins or intervals.

**Measures of Association:**

    - Correlation: A measure of the strength and direction of the linear relationship between two variables.
    - Covariance: A measure of the joint variability between two variables.

### Bivariate Analysis

Bivariate analysis is a statistical method used to analyze the relationship between two variables simultaneously. 

#### Numerical-Numerical Analysis:

- Scatter Plots: Scatter plots with a regression line can show the relationship between two continuous variables. Each data point represents a combination of values from both variables.

- Correlation Analysis: Quantifies the strength and direction of the linear relationship between two continuous variables. Pearson correlation coefficient (r) measures the degree of linear association between variables. 
	- It ranges from -1 to 1, where:
	- r = 1: Perfect positive correlation
	- r = -1: Perfect negative correlation
	- r = 0: No correlation

#### Categorical-Categorical Analysis:

- Contingency tables (also known as cross-tabulations) display the frequency distribution of categories for two categorical variables. 

#### Categorical-Numerical Analysis:

- Box plots or bar charts with groupings display the distribution of a numerical variable across different categories of a categorical variable.

#### Categorial vs Numeric

#### Numerical vs Numerical

#### Categorial vs categorial

### Example on Multivariate Analysis

###### Ex. Number of franchises the product being sold in each city