# W3: Statistics, Matplotlib, Seaborn
- Contributer: Dr. Zhonghua Zheng, Yuan Sun
- Course Unit: Earth and Environmental Data Science (EART60702)
- Last modified date: 04 February, 2024

## Intended Learning Outcomes (ILOs)
- Statistical Analysis: Gain a practical understanding of core statistical analysis methods to evaluate and interpret data.

- Visualization Skills with Matplotlib and Seaborn: Learn to effectively create visual representations of data using the plotting capabilities of Matplotlib and seaborn.

## 1. Statistical Analysis (15 mins)
**Scipy**
- Statistical functions (scipy.stats): contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more: https://docs.scipy.org/doc/scipy/reference/stats.html

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

In [None]:
# ref: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr
x = np.array([7.1, 7.1, 7.2, 8.3, 9.4, 10.5, 11.4])
y = np.array([2.8, 2.9, 2.8, 2.6, 3.5, 4.6, 5.0])

### 1.1 Pearson correlation coefficient

For more information: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

$$ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} $$

Q1: please use `scipy.stats` to calculate the pearson correlation coefficient
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

Q2: please use `numpy` to calculate the pearson correlation coefficient by implementing the equations

In [None]:
mean_x = np.mean(x)
mean_y = np.mean(y)

numerator = 
denominator = 
pearson_r = numerator / denominator
pearson_r

Q3: please use `numpy` to calcuate the pearson correlation coefficient

Q4: please use `pandas` to calculate the pearson correlation coefficient

### 1.2 Spearman's rank correlation coefficient
For more information: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

Q1: please use `scipy.stats` to calculate the Spearman's rank correlation coefficient
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

Q2: please calculate Spearman's rank correlation coefficient using a method other than scipy.stats

## 2. Matplotlib (15 mins)
- Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python: https://matplotlib.org/stable/users/getting_started/
- Matplotlib makes easy things easy and hard things possible.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### 2.1 draw line plot
- plot(x, y)
- attribute: color, linewidth, marker, linestyle, etc. search more: https://matplotlib.org/stable/gallery/lines_bars_and_markers/index.html

In [None]:
# a simple case
x = np.linspace(0, 2 * np.pi, 200)
y = np.sin(x)

fig, ax = plt.subplots() # Create a figure and axis
ax.plot(x, y) # Plot the sine function
plt.show()

In [None]:
# customize parameters
x = np.linspace(0, 2 * np.pi, 200)
y1 = np.sin(x)
y2 = np.cos(x)
fig1, ax = plt.subplots() 
ax.plot(x, y1, color = 'red', linewidth = 0.5, linestyle = ':', label='sin(x)') # change line color, width, etc in multiple lines
ax.plot(x, y2, color = 'green', linewidth = 0.75, linestyle = '--', label='cos(x)') 
ax.legend() # Add a legend
plt.show()

In [None]:
# use a loop for a series of lines
# prepare data and variables
x = np.linspace(0, 2 * np.pi, 200)
y = [100 * np.sin(x), 100 * np.cos(x), np.tan(x), -x]
color = ['red', 'green', 'orange', 'blue']
linewidth = [1, 0.3, 0.75, 1.2]
linestyle = ['-', '--', ':', '-.'] 
marker = ['o', '*', 'D', '.']
alpha = [0.9, 0.75, 1, 0.4]
label = ['sin', 'cos', 'tan', 'nag']
fontsize = [8, 10, 7]

# create the figure 
fig2, ax = plt.subplots() 
ax.set_facecolor('white')
for i in range(4):
    ax.plot(x, y[i], color = color[i], linewidth = linewidth[i], linestyle = linestyle[i], 
            label= label[i], alpha = alpha[i], marker = marker[i], markersize=0.3) 
ax.legend(frameon=False) # Remove legend border
ax.set_title('It is a example.') # add figure title
ax.set_xlabel('X-axis', fontsize= fontsize[1], color = color[0])  # Adjust X-axis label font size
ax.set_ylabel('Y-axis', fontsize= fontsize[1], color = color[1])  # Adjust Y-axis label font size
ax.grid(True, linestyle=linestyle[1], linewidth = linewidth[1], alpha=alpha[2]) # add grid
plt.show()

### 2.2 customize figure style

In [None]:
# create the figure 
fig2, ax = plt.subplots(figsize=(10, 4)) # Specify width and height in inches
for i in range(4):
    ax.plot(x, y[i], color = color[i], linewidth = linewidth[i], linestyle = linestyle[i], label= label[i], alpha = alpha[i]) 
ax.legend(frameon=False) 
ax.set_title('It is a example.') 
ax.set_xlabel('X-axis', fontsize= fontsize[1], color = color[0])
ax.set_ylabel('Y-axis', fontsize= fontsize[1], color = color[1]) 
ax.yaxis.grid(True, linestyle=linestyle[1], linewidth = linewidth[1], alpha=alpha[2]) # add y-axis grid
ax.tick_params(axis='x', which='major', top=False, bottom=True, labelbottom=True, labelcolor='black', labelsize=8, pad=1, color = 'grey', width = 1, length = 4)
ax.spines['left'].set_visible(False) # spines visibility
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#ax.set_facecolor('white')

# Annotate the max value
max_value_index = np.argmax(y[0])  # Assuming you want to mark the max value for the first line (sin(x))
max_value_x = x[max_value_index]
max_value_y = y[0][max_value_index]
ax.annotate(f'Max Value: {max_value_y:.2f}', 
            xy=(max_value_x, max_value_y), 
            xytext=(max_value_x + 1, max_value_y), 
            color='red',
            fontsize=8,
            arrowprops=dict(facecolor='blue', arrowstyle='->'))

plt.subplots_adjust(left=0.1, right=0.75, top=0.75, bottom=0.2) # adjust plot location
plt.show()

### 2.3 Questions

Q1: Comment and uncoment every code line to see the figure changes.

Q2: Given a dataset of fruit weights and their sweetness levels, create a scatter plot to visualize the relationship between weight and sweetness. Here are the data points: 
- weights = [100, 150, 180, 200, 220]
- sweetness = [30, 45, 50, 55, 60]. 

Annotate each point with the corresponding fruit name: `["Apple", "Banana", "Grape", "Orange", "Watermelon"]`. Label the axes appropriately.

Q3: Try to follow the style of a figure style in IPCC report: https://www.ipcc.ch/report/ar6/syr/figures/figure-3-6
- color
- size of lable, font
- location of title, text
- style of axis and ticks

## 3. Seaborn (15 mins)
- Seaborn is a library for making statistical graphics in Python: https://seaborn.pydata.org/tutorial/introduction.html
- It builds on top of matplotlib and integrates closely with pandas data structures.

### 3.0 If you haven't installed seaborn: 

In [None]:
!conda install -c conda-forge seaborn -y

In [None]:
import seaborn as sns

### 3.1 seaborn fundamentals

In [None]:
# Apply the default theme
sns.set_theme()

# Load an example dataset
tips = sns.load_dataset("tips")
tips

In [None]:
# Create a visualization
sns.relplot(
    data=tips,
    x="total_bill", y="tip", col="time",
    hue="smoker", style="smoker", size="size",
)

In [None]:
# Define your custom color palette, markers and other attribute
custom_palette = {"Yes": "green", "No": "red"}
custom_markers = {"Yes": "D", "No": "X"}
sns.set(style="whitegrid", rc={"grid.linestyle": "--", "grid.color": "gray", "grid.alpha": 0.3})
g = sns.relplot(
    data=tips,
    x="total_bill", y="tip", col="time",
    hue="smoker", style="smoker", size="size",
    palette=custom_palette,  # Set the custom color palette
    markers=custom_markers # Set the custom marker styles
)
g.fig.set_facecolor("white")
plt.show()

### 3.2 Question 

Use Seaborn to create a pairplot to analyze the relationships between multiple variables in the `Iris` dataset. The `Iris` dataset contains measurements for iris flowers of three different species, including sepal length, sepal width, petal length, and petal width. Your objectives are:

- Load the Iris dataset directly from Seaborn.
- Create a pairplot to visualize the pairwise relationships between the variables.
- Use different colors to distinguish between species.
- Add a title to the pairplot (Note: You might need to use Matplotlib functions for this, as Seaborn's pairplot does not have a direct way to add a title).

In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

## Project 1

Please fork the repo (https://github.com/m-edal/Earth-Env-DS-MSc-Course/tree/main), and continue to work on your project