{{< include _include_d3.qmd >}}

In [None]:
#| eval: true
#| echo: false
#| output: false

import warnings
warnings.filterwarnings("ignore", "is_categorical_dtype")
warnings.filterwarnings("ignore", "use_inf_as_na")

import pandas as pd
#df = pd.read_csv('/home/sol-nhl/rnd/d/cca-cce/csv/iris.tsv', sep='\t')
df = pd.read_csv('https://raw.githubusercontent.com/nils-holmberg/cca-cce/main/csv/iris.csv', sep='\t')

## Frequency diagrams

The frequency diagram, or histogram, visually represents the distribution of the "sepal_length" variable from the `df` DataFrame. In the example, we used Seaborn's `histplot` function to plot the diagram with 20 bins. The kernel density estimation (KDE) curve is also overlaid on the histogram to give a smoother representation of the data distribution. The x-axis represents the range of "sepal_length" values, while the y-axis shows the frequency of occurrences for each bin.

Here's the code snippet to generate the frequency diagram:

In [None]:
#| eval: true
#| echo: true
#| output: true

import seaborn as sns
import matplotlib.pyplot as plt

# Create a frequency diagram for 'sepal_length'
sns.histplot(df['sepal_length'], bins=20, kde=True)

# Add labels and title
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title('Frequency Diagram of Sepal Length')

# Show the plot
plt.show()

This visualization allows you to quickly grasp the shape, center, and spread of the "sepal_length" data.

## Bar plots

Bar plots are useful for displaying the relationship between a categorical variable and a numerical variable. In the example, we use Seaborn's `barplot` function to visualize the average "sepal_length" for each species in the `df` DataFrame. The x-axis represents the different species, and the y-axis shows the average "sepal_length" for each. 

Here's the code snippet to generate the bar plot:

In [None]:
#| eval: true
#| echo: true
#| output: true
#| warning: false

import seaborn as sns
import matplotlib.pyplot as plt

# Create a barplot for the 'species' column showing the average 'sepal_length'
sns.barplot(x='species', y='sepal_length', data=df, errorbar="ci")

# Add labels and title
plt.xlabel('Species')
plt.ylabel('Average Sepal Length')
plt.title('Average Sepal Length by Species')

# Show the plot
plt.show()

In this case, the `ci=None` parameter removes the confidence interval bars, focusing solely on the mean values. The plot provides a quick way to compare the average "sepal_length" across different species.

## Scatter plots

Scatter plots are excellent tools for visualizing relationships between two numerical variables. In the given example, we use Seaborn's `scatterplot` function to create a scatter plot of "sepal_length" against "sepal_width" from the `df` DataFrame. The points are colored based on the "species" category, providing an additional layer of information. 

Here's the code snippet to generate the scatter plot:

In [None]:
#| eval: true
#| echo: true
#| output: true

import seaborn as sns
import matplotlib.pyplot as plt

# Create a scatter plot for 'sepal_length' and 'sepal_width' colored by 'species'
g = sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=df)

# Add labels and title
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Scatter Plot of Sepal Dimensions by Species')

# Show the plot
plt.show()

This scatter plot allows you to identify patterns or relationships between "sepal_length" and "sepal_width" while also considering the species. It's a powerful way to explore multidimensional data.

## Heatmaps

Heatmaps are excellent tools for visualizing complex relationships between numerical variables. In Python, the Seaborn library provides an easy-to-use `heatmap` function for this purpose. For instance, you can create a heatmap of the correlation matrix of numerical features in the `df` DataFrame. The color gradients in the heatmap represent the strength and direction of correlation, making it easier to identify highly or weakly correlated variables.

Here's an inline code example to generate the heatmap:

In [None]:
#| eval: true
#| echo: true
#| output: true

import seaborn as sns
import matplotlib.pyplot as plt

# Drop the 'species' column to only keep numerical columns
numerical_df = df.drop('species', axis=1)

# Calculate the correlation matrix for the numerical columns
correlation_matrix = numerical_df.corr()

# Create a heatmap to visualize the correlation matrix
g = sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

# Add title
plt.title('Heatmap of Feature Correlations')

# Show the plot
plt.show()

This heatmap makes it easier to understand the relationships between different numerical features, aiding in feature selection and further data analysis.

## Save plot images

In [None]:
#| eval: true
#| echo: true
#| output: true

# write to image file
#fig.savefig('../../tmp/some.png', format='png', dpi=300)
g.figure.savefig('../../tmp/some.png', format='png', dpi=300)

## Try it yourself!

In [None]:
#| eval: true
#| echo: true
#| output: true

import pandas as pd

penguins_df = pd.read_csv('https://raw.githubusercontent.com/nils-holmberg/cca-cce/main/csv/palmerpenguins.tsv', sep='\t')

![palmer penguins](https://nils-holmberg.github.io/cca-cce/res/img/lter_penguins.png){#fig-penguins height=214px width=360px}

**Tasks:**

1. Plot a bar chart showing the average bill length for each species.
2. Visualize the distribution of body mass using a histogram.
3. Create a scatter plot between bill length and bill depth, color-coded by species.
4. Plot a pair plot for numerical columns to visualize pairwise relationships, colored by species.
5. Display a box plot comparing the flipper length distributions of different species.
6. Create a scatter plot of bill length vs. flipper length with a regression line.
7. Visualize the distribution of flipper length for each species using violin plots.
8. Display a heatmap of the correlation matrix for the numerical columns.
9. Plot a bar chart of the number of penguins per island.
10. Create a scatter plot of bill depth vs. body mass, color-coded by sex, with a regression line.

In [None]:
#| eval: false
#| echo: false
#| output: false

import seaborn as sns
import matplotlib.pyplot as plt

# Task 1
plt.figure(figsize=(8, 6))
sns.barplot(data=penguins_df, x="species", y="bill_length_mm")
plt.title("Average Bill Length by Species")
plt.show()

# Task 2
plt.figure(figsize=(8, 6))
sns.histplot(data=penguins_df, x="body_mass_g")
plt.title("Distribution of Body Mass")
plt.show()

# Task 3
plt.figure(figsize=(8, 6))
sns.scatterplot(data=penguins_df, x="bill_length_mm", y="bill_depth_mm", hue="species")
plt.title("Bill Length vs Bill Depth by Species")
plt.show()

# Task 4
sns.pairplot(penguins_df, hue="species")
plt.show()

# Task 5
plt.figure(figsize=(8, 6))
sns.boxplot(data=penguins_df, x="species", y="flipper_length_mm")
plt.title("Flipper Length Distributions by Species")
plt.show()

# Task 6
plt.figure(figsize=(8, 6))
sns.regplot(data=penguins_df, x="bill_length_mm", y="flipper_length_mm")
plt.title("Bill Length vs Flipper Length with Regression Line")
plt.show()

# Task 7
plt.figure(figsize=(8, 6))
sns.violinplot(data=penguins_df, x="species", y="flipper_length_mm")
plt.title("Distribution of Flipper Length by Species")
plt.show()

# Task 8
correlation_matrix = penguins_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Heatmap of Correlation Matrix")
plt.show()

# Task 9
plt.figure(figsize=(8, 6))
sns.countplot(data=penguins_df, x="island")
plt.title("Number of Penguins Per Island")
plt.show()

# Task 10
plt.figure(figsize=(8, 6))
sns.lmplot(data=penguins_df, x="bill_depth_mm", y="body_mass_g", hue="sex")
plt.title("Bill Depth vs Body Mass by Sex with Regression Line")
plt.show()

These tasks and their solutions offer students a comprehensive introduction to data visualization in Seaborn using the Palmer Penguins dataset.
