# Basics of Quantitative Graphs with GIT

Matti Kuikka 18.9.2023
- Some examples about graphs for quantitative data and assignments
- Examples are done with Seaborn
- Similar graphs can be done also with Matplotlib and Pandas


In [None]:
# Take libraries in use
import pandas as pd
import numpy as np
import matplotlib as mp
import matplotlib.pyplot as plt 
import seaborn as sns 

## Load dataset 

### Read iris (flowers) dataset  

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. 

Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Data is available in various formats (CSV, JSON) or can be read also in methods (e.g. with Seaborn load_data or Scikit-Learn load_iris.

https://en.wikipedia.org/wiki/Iris_flower_data_set

In [None]:
# Read iris data using Seaborn 'load_dataset'
df = sns.load_dataset('iris')
df.shape # present structure: number of rows and columns

In [None]:
# Present data types of dataset
df.dtypes

In [None]:
df

If you like to display all the rows, there are several methods
- see guidance e.g. in https://www.geeksforgeeks.org/how-to-display-all-rows-from-dataframe-using-pandas/ 
- Below one method using Pandas method set_option

In [None]:
# Set it None to display
# all rows in the dataframe
pd.set_option('display.max_rows', None)
df

## Examine iris dataset

### Quantitative measurements
- Easiest way to describe key data for quantitative data is to use 'describe()'
- It prints frequency (count), mean, standard deviation and the so called "Five-number summary"
-- Mininimum value
-- 1st quartile (25%) - value representing point where 25% from smallest are included
-- median (50%), 2nd quartile - value representing middle value of the data set
-- 3rd quartile (75%) - value representing point where 75% from smallest are included
-- Maximu value 

In [None]:
# find measurements for quantitative data
df.describe()

### Graphs

#### 1. Histogram where quantitative data is separeted by species  
- Use Seaborn: https://seaborn.pydata.org/generated/seaborn.histplot.html

In [None]:
sns.histplot?

In [None]:
# parameter kde = True draws curve in histograms
# parameter hue separates data using qualitative data
sns.histplot(data = df, x= 'sepal_length',bins=15, hue = 'species', kde = True) 
plt.xlabel('Sepal length (cm)', fontsize=14)
plt.ylabel('Freq', fontsize=12)   
plt.title("Histogram for iris species", fontsize=18)
plt.show()

In [None]:
# parameter multiple = 'dodge' draws onwn bar per qualitative data
sns.histplot(data = df, x= 'petal_length',bins=15, hue = 'species', kde = True, multiple = 'dodge') 
plt.xlabel('Petal length (cm)', fontsize=14)
plt.ylabel('Freq', fontsize=12)   
plt.title("Histogram for iris species", fontsize=18)
plt.show()

#### Task: Create histogram for sepal width

In [None]:
# Your code in here

##### 2. Box plot can be variation of quantitative variables
- It shows minimum, maximum, value for 1. quartile (25%), median (50%), 3. quartile (75%) in graphical presentation
Use Seaborn: https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
sns.boxplot?

In [None]:
# parameter multiple = 'dodge' draws onwn bar per qualitative data
sns.boxplot(data = df, x= 'species', y = 'petal_length') 
plt.ylabel('Petal length (cm)', fontsize=12)
# plt.ylabel('', fontsize=12)   
plt.title("Boxplot for iris species", fontsize=18)
plt.show()

#### Task: Create similar figures for Sepal lenght and width 

In [None]:
# Your code in here

##### 3. Scatter diagrams can be used to see relations of quantitative variables 
- Use Seaborn: https://seaborn.pydata.org/generated/seaborn.scatterplot.html

In [None]:
sns.scatterplot?

In [None]:
# Two quantitative variables, one in x and another in y
sns.scatterplot(data = df, x= 'petal_length', y= 'sepal_length', hue = 'species') 
plt.xlabel('Petal length (cm)', fontsize=12)
plt.ylabel('Sepal length (cm)', fontsize=12)   
plt.title("Scatter diagram for iris species", fontsize=18)
plt.show()

##### Same figure can also include separate graphs
- Use Seaborn: https://seaborn.pydata.org/generated/seaborn.relplot.html

In [None]:
sns.relplot?

In [None]:
# With parameter col and row there is possible to get own graphs per qualitative variable 
# Parameter col_wrap restrict the number of graphs per row
sns.relplot(data = df, x= 'sepal_width', y= 'sepal_length', col = 'species', col_wrap = 2) 

#### 4. Pairplot
- Pairplot can be used to present in same figure easily scatter diagrams and histograms for dataset

In [None]:
sns.pairplot?

In [None]:
sns.pairplot(data = df) 

#### Task: Use pairplot and present data separated by species 
- How you do this?
- Create the figure

In [None]:
# Your code in here