# **Introduction to Python. Day 4**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 4!**

## **By now, we have covered:**

+ The overall workflow of Jupyter Notebooks in Google Colab
+ The basics of Python syntax and operations with lists
+ How to read in external datasets
+ How to navigate datasets - subsetting, accessing rows/columns
+ Operations with variables i.e. recoding, creating new variables
+ Exploratory data analysis


## **Today, we are going to look at:**

+ Data visualization

---



# **Preparing to work in Python**

In [None]:
# Import the necessary libraries

import pandas as pd # data analysis and management library
import numpy as np # multi-dimensional arrays

# Data visualization libraries

import seaborn as sns # easy-syntax plots
import matplotlib.pyplot as plt # deep-level library used to tweak the details of the seaborn plots


---

# **1. Data visualization**

<figure>
<left>
<img src=https://livecodestream.dev/post/how-to-build-beautiful-plots-with-python-and-seaborn/featured.jpg  width="450">
</figure>



In [None]:
# Let's upload the Pokemon dataset into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# Here is an example of how to open a .csv dataframe in Python using Pandas library

df = pd.read_csv('pokemon.csv')

df

# Data source: https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6


In [None]:
df.shape # dimensions of the dataset

df.info() # information about the variables


## **1.1 Numeric variables**

### **Histogram**

See more examples of `seaborn` histograms [here](https://seaborn.pydata.org/generated/seaborn.histplot.html).


In [None]:
# Basic histogram

plt.figure(figsize = (10, 7)) # define the size of the figure (width = 10, height = 7)

sns.histplot(data = df, x = 'Attack')

plt.show() # include this to avoid the plot object description written above the graph


In [None]:
# Change the width of bins either by using 'binwidth' or 'bins' arguments

# binwidth - the width of bins
# bins - the total number of bins

plt.figure(figsize = (10, 7))

sns.histplot(data = df, x = 'Attack', binwidth = 5)

plt.show()


In [None]:
# Adding plot title and specifying the axes labels

plt.figure(figsize = (10, 7))

sns.histplot(data = df, x = 'Attack', binwidth = 5)

plt.title('The distribution of Pokemon attack points', fontsize = 20) # plot title
plt.xlabel('Attack points', fontsize = 15) # X axis label
plt.ylabel('Number of Pokemons', fontsize = 15) # Y axis label

plt.show()


In [None]:
# Manually changing the limits of X and Y axes, and setting axes breaks

plt.figure(figsize = (10, 7))

sns.histplot(data = df, x = 'Attack', binwidth = 5)

plt.title('The distribution of Pokemon attack points', fontsize = 20)
plt.xlabel('Attack points', fontsize = 15)
plt.ylabel('Number of Pokemons', fontsize = 15)

plt.xlim(0, 200)
plt.ylim(0, 60)

plt.xticks(ticks = np.arange(0, 210, 10))
plt.yticks(ticks = np.arange(0, 65, 5))

plt.show()

# try running np.arange(0, 210, 10) is a separate code cell and see what you get as an output


In [None]:
# Change the color of bars

plt.figure(figsize = (10, 7))

sns.histplot(data = df, x = 'Attack', binwidth = 5, color = 'crimson')

plt.title('The distribution of Pokemon attack points', fontsize = 20)
plt.xlabel('Attack points', fontsize = 15)
plt.ylabel('Number of Pokemons', fontsize = 15)

plt.xlim(0, 200)
plt.ylim(0, 60)

plt.xticks(ticks = np.arange(0, 210, 10))
plt.yticks(ticks = np.arange(0, 65, 5))

plt.show()


List of colors avaiable in `matplotlib` / `seaborn`

<figure>
<left>
<img src=https://i.stack.imgur.com/lFZum.png width="600">
</figure>

In [None]:
# The powerful thing about visualizing data is that you can perform a group comparison
# by drawing multiple graphs on one plot

# Here I add a hue = 'Legendary' argument to the sns.histplot() command,
# asking seaborn to draw histograms separately for Legendary and Non-legendary pokemons

# Note that 'hue' argument overrides 'color' argument,
# so color = 'crimson' from previous plot is no longer valid and I delete it

# I also amend the plot title a bit to reflect what the graph is now showing

plt.figure(figsize = (10, 7))

sns.histplot(data = df,
             x = 'Attack',
             binwidth = 5,
             hue = 'Legendary',
             alpha = 0.3) # transparency ranges from 0 to 1 

plt.title('The distribution of Pokemon attack points \nby legendary status', fontsize = 20)
plt.xlabel('Attack points', fontsize = 15)
plt.ylabel('Number of Pokemons', fontsize = 15)

plt.xlim(0, 200)
plt.ylim(0, 60)

plt.xticks(ticks = np.arange(0, 210, 10))
plt.yticks(ticks = np.arange(0, 65, 5))

plt.show()


In [None]:
# If you want to manually change the colors of the plot which uses 'hue' argument,
# use 'palette' argument

plt.figure(figsize = (10, 7))

sns.histplot(data = df,
             x = 'Attack',
             binwidth = 5,
             hue = 'Legendary',
             alpha = 0.3, # transparency ranges from 0 to 1 
             palette = 'Dark2') 

plt.title('The distribution of Pokemon attack points \nby legendary status', fontsize = 20)
plt.xlabel('Attack points', fontsize = 15)
plt.ylabel('Number of Pokemons', fontsize = 15)

plt.xlim(0, 200)
plt.ylim(0, 60)

plt.xticks(ticks = np.arange(0, 210, 10))
plt.yticks(ticks = np.arange(0, 65, 5))

plt.show()


`matplotlib` / `seaborn` color palettes available [here](https://matplotlib.org/stable/tutorials/colors/colormaps.html).

### **Boxplot**

See more examples of `seaborn` boxplots [here](https://seaborn.pydata.org/generated/seaborn.boxplot.html).

In [None]:
# Basic boxplot

plt.figure(figsize = (10, 7))

sns.boxplot(data = df, y = 'HP') # you can also swop y for x

plt.show()


In [None]:
# Boxplots are not very useful without group comparison.
# Let's add a variable of Pokemon type on the X axis

plt.figure(figsize = (14, 10))

sns.boxplot(data = df, x = 'Type 1', y = 'HP')

plt.show()


In [None]:
# Adding plot title and specifying the axes labels

plt.figure(figsize = (14, 10))

sns.boxplot(data = df, x = 'Type 1', y = 'HP')

plt.title('The distribution of Pokemon health points (HP) by their type', fontsize = 20)
plt.xlabel('Pokemon type', fontsize = 15)
plt.ylabel('Health points (HP)', fontsize = 15)

plt.show()


In [None]:
# Tweak the X and Y axes

plt.figure(figsize = (14, 10))

sns.boxplot(data = df, x = 'Type 1', y = 'HP')

plt.title('The distribution of Pokemon health points (HP) by their type', fontsize = 20)
plt.xlabel('Pokemon type', fontsize = 15)
plt.ylabel('Health points (HP)', fontsize = 15)

plt.ylim(0, df['HP'].max() + 15) # from 0 to Max HP + 15

plt.xticks(rotation = 15) # add a bit of rotation to X ticks
plt.yticks(ticks = np.arange(0, df['HP'].max() + 15, 15)) # from 0 to Max HP + 15 by 15

plt.show()


In [None]:
# You can change the color palette if you wish

plt.figure(figsize = (14, 10))

sns.boxplot(data = df,
            x = 'Type 1',
            y = 'HP',
            palette = 'tab20b')

plt.title('The distribution of Pokemon health points (HP) by their type', fontsize = 20)
plt.xlabel('Pokemon type', fontsize = 15)
plt.ylabel('Health points (HP)', fontsize = 15)

plt.ylim(0, df['HP'].max() + 15)

plt.xticks(rotation = 15) # add a bit of rotation
plt.yticks(ticks = np.arange(0, df['HP'].max() + 15, 15))

plt.show()


In [None]:
# Lastly, you might want to arrange the types of Pokemons depending on their median HP value.

# This requires a few steps. Eventually, I create an object called desc_order
# in which I keep the types of Pokemons according to the median level of HP (from the highest one to the lowest one)

# As a first step to achieve this, I get the median values of HP for each Type of Pokemons: df.groupby('Type 1')['HP'].median()
# Then I sort the outcome vector in the descending order via .sort_values(ascending = False)
# And lastly, I take only the indeces (that is, Pokemon names) out of the arranged vector

# desc_order then goes into the 'order' argument of sns.boxplot()

desc_order = df.groupby('Type 1')['HP'].median().sort_values(ascending = False).index

plt.figure(figsize = (14, 10))

sns.boxplot(data = df,
            x = 'Type 1',
            y = 'HP',
            palette = 'tab20b',
            order = desc_order) # arrange by median values

plt.title('The distribution of Pokemon health points (HP) by their type', fontsize = 20)
plt.xlabel('Pokemon type', fontsize = 15)
plt.ylabel('Health points (HP)', fontsize = 15)

plt.ylim(0, df['HP'].max() + 15)

plt.xticks(rotation = 15)
plt.yticks(ticks = np.arange(0, df['HP'].max() + 15, 15))

plt.show()


In [None]:
# There are clearly two Pokemon outliers among the Normal type that score extremely high on the HP variable
# compared to other Pokemons

# Their HP values are around 255, so let me keep only those Pokemons that score less than 200,
# and see if it makes the graph prettier

# You can subset data right within the sns.boxpot() command

desc_order = df.groupby('Type 1')['HP'].median().sort_values(ascending = False).index

plt.figure(figsize = (14, 10))

sns.boxplot(data = df[df['HP'] < 200], # note than now instead of data = df, it's data = df[df['HP'] < 200
            x = 'Type 1',
            y = 'HP',
            palette = 'tab20b',
            order = desc_order)

plt.title('The distribution of Pokemon health points (HP) by their type', fontsize = 20)
plt.xlabel('Pokemon type', fontsize = 15)
plt.ylabel('Health points (HP)', fontsize = 15)

plt.ylim(0, 200) # and I change the upper limit of Y scale accordingly

plt.xticks(rotation = 15)
plt.yticks(ticks = np.arange(0, 200, 15)) # and Y ticks as well

plt.show()


### **Scatterplot**

See more examples of `seaborn` scatterplots [here](https://seaborn.pydata.org/generated/seaborn.scatterplot.html).

In [None]:
# Basic scatterplot

plt.figure(figsize = (10, 7))

sns.scatterplot(data = df, x = 'Speed', y = 'Attack')

plt.show()


In [None]:
# Adding plot title and specifying the axes labels

plt.figure(figsize = (10, 7))

sns.scatterplot(data = df, x = 'Speed', y = 'Attack')

plt.title('The relationship between Pokemon speed and attack', fontsize = 20)
plt.xlabel('Speed level', fontsize = 15)
plt.ylabel('Attack level', fontsize = 15)

plt.show()


In [None]:
# Manually changing the limits of X and Y axes, and setting axes breaks

plt.figure(figsize = (10, 7))

sns.scatterplot(data = df, x = 'Speed', y = 'Attack')

plt.title('The relationship between Pokemon speed and attack', fontsize = 20)
plt.xlabel('Speed level', fontsize = 15)
plt.ylabel('Attack level', fontsize = 15)

plt.xlim(0, df['Speed'].max() + 10)
plt.ylim(0, df['Attack'].max() + 10)

plt.xticks(ticks = np.arange(0, 190, 10))
plt.yticks(ticks = np.arange(0, 200, 10))

plt.show()


In [None]:
# Changing color of dots, decreasing their size, and changing the type of marker

plt.figure(figsize = (10, 7))

sns.scatterplot(data = df,
                x = 'Speed',
                y = 'Attack',
                s = 35, # make points a bit smaller
                marker = 'X', # available markers: https://matplotlib.org/stable/api/markers_api.html
                color = 'forestgreen') # change the color of points

plt.title('The relationship between Pokemon speed and attack', fontsize = 20)
plt.xlabel('Speed level', fontsize = 15)
plt.ylabel('Attack level', fontsize = 15)

plt.xlim(0, df['Speed'].max() + 10)
plt.ylim(0, df['Attack'].max() + 10)

plt.xticks(ticks = np.arange(0, 190, 10))
plt.yticks(ticks = np.arange(0, 200, 10))

plt.show()


In [None]:
# Now let's split the scatterplot by Pokemon generation and see if we can identify any patterns

plt.figure(figsize = (14, 8))

sns.scatterplot(data = df,
                x = 'Speed',
                y = 'Attack',
                s = 100,
                hue = 'Legendary', # note that since hue is introduced, we don't need 'color' argument any more
                palette = 'YlOrRd') # changing colormap

plt.title('The relationship between Pokemon speed and attack', fontsize = 20)
plt.xlabel('Speed level', fontsize = 15)
plt.ylabel('Attack level', fontsize = 15)

plt.xlim(0, df['Speed'].max() + 10)
plt.ylim(0, df['Attack'].max() + 10)

plt.xticks(ticks = np.arange(0, 190, 10))
plt.yticks(ticks = np.arange(0, 200, 10))

plt.show()


In [None]:
# You can assign more than one property of the graph to the variable,
# i.e. say both color and type of marker will be dependent on Pokemon's generation

plt.figure(figsize = (14, 8))

sns.scatterplot(data = df,
                x = 'Speed',
                y = 'Attack',
                s = 100,
                hue = 'Legendary', # note that since hue is introduced, we don't need color argument any more
                palette = 'YlOrRd', # changing colormap
                style = 'Legendary')

plt.title('The relationship between Pokemon speed and attack', fontsize = 20)
plt.xlabel('Speed level', fontsize = 15)
plt.ylabel('Attack level', fontsize = 15)

plt.xlim(0, df['Speed'].max() + 10)
plt.ylim(0, df['Attack'].max() + 10)

plt.xticks(ticks = np.arange(0, 190, 10))
plt.yticks(ticks = np.arange(0, 200, 10))

plt.show()


## **2.2 Categorical variables**

### **Count plot (Bar chart)**

See more examples of `seaborn` count plots [here](https://seaborn.pydata.org/generated/seaborn.countplot.html).


In [None]:
# Basic count plot

plt.figure(figsize = (12, 7))

sns.countplot(data = df, x = 'Generation')

plt.show()


In [None]:
# Basic count plot

# Adding axes labels, plot title, and tweaking X and Y scales, changing color palette

plt.figure(figsize = (12, 7))

sns.countplot(data = df, x = 'Generation', palette = 'Oranges')

plt.title('Number of Pokemons in different generations', fontsize = 20)
plt.xlabel('Pokemon generation', fontsize = 15)
plt.ylabel('Number of Pokemons', fontsize = 15)

plt.ylim(0, 170)
plt.yticks(np.arange(0, 180, 10))

# You can manually change the names of X ticks too if you wish
plt.xticks(ticks = [0, 1, 2, 3, 4, 5], # positions of bars
           labels = ['Gen. 1', 'Gen. 2', 'Gen. 3', 'Gen. 4', 'Gen. 5', 'Gen. 6']) # new labels

plt.show()


In [None]:
# Subsetting graph by Legendary status via 'hue' argument

plt.figure(figsize = (12, 7))

sns.countplot(data = df,
              x = 'Generation',
              palette = 'Oranges',
              hue = 'Legendary')

plt.title('Number of Pokemons in different generations\nby legendary status', fontsize = 20)
plt.xlabel('Pokemon generation', fontsize = 15)
plt.ylabel('Number of Pokemons', fontsize = 15)

plt.ylim(0, 170)
plt.yticks(np.arange(0, 180, 10))

# You can manually change the names of X ticks too if you wish
plt.xticks(ticks = [0, 1, 2, 3, 4, 5], # positions of bars
           labels = ['Gen. 1', 'Gen. 2', 'Gen. 3', 'Gen. 4', 'Gen. 5', 'Gen. 6']) # new labels

plt.show()


### **Stacked bar chart**



In [None]:
# Lifehack: use sns.histplot() with multiple = 'stack' argument

plt.figure(figsize = (12, 7))

sns.histplot(data = df, y = 'Generation',
             hue = 'Legendary',
             multiple = 'stack', # stack Legendary and Non-Legendary Pokemons
             palette = 'Oranges',
             hue_order = [True, False]) # changing the order of hue categoriess

plt.title('Number of Pokemons in different generations\nby legendary status', fontsize = 20)
plt.xlabel('Number of Pokemons', fontsize = 15)
plt.ylabel('Pokemon generation', fontsize = 15)

plt.xlim(0, 170)
plt.xticks(np.arange(0, 180, 10))

plt.yticks(ticks = [1.25, 2.15, 3.05, 3.95, 4.9, 5.80],
           labels = ['Gen. 1', 'Gen. 2', 'Gen. 3', 'Gen. 4', 'Gen. 5', 'Gen. 6'])

plt.show()


## **2.3 Visualizing aggregated analysis**

### **Bar plot**

A **bar plot** represents **an estimate of central tendency** for a numeric variable with the **height of each rectangle** and provides some indication of the **uncertainty** around that estimate using **error bars**.

See more examples of `seaborn` bar plots [here](https://seaborn.pydata.org/generated/seaborn.barplot.html).


In [None]:
# Say we want to plot a mean value of Speed for each Generation of Pokemons

# Basic bar plot

plt.figure(figsize = (12, 7))

sns.barplot(data = df, x = 'Generation', y = 'Speed')

plt.show()

# You can get the exact values of bar height from the aggregated analysis code below: 
# I.e. get the mean speed level for Pokemons in different generations

df.groupby('Generation')['Speed'].mean()


In [None]:
# Say we want to plot a mean value of Speed for each Generation of Pokemons

# Adding title and axes labels, tweaking X and Y axes, changing color palette

plt.figure(figsize = (12, 7))

sns.barplot(data = df,
            x = 'Generation',
            y = 'Speed',
            palette = 'YlGn')

plt.title('Average speed level per Pokemon generation', fontsize = 20)
plt.xlabel('Pokemon generation', fontsize = 15)
plt.ylabel('Average speed level', fontsize = 15)

plt.ylim(0, 80)
plt.yticks(np.arange(0, 90, 10))

plt.xticks(ticks = [0, 1, 2, 3, 4, 5],
           labels = ['Gen. 1', 'Gen. 2', 'Gen. 3', 'Gen. 4', 'Gen. 5', 'Gen. 6'])

plt.show()


In [None]:
# Mean is a default estimator for sns.barplot() but you can override it and calculate median or other statistic instead

plt.figure(figsize = (12, 7))

sns.barplot(data = df,
            x = 'Generation',
            y = 'Speed',
            palette = 'YlGn',
            estimator = np.median, # use numpy median fucntion instead of default mean
            capsize = 0.2) # add caps to the confidence interval boundaries

plt.title('Median speed level per Pokemon generation', fontsize = 20)
plt.xlabel('Pokemon generation', fontsize = 15)
plt.ylabel('Median speed level', fontsize = 15)

plt.ylim(0, 90)
plt.yticks(np.arange(0, 100, 10))

plt.xticks(ticks = [0, 1, 2, 3, 4, 5],
           labels = ['Gen. 1', 'Gen. 2', 'Gen. 3', 'Gen. 4', 'Gen. 5', 'Gen. 6'])

plt.show()

# You can get the exact values of bar height from the aggregated analysis code below: 
# I.e. get the median speed level for Pokemons in different generations
# df.groupby('Generation')['Speed'].median()


## **Exercise**

Alright, it's time to practice!


In [None]:
# Let's upload the mtcars dataset into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# And save it as an mtcars object

mtcars = pd.read_csv('mtcars.csv')

mtcars.head(10)


In [None]:
# Here is the description of the dataset

# Motor Trend Car Road Tests data
# The data was extracted from the 1974 Motor Trend US magazine,
# and comprises fuel consumption and 10 aspects of automobile design and performance
# for 32 automobiles (1973–74 models).

# mpg - Miles/(US) gallon
# cyl - Number of cylinders
# disp - Displacement (cu.in.)
# hp - Gross horsepower
# drat - Rear axle ratio
# wt - Weight (1000 lbs)
# qsec - 1/4 mile time
# vs - Engine (0 = V-shaped, 1 = straight)
# am - Transmission (0 = automatic, 1 = manual)
# gear - Number of forward gears
# carb - Number of carburetors


Using the `mtcars` dataset, please draw the following graphs (and make them as clear and pretty as possible):

+ **Histogram** of **displacement** for cars with V-shaped and straight **engines**
+ **Boxplot** of **gross horsepower** for cars with different number of **cylinders**
+ **Scatterplot** of **weight** and **1/4 mile time** for cars with automatic and manual **transimission**
+ **Count plot** of number of **carburetors**
+ **Bar plot** of average **miles per gallon** for cars with different number of **forward gears**


In [None]:
# Histogram of displacement for cars with V-shaped and straight engines




In [None]:
# Boxplot of gross horsepower for cars with different number of cylinders




In [None]:
# Scatterplot of weight and 1/4 mile time for cars with automatic and manual transimission




In [None]:
# Count plot of number of carburetors




In [None]:
# Bar plot of average miles per gallon for cars with different number of forward gears




---

# **That's the end of Day 4!**
