<img src="./intro_images/logo.png" width="100%" align="left" />

<table style="float:right;">
    <tr>
        <td>                      
            <div style="text-align: right">Dr Lijing Lin</div>
            <div style="text-align: right">Lecturer in Data Science Technology</div>
            <div style="text-align: right">School of Health Sciences</div>
            <div style="text-align: right">University of Manchester</div>
         </td>
     </tr>
</table>

#   Data Visualisation

In the previous notebook, we discussed data characteristics and summaries, and how to use visualisations to explore data characteristics. You also became familiar with basics of visualisation in Matplotlib and Seaborn. In this notebook, we further explore data visualisation with Python, in particular, plotting with `matplotlib` (`pandas`), and `seaborn`.


Estimated time needed: **40** minutes
    
<div class="alert alert-block alert-warning"><b>Learning Objectives:</b> 
<br/> At the end of this notebook you further develop your skills to:
    
* Create data visualisation with Python
* Use various Python libraries for visualisation

</div> 

***

**Load your data**

We will first load the datasets we need to perform the tasks within this notebook.

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
# I-SPY 2 trial data on breast cancer
breast_data = pd.read_csv("./data/ISPY2 Imaging Cohort 1 Clinical Data.csv")

# some simple data on examination scores
exam_data = pd.read_csv("./data/exam data.csv")

# diabetes data
diabetes_data = pd.read_csv("./data/pima-indians-diabetes.csv")

# suicide rates for different countries from the year 2000 to 2016. 
suicide_data = pd.read_csv("./data/Age-standardized suicide rates.csv")

In [None]:
breast_data.head(2)

In [None]:
exam_data.head(2)

In [None]:
diabetes_data.head(2)

In [None]:
suicide_data.head(2)

---
## Visualising Data using Matplotlib





### Matplotlib: Standard Python Visualisation Library 

We briefly discussed `matplotlib` in the previous notebook, which is the primary plotting library we will explore in the course is **Matplotlib**: http://matplotlib.org/.  As mentioned on their website: 
>Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.


#### Matplotlib.Pyplot

One of the core aspects of Matplotlib is `matplotlib.pyplot`. It is the scripting layer of Matplotlib, designed to provide a simple and intuitive interface for creating plots and visualisations, similar to how plots are created in MATLAB, a popular software used for numerical computing. 

Each `pyplot` function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.  `pyplot` provides an extensive collection of functions to create, modify, and customise plots. From basic line plots to more advanced visualisations like contour or 3D plots, `pyplot` offers tools for nearly every kind of plot you can think of, as well as fine-grained control over the appearance and layout. 


Let's start by importing `Matplotlib` and `Matplotlib.pyplot` as follows:

In [None]:
# we are using the inline backend
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

In [None]:
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

Small example of `pyplot`: In this example, `plt.plot()` creates a line plot, and `plt.title()`, `plt.xlabel(),` and `plt.ylabel()` modify the plot. 

In [None]:
# Simple plot
plt.plot([1, 2, 3, 4], [10, 20, 25, 30])
plt.title('Sample Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

**Behind the Scenes: Object-Oriented Interface**

* While pyplot provides a high-level interface, it is important to note that it internally manages Figure and Axes objects. These objects represent the actual plot elements.
* In other words, while you are using command-style functions like plt.plot() or plt.title(), pyplot is creating and managing Figure and Axes objects under the hood.
* If you need more control, you can switch to Matplotlib's object-oriented interface, where you explicitly work with Figure and Axes objects. This allows for more flexibility, but requires more code.
  
Example using object-oriented interface:

In [None]:
fig, ax = plt.subplots()  # Create figure and axes
ax.plot([1, 2, 3, 4], [10, 20, 25, 30])  # Plot on the axes
ax.set_title('Sample Plot')
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
plt.show()

### Plotting in *pandas*

Fortunately, pandas has a built-in implementation of Matplotlib that we can use. We have shown some examples in the data characteristics notebook. 

Plotting in *pandas* is as simple as appending a `.plot()` method to a series or dataframe.

**Documentation**:
- Plotting in Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/plotting.html


#### **Line plot with pandas Series/Dataframe**

**What is a line plot and why use it?**

A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields.
Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

Now, plot a line graph of scicide rate of Afghanistan using `df.plot()`.

In [None]:
# First extract data:
af_sr = suicide_data.loc[suicide_data.Country == 'Afghanistan', :]
af_sr.head()

In [None]:
# we first drop the non-numerical data (Country, Sex) and take mean of all Afghanistan data for each year
af_sr_mean =  af_sr.drop(['Country','Sex'],axis=1).mean()
af_sr_mean

The resulted mean is a Pandas Series. We would like to plot trend from 2000 to 2016 and therefore order the index (year) and then plot a line plot by appending `.plot()` to the mean Series.

In [None]:
af_sr_mean = af_sr_mean.sort_index() 
af_sr_mean.plot(kind = 'line')
plt.show()

*pandas* automatically populated the x-axis with the index values (years), and the y-axis with the series values (average rate).  

Also, let's label the x and y axis using `plt.title()`, `plt.ylabel()`, and `plt.xlabel()` as follows:

In [None]:
 
af_sr_mean.plot(kind='line')

plt.title('Suicide rate: Afghanistan')
plt.ylabel('%')

plt.xlabel('Years')

plt.show() # need this line to show the updates made to the figure

#### Other Plots

There are many other plotting styles available other than the default Line plot, all of which can be accessed by passing `kind` keyword to `plot()`. The full list of available plots are as follows:

* `bar` for vertical bar plots
* `barh` for horizontal bar plots
* `hist` for histogram
* `box` for boxplot
* `kde` or `density` for density plots
* `area` for area plots
* `pie` for pie plots
* `scatter` for scatter plots
* `hexbin` for hexbin plot

These plots correspond to specific plotting methods in `Matplotlib`, though `Pandas`simplifies the process of calling them.


<div class="alert alert-block alert-info">
<b>Task:</b>
<br> 
    We demonstrated line plot above. You will practise other plots with more examples  in the `Final Exercise` notebook.
</div>


## Plotting with Seaborn

**`seaborn`** is a Python visualisation library built on top of `matplotlib`. As you saw in the previous notebooks, Seaborn provides a high-level interface for drawing attractive statistical graphics. It simplifies the process of creating informative and attractive statistical graphics by offering more complex visualisations with fewer lines of code, making it easier for users to create well-styled plots out of the box.

* **Seaborn Adds Statistical Features**: While Matplotlib provides basic plotting capabilities, Seaborn adds more statistical plot types and makes working with data frames more intuitive. Seaborn is built with functions designed for specific types of statistical plots like:
    * Pairplots (multiple scatter plots for all pairwise relationships),
    * Heatmaps,
    * Violin plots, and more.
      
These advanced plots, especially for visualising distributions and relationships, are more complicated to create from scratch with Matplotlib, but Seaborn simplifies them.

* **Matplotlib Customisation for Seaborn Plots**: Seaborn still relies on Matplotlib's plotting engine. So, after generating a plot with Seaborn, you can further modify it using Matplotlib functions (e.g. adjusting axis labels, titles, or figure sizes). Seaborn plots return Matplotlib axes objects, so you can use Matplotlib commands to customise them further.





<div class="alert alert-block alert-success"><b>Note:</b> 
You can learn more about <code>seaborn</code> by following its own <strong>User guide and tutorial</strong>: https://seaborn.pydata.org/tutorial.html
</div>





In this section, we will explore *seaborn* and see how efficient it is to create regression lines and fits using this library!

In [None]:
import seaborn as sns

### Categorical Plots
In our data 'breast_data', let's find out how many types of Race are mentioned


In [None]:
breast_data['Race'].unique()

#### **Countplot**
**A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.**

Let's find the count of Races in the data `breast_data` using countplot on `Race`


In [None]:
sns.countplot(x='Race', data=breast_data )
plt.show()

The labels on the x-axis doesnot look as expected. We therefore rotate x-axis tick labels by 90 degrees

In [None]:
sns.countplot(x='Race', data=breast_data )
plt.xticks(rotation=90)
plt.show()

### Barplot
**This plot will perform the Groupby on a categorical varaible and plot aggregated values, with confidence intervals**.<br> 

Let's plot the  Age_at_Screening by Race


In [None]:
plt.figure(figsize=(8, 4))
sns.barplot(x='Race', y='Age_at_Screening', data=breast_data) 
plt.xticks(rotation=90)
plt.show()

You can verify the values by performing the groupby of Age_at_Screening on the Race for mean()


In [None]:
br_age = breast_data.groupby('Race')['Age_at_Screening'].mean()
br_age

### Regression Plot 
With *seaborn*, generating a regression plot is as simple as calling the **regplot** function.

We will use `diabetes_data`. We only look at patients with Age between 25 to 70.


In [None]:
diabetes_data.head()

In [None]:
df = diabetes_data.loc[(diabetes_data.Age<70) & (diabetes_data.Age>=25), :]

In [None]:
#seaborn is already imported at the start of this section

sns.regplot(x='Age', y='Glucose', data=df)
plt.show()

You can also  

* customize the color of the scatter plot and regression line. E.g., change the color to green.

* customize marker shape and size 

* add a title and x- and y-labels
  
* ... (many more)

Check out Seaborn Tutorials for this: 
https://seaborn.pydata.org/generated/seaborn.regplot.html

In [None]:
plt.figure(figsize=(12, 8))

sns.regplot(x='Age', y='Glucose', data=df, color='green', marker='+', scatter_kws={'s':80})
 
sns.set(font_scale=1.5)

ax.set_title('Total Immigration to Canada from 1980 - 2013') # add title
plt.show()

<div class="alert alert-block alert-info">
<b>Task:</b>
<br> 
   Using the suicide rate dataset, create a scatter plot with a regression line in Seaborn to visualise the suicide rate versus year, highlighting the differences between males and females.
</div>


In [None]:
# Your code here:



