<a href="https://colab.research.google.com/github/mfernandes61/py-dropin-session/blob/main/pydropin_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise in Exploratory Data Analysis.  
## Uses the [Pandas](https://pandas.pydata.org/) and [Plotnine](https://plotnine.org/) packages
## Created as teaching aid by Mark Fernandes.
University of Cambridge.

## Useful Python Notebook insight.  
We can run Linux command line utilities from a Notebook Code cell by prefixing them with an exclamation mark (!).   
In the next cell we will utilise the **wget** tool to download a data file off the internet onto our computer ([Wget documentation](https://www.gnu.org/software/wget/manual/wget.html) here).    

This is a simple exploration of a dataset using Python. A much more in-depth example of using this data is [here in this paper.](https://pmc.ncbi.nlm.nih.gov/articles/PMC8943493/).  




In [None]:
# Pull the Pima-Indians-Diabetes-Data using wget
!wget -O diabetes.csv https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv

## Build our analysis environment   
- Load packages (**numpy** & **pandas**)
- Read data into a Python Data structure (**dataframe**).  

In [None]:
import numpy as np
import pandas as pd
from pandas import read_csv

# Read data into python
pima = read_csv("diabetes.csv")

pima

## What is the data like?   
- Characterise the data - do we have the expected number of rows & columns?
Algebraic representation
- Get the summary statistics for the data (can indicate data provenance issues)
- Are there any indications of possibly correlated variables?

In [None]:
# give data the once-over
print(pima.shape)
print(pima.head)
print(pima.columns)

In [None]:
# get summary statistics for our data
pima.describe()

In [None]:
# can we see any correlations in our data?
print(pima.corr(method ="pearson"))

## Moving from algebraic to geometrical exploration.  
If we just take these results at face value, there are some potential correlations (Pearson measures strength of linear relationship) with values above 0.4.   
Rather than trusting to summary **statistics** we should additionally perform graphical examination.   
Using both approaches can give us useful insights into our data.   
An example would be a histogram of the data as the distribution helps decide suitable statistical tests e.g. is data normally distributed?   

In [None]:
# plots using plotnine
%matplotlib inline
import plotnine as p9

(p9.ggplot(data=pima,
           mapping=p9.aes(x='Age', y='Pregnancies'))
    + p9.geom_point()
)


In [None]:

(p9.ggplot(data=pima,
           mapping=p9.aes(x='Outcome', y='Glucose'))
    + p9.geom_point()
)

In [None]:

(p9.ggplot(data=pima,
           mapping=p9.aes(x='Insulin', y='SkinThickness'))
    + p9.geom_point()
)

In [None]:

(p9.ggplot(data=pima,
           mapping=p9.aes(x='BMI', y='SkinThickness'))
    + p9.geom_point()
)

## Where does this take us?   

- Does this last plot suggest a relationship?
- What is a third way of interpreting this data and what issue with the data does it illustrate?



## Hint.  
Clue - look closely at skin thickness and BMI.  

Let's break out another pandas tool to further explore correlation between our variables.   


In [None]:
# Tricky creating matrixplot in plotnine so let's use a pandas plotting tool
# to explore correlation between ALL of our variables
from pandas.plotting import scatter_matrix
scatter_matrix(pima, alpha = 0.2,  diagonal = 'kde')

To improve your Python programming it is worth visiting the Python Documentation [site](https://docs.python.org/3/).   
