# DA2 : Data (tabular) and plotting - live demo

<u>Reading in a CSV (Comma Seperated Variable) file into Python.</u>      
We use the pandas __read_csv__ function.    
_CSV is the recommended format to store data for program interoperability._

In [None]:
import pandas as pd
surveys = pd.read_csv("data/surveys.csv")

# ALWAYS take a look at data after reading it in. pandas has a handy .head(0 object method.
surveys.head()

In [None]:
# Complementary .tail(0 method is also useful
surveys.tail()

## What issue is immediately apparent in the output?

In [None]:
# Let us explore our data further with pandas properties
surveys.shape

In [None]:
# number of rows?
surveys.shape[0]

In [None]:
#number of columns?
surveys.shape[1]

In [None]:
# We can use the .describe() method to get some summary statistics
surveys.describe()

## Hold on!
We had 9 columns and only got 7 from describe - suggests that pandas cannot generate    
summary statistics for hose columns. What are these 2 columns called?

In [None]:
surveys.columns

In [None]:
# looks like those 2 columns are probably some form of categorical data. 
# We might find it useful to see what unique values they take. we use the .unique() method.
surveys["species_id"].unique()

In [None]:
surveys["sex"].unique()

We can find the number of missing values in a column by combining two methods    
_.isna()_ and _.sum()_.

In [None]:
surveys["weight"].isna().sum()

# Subsetting data objects (pandas datframe example)
Start by listing the record_id column.

In [None]:
surveys.record_id

__Now listing record_id and weight...__

In [None]:
surveys[["record_id","weight"]]

__Too much data?__ Let's just look at the first 5 values using the __.iloc()__ method.   
_(Note Pythons indexing & sequencing)_.

In [None]:
surveys[["record_id","weight"]].iloc[0:5]

## Simple plots using _plot_nine_
Starting with a scatterplot

In [None]:
# ensure plotnine available in this notebook
from plotnine import *

In [None]:
p = (ggplot(surveys, aes(x = "weight", y = "hindfoot_length")) +
  geom_point())
  
p.show()

__Whole lot of data - where is it all comng from? let's try colouring by year...__

In [None]:
p = (ggplot(surveys, aes(x = "weight", y = "hindfoot_length", colour= "year")) +
  geom_point())
  
p.show()

## Facetting plots    

That didn't help a lot as we have overplotting - if only there was a way    
to create a plot for each year... &#x1F600; <!-- Grinning Face -->

In [None]:
p = (ggplot(surveys, aes(x = "weight", y = "hindfoot_length", colour= "year")) +
  geom_point()) + facet_wrap("~ year")
  
p.show()