# Week X - Pandas DataFrames

<hr style="border:2px solid gray">

## Index: <a id='index'></a>
1. [Introduction to Pandas](#pandas)
1. [Creating a DataFrame](#create)
1. [Manipulating DataFrames](#manipulate)
1. [Displaying Data](#display)
1. [Reading Data from Files](#files)

<hr style="border:2px solid gray">

## Section One: Introduction to Pandas  [^](#index) <a id='pandas'></a>

**Pandas** is a Python library for data manipulation, analysis and display. Pandas has two data formats: the **Series** and the **DataFrame**. To be honest, I (very) rarely use the series, but do use the DataFrame quite a lot.

DataFrames are a tabular data structure, a bit like Excel spreadsheets (and you can read/write spreadsheets to/from pandas DataFrames). 

There are many online teaching materials for pandas for example the [w3resources]( https://www.w3resource.com/python-exercises/pandas/index.php) and so this worksheet is only to give you a taste.

<hr style="border:2px solid gray">

## Section Two: Creating a DataFrame  [^](#index) <a id='create'></a>

A DataFrame is a 2D data structure that is composed of the following components:
- 1) The data
- 2) The index
    - This is the row number of the DataFrame
- 3) The columns 
    - Contains the data taken at each index, labelled with headers

The information at the 'top' of the DataFrame contains are known as **headers**. These allow you to access your data without needing to use indices. The cell below shows two equivalent ways to create a DataFrame.

In [None]:
import pandas as pd

# Method 1: Set data as dictionary structure, data formatted in columns

data={'Name':["Rex","Bruno","Biffa","Queeny", "Bob"],
     'Breed':["bulldog","labrador","doberman","poodle", "pug"],
     'Age':[2,4,12,0.5, 7]}

dogs=pd.DataFrame(data)

display(dogs)

# Method 2: Splitting Headers and data - data formatted in rows

d=[["Rex","bulldog",2],
    ["Bruno","labrador",4],
    ["Biffa", "doberman", 12],
    ["Queeny","poodle", 0.5],
    ["Bob", "pug", 7]]

Headers=['Name', 'Breed', 'Age']

dogs2=pd.DataFrame(data=d,columns=Headers)

display(dogs2)

The first column is the **index**, and you can be used to specify the data you want to display.

In [None]:
display(dogs[2:4])

To change the index to something more relevant (although this is not a particularly good example of this):

In [None]:
dogs=pd.DataFrame(data,index=["a","b","c","d", "e"])
display(dogs["b":"d"]) 

#This will display data up to and including 'b' and 'd'

To insert a new column into the DataFrame, simply perform:

In [None]:
dogs["Length"]=[50,100,105,85, 40]
display(dogs)

You can even create columns that are functions of other columns. Pandas performs this very quickly. 

In [None]:
dogs["combination"]=dogs.Age*dogs.Length
display(dogs)

<hr style="border:2px solid gray">

## Section Three: Manipulating DataFrames  [^](#index) <a id='manipulate'></a>

### Filtering DataFrames
Here we are choosing to display all dogs above a given age.

In [None]:
display(dogs[dogs.Age > 6])

### Statistical Analysis
You can calculate things like the correlation and covariance matrices

In [None]:
display(dogs.corr(numeric_only = True))
display(dogs.cov(numeric_only = True))

<hr style="border:2px solid gray">

## Section Four: Displaying Data  [^](#index) <a id='display'></a>


It is possible to display your DataFrame content quite easily. Here we will cover a few common examples.

### Basic plotting

Two display a basic plot of our data, we can use:
```python
df['column name'].plot()
```

or:

```python
df.plot('x column name','y column name')
```
Wee only need to reference the name of the column, we don't need to know its index. For the first method we didn't set an x-axis; with that plotting nomenclature Pandas will use whatever the index is as an x-axis. 

<div style="background-color:#C2F5DD">

## Exercise
Experiment with these methods of data plotting using our 'dogs' DataFrame.


Other useful data visualisation:
### Histograms

In [None]:
import numpy as np
import scipy as sp
import pylab as pl

histogram=dogs.hist()

In [None]:
dogs['Length'].plot()

In [None]:
h1=dogs.hist(column="Length")

In [None]:
dogs[dogs.Age>6].hist(column="Length")

### Scatter Plots

In [None]:
dogs.plot(kind="scatter",x="Age",y="Length",alpha=1) 
#alpha controls the opacity of data points. 
#For larger amounts of data, setting alpha to a lower value can make the plot easier to interpret

A **scatter_matrix** displays all possible combinations of the scatter plots, as well as the various histograms. Run the cell below to see what this looks like for our data. Just like with any pandas plot, this can also be filtered.

In [None]:
import pandas.plotting as pdp
pdp.scatter_matrix(dogs)

In [None]:
pdp.scatter_matrix(dogs[dogs.Age>3])

Further example of plots can be found [here](https://pandas.pydata.org/docs/user_guide/visualization.html)

<div style="background-color:#C2F5DD">

## Exercise

The purpose of this exercise is to get you to play around with pandas DataFrame and to consolidate the knowledge that you already have. 

* Generate 5 samples with 100,000 correlated random numbers distributed according to Gaussian distributions (you can choose whatever covariance matrix that you like). See worksheet [] if you require a refresher.

* Read these into a DataFrame

* Create a 6th column in your DataFrame: the values should be the second column plus the fourth column

* Verify that the covariance (and correlation) matrices are what you would expect 

* Display your data

<hr style="border:2px solid gray">

## Section Five: Reading Data from Files  [^](#index) <a id='files'></a>

You can read data from all sorts of files (csv, excel, etc) into a DataFrames. Sometimes (especially with csv) you have to be careful with the separator

In [None]:
students=pd.read_excel(r'student-por.xlsx') 
#'r' refers to raw string, it is required to read in the file with no bugs

In [None]:
display(students)

This will display a lot of information. We can reduce this display and make the data easier to interpret at a glance using <span style="color:blue">.head()</span> and <span style="color:blue">.head()</span>. To get a top level summary of the data, we can use the <span style="color:blue">.info()</span> command. For example:

In [None]:
display(students.head())

print ('\n And the summary of the data: \n')

display(students.info())

<div style="background-color:#C2F5DD">

## Exercise 

These data are taken from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Student+Performance#)

Read the description of student data and then read in the data set. Then work together as a group to analyse these data. What are the most important factors that determine a students scores? What are the least important? What other correlations do you see here (look at data values that aren't simply numerical as well as those that are). 