# Pandas DataFrames

Pandas is a Python library for data manipulation, analysis and display. Pandas is a very useful package and many things interface to it.  Pandas has two data formats the *Series* and the *DataFrame*. Being honest I (very) rarely use the series, but do use the data frame quite a lot. This is just a brief introduction to them.

DataFrames are a tabular form are a bit like Excel spreadsheets (and you can read/write spreadsheets to/from pandas DataFrames). 

There are many online teaching materials for pandas for example the [w3resources]( https://www.w3resource.com/python-exercises/pandas/index.php) and so this is only to give you a taste.

Lets create a simple DataFrame:



In [None]:
import pandas as pd
# set data as dictionary structure
data={'Name':["Rex","Bruno","Biffa","Queeny","Sheiba","Crusoe"],
     'Breed':["bulldog","labrador","doberman","poodle","labrador","scotty"],
     'Age':[2,4,12,0.5,10,7]}

dogs=pd.DataFrame(data)

display(dogs)



You should note that first column is an index and you can use this to display the ones that you want.

In [None]:
display(dogs[2:4])

If you like you can change this index to something more meaningful (not thatthis is a good example of this)

In [None]:
dogs=pd.DataFrame(data,index=["a","b","c","d","e","f"])
display(dogs["b":"d"])

Note that in ths case it displays both the first and the last.

You insert a new column:

In [None]:
dogs["Length"]=[50,100,105,85,100,80]
display(dogs)

You can select only entries with a given attribute ... say old dogs.

In [None]:
display(dogs[dogs.Age > 6])

You can even columns that are functions of other columns and pandas does this really quickly. 

In [None]:
dogs["combination"]=dogs.Age*dogs.Length
display(dogs)

You can calculate things like the correlation and covariance matrices

In [None]:
display(dogs.corr())
display(dogs.cov())

## Displaying data


It is possible to display your DataFrame content quite easily

In [None]:
import numpy as np
import scipy as sp
import pylab as pl

histogram=dogs.hist()


In [None]:
h1=dogs.hist(column="Length")

In [None]:
dogs[dogs.Age>6].hist(column="Length")

One particularly useful way is as a scatter plot

In [None]:
dogs.plot(kind="scatter",x="Age",y="Length")
#dogs.plot(kind="scatter",x="Age",y="Length",alpha=0.05) if you have larger numbers then having a value for alpha can make it easier to see.

In [None]:
import pandas.plotting as pdp
pdp.scatter_matrix(dogs)

In [None]:
pdp.scatter_matrix(dogs[dogs.Age>6])

## Exercise

The purpose of this exercise is to get you to play around with pandas DataFrame and to consolidate the knowledge that you already have. 

* Generate 5 samples 100,000 correlated (you can choose whatever covariance matrix that you like) random numbers distributed according to Gaussian distributions. 

* Read these into a DataFrame

* Creat a 6 column in your DataFrame that the second column plus the 4th column

* Verify that the covariance (and correlation) matrices are what you would expect 

* Display your data

## Reading data from files

You can read data from all sorts of files (csv, excel, ...). Sometimes (especially with csv) you have to be careful with the separator

In [None]:
students=pd.read_excel(r'student-por.xlsx')

In [None]:
display(students)

## Exercise 

These data are taken from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Student+Performance#)

Read the description of student data, read in the data set. Then work together as a group to analyse these data. What are the most important factors that determine a students scores? What are the least important? What other correlations do you see here (look at hings that aren't simply numerical as well as those that are). 