# ECB Data Academy - Evolve Programme
[Krisolis](http://www.krisolis.ie)

## Accessing Datasets with pandas

### Pandas DataFrame Data Structure

*Comma Separated Value* (**CSV**) files are one of the most common ways to exchange data. The **pandas** package in Python makes it easy to load data from a CSV file and analyse it. In this notebook we will demonstrate how this is done. 

### Import pandas

First import the pandas package. The **as pd** statement simply adds an *alias* so we can use the name **pd** instead of **pandas**. This is just so that we can be lazy and type *pd* intead of *pandas*.

In [None]:
import pandas as pd

### Load the File

We can load a CSV file file using the **read_csv** function from pandas. the data from the CSV file is loaded into a special pandas data strcuture known as a **DataFrame**. This is essentially a data table like you might use in Excel. 

In [None]:
country_data = pd.read_csv('..//Data//country_data.csv')

### Examining the Data

The **head** and **tail** methods are used to show the first or last few lines of a DataFrame.

In [None]:
display(country_data.head())

In [None]:
display(country_data.tail())

Or we can just see the full dataset using **display**.

In [None]:
display(country_data)

Note that we use **display** instad of **print** here just because it gives us printouts that look much nicer!

### Accessing Columns

Accessing *columns* in a DataFrame is simply a matter of using the name of the column (similar to dictionary selection) in double square brackets, [[ ]]

In [None]:
populations = country_data["Population"]
print(populations)

In [None]:
continents = country_data["Continent"]
print(continents)

### Generating Summary Statistics

Pandas provides a range of functionality to calculate summary statistics - e.g. **sum**, **mean**, **min**, **max**, ... - from the data in a DataFrame. 

In [None]:
pop_sum = country_data["Population"].sum()
print(pop_sum)

In [None]:
pop_mean = country_data["Population"].mean()
print(pop_mean)

In [None]:
pop_min = country_data["Population"].min()
print(pop_min)

In [None]:
pop_max = country_data["Population"].max()
print(pop_max)

Frequency tables are also a really useful analysis tool that are easily genratated using the **value_counts** method.

In [None]:
continent_counts = country_data["Continent"].value_counts()
print(continent_counts)

### Accessing Rows from a DataFrame

One very useful way to slice a DataFrame is using a condition. We can pass a list of Boolean values to a DataFrame indicating which rows should be retained (True) and which should be filtered (False). A suitable list is easily generated using a simple Boolean expression on a column from the DataFrame.

In [None]:
country_data["Life Exp."] > 60

This list can be passed directly to the DataFrame to perform a filtering:

In [None]:
healthy_countries = country_data[country_data["Life Exp."] > 70]
display(healthy_countries)

So, for example we could extrac the mean life expectancy for countries in our list that are from Europe.

In [None]:
european_countries = country_data[country_data["Continent"] == "Europe"]
euro_life_exp_mean = european_countries['Life Exp.'].mean()
print("The average life expectancy in Europe is: ", euro_life_exp_mean)