<h1>Pandas Tutorial</h1>

(C) Bartosz and Maria Teleńczuk


Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)

<img src="https://raw.githubusercontent.com/maikia/pyladies-python/main/figs/pylady_simple_paris_2.png" width=200 height=200 />

# Pandas: data analysis in python

 Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.

## Why do you need pandas?

When working with *tabular or structured data* (like R dataframe, SQL table, Excel spreadsheet, ...):

- Import data
- Clean up messy data
- Explore data, gain insight into data
- Process and prepare your data for analysis
- Analyse your data (together with scikit-learn, statsmodels, ...)

## Further reading

- the documentation: http://pandas.pydata.org/pandas-docs/stable/
- Wes McKinney's book "Python for Data Analysis"
- lots of tutorials on the internet (search "pandas tutorial" On github or youtube)



## Dataframes

DataFrame: multi-dimensional table of structured, hetermogeneous data, similar to spreadsheet

We will first download the data, **please copy and paste**:

In [None]:
# download data from server and save in a local file if you downloaded this notebook only
from urllib.request import urlretrieve
url = "https://raw.githubusercontent.com/maikia/pyladies-python/main/data/countries.csv"
f = urlretrieve(url, filename='countries.csv')

In [None]:
import pandas as pd
pd.options.display.max_rows = 8  # how many rows in the table

countries = pd.read_csv("countries.csv")
countries

### Useful functions

In [None]:
countries.head(n=2)

In [None]:
countries.describe()

<div class="alert alert-success">

**Quiz** What is the minimum and maximum population of the given countries.

</div>

### Indexing

#### Indexing columns

get a column

In [None]:
countries['population']

select several columns

In [None]:
countries[['population', 'area']]

#### Indexing rows

In [None]:
countries.loc[0]  # attribute

#### Two-dimensional indexing

In [None]:
countries.loc[0, "area"]

<div class="alert alert-success">
    <b>EXERCISE</b>: Print the population of Germany
</div>

#### Dataframe index

In [None]:
countries.index

In [None]:
countries.columns

In [None]:
# index is a label of a row

indexed = countries.set_index("country")
indexed

#### Indexing with .loc and .iloc

select a row using special `.loc` or `.iloc` attributes:

In [None]:
indexed.loc["Belgium", :]

In [None]:
indexed.loc[["Belgium", "France"], ["area", "population"]]

In [None]:
indexed.iloc[1:3, :]

#### Filtering

In [None]:
indexed[indexed['population'] > 20]

<div class="alert alert-success">
    
**Quiz** Imagine that you get the DataFrame called `dinos` with the following content:


sample | density | weight | species
--- | --- | --- | --
3 | 10 | 11 | Hydrosaurus
1 | 9 | 5 | Diplodocus 
2 | 11 | 2 | Pterodactyl 

try to guess what each of the command will return:

A) `dinos['density']`

B) `dinos.iloc[1, 'species']`

C) `dinos.loc[1, 'species']`

D) `dinos.iloc[1:, 'weight']`

E*) `dinos.loc[1:, 'weight']`
    
</div>

In [None]:
data_dinos = {
    'sample': [3, 1, 2],
    'density': [10, 9, 11], 
    'weight': [11, 5, 2], 
    'species': ['Hydrosaurus', 'Diplodocus', 'Pterodactyl']
}

dinos = pd.DataFrame(data_dinos)

# print(dinos['density'])
# print(dinos.iloc[1, 'species'])  # iloc works only with the indices (rows and columns)
# print(dinos.loc[1, 'species'])
# print(dinos.iloc[1:, 'weight'])
# print(dinos.loc[1:, 'weight'])


## Calculations

In [None]:
countries['population'] * 1000

<div class="alert alert-success">
    <b>EXERCISE</b>: Calculate how big is the population of each country relative to France
</div>

In [None]:
pop_france = indexed.loc['France', 'population']
indexed['population'] / pop_france

# other possibility, but less robust:
# indexed.iloc[1, 0]

In [None]:
countries.mean()

In [None]:
countries.sum()

In [None]:
countries.sort_index()

In [None]:
countries.sort_values(by='population')   # , ascending=False)

<div class="alert alert-success">
    <b>EXERCISE</b>: Calculate the total population of 3 largest countries (area-wise)
</div>

In [None]:
largest = countries.sort_values(by='area', ascending=False)[:3]
print(largest)
largest['area'].sum()

## Plotting

In [None]:
indexed['population'].plot()

In [None]:
indexed['population'].plot(kind='bar')

In [None]:
indexed.plot(kind='scatter', x='area', y='population')

<div class="alert alert-success">
    <b>EXERCISE</b>: Calculate and plot the population density in each country as a bar plot
</div>

In [None]:
pop_density = indexed['population'] / indexed['area']
print(pop_density)
pop_density.plot(kind='bar')

### Styling and customisation

pandas graphs are matplotlib objects:

In [None]:
ax = indexed['population'].plot(kind='bar')


In [None]:
type(ax)

#### Using .plot arguments

you can pass extra arguments to plot that will be forwarded to matplotlib's plot function (such as `color`, `facecolor`, `edgecolor`, `linewidth`):

In [None]:
ax = indexed['population'].plot(
    kind='bar',
    facecolor='white',
    edgecolor='red', 
    linewidth=5)

#### Using matploltib functions

In [None]:
import matplotlib.pyplot as plt

ax = indexed['population'].plot(kind='bar')

# add legend
ax.legend()

# set title
ax.set_title('countries')

# set x, y axes labes
ax.set_xlabel("countries")
ax.set_ylabel("milions")

# replace x-tick labels
ax.set_xticklabels(['This', 'will', 'replace', 'country', 'names'])

# toggle grid
plt.grid()

#### Changing matplotlib style

In [None]:
plt.style.use('grayscale')
indexed['population'].plot(kind='bar')

#### Using subplots

You can also use mutliple graphs on one figure:

* using matplotlib directly 

In [None]:
fig, axes = plt.subplots(2, 2)
indexed['population'].plot(kind='bar', ax=axes[0, 1])
indexed['area'].plot(ax=axes[1, 0])

# resize graphs to avoid overlapping labels
plt.tight_layout()

* or use pandas argument (`subplots`)

In [None]:
countries.plot(kind='bar', subplots=True)

#### Saving

In [None]:
indexed['population'].plot(kind='bar')

# you can save graphs in many different formats
plt.savefig('population.png')
plt.savefig('population.pdf')
plt.savefig('population.jpg')
plt.savefig('population.eps')
plt.savefig('population.svg')

# Working with multiple data sources

## Series

A Series is a basic holder for **one-dimensional labeled data**. It can be created like a NumPy array:

In [None]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s

### Creating Series from dictionary

It's possible to construct a series directly from a Python dictionary. Let's first define the dictionary of GDP in 2007:

In [None]:
gdp_dict = {
 'France': 30470,
 'Germany': 32170,
 'United Kingdom': 33203,
 'Belgium': 33692,
 'Netherlands': 36797,
 'Albania': 5937}

In [None]:
pd.Series(gdp_dict)

Now we construct a `Series` object from the dictionary.

In [None]:
gdp = pd.Series(gdp_dict)
gdp

### Automatic alignment

In [None]:
indexed['population'] * gdp

## Adding column to a dataset

Most novel information is hidden in **relations** between several data sets. Hardly, ever you will find all relevant information in a single table.

You can add a column to a dataframe, which will be also automatically aligned

In [None]:
indexed["gdp"] = gdp

In [None]:
indexed

**Note** Albania is missing from the list, because it was not included in our DataFrame with countries

this changed the dataframe **in place**