# Using Pandas (basic introduction)
Pandas is a library to work with data using relational tables

To prepare for this lesson execute the following cell

In [None]:
!git clone https://github.com/datasciencedojo/datasets.git

import the pandas library and assign it a shorter alias

In [None]:
import pandas as pd

## Loading data
Pandas includes a rich set of input functions that allow you to get data from various file types

| function | format | notes |
|----------|--------|-------|
| `pd.read_csv` | textual csv | |
| `pd.read_excel` | binary excel format | requires external library |
| `pd.read_parquet` | fast binary columnar format| requires pyarrow |

A data frame contains many functions to explore it e.g. the `.head()` method shows the first lines of a data frame

In [None]:
df = pd.read_csv("datasets/titanic.csv")
df.head()

## Projection, Selection and Extension
a data frame is a table; you can get its column names using the `.columns` attributes

In [None]:
df.columns

columns can be accessed individually or in groups; this operation is called **projection**

Single columns can be accessed either 
1. using a square bracket operator `df.["age"]`
2. using the dot operator if the column name is a good **identifier** `df.age`

Each column is called a **Series** in pandas jargon

Groups of columns can be accessed by passing a list of strings to the bracket operator 

In [None]:
df[["Survived","Pclass","Sex","Age"]]

operations on series are vectorized i.e. each individual element is used to get a new vector

Operations within a series and a scalar value are repeated for all values of a series

```python
df.Pclass == 1
```

returns a series of booleans

By passing a list of booleans to the square bracket operators this filters all of the lines which are satisfying the logic statement expressed; this operation is called **selection** which is a synonim for filter

In [None]:
df[df.Pclass == 1]

Usually selection and projection are used at the same time; the `.loc[,]` operator can be conveniently used for this purpose; its arguments are:
1. a boolean list for rows or the splice operator `:` for no filter
2. a string list of column names or the splice operator `:` for all columns

In [None]:
df.loc[df.Pclass==1,["Survived","Sex","Age"]]

It is possible to extend a table with more columns possibly as a result from a computation in other columns

To create a new column, just assign an expression to a new column name e.g.

```python
df["above_average"] = (df.score > df.score.mean())
```

In [None]:
countries = pd.read_csv("datasets/WorldDBTables/CountryTable.csv")
countries.columns

### Exercise
calculate the population density of each country

The countries table contains the population size in the `population` column and the land extension in the `surface area`
1. calculate the ratio of these two columns and store it in a new column called `population density`

In [None]:
countries["population_density"] = countries.population /  countries.surface_area
countries.loc[:,["name","population_density"]]

2. sort the table in descending order using the `.sort_values` function
3. restrict the columns to only the `["name","population_density"]` columns
4. show the first lines of the table using the `.head()` method: what are the most densely populated countries?

In [None]:
countries.sort_values("population_density",ascending=False).loc[:,["name","population_density"]].head()

## Join and concatenation

A relation may be composed by more than a table; this may offer some consistency and operation efficiency.

If two tables represents entitites which are related they can be **joined** by selecting one or more columns which contains those attributes which creates the relationship.

Per each matched rows in a table, this is replicated as many times as the matched rows in the other table

There are four kinds of available joins

| join | data incuded | added missing values |
|------|--------------|----------------|
| inner | only rows which belongs to both tables | none |
| left | all rows of the first table | for all non matching rows in the first table |
| right | all rows of the second table | for all non matching rows in the second table |
| outer | all rows of both tables | for all non matching rows |

Suppose we have a list of courses, classrooms and classroom booking per each course; if we want to know where each professor should hold his lesson we need to join these tables

| course_id | title | professor |
|-----------|-------|-----------|
| 1 | quantum field theory | Bohr |
| 2 | thermodynamics | Carnot |
| 3 | statistics | Gosset |

| classroom_id | building | floor |
|--------------|----------|-------|
| p124 | Purple | 1 |
| r201 | Red | 2 |

| course_id | classroom_id | weekday | start | end |
|-----------|--------------|---------|-------|-----|
| 1 | p124 | Monday | 9 | 11 |
| 1 | r201 | Wednesday | 14 | 15 |
| 2 | r201 | Tuesday | 14 | 17 |
| 3 | r201 | Monday | 14 | 15 |
| 3 | p124 | Tuesday | 9 | 10 |
| 3 | p124 | Wednesday | 9 | 10 |

the `pd.merge()` function performs the join operation e.g.
```python
courses_classrooms = pd.merge(courses,classroom)
courses_bookings = ps.merge(courses_classroom, bookings)
```
The default kind of join is `inner` you can use the `how=` optional argument to choose another kind.

`pd.merge` will join by default all columns with identical name: if you want to restrict the join to a given list of column you can use the `on=` option.

If you have different names for the join columns you can use `left_on=` and `right_on=` options to match them.

### Exercise
- in the country table we have a list of countries including their population
- in the languages table we have a list of languages spoken in each country and the percentage of the population which speaks said language
- in the country table we have a textual `code` which is uniquely assigned to each county
- in the languages table we have the same code in a column called `country_code`

1. load the language table from `datasets/WorldDBTables/LanguageTable.csv` using the `pd.read_csv` function and store it in a variable called `languages`
2. create a table named `language_by_country` using the `pd.merge` function and joining the column `code` of table `countries` with the column `country_code` from the `languages` table
3. calculate the number of people speaking a language by multiplying the `population` column with the `percentage` column (don't forget to divide by 100!); put the result in a column called `poeple_speaking`
4. show some lines of the table keeping only the following columns: `["name","language","people_speaking","official"]` what do you see?

In [None]:
languages = pd.read_csv("datasets/WorldDBTables/LanguageTable.csv")

In [None]:
languages_by_country = pd.merge(countries, languages, how="inner", left_on=["code"], right_on=["country_code"])
languages_by_country["people_speaking"] = languages_by_country.population * languages_by_country.percentage / 100
languages_by_country[["name","language","people_speaking","official"]]

## Pivoting and melting

## Aggregation
very often you may want to group your data according to one or more attribute and perform some calculation on each group, this operation is called **aggregation**

e.g. suppose I want to split a restaurant bill with my friends and I have a dataframe which looks like the following table

| person | item | amount |
|--------|------|--------|
| me | pepperoni pizza | 12 |
| me | lager pils | 5 |
| andrea | cheeseburger | 10 |
| andrea | coca cola | 2 |
| andrea | french fries | 2 |

```python
groups = bill.groupby(["person"])
groups.agg({"amount":"sum"})
```

will return

| person | amount |
|--------|--------|
| me | 17 |
| andrea | 14 |

## Exercise
using the `languages_by_country` table we created in the previous exercise
1. create a grouping by using the `"language"` column
2. using the `.agg()` method calculate how many people speak each language
3. sort the dataset from the largest group descending
4. show the first lines using `.head()` method

In [None]:
languages_by_country.groupby(["language"]) \
.agg({"people_speaking":"sum"}) \
.sort_values("people_speaking",ascending=False) \
.head()