# Using Pandas
Pandas is a library to work with data using relational tables

To prepare for this lesson execute the following cell

In [None]:
!git clone https://github.com/datasciencedojo/datasets.git

import the pandas library and assign it a shorter alias

In [None]:
import pandas as pd

## Loading data
Pandas includes a rich set of input functions that allow you to get data from various file types

| function | format | notes |
|----------|--------|-------|
| `pd.read_csv` | textual csv | |
| `pd.read_excel` | binary excel format | requires external library |
| `pd.read_parquet` | fast binary columnar format| requires pyarrow |

A data frame contains many functions to explore it e.g. the `.head()` method shows the first lines of a data frame

In [None]:
df = pd.read_csv("datasets/titanic.csv")
df.head()

## Projection, Selection and Extension
a data frame is a table; you can get its column names using the `.columns` attributes

In [None]:
df.columns

columns can be accessed individually or in groups; this operation is called **projection**

Single columns can be accessed either 
1. using a square bracket operator `df.["age"]`
2. using the dot operator if the column name is a good **identifier** `df.age`

Each column is called a **Series** in pandas jargon

Groups of columns can be accessed by passing a list of strings to the bracket operator 

In [None]:
df[["Survived","Pclass","Sex","Age"]]

operations on series are vectorized i.e. each individual element is used to get a new vector

Operations within a series and a scalar value are repeated for all values of a series

```python
df.Pclass == 1
```

returns a series of booleans

By passing a list of booleans to the square bracket operators this filters all of the lines which are satisfying the logic statement expressed; this operation is called **selection** which is a synonim for filter

In [None]:
df[df.Pclass == 1]

Usually selection and projection are used at the same time; the `.loc[,]` operator can be conveniently used for this purpose; its arguments are:
1. a boolean list for rows or the splice operator `:` for no filter
2. a string list of column names or the splice operator `:` for all columns

In [None]:
df.loc[df.Pclass==1,["Survived","Sex","Age"]]

It is possible to extend a table with more columns possibly as a result from a computation in other columns

To create a new column, just assign an expression to a new column name e.g.

```python
df["above_average"] = (df.score > df.score.mean())
```

In [None]:
countries = pd.read_csv("datasets/WorldDBTables/CountryTable.csv")
countries.columns

### Exercise
calculate the population density of each country

- the countries table contains 

In [None]:
countries["population_density"] = countries.population /  countries.surface_area
countries.loc[:,["name","population_density"]]

In [None]:
countries.sort_values("population_density",ascending=False).loc[:,["name","population_density"]].head()

## Join and concatenation

A relation may be composed by more than a table; this may offer some consistency and operation efficiency.

If two tables represents entitites which are related they can be **joined** by selecting one or more columns which contains those attributes which creates the relationship.

Per each matched rows in a table, this is replicated as many times as the matched rows in the other table

There are four kinds of available joins

| join | data incuded | added missing values |
|------|--------------|----------------|
| inner | only rows which belongs to both tables | none |
| left | all rows of the first table | for all non matching rows in the first table |
| right | all rows of the second table | for all non matching rows in the second table |
| outer | all rows of both tables | for all non matching rows |

### Example 
- in the country table we have a list of countries including their population
- in the languages table we have a list of languages spoken in each country and the percentage of the population which speaks said language
- in the country table we have a textual `code` which is uniquely assigned to each county
- in the languages table we have the same code in a column called `country_code`

In [None]:
languages = pd.read_csv("datasets/WorldDBTables/LanguageTable.csv")

In [None]:
languages_by_country = pd.merge(countries, languages, how="inner", left_on=["code"], right_on=["country_code"])
languages_by_country["people_speaking"] = languages_by_country.population * languages_by_country.percentage / 100
languages_by_country[["name","language","people_speaking","official"]]

## Pivoting and melting

## Aggregation

In [None]:
languages_by_country.groupby(["language"]) \
.agg({"people_speaking":"sum"}) \
.sort_values("people_speaking",ascending=False) \
.head()

## Exercise