# Using Pandas
Pandas is a library to work with data using relational tables

To prepare for this lesson execute the following cell

In [None]:
!git clone https://github.com/datasciencedojo/datasets.git

import the pandas library and assign it a shorter alias

In [None]:
import pandas as pd

## Loading data
Pandas includes a rich set of input functions that allow you to get data from various file types

| function | format | notes |
|----------|--------|-------|
| `pd.read_csv` | textual csv | |
| `pd.read_excel` | binary excel format | requires external library |
| `pd.read_parquet` | fast binary columnar format| requires pyarrow |

A data frame contains many functions to explore it e.g. the `.head()` method shows the first lines of a data frame

In [None]:
df = pd.read_csv("datasets/titanic.csv")
df.head()

## Selection, filter and extension

In [None]:
df.columns

In [None]:
df[["Survived","Pclass","Sex","Age"]]

In [None]:
df[df.Pclass == 1]

In [None]:
df.loc[df.Pclass==1,["Survived","Sex","Age"]]

In [None]:
countries = pd.read_csv("datasets/WorldDBTables/CountryTable.csv")
countries.columns

In [None]:
countries["population_density"] = countries.population /  countries.surface_area
countries.loc[:,["name","population_density"]]

In [None]:
countries.sort_values("population_density",ascending=False).loc[:,["name","population_density"]].head()

## Join and concatenation

In [None]:
languages = pd.read_csv("datasets/WorldDBTables/LanguageTable.csv")

In [None]:
languages_by_country = pd.merge(countries, languages, how="inner", left_on=["code"], right_on=["country_code"])
languages_by_country["people_speaking"] = languages_by_country.population * languages_by_country.percentage / 100
languages_by_country[["name","language","people_speaking","official"]]

## Pivoting and melting

## Aggregation

In [None]:
languages_by_country.groupby(["language"]) \
.agg({"people_speaking":"sum"}) \
.sort_values("people_speaking",ascending=False) \
.head()

## Exercise