In [None]:
import pandas as pd

In [None]:
pd.options.display.max_columns = 50

# Pandas values and overview methods

We'll be using free data via gapminder.org [repository](https://github.com/open-numbers/ddf--gapminder--systema_globalis), CC-BY LICENSE for this exercise. 

Let's load the data from its storage on github: two tables, with information about world's countries:

In [None]:
countries = pd.read_csv(
    "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis"
    "/master/ddf--entities--geo--country.csv"
)

In [None]:
population = pd.read_csv(
    "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis"
    "/master/countries-etc-datapoints/ddf--datapoints--population_total--by--geo--time.csv"    
)

## 1. Pandas operates on columns

In pandas tables rows and columns have different meaning:
* row describe observations
* columns describe observations' features

For instance, birds. We define what we are interested in as columns: size, weight, color, diet, habitat.

Then we observe birds, and each new bird gets its own row.

From this follows, that each column has its data type and meaning (number, weight in grams), while rows can combine multiple data types (numbers, strings, categories).

To view table data types, check `dtypes` attribute

In [None]:
countries.dtypes

* `object` means string (in most cases)
* `float64` is a floating point number
* `bool` is boolean (True/False)
* `int` can be integer numbers
* `category` can be special type for categorical data
* date and time types also exist

Not all operations are possible for each data type. For example, you cannot compute average string, or append `.` to the numbers.

To see basic statistics for each column, use `describe` method:

In [None]:
countries.describe(include="all")

## 2. Summary operations for strings and categories

### Number of entries (present values)

In [None]:
countries.income_3groups.count()

Out of total number of rows

In [None]:
countries.shape[0]

### Unique enties

In [None]:
countries.income_3groups.unique()

### Number of unique entries

In [None]:
countries.income_3groups.nunique()

In [None]:
countries.income_3groups.nunique(dropna=False)

### Most frequent entry

In [None]:
countries.income_3groups.mode()

### Count of all different entries

In [None]:
countries.income_3groups.value_counts()

## 3. Summary operations for numbers

We'll use population data from `population` table

### Mean

In [None]:
population.population_total.mean()

### Standard deviation

In [None]:
population.population_total.std()

### Minimum/maximum/median/percentiles

In [None]:
population.population_total.min()

In [None]:
population.population_total.max()

In [None]:
population.population_total.median()

In [None]:
population.population_total.quantile(q=0.25)

In [None]:
population.population_total.quantile(q=0.98)

### Sum

In [None]:
population.population_total.sum()

## 4. Classwork

1. Find a country of interest in the `countries` table
2. Use it's `country` code to get data about its population from `population` table
3. Compute statistics for this country's population
4. Determine the year range of that population data
5. Select data for 2001–2022 years only
6. Compute statistics for these years
7. Compare

# Pandas documentation

Very important link: https://pandas.pydata.org/docs/user_guide/10min.html