# Pandas - data types and basic operations
In the last lesson, we introduced the pandas library and its base classes: `Series`,` DataFrame` and `Index`. However, we took them as static objects, which we only viewed.
In this lesson, we will begin to edit existing tables. We will show:
* how to **add** or **remove columns** and **rows**
* how to **change the value of a specific cell**
* what **data types** are suitable for which purpose
* **arithmetic** and **logical operations** that can be performed on columns
* **filtering** and **sorting rows**

And since you definitely don't want to lose the results of the work
* **saving the results** to external files will be useful in the end.

In [1]:
# Mandatory import
import pandas as pd

In [2]:
#Optional import to disable warningns

#import warnings
#warnings.simplefilter(action='ignore', category=FutureWarning)

## Manipulating DataFrames
To warm up, we will work with a small table containing some basic information about the planets, which you can easily find, for example, on Wikipedia (https://en.wikipedia.org/wiki/Planet).

In [3]:
planets = pd.DataFrame({
    
"name": ["Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"],
"symbol": ["☿", "♀", "⊕", "♂", "♃", "♄", "♅", "♆"],
"equatorian_diameter": [0.39, 0.72, 1.00, 1.52, 5.20, 9.54, 19.22, 30.06],
"orbital_period": [0.24, 0.62, 1, 1.88, 11.86, 29.46, 84.01, 164.8]})

planets = planets.set_index("name") # It will be easier for you to work with the name as index
planets

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mercury,☿,0.39,0.24
Venus,♀,0.72,0.62
Earth,⊕,1.0,1.0
Mars,♂,1.52,1.88
Jupiter,♃,5.2,11.86
Saturn,♄,9.54,29.46
Uranus,♅,19.22,84.01
Neptune,♆,30.06,164.8


### Add a new column
When we want to add a new column (`Series`), we assign it to a `DataFrame` as a dictionary value - that is, in square brackets with the column name. 

`pandas` handles both the `Series` and the regular `list`, so we can create a new list and add it as a new column with number of moons for each planet.


In [4]:
moons = [0, 0, 1, 2, 79, 82, 27, 14] 
# Alternatively moons = pd.Series ([...])

planets["moons"] = moons #define a new column
planets

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mercury,☿,0.39,0.24,0
Venus,♀,0.72,0.62,0
Earth,⊕,1.0,1.0,1
Mars,♂,1.52,1.88,2
Jupiter,♃,5.2,11.86,79
Saturn,♄,9.54,29.46,82
Uranus,♅,19.22,84.01,27
Neptune,♆,30.06,164.8,14


💡 In this case, we have directly modified the existing `DataFrame`. 

Most methods / operations in `pandas` (eg. `set_index`) by default always return a new object with the modification applied, leaving the original object in an unchanged state. 

It is generally a good habit & we will stick to it. However, column assignment is one of the accepted exceptions to this otherwise recognized rule.

`DataFrame` also offers an `assign` method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html), which does not modify the table but creates a copy of it. with columns added (or replaced). If you want to avoid the annoying tracking of which table you changed or not, we can only recommend `assign` to you.

By the way, you can make a copy of the table at any time using the `copy` method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) - this is useful when writing functions , where the input table is modified for various reasons.

In [5]:
# New temporary DataFrame

planets.assign(has_rings = [False, False, False, False, True, True, True, True])

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,has_rings
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mercury,☿,0.39,0.24,0,False
Venus,♀,0.72,0.62,0,False
Earth,⊕,1.0,1.0,1,False
Mars,♂,1.52,1.88,2,False
Jupiter,♃,5.2,11.86,79,True
Saturn,♄,9.54,29.46,82,True
Uranus,♅,19.22,84.01,27,True
Neptune,♆,30.06,164.8,14,True


In [6]:
planets # The planets df has remained unchanged.

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mercury,☿,0.39,0.24,0
Venus,♀,0.72,0.62,0
Earth,⊕,1.0,1.0,1
Mars,♂,1.52,1.88,2
Jupiter,♃,5.2,11.86,79
Saturn,♄,9.54,29.46,82
Uranus,♅,19.22,84.01,27
Neptune,♆,30.06,164.8,14


In [7]:
planets2 = planets.copy()
planets2["has_rings"] = [False, False, False, False, True, True, True, True]
planets2
# The original planets df will not change

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,has_rings
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mercury,☿,0.39,0.24,0,False
Venus,♀,0.72,0.62,0,False
Earth,⊕,1.0,1.0,1,False
Mars,♂,1.52,1.88,2,False
Jupiter,♃,5.2,11.86,79,True
Saturn,♄,9.54,29.46,82,True
Uranus,♅,19.22,84.01,27,True
Neptune,♆,30.06,164.8,14,True


> ### Task 
Add a column with the (approximate) year of discovery (`" discovered "`). Try both methods
You can find the data at https://en.wikipedia.org/wiki/Timeline_of_discovery_of_Solar_System_planets_and_their_moons.

In [8]:
#Your code goes here...

### Scalar value

Sometimes, one scalar value can be used for all the values of a new column (we use it very rarely) - the same value is then used in all rows:

In [9]:
planets["is_planet"] = True
planets

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,is_planet
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mercury,☿,0.39,0.24,0,True
Venus,♀,0.72,0.62,0,True
Earth,⊕,1.0,1.0,1,True
Mars,♂,1.52,1.88,2,True
Jupiter,♃,5.2,11.86,79,True
Saturn,♄,9.54,29.46,82,True
Uranus,♅,19.22,84.01,27,True
Neptune,♆,30.06,164.8,14,True


### Add a new row
Sometime before 2006 a new planet was added to a list, namely Pluto.

We will insert it into our table as a new row using the `loc` indexer, which we have previously used to access values from the table:

In [10]:
planets.loc["Pluto"] = ["♇", 39.48, 247.94, 5,"1980", True] # List of values in a row in planets
planets

ValueError: cannot set a row with mismatched columns

> ### Task
Add the Sun or some completely fictional planet to the table

In [None]:
# Your code goes here

### Change cell value(s)
The "indexers" `.loc` and `.iloc` with two arguments in square brackets refer directly to a specific cell, and by assigning them (again, similarly to a dictionary), the value is written to the appropriate place. You just need to keep the order (row, column).
Let's come back to the present moment and deprive Pluto of its status:

In [11]:
planets.loc["Pluto", "is_planet"] = False
planets

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,is_planet
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mercury,☿,0.39,0.24,0.0,True
Venus,♀,0.72,0.62,0.0,True
Earth,⊕,1.0,1.0,1.0,True
Mars,♂,1.52,1.88,2.0,True
Jupiter,♃,5.2,11.86,79.0,True
Saturn,♄,9.54,29.46,82.0,True
Uranus,♅,19.22,84.01,27.0,True
Neptune,♆,30.06,164.8,14.0,True
Pluto,,,,,False


> ⚠ Attention: As with the dictionary it is possible to write a value in a row or column that does not exist!

In [12]:
planets_bad = planets.copy () # We will make a copy to not mess up with the original df
planets_bad.loc["Earth", "is_star"] = False
planets_bad

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,is_planet,is_star
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Mercury,☿,0.39,0.24,0.0,True,
Venus,♀,0.72,0.62,0.0,True,
Earth,⊕,1.0,1.0,1.0,True,False
Mars,♂,1.52,1.88,2.0,True,
Jupiter,♃,5.2,11.86,79.0,True,
Saturn,♄,9.54,29.46,82.0,True,
Uranus,♅,19.22,84.01,27.0,True,
Neptune,♆,30.06,164.8,14.0,True,
Pluto,,,,,False,


💡 You must be wondering what `NaN` means in the table. `NaN` (Not a Number) indicates a **missing**, **invalid**, or **unknown value**. 

In our example, we did not enter it, so it is not surprising. We will talk about the issue of missing values and how to deal with them next time, so let us not get nervous about them for now.

It is also possible to assign to ranges in indexes - we just need to make sure that we assign either a *scalar value* (ie one value for the whole area, a dimensionless non-array) or a multidimensional object (Series, DataFrame, list, ...) of the same shape (number of rows and columns) as the area to which we assign:

In [13]:
planets.loc["Mercury": "Mars", "has_rings"] = False
planets.loc["Jupiter": "Neptune", "has_rings"] = [True, True, True, True]
planets

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,is_planet,has_rings
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Mercury,☿,0.39,0.24,0.0,True,False
Venus,♀,0.72,0.62,0.0,True,False
Earth,⊕,1.0,1.0,1.0,True,False
Mars,♂,1.52,1.88,2.0,True,False
Jupiter,♃,5.2,11.86,79.0,True,True
Saturn,♄,9.54,29.46,82.0,True,True
Uranus,♅,19.22,84.01,27.0,True,True
Neptune,♆,30.06,164.8,14.0,True,True
Pluto,,,,,False,


> ### Task
Create a `has_moon` column that indicates whether a planet has a moon

In [14]:
#Your code goes here...

### Delete a row
Use the `drop` method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) to remove a column or row from the DataFrame. 

Its first argument expects the `index` of one or more rows or columns that you want to remove. The `axis` argument indicates in which dimension the operation is to be applied. You can use either the number 0 or 1 (which corresponds to the order from zero in which the keys are referenced when referencing cells), or the name of the dimension:

Part (axis):
* 0 or "index" → rows
* 1 or "columns" → columns
* Numerous other methods and functions use this argument, so make sure you understand it

Now, let's have no mercy on Pluto and delete it from the `DataFrame`- for the `drop` method, the default value of the `axis` argument is 0, so we don't have to write it:

In [15]:
planets

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,is_planet,has_rings
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Mercury,☿,0.39,0.24,0.0,True,False
Venus,♀,0.72,0.62,0.0,True,False
Earth,⊕,1.0,1.0,1.0,True,False
Mars,♂,1.52,1.88,2.0,True,False
Jupiter,♃,5.2,11.86,79.0,True,True
Saturn,♄,9.54,29.46,82.0,True,True
Uranus,♅,19.22,84.01,27.0,True,True
Neptune,♆,30.06,164.8,14.0,True,True
Pluto,,,,,False,


In [16]:
planets2 = planets.drop("Neptune", axis =0) # Add axis = "rows" to be explicit
planets2

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,is_planet,has_rings
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Mercury,☿,0.39,0.24,0.0,True,False
Venus,♀,0.72,0.62,0.0,True,False
Earth,⊕,1.0,1.0,1.0,True,False
Mars,♂,1.52,1.88,2.0,True,False
Jupiter,♃,5.2,11.86,79.0,True,True
Saturn,♄,9.54,29.46,82.0,True,True
Uranus,♅,19.22,84.01,27.0,True,True
Pluto,,,,,False,


> ### Task
Create a table from `planets` that will contain neither Uranus nor Neptune (with one command).

In [17]:
#Your code goes here

### Delete a column
The `drop` method works very similarly for a column, only this time we have to specify the `axis` argument.

Let's remove the unnecessary column with the information that a planet is a planet.

In [18]:
planets = planets.drop("is_planet", axis="columns")
planets

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,has_rings
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mercury,☿,0.39,0.24,0.0,False
Venus,♀,0.72,0.62,0.0,False
Earth,⊕,1.0,1.0,1.0,False
Mars,♂,1.52,1.88,2.0,False
Jupiter,♃,5.2,11.86,79.0,True
Saturn,♄,9.54,29.46,82.0,True
Uranus,♅,19.22,84.01,27.0,True
Neptune,♆,30.06,164.8,14.0,True
Pluto,,,,,


In [19]:
planets.drop("has_moon", axis="columns", inplace = True)
planets

KeyError: "['has_moon'] not found in axis"

The `drop` method, in accordance with the above-mentioned convention, returns a new `DataFrame` (and therefore we must assign the result of the operation to the `planet`). If you want to operate directly on the table, you can use the `del` command (it works the same as the dictionary) or use a special argument `inplace = True`

In [20]:
# Only at your own risk
# Alternative 1) # del planets["is_planet"]
# Alternative 2) # planets.drop("is_planet", axis=1, inplace=True)

## Data types

#### Data preparation
We will now leave the planets and look at some interesting characteristics of countries around the world (since the definition of what a country is is somewhat vague, we take into account UN members). We captured data for one particular year of the past decade (because not all data are always available , we take the last year where enough indicators are known).

The data comes mostly from the Gapminder project (https://www.gapminder.org/), we have added just a few more information from wikipedia.

The following code (you don't have to understand it) will download the required file and save it in the local directory. 

Alternatively, you can download it manually from https://raw.githubusercontent.com/janpipek/data-pro-pyladies/master/data/countries.csv

In [21]:
import os
import requests


source = "https://raw.githubusercontent.com/janpipek/data-pro-pyladies/master/data/countries.csv"
countries = source.rsplit("/")[-1]

if not os.path.exists(countries):
    print(f"Filename {countries} is not downloaded yet, Let's go...")
    response = requests.get(source)
    with open(countries, "wb") as out:
        out.write(response.content)
    print(f"File {countries} downloaded successfully.")
else:
    print(f"The file {countries} has already been downloaded, we will use the local copy.")

The file countries.csv has already been downloaded, we will use the local copy.


And we will open it using the already known function `read_csv` (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) (Note: `pandas` can also open the file directly from the internet, but we prefer to use a local copy so that you can return to work offline).

In [22]:
# Instead of 'set_index`, we select the index right away when loading

countries = pd.read_csv("countries.csv", index_col="name")

countries = countries.sort_index()
countries

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan,AFG,south_asia,asia,low_income,False,False,,2018,652860.0,34500000.0,0.03,20.62,21.07,,2090.0,66.3,58.69,65.812,63.101,1946-11-19
Albania,ALB,europe_central_asia,europe,upper_middle_income,False,False,,2018,28750.0,3238000.0,7.29,26.45,25.66,5.978,3193.0,12.5,78.01,80.737,76.693,1955-12-14
Algeria,DZA,middle_east_north_africa,africa,upper_middle_income,False,False,,2018,2381740.0,36980000.0,0.69,24.60,26.37,,3296.0,21.9,77.86,77.784,75.279,1962-10-08
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,10.17,27.63,26.43,,,2.1,82.55,,,1993-07-28
Angola,AGO,sub_saharan_africa,africa,upper_middle_income,False,False,,2018,1246700.0,20710000.0,5.57,22.25,23.48,,2473.0,96.0,65.19,64.939,59.213,1976-12-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,VEN,america,americas,upper_middle_income,False,False,,2018,912050.0,30340000.0,7.60,27.45,28.13,7.332,2631.0,12.9,75.91,79.079,70.950,1945-11-15
Vietnam,VNM,east_asia_pacific,asia,lower_middle_income,False,False,,2018,330967.0,90660000.0,3.91,20.92,21.07,,2745.0,17.3,74.88,81.203,72.003,1977-09-20
Yemen,YEM,middle_east_north_africa,asia,lower_middle_income,False,False,,2018,527970.0,26360000.0,0.20,24.44,26.11,,2223.0,33.8,67.14,66.871,63.875,1947-09-30
Zambia,ZMB,sub_saharan_africa,africa,lower_middle_income,False,False,,2018,752610.0,14310000.0,3.56,20.68,23.05,11.260,1930.0,43.3,59.45,65.362,59.845,1964-12-01


Let's select a country and see what kind of data we have in the table.

In [23]:
countries.loc["Costa Rica"]

iso                                             CRI
world_6region                               america
world_4region                              americas
income_groups                   upper_middle_income
is_eu                                         False
is_oecd                                       False
eu_accession                                    NaN
year                                           2018
area                                        51100.0
population                                4860000.0
alcohol_adults                                 5.81
bmi_men                                       26.48
bmi_women                                     27.03
car_deaths_per_100000_people                  6.319
calories_per_day                             2848.0
infant_mortality                                8.5
life_expectancy                               81.42
life_expectancy_female                       82.598
life_expectancy_male                         77.912
un_accession

At first glance, each field is a different type. But what type exactly? 

The `dtypes` attribute (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) will provide us with an answer.
For `Series` you will use `dtype ` (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dtype.html), or rather `dtype.name` if you want an equally nice string representation).

In [24]:
countries.dtypes

iso                              object
world_6region                    object
world_4region                    object
income_groups                    object
is_eu                              bool
is_oecd                            bool
eu_accession                     object
year                              int64
area                            float64
population                      float64
alcohol_adults                  float64
bmi_men                         float64
bmi_women                       float64
car_deaths_per_100000_people    float64
calories_per_day                float64
infant_mortality                float64
life_expectancy                 float64
life_expectancy_female          float64
life_expectancy_male            float64
un_accession                     object
dtype: object

 > 💡 The types in pandas are based on how they are defined by the `numpy` library 

`numpy` is useful for working with numeric arrays. It provides vector operations at one or two orders of magnitude faster than their pure Python equivalents. (side-note: Vectorization is a practice of replacing explicit loops with array expressions)

A `data type` object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. Among other things, it describes the following aspects of the data:

* Type of the data (integer, float, Python object, etc.)
* Size of the data (how many bytes is in e.g. the integer)


Fortunately, the somewhat mysterious type system in `numpy` (described in (https://docs.scipy.org/doc/numpy/user/basics.types.html) is slightly simplified in `pandas` and offers just a few useful basic data types that we will use in our course.

### Integers
In Python, there is exactly one type reserved for integers: `int`
which allows you to work with integers of any size (0, -58 or 123456789012345678901234567890). 

In `pandas` you can find `int8`, `int16`,` int32`, `int64`,` uint8`, `uint16`,` uint32` and `uint64` - they all have the same basic features and each has only a certain range of numbers that can be stored in it.

They differ in the amount of memory that one number takes up (the number in the name indicates the number of bits), and whether negative numbers are also supported (the prefix `u` means unsigned - it contains only zero and positive numbers).

Ranges:
* `int8`: -128 to 127
* `uint8`: 0 to 255
* `int16`: -32 768 to 32 767
* `uint16`: 0 to 65 535
* `int32`: -2 147 483 648 to 2 147 483 647 (+/- ~ 2 billion)
* `uint32`: 0 to 4 294 967 295 (up to ~ 4 billion)
* `int64`: -9 223 372 036 854 775 808 to 9 223 372 036 854 775 807 (+/- ~ 9 trillion)
* `uint64`: 0 to 18 446 744 073 709 551 615 (up to ~ 18 trillion)

For a detailed explanation of how integers are represented in computer memory, check https://en.wikipedia.org/wiki/Integer.

In `pandas`, the default integer type is `int64`, and unless you say otherwise, it will automatically be used for integers (in most cases this will be a good choice):

In [25]:
countries["year"]

name
Afghanistan    2018
Albania        2018
Algeria        2018
Andorra        2017
Angola         2018
               ... 
Venezuela      2018
Vietnam        2018
Yemen          2018
Zambia         2018
Zimbabwe       2018
Name: year, Length: 193, dtype: int64

In [26]:
pd.Series([0, 123, 12345])

# pd.Series ([0, 123, 12345], dtype = "int64") # same

0        0
1      123
2    12345
dtype: int64

You can use the `dtype` argument to specify exactly which type of integers you want:

In [27]:
pd.Series([0, 123, 12345], dtype="int16")

0        0
1      123
2    12345
dtype: int16

> ⚠ Caution: When selecting a specific integer type, you must be careful of ranges, because `pandas` will not warn you if any of your values do not "fit" into the range and discard the part of the binary representation that is left-over. As a result, you get a much smaller number than you expected:

In [28]:
pd.Series([0, 123, 12345], dtype="int8")

  """Entry point for launching an IPython kernel.


0      0
1    123
2     57
dtype: int8

Now, let's create a `Series`with an exceptionally big number (> 10 trillions) and see what happens:

In [29]:
pd.Series([0, 123, 123456789012345678901234567890]) #it is no longer an integer

0                                 0
1                               123
2    123456789012345678901234567890
dtype: object

- If we explicitly request int64, an exception is thrown.- When we let `pandas` do the job, the general type `object` is used and we lose some of the advantages: the column takes up many times more memory and arithmetic operations with it are an order of magnitude or two slower. If this is not our priority, we can leave it just like that.

To sum up, we generally recommend sticking to `int64` or let `pandas` use it for us automatically. Only if strict memory requirements require it, it pays off to look for ways to optimise the performance.

> ### Task: 
Create a `Series` with data type `uint8`, containing (at least) one small negative number. What will happen?

In [30]:
#Your code goes here...

### Floating numbers
As with integer values, one type in Python (`float`) corresponds to several types in `pandas`: `float16`,` float32`, `float64`. 

The name again includes the number of bits that one number has. Fortunately, in this case, `float64` exactly matches the `float` behavior of Python, the other two types are less accurate and have a smaller scope - apart from optimizing memory requirements for a specific type of data, you probably won't use them.

In [31]:
countries["bmi_men"]

name
Afghanistan    20.62
Albania        26.45
Algeria        24.60
Andorra        27.63
Angola         22.25
               ...  
Venezuela      27.45
Vietnam        20.92
Yemen          24.44
Zambia         20.68
Zimbabwe       22.03
Name: bmi_men, Length: 193, dtype: float64

In [32]:
# Quite accurate pi
pd.Series([3.14159265])

0    3.141593
dtype: float64

In [33]:
# Not so accurate pi
pd.Series([3.14159265], dtype="float16")

0    3.140625
dtype: float16

> ### Task
Create a Series of type `float64` only from integers. What will happen?

In [34]:
#Your code goes here...

### Booleans
This is probably the least surprising data type. It basically behaves the same as the `bool` type in Python. It takes the values `True` and` False` (which can also be considered as 1 and 0 in some operations). 

It has another great feature - `Series` and `DataFrame` objects can be filtered using a column of logical type:

In [35]:
countries["is_oecd"].iloc[:20]

name
Afghanistan            False
Albania                False
Algeria                False
Andorra                False
Angola                 False
Antigua and Barbuda    False
Argentina              False
Armenia                False
Australia               True
Austria                 True
Azerbaijan             False
Bahamas                False
Bahrain                False
Bangladesh             False
Barbados               False
Belarus                False
Belgium                 True
Belize                 False
Benin                  False
Bhutan                 False
Name: is_oecd, dtype: bool

In [36]:
# Create a new column

pd.Series([True, False, False])

0     True
1    False
2    False
dtype: bool

Another way of creating it:

In [37]:
pd.Series([1, 0, 0], dtype="bool")

0     True
1    False
2    False
dtype: bool

> ### Task
What happens when you create a `Series` of type `bool` from the strings `"True"` and `"False"` (don&#39;t forget the quotes)?

In [38]:
#Your code goes here...

### Strings, objects
In the current version of the library `pandas` it is recommended to use a special dtype `string` instead of a general dtype `object`. The difference is more or less aesthetic at the moment (and for convenience we usually won't convert the columns to `string` since it is not likely to improve the performance).

In [39]:
countries["iso"]

name
Afghanistan    AFG
Albania        ALB
Algeria        DZA
Andorra        AND
Angola         AGO
              ... 
Venezuela      VEN
Vietnam        VNM
Yemen          YEM
Zambia         ZMB
Zimbabwe       ZWE
Name: iso, Length: 193, dtype: object

By default, `strings`, along with other unspecified or unrecognized values, fall into the `object` category.

If you want to be explicit or get some extra type checking, you can specify the data type `string` in the constructor, or convert the column using the `astype` method:

In [40]:
# countries["iso"].astype("string")

# Pets

pets = pd.Series( ["dog", "cat", "hamster", "tarantula", "boa"], dtype="string")

pets

0          dog
1          cat
2      hamster
3    tarantula
4          boa
dtype: string

In [41]:
#pets[0] = 42 #will throw an error

The `object` data type is the only option if we have heterogeneous data in `Series`:

In [42]:
pd.Series([1, "two", 3.0]) #String, integer & float

0      1
1    two
2    3.0
dtype: object

Note that also `lists` can be a value in a column of type `object`:

In [43]:
# Orders
pd.Series(
[["steak", "potatoes", "cola"], ["tuna", "french fries"], ["soda"]],
    index=["Eva", "Dagma", "Petra"])

Eva      [steak, potatoes, cola]
Dagma       [tuna, french fries]
Petra                     [soda]
dtype: object

> ### Task:
1. What kind of object (and what `dtype`) do we get when we try to get one row from the `planets` table?
2. What happens when you convert the column `planet["orbital_period"]` to `object` or `string`?

In [44]:
#Your code goes here

### Date / Time (datetime)
One of the future lessons deals with time data, but we already have one in the table of countries, so at least for completeness let's explore what `pandas` offers in this regard: 
- Time or date data (*datetime*) as points on the timeline.
- Time data with time zone designation (*datetimes with time zone*).
- Time slots (*timedeltas*) as a determination of the length of a section (calculated in nanoseconds)
- Periods (*periods*) indicate some specified time periods (eg "February 2020")

💡 The `to_datetime` function is used to convert from various formats to date / time, which we will use for the following example:

In [45]:
pd.to_datetime(countries["un_accession"])

name
Afghanistan   1946-11-19
Albania       1955-12-14
Algeria       1962-10-08
Andorra       1993-07-28
Angola        1976-12-01
                 ...    
Venezuela     1945-11-15
Vietnam       1977-09-20
Yemen         1947-09-30
Zambia        1964-12-01
Zimbabwe      1980-08-25
Name: un_accession, Length: 193, dtype: datetime64[ns]

### Categorical
If we want to be efficient when working with columns, where values are often repeated (especially strings), we can encode them into categories. This often saves space and speeds up some operations. In such a conversion, `pandas` will find all the unique values in the column, store them in a special list, and store only the indexes in that list. Everything behaves transparently, so when used, you usually don&#39;t even know if you have a column of type `object` or` category`.

💡 The `astype` method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) is used for conversion between different data types, which accepts the name as its argumen, dtype to which we want to convert:

In [46]:
countries.dtypes

iso                              object
world_6region                    object
world_4region                    object
income_groups                    object
is_eu                              bool
is_oecd                            bool
eu_accession                     object
year                              int64
area                            float64
population                      float64
alcohol_adults                  float64
bmi_men                         float64
bmi_women                       float64
car_deaths_per_100000_people    float64
calories_per_day                float64
infant_mortality                float64
life_expectancy                 float64
life_expectancy_female          float64
life_expectancy_male            float64
un_accession                     object
dtype: object

In [47]:
countries["income_groups"].astype("category") #creates only temporary df
countries = countries.astype({"income_groups": "category"}) # changes dtype in a df

> ### Task 
Can you think of which columns from the `countries` table we should convert to some other type?

In [48]:
#Your code goes here...

## Mathematics
The `Series` in `pandas` is designed to be as functional as possible. Individual columns can thus become part of arithmetic expressions along with scalar values, other columns, `numpy` arrays of the appropriate shape, and even lists.

In [49]:
countries

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan,AFG,south_asia,asia,low_income,False,False,,2018,652860.0,34500000.0,0.03,20.62,21.07,,2090.0,66.3,58.69,65.812,63.101,1946-11-19
Albania,ALB,europe_central_asia,europe,upper_middle_income,False,False,,2018,28750.0,3238000.0,7.29,26.45,25.66,5.978,3193.0,12.5,78.01,80.737,76.693,1955-12-14
Algeria,DZA,middle_east_north_africa,africa,upper_middle_income,False,False,,2018,2381740.0,36980000.0,0.69,24.60,26.37,,3296.0,21.9,77.86,77.784,75.279,1962-10-08
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,10.17,27.63,26.43,,,2.1,82.55,,,1993-07-28
Angola,AGO,sub_saharan_africa,africa,upper_middle_income,False,False,,2018,1246700.0,20710000.0,5.57,22.25,23.48,,2473.0,96.0,65.19,64.939,59.213,1976-12-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,VEN,america,americas,upper_middle_income,False,False,,2018,912050.0,30340000.0,7.60,27.45,28.13,7.332,2631.0,12.9,75.91,79.079,70.950,1945-11-15
Vietnam,VNM,east_asia_pacific,asia,lower_middle_income,False,False,,2018,330967.0,90660000.0,3.91,20.92,21.07,,2745.0,17.3,74.88,81.203,72.003,1977-09-20
Yemen,YEM,middle_east_north_africa,asia,lower_middle_income,False,False,,2018,527970.0,26360000.0,0.20,24.44,26.11,,2223.0,33.8,67.14,66.871,63.875,1947-09-30
Zambia,ZMB,sub_saharan_africa,africa,lower_middle_income,False,False,,2018,752610.0,14310000.0,3.56,20.68,23.05,11.260,1930.0,43.3,59.45,65.362,59.845,1964-12-01


In [50]:
# Life expectancy in days
countries["life_exp_days"] = countries["life_expectancy"] * 365
countries

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,AFG,south_asia,asia,low_income,False,False,,2018,652860.0,34500000.0,...,20.62,21.07,,2090.0,66.3,58.69,65.812,63.101,1946-11-19,21421.85
Albania,ALB,europe_central_asia,europe,upper_middle_income,False,False,,2018,28750.0,3238000.0,...,26.45,25.66,5.978,3193.0,12.5,78.01,80.737,76.693,1955-12-14,28473.65
Algeria,DZA,middle_east_north_africa,africa,upper_middle_income,False,False,,2018,2381740.0,36980000.0,...,24.60,26.37,,3296.0,21.9,77.86,77.784,75.279,1962-10-08,28418.90
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,...,27.63,26.43,,,2.1,82.55,,,1993-07-28,30130.75
Angola,AGO,sub_saharan_africa,africa,upper_middle_income,False,False,,2018,1246700.0,20710000.0,...,22.25,23.48,,2473.0,96.0,65.19,64.939,59.213,1976-12-01,23794.35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,VEN,america,americas,upper_middle_income,False,False,,2018,912050.0,30340000.0,...,27.45,28.13,7.332,2631.0,12.9,75.91,79.079,70.950,1945-11-15,27707.15
Vietnam,VNM,east_asia_pacific,asia,lower_middle_income,False,False,,2018,330967.0,90660000.0,...,20.92,21.07,,2745.0,17.3,74.88,81.203,72.003,1977-09-20,27331.20
Yemen,YEM,middle_east_north_africa,asia,lower_middle_income,False,False,,2018,527970.0,26360000.0,...,24.44,26.11,,2223.0,33.8,67.14,66.871,63.875,1947-09-30,24506.10
Zambia,ZMB,sub_saharan_africa,africa,lower_middle_income,False,False,,2018,752610.0,14310000.0,...,20.68,23.05,11.260,1930.0,43.3,59.45,65.362,59.845,1964-12-01,21699.25


In [51]:
# Population density
countries["population_density"] = countries["population"] / countries["area"]

In [52]:
countries

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,AFG,south_asia,asia,low_income,False,False,,2018,652860.0,34500000.0,...,21.07,,2090.0,66.3,58.69,65.812,63.101,1946-11-19,21421.85,52.844408
Albania,ALB,europe_central_asia,europe,upper_middle_income,False,False,,2018,28750.0,3238000.0,...,25.66,5.978,3193.0,12.5,78.01,80.737,76.693,1955-12-14,28473.65,112.626087
Algeria,DZA,middle_east_north_africa,africa,upper_middle_income,False,False,,2018,2381740.0,36980000.0,...,26.37,,3296.0,21.9,77.86,77.784,75.279,1962-10-08,28418.90,15.526464
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,...,26.43,,,2.1,82.55,,,1993-07-28,30130.75,189.170213
Angola,AGO,sub_saharan_africa,africa,upper_middle_income,False,False,,2018,1246700.0,20710000.0,...,23.48,,2473.0,96.0,65.19,64.939,59.213,1976-12-01,23794.35,16.611855
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,VEN,america,americas,upper_middle_income,False,False,,2018,912050.0,30340000.0,...,28.13,7.332,2631.0,12.9,75.91,79.079,70.950,1945-11-15,27707.15,33.265720
Vietnam,VNM,east_asia_pacific,asia,lower_middle_income,False,False,,2018,330967.0,90660000.0,...,21.07,,2745.0,17.3,74.88,81.203,72.003,1977-09-20,27331.20,273.924591
Yemen,YEM,middle_east_north_africa,asia,lower_middle_income,False,False,,2018,527970.0,26360000.0,...,26.11,,2223.0,33.8,67.14,66.871,63.875,1947-09-30,24506.10,49.927079
Zambia,ZMB,sub_saharan_africa,africa,lower_middle_income,False,False,,2018,752610.0,14310000.0,...,23.05,11.260,1930.0,43.3,59.45,65.362,59.845,1964-12-01,21699.25,19.013832


In [81]:
countries.head()

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,AFG,south_asia,asia,low_income,False,False,,2018,652860.0,34500000.0,...,21.07,,2090.0,66.3,58.69,65.812,63.101,1946-11-19,21421.85,52.844408
Albania,ALB,europe_central_asia,europe,upper_middle_income,False,False,,2018,28750.0,3238000.0,...,25.66,5.978,3193.0,12.5,78.01,80.737,76.693,1955-12-14,28473.65,112.626087
Algeria,DZA,middle_east_north_africa,africa,upper_middle_income,False,False,,2018,2381740.0,36980000.0,...,26.37,,3296.0,21.9,77.86,77.784,75.279,1962-10-08,28418.9,15.526464
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,...,26.43,,,2.1,82.55,,,1993-07-28,30130.75,189.170213
Angola,AGO,sub_saharan_africa,africa,upper_middle_income,False,False,,2018,1246700.0,20710000.0,...,23.48,,2473.0,96.0,65.19,64.939,59.213,1976-12-01,23794.35,16.611855


In [82]:
# How food prices went up
pd.Series([1.2, 1.5], index = ["bread", "butter"]) + [0.5, 1.2] # list addition

bread     1.7
butter    2.7
dtype: float64

> ### Task
1. Calculate the total number of deaths in car accidents in each country (use the columns "population" and "car_deaths_per_100000_people" and simple arithmetic). 
2. Which country has the highest total number of deaths in car accidents?

In [83]:
#Your code goes here...

How long has the country be a member of the UN? 

In [85]:
datetime.now()

datetime.datetime(2022, 4, 28, 18, 27, 49, 719434)

In [86]:
from datetime import datetime
datetime.now() - pd.to_datetime(countries["un_accession"])

name
Afghanistan   27554 days 18:28:13.438467
Albania       24242 days 18:28:13.438467
Algeria       21752 days 18:28:13.438467
Andorra       10501 days 18:28:13.438467
Angola        16584 days 18:28:13.438467
                         ...            
Venezuela     27923 days 18:28:13.438467
Vietnam       16291 days 18:28:13.438467
Yemen         27239 days 18:28:13.438467
Zambia        20967 days 18:28:13.438467
Zimbabwe      15221 days 18:28:13.438467
Name: un_accession, Length: 193, dtype: timedelta64[ns]

💡 Floating point numbers can also contain special values:
* `NaN` (Not a Number)
* `inf`/`-inf` (plus or minus infinity)

They arise, for example, in case of inappropriate division by zero:

In [87]:
pd.Series([0, -1, 1]) / pd.Series([0, 0, 0])

0    NaN
1   -inf
2    inf
dtype: float64

**Warning:** We urge you to be careful when working with limited integer types. Due to their inappropriate conversion, the result can be a so-called overflow and show questionable results. One more reason to stick to `int64`.

In [88]:
pd.Series([7, 14, 149], dtype="int8") * 2

  """Entry point for launching an IPython kernel.


0    14
1    28
2    42
dtype: int8

## Comparisons

Not only numerical but also logical operators can be used for `Series`. The result is not one logical value, but a column of logical values.

In [89]:
# 15 liters of pure alcohol per person per year will be considered the limit of excessive drinking
# (not consulted with doctors!)
# In which countries do people drink a lot? (on average)

countries["alcohol_adults"] > 15

name
Afghanistan    False
Albania        False
Algeria        False
Andorra        False
Angola         False
               ...  
Venezuela      False
Vietnam        False
Yemen          False
Zambia         False
Zimbabwe       False
Name: alcohol_adults, Length: 193, dtype: bool

In [93]:
countries["alcohol_adults"].head()

name
Afghanistan     0.03
Albania         7.29
Algeria         0.69
Andorra        10.17
Angola          5.57
Name: alcohol_adults, dtype: float64

In [94]:
# Let's examine a specific country
countries.loc["Costa Rica", "alcohol_adults"] > 15

False

In [96]:
countries.loc["Czechia", "alcohol_adults"] > 15

True

In [97]:
# Are men fatter than women in each country?
countries["bmi_men"] > countries["bmi_women"]

name
Afghanistan    False
Albania         True
Algeria        False
Andorra         True
Angola         False
               ...  
Venezuela      False
Vietnam        False
Yemen          False
Zambia         False
Zimbabwe       False
Length: 193, dtype: bool

> ### Task
Find out if men or women live longer in a particular country

In [98]:
countries.columns

Index(['iso', 'world_6region', 'world_4region', 'income_groups', 'is_eu',
       'is_oecd', 'eu_accession', 'year', 'area', 'population',
       'alcohol_adults', 'bmi_men', 'bmi_women',
       'car_deaths_per_100000_people', 'calories_per_day', 'infant_mortality',
       'life_expectancy', 'life_expectancy_female', 'life_expectancy_male',
       'un_accession', 'life_exp_days', 'population_density'],
      dtype='object')

In [99]:
countries["life_expectancy_female"] > countries["life_expectancy_male"]

name
Afghanistan     True
Albania         True
Algeria         True
Andorra        False
Angola          True
               ...  
Venezuela       True
Vietnam         True
Yemen           True
Zambia          True
Zimbabwe        True
Length: 193, dtype: bool

In [101]:
countries.loc["Mexico", "life_expectancy_female"] >  countries.loc["Mexico", "life_expectancy_male"]

True

In [104]:
# Is the country in Africa?
countries["world_4region"] == "africa" ###To test if this is true. 

name
Afghanistan    False
Albania        False
Algeria         True
Andorra        False
Angola          True
               ...  
Venezuela      False
Vietnam        False
Yemen          False
Zambia          True
Zimbabwe        True
Name: world_4region, Length: 193, dtype: bool

As in Python, conditions can be combined using operators. However, due to certain syntax requirements of Python, you need to use alternatives instead of the logical operators you know: `&` (instead of `and`),` | `(instead of` or`), and `~` (instead of `not`). Because they have different priorities than their classic little brothers, it will be better if you always use parentheses when combined with other operators.

In [105]:
# Where do women and men live to be over 75?
(countries["life_expectancy_male"] > 75) & (countries["life_expectancy_female"] > 75)

name
Afghanistan    False
Albania         True
Algeria         True
Andorra        False
Angola         False
               ...  
Venezuela      False
Vietnam        False
Yemen          False
Zambia         False
Zimbabwe       False
Length: 193, dtype: bool

In [108]:
countries[(countries["life_expectancy_male"] > 75) & (countries["life_expectancy_female"] > 75)]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Albania,ALB,europe_central_asia,europe,upper_middle_income,False,False,,2018,28750.0,3238000.0,...,25.66,5.978,3193.0,12.5,78.01,80.737,76.693,1955-12-14,28473.65,112.626087
Algeria,DZA,middle_east_north_africa,africa,upper_middle_income,False,False,,2018,2381740.0,36980000.0,...,26.37,,3296.0,21.9,77.86,77.784,75.279,1962-10-08,28418.9,15.526464
Australia,AUS,east_asia_pacific,asia,high_income,False,True,,2018,7741220.0,23210000.0,...,26.88,5.335,3276.0,3.0,82.87,85.102,81.39,1945-11-01,30247.55,2.998235
Austria,AUT,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,83879.0,8441000.0,...,25.09,3.541,3768.0,2.9,81.84,84.249,79.585,1955-12-14,29871.6,100.633055
Bahrain,BHR,middle_east_north_africa,asia,high_income,False,False,,2018,771.0,1377000.0,...,28.79,8.19,,5.3,77.18,78.293,76.318,1971-09-21,28170.7,1785.992218
Belgium,BEL,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,30530.0,10820000.0,...,25.14,5.427,3733.0,3.3,81.23,83.751,79.131,1945-12-27,29648.95,354.405503
Brunei,BRN,east_asia_pacific,asia,high_income,False,False,,2018,5770.0,419800.0,...,22.89,11.67,2985.0,8.6,77.36,79.24,75.954,1984-09-21,28236.4,72.755633
Canada,CAN,america,americas,high_income,False,True,,2018,9984670.0,34990000.0,...,26.7,6.333,3494.0,4.3,82.16,84.509,80.859,1945-11-09,29988.4,3.504372
Chile,CHL,america,americas,high_income,False,True,,2018,756096.0,17570000.0,...,27.93,3.329,2979.0,7.0,80.66,82.272,77.416,1945-10-24,29440.9,23.23779
China,CHN,east_asia_pacific,asia,upper_middle_income,False,False,,2018,9562911.0,1359000000.0,...,22.91,3.59,3108.0,9.2,76.92,78.163,75.096,1945-10-24,28075.8,142.111539


## Filtering
If you want to select rows from the table that meet a criterion, you must convert this criterion into a column of **logical values**. Then you insert this column (the column itself, not its name!) In square brackets as the index in `DataFrame`.

For example, if you only want information about EU members, you can directly use the "is_eu" column, which contains logical values:

In [109]:
countries[countries["is_eu"]]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Austria,AUT,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,83879.0,8441000.0,...,25.09,3.541,3768.0,2.9,81.84,84.249,79.585,1955-12-14,29871.6,100.633055
Belgium,BEL,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,30530.0,10820000.0,...,25.14,5.427,3733.0,3.3,81.23,83.751,79.131,1945-12-27,29648.95,354.405503
Bulgaria,BGR,europe_central_asia,europe,upper_middle_income,True,False,2007-01-01,2018,111000.0,7349000.0,...,25.52,9.662,2829.0,9.3,75.32,78.485,71.618,1955-12-14,27491.8,66.207207
Croatia,HRV,europe_central_asia,europe,high_income,True,False,2013-01-01,2018,56590.0,4379000.0,...,25.18,6.434,3059.0,3.6,77.66,81.167,74.701,1992-05-22,28345.9,77.381163
Cyprus,CYP,europe_central_asia,europe,high_income,True,False,2004-05-01,2018,9250.0,1141000.0,...,25.93,6.419,2649.0,2.5,80.79,82.918,78.734,1960-09-20,29488.35,123.351351
Czechia,CZE,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,78870.0,10590000.0,...,26.51,5.72,3256.0,2.8,79.37,81.858,76.148,1993-01-19,28970.05,134.271586
Denmark,DNK,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,42922.0,5611000.0,...,25.11,3.481,3367.0,2.9,81.1,82.878,79.13,1945-10-24,29601.5,130.725502
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000.0,...,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17,28345.9,29.604245
Finland,FIN,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,338420.0,5419000.0,...,25.58,3.615,3368.0,1.9,82.06,84.423,78.934,1955-12-14,29951.9,16.012647
France,FRA,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,549087.0,63780000.0,...,24.83,2.491,3482.0,3.5,82.62,85.747,79.991,1945-10-24,30156.3,116.156456


You don't have to use an existing column in the table - just modify the condition to return a Series of logical values:

In [110]:
# Small countries
countries[countries["population"] < 100_000] # Underlining helps to separate thousands visually

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,...,26.43,,,2.1,82.55,,,1993-07-28,30130.75,189.170213
Antigua and Barbuda,ATG,america,americas,high_income,False,False,,2018,440.0,91400.0,...,27.51,,2417.0,5.8,77.6,79.028,74.154,1981-11-11,28324.0,207.727273
Dominica,DMA,america,americas,upper_middle_income,False,False,,2017,750.0,67700.0,...,28.78,,2931.0,19.6,73.01,,,1978-12-18,26648.65,90.266667
Liechtenstein,LIE,europe_central_asia,europe,high_income,False,False,,2017,160.0,36870.0,...,,,,1.76,,,,1990-09-18,,230.4375
Marshall Islands,MHL,east_asia_pacific,asia,upper_middle_income,False,False,,2017,180.0,56690.0,...,31.39,1.8,,29.6,65.0,,,1991-09-17,23725.0,314.944444
Monaco,MCO,europe_central_asia,europe,high_income,False,False,,2017,2.0,35460.0,...,,,,2.8,,,,1993-05-28,,17730.0
Nauru,NRU,east_asia_pacific,asia,,False,False,,2015,20.0,10440.0,...,35.02,,,29.1,,,,1999-09-14,,522.0
Palau,PLW,east_asia_pacific,asia,upper_middle_income,False,False,,2017,460.0,20920.0,...,31.85,10.73,,14.2,,,,1994-12-15,,45.478261
Saint Kitts and Nevis,KNA,america,americas,high_income,False,False,,2017,260.0,54340.0,...,30.51,,2492.0,8.4,,,,1983-09-23,,209.0
San Marino,SMR,europe_central_asia,europe,high_income,False,False,,2017,60.0,32160.0,...,,5.946,,2.6,,,,1992-03-02,,536.0


... and of course combinations:

In [111]:
# Poorer EU countries
countries[countries["is_eu"] & (countries["income_groups"] != "high_income")]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulgaria,BGR,europe_central_asia,europe,upper_middle_income,True,False,2007-01-01,2018,111000.0,7349000.0,...,25.52,9.662,2829.0,9.3,75.32,78.485,71.618,1955-12-14,27491.8,66.207207
Hungary,HUN,europe_central_asia,europe,upper_middle_income,True,True,2004-05-01,2018,93030.0,9934000.0,...,25.98,5.234,3037.0,5.3,75.9,79.557,72.61,1955-12-14,27703.5,106.782758
Romania,ROU,europe_central_asia,europe,upper_middle_income,True,False,2007-01-01,2018,238390.0,21340000.0,...,25.22,8.808,3358.0,9.7,75.53,79.158,72.265,1955-12-14,27568.45,89.517178


In [115]:
# Which OECD countries have a life expectancy of less than 78 years?
countries[(countries["is_oecd"]) & (countries["life_expectancy"] < 78)]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000.0,...,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17,28345.9,29.604245
Hungary,HUN,europe_central_asia,europe,upper_middle_income,True,True,2004-05-01,2018,93030.0,9934000.0,...,25.98,5.234,3037.0,5.3,75.9,79.557,72.61,1955-12-14,27703.5,106.782758
Latvia,LVA,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,64490.0,2226000.0,...,25.62,8.275,3174.0,6.9,75.13,79.498,69.882,1991-09-17,27422.45,34.516979
Lithuania,LTU,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,65286.0,3278000.0,...,26.01,8.09,3417.0,3.3,75.31,80.06,69.554,1991-09-17,27488.15,50.209846
Mexico,MEX,america,americas,upper_middle_income,False,True,,2018,1964380.0,117500000.0,...,28.74,9.468,3072.0,11.3,76.78,79.88,75.12,1945-11-07,28024.7,59.815311
Slovakia,SVK,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,49035.0,5489000.0,...,26.32,6.746,2944.0,5.8,77.16,80.511,73.589,1993-01-19,28163.4,111.940451


Because this method of filtering is a bit awkward, there is also a `query` method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) that allows you to select rows based on a string that describes some (in)equality of column names and numeric values (which is often the case, but sometimes it doesn&#39;t have to).

In [116]:
# Really big countries (population over 100 million)
countries.query("population > 100_000_000")

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bangladesh,BGD,south_asia,asia,low_income,False,False,,2018,147630.0,154400000.0,...,20.55,4.401,2450.0,30.7,73.41,74.937,71.484,1974-09-17,26794.65,1045.857888
Brazil,BRA,america,americas,upper_middle_income,False,False,,2018,8515770.0,200100000.0,...,25.99,1.872,3263.0,14.6,75.7,79.527,72.34,1945-10-24,27630.5,23.497582
China,CHN,east_asia_pacific,asia,upper_middle_income,False,False,,2018,9562911.0,1359000000.0,...,22.91,3.59,3108.0,9.2,76.92,78.163,75.096,1945-10-24,28075.8,142.111539
India,IND,south_asia,asia,lower_middle_income,False,False,,2018,3287259.0,1275000000.0,...,21.31,3.034,2459.0,37.9,69.1,70.678,67.538,1945-10-30,25221.5,387.861133
Indonesia,IDN,east_asia_pacific,asia,lower_middle_income,False,False,,2018,1910931.0,247200000.0,...,22.99,1.232,2777.0,22.8,72.03,71.742,67.426,1950-09-28,26290.95,129.361029
Japan,JPN,east_asia_pacific,asia,high_income,False,True,,2018,377962.0,126300000.0,...,21.87,1.381,2726.0,2.0,84.17,87.244,80.803,1956-12-18,30722.05,334.160577
Mexico,MEX,america,americas,upper_middle_income,False,True,,2018,1964380.0,117500000.0,...,28.74,9.468,3072.0,11.3,76.78,79.88,75.12,1945-11-07,28024.7,59.815311
Nigeria,NGA,sub_saharan_africa,africa,lower_middle_income,False,False,,2018,923770.0,170900000.0,...,23.67,,2700.0,69.4,66.14,55.158,53.512,1960-10-07,24141.1,185.00276
Pakistan,PAK,south_asia,asia,lower_middle_income,False,False,,2018,796100.0,183200000.0,...,23.45,,2440.0,65.8,67.96,67.869,65.75,1947-09-30,24805.4,230.121844
Russia,RUS,europe_central_asia,europe,high_income,False,False,,2018,17098250.0,142600000.0,...,27.21,14.38,3361.0,8.2,71.07,76.882,65.771,1945-10-24,25940.55,8.340035


In [117]:
# In which EU countries people tend to consume a lot?

countries.query("is_eu & (calories_per_day > 3500)")

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Austria,AUT,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,83879.0,8441000.0,...,25.09,3.541,3768.0,2.9,81.84,84.249,79.585,1955-12-14,29871.6,100.633055
Belgium,BEL,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,30530.0,10820000.0,...,25.14,5.427,3733.0,3.3,81.23,83.751,79.131,1945-12-27,29648.95,354.405503
Ireland,IRL,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,70280.0,4631000.0,...,26.62,3.768,3600.0,3.0,81.49,83.737,79.885,1955-12-14,29743.85,65.893569
Italy,ITA,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,301340.0,61090000.0,...,24.79,3.778,3579.0,2.9,82.62,85.435,81.146,1955-12-14,30156.3,202.727816
Luxembourg,LUX,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,2590.0,530000.0,...,26.09,5.971,3539.0,1.5,82.39,84.227,79.981,1945-10-24,30072.35,204.633205


In [120]:
oced = countries[(countries["is_oecd"]) & (countries["life_expectancy"] < 78)]

In [121]:
oced.head()

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000.0,...,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17,28345.9,29.604245
Hungary,HUN,europe_central_asia,europe,upper_middle_income,True,True,2004-05-01,2018,93030.0,9934000.0,...,25.98,5.234,3037.0,5.3,75.9,79.557,72.61,1955-12-14,27703.5,106.782758
Latvia,LVA,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,64490.0,2226000.0,...,25.62,8.275,3174.0,6.9,75.13,79.498,69.882,1991-09-17,27422.45,34.516979
Lithuania,LTU,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,65286.0,3278000.0,...,26.01,8.09,3417.0,3.3,75.31,80.06,69.554,1991-09-17,27488.15,50.209846
Mexico,MEX,america,americas,upper_middle_income,False,True,,2018,1964380.0,117500000.0,...,28.74,9.468,3072.0,11.3,76.78,79.88,75.12,1945-11-07,28024.7,59.815311


In [122]:
oced ["world_4region"]

name
Estonia        europe
Hungary        europe
Latvia         europe
Lithuania      europe
Mexico       americas
Slovakia       europe
Name: world_4region, dtype: object

> ### Task
1. Which only country in Africa belongs to the high-income group?
2. In which countries do you drink a lot (use any criterion)

In [132]:
countries["world_4region"].unique()

array(['asia', 'europe', 'africa', 'americas'], dtype=object)

In [133]:
countries["income_groups"].unique()

['low_income', 'upper_middle_income', 'high_income', 'lower_middle_income', NaN]
Categories (4, object): ['high_income', 'low_income', 'lower_middle_income', 'upper_middle_income']

In [134]:
countries[(countries["world_4region"] == "africa") & (countries["income_groups"] == "high_income")]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Equatorial Guinea,GNQ,sub_saharan_africa,africa,high_income,False,False,,2018,28050.0,761000.0,...,24.53,,,68.2,66.13,59.653,56.979,1968-11-12,24137.45,27.130125


In [151]:
countries.query("(world_4region == africa)")

UndefinedVariableError: name 'africa' is not defined

In [152]:
countries.columns

Index(['iso', 'world_6region', 'world_4region', 'income_groups', 'is_eu',
       'is_oecd', 'eu_accession', 'year', 'area', 'population',
       'alcohol_adults', 'bmi_men', 'bmi_women',
       'car_deaths_per_100000_people', 'calories_per_day', 'infant_mortality',
       'life_expectancy', 'life_expectancy_female', 'life_expectancy_male',
       'un_accession', 'life_exp_days', 'population_density'],
      dtype='object')

In [153]:
countries[(countries["alcohol_adults"] > 15) & (countries["income_groups"] != "low_income")]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Belarus,BLR,europe_central_asia,europe,upper_middle_income,False,False,,2018,207600.0,9498000.0,...,26.64,8.454,3250.0,3.4,73.76,78.583,67.693,1945-10-24,26922.4,45.751445
Czechia,CZE,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,78870.0,10590000.0,...,26.51,5.72,3256.0,2.8,79.37,81.858,76.148,1993-01-19,28970.05,134.271586
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000.0,...,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17,28345.9,29.604245
Hungary,HUN,europe_central_asia,europe,upper_middle_income,True,True,2004-05-01,2018,93030.0,9934000.0,...,25.98,5.234,3037.0,5.3,75.9,79.557,72.61,1955-12-14,27703.5,106.782758
Lithuania,LTU,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,65286.0,3278000.0,...,26.01,8.09,3417.0,3.3,75.31,80.06,69.554,1991-09-17,27488.15,50.209846
Moldova,MDA,europe_central_asia,europe,lower_middle_income,False,False,,2018,33850.0,3496000.0,...,27.06,5.529,2714.0,13.6,72.41,76.09,67.544,1992-03-02,26429.65,103.279173
Romania,ROU,europe_central_asia,europe,upper_middle_income,True,False,2007-01-01,2018,238390.0,21340000.0,...,25.22,8.808,3358.0,9.7,75.53,79.158,72.265,1955-12-14,27568.45,89.517178
Russia,RUS,europe_central_asia,europe,high_income,False,False,,2018,17098250.0,142600000.0,...,27.21,14.38,3361.0,8.2,71.07,76.882,65.771,1945-10-24,25940.55,8.340035
South Korea,KOR,east_asia_pacific,asia,high_income,False,True,,2018,100280.0,48770000.0,...,23.33,4.319,3334.0,2.9,81.35,85.467,79.456,1991-09-17,29692.75,486.338253
Ukraine,UKR,europe_central_asia,europe,lower_middle_income,False,False,,2018,603550.0,44700000.0,...,26.23,8.771,3138.0,7.7,72.29,77.067,67.246,1945-10-24,26385.85,74.061801


## Sorting

In the introductory `pandas` lesson, we have already shown how to sort rows using the `sort_index` method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html) according to the index. Since the `countries` are already aligned, let&#39;s try it again on the planets:

In [154]:
planets.sort_index()

Unnamed: 0_level_0,symbol,equatorian_diameter,orbital_period,moons,has_rings
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Earth,⊕,1.0,1.0,1.0,False
Jupiter,♃,5.2,11.86,79.0,True
Mars,♂,1.52,1.88,2.0,False
Mercury,☿,0.39,0.24,0.0,False
Neptune,♆,30.06,164.8,14.0,True
Pluto,,,,,
Saturn,♄,9.54,29.46,82.0,True
Uranus,♅,19.22,84.01,27.0,True
Venus,♀,0.72,0.62,0.0,False


The `sort_values` method is used to sort the values in the `Series` (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html):

In [155]:
# 10 countries with the smallest population THIS IS DEFAULT... the ascending true
countries["population"].sort_values(ascending=True).head(10) 

name
Tuvalu                    9888.0
Nauru                    10440.0
Palau                    20920.0
San Marino               32160.0
Monaco                   35460.0
Liechtenstein            36870.0
Saint Kitts and Nevis    54340.0
Marshall Islands         56690.0
Dominica                 67700.0
Seychelles               87420.0
Name: population, dtype: float64

The optional argument `ascending` tells us which way to go. The default value is `True`, so changing to`False` will sort from largest to smallest:

In [156]:
# The largest 10 countries by area
countries["area"].sort_values(ascending=False).head(10)

name
Russia           17098250.0
Canada            9984670.0
United States     9831510.0
China             9562911.0
Brazil            8515770.0
Australia         7741220.0
India             3287259.0
Argentina         2780400.0
Kazakhstan        2724902.0
Algeria           2381740.0
Name: area, dtype: float64

In the case of a table, the first argument should be the name of the column (or columns) by which we want to sort:

In [157]:
# 10 countries with the highest alcohol consumption per capita
countries.sort_values("alcohol_adults", ascending=False).head(10)

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,...,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession,life_exp_days,population_density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Moldova,MDA,europe_central_asia,europe,lower_middle_income,False,False,,2018,33850.0,3496000.0,...,27.06,5.529,2714.0,13.6,72.41,76.09,67.544,1992-03-02,26429.65,103.279173
South Korea,KOR,east_asia_pacific,asia,high_income,False,True,,2018,100280.0,48770000.0,...,23.33,4.319,3334.0,2.9,81.35,85.467,79.456,1991-09-17,29692.75,486.338253
Belarus,BLR,europe_central_asia,europe,upper_middle_income,False,False,,2018,207600.0,9498000.0,...,26.64,8.454,3250.0,3.4,73.76,78.583,67.693,1945-10-24,26922.4,45.751445
North Korea,PRK,east_asia_pacific,asia,low_income,False,False,,2018,120540.0,24650000.0,...,21.25,,2094.0,19.7,71.13,75.512,68.45,1991-09-17,25962.45,204.496433
Ukraine,UKR,europe_central_asia,europe,lower_middle_income,False,False,,2018,603550.0,44700000.0,...,26.23,8.771,3138.0,7.7,72.29,77.067,67.246,1945-10-24,26385.85,74.061801
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000.0,...,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17,28345.9,29.604245
Czechia,CZE,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,78870.0,10590000.0,...,26.51,5.72,3256.0,2.8,79.37,81.858,76.148,1993-01-19,28970.05,134.271586
Uganda,UGA,sub_saharan_africa,africa,low_income,False,False,,2018,241550.0,36760000.0,...,22.48,13.69,2130.0,37.7,62.86,62.667,58.252,1962-10-25,22943.9,152.183813
Lithuania,LTU,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,65286.0,3278000.0,...,26.01,8.09,3417.0,3.3,75.31,80.06,69.554,1991-09-17,27488.15,50.209846
Russia,RUS,europe_central_asia,europe,high_income,False,False,,2018,17098250.0,142600000.0,...,27.21,14.38,3361.0,8.2,71.07,76.882,65.771,1945-10-24,25940.55,8.340035


💡 In the next cell, the entire code is enclosed in parentheses. This allowed us to stretch one expression into several lines so that we could comment on its parts properly.

In [158]:
(
    # Only EU countries
    countries[countries["is_eu"]]
    
    # Sort first by date of EU accession, then by UN accession
    .sort_values(["eu_accession", "un_accession"])

    # Show only those two columns
    [["eu_accession", "un_accession"]]
)

Unnamed: 0_level_0,eu_accession,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1
France,1952-07-23,1945-10-24
Luxembourg,1952-07-23,1945-10-24
Netherlands,1952-07-23,1945-12-10
Belgium,1952-07-23,1945-12-27
Italy,1952-07-23,1955-12-14
Germany,1952-07-23,1973-09-18
Denmark,1973-01-01,1945-10-24
United Kingdom,1973-01-01,1945-10-24
Ireland,1973-01-01,1955-12-14
Greece,1981-01-01,1945-10-25


> ### HOMEWORK Task
1. Sort the countries of the world according to population density.
2. Which countries have overweight problems (average BMI of men and women is over 25)
3. In which 20 countries do absolutely most people die in car accidents?

In [159]:
#Your code goes here...

## Save Results!
And that's slowly the end of it. But we have done a great amount of work and don't get frustrated if you forget most of it until the next time. Feel free to always come back to this notebook.

Fortunately, writing an `DataFrame` to an external file in one of the typical extensions is not complicated at all. There are counterparts to the `pd.read_XXX` function set,` DataFrame.to_XXX`. They differ in different parameters, but the basic use is very simple:

In [160]:
planets.to_csv("planets.csv")

In [161]:
planets.to_excel("planets.xlsx")

Excel and CSV are not completely suitable formats for storing large data (alternatives are `feather` (https://github.com/wesm/feather) or `parquet` (https://en.wikipedia.org/wiki/Apache_Parquet)), for our purposes (small files, readable text format) CSV will suffice.

One of the possibilities is to create an HTML table (which can be supplied with various formatting, which we should leave for another time or at home, see [documentation "Styling"] (https://pandas.pydata.org/pandas-docs/stable/ user_guide/style.html)):

In [162]:
planets.to_html("planets.html")

> ### HOMEWORK Task 
1. See what you find in the output files.
2. Look at the list of possible output formats and try to write planets or countries in one of them: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#serialization-io-conversion

In [163]:
#Your code goes here...

And that&#39;s really all. 👋