# Pandas - data types and basic operations
In the last lesson, we introduced the pandas library and its base classes: `Series`,` DataFrame` and `Index`. However, we took them as static objects, which we only viewed.
In this lesson, we will begin to edit existing tables. We will show:
* how to add or remove columns and rows* how to change the value of a specific cell* what data types are suitable for which purpose* arithmetic and logical operations that can be performed on columns* filtering and sorting rows
And since you definitely don&#39;t want to lose the results of the work, saving the results to external files will be useful in the end.

In [1]:
# Mandatory importimport pandas as pd

## Manipulating DataFrames
To warm up, we will work with a small table containing some basic information about the planets, which you can easily find, for example, on [wikipedia] (https://en.wikipedia.org/wiki/Planet).

In [2]:
planety = pd.DataFrame({
&quot;name&quot;: [&quot;Mercury&quot;, &quot;Venus&quot;, &quot;Earth&quot;, &quot;Mars&quot;, &quot;Jupiter&quot;, &quot;Saturn&quot;, &quot;Uranus&quot;, &quot;Neptune&quot;],    "symbol": ["☿", "♀", "⊕", "♂", "♃", "♄", "♅", "♆"],
obezna_poloosa: [0.39, 0.72, 1.00, 1.52, 5.20, 9.54, 19.22, 30.06],&quot;obezna_doba&quot;: [0.24, 0.62, 1, 1.88, 11.86, 29.46, 84.01, 164.8],})
planety = planety.set_index (&quot;name&quot;) # It will be easier for you to work with the namescriptplanets

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Merkur,☿,0.39,0.24
Venuše,♀,0.72,0.62
Země,⊕,1.0,1.0
Mars,♂,1.52,1.88
Jupiter,♃,5.2,11.86
Saturn,♄,9.54,29.46
Uran,♅,19.22,84.01
Neptun,♆,30.06,164.8


### Add a new column
When we want to add a new column (`Series`), we assign it to` DataFrame` as a value in the dictionary - that is, in square brackets with the column name. The good news is that, as in the constructor, `pandas` handles both the` Series` and the regular list.
In our specific case, we will find and add the number of known months (large and small).

In [3]:
months = [0, 0, 1, 2, 79, 82, 27, 14] # Alternatively months = pd.Series ([...])planets [&quot;months&quot;] = monthsplanets

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Merkur,☿,0.39,0.24,0
Venuše,♀,0.72,0.62,0
Země,⊕,1.0,1.0,1
Mars,♂,1.52,1.88,2
Jupiter,♃,5.2,11.86,79
Saturn,♄,9.54,29.46,82
Uran,♅,19.22,84.01,27
Neptun,♆,30.06,164.8,14


💡 In this case, we have directly modified the existing `DataFrame`. Most methods / operations in `pandas` (you already know, for example,` set_index`) by default always return a new object with the modification applied, leaving the original object in an unchanged state. It is a good habit that we will follow. Column assignment is one of the accepted exceptions to this otherwise recognized rule, especially when the table is modified only in a narrow range of lines of code (or if copying is too memory intensive).   
However, `DataFrame` also offers a [` assign`] method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html), which does not modify the table but creates a copy of it. with columns added (or replaced). If you want to avoid the annoying tracking of which table you changed or not, we can only recommend `assign` to you.
By the way, you can make a copy of the table at any time using the [`copy`] method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) - this is useful when writing functions , where the input table is modified for various reasons.

In [4]:
# New temporary DataFrameplanety.assign(je_stavebnice = [True, False, False, False, False, False, False, False],ma_vztah_k_vestonicim = [False, True, False, False, False, False, False, False],)
# The `planet` object has remained unchanged.

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_stavebnice,ma_vztah_k_vestonicim
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Merkur,☿,0.39,0.24,0,True,False
Venuše,♀,0.72,0.62,0,False,True
Země,⊕,1.0,1.0,1,False,False
Mars,♂,1.52,1.88,2,False,False
Jupiter,♃,5.2,11.86,79,False,False
Saturn,♄,9.54,29.46,82,False,False
Uran,♅,19.22,84.01,27,False,False
Neptun,♆,30.06,164.8,14,False,False


In [5]:
planety2 = planety.copy()planety2["je_nezdrava_tycinka"] = [False, False, False, True, False, False, False, False]
planety2
# Even now, the original `planets&#39; will not change

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_nezdrava_tycinka
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,☿,0.39,0.24,0,False
Venuše,♀,0.72,0.62,0,False
Země,⊕,1.0,1.0,1,False
Mars,♂,1.52,1.88,2,True
Jupiter,♃,5.2,11.86,79,False
Saturn,♄,9.54,29.46,82,False
Uran,♅,19.22,84.01,27,False
Neptun,♆,30.06,164.8,14,False


** Task **: Try (one way or another) to add a column with the year of discovery (`&quot; discovered &quot;`). You can find the data at https://cs.wikipedia.org/wiki/Slune%C4%8Dn%C3%AD_soustava.

One scalar value can be used for the values of a new column (in practice, however, we do not meet this need so often) - the same value is then used in all rows:

In [6]:
planets [&quot;is_planet&quot;] = Trueplanets

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,☿,0.39,0.24,0,True
Venuše,♀,0.72,0.62,0,True
Země,⊕,1.0,1.0,1,True
Mars,♂,1.52,1.88,2,True
Jupiter,♃,5.2,11.86,79,True
Saturn,♄,9.54,29.46,82,True
Uran,♅,19.22,84.01,27,True
Neptun,♆,30.06,164.8,14,True


### Add a new row
If we go back to the childhood (or early adulthood) of the authors of these materials, ie before 2006, when an astronomical congress was held in Prague, which defined the term &quot;planet&quot; (but not before 1930!), We get a new planet: Pluto .
We will insert it into our table as a new row using the `loc` indexer, which we have previously used to&quot; look &quot;into the table:

In [7]:
planety.loc [&quot;Pluto&quot;] = [&quot;♇&quot;, 39.48, 247.94, 5, True] # List of values in a rowplanets

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,☿,0.39,0.24,0,True
Venuše,♀,0.72,0.62,0,True
Země,⊕,1.0,1.0,1,True
Mars,♂,1.52,1.88,2,True
Jupiter,♃,5.2,11.86,79,True
Saturn,♄,9.54,29.46,82,True
Uran,♅,19.22,84.01,27,True
Neptun,♆,30.06,164.8,14,True
Pluto,♇,39.48,247.94,5,True


** Task: ** Try to add the Sun or some completely fictional planet.

### Change cell value
The &quot;indexers&quot; `.loc` and` .iloc` with two arguments in square brackets refer directly to a specific cell, and by assigning them (again, similarly to a dictionary), the value is written to the appropriate place. You just need to keep the order (row, column).
We will return to the present and deprive Pluto of its status:

In [8]:
planety.loc [&quot;Pluto&quot;, &quot;je_planeta&quot;] = Falseplanets

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,☿,0.39,0.24,0,True
Venuše,♀,0.72,0.62,0,True
Země,⊕,1.0,1.0,1,True
Mars,♂,1.52,1.88,2,True
Jupiter,♃,5.2,11.86,79,True
Saturn,♄,9.54,29.46,82,True
Uran,♅,19.22,84.01,27,True
Neptun,♆,30.06,164.8,14,True
Pluto,♇,39.48,247.94,5,False


** ⚠ Attention: ** As with the dictionary, but perhaps somewhat intuitively, it is possible to write a value in a row or column that does not exist!

In [9]:
planety_bad = planety.copy () # We will make a copy for sure
planety_bad.loc [&quot;Earth&quot;, &quot;planet&quot;] = Trueplanety_bad

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta,planeta
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Merkur,☿,0.39,0.24,0.0,True,
Venuše,♀,0.72,0.62,0.0,True,
Země,⊕,1.0,1.0,1.0,True,
Mars,♂,1.52,1.88,2.0,True,
Jupiter,♃,5.2,11.86,79.0,True,
Saturn,♄,9.54,29.46,82.0,True,
Uran,♅,19.22,84.01,27.0,True,
Neptun,♆,30.06,164.8,14.0,True,
Pluto,♇,39.48,247.94,5.0,False,
Zeme,,,,,,True


💡 You must be wondering what ** NaN ** means in the table. NaN (Not a Number) indicates a missing, invalid, or unknown value. In our example, we did not enter it, so it is not surprising. We will talk about the issue of missing values and their correction next time, so let us not get nervous about them for the time being.

It is also possible to assign to ranges in indexes - we just need to make sure that we assign either a * scalar value * (ie one value for the whole area, a dimensionless non-array) or a multidimensional object (Series, DataFrame, list, ...) of the same shape (number of rows and columns) as the area to which we assign:

In [10]:
planety.loc [&quot;Mercury&quot;: &quot;Mars&quot;, &quot;je_obr&quot;] = Falseplanety.loc [&quot;Jupiter&quot;: &quot;Neptune&quot;, &quot;je_obr&quot;] = [True, True, True, True]planets

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta,je_obr
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Merkur,☿,0.39,0.24,0,True,False
Venuše,♀,0.72,0.62,0,True,False
Země,⊕,1.0,1.0,1,True,False
Mars,♂,1.52,1.88,2,True,False
Jupiter,♃,5.2,11.86,79,True,True
Saturn,♄,9.54,29.46,82,True,True
Uran,♅,19.22,84.01,27,True,True
Neptun,♆,30.06,164.8,14,True,True
Pluto,♇,39.48,247.94,5,False,




** Task: ** Coincidentally (or is it an astronomical inevitability?) All planetary giants have at least some ring. Can you simply create a `` ma_ring &#39;column `?

### Delete a row
Use the [`drop`] method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) to remove a column or row from the DataFrame. Its first argument expects the index (index) of one or more rows or columns that you want to remove. The `axis` argument indicates in which dimension the operation is to be applied. You can use either the number 0 or 1 (which corresponds to the order from zero in which the keys are referenced when referencing cells), or the name of the dimension:
Part (axis):
- 0 or &quot;index&quot; → rows- 1 or &quot;columns&quot; → columns
* Numerous other methods and functions use this argument, so make sure you understand it *
When we return to the future (or the present), let&#39;s deal with Pluto mercilessly (for the `drop` method, the default value of the` axis` argument is 0, so we don&#39;t have to write it):

In [11]:
planety = planety.drop (&quot;Pluto&quot;) # Add axis = &quot;rows&quot; to be explicitplanets

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta,je_obr
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Merkur,☿,0.39,0.24,0,True,False
Venuše,♀,0.72,0.62,0,True,False
Země,⊕,1.0,1.0,1,True,False
Mars,♂,1.52,1.88,2,True,False
Jupiter,♃,5.2,11.86,79,True,True
Saturn,♄,9.54,29.46,82,True,True
Uran,♅,19.22,84.01,27,True,True
Neptun,♆,30.06,164.8,14,True,True


** Task: ** Try to create a table from `planet` that will contain neither Uranus nor Neptune (with one command).

### Delete a column
The `drop` method works very similarly for a column, only this time we have to specify the` axis` argument.
Let&#39;s remove the unnecessary column with the information value at the level of &quot;wipers wipe, horn blows&quot; ...

In [12]:
planety = planety.drop(&quot;je_planeta&quot;, axis=&quot;columns&quot;)planets

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_obr
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,☿,0.39,0.24,0,False
Venuše,♀,0.72,0.62,0,False
Země,⊕,1.0,1.0,1,False
Mars,♂,1.52,1.88,2,False
Jupiter,♃,5.2,11.86,79,True
Saturn,♄,9.54,29.46,82,True
Uran,♅,19.22,84.01,27,True
Neptun,♆,30.06,164.8,14,True


The `drop` method, in accordance with the above-mentioned convention, returns a new` DataFrame` (and therefore we must assign the result of the operation to the `planet`). If you want to operate directly on the table, you can use the `del` command (it works the same as the dictionary) or ask the panda gods (and the authors of these materials) for forgiveness and add the argument` inplace = True` (this argument can, unfortunately, be used for many other operations):

In [13]:
# Only at your own risk
# Alternative 1)# del planety [&quot;je_planeta&quot;]
# Alternative 2)# planety.drop("je_planeta", axis=1, inplace=True)

## Data types

#### Data preparation
We will now leave the planets and look at some interesting characteristics of countries around the world (since the definition of what a country is is somewhat vague, we take into account UN members), captured for one particular year of the past decade (because not all data are always available , we take the last year where enough indicators are known). The data comes mostly from the [Gapminder] project (https://www.gapminder.org/), we have added just a few more information from wikipedia.
The following code (you don&#39;t have to understand it) will download the required file and save it in the local directory. Alternatively, you can download it manually from [https://raw.githubusercontent.com/janpipek/data-pro-pyladies/master/data/countries.csv](https://raw.githubusercontent.com/janpipek/data-pro- pyladies / master / data / countries.csv).

In [14]:
# Imports requiredimport themimport requests

#File list (see below)source = &quot;https://raw.githubusercontent.com/janpipek/data-pro-pyladies/master/data/countries.csv&quot;name = source.spring (&quot;/&quot;) [- 1]
if not os.path.exists(jmeno):
print (f &quot;File {name} is not downloaded yet, let&#39;s go ...&quot;)    response = requests.get(zdroj)
    with open(jmeno, "wb") as out:
        out.write(response.content)
print (f &quot;File {name} downloaded successfully.&quot;)else:
print (f &quot;The file {name} has already been downloaded, we will use the local copy.&quot;)

Soubor countries.csv už byl stažen, použijeme místní kopii.


And we will open it using the already known function [`read_csv`] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) (Note:` pandas` can also open the file directly from the internet, but we prefer to use a local copy so that you can return to work offline).

In [15]:
# Instead of `set_index`, we select the index right when loadingcountries = pd.read_csv("countries.csv", index_col="name")

countries = countries.sort_index()
countries

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan,AFG,south_asia,asia,low_income,False,False,,2018,652860.0,34500000.0,0.03,20.62,21.07,,2090.0,66.3,58.69,65.812,63.101,1946-11-19
Albania,ALB,europe_central_asia,europe,upper_middle_income,False,False,,2018,28750.0,3238000.0,7.29,26.45,25.66,5.978,3193.0,12.5,78.01,80.737,76.693,1955-12-14
Algeria,DZA,middle_east_north_africa,africa,upper_middle_income,False,False,,2018,2381740.0,36980000.0,0.69,24.60,26.37,,3296.0,21.9,77.86,77.784,75.279,1962-10-08
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,10.17,27.63,26.43,,,2.1,82.55,,,1993-07-28
Angola,AGO,sub_saharan_africa,africa,upper_middle_income,False,False,,2018,1246700.0,20710000.0,5.57,22.25,23.48,,2473.0,96.0,65.19,64.939,59.213,1976-12-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,VEN,america,americas,upper_middle_income,False,False,,2018,912050.0,30340000.0,7.60,27.45,28.13,7.332,2631.0,12.9,75.91,79.079,70.950,1945-11-15
Vietnam,VNM,east_asia_pacific,asia,lower_middle_income,False,False,,2018,330967.0,90660000.0,3.91,20.92,21.07,,2745.0,17.3,74.88,81.203,72.003,1977-09-20
Yemen,YEM,middle_east_north_africa,asia,lower_middle_income,False,False,,2018,527970.0,26360000.0,0.20,24.44,26.11,,2223.0,33.8,67.14,66.871,63.875,1947-09-30
Zambia,ZMB,sub_saharan_africa,africa,lower_middle_income,False,False,,2018,752610.0,14310000.0,3.56,20.68,23.05,11.260,1930.0,43.3,59.45,65.362,59.845,1964-12-01


We will randomly select a country and see what data we have in the table.

In [16]:
countries.loc["Czechia"]

iso                                             CZE
world_6region                   europe_central_asia
world_4region                                europe
income_groups                           high_income
is_eu                                          True
is_oecd                                        True
eu_accession                             2004-05-01
year                                           2018
area                                          78870
population                                1.059e+07
alcohol_adults                                16.47
bmi_men                                       27.91
bmi_women                                     26.51
car_deaths_per_100000_people                   5.72
calories_per_day                               3256
infant_mortality                                2.8
life_expectancy                               79.37
life_expectancy_female                       81.858
life_expectancy_male                         76.148
un_accession

At first glance, each field is a different type. But what? The [`dtypes`] attribute (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) of our table will answer this (for` Series` you will use [`dtype `] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dtype.html), or rather` dtype.name` if you want an equally nice string representation).

In [17]:
countries.dtypes

iso                              object
world_6region                    object
world_4region                    object
income_groups                    object
is_eu                              bool
is_oecd                            bool
eu_accession                     object
year                              int64
area                            float64
population                      float64
alcohol_adults                  float64
bmi_men                         float64
bmi_women                       float64
car_deaths_per_100000_people    float64
calories_per_day                float64
infant_mortality                float64
life_expectancy                 float64
life_expectancy_female          float64
life_expectancy_male            float64
un_accession                     object
dtype: object

The types in pandas are based on how they are defined by the `numpy` library (generally useful for working with numeric arrays and providing vector operations at speeds orders of magnitude higher than in Python as such). Above all, it needs to know how to allocate arrays for elements of a given type so that they can be sorted efficiently one after the other, and therefore how many bytes of memory each takes up. It copies &quot;native&quot; data types that you may already know from other languages (eg [C] (https://cs.wikipedia.org/wiki/C_ (programmer% C3% AD_language))). Memory placement is something we don&#39;t usually deal with in Python, but fast counting can&#39;t do without it. We will not go into details, but the demand for speed will arise here and there, and we will emphasize that operations are performed at the numpy level and not in Python.
Fortunately, the somewhat mysterious type system in `numpy` (described in [documentation] (https://docs.scipy.org/doc/numpy/user/basics.types.html)) is (slightly) simplified in` pandas` and offers just a few useful basic (families) types that we will now imagine.

### Integers
In Python, there is exactly one type reserved for integers: `int`, which allows you to work with integers of any size (0, -58 or 123456789012345678901234567890). In `pandas` you can find` int8`, `int16`,` int32`, `int64`,` uint8`, `uint16`,` uint32` and `uint64` - they all have the same basic features and each has only a certain range of numbers that can be stored in it. They differ in the amount of memory that one number takes up (the number in the name indicates the number of bits), and whether negative numbers are also supported (the prefix `u` means unsigned (unsigned), ie we count only zero and positive numbers).
Ranges:
- `int8`: -128 to 127- `uint8`: 0 to 255- `int16`: -32 768 to 32 767- `uint16`: 0 to 65 535- `int32`: -2 147 483 648 to 2 147 483 647 (ie +/- ~ 2 billion)- `uint32`: 0 to 4 294 967 295 (ie up to ~ 4 billion)- `int64`: -9 223 372 036 854 775 808 to 9 223 372 036 854 775 807 (ie +/- ~ 9 trillion)- `uint64`: 0 to 18 446 744 073 709 551 615 (ie up to ~ 18 trillion)
💡 To make matters worse, there is an alternative to each `int?` / `Uint?` Type that allows you to use missing values in the column, ie `NaN`. Instead of lowercase `i` or` u` in the name, the uppercase letter is used. This feature (so-called &quot;nullable integer types&quot;) is relatively useful, but is still somewhat experimental. We will not use it in the course.
For a detailed explanation of how integers are represented in computer memory, see [wikipedia] (https://en.wikipedia.org/wiki/Integer).

In `pandas`, the default integer type is` int64`, and unless you say otherwise, it will automatically be used for integers (in most cases this will be a good choice):

In [18]:
countries["year"]

name
Afghanistan    2018
Albania        2018
Algeria        2018
Andorra        2017
Angola         2018
               ... 
Venezuela      2018
Vietnam        2018
Yemen          2018
Zambia         2018
Zimbabwe       2018
Name: year, Length: 193, dtype: int64

In [19]:
pd.Series([0, 123, 12345])

# pd.Series ([0, 123, 12345], dtype = &quot;int64&quot;) # same

0        0
1      123
2    12345
dtype: int64

However, you can use the `dtype` argument to specify exactly which type of integers you want:

In [20]:
pd.Series([0, 123, 12345], dtype="int16")

0        0
1      123
2    12345
dtype: int16

** ⚠ Caution: ** When selecting a specific integer type, you must be careful of ranges, because `pandas` will not warn you if any of your values do not&quot; fit &quot;into the range and cheerfully discard the part of the binary representation that is extra and you get a much smaller number than you expected:

In [21]:
pd.Series([0, 123, 12345], dtype="int8")

0      0
1    123
2     57
dtype: int8

Fortunately, this does not apply to the widest range type (`int64`). Let&#39;s try to put a big number in it (for example 123456789012345678901234567890) and see what happens:

In [22]:
# This throws an exception:# pd.Series([0, 123, 123456789012345678901234567890], dtype="int64")

# This will pass, but it is no longer int64:pd.Series([0, 123, 123456789012345678901234567890])

0                                 0
1                               123
2    123456789012345678901234567890
dtype: object

- If we explicitly request it, an exception is thrown.- When we let `pandas` do his job, the general type` object` is used and we lose some of the advantages: the column takes up many times more memory and arithmetic operations with it are an order of magnitude or two slower. If this is not our priority, it is not such a problem.
Therefore, we generally recommend sticking to `int64`, resp. let `pandas` use it for us automatically. Only if strict memory requirements require it, it pays to look for the &quot;most pink&quot; type.

** Task: ** Try to create a `Series` with data type` uint8`, containing (at least) one small negative number. What will happen?

### Floating numbers
As with integer values, one type in Python (`float`) corresponds to several types in` pandas`: `float16`,` float32`, `float64`. The name again includes the number of bits that one number needs to store it. Fortunately, in this case, `float64` exactly matches the` float` behavior of Python, the other two types are less accurate and have a smaller scope - apart from optimizing memory requirements for a specific type of data, you probably won&#39;t use them.
You can find more theoretical reading on the representation of numbers with a decimal point at [wiki] (https://cs.wikipedia.org/wiki/Pohybliv%C3%A1_%C5%99%C3%A1dov%C3%A1_%C4%8D%C3 % A1rka).

In [23]:
countries["bmi_men"]

name
Afghanistan    20.62
Albania        26.45
Algeria        24.60
Andorra        27.63
Angola         22.25
               ...  
Venezuela      27.45
Vietnam        20.92
Yemen          24.44
Zambia         20.68
Zimbabwe       22.03
Name: bmi_men, Length: 193, dtype: float64

In [24]:
# Quite accurate pipd.Series([3.14159265])

0    3.141593
dtype: float64

In [25]:
# Not so accurate pipd.Series([3.14159265], dtype="float16")

0    3.140625
dtype: float16

** Task **: Create a field of type `float64` only from integers. What will happen?

### Booleans
This is probably the least surprising data type. It basically behaves the same as the `bool` type in Python. It takes the values `True` and` False` (which can also be considered as 1 and 0 in some operations). It has another great feature - `Series` and` DataFrame` objects can be filtered using a column of logical type (see below).

In [26]:
countries["is_oecd"].iloc[:20]

name
Afghanistan            False
Albania                False
Algeria                False
Andorra                False
Angola                 False
Antigua and Barbuda    False
Argentina              False
Armenia                False
Australia               True
Austria                 True
Azerbaijan             False
Bahamas                False
Bahrain                False
Bangladesh             False
Barbados               False
Belarus                False
Belgium                 True
Belize                 False
Benin                  False
Bhutan                 False
Name: is_oecd, dtype: bool

In [27]:
# Create a new columnpd.Series([True, False, False])

0     True
1    False
2    False
dtype: bool

However, it also goes like this:

In [28]:
pd.Series([1, 0, 0], dtype="bool")

0     True
1    False
2    False
dtype: bool

** Task: ** What happens when you create a `Series` of type` bool` from the strings `&quot; True &quot;` and `&quot; False &quot;` (don&#39;t forget the quotes)?

### Strings, objects
_The current version of the library `pandas` (1.2) has a somewhat schizophrenic attitude to strings, or is in the process of moving from a not completely happy approach (general data type` object`) to something better (special type `string`) - in the documentation it is recommended to use The second approach, although it is also called experimental. The difference is more or less aesthetic at the moment (and for convenience we won&#39;t usually convert the columns to `string`) ._

In [29]:
countries["iso"]

name
Afghanistan    AFG
Albania        ALB
Algeria        DZA
Andorra        AND
Angola         AGO
              ... 
Venezuela      VEN
Vietnam        VNM
Yemen          YEM
Zambia         ZMB
Zimbabwe       ZWE
Name: iso, Length: 193, dtype: object

This is likely to surprise you - by default, strings, along with other unspecified or unrecognized values, fall into the `object` category, which allows you to have anything you know from Python in a given column, and thus behaves largely like a regular list of benefits (none strange conversions, range tracking, ...) and disadvantages (it&#39;s slower than it could; no one can guarantee that there will be only strings in the column).

If you want to be explicit or get some extra type checking, you can specify the data type `string` in the constructor, or convert the column using the` astype` method:

In [30]:
# countries["iso"].astype("string")

# Petsmazlicci = pd.Series(
[&quot;dog&quot;, &quot;cat&quot;, &quot;hamster&quot;, &quot;tarantula&quot;, &quot;boa&quot;],    dtype="string"
)
mazlicci

0          pes
1        kočka
2       křeček
3    tarantule
4      hroznýš
dtype: string

In [31]:
# mazlicci [0] = 42 # I guess

The object data type is the only option if we have heterogeneous data in `Series`:

In [32]:
pd.Series ([1, &quot;two&quot;, 3.0]) #String and other &quot;trash&quot;

0      1
1    dvě
2      3
dtype: object

Note that even such a list can be a value in a column of type `object`:

In [33]:
# Orderspd.Series(
[[&quot;steak&quot;, &quot;potatoes&quot;, &quot;cola&quot;], [&quot;fryer&quot;, &quot;french fries&quot;], [&quot;soda&quot;]],    index=["Eva", "Evelína", "Evženie"])

Eva        [řízek, brambory, cola]
Evelína         [smažák, hranolky]
Evženie                  [sodovka]
dtype: object

** Task: ** What kind of object (and what `dtype`) do we get when we try to get one row from the` planets` table?
** Task: ** What happens when you convert the column `planet [&quot; obezna_doba &quot;]` to `object`, resp. `string`?

### Date / Time (datetime)
One of the following lessons deals with time data, but we already have one in the table of countries, so at least for completeness we will state what `pandas` offers in this regard:
- Time or date data (* datetime *) as points on the timeline.
- Time data with time zone designation (* datetimes with time zone *).
- Time slots (* timedeltas *) as a determination of the length of a section (calculated in nanoseconds)
- Periods (* periods *) indicate some specified time periods (eg &quot;February 2020&quot;)

💡 The `to_datetime` function is used to convert from various formats to date / time, which we will use for the following example:

In [34]:
pd.to_datetime(countries["un_accession"])

name
Afghanistan   1946-11-19
Albania       1955-12-14
Algeria       1962-10-08
Andorra       1993-07-28
Angola        1976-12-01
                 ...    
Venezuela     1945-11-15
Vietnam       1977-09-20
Yemen         1947-09-30
Zambia        1964-12-01
Zimbabwe      1980-08-25
Name: un_accession, Length: 193, dtype: datetime64[ns]

### Categorical
If we want to be efficient when working with columns, where values are often repeated (especially strings), we can encode them into categories. This often saves space and speeds up some operations. In such a conversion, `pandas` will find all the unique values in the column, store them in a special list, and store only the indexes in that list. Everything behaves transparently, so when used, you usually don&#39;t even know if you have a column of type `object` or` category`.

💡 The [`astype`] method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) is used for conversion between different data types, which accepts the name as its argument. dtype to which we want to convert:

In [35]:
countries["income_groups"].astype("category")

name
Afghanistan             low_income
Albania        upper_middle_income
Algeria        upper_middle_income
Andorra                high_income
Angola         upper_middle_income
                      ...         
Venezuela      upper_middle_income
Vietnam        lower_middle_income
Yemen          lower_middle_income
Zambia         lower_middle_income
Zimbabwe                low_income
Name: income_groups, Length: 193, dtype: category
Categories (4, object): ['high_income', 'low_income', 'lower_middle_income', 'upper_middle_income']

** Task: ** Can you think of which columns from the `countries` table we should convert to some other type?

## Mathematics
The `Series` in` pandas` is designed to be as surprising as possible. Individual columns can thus become part of arithmetic expressions along with scalar values, with other columns, `numpy` fields of the appropriate shape, and even lists.

In [36]:
# Life expectancy in dayscountries["life_expectancy"] * 365

name
Afghanistan    21421.85
Albania        28473.65
Algeria        28418.90
Andorra        30130.75
Angola         23794.35
                 ...   
Venezuela      27707.15
Vietnam        27331.20
Yemen          24506.10
Zambia         21699.25
Zimbabwe       21965.70
Name: life_expectancy, Length: 193, dtype: float64

In [37]:
# Population densitycountries["population"] / countries["area"]

name
Afghanistan     52.844408
Albania        112.626087
Algeria         15.526464
Andorra        189.170213
Angola          16.611855
                  ...    
Venezuela       33.265720
Vietnam        273.924591
Yemen           49.927079
Zambia          19.013832
Zimbabwe        34.113011
Length: 193, dtype: float64

In [38]:
# How our lunches went uppd.Series ([109, 99], index = [&quot;slice&quot;, &quot;eraser&quot;]) + [20.9, 10.9] # list addition

řízek     129.9
smažák    109.9
dtype: float64

** Task **: Calculate the total number of deaths in car accidents in each country (use the columns &quot;population&quot; and &quot;car_deaths_per_100000_people&quot; and simple arithmetic). Does the result fit for the Czech Republic?

In [39]:
# How long have they been at the UN?from datetime import datetime
datetime.now() - pd.to_datetime(countries["un_accession"])

name
Afghanistan   27150 days 13:13:55.741167
Albania       23838 days 13:13:55.741167
Algeria       21348 days 13:13:55.741167
Andorra       10097 days 13:13:55.741167
Angola        16180 days 13:13:55.741167
                         ...            
Venezuela     27519 days 13:13:55.741167
Vietnam       15887 days 13:13:55.741167
Yemen         26835 days 13:13:55.741167
Zambia        20563 days 13:13:55.741167
Zimbabwe      14817 days 13:13:55.741167
Name: un_accession, Length: 193, dtype: timedelta64[ns]

💡 Floating point numbers can also contain special values &quot;not a number&quot; and plus or minus infinity. They arise, for example, in case of inappropriate division by zero:

In [40]:
pd.Series([0, -1, 1]) / pd.Series([0, 0, 0])

0    NaN
1   -inf
2    inf
dtype: float64

** Warning: ** We urge you to be careful when working with limited integer types. As with their inappropriate conversion, the result can so-called overflow and show questionable results. One more reason to stick to `int64`.

In [41]:
pd.Series([7, 14, 149], dtype="int8") * 2

0    14
1    28
2    42
dtype: int8

## Comparing

Not only numerical but also logical operators can be used for `Series`. The result is not one logical value, but a column of logical values.

In [42]:
# 15 liters of pure alcohol per person per year will be considered the limit of excessive drinking# (not consulted with addictologists!)
# Where is a lot to drink?countries["alcohol_adults"] > 15

name
Afghanistan    False
Albania        False
Algeria        False
Andorra        False
Angola         False
               ...  
Venezuela      False
Vietnam        False
Yemen          False
Zambia         False
Zimbabwe       False
Name: alcohol_adults, Length: 193, dtype: bool

In [43]:
# Almost nowhere. And how are we doing?countries.loc["Czechia", "alcohol_adults"] > 15

True

In [44]:
# Are men fatter than women in each country?countries["bmi_men"] > countries["bmi_women"]

name
Afghanistan    False
Albania         True
Algeria        False
Andorra         True
Angola         False
               ...  
Venezuela      False
Vietnam        False
Yemen          False
Zambia         False
Zimbabwe       False
Length: 193, dtype: bool

** Task **: Find out if there are more men or women living in each country.

In [45]:
# Is the country in Africa?countries["world_4region"] == "africa"

name
Afghanistan    False
Albania        False
Algeria         True
Andorra        False
Angola          True
               ...  
Venezuela      False
Vietnam        False
Yemen          False
Zambia          True
Zimbabwe        True
Name: world_4region, Length: 193, dtype: bool

As in Python, conditions can be combined using operators. However, due to certain syntax requirements of Python, you need to use alternatives instead of the logical operators you know: `&amp;` (instead of `and`),` | `(instead of` or`), and `~` (instead of `not`). Because they have different priorities than their classic little brothers, it will be better if you always use parentheses when combined with other operators.

In [46]:
# Where do women and men live to be over 75?(countries["life_expectancy_male"] > 75) & (countries["life_expectancy_female"] > 75)

name
Afghanistan    False
Albania         True
Algeria         True
Andorra        False
Angola         False
               ...  
Venezuela      False
Vietnam        False
Yemen          False
Zambia         False
Zimbabwe       False
Length: 193, dtype: bool

## Filtering
If you want to select rows from the table that meet a criterion, you must (it is not always difficult :-)) to convert this criterion into a column of logical values. Then you insert this column (the column itself, not its name!) In square brackets as the index `DataFrame`.
For example, if you only want information about EU members, you can directly use the &quot;is_eu&quot; column, which contains logical values:

In [47]:
countries[countries["is_eu"]]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Austria,AUT,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,83879.0,8441000.0,12.4,26.47,25.09,3.541,3768.0,2.9,81.84,84.249,79.585,1955-12-14
Belgium,BEL,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,30530.0,10820000.0,10.41,26.76,25.14,5.427,3733.0,3.3,81.23,83.751,79.131,1945-12-27
Bulgaria,BGR,europe_central_asia,europe,upper_middle_income,True,False,2007-01-01,2018,111000.0,7349000.0,11.4,26.54,25.52,9.662,2829.0,9.3,75.32,78.485,71.618,1955-12-14
Croatia,HRV,europe_central_asia,europe,high_income,True,False,2013-01-01,2018,56590.0,4379000.0,15.0,26.6,25.18,6.434,3059.0,3.6,77.66,81.167,74.701,1992-05-22
Cyprus,CYP,europe_central_asia,europe,high_income,True,False,2004-05-01,2018,9250.0,1141000.0,8.84,27.42,25.93,6.419,2649.0,2.5,80.79,82.918,78.734,1960-09-20
Czechia,CZE,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,78870.0,10590000.0,16.47,27.91,26.51,5.72,3256.0,2.8,79.37,81.858,76.148,1993-01-19
Denmark,DNK,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,42922.0,5611000.0,12.02,26.13,25.11,3.481,3367.0,2.9,81.1,82.878,79.13,1945-10-24
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000.0,17.24,26.26,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17
Finland,FIN,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,338420.0,5419000.0,13.1,26.73,25.58,3.615,3368.0,1.9,82.06,84.423,78.934,1955-12-14
France,FRA,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,549087.0,63780000.0,12.48,25.85,24.83,2.491,3482.0,3.5,82.62,85.747,79.991,1945-10-24


You don&#39;t have to use an existing column in the table, but also any calculated value of the same form:

In [48]:
# Migratory earthcountries [countries [&quot;population&quot;] &lt;100_000] # Underlining helps to separate thousands visually

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,10.17,27.63,26.43,,,2.1,82.55,,,1993-07-28
Antigua and Barbuda,ATG,america,americas,high_income,False,False,,2018,440.0,91400.0,8.17,25.77,27.51,,2417.0,5.8,77.6,79.028,74.154,1981-11-11
Dominica,DMA,america,americas,upper_middle_income,False,False,,2017,750.0,67700.0,8.68,24.57,28.78,,2931.0,19.6,73.01,,,1978-12-18
Liechtenstein,LIE,europe_central_asia,europe,high_income,False,False,,2017,160.0,36870.0,,,,,,1.76,,,,1990-09-18
Marshall Islands,MHL,east_asia_pacific,asia,upper_middle_income,False,False,,2017,180.0,56690.0,,29.37,31.39,1.8,,29.6,65.0,,,1991-09-17
Monaco,MCO,europe_central_asia,europe,high_income,False,False,,2017,2.0,35460.0,,,,,,2.8,,,,1993-05-28
Nauru,NRU,east_asia_pacific,asia,,False,False,,2015,20.0,10440.0,4.81,33.9,35.02,,,29.1,,,,1999-09-14
Palau,PLW,east_asia_pacific,asia,upper_middle_income,False,False,,2017,460.0,20920.0,9.86,30.38,31.85,10.73,,14.2,,,,1994-12-15
Saint Kitts and Nevis,KNA,america,americas,high_income,False,False,,2017,260.0,54340.0,10.62,28.23,30.51,,2492.0,8.4,,,,1983-09-23
San Marino,SMR,europe_central_asia,europe,high_income,False,False,,2017,60.0,32160.0,,,,5.946,,2.6,,,,1992-03-02


... and of course combinations:

In [49]:
# Poorer EU countriescountries[countries["is_eu"] & (countries["income_groups"] != "high_income")]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Bulgaria,BGR,europe_central_asia,europe,upper_middle_income,True,False,2007-01-01,2018,111000.0,7349000.0,11.4,26.54,25.52,9.662,2829.0,9.3,75.32,78.485,71.618,1955-12-14
Hungary,HUN,europe_central_asia,europe,upper_middle_income,True,True,2004-05-01,2018,93030.0,9934000.0,16.12,27.12,25.98,5.234,3037.0,5.3,75.9,79.557,72.61,1955-12-14
Romania,ROU,europe_central_asia,europe,upper_middle_income,True,False,2007-01-01,2018,238390.0,21340000.0,16.15,25.41,25.22,8.808,3358.0,9.7,75.53,79.158,72.265,1955-12-14


In [50]:
# Which OECD countries have a life expectancy of less than 78 years?countries[countries["is_oecd"] & (countries["life_expectancy"] < 78)]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000.0,17.24,26.26,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17
Hungary,HUN,europe_central_asia,europe,upper_middle_income,True,True,2004-05-01,2018,93030.0,9934000.0,16.12,27.12,25.98,5.234,3037.0,5.3,75.9,79.557,72.61,1955-12-14
Latvia,LVA,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,64490.0,2226000.0,13.45,26.46,25.62,8.275,3174.0,6.9,75.13,79.498,69.882,1991-09-17
Lithuania,LTU,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,65286.0,3278000.0,16.3,26.86,26.01,8.09,3417.0,3.3,75.31,80.06,69.554,1991-09-17
Mexico,MEX,america,americas,upper_middle_income,False,True,,2018,1964380.0,117500000.0,8.55,27.42,28.74,9.468,3072.0,11.3,76.78,79.88,75.12,1945-11-07
Slovakia,SVK,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,49035.0,5489000.0,13.31,26.93,26.32,6.746,2944.0,5.8,77.16,80.511,73.589,1993-01-19


Because this method of filtering is a bit awkward, there is also a [`query`] method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) that allows you to select rows based on a string that describes some (in) equality of column names and numeric values (which is often the case, but sometimes it doesn&#39;t have to).

In [51]:
# Really big countries (population over 100 million)countries.query("population > 100_000_000")

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Bangladesh,BGD,south_asia,asia,low_income,False,False,,2018,147630.0,154400000.0,0.17,20.4,20.55,4.401,2450.0,30.7,73.41,74.937,71.484,1974-09-17
Brazil,BRA,america,americas,upper_middle_income,False,False,,2018,8515770.0,200100000.0,10.08,25.79,25.99,1.872,3263.0,14.6,75.7,79.527,72.34,1945-10-24
China,CHN,east_asia_pacific,asia,upper_middle_income,False,False,,2018,9562911.0,1359000000.0,5.56,22.92,22.91,3.59,3108.0,9.2,76.92,78.163,75.096,1945-10-24
India,IND,south_asia,asia,lower_middle_income,False,False,,2018,3287259.0,1275000000.0,2.69,20.96,21.31,3.034,2459.0,37.9,69.1,70.678,67.538,1945-10-30
Indonesia,IDN,east_asia_pacific,asia,lower_middle_income,False,False,,2018,1910931.0,247200000.0,0.56,21.86,22.99,1.232,2777.0,22.8,72.03,71.742,67.426,1950-09-28
Japan,JPN,east_asia_pacific,asia,high_income,False,True,,2018,377962.0,126300000.0,7.79,23.5,21.87,1.381,2726.0,2.0,84.17,87.244,80.803,1956-12-18
Mexico,MEX,america,americas,upper_middle_income,False,True,,2018,1964380.0,117500000.0,8.55,27.42,28.74,9.468,3072.0,11.3,76.78,79.88,75.12,1945-11-07
Nigeria,NGA,sub_saharan_africa,africa,lower_middle_income,False,False,,2018,923770.0,170900000.0,12.72,23.03,23.67,,2700.0,69.4,66.14,55.158,53.512,1960-10-07
Pakistan,PAK,south_asia,asia,lower_middle_income,False,False,,2018,796100.0,183200000.0,0.05,22.3,23.45,,2440.0,65.8,67.96,67.869,65.75,1947-09-30
Russia,RUS,europe_central_asia,europe,high_income,False,False,,2018,17098250.0,142600000.0,16.23,26.01,27.21,14.38,3361.0,8.2,71.07,76.882,65.771,1945-10-24


In [52]:
# In which EU countries does it eat a lot?countries.query("is_eu & (calories_per_day > 3500)")

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Austria,AUT,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,83879.0,8441000.0,12.4,26.47,25.09,3.541,3768.0,2.9,81.84,84.249,79.585,1955-12-14
Belgium,BEL,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,30530.0,10820000.0,10.41,26.76,25.14,5.427,3733.0,3.3,81.23,83.751,79.131,1945-12-27
Ireland,IRL,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,70280.0,4631000.0,14.92,27.65,26.62,3.768,3600.0,3.0,81.49,83.737,79.885,1955-12-14
Italy,ITA,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,301340.0,61090000.0,9.72,26.48,24.79,3.778,3579.0,2.9,82.62,85.435,81.146,1955-12-14
Luxembourg,LUX,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,2590.0,530000.0,12.84,27.43,26.09,5.971,3539.0,1.5,82.39,84.227,79.981,1945-10-24


** Task **: Which only country in Africa belongs to the high-income group?

** Task **: In which countries do you drink a lot (use the above or any other criterion)

## Sorting

In the introductory `pandas` lesson, we have already shown how to sort rows using the [` sort_index`] method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html) according to the index. Since the `countries` are already aligned, let&#39;s try it again on the planets:

In [53]:
planety.sort_index()

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_obr
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jupiter,♃,5.2,11.86,79,True
Mars,♂,1.52,1.88,2,False
Merkur,☿,0.39,0.24,0,False
Neptun,♆,30.06,164.8,14,True
Saturn,♄,9.54,29.46,82,True
Uran,♅,19.22,84.01,27,True
Venuše,♀,0.72,0.62,0,False
Země,⊕,1.0,1.0,1,False


The [`sort_values`] method is used to sort the values in the` Series` (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html):

In [54]:
# 10 countries with the smallest populationcountries["population"].sort_values().head(10)

name
Tuvalu                    9888.0
Nauru                    10440.0
Palau                    20920.0
San Marino               32160.0
Monaco                   35460.0
Liechtenstein            36870.0
Saint Kitts and Nevis    54340.0
Marshall Islands         56690.0
Dominica                 67700.0
Seychelles               87420.0
Name: population, dtype: float64

The optional argument `ascending` tells us which way to go. The default value is `True`, so changing to` False` will sort from largest to smallest:

In [55]:
# The largest 10 countries by areacountries["area"].sort_values(ascending=False).head(10)

name
Russia           17098250.0
Canada            9984670.0
United States     9831510.0
China             9562911.0
Brazil            8515770.0
Australia         7741220.0
India             3287259.0
Argentina         2780400.0
Kazakhstan        2724902.0
Algeria           2381740.0
Name: area, dtype: float64

In the case of a table, the first argument should be the name of the column (or columns) by which we want to sort:

In [56]:
# 10 countries with the highest alcohol consumption per capitacountries.sort_values("alcohol_adults", ascending=False).head(10)

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Moldova,MDA,europe_central_asia,europe,lower_middle_income,False,False,,2018,33850.0,3496000.0,23.01,24.24,27.06,5.529,2714.0,13.6,72.41,76.09,67.544,1992-03-02
South Korea,KOR,east_asia_pacific,asia,high_income,False,True,,2018,100280.0,48770000.0,19.15,23.99,23.33,4.319,3334.0,2.9,81.35,85.467,79.456,1991-09-17
Belarus,BLR,europe_central_asia,europe,upper_middle_income,False,False,,2018,207600.0,9498000.0,18.85,26.16,26.64,8.454,3250.0,3.4,73.76,78.583,67.693,1945-10-24
North Korea,PRK,east_asia_pacific,asia,low_income,False,False,,2018,120540.0,24650000.0,18.28,22.02,21.25,,2094.0,19.7,71.13,75.512,68.45,1991-09-17
Ukraine,UKR,europe_central_asia,europe,lower_middle_income,False,False,,2018,603550.0,44700000.0,17.47,25.42,26.23,8.771,3138.0,7.7,72.29,77.067,67.246,1945-10-24
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000.0,17.24,26.26,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17
Czechia,CZE,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,78870.0,10590000.0,16.47,27.91,26.51,5.72,3256.0,2.8,79.37,81.858,76.148,1993-01-19
Uganda,UGA,sub_saharan_africa,africa,low_income,False,False,,2018,241550.0,36760000.0,16.4,22.36,22.48,13.69,2130.0,37.7,62.86,62.667,58.252,1962-10-25
Lithuania,LTU,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,65286.0,3278000.0,16.3,26.86,26.01,8.09,3417.0,3.3,75.31,80.06,69.554,1991-09-17
Russia,RUS,europe_central_asia,europe,high_income,False,False,,2018,17098250.0,142600000.0,16.23,26.01,27.21,14.38,3361.0,8.2,71.07,76.882,65.771,1945-10-24


💡 In the next cell, the entire code is enclosed in parentheses. This allowed us to stretch one expression into several lines so that we could comment on its parts properly.

In [57]:
(
# Think only of the EU    countries[countries["is_eu"]]
    
# Sort first by date of EU accession, then by UN accession.sort_values([&quot;eu_accession&quot;, &quot;un_accession&quot;])
# Show only those two columns[[&quot;eu_accession&quot;, &quot;un_accession&quot;]])

Unnamed: 0_level_0,eu_accession,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1
France,1952-07-23,1945-10-24
Luxembourg,1952-07-23,1945-10-24
Netherlands,1952-07-23,1945-12-10
Belgium,1952-07-23,1945-12-27
Italy,1952-07-23,1955-12-14
Germany,1952-07-23,1973-09-18
Denmark,1973-01-01,1945-10-24
United Kingdom,1973-01-01,1945-10-24
Ireland,1973-01-01,1955-12-14
Greece,1981-01-01,1945-10-25


** Task: ** Sort the countries of the world according to population density.

** Task: ** Which countries have overweight problems (average BMI of men and women is over 25)?

** Task: ** In which 20 countries do absolutely most people die in car accidents?

## Save Results!
And that&#39;s slowly the end of it. But we have done an (almost) non-trivial amount of work and it will be lost by next time. Fortunately, writing an `DataFrame` to an external file in one of the typical formats is not complicated at all. There are counterparts to the `pd.read_XXX` function set,` DataFrame.to_XXX`. They differ in different parameters, but the basic use is very simple:

In [58]:
planety.to_csv(&quot;planety.csv&quot;)

In [59]:
planety.to_excel(&quot;planety.xlsx&quot;)

Excel and CSV are not completely suitable formats for storing large data (alternatives are [feather] (https://github.com/wesm/feather) or [parquet] (https://en.wikipedia.org/wiki/ Apache_Parquet)), for our purposes (small files, readable text format) but CSV will suffice.

One of the possibilities is to create an HTML table (which can be supplied with various formatting, which we should leave for another time or at home, see [documentation &quot;Styling&quot;] (https://pandas.pydata.org/pandas-docs/stable/ user_guide / style.html)):

In [60]:
planety.to_html(&quot;planety.html&quot;)

** Task **: See what you find in the output files.

** Task **: Look at the list of possible output formats and try to write planets or countries in one of them: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#serialization-io -conversion

And that&#39;s really all. 👋