# Pandas

Python Data Analysis Toolkit

In [1]:
import numpy as np
import pandas as pd

## Serie

Liste von Werten (Array)

Zusätzlich zum Array hat die Serie einen Namen und einen benannten Index

Unter jeder Serie ist ein Numpy Array

In [3]:
pd.Series([9, 83.7, 8.6, 0.38])

0     9.00
1    83.70
2     8.60
3     0.38
dtype: float64

In [4]:
einwohnerM = pd.Series([9, 83.7, 8.6, 0.38])

In [5]:
einwohnerM.name = "Einwohnerzahlen Europa"

In [6]:
einwohnerM

0     9.00
1    83.70
2     8.60
3     0.38
Name: Einwohnerzahlen Europa, dtype: float64

In [7]:
einwohnerM.index = ["CH", "DE", "AT", "LI"]

In [8]:
einwohnerM

CH     9.00
DE    83.70
AT     8.60
LI     0.38
Name: Einwohnerzahlen Europa, dtype: float64

Hier können jetzt normale Arrayfunktionen verwendet werden

In [9]:
einwohnerM[0]

  einwohnerM[0]


9.0

In [10]:
einwohnerM["AT"]

8.6

In [11]:
einwohnerM["CH":"AT"]

CH     9.0
DE    83.7
AT     8.6
Name: Einwohnerzahlen Europa, dtype: float64

Aufgabe: Alle Länder finden, welche überdurchschnittlich viele Einwohner haben

In [12]:
einwohnerM.mean()

25.419999999999998

In [14]:
einwohnerM > einwohnerM.mean()

CH    False
DE     True
AT    False
LI    False
Name: Einwohnerzahlen Europa, dtype: bool

In [15]:
einwohnerM[einwohnerM > einwohnerM.mean()]

DE    83.7
Name: Einwohnerzahlen Europa, dtype: float64

## DataFrame

Zweidimensionale Sammlung von Serien (= Tabelle)

In [16]:
data = pd.DataFrame({"Col1": [1, 2, 3], "Col2": [4, 5, 6], "Col3": [7, 8, 9]})

In [17]:
data

Unnamed: 0,Col1,Col2,Col3
0,1,4,7
1,2,5,8
2,3,6,9


### Externe Daten einlesen

Daten aus beliebigen Quellen einlesen (z.B. Datei, Datenbank, ...)

Dafür gibt es in Pandas eine Sammlung von Methoden, namens read-Methoden

Einlesen von Daten mittels read_csv:

In [37]:
data = pd.read_csv("Data/PopulationData.csv", delimiter=";", thousands=",", decimal=".")

In [38]:
data

Unnamed: 0,#,Country (or dependency),Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
0,1,China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
1,2,India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
2,3,United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
3,4,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
4,5,Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...,...,...
230,231,Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
231,232,Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
232,233,Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
233,234,Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


Um sich schnell einen Überblick verschaffen zu können, kann die info() Methode verwendet werden

In [39]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   #                        235 non-null    int64  
 1   Country (or dependency)  235 non-null    object 
 2   Population(2020)         235 non-null    int64  
 3   YearlyChange             235 non-null    object 
 4   NetChange                235 non-null    int64  
 5   Density(P/Km²)           235 non-null    int64  
 6   Land Area(Km²)           235 non-null    int64  
 7   Migrants(net)            201 non-null    float64
 8   Fert.Rate                235 non-null    object 
 9   Med.Age                  235 non-null    object 
 10  UrbanPop %               235 non-null    object 
 11  WorldShare               235 non-null    object 
dtypes: float64(1), int64(5), object(6)
memory usage: 22.2+ KB


Das Datenset hat einige Probleme:
- Numerische Spalten sind strings (object)
- % bei Spalten kann nicht interpretiert werden
- NaN, N.A., ...

### Anpassen des DataFrames selbst

Das DataFrame selbst hat hier noch kleine Fehler:
- Zwei Indexspalten (Standardspalte + #-Spalte)
- Schlechte Spaltennamen

Country zur Index-Spalte machen:

In [40]:
data.set_index("Country (or dependency)", inplace=True)
# Jede Pandas-Funktion erzeugt immer ein neues Datenset (Original bleibt unverändert)
# Mit inplace=True oder data = data... kann die Änderung angewandt werden

In [41]:
data = data.drop(columns=["#"])

In [42]:
data

Unnamed: 0_level_0,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
Country (or dependency),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


#### Spaltennamen anpassen

Bei der rename Funktion muss ein Dictionary gegeben werden, welches die alten Spaltennamen als Key verwendet, und die neuen Spaltennamen als Values enthält

In [47]:
data.rename(columns={
    "Country (or dependency)": "Country",  # Index-Spalte wird nicht verändert
    "Population(2020)": "Pop",
    "Density(P/Km²)": "Density",
    "Land Area(Km²)": "Area",
    "Migrants(net)": "Migrants",
    "UrbanPop %": "UrbanPop"
}, inplace=True)

In [48]:
data

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


In [49]:
data.index.name = "Country"  # Index-Name setzen

In [50]:
data

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


#### head/tail

Die obersten/untersten X Datensätze anzeigen

In [51]:
data.head(10)

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
Nigeria,206139589,2.58 %,5175990,226,910770,-60000.0,5.4,18,52 %,2.64 %
Bangladesh,164689383,1.01 %,1643222,1265,130170,-369501.0,2.1,28,39 %,2.11 %
Russia,145934462,0.04 %,62206,9,16376870,182456.0,1.8,40,74 %,1.87 %
Mexico,128932753,1.06 %,1357224,66,1943950,-60000.0,2.1,29,84 %,1.65 %


In [52]:
data.tail(10)

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Wallis & Futuna,11239,-1.69 %,-193,80,140,,N.A.,N.A.,0 %,0.00 %
Nauru,10824,0.63 %,68,541,20,,N.A.,N.A.,N.A.,0.00 %
Saint Barthelemy,9877,0.30 %,30,470,21,,N.A.,N.A.,0 %,0.00 %
Saint Helena,6077,0.30 %,18,16,390,,N.A.,N.A.,27 %,0.00 %
Saint Pierre & Miquelon,5794,-0.48 %,-28,25,230,,N.A.,N.A.,100 %,0.00 %
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %
Vatican State,801,0.25 %,2,2003,0,,N.A.,N.A.,N.A.,0.00 %


### Daten angreifen

Zwei Möglichkeiten um Daten anzugreifen:
- Über Index []
- Über loc[]
- Über iloc[]

Normaler Index: Nur für Spalten

In [58]:
data["Pop"]  # Serie aus einem Spaltennamen entnehmen

Country
China               1439323776
India               1380004385
United States        331002651
Indonesia            273523615
Pakistan             220892340
                       ...    
Montserrat                4992
Falkland Islands          3480
Niue                      1626
Tokelau                   1357
Vatican State              801
Name: Pop, Length: 235, dtype: int64

Loc: Nur für Zeilen

In [59]:
data.loc["China"]  # Serie aus einem Zeilennamen entnehmen mittels loc

Pop             1439323776
YearlyChange        0.39 %
NetChange          5540090
Density                153
Area               9388211
Migrants         -348399.0
Fert.Rate              1.7
Med.Age                 38
UrbanPop              61 %
WorldShare         18.47 %
Name: China, dtype: object

In [70]:
data.loc["China" : "United States"]

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %


In [74]:
data.loc["China" : "United States", "Pop" : "Area"]  # Mit loc nur bestimmte Spalten auswählen

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
China,1439323776,0.39 %,5540090,153,9388211
India,1380004385,0.99 %,13586631,464,2973190
United States,331002651,0.59 %,1937734,36,9147420


In [76]:
data.iloc[0:4]  # iloc: Index-loc, loc ohne Zeichenbasierten Index (Zahlen)

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %


### Sortierung

Es kann nach Spalten sortiert werden mit der sort_values Funktion

In [79]:
data.sort_values("Area", ascending=False)

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Russia,145934462,0.04 %,62206,9,16376870,182456.0,1.8,40,74 %,1.87 %
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Canada,37742154,0.89 %,331107,4,9093510,242032.0,1.5,41,81 %,0.48 %
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
...,...,...,...,...,...,...,...,...,...,...
Nauru,10824,0.63 %,68,541,20,,N.A.,N.A.,N.A.,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %
Gibraltar,33691,-0.03 %,-10,3369,10,,N.A.,N.A.,N.A.,0.00 %
Monaco,39242,0.71 %,278,26337,1,,N.A.,N.A.,N.A.,0.00 %


In [80]:
data.sort_values(["Area", "Pop"], ascending=False)  # Subsequente Sortierungen

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Russia,145934462,0.04 %,62206,9,16376870,182456.0,1.8,40,74 %,1.87 %
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Canada,37742154,0.89 %,331107,4,9093510,242032.0,1.5,41,81 %,0.48 %
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
...,...,...,...,...,...,...,...,...,...,...
Nauru,10824,0.63 %,68,541,20,,N.A.,N.A.,N.A.,0.00 %
Gibraltar,33691,-0.03 %,-10,3369,10,,N.A.,N.A.,N.A.,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %
Monaco,39242,0.71 %,278,26337,1,,N.A.,N.A.,N.A.,0.00 %


Nach dem Index sortieren

In [81]:
data.sort_index()

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,38928346,2.33 %,886592,60,652860,-62920.0,4.6,18,25 %,0.50 %
Albania,2877797,-0.11 %,-3120,105,27400,-14000.0,1.6,36,63 %,0.04 %
Algeria,43851044,1.85 %,797990,18,2381740,-10000.0,3.1,29,73 %,0.56 %
American Samoa,55191,-0.22 %,-121,276,200,,N.A.,N.A.,88 %,0.00 %
Andorra,77265,0.16 %,123,164,470,,N.A.,N.A.,88 %,0.00 %
...,...,...,...,...,...,...,...,...,...,...
Wallis & Futuna,11239,-1.69 %,-193,80,140,,N.A.,N.A.,0 %,0.00 %
Western Sahara,597339,2.55 %,14876,2,266000,5582.0,2.4,28,87 %,0.01 %
Yemen,29825964,2.28 %,664042,56,527970,-30000.0,3.8,20,38 %,0.38 %
Zambia,18383955,2.93 %,522925,25,743390,-8000.0,4.7,18,45 %,0.24 %


### Daten filtern

Mittels Boolean Masken

Aufgabe: Alle Länder finden, welche über 100m Einwohner haben

In [82]:
data["Pop"] > 100_000_000

Country
China                True
India                True
United States        True
Indonesia            True
Pakistan             True
                    ...  
Montserrat          False
Falkland Islands    False
Niue                False
Tokelau             False
Vatican State       False
Name: Pop, Length: 235, dtype: bool

In [84]:
data[data["Pop"] > 100_000_000]

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
Nigeria,206139589,2.58 %,5175990,226,910770,-60000.0,5.4,18,52 %,2.64 %
Bangladesh,164689383,1.01 %,1643222,1265,130170,-369501.0,2.1,28,39 %,2.11 %
Russia,145934462,0.04 %,62206,9,16376870,182456.0,1.8,40,74 %,1.87 %
Mexico,128932753,1.06 %,1357224,66,1943950,-60000.0,2.1,29,84 %,1.65 %


Aufgabe: Alle Länder finden, welche zw. 50m und 100m Einwohner haben

In [85]:
data["Pop"] > 50_000_000

Country
China                True
India                True
United States        True
Indonesia            True
Pakistan             True
                    ...  
Montserrat          False
Falkland Islands    False
Niue                False
Tokelau             False
Vatican State       False
Name: Pop, Length: 235, dtype: bool

In [86]:
data["Pop"] < 100_000_000

Country
China               False
India               False
United States       False
Indonesia           False
Pakistan            False
                    ...  
Montserrat           True
Falkland Islands     True
Niue                 True
Tokelau              True
Vatican State        True
Name: Pop, Length: 235, dtype: bool

In [91]:
data[(data["Pop"] > 50_000_000) & (data["Pop"] < 100_000_000)]  # Mehrere Bedingungen MÜSSEN mit & und | verbunden werden

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Vietnam,97338579,0.91 %,876473,314,310070,-80000.0,2.1,32,38 %,1.25 %
DR Congo,89561403,3.19 %,2770836,40,2267050,23861.0,6.0,17,46 %,1.15 %
Turkey,84339067,1.09 %,909452,110,769630,283922.0,2.1,32,76 %,1.08 %
Iran,83992949,1.30 %,1079043,52,1628550,-55000.0,2.2,32,76 %,1.08 %
Germany,83783942,0.32 %,266897,240,348560,543822.0,1.6,46,76 %,1.07 %
Thailand,69799978,0.25 %,174396,137,510890,19444.0,1.5,40,51 %,0.90 %
United Kingdom,67886011,0.53 %,355839,281,241930,260650.0,1.8,40,83 %,0.87 %
France,65273511,0.22 %,143783,119,547557,36527.0,1.9,42,82 %,0.84 %
Italy,60461826,-0.15 %,-88249,206,294140,148943.0,1.3,47,69 %,0.78 %
Tanzania,59734218,2.98 %,1728755,67,885800,-40076.0,4.9,18,37 %,0.77 %


In [94]:
data[(data["Pop"] > 50_000_000) & (data["Pop"] < 100_000_000)].sort_index()

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Colombia,50882891,1.08 %,543448,46,1109500,204796.0,1.8,31,80 %,0.65 %
DR Congo,89561403,3.19 %,2770836,40,2267050,23861.0,6.0,17,46 %,1.15 %
France,65273511,0.22 %,143783,119,547557,36527.0,1.9,42,82 %,0.84 %
Germany,83783942,0.32 %,266897,240,348560,543822.0,1.6,46,76 %,1.07 %
Iran,83992949,1.30 %,1079043,52,1628550,-55000.0,2.2,32,76 %,1.08 %
Italy,60461826,-0.15 %,-88249,206,294140,148943.0,1.3,47,69 %,0.78 %
Kenya,53771296,2.28 %,1197323,94,569140,-10000.0,3.5,20,28 %,0.69 %
Myanmar,54409800,0.67 %,364380,83,653290,-163313.0,2.2,29,31 %,0.70 %
South Africa,59308690,1.28 %,750420,49,1213090,145405.0,2.4,28,67 %,0.76 %
South Korea,51269185,0.09 %,43877,527,97230,11731.0,1.1,44,82 %,0.66 %


### Gruppierung

Gruppen anhand von einem Kriterium erstellen, jeder Datensatz ist dann in seiner Gruppe enthalten

Beispiel: Gruppierung nach Med.Age
- 15er Gruppe
- 16er Gruppe
- 17er Gruppe
- ...

In [98]:
data.groupby("Med.Age").count()["Pop"]  # Gruppen links, Anzahl Einträge pro Gruppe rechts

Med.Age
15       1
16       1
17       6
18      10
19      14
20       9
21       5
22       7
23       4
24       6
25       4
26       7
27       4
28      12
29       5
30       8
31       6
32      11
33       5
34       6
35       3
36       4
37       4
38       7
39       3
40       7
41       5
42      10
43      11
44       5
45       5
46       3
47       2
48       1
N.A.    34
Name: Pop, dtype: int64

In [101]:
data.groupby("Med.Age").get_group("19")

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,Area,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Ethiopia,114963588,2.57 %,2884858,115,1000000,30000.0,4.3,19,21 %,1.47 %
Cameroon,26545863,2.59 %,669483,56,472710,-4800.0,4.6,19,56 %,0.34 %
Côte d'Ivoire,26378274,2.57 %,661730,83,318000,-8000.0,4.7,19,51 %,0.34 %
Senegal,16743927,2.75 %,447563,87,192530,-20000.0,4.7,19,49 %,0.21 %
Zimbabwe,14862924,1.48 %,217456,38,386850,-116858.0,3.6,19,38 %,0.19 %
Benin,12123200,2.73 %,322049,108,112760,-2000.0,4.9,19,48 %,0.16 %
South Sudan,11193725,1.19 %,131612,18,610952,-174200.0,4.7,19,25 %,0.14 %
Togo,8278724,2.43 %,196358,152,54390,-2000.0,4.4,19,43 %,0.11 %
Sierra Leone,7976983,2.10 %,163768,111,72180,-4200.0,4.3,19,43 %,0.10 %
Congo,5518087,2.56 %,137579,16,341500,-4000.0,4.5,19,70 %,0.07 %


Aufgabe: Was ist die Durchschnittliche Bevölkerung pro Altersdurchschnitt?

In [105]:
data.groupby("Med.Age")["Pop"].mean()

Med.Age
15      2.420664e+07
16      2.025083e+07
17      3.539643e+07
18      4.148540e+07
19      1.824118e+07
20      1.939652e+07
21      9.621932e+06
22      7.686515e+06
23      6.028827e+07
24      7.213284e+06
25      3.304188e+07
26      2.510786e+07
27      2.088666e+06
28      1.398790e+08
29      4.693030e+07
30      4.894655e+07
31      1.847800e+07
32      3.284957e+07
33      4.789470e+07
34      3.726593e+06
35      1.595275e+07
36      2.161582e+06
37      1.772735e+06
38      2.590871e+08
39      1.120258e+06
40      4.277205e+07
41      1.942832e+07
42      1.717237e+07
43      7.631299e+06
44      1.210433e+07
45      1.320029e+07
46      3.480124e+07
47      3.041855e+07
48      1.264765e+08
N.A.    3.333179e+04
Name: Pop, dtype: float64

### Daten exportieren

= Daten speichern

Dafür gibt es in Pandas die to_... Methoden

Diese Funktionen können u.a. dafür verwendet werden, Daten von einem Format zu einem anderen Format zu konvertieren (z.B. CSV -> XML)

In [107]:
data.to_csv("Data/PopulationDataFastFertig.csv")