# Pandas

Hauptwerkzeug für Datenanalyse

Funktionen:
- Sortieren
- Filtern
- Vergleichen
- Summieren
- Mittelwerte bilden
- Daten anpassen/Fehler beheben
- ...

Zwei Teile:
- Einführung in Pandas
- Fehlerbehebung mit Pandas

In [1]:
import numpy as np
import pandas as pd

## Serie

Die Serie ist ein Numpy-Array mit mehr Funktionen

Jede Spalte und jede Zeile im nachfolgendem DataFrame ist eine Serie

In [3]:
pd.Series([0.38, 83.7, 8.6, 9])  # pd.Series: Ein Series Objekt erstellen mit den gegebenen Zahlen

0     0.38
1    83.70
2     8.60
3     9.00
dtype: float64

In [4]:
einwohnerM = pd.Series([0.38, 83.7, 8.6, 9]) 

In [6]:
einwohnerM  # Nicht aussagekräftig (Städte? Länder? Beides?)

0     0.38
1    83.70
2     8.60
3     9.00
dtype: float64

### Beschriftungen

Eine Serie kann jetzt Beschriftungen erhalten

In [8]:
einwohnerM.name = "Einwohnerzahlen Europa"

In [10]:
einwohnerM  # Besser

0     0.38
1    83.70
2     8.60
3     9.00
Name: Einwohnerzahlen Europa, dtype: float64

In [11]:
einwohnerM.index = ["LI", "DE", "CH", "AT"]

In [12]:
einwohnerM

LI     0.38
DE    83.70
CH     8.60
AT     9.00
Name: Einwohnerzahlen Europa, dtype: float64

### Vektorisierung

Unter jeder Serie ist ein ndarray (Numpy Array) darunter -> Vektorisierung

In [13]:
einwohnerM > 50

LI    False
DE     True
CH    False
AT    False
Name: Einwohnerzahlen Europa, dtype: bool

In [14]:
einwohnerM[einwohnerM > 50]

DE    83.7
Name: Einwohnerzahlen Europa, dtype: float64

In [17]:
# Alle Länder mit unterdurchschnittlich vielen Einwohnern
einwohnerM.mean()

25.419999999999998

In [18]:
einwohnerM[einwohnerM < einwohnerM.mean()]

LI    0.38
CH    8.60
AT    9.00
Name: Einwohnerzahlen Europa, dtype: float64

In [21]:
einwohnerM[0]

  einwohnerM[0]


0.38

In [22]:
einwohnerM["AT"]

9.0

In [26]:
einwohnerM["CH":"AT"]  # WICHTIG: Hier wird auch die Obergrenze ausgegeben

CH    8.6
AT    9.0
Name: Einwohnerzahlen Europa, dtype: float64

In [27]:
einwohnerM[2:3]

CH    8.6
Name: Einwohnerzahlen Europa, dtype: float64

In [30]:
einwohnerM.sort_values(ascending=False)

DE    83.70
AT     9.00
CH     8.60
LI     0.38
Name: Einwohnerzahlen Europa, dtype: float64

In [32]:
einwohnerM.sort_index()["AT":"DE"]

AT     9.0
CH     8.6
DE    83.7
Name: Einwohnerzahlen Europa, dtype: float64

In [33]:
einwohnerM.values  # Das unterliegende Numpy-Array

array([ 0.38, 83.7 ,  8.6 ,  9.  ])

In [34]:
type(einwohnerM.values)

numpy.ndarray

## DataFrame

Effektiv eine Tabelle

Jeder Zeile/Spalte ist eine Serie

DataFrames werden über die read... Methoden erstellt

In [35]:
data = pd.DataFrame({ "Spalte 1": [1, 2, 3], "Spalte 2": [4, 5, 6], "Spalte 3": [7, 8, 9] })

In [37]:
data.index = ["Z1", "Z2", "Z3"]

In [38]:
data

Unnamed: 0,Spalte 1,Spalte 2,Spalte 3
Z1,1,4,7
Z2,2,5,8
Z3,3,6,9


### Die read Methoden

Die read Methoden sind eine Sammlung von Methoden, welche verschiedenste Dateiformate lesen können

Um PopulationData.csv laden zu können, benötigen wir die read_csv Methode

In [41]:
data = pd.read_csv("Data/PopulationData.csv", delimiter=";")

In [42]:
data

Unnamed: 0,#,Country (or dependency),Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
0,1,China,1439323776,0.39 %,5540090,153,9388211,-348399,1.7,38,61 %,18.47 %
1,2,India,1380004385,0.99 %,13586631,464,2973190,-532687,2.2,28,35 %,17.70 %
2,3,United States,331002651,0.59 %,1937734,36,9147420,954806,1.8,38,83 %,4.25 %
3,4,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955,2.3,30,56 %,3.51 %
4,5,Pakistan,220892340,2.00 %,4327022,287,770880,-233379,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...,...,...
230,231,Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
231,232,Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
232,233,Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
233,234,Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


### Probleme
- Tausendertrennzeichen und Dezimalzeichen
- Index als Land setzen
- #-Spalte entfernen
- Spalten umbenennen
- Strings aus numerischen Spalten entfernen (%, N.A., ...)

In [47]:
data.info()  # Datentypen auflisten

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   #                        235 non-null    int64 
 1   Country (or dependency)  235 non-null    object
 2   Population(2020)         235 non-null    object
 3   YearlyChange             235 non-null    object
 4   NetChange                235 non-null    object
 5   Density(P/Km²)           235 non-null    object
 6   Land Area(Km²)           235 non-null    object
 7   Migrants(net)            201 non-null    object
 8   Fert.Rate                235 non-null    object
 9   Med.Age                  235 non-null    object
 10  UrbanPop %               235 non-null    object
 11  WorldShare               235 non-null    object
dtypes: int64(1), object(11)
memory usage: 22.2+ KB


#### Problem 1: Tausendertrennzeichen und Dezimalzeichen

In [48]:
data = pd.read_csv("Data/PopulationData.csv", delimiter=";", thousands=",", decimal=".")

In [49]:
data

Unnamed: 0,#,Country (or dependency),Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
0,1,China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
1,2,India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
2,3,United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
3,4,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
4,5,Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...,...,...
230,231,Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
231,232,Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
232,233,Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
233,234,Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


In [51]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   #                        235 non-null    int64  
 1   Country (or dependency)  235 non-null    object 
 2   Population(2020)         235 non-null    int64  
 3   YearlyChange             235 non-null    object 
 4   NetChange                235 non-null    int64  
 5   Density(P/Km²)           235 non-null    int64  
 6   Land Area(Km²)           235 non-null    int64  
 7   Migrants(net)            201 non-null    float64
 8   Fert.Rate                235 non-null    object 
 9   Med.Age                  235 non-null    object 
 10  UrbanPop %               235 non-null    object 
 11  WorldShare               235 non-null    object 
dtypes: float64(1), int64(5), object(6)
memory usage: 22.2+ KB


#### Problem 2: Indexspalte setzen

In [57]:
data = pd.read_csv("Data/PopulationData.csv", delimiter=";", thousands=",", decimal=".", index_col="Country (or dependency)")

In [58]:
data

Unnamed: 0_level_0,#,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
Country (or dependency),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
China,1,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,2,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,3,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,4,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,5,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...,...
Montserrat,231,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,232,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,233,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,234,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


Index nach dem Laden setzen:

In [55]:
data.set_index("#")

Unnamed: 0_level_0,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
2,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
3,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
4,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
5,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
231,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
232,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
233,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
234,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


#### Problem 3: #-Spalte entfernen

In [62]:
data.drop(columns=["#"])  # Hier muss mit columns= gearbeitet werden

Unnamed: 0_level_0,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
Country (or dependency),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


In [64]:
data  # Hier wird eine Kopie erzeugt, das original bleibt unverändert

Unnamed: 0_level_0,#,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
Country (or dependency),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
China,1,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,2,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,3,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,4,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,5,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...,...
Montserrat,231,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,232,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,233,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,234,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


Zwei Möglichkeiten:
- Zuweisung (data = data.drop(columns=["#"]))
- inplace=True

In [65]:
data.drop(columns=["#"], inplace=True)

In [66]:
data

Unnamed: 0_level_0,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
Country (or dependency),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


#### Problem 4: Spalten umbenennen

Die rename Funktion benötigt ein Dictionary als Parameter in der Form "Alter Name": "Neuer Name"

In [70]:
data.rename(columns={
    "Population(2020)": "Pop",
    "Density(P/Km²)": "Density",
    "Land Area(Km²)": "LandArea",
    "Migrants(net)": "Migrants",
    "UrbanPop %": "UrbanPop"
}, inplace=True)

In [71]:
data

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country (or dependency),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


In [72]:
data = pd.read_csv("Data/PopulationData.csv", delimiter=";", thousands=",", decimal=".")

In [74]:
data.drop(columns=["#"], inplace=True)

In [77]:
data.rename(columns={
    "Country (or dependency)": "Country",
    "Population(2020)": "Pop",
    "Density(P/Km²)": "Density",
    "Land Area(Km²)": "LandArea",
    "Migrants(net)": "Migrants",
    "UrbanPop %": "UrbanPop"
}, inplace=True)

In [79]:
data.set_index("Country", inplace=True)

In [80]:
data

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


### Datenanalyse

Erkenntnisse ziehen aus den Daten

Zwei Möglichekeiten um Daten anzugreifen:
- Spaltenweise
- Zeilenweise

#### Spalten angreifen

In [81]:
data["Pop"]

Country
China               1439323776
India               1380004385
United States        331002651
Indonesia            273523615
Pakistan             220892340
                       ...    
Montserrat                4992
Falkland Islands          3480
Niue                      1626
Tokelau                   1357
Vatican State              801
Name: Pop, Length: 235, dtype: int64

In [82]:
data["Pop"].mean()

33171202.680851065

In [89]:
data["Pop"] > data["Pop"].mean()

Country
China                True
India                True
United States        True
Indonesia            True
Pakistan             True
                    ...  
Montserrat          False
Falkland Islands    False
Niue                False
Tokelau             False
Vatican State       False
Name: Pop, Length: 235, dtype: bool

In [88]:
# Aufgabe: Welche Länder sind über dem Durchschnitt?
data[data["Pop"] > data["Pop"].mean()]

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
Nigeria,206139589,2.58 %,5175990,226,910770,-60000.0,5.4,18,52 %,2.64 %
Bangladesh,164689383,1.01 %,1643222,1265,130170,-369501.0,2.1,28,39 %,2.11 %
Russia,145934462,0.04 %,62206,9,16376870,182456.0,1.8,40,74 %,1.87 %
Mexico,128932753,1.06 %,1357224,66,1943950,-60000.0,2.1,29,84 %,1.65 %


#### Zeile angreifen

Für Zeilen muss die loc-Funktion verwendet werden

WICHTIG: Bei loc müssen eckige Klammern verwendet werden

In [92]:
data.loc["China"]

Pop             1439323776
YearlyChange        0.39 %
NetChange          5540090
Density                153
LandArea           9388211
Migrants         -348399.0
Fert.Rate              1.7
Med.Age                 38
UrbanPop              61 %
WorldShare         18.47 %
Name: China, dtype: object

In [98]:
data.loc["China":"Indonesia"]  # Mehrere Zeilen nehmen

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %


In [105]:
data.loc["China":"Indonesia", "Pop":"Density"]

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
China,1439323776,0.39 %,5540090,153
India,1380004385,0.99 %,13586631,464
United States,331002651,0.59 %,1937734,36
Indonesia,273523615,1.07 %,2898047,151


iloc: Index-Locate, kann statt alphabetischen Indizes numerische Indizes benutzen

In [110]:
data.iloc[0:10]

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
Nigeria,206139589,2.58 %,5175990,226,910770,-60000.0,5.4,18,52 %,2.64 %
Bangladesh,164689383,1.01 %,1643222,1265,130170,-369501.0,2.1,28,39 %,2.11 %
Russia,145934462,0.04 %,62206,9,16376870,182456.0,1.8,40,74 %,1.87 %
Mexico,128932753,1.06 %,1357224,66,1943950,-60000.0,2.1,29,84 %,1.65 %


In [112]:
# Aufgabe: Was ist die Durchschnittsbevölkerung der Top 10 Nationen?
data.iloc[:10]["Pop"].mean()

450300237.1

In [113]:
# Aufgabe: Was war der Stand der Bevölkerung letztes Jahr von allen Nationen?

In [114]:
data["Pop"] - data["Density"]

Country
China               1439323623
India               1380003921
United States        331002615
Indonesia            273523464
Pakistan             220892053
                       ...    
Montserrat                4942
Falkland Islands          3480
Niue                      1620
Tokelau                   1221
Vatican State            -1202
Length: 235, dtype: int64

#### Daten sortieren

Nach eine Spalte sortieren: sort_values(Spaltenname)

Nach Index sortieren: sort_index()

Beide Funktionen können absteigend/aufsteigend sortieren mittels ascending=True/False

In [122]:
data.sort_values("LandArea", ascending=False)

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Russia,145934462,0.04 %,62206,9,16376870,182456.0,1.8,40,74 %,1.87 %
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Canada,37742154,0.89 %,331107,4,9093510,242032.0,1.5,41,81 %,0.48 %
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
...,...,...,...,...,...,...,...,...,...,...
Nauru,10824,0.63 %,68,541,20,,N.A.,N.A.,N.A.,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %
Gibraltar,33691,-0.03 %,-10,3369,10,,N.A.,N.A.,N.A.,0.00 %
Monaco,39242,0.71 %,278,26337,1,,N.A.,N.A.,N.A.,0.00 %


Subsequente Sortierungen

In [124]:
data.sort_values(["LandArea", "Pop"], ascending=True)

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Vatican State,801,0.25 %,2,2003,0,,N.A.,N.A.,N.A.,0.00 %
Monaco,39242,0.71 %,278,26337,1,,N.A.,N.A.,N.A.,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %
Gibraltar,33691,-0.03 %,-10,3369,10,,N.A.,N.A.,N.A.,0.00 %
Nauru,10824,0.63 %,68,541,20,,N.A.,N.A.,N.A.,0.00 %
...,...,...,...,...,...,...,...,...,...,...
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
Canada,37742154,0.89 %,331107,4,9093510,242032.0,1.5,41,81 %,0.48 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %


In [147]:
data.sort_index()

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,38928346,2.33 %,886592,60,652860,-62920.0,4.6,18,25 %,0.50 %
Albania,2877797,-0.11 %,-3120,105,27400,-14000.0,1.6,36,63 %,0.04 %
Algeria,43851044,1.85 %,797990,18,2381740,-10000.0,3.1,29,73 %,0.56 %
American Samoa,55191,-0.22 %,-121,276,200,,N.A.,N.A.,88 %,0.00 %
Andorra,77265,0.16 %,123,164,470,,N.A.,N.A.,88 %,0.00 %
...,...,...,...,...,...,...,...,...,...,...
Wallis & Futuna,11239,-1.69 %,-193,80,140,,N.A.,N.A.,0 %,0.00 %
Western Sahara,597339,2.55 %,14876,2,266000,5582.0,2.4,28,87 %,0.01 %
Yemen,29825964,2.28 %,664042,56,527970,-30000.0,3.8,20,38 %,0.38 %
Zambia,18383955,2.93 %,522925,25,743390,-8000.0,4.7,18,45 %,0.24 %


#### Gruppierung

Anhand einer Spalte Gruppen erzeugen, jeden Datensatz in seine entsprechende Gruppe geben

In [151]:
data.groupby("Med.Age")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016C21399250>

In [154]:
data.groupby("Med.Age").get_group("20")

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,Fert.Rate,Med.Age,UrbanPop,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Kenya,53771296,2.28 %,1197323,94,569140,-10000.0,3.5,20,28 %,0.69 %
Sudan,43849260,2.42 %,1036022,25,1765048,-50000.0,4.4,20,35 %,0.56 %
Yemen,29825964,2.28 %,664042,56,527970,-30000.0,3.8,20,38 %,0.38 %
Madagascar,27691018,2.68 %,721711,48,581795,-1500.0,4.1,20,39 %,0.36 %
Rwanda,12952218,2.58 %,325268,525,24670,-9000.0,4.1,20,18 %,0.17 %
Mauritania,4649658,2.74 %,123962,5,1030700,5000.0,4.6,20,57 %,0.06 %
Comoros,869601,2.20 %,18715,467,1861,-2000.0,4.2,20,29 %,0.01 %
Solomon Islands,686884,2.55 %,17061,25,27990,-1600.0,4.4,20,23 %,0.01 %
Mayotte,272815,2.50 %,6665,728,375,0.0,3.7,20,46 %,0.00 %


In [155]:
# Aufgabe: Wieviele Datensätze gibt es in den jeweiligen Gruppen?

In [157]:
data.groupby("Med.Age").count()["Pop"]

Med.Age
15       1
16       1
17       6
18      10
19      14
20       9
21       5
22       7
23       4
24       6
25       4
26       7
27       4
28      12
29       5
30       8
31       6
32      11
33       5
34       6
35       3
36       4
37       4
38       7
39       3
40       7
41       5
42      10
43      11
44       5
45       5
46       3
47       2
48       1
N.A.    34
Name: Pop, dtype: int64

In [158]:
data.groupby("Med.Age")["Pop"].mean()

Med.Age
15      2.420664e+07
16      2.025083e+07
17      3.539643e+07
18      4.148540e+07
19      1.824118e+07
20      1.939652e+07
21      9.621932e+06
22      7.686515e+06
23      6.028827e+07
24      7.213284e+06
25      3.304188e+07
26      2.510786e+07
27      2.088666e+06
28      1.398790e+08
29      4.693030e+07
30      4.894655e+07
31      1.847800e+07
32      3.284957e+07
33      4.789470e+07
34      3.726593e+06
35      1.595275e+07
36      2.161582e+06
37      1.772735e+06
38      2.590871e+08
39      1.120258e+06
40      4.277205e+07
41      1.942832e+07
42      1.717237e+07
43      7.631299e+06
44      1.210433e+07
45      1.320029e+07
46      3.480124e+07
47      3.041855e+07
48      1.264765e+08
N.A.    3.333179e+04
Name: Pop, dtype: float64

### Daten exportieren

In [159]:
data.to_csv("Data/PopulationDataFastFertig.csv")