## Pandas

Advanced Python Data Analysis Toolkit

Hauptwerkzeug zum Verarbeiten von Daten

In [1]:
import numpy as np
import pandas as pd

### Serie

Listen von Werten

Zusätzlich zum Array auch noch weitere Eigenschaften

z.B.: Name, benannter Index, Datentyp, ...

Unter jeder Serie befindet sich immer ein Numpy Array

In [4]:
pd.Series([9, 83.7, 8.6, 0.38])

0     9.00
1    83.70
2     8.60
3     0.38
dtype: float64

Diese Werte sind nicht aussagekräftig

In [5]:
einwohnerM = pd.Series([9, 83.7, 8.6, 0.38])

In [6]:
einwohnerM.name = "Einwohnerzahlen Europa"

In [7]:
einwohnerM

0     9.00
1    83.70
2     8.60
3     0.38
Name: Einwohnerzahlen Europa, dtype: float64

In [8]:
einwohnerM.index = ["CH", "DE", "AT", "LI"]

In [9]:
einwohnerM

CH     9.00
DE    83.70
AT     8.60
LI     0.38
Name: Einwohnerzahlen Europa, dtype: float64

Diese Serie kann jetzt ála Numpy verarbeitet werden

In [10]:
einwohnerM[0]

  einwohnerM[0]


np.float64(9.0)

In [11]:
einwohnerM["AT"]

np.float64(8.6)

In [13]:
einwohnerM["CH":"AT"]  # Beim Bereich ist hier die Obergrenze inkludiert

CH     9.0
DE    83.7
AT     8.6
Name: Einwohnerzahlen Europa, dtype: float64

In [15]:
# Aufgabe: Welche Länder haben eine überdurchschnittliche Einwohnerzahl?
einwohnerM > einwohnerM.mean()

CH    False
DE     True
AT    False
LI    False
Name: Einwohnerzahlen Europa, dtype: bool

In [16]:
einwohnerM[einwohnerM > einwohnerM.mean()]

DE    83.7
Name: Einwohnerzahlen Europa, dtype: float64

### DataFrame

Zweidimensionale Sammlung von Serien (= Tabelle)

In [18]:
data = pd.DataFrame({ "Spalte1": [1, 2, 3], "Spalte2": [4, 5, 6], "Spalte3": [7, 8, 9] })

In [19]:
data

Unnamed: 0,Spalte1,Spalte2,Spalte3
0,1,4,7
1,2,5,8
2,3,6,9


#### Externe Daten einlesen

Daten werden generell aus beliebigen Quellen eingelesen

Dafür gibt es in Pandas die read_... Funktionen

In [30]:
data = pd.read_csv("Data/PopulationData.csv", delimiter=";", thousands=",", decimal=".")
data

Unnamed: 0,#,Country (or dependency),Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
0,1,China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
1,2,India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
2,3,United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
3,4,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
4,5,Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...,...,...
230,231,Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
231,232,Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
232,233,Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
233,234,Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


Über die info() Methode kann ein schneller Überblick über das Datenset gewährt werden

In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   #                        235 non-null    int64  
 1   Country (or dependency)  235 non-null    object 
 2   Population(2020)         235 non-null    int64  
 3   YearlyChange             235 non-null    object 
 4   NetChange                235 non-null    int64  
 5   Density(P/Km²)           235 non-null    int64  
 6   Land Area(Km²)           235 non-null    int64  
 7   Migrants(net)            201 non-null    float64
 8   Fert.Rate                235 non-null    object 
 9   Med.Age                  235 non-null    object 
 10  UrbanPop %               235 non-null    object 
 11  WorldShare               235 non-null    object 
dtypes: float64(1), int64(5), object(6)
memory usage: 22.2+ KB


#### Probleme mit dem Datenset

- DataFrame selbst
    - Zwei Indexspalten
    - Schlechte Spaltennamen
- Daten
    - Numerische Spalten sind strings
    - %-Zeichen in Datensätzen
    - NaN, N.A., ...

#### Index setzen

Hier kann der Index das Land selbst sein

In [37]:
data.set_index("Country (or dependency)", inplace=True)

Wenn in Pandas das DataFrame selbst angepasst wird, wird eine Kopie erzeugt
Zwei Optionen:
- Variablenzuweisung (data = data.set_index(...))
- Der inplace Parameter

In [38]:
data

Unnamed: 0_level_0,#,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
Country (or dependency),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
China,1,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,2,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,3,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,4,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,5,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...,...
Montserrat,231,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,232,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,233,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,234,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


#### Spalten löschen

Dafür kann die drop Funktion werden

Die drop Funktion kann auch Zeilen löschen -> columns Parameter

In [41]:
data.drop(columns=["#"], inplace=True)
data

Unnamed: 0_level_0,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
Country (or dependency),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


#### Spalten umbenennen

Dafür kann die rename Funktion verwendet werden

Auch hier muss columns= angegeben werden

Hier muss ein Dictionary definiert werden in der Form "Alter Spaltenname": "Neuer Spaltenname", ...

In [43]:
data.rename(columns={
    "Population(2020)": "Pop",
    "Density(P/Km²)": "Density",
    "Land Area(Km²)": "LandArea",
    "Migrants(net)": "Migrants",
    "Fert.Rate": "FertRate",
    "Med.Age": "MedAge",
    "UrbanPop %": "UrbanPopPct"
}, inplace=True)

In [51]:
data.index.name = "Country"  # Index umbenennen

In [52]:
data

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


### Daten angreifen

Drei Möglichkeiten:
- Über Index [] (nur Spalten)
- Über loc[] (nur Zeilen)
- Über iloc[] (Zeilen mit Nummern)

In [61]:
data["Pop"]  # Einzelne Spalte angreifen

Country
China               1439323776
India               1380004385
United States        331002651
Indonesia            273523615
Pakistan             220892340
                       ...    
Montserrat                4992
Falkland Islands          3480
Niue                      1626
Tokelau                   1357
Vatican State              801
Name: Population, Length: 235, dtype: int64

In [62]:
data.loc["China"]  # Einzelne Zeile angreifen

Pop             1439323776
YearlyChange        0.39 %
NetChange          5540090
Density                153
LandArea           9388211
Migrants         -348399.0
FertRate               1.7
MedAge                  38
UrbanPopPct           61 %
WorldShare         18.47 %
Name: China, dtype: object

In [63]:
data.iloc[0]  # Einzelne Zeile angreifen über einen numerischen Index

Pop             1439323776
YearlyChange        0.39 %
NetChange          5540090
Density                153
LandArea           9388211
Migrants         -348399.0
FertRate               1.7
MedAge                  38
UrbanPopPct           61 %
WorldShare         18.47 %
Name: China, dtype: object

#### Mehrere Zeilen angreifen

In [76]:
# Aufgabe: Was ist die Durchschnittsbevölkerung der Top 10 Nationen?
data["Pop"][:10].mean()

np.float64(450300237.1)

In [78]:
# Die Top 10 Nationen mit allen Spalten
data.iloc[:10]

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
Brazil,212559417,0.72 %,1509890,25,8358140,21200.0,1.7,33,88 %,2.73 %
Nigeria,206139589,2.58 %,5175990,226,910770,-60000.0,5.4,18,52 %,2.64 %
Bangladesh,164689383,1.01 %,1643222,1265,130170,-369501.0,2.1,28,39 %,2.11 %
Russia,145934462,0.04 %,62206,9,16376870,182456.0,1.8,40,74 %,1.87 %
Mexico,128932753,1.06 %,1357224,66,1943950,-60000.0,2.1,29,84 %,1.65 %


In [84]:
data.head(3)

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %


In [90]:
data.loc["China":"Mexico", "Pop":"NetChange"]  # Rechteck nehmen

Unnamed: 0_level_0,Pop,YearlyChange,NetChange
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
China,1439323776,0.39 %,5540090
India,1380004385,0.99 %,13586631
United States,331002651,0.59 %,1937734
Indonesia,273523615,1.07 %,2898047
Pakistan,220892340,2.00 %,4327022
Brazil,212559417,0.72 %,1509890
Nigeria,206139589,2.58 %,5175990
Bangladesh,164689383,1.01 %,1643222
Russia,145934462,0.04 %,62206
Mexico,128932753,1.06 %,1357224


### Daten sortieren

Zwei Funktionen: sort_values("Spalte"), sort_index()

Bei beiden Funktion kann auch die Richtung eingestellt werden

In [95]:
data.sort_values("NetChange", ascending=False)  # inplace=True ist auch hier möglich

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
China,1439323776,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
Nigeria,206139589,2.58 %,5175990,226,910770,-60000.0,5.4,18,52 %,2.64 %
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
...,...,...,...,...,...,...,...,...,...,...
Venezuela,28435940,-0.28 %,-79889,32,882050,-653249.0,2.3,30,N.A.,0.36 %
Italy,60461826,-0.15 %,-88249,206,294140,148943.0,1.3,47,69 %,0.78 %
Romania,19237691,-0.66 %,-126866,84,230170,-73999.0,1.6,43,55 %,0.25 %
Ukraine,43733762,-0.59 %,-259876,75,579320,10000.0,1.4,41,69 %,0.56 %


In [96]:
data.sort_index()

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,38928346,2.33 %,886592,60,652860,-62920.0,4.6,18,25 %,0.50 %
Albania,2877797,-0.11 %,-3120,105,27400,-14000.0,1.6,36,63 %,0.04 %
Algeria,43851044,1.85 %,797990,18,2381740,-10000.0,3.1,29,73 %,0.56 %
American Samoa,55191,-0.22 %,-121,276,200,,N.A.,N.A.,88 %,0.00 %
Andorra,77265,0.16 %,123,164,470,,N.A.,N.A.,88 %,0.00 %
...,...,...,...,...,...,...,...,...,...,...
Wallis & Futuna,11239,-1.69 %,-193,80,140,,N.A.,N.A.,0 %,0.00 %
Western Sahara,597339,2.55 %,14876,2,266000,5582.0,2.4,28,87 %,0.01 %
Yemen,29825964,2.28 %,664042,56,527970,-30000.0,3.8,20,38 %,0.38 %
Zambia,18383955,2.93 %,522925,25,743390,-8000.0,4.7,18,45 %,0.24 %


In [98]:
data.sort_values(["Pop", "NetChange"])  # Mit einer Liste von Spalten können subsequente Sortierung durchgeführt werden

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Vatican State,801,0.25 %,2,2003,0,,N.A.,N.A.,N.A.,0.00 %
Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %
Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
...,...,...,...,...,...,...,...,...,...,...
Pakistan,220892340,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
Indonesia,273523615,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
United States,331002651,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
India,1380004385,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %


### Daten filtern

Daten in DataFrames werden mittels Vektorisierung gefiltern (wie Numpy)

In [106]:
# Aufgabe: Alle Länder zw. 10m und 50m Einwohner finden
(data["Pop"] > 10_000_000) & (data["Pop"] < 50_000_000)  # Hier kann nicht and verwendet werden; hier muss & oder | verwendet werden

Country
China               False
India               False
United States       False
Indonesia           False
Pakistan            False
                    ...  
Montserrat          False
Falkland Islands    False
Niue                False
Tokelau             False
Vatican State       False
Name: Population, Length: 235, dtype: bool

In [111]:
data[(data["Pop"] > 10_000_000) & (data["Pop"] < 50_000_000)]

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Spain,46754778,0.04 %,18002,94,498800,40000.0,1.3,45,80 %,0.60 %
Uganda,45741007,3.32 %,1471413,229,199810,168694.0,5.0,17,26 %,0.59 %
Argentina,45195774,0.93 %,415097,17,2736690,4800.0,2.3,32,93 %,0.58 %
Algeria,43851044,1.85 %,797990,18,2381740,-10000.0,3.1,29,73 %,0.56 %
Sudan,43849260,2.42 %,1036022,25,1765048,-50000.0,4.4,20,35 %,0.56 %
...,...,...,...,...,...,...,...,...,...,...
Greece,10423054,-0.48 %,-50401,81,128900,-16000.0,1.3,46,85 %,0.13 %
Jordan,10203134,1.00 %,101440,115,88780,10220.0,2.8,24,91 %,0.13 %
Portugal,10196709,-0.29 %,-29478,111,91590,-6000.0,1.3,46,66 %,0.13 %
Azerbaijan,10139177,0.91 %,91459,123,82658,1200.0,2.1,32,56 %,0.13 %


In [113]:
data[(data["Pop"] > 10_000_000) & (data["Pop"] < 50_000_000)].sort_index()

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,38928346,2.33 %,886592,60,652860,-62920.0,4.6,18,25 %,0.50 %
Algeria,43851044,1.85 %,797990,18,2381740,-10000.0,3.1,29,73 %,0.56 %
Angola,32866272,3.27 %,1040977,26,1246700,6413.0,5.6,17,67 %,0.42 %
Argentina,45195774,0.93 %,415097,17,2736690,4800.0,2.3,32,93 %,0.58 %
Australia,25499884,1.18 %,296686,3,7682300,158246.0,1.8,38,86 %,0.33 %
...,...,...,...,...,...,...,...,...,...,...
Uzbekistan,33469203,1.48 %,487487,79,425400,-8863.0,2.4,28,50 %,0.43 %
Venezuela,28435940,-0.28 %,-79889,32,882050,-653249.0,2.3,30,N.A.,0.36 %
Yemen,29825964,2.28 %,664042,56,527970,-30000.0,3.8,20,38 %,0.38 %
Zambia,18383955,2.93 %,522925,25,743390,-8000.0,4.7,18,45 %,0.24 %


In [121]:
data.index  # Gibt den Index als Serie zurück

Index(['China', 'India', 'United States', 'Indonesia', 'Pakistan', 'Brazil',
       'Nigeria', 'Bangladesh', 'Russia', 'Mexico',
       ...
       'Wallis & Futuna', 'Nauru', 'Saint Barthelemy', 'Saint Helena',
       'Saint Pierre & Miquelon', 'Montserrat', 'Falkland Islands', 'Niue',
       'Tokelau', 'Vatican State'],
      dtype='object', name='Country', length=235)

In [122]:
data.values  # Gibt das DataFrame als Numpy Array zurück; funktioniert auch bei Serien

array([[1439323776, '0.39 %', 5540090, ..., '38', '61 %', '18.47 %'],
       [1380004385, '0.99 %', 13586631, ..., '28', '35 %', '17.70 %'],
       [331002651, '0.59 %', 1937734, ..., '38', '83 %', '4.25 %'],
       ...,
       [1626, '0.68 %', 11, ..., 'N.A.', '46 %', '0.00 %'],
       [1357, '1.27 %', 17, ..., 'N.A.', '0 %', '0.00 %'],
       [801, '0.25 %', 2, ..., 'N.A.', 'N.A.', '0.00 %']],
      shape=(235, 10), dtype=object)

### Gruppierung

Kriterium auswählen; Datensätze werden in Gruppen aufgeteilt nach dem Kriterium

In [124]:
groups = data.groupby("MedAge")

In [126]:
groups.count()  # Gruppen als Index, Anzahl als Werte in den Spalten

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,UrbanPopPct,WorldShare
MedAge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15,1,1,1,1,1,1,1,1,1
16,1,1,1,1,1,1,1,1,1
17,6,6,6,6,6,6,6,6,6
18,10,10,10,10,10,10,10,10,10
19,14,14,14,14,14,14,14,14,14
20,9,9,9,9,9,9,9,9,9
21,5,5,5,5,5,5,5,5,5
22,7,7,7,7,7,7,7,7,7
23,4,4,4,4,4,4,4,4,4
24,6,6,6,6,6,6,6,6,6


In [127]:
groups.count()["Pop"]

MedAge
15       1
16       1
17       6
18      10
19      14
20       9
21       5
22       7
23       4
24       6
25       4
26       7
27       4
28      12
29       5
30       8
31       6
32      11
33       5
34       6
35       3
36       4
37       4
38       7
39       3
40       7
41       5
42      10
43      11
44       5
45       5
46       3
47       2
48       1
N.A.    34
Name: Pop, dtype: int64

In [129]:
groups.get_group("20")

Unnamed: 0_level_0,Pop,YearlyChange,NetChange,Density,LandArea,Migrants,FertRate,MedAge,UrbanPopPct,WorldShare
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Kenya,53771296,2.28 %,1197323,94,569140,-10000.0,3.5,20,28 %,0.69 %
Sudan,43849260,2.42 %,1036022,25,1765048,-50000.0,4.4,20,35 %,0.56 %
Yemen,29825964,2.28 %,664042,56,527970,-30000.0,3.8,20,38 %,0.38 %
Madagascar,27691018,2.68 %,721711,48,581795,-1500.0,4.1,20,39 %,0.36 %
Rwanda,12952218,2.58 %,325268,525,24670,-9000.0,4.1,20,18 %,0.17 %
Mauritania,4649658,2.74 %,123962,5,1030700,5000.0,4.6,20,57 %,0.06 %
Comoros,869601,2.20 %,18715,467,1861,-2000.0,4.2,20,29 %,0.01 %
Solomon Islands,686884,2.55 %,17061,25,27990,-1600.0,4.4,20,23 %,0.01 %
Mayotte,272815,2.50 %,6665,728,375,0.0,3.7,20,46 %,0.00 %


In [132]:
# Aufgabe: Was ist die Durchnittliche Bevölkerung pro Altersdurchschnitt?

In [133]:
groups["Pop"].mean()

MedAge
15      2.420664e+07
16      2.025083e+07
17      3.539643e+07
18      4.148540e+07
19      1.824118e+07
20      1.939652e+07
21      9.621932e+06
22      7.686515e+06
23      6.028827e+07
24      7.213284e+06
25      3.304188e+07
26      2.510786e+07
27      2.088666e+06
28      1.398790e+08
29      4.693030e+07
30      4.894655e+07
31      1.847800e+07
32      3.284957e+07
33      4.789470e+07
34      3.726593e+06
35      1.595275e+07
36      2.161582e+06
37      1.772735e+06
38      2.590871e+08
39      1.120258e+06
40      4.277205e+07
41      1.942832e+07
42      1.717237e+07
43      7.631299e+06
44      1.210433e+07
45      1.320029e+07
46      3.480124e+07
47      3.041855e+07
48      1.264765e+08
N.A.    3.333179e+04
Name: Population, dtype: float64

### Daten exportieren

Zum Exportieren gibt es in Pandas die to_... Funktionen

Diese Funktionen können auch für Konvertierung zw. Formaten verwendet werden

In [134]:
data.to_csv("Data/PopulationDataFastFertig.csv")