### Abfragen in Dataframes

Für analytische Aufgaben müssen wir oft Abfragen in Dataframes erstellen um gewisse Fragen zu beantworten.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('december.csv', parse_dates = ['date'])
df.head()

Unnamed: 0,date,temperature,windspeed,event
0,2015-12-31,48.0,12.0,Rain
1,2015-12-30,46.0,7.0,Rain
2,2015-12-29,42.0,12.0,Rain
3,2015-12-28,42.0,14.0,Rain
4,2015-12-27,56.0,12.0,Fog


## Zugriff auf eine einzige Spalte

In [40]:
# df.temperature (nur dann, wenn der Spaltennamen keine Leerzeichen enthält)
df['temperature'] # sichere variante - ähnlich wie Dictionary

0     48.0
1     46.0
2     42.0
3     42.0
4     56.0
5      NaN
6     58.0
7     63.0
8     56.0
9     56.0
10    47.0
11    40.0
12    39.0
13    49.0
14    55.0
15    49.0
16    59.0
17     NaN
18    62.0
19    57.0
20    52.0
21    54.0
22     NaN
23    46.0
24    48.0
25    45.0
26    44.0
27    47.0
28    51.0
29    52.0
30    50.0
Name: temperature, dtype: float64

## Zugriff auf mehrere Spalten

In [41]:
df[ ['temperature','windspeed'] ]

Unnamed: 0,temperature,windspeed
0,48.0,12.0
1,46.0,7.0
2,42.0,12.0
3,42.0,14.0
4,56.0,12.0
5,,11.0
6,58.0,2.0
7,63.0,11.0
8,56.0,10.0
9,56.0,9.0


## Einfache Abfragen

- Alle Einträge, die einen Temperaturwert größer als 55 haben

In [42]:
condition = df['temperature'] > 55
condition

0     False
1     False
2     False
3     False
4      True
5     False
6      True
7      True
8      True
9      True
10    False
11    False
12    False
13    False
14    False
15    False
16     True
17    False
18     True
19     True
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
Name: temperature, dtype: bool

In [43]:
df[condition] 

Unnamed: 0,date,temperature,windspeed,event
4,2015-12-27,56.0,12.0,Fog
6,2015-12-25,58.0,2.0,Fog
7,2015-12-24,63.0,11.0,Fog
8,2015-12-23,56.0,10.0,Fog
9,2015-12-22,56.0,9.0,Rain
16,2015-12-15,59.0,18.0,Rain
18,2015-12-13,62.0,7.0,Rain
19,2015-12-12,57.0,4.0,Rain


Als Ergebnis sehen wir alle Einträge, für sie die Kondition `df['temperature'] > 55` den Wert `True` liefert.

In [44]:
# übliche Schreibweise
df[ (df['temperature'] > 55) & (df['windspeed'] > 7) ]  

Unnamed: 0,date,temperature,windspeed,event
4,2015-12-27,56.0,12.0,Fog
7,2015-12-24,63.0,11.0,Fog
8,2015-12-23,56.0,10.0,Fog
9,2015-12-22,56.0,9.0,Rain
16,2015-12-15,59.0,18.0,Rain


- Ein teilbereich aus dem Dataframe, wo es geregnet hat.

In [45]:
condition = df['event'] == 'Rain' 
df[condition]

Unnamed: 0,date,temperature,windspeed,event
0,2015-12-31,48.0,12.0,Rain
1,2015-12-30,46.0,7.0,Rain
2,2015-12-29,42.0,12.0,Rain
3,2015-12-28,42.0,14.0,Rain
5,2015-12-26,,11.0,Rain
9,2015-12-22,56.0,9.0,Rain
10,2015-12-21,47.0,12.0,Rain
12,2015-12-19,39.0,20.0,Rain
13,2015-12-18,49.0,14.0,Rain
14,2015-12-17,55.0,,Rain


Wir können die Konditionen kombinieren, bzw. Ergebnisse _selektiv_ (customized) ausgeben lassen

In [46]:
df[condition][['date', 'windspeed']] # nur das Datum - von tagen mit Regen

Unnamed: 0,date,windspeed
0,2015-12-31,12.0
1,2015-12-30,7.0
2,2015-12-29,12.0
3,2015-12-28,14.0
5,2015-12-26,11.0
9,2015-12-22,9.0
10,2015-12-21,12.0
12,2015-12-19,20.0
13,2015-12-18,14.0
14,2015-12-17,


In [4]:
cond_1 = df['event'] == 'Rain'
cond_2 = df['temperature'] > 55
df[cond_1 & cond_2]['date'].to_frame()
# df[cond_1 & cond_2]['date']

Unnamed: 0,date
9,2015-12-22
16,2015-12-15
18,2015-12-13
19,2015-12-12


- an vielen Tagen gab es regen?

In [48]:
df[df['event'] == 'Rain'].count()

date           19
temperature    18
windspeed      18
event          19
dtype: int64

In [49]:
df[(df['event'] == 'Rain')].count()

date           19
temperature    18
windspeed      18
event          19
dtype: int64

- an wievielen Tagen war die Windgeschwindigkeit größer als Durchschnitt?

In [50]:
condition = df['windspeed'] > df['windspeed'].mean()
df[condition]

Unnamed: 0,date,temperature,windspeed,event
0,2015-12-31,48.0,12.0,Rain
2,2015-12-29,42.0,12.0,Rain
3,2015-12-28,42.0,14.0,Rain
4,2015-12-27,56.0,12.0,Fog
5,2015-12-26,,11.0,Rain
7,2015-12-24,63.0,11.0,Fog
10,2015-12-21,47.0,12.0,Rain
11,2015-12-20,40.0,12.0,Fog
12,2015-12-19,39.0,20.0,Rain
13,2015-12-18,49.0,14.0,Rain


In [51]:
len(df[condition]) # anstatt .count() einfach len()

14

- Suche Tage die eine Temperatur haben zwischen 48 und 55 inklusive beiden

In [52]:
cond = (48 <= df['temperature']) & (df['temperature'] <=55)

In [53]:
df[cond]

Unnamed: 0,date,temperature,windspeed,event
0,2015-12-31,48.0,12.0,Rain
13,2015-12-18,49.0,14.0,Rain
14,2015-12-17,55.0,,Rain
15,2015-12-16,49.0,9.0,Rain
20,2015-12-11,52.0,,Fog
21,2015-12-10,54.0,7.0,Rain
24,2015-12-07,48.0,8.0,Fog
28,2015-12-03,51.0,16.0,Rain
29,2015-12-02,52.0,5.0,Fog
30,2015-12-01,50.0,11.0,Rain


In [54]:
# alternativ
df[df['temperature'].between(48,55)]

Unnamed: 0,date,temperature,windspeed,event
0,2015-12-31,48.0,12.0,Rain
13,2015-12-18,49.0,14.0,Rain
14,2015-12-17,55.0,,Rain
15,2015-12-16,49.0,9.0,Rain
20,2015-12-11,52.0,,Fog
21,2015-12-10,54.0,7.0,Rain
24,2015-12-07,48.0,8.0,Fog
28,2015-12-03,51.0,16.0,Rain
29,2015-12-02,52.0,5.0,Fog
30,2015-12-01,50.0,11.0,Rain


- welche 5 Tage haben die größte Temperaturen?

In [55]:
df.nlargest(5, 'temperature') # ohne DataFrame sortieren zu müssen

Unnamed: 0,date,temperature,windspeed,event
7,2015-12-24,63.0,11.0,Fog
18,2015-12-13,62.0,7.0,Rain
16,2015-12-15,59.0,18.0,Rain
6,2015-12-25,58.0,2.0,Fog
19,2015-12-12,57.0,4.0,Rain


In [None]:
#help(pd.DataFrame.nlargest)

Über Dataframes kann iteriert werden

In [58]:
for _ in df:
    print(_) # Spaltennamen

date
temperature
windspeed
event


In [59]:
for row in df.iterrows():
    print(row[0])
    print(50*'-')
    print(row[1])
    print(50*'#')

0
--------------------------------------------------
date           2015-12-31 00:00:00
temperature                   48.0
windspeed                     12.0
event                         Rain
Name: 0, dtype: object
##################################################
1
--------------------------------------------------
date           2015-12-30 00:00:00
temperature                   46.0
windspeed                      7.0
event                         Rain
Name: 1, dtype: object
##################################################
2
--------------------------------------------------
date           2015-12-29 00:00:00
temperature                   42.0
windspeed                     12.0
event                         Rain
Name: 2, dtype: object
##################################################
3
--------------------------------------------------
date           2015-12-28 00:00:00
temperature                   42.0
windspeed                     14.0
event                         Rain
Name: 

Mehreren Spalten Werte zuweisen

In [60]:
df[["temperature", "windspeed"]] = df[["temperature", "windspeed"]] * 1000

In [61]:
df.head()

Unnamed: 0,date,temperature,windspeed,event
0,2015-12-31,48000.0,12000.0,Rain
1,2015-12-30,46000.0,7000.0,Rain
2,2015-12-29,42000.0,12000.0,Rain
3,2015-12-28,42000.0,14000.0,Rain
4,2015-12-27,56000.0,12000.0,Fog


- Wir wollen alle Spalten mit NaNs in einem separaten DataFrame speichern

In [64]:
df.isnull().any()

date           False
temperature     True
windspeed       True
event          False
dtype: bool

In [68]:
columns_with_nans = df.columns[df.isnull().any()]
columns_with_nans

Index(['temperature', 'windspeed'], dtype='object')

In [70]:
df_nan = df[columns_with_nans]
df_nan

Unnamed: 0,temperature,windspeed
0,48000.0,12000.0
1,46000.0,7000.0
2,42000.0,12000.0
3,42000.0,14000.0
4,56000.0,12000.0
5,,11000.0
6,58000.0,2000.0
7,63000.0,11000.0
8,56000.0,10000.0
9,56000.0,9000.0
