# Example for reading formatted data from sensor logs

This is using pandas in a very straightforward manner and should give you some hints for further usage

In [2]:
import pandas as pd

In [3]:
import numpy as np
import time

## Reading a file

When reading data from a file with known formats, conversions can already be done while reading. 
The file path should either use foreslashes ("./mydir/myfile"), escaped backslashes (".\\mydir\\myfile") and/or raw strings (r".\mydir\myfile")

The %time preceding the operation in a notebook cell gives you timing information about CPU time and wall time used.

In [3]:
filename = "./data/bimsim_31_day_sample.csv"
%time data = pd.read_csv(filename, sep=",", parse_dates = ["datetime"], infer_datetime_format=True)

CPU times: total: 1.64 s
Wall time: 1.67 s


In [6]:
%time data_2 = pd.read_csv(filename, sep=",")

CPU times: total: 484 ms
Wall time: 490 ms


In [7]:
data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130887 entries, 0 to 130886
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  130887 non-null  int64  
 1   source      130887 non-null  object 
 2   datetime    130887 non-null  object 
 3   id          130887 non-null  float64
 4   celsius     130887 non-null  float64
 5   pressure    130887 non-null  float64
 6   humidity    130887 non-null  float64
 7   sensor      130887 non-null  object 
 8   room        130887 non-null  object 
dtypes: float64(4), int64(1), object(4)
memory usage: 9.0+ MB


In [8]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,source,datetime,id,celsius,pressure,humidity,sensor,room
0,0,0013A20041A94FA3,2021-08-18 09:40:38,1629272000.0,23.144531,956.7491,42.368893,0013A20041A94FA3,H205
1,1,0013A20041A94FA3,2021-08-18 09:45:42,1629273000.0,23.154882,956.8015,42.05996,0013A20041A94FA3,H205
2,2,0013A20041A94FA3,2021-08-18 09:50:45,1629273000.0,23.16504,956.7753,42.528866,0013A20041A94FA3,H205
3,3,0013A20041A94FA3,2021-08-18 09:55:48,1629273000.0,23.16504,956.7753,42.512318,0013A20041A94FA3,H205
4,4,0013A20041A94FA3,2021-08-18 10:00:52,1629274000.0,23.180664,956.8277,42.29166,0013A20041A94FA3,H205


## Getting information about dataset

The info() function of a dataframe gives information on the size of the dataframe, the types of columns and the overall memory consumption

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130887 entries, 0 to 130886
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   Unnamed: 0  130887 non-null  int64         
 1   source      130887 non-null  object        
 2   datetime    130887 non-null  datetime64[ns]
 3   id          130887 non-null  float64       
 4   celsius     130887 non-null  float64       
 5   pressure    130887 non-null  float64       
 6   humidity    130887 non-null  float64       
 7   sensor      130887 non-null  object        
 8   room        130887 non-null  object        
dtypes: datetime64[ns](1), float64(4), int64(1), object(3)
memory usage: 9.0+ MB


To get some fundamental statistical information on numerical columns in a datafram, pandas provides the method DataFrame.describe()

In [34]:
data.describe()

Unnamed: 0.1,Unnamed: 0,id,celsius,pressure,humidity
count,130887.0,130887.0,130887.0,130887.0,130887.0
mean,65443.0,1628740000.0,24.783254,952.056672,42.197683
std,37783.966679,783388.0,2.82844,6.734564,3.48731
min,0.0,1627337000.0,-149.52812,933.16724,24.105137
25%,32721.5,1628093000.0,23.661915,946.65094,40.05078
50%,65443.0,1628762000.0,24.671875,952.8762,42.13707
75%,98164.5,1629501000.0,25.885645,956.72266,44.467346
max,130886.0,1629936000.0,34.46582,971.1399,57.478916


If we want to look at the contents, we can use the head() function to get a quick glimpse on some data

In [26]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,source,datetime,id,celsius,pressure,humidity,sensor,room
0,0,0013A20041A94FA3,2021-08-18 09:40:38+00:00,1629272000.0,23.144531,956.7491,42.368893,0013A20041A94FA3,H205
1,1,0013A20041A94FA3,2021-08-18 09:45:42+00:00,1629273000.0,23.154882,956.8015,42.05996,0013A20041A94FA3,H205
2,2,0013A20041A94FA3,2021-08-18 09:50:45+00:00,1629273000.0,23.16504,956.7753,42.528866,0013A20041A94FA3,H205
3,3,0013A20041A94FA3,2021-08-18 09:55:48+00:00,1629273000.0,23.16504,956.7753,42.512318,0013A20041A94FA3,H205
4,4,0013A20041A94FA3,2021-08-18 10:00:52+00:00,1629274000.0,23.180664,956.8277,42.29166,0013A20041A94FA3,H205
5,5,0013A20041A94FA3,2021-08-18 10:05:55+00:00,1629274000.0,23.19082,956.8015,42.68883,0013A20041A94FA3,H205
6,6,0013A20041A94FA3,2021-08-18 10:10:59+00:00,1629274000.0,23.19082,956.8015,42.892914,0013A20041A94FA3,H205
7,7,0013A20041A94FA3,2021-08-18 10:16:02+00:00,1629275000.0,23.180664,956.9325,42.65022,0013A20041A94FA3,H205
8,8,0013A20041A94FA3,2021-08-18 10:21:06+00:00,1629275000.0,23.180664,956.9849,42.490253,0013A20041A94FA3,H205
9,9,0013A20041A94FA3,2021-08-18 10:26:09+00:00,1629275000.0,23.19082,956.95874,42.738472,0013A20041A94FA3,H205


We can reduce the overall memory consumption by using float32 instead of float64. Using the astype() function, a complete column can be converted. Note, that conversion is done off-place, i.e. you must assign the result somewhere.

In [27]:
data["celsius"] = data["celsius"].astype(np.float32)
data["pressure"] = data["pressure"].astype(np.float32)
data["humidity"] = data["humidity"].astype(np.float32)

In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130887 entries, 0 to 130886
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype              
---  ------      --------------   -----              
 0   Unnamed: 0  130887 non-null  int64              
 1   source      130887 non-null  object             
 2   datetime    130887 non-null  datetime64[ns, UTC]
 3   id          130887 non-null  float64            
 4   celsius     130887 non-null  float32            
 5   pressure    130887 non-null  float32            
 6   humidity    130887 non-null  float32            
 7   sensor      130887 non-null  object             
 8   room        130887 non-null  object             
dtypes: datetime64[ns, UTC](1), float32(3), float64(1), int64(1), object(3)
memory usage: 7.5+ MB


So we reduced the overall memory consumption from 9+MB to 5+MB.

## A faster way to convert date and time
Datetimeconversion can become slow when reading from disk, especially for larger datasets. This is really depending on date format used in the file. Esp. if there is nominal information on Weekdays ("Mon", "Tue") and timezones ("GMT", "EST"), conversion can be costly. Here is a faster way:

In [4]:
filename = "bimsim_31_day_sample.csv"
%time data = pd.read_csv(filename, sep=",")
data.head()

CPU times: total: 312 ms
Wall time: 321 ms


Unnamed: 0.1,Unnamed: 0,source,datetime,id,celsius,pressure,humidity,sensor,room
0,0,0013A20041A94FA3,"Wed, 18 Aug 2021 09:40:38 GMT",1629272000.0,23.144531,956.7491,42.368893,0013A20041A94FA3,H205
1,1,0013A20041A94FA3,"Wed, 18 Aug 2021 09:45:42 GMT",1629273000.0,23.154882,956.8015,42.05996,0013A20041A94FA3,H205
2,2,0013A20041A94FA3,"Wed, 18 Aug 2021 09:50:45 GMT",1629273000.0,23.16504,956.7753,42.528866,0013A20041A94FA3,H205
3,3,0013A20041A94FA3,"Wed, 18 Aug 2021 09:55:48 GMT",1629273000.0,23.16504,956.7753,42.512318,0013A20041A94FA3,H205
4,4,0013A20041A94FA3,"Wed, 18 Aug 2021 10:00:52 GMT",1629274000.0,23.180664,956.8277,42.29166,0013A20041A94FA3,H205


In [5]:
%time data["datetime"] = pd.to_datetime(data["datetime"],format='%a, %d %b %Y %H:%M:%S %Z') 

CPU times: total: 6.2 s
Wall time: 6.32 s


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130887 entries, 0 to 130886
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype              
---  ------      --------------   -----              
 0   Unnamed: 0  130887 non-null  int64              
 1   source      130887 non-null  object             
 2   datetime    130887 non-null  datetime64[ns, GMT]
 3   id          130887 non-null  float64            
 4   celsius     130887 non-null  float64            
 5   pressure    130887 non-null  float64            
 6   humidity    130887 non-null  float64            
 7   sensor      130887 non-null  object             
 8   room        130887 non-null  object             
dtypes: datetime64[ns, GMT](1), float64(4), int64(1), object(3)
memory usage: 9.0+ MB


In [7]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,source,datetime,id,celsius,pressure,humidity,sensor,room
0,0,0013A20041A94FA3,2021-08-18 09:40:38+00:00,1629272000.0,23.144531,956.7491,42.368893,0013A20041A94FA3,H205
1,1,0013A20041A94FA3,2021-08-18 09:45:42+00:00,1629273000.0,23.154882,956.8015,42.05996,0013A20041A94FA3,H205
2,2,0013A20041A94FA3,2021-08-18 09:50:45+00:00,1629273000.0,23.16504,956.7753,42.528866,0013A20041A94FA3,H205
3,3,0013A20041A94FA3,2021-08-18 09:55:48+00:00,1629273000.0,23.16504,956.7753,42.512318,0013A20041A94FA3,H205
4,4,0013A20041A94FA3,2021-08-18 10:00:52+00:00,1629274000.0,23.180664,956.8277,42.29166,0013A20041A94FA3,H205
5,5,0013A20041A94FA3,2021-08-18 10:05:55+00:00,1629274000.0,23.19082,956.8015,42.68883,0013A20041A94FA3,H205
6,6,0013A20041A94FA3,2021-08-18 10:10:59+00:00,1629274000.0,23.19082,956.8015,42.892914,0013A20041A94FA3,H205
7,7,0013A20041A94FA3,2021-08-18 10:16:02+00:00,1629275000.0,23.180664,956.9325,42.65022,0013A20041A94FA3,H205
8,8,0013A20041A94FA3,2021-08-18 10:21:06+00:00,1629275000.0,23.180664,956.9849,42.490253,0013A20041A94FA3,H205
9,9,0013A20041A94FA3,2021-08-18 10:26:09+00:00,1629275000.0,23.19082,956.95874,42.738472,0013A20041A94FA3,H205


# Examples for Dataframe addressing

The next secions give examples how to address single elements, subsets of lines and/or columns of a dataframe. These are the most common operations when working with datasets and should be well remembered.

## Directly per column
* Use a column name or a list of column names to select data.
* You can run operations directly on the selection
* the addressing returns a new dataframe

### Select the set of subcolumns and show some data

In [14]:
data[["humidity", "room"]].head(5)

Unnamed: 0,humidity,room
0,42.368893,H205
1,42.05996,H205
2,42.528866,H205
3,42.512318,H205
4,42.29166,H205


In [20]:
data.groupby("room").mean()

Unnamed: 0_level_0,Unnamed: 0,id,celsius,pressure,humidity
room,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
H106,40057.0,1629604000.0,23.375539,958.82718,44.196729
H108,30313.0,1629605000.0,20.514411,962.083953,48.448743
H109,51864.5,1629604000.0,22.829626,959.914144,46.617265
H110,35174.5,1628609000.0,23.743433,952.593723,44.455246
H111,61622.0,1629604000.0,22.371614,958.800639,46.694685
H113,6017.5,1628609000.0,24.251669,952.799245,45.0869
H203,127035.0,1628609000.0,25.006882,950.855073,42.379556
H205,1089.0,1629604000.0,23.463388,958.658923,44.799518
H206,13593.0,1628609000.0,25.393256,952.078658,39.962568
H207,66628.0,1628609000.0,25.55837,949.129193,40.915445


In [39]:
data[["source", "datetime", "humidity"]].head(5)

Unnamed: 0,source,datetime,humidity
0,0013A20041A94FA3,2021-08-18 09:40:38+00:00,42.368893
1,0013A20041A94FA3,2021-08-18 09:45:42+00:00,42.05996
2,0013A20041A94FA3,2021-08-18 09:50:45+00:00,42.528866
3,0013A20041A94FA3,2021-08-18 09:55:48+00:00,42.512318
4,0013A20041A94FA3,2021-08-18 10:00:52+00:00,42.29166


### Compare two columns for elementwise equality
This is useful, e.g. if you want to decide if you can drop one of the columns

In [8]:
data["source"] != data["sensor"]

0         False
1         False
2         False
3         False
4         False
          ...  
130882    False
130883    False
130884    False
130885    False
130886    False
Length: 130887, dtype: bool

In [12]:
data[ data["source"] != data["sensor"] ].shape[0]

0

In [13]:
isEqual = data[data["source"] != data["sensor"]].shape[0] == 0
if isEqual == True:
    data.drop(["sensor"], axis=1, inplace = True)

In [14]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,source,datetime,id,celsius,pressure,humidity,room
0,0,0013A20041A94FA3,2021-08-18 09:40:38+00:00,1629272000.0,23.144531,956.7491,42.368893,H205
1,1,0013A20041A94FA3,2021-08-18 09:45:42+00:00,1629273000.0,23.154882,956.8015,42.05996,H205
2,2,0013A20041A94FA3,2021-08-18 09:50:45+00:00,1629273000.0,23.16504,956.7753,42.528866,H205
3,3,0013A20041A94FA3,2021-08-18 09:55:48+00:00,1629273000.0,23.16504,956.7753,42.512318,H205
4,4,0013A20041A94FA3,2021-08-18 10:00:52+00:00,1629274000.0,23.180664,956.8277,42.29166,H205


In [18]:
data.drop(["Unnamed: 0", "id"], axis=1, inplace=True)

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130887 entries, 0 to 130886
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype              
---  ------    --------------   -----              
 0   source    130887 non-null  object             
 1   datetime  130887 non-null  datetime64[ns, GMT]
 2   celsius   130887 non-null  float64            
 3   pressure  130887 non-null  float64            
 4   humidity  130887 non-null  float64            
 5   room      130887 non-null  object             
dtypes: datetime64[ns, GMT](1), float64(3), object(2)
memory usage: 6.0+ MB


## Adressing using the loc API, i.e. label based addressing
### Use label names or integers for row indices

In [55]:
data.loc[1:5, "id":"room"]

Unnamed: 0,id,celsius,pressure,humidity,room
1,1629273000.0,23.154882,956.8015,42.05996,H205
2,1629273000.0,23.16504,956.7753,42.528866,H205
3,1629273000.0,23.16504,956.7753,42.512318,H205
4,1629274000.0,23.180664,956.8277,42.29166,H205
5,1629274000.0,23.19082,956.8015,42.68883,H205


### You can also use a stride to skip elements
Take very second row and every second column

In [56]:
data.loc[1:5:3, "id":"room":2]

Unnamed: 0,id,pressure,room
1,1629273000.0,956.8015,H205
4,1629274000.0,956.8277,H205


## Adressing using the iloc API, i.e. integer based addressing
### Use integers for rows and columns

In [53]:
data.iloc[1:5, 0:6]

Unnamed: 0.1,Unnamed: 0,source,datetime,id,celsius,pressure
1,1,0013A20041A94FA3,2021-08-18 09:45:42+00:00,1629273000.0,23.154882,956.8015
2,2,0013A20041A94FA3,2021-08-18 09:50:45+00:00,1629273000.0,23.16504,956.7753
3,3,0013A20041A94FA3,2021-08-18 09:55:48+00:00,1629273000.0,23.16504,956.7753
4,4,0013A20041A94FA3,2021-08-18 10:00:52+00:00,1629274000.0,23.180664,956.8277


In [54]:
data.iloc[1:5:2, 0:6:2]

Unnamed: 0.1,Unnamed: 0,datetime,celsius
1,1,2021-08-18 09:45:42+00:00,23.154882
3,3,2021-08-18 09:55:48+00:00,23.16504
