# Luftdaten data : data cleaning, resampling - mini version

# - RETRYING THIS, just to makes sure we've got the hang of it! 
## Code builds a continuous time tabular version of the luftdaen data, such that the same time period is present for each sensor in the data, regardless of whether each sensor has data for all the time slots. 

## Testing :
- using pd.resample
- constructing a time shift using pandas own tools, rather than my own


#### Reference documents

Resampling time series data with Pandas ( Ben Alex Keen ) 
http://benalexkeen.com/resampling-time-series-data-with-pandas/

Pandas reference manual : 

.at - access df values using nay kind of index, for SINGLE VALUES
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc

.iat - only integer index values for getting/setting SINGLE df values
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html

.loc - 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc

.iloc - purely integer indexed access ( getting/setting ) values 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc

#### methods of filling … 

These are some of the common methods you might use for resampling:

Method	Description

bfill	Backward fill

count	Count of values

ffill	Forward fill

first	First valid data value

last	Last valid data value

max	Maximum data value

mean	Mean of values in time range

median	Median of values in time range

min	Minimum data value

nunique	Number of unique values

ohlc	Opening value, highest value, lowest value, closing value

pad	Same as forward fill

std	Standard deviation of values

sum	Sum of values

var	Variance of values

#### time abbreviations 

Alias	Description

B	Business day

D	Calendar day

W	Weekly

M	Month end

Q	Quarter end

A	Year end

BA	Business year end

AS	Year start

H	Hourly frequency

T, min	Minutely frequency

S	Secondly frequency

L, ms	Millisecond frequency

U, us	Microsecond frequency

N, ns	Nanosecond frequency

In [1]:
import pandas as pd
import numpy as np
import time

In [2]:
# parameters

# start_time = "2018-12-31 21:58:42"
end_time = "2019-01-01 11:58:42"
# generate this please
start_time = "?????"

time_frequency_for_periods__for_basic_data = "5T"
num_of_time_periods___for_basic_data = 24*20 # 24 hrs * 12 x 5 mins in each hour

# when generating time periods 
sampling_frequency = "3T"



# --- data urls 

curr_url = "????"
nordic_midnight_24_hrs_data__url = "/Users/miska/Documents/open_something/luftdaten/luftdaten_code/luftdaten__make_tabular_data__from_db_data/ld_NYE_midnight_24hrs_nordics_all_data_01.csv"
# nordic_midnight_24_hrs_data__url = "/home/miska/documents/opensomething/luftdaten/dustmin_to_csv__various_code/ld_NYE_midnight_24hrs_nordics_all_data_01.csv"



# set the current data source 
curr_url = nordic_midnight_24_hrs_data__url

In [3]:
# try convert the timestamp in the data, to epoch

in_data = pd.read_csv( curr_url )
in_data.shape

(127109, 7)

In [4]:
### prepare the data a bit

In [5]:
# what have we got? 
in_data.dtypes

sensor_id         int64
sensor_namee     object
lat             float64
lon             float64
timestamp        object
p1              float64
p2              float64
dtype: object

In [6]:
type( in_data['timestamp'][0] ) 

str

In [7]:
# set the timestamp as a timestamp
in_data['timestamp'] = pd.to_datetime( in_data['timestamp'] )

In [8]:
in_data

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011,60.002,17.846,2018-12-31 11:57:22,3.43,1.56
1,7275,SDS011,57.720,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.230,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.40,5.54
6,7597,SDS011,59.320,18.064,2018-12-31 11:58:51,3.68,2.00
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.80
8,9411,SDS011,59.266,15.230,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.90


In [9]:
type( in_data['timestamp'][0] )

pandas._libs.tslibs.timestamps.Timestamp

#### slice things up a bit 

In [10]:
in_data__start = in_data[:10]

In [11]:
in_data__start

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011,60.002,17.846,2018-12-31 11:57:22,3.43,1.56
1,7275,SDS011,57.72,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.23,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.4,5.54
6,7597,SDS011,59.32,18.064,2018-12-31 11:58:51,3.68,2.0
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.8
8,9411,SDS011,59.266,15.23,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.9


In [12]:
in_data__middle = in_data[60000:60010]

In [13]:
in_data__middle

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
60000,12679,SDS011,59.384,17.874,2018-12-31 16:52:43,0.4,0.4
60001,12687,SDS011,59.388,17.798,2018-12-31 16:52:21,2.83,1.9
60002,12691,SDS011,57.636,18.304,2018-12-31 16:51:17,2.4,1.7
60003,12693,SDS011,58.19,12.72,2018-12-31 16:51:03,6.42,2.62
60004,14017,SDS011,59.376,18.01,2018-12-31 16:51:10,4.07,1.94
60005,14133,SDS011,59.364,18.018,2018-12-31 16:50:49,4.14,2.38
60006,14209,SDS011,56.07,12.698,2018-12-31 16:51:35,28.52,13.48
60007,14264,SDS011,57.654,11.88,2018-12-31 16:51:51,26.92,8.46
60008,14276,SDS011,57.346,12.15,2018-12-31 16:50:46,72.48,20.16
60009,14278,SDS011,59.374,18.01,2018-12-31 16:51:31,4.64,2.3


In [14]:
in_data__end = in_data[-10:]
in_data__end

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
127099,13020,SDS011,57.722,11.948,2019-01-01 11:59:41,20.64,3.67
127100,16147,SDS011,59.364,18.018,2019-01-01 11:59:46,3.28,1.8
127101,16153,SDS011,55.648,13.208,2019-01-01 11:57:19,20.0,3.9
127102,16296,SDS011,56.144,13.394,2019-01-01 11:59:59,23.86,7.45
127103,16533,SDS011,55.722,13.202,2019-01-01 11:56:55,18.05,4.33
127104,16723,SDS011,57.736,11.894,2019-01-01 11:58:57,16.47,3.4
127105,16815,SDS011,59.462,18.04,2019-01-01 11:59:36,2.67,1.97
127106,17235,SDS011,59.272,17.78,2019-01-01 11:59:41,4.69,1.82
127107,10588,SDS011,55.676,13.346,2019-01-01 11:57:12,14.08,3.68
127108,10647,SDS011,55.608,13.036,2019-01-01 11:58:42,23.42,4.75


#### try putting the different bits together, so you can interpolate some data… 

In [15]:
in_data__beginning_and_middle_and_end = in_data__start.append( in_data__middle ).append( in_data__end )

In [16]:
in_data__beginning_and_middle_and_end

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011,60.002,17.846,2018-12-31 11:57:22,3.43,1.56
1,7275,SDS011,57.72,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.23,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.4,5.54
6,7597,SDS011,59.32,18.064,2018-12-31 11:58:51,3.68,2.0
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.8
8,9411,SDS011,59.266,15.23,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.9


#### now try interpolate the data

In [17]:
# first check that the different datatypes, eg timestamps is in order, and that the timestamps are the index… 

In [18]:
# datatypes
in_data__beginning_and_middle_and_end.dtypes

sensor_id                int64
sensor_namee            object
lat                    float64
lon                    float64
timestamp       datetime64[ns]
p1                     float64
p2                     float64
dtype: object

In [19]:
type( in_data__beginning_and_middle_and_end['timestamp'][0] ) 

pandas._libs.tslibs.timestamps.Timestamp

In [20]:
# ok, that looks ok, 

In [21]:
# but what about the index? 
in_data__beginning_and_middle_and_end.index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,  60000,  60001,  60002,  60003,  60004,  60005,
             60006,  60007,  60008,  60009, 127099, 127100, 127101, 127102,
            127103, 127104, 127105, 127106, 127107, 127108],
           dtype='int64')

In [22]:
# alas, this doesn't do the trick, one needs to reference the variable one wants this operation to end up in … 
in_data__beginning_and_middle_and_end.set_index( 'timestamp' )

Unnamed: 0_level_0,sensor_id,sensor_namee,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-31 11:57:22,7273,SDS011,60.002,17.846,3.43,1.56
2018-12-31 11:58:44,7275,SDS011,57.72,11.888,482.77,33.82
2018-12-31 11:58:47,7277,SDS011,59.266,15.23,5.48,2.47
2018-12-31 11:56:41,7406,SDS011,56.964,24.128,11.05,6.62
2018-12-31 11:57:42,7428,SDS011,59.868,17.624,1.78,1.02
2018-12-31 11:57:52,7469,SDS011,56.944,24.142,8.4,5.54
2018-12-31 11:58:51,7597,SDS011,59.32,18.064,3.68,2.0
2018-12-31 11:58:28,8683,SDS011,59.744,18.206,3.01,2.8
2018-12-31 11:57:18,9411,SDS011,59.266,15.23,3.44,2.18
2018-12-31 11:57:22,9436,SDS011,59.334,18.034,2.12,1.9


In [23]:
in_data__beginning_and_middle_and_end.index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,  60000,  60001,  60002,  60003,  60004,  60005,
             60006,  60007,  60008,  60009, 127099, 127100, 127101, 127102,
            127103, 127104, 127105, 127106, 127107, 127108],
           dtype='int64')

In [24]:
# this should work … 
in_data__beginning_and_middle_and_end = in_data__beginning_and_middle_and_end.set_index( 'timestamp' )


In [25]:
in_data__beginning_and_middle_and_end.index

DatetimeIndex(['2018-12-31 11:57:22', '2018-12-31 11:58:44',
               '2018-12-31 11:58:47', '2018-12-31 11:56:41',
               '2018-12-31 11:57:42', '2018-12-31 11:57:52',
               '2018-12-31 11:58:51', '2018-12-31 11:58:28',
               '2018-12-31 11:57:18', '2018-12-31 11:57:22',
               '2018-12-31 16:52:43', '2018-12-31 16:52:21',
               '2018-12-31 16:51:17', '2018-12-31 16:51:03',
               '2018-12-31 16:51:10', '2018-12-31 16:50:49',
               '2018-12-31 16:51:35', '2018-12-31 16:51:51',
               '2018-12-31 16:50:46', '2018-12-31 16:51:31',
               '2019-01-01 11:59:41', '2019-01-01 11:59:46',
               '2019-01-01 11:57:19', '2019-01-01 11:59:59',
               '2019-01-01 11:56:55', '2019-01-01 11:58:57',
               '2019-01-01 11:59:36', '2019-01-01 11:59:41',
               '2019-01-01 11:57:12', '2019-01-01 11:58:42'],
              dtype='datetime64[ns]', name='timestamp', freq=None)

In [26]:
# now try RESAMPLING! 
in_data__beginning_and_middle_and_end__RESAMPLED = in_data__beginning_and_middle_and_end.resample("5Min").mean().bfill()

In [27]:
in_data__beginning_and_middle_and_end__RESAMPLED

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-12-31 11:55:00,7925.5,58.8428,18.0392,52.516,5.991
2018-12-31 12:00:00,13592.7,58.3782,15.7462,15.282,5.534
2018-12-31 12:05:00,13592.7,58.3782,15.7462,15.282,5.534
2018-12-31 12:10:00,13592.7,58.3782,15.7462,15.282,5.534
2018-12-31 12:15:00,13592.7,58.3782,15.7462,15.282,5.534
2018-12-31 12:20:00,13592.7,58.3782,15.7462,15.282,5.534
2018-12-31 12:25:00,13592.7,58.3782,15.7462,15.282,5.534
2018-12-31 12:30:00,13592.7,58.3782,15.7462,15.282,5.534
2018-12-31 12:35:00,13592.7,58.3782,15.7462,15.282,5.534
2018-12-31 12:40:00,13592.7,58.3782,15.7462,15.282,5.534


#### now let's try making a single data row, of the kind in the data we have

In [28]:
# approach #1 - copy an existing row

In [29]:
single_row = in_data.iloc[0]
single_row

sensor_id                      7273
sensor_namee                 SDS011
lat                          60.002
lon                          17.846
timestamp       2018-12-31 11:57:22
p1                             3.43
p2                             1.56
Name: 0, dtype: object

In [30]:
type( single_row )

pandas.core.series.Series

In [31]:
# testing if one could just add the single row, which turns out to be a series, 
# into the ..... indata.... 
# YES, it looks like it's possible 
in_data.append( single_row )

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011,60.002,17.846,2018-12-31 11:57:22,3.43,1.56
1,7275,SDS011,57.720,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.230,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.40,5.54
6,7597,SDS011,59.320,18.064,2018-12-31 11:58:51,3.68,2.00
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.80
8,9411,SDS011,59.266,15.230,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.90


In [32]:
# let's try modifying the line 
single_row['timestamp']

Timestamp('2018-12-31 11:57:22')

In [33]:
single_row['p1']

3.43

In [34]:
len( single_row )

7

In [35]:
single_row

sensor_id                      7273
sensor_namee                 SDS011
lat                          60.002
lon                          17.846
timestamp       2018-12-31 11:57:22
p1                             3.43
p2                             1.56
Name: 0, dtype: object

In [36]:
single_row.index

Index(['sensor_id', 'sensor_namee', 'lat', 'lon', 'timestamp', 'p1', 'p2'], dtype='object')

In [37]:
single_row.p1

3.43

In [38]:
single_row.p1 * 0

0.0

In [39]:
single_row['p1'] = np.int( 0 )

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [40]:
single_row['p1'].iloc[0] = list( range( len( single_row.index ) ) )

AttributeError: 'int' object has no attribute 'iloc'

In [41]:
in_data_slice = in_data[:10]

In [42]:
in_data_slice

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011,60.002,17.846,2018-12-31 11:57:22,3.43,1.56
1,7275,SDS011,57.72,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.23,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.4,5.54
6,7597,SDS011,59.32,18.064,2018-12-31 11:58:51,3.68,2.0
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.8
8,9411,SDS011,59.266,15.23,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.9


In [43]:
in_data_slice.iloc[0]['p1']

3.43

In [44]:
in_data_slice.iloc[:1]['p1'] = in_data_slice.iloc[:1]['p1'] * 0 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [45]:
in_data_slice

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011,60.002,17.846,2018-12-31 11:57:22,0.0,1.56
1,7275,SDS011,57.72,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.23,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.4,5.54
6,7597,SDS011,59.32,18.064,2018-12-31 11:58:51,3.68,2.0
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.8
8,9411,SDS011,59.266,15.23,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.9


#### let's try using .at
 "Access a single value for a row/column pair by integer position.
Similar to iloc, in that both provide integer-based lookups. Use iat if you only need to get or set a single value in a DataFrame or Series."
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html


In [46]:
in_data_slice.at[ 0, 'p1' ] = -99

In [47]:
in_data_slice

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011,60.002,17.846,2018-12-31 11:57:22,-99.0,1.56
1,7275,SDS011,57.72,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.23,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.4,5.54
6,7597,SDS011,59.32,18.064,2018-12-31 11:58:51,3.68,2.0
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.8
8,9411,SDS011,59.266,15.23,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.9


In [48]:
in_data_slice.columns

Index(['sensor_id', 'sensor_namee', 'lat', 'lon', 'timestamp', 'p1', 'p2'], dtype='object')

In [49]:
# try fetching a value
in_data_slice.at[ 0, "sensor_namee" ]

'SDS011'

In [50]:
# now try setting it 
in_data_slice.at[ 0, "sensor_namee" ] = "SDS011_rules"

In [51]:
in_data_slice

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011_rules,60.002,17.846,2018-12-31 11:57:22,-99.0,1.56
1,7275,SDS011,57.72,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.23,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.4,5.54
6,7597,SDS011,59.32,18.064,2018-12-31 11:58:51,3.68,2.0
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.8
8,9411,SDS011,59.266,15.23,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.9


In [52]:
# now try setting a timestamp

# - first try making one 
curr_timestamp = pd.Timestamp( time.ctime() ) 
curr_timestamp

Timestamp('2019-02-03 09:18:08')

In [53]:
curr_timestamp = pd.Timestamp( time.ctime() ) 
curr_timestamp

Timestamp('2019-02-03 09:18:11')

In [54]:
type( curr_timestamp )

pandas._libs.tslibs.timestamps.Timestamp

In [55]:
# now let's try setting a timestamp in the data
in_data_slice.at[0, 'timestamp'] = curr_timestamp

In [56]:
in_data_slice

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
0,7273,SDS011_rules,60.002,17.846,2019-02-03 09:18:11,-99.0,1.56
1,7275,SDS011,57.72,11.888,2018-12-31 11:58:44,482.77,33.82
2,7277,SDS011,59.266,15.23,2018-12-31 11:58:47,5.48,2.47
3,7406,SDS011,56.964,24.128,2018-12-31 11:56:41,11.05,6.62
4,7428,SDS011,59.868,17.624,2018-12-31 11:57:42,1.78,1.02
5,7469,SDS011,56.944,24.142,2018-12-31 11:57:52,8.4,5.54
6,7597,SDS011,59.32,18.064,2018-12-31 11:58:51,3.68,2.0
7,8683,SDS011,59.744,18.206,2018-12-31 11:58:28,3.01,2.8
8,9411,SDS011,59.266,15.23,2018-12-31 11:57:18,3.44,2.18
9,9436,SDS011,59.334,18.034,2018-12-31 11:57:22,2.12,1.9



### could this new slice be used for resampling-interpolation? 

#### let's try…

In [57]:
in_data_end02 = in_data[-10:]
in_data_end02

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
127099,13020,SDS011,57.722,11.948,2019-01-01 11:59:41,20.64,3.67
127100,16147,SDS011,59.364,18.018,2019-01-01 11:59:46,3.28,1.8
127101,16153,SDS011,55.648,13.208,2019-01-01 11:57:19,20.0,3.9
127102,16296,SDS011,56.144,13.394,2019-01-01 11:59:59,23.86,7.45
127103,16533,SDS011,55.722,13.202,2019-01-01 11:56:55,18.05,4.33
127104,16723,SDS011,57.736,11.894,2019-01-01 11:58:57,16.47,3.4
127105,16815,SDS011,59.462,18.04,2019-01-01 11:59:36,2.67,1.97
127106,17235,SDS011,59.272,17.78,2019-01-01 11:59:41,4.69,1.82
127107,10588,SDS011,55.676,13.346,2019-01-01 11:57:12,14.08,3.68
127108,10647,SDS011,55.608,13.036,2019-01-01 11:58:42,23.42,4.75


In [58]:
new_later_timestamp = pd.Timestamp( 2019, 1, 1, 15, 0, 0 )
new_later_timestamp

Timestamp('2019-01-01 15:00:00')

In [59]:
new_last_data_row = pd.DataFrame( in_data_end02[-2:] ) 
new_last_data_row

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
127107,10588,SDS011,55.676,13.346,2019-01-01 11:57:12,14.08,3.68
127108,10647,SDS011,55.608,13.036,2019-01-01 11:58:42,23.42,4.75


In [60]:
new_last_data_row.shape

(2, 7)

In [61]:
type( new_last_data_row )

pandas.core.frame.DataFrame

In [62]:
new_last_data_row.shape

(2, 7)

In [63]:
new_last_data_row.columns

Index(['sensor_id', 'sensor_namee', 'lat', 'lon', 'timestamp', 'p1', 'p2'], dtype='object')

In [64]:
new_last_data_row.iloc[0].index

Index(['sensor_id', 'sensor_namee', 'lat', 'lon', 'timestamp', 'p1', 'p2'], dtype='object')

In [65]:
new_last_data_row.iat[ 0, 0 ] = -9999

In [66]:
new_last_data_row

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
127107,-9999,SDS011,55.676,13.346,2019-01-01 11:57:12,14.08,3.68
127108,10647,SDS011,55.608,13.036,2019-01-01 11:58:42,23.42,4.75


In [67]:
# NOTE : .at needs to use the index of the DF, now th row count… 
# ok, let's see if we can get .at working again… 
new_last_data_row.at[ 127107, 'p1' ]

14.08

In [68]:
new_last_data_row__SINGLE_ROW_AS_SERIES = new_last_data_row.iloc[0]

In [69]:
new_last_data_row__SINGLE_ROW_AS_SERIES

sensor_id                     -9999
sensor_namee                 SDS011
lat                          55.676
lon                          13.346
timestamp       2019-01-01 11:57:12
p1                            14.08
p2                             3.68
Name: 127107, dtype: object

In [70]:
type( new_last_data_row__SINGLE_ROW_AS_SERIES )

pandas.core.series.Series

In [71]:
new_last_data_row__SINGLE_ROW_AS_SERIES.at['timestamp']

Timestamp('2019-01-01 11:57:12')

In [72]:
# HOWEVER, .at does work on single series, directly!
new_last_data_row__SINGLE_ROW_AS_SERIES.at['timestamp'] = new_later_timestamp

In [73]:
new_later_timestamp

Timestamp('2019-01-01 15:00:00')

In [74]:
new_last_data_row__SINGLE_ROW_AS_SERIES

sensor_id                     -9999
sensor_namee                 SDS011
lat                          55.676
lon                          13.346
timestamp       2019-01-01 15:00:00
p1                            14.08
p2                             3.68
Name: 127107, dtype: object

In [75]:
# this could work for setting timestamps, when one wants to use row numbers relative to the index one has
new_last_data_row.iloc[0].at['timestamp'] = new_later_timestamp

In [76]:
new_last_data_row

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
127107,-9999,SDS011,55.676,13.346,2019-01-01 11:57:12,14.08,3.68
127108,10647,SDS011,55.608,13.036,2019-01-01 11:58:42,23.42,4.75


In [77]:
# in_data_end02__plus_later_timestamp = in_data_end02.append( )

In [78]:
new_last_data_row.iloc[0]

sensor_id                     -9999
sensor_namee                 SDS011
lat                          55.676
lon                          13.346
timestamp       2019-01-01 11:57:12
p1                            14.08
p2                             3.68
Name: 127107, dtype: object

In [79]:
new_last_data_row.iloc[0]  = new_last_data_row__SINGLE_ROW_AS_SERIES

In [80]:
new_last_data_row

Unnamed: 0,sensor_id,sensor_namee,lat,lon,timestamp,p1,p2
127107,-9999,SDS011,55.676,13.346,2019-01-01 15:00:00,14.08,3.68
127108,10647,SDS011,55.608,13.036,2019-01-01 11:58:42,23.42,4.75


In [81]:
# new_last_data_rows__RESAMPLED = new_last_data_row.resample("5Min").mean()
new_last_data_rows__RESAMPLED = new_last_data_row.resample("5Min").mean()

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

In [82]:
type( new_last_data_row['timestamp'].iat[ 0 ] ) 

pandas._libs.tslibs.timestamps.Timestamp

In [83]:
new_last_data_row.dtypes

sensor_id                int64
sensor_namee            object
lat                    float64
lon                    float64
timestamp       datetime64[ns]
p1                     float64
p2                     float64
dtype: object

In [84]:
new_last_data_row.index

RangeIndex(start=127107, stop=127109, step=1)

In [88]:
new_last_data_row = new_last_data_row.set_index( 'timestamp' )

In [89]:
new_last_data_rows__RESAMPLED = new_last_data_row.resample("5Min").mean()

In [90]:
new_last_data_rows__RESAMPLED

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 11:55:00,10647.0,55.608,13.036,23.42,4.75
2019-01-01 12:00:00,,,,,
2019-01-01 12:05:00,,,,,
2019-01-01 12:10:00,,,,,
2019-01-01 12:15:00,,,,,
2019-01-01 12:20:00,,,,,
2019-01-01 12:25:00,,,,,
2019-01-01 12:30:00,,,,,
2019-01-01 12:35:00,,,,,
2019-01-01 12:40:00,,,,,


In [91]:
# let's try filling the NaN values … 

In [92]:
# ok - that was simple 
new_last_data_rows__RESAMPLED__FILLed_NA = new_last_data_rows__RESAMPLED.fillna( 0 )
new_last_data_rows__RESAMPLED__FILLed_NA

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 11:55:00,10647.0,55.608,13.036,23.42,4.75
2019-01-01 12:00:00,0.0,0.0,0.0,0.0,0.0
2019-01-01 12:05:00,0.0,0.0,0.0,0.0,0.0
2019-01-01 12:10:00,0.0,0.0,0.0,0.0,0.0
2019-01-01 12:15:00,0.0,0.0,0.0,0.0,0.0
2019-01-01 12:20:00,0.0,0.0,0.0,0.0,0.0
2019-01-01 12:25:00,0.0,0.0,0.0,0.0,0.0
2019-01-01 12:30:00,0.0,0.0,0.0,0.0,0.0
2019-01-01 12:35:00,0.0,0.0,0.0,0.0,0.0
2019-01-01 12:40:00,0.0,0.0,0.0,0.0,0.0


#### remove the NaN using .where 
#### as documented here 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html



In [93]:
# use one column's values to set the whole array
new_last_data_rows__RESAMPLED.where( new_last_data_rows__RESAMPLED == 'NaN', 1 )

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 11:55:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:00:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:05:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:10:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:15:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:20:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:25:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:30:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:35:00,1.0,1.0,1.0,1.0,1.0
2019-01-01 12:40:00,1.0,1.0,1.0,1.0,1.0


In [94]:
# set only one column… when using the whole array 
new_last_data_rows__RESAMPLED['p1'].where( new_last_data_rows__RESAMPLED['p1'] == np.NaN, 1 )

timestamp
2019-01-01 11:55:00    1.0
2019-01-01 12:00:00    1.0
2019-01-01 12:05:00    1.0
2019-01-01 12:10:00    1.0
2019-01-01 12:15:00    1.0
2019-01-01 12:20:00    1.0
2019-01-01 12:25:00    1.0
2019-01-01 12:30:00    1.0
2019-01-01 12:35:00    1.0
2019-01-01 12:40:00    1.0
2019-01-01 12:45:00    1.0
2019-01-01 12:50:00    1.0
2019-01-01 12:55:00    1.0
2019-01-01 13:00:00    1.0
2019-01-01 13:05:00    1.0
2019-01-01 13:10:00    1.0
2019-01-01 13:15:00    1.0
2019-01-01 13:20:00    1.0
2019-01-01 13:25:00    1.0
2019-01-01 13:30:00    1.0
2019-01-01 13:35:00    1.0
2019-01-01 13:40:00    1.0
2019-01-01 13:45:00    1.0
2019-01-01 13:50:00    1.0
2019-01-01 13:55:00    1.0
2019-01-01 14:00:00    1.0
2019-01-01 14:05:00    1.0
2019-01-01 14:10:00    1.0
2019-01-01 14:15:00    1.0
2019-01-01 14:20:00    1.0
2019-01-01 14:25:00    1.0
2019-01-01 14:30:00    1.0
2019-01-01 14:35:00    1.0
2019-01-01 14:40:00    1.0
2019-01-01 14:45:00    1.0
2019-01-01 14:50:00    1.0
2019-01-01 14:55:0

In [95]:
# try extracting one column and only working on that one
extracted_p1_col = new_last_data_rows__RESAMPLED['p1']

In [96]:
extracted_p1_col

timestamp
2019-01-01 11:55:00    23.42
2019-01-01 12:00:00      NaN
2019-01-01 12:05:00      NaN
2019-01-01 12:10:00      NaN
2019-01-01 12:15:00      NaN
2019-01-01 12:20:00      NaN
2019-01-01 12:25:00      NaN
2019-01-01 12:30:00      NaN
2019-01-01 12:35:00      NaN
2019-01-01 12:40:00      NaN
2019-01-01 12:45:00      NaN
2019-01-01 12:50:00      NaN
2019-01-01 12:55:00      NaN
2019-01-01 13:00:00      NaN
2019-01-01 13:05:00      NaN
2019-01-01 13:10:00      NaN
2019-01-01 13:15:00      NaN
2019-01-01 13:20:00      NaN
2019-01-01 13:25:00      NaN
2019-01-01 13:30:00      NaN
2019-01-01 13:35:00      NaN
2019-01-01 13:40:00      NaN
2019-01-01 13:45:00      NaN
2019-01-01 13:50:00      NaN
2019-01-01 13:55:00      NaN
2019-01-01 14:00:00      NaN
2019-01-01 14:05:00      NaN
2019-01-01 14:10:00      NaN
2019-01-01 14:15:00      NaN
2019-01-01 14:20:00      NaN
2019-01-01 14:25:00      NaN
2019-01-01 14:30:00      NaN
2019-01-01 14:35:00      NaN
2019-01-01 14:40:00      NaN
2019

In [97]:
# working on only a single column
extracted_p1_col.where( extracted_p1_col == np.NaN, 1)

timestamp
2019-01-01 11:55:00    1.0
2019-01-01 12:00:00    1.0
2019-01-01 12:05:00    1.0
2019-01-01 12:10:00    1.0
2019-01-01 12:15:00    1.0
2019-01-01 12:20:00    1.0
2019-01-01 12:25:00    1.0
2019-01-01 12:30:00    1.0
2019-01-01 12:35:00    1.0
2019-01-01 12:40:00    1.0
2019-01-01 12:45:00    1.0
2019-01-01 12:50:00    1.0
2019-01-01 12:55:00    1.0
2019-01-01 13:00:00    1.0
2019-01-01 13:05:00    1.0
2019-01-01 13:10:00    1.0
2019-01-01 13:15:00    1.0
2019-01-01 13:20:00    1.0
2019-01-01 13:25:00    1.0
2019-01-01 13:30:00    1.0
2019-01-01 13:35:00    1.0
2019-01-01 13:40:00    1.0
2019-01-01 13:45:00    1.0
2019-01-01 13:50:00    1.0
2019-01-01 13:55:00    1.0
2019-01-01 14:00:00    1.0
2019-01-01 14:05:00    1.0
2019-01-01 14:10:00    1.0
2019-01-01 14:15:00    1.0
2019-01-01 14:20:00    1.0
2019-01-01 14:25:00    1.0
2019-01-01 14:30:00    1.0
2019-01-01 14:35:00    1.0
2019-01-01 14:40:00    1.0
2019-01-01 14:45:00    1.0
2019-01-01 14:50:00    1.0
2019-01-01 14:55:0

In [98]:
# using the extract to set values in a larger array
# ( where that extracted column originally came from )
new_last_data_rows__RESAMPLED['p1'] = extracted_p1_col.where( extracted_p1_col == np.NaN, 1)

In [99]:
new_last_data_rows__RESAMPLED

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 11:55:00,10647.0,55.608,13.036,1.0,4.75
2019-01-01 12:00:00,,,,1.0,
2019-01-01 12:05:00,,,,1.0,
2019-01-01 12:10:00,,,,1.0,
2019-01-01 12:15:00,,,,1.0,
2019-01-01 12:20:00,,,,1.0,
2019-01-01 12:25:00,,,,1.0,
2019-01-01 12:30:00,,,,1.0,
2019-01-01 12:35:00,,,,1.0,
2019-01-01 12:40:00,,,,1.0,


### still to do : 
- make a single row and add it to a table 
( eg to insert the rows one wants, with the times and values one wants, into a table )
- check if a given time-range is in a data frame … ( eg is the beginning / end period, that we want in the data, in the actual data )

### Check if the time-range we want, is in a dataframe

In [100]:
# let's use this 
new_last_data_rows__RESAMPLED.head()

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 11:55:00,10647.0,55.608,13.036,1.0,4.75
2019-01-01 12:00:00,,,,1.0,
2019-01-01 12:05:00,,,,1.0,
2019-01-01 12:10:00,,,,1.0,
2019-01-01 12:15:00,,,,1.0,


In [101]:
new_last_data_rows__RESAMPLED.tail()

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 14:40:00,,,,1.0,
2019-01-01 14:45:00,,,,1.0,
2019-01-01 14:50:00,,,,1.0,
2019-01-01 14:55:00,,,,1.0,
2019-01-01 15:00:00,-9999.0,55.676,13.346,1.0,3.68


In [104]:
new_last_data_rows__RESAMPLED.shape

(38, 5)

In [102]:
# which index does this have? 
new_last_data_rows__RESAMPLED.index

DatetimeIndex(['2019-01-01 11:55:00', '2019-01-01 12:00:00',
               '2019-01-01 12:05:00', '2019-01-01 12:10:00',
               '2019-01-01 12:15:00', '2019-01-01 12:20:00',
               '2019-01-01 12:25:00', '2019-01-01 12:30:00',
               '2019-01-01 12:35:00', '2019-01-01 12:40:00',
               '2019-01-01 12:45:00', '2019-01-01 12:50:00',
               '2019-01-01 12:55:00', '2019-01-01 13:00:00',
               '2019-01-01 13:05:00', '2019-01-01 13:10:00',
               '2019-01-01 13:15:00', '2019-01-01 13:20:00',
               '2019-01-01 13:25:00', '2019-01-01 13:30:00',
               '2019-01-01 13:35:00', '2019-01-01 13:40:00',
               '2019-01-01 13:45:00', '2019-01-01 13:50:00',
               '2019-01-01 13:55:00', '2019-01-01 14:00:00',
               '2019-01-01 14:05:00', '2019-01-01 14:10:00',
               '2019-01-01 14:15:00', '2019-01-01 14:20:00',
               '2019-01-01 14:25:00', '2019-01-01 14:30:00',
               '2019-01-

In [None]:
# ok, it's got a time index :) 

In [109]:
# let's check for a time interval… 
new_last_data_rows__RESAMPLED[ new_last_data_rows__RESAMPLED.index < '2019-01-01 12:00:00' ]

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 11:55:00,10647.0,55.608,13.036,1.0,4.75


In [111]:
# let's get a numeric reading… 
new_last_data_rows__RESAMPLED[ new_last_data_rows__RESAMPLED.index < '2019-01-01 12:00:00' ].shape

(1, 5)

In [113]:
# let's test an impossible case… to see what value one gets
new_last_data_rows__RESAMPLED[ new_last_data_rows__RESAMPLED.index < '2018-01-01 12:00:00' ].shape

(0, 5)

In [127]:
# low let's try find things within a time interval
# ( and next, search for relevant time-intervals in the bigger data)
found_entries = new_last_data_rows__RESAMPLED[ ( new_last_data_rows__RESAMPLED.index > '2019-01-01 12:00:00' )  & ( new_last_data_rows__RESAMPLED.index <= '2019-01-01 12:05:00' ) ]
found_entries

Unnamed: 0_level_0,sensor_id,lat,lon,p1,p2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 12:05:00,,,,1.0,


In [128]:
found_entries.shape


(1, 5)