Load the libraries that I'm using

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
import numpy as np
import datetime

Load the dataset using pyarrow because it's a parquet file.

In [2]:
dfWave=pd.read_parquet('C:\\Users\\John\\Downloads\\waveDemo.parquet', engine='pyarrow')

This is what a sample row of the dataset looks like. 

In [3]:
dfWave.iloc[1]

csID                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

I want to look at just one line of the dataset to simplify my problem a little, so I'm only looking at ecg-I#1 which was chosen arbitrarially. 
I also want to split my delimited cells into arrays, in hindsite I should have probably converted them when I was doing this in spark but I can go back and fix this later. In a final version of this it's likely that the str.split stuff won't be needed because the values will already be arrays. 

In [3]:
dfT=dfWave[dfWave['mgname']=='ecg-I#1'].sort_values(by=['offsetDate','offsetTime'])
dfT['mgwave']=dfT['mgwave'].str.split('^')
dfT['mginvalid']=dfT['mginvalid'].str.split(',')
dfT['mgmissing']=dfT['mgmissing'].str.split(',')

I want to break down my data even further, for now I'm going to assume that all of the values that aren't a primary key or mgWave don't matter and I'm going to build a simple, long, dataset. 

In [5]:
dftemp=dfT.mgwave.apply(pd.Series) #Convert mgwave into wide columns
#the column should only have 480 columns because it's running at 480 hz, so to I want to drop all of the columns that aren't in that
#Because of how the apply(series) command works, all of the columns are named by their number starting at 0
#So first I make a list of all columns that are after the first 480 columns
dropList=[]
for x in dftemp.columns:
    if x >=480:
        dropList.append(x) 
#Then I drop those columns
dftemp=dftemp.drop(dftemp.columns[dropList],axis = 1)
#Next I need to rejoin my dataset and drop all of the columns that I don't care for this part of the problem.
dfTempLong=dftemp.merge(dfT, right_index = True, left_index = True).drop(["mgwave","mgGain","mgHZ","mguom","mgsite","mgscale","mginvalid","mgmissing","mgPoints","mgPointsBytes","mgMin","mgMax","mgOffset"], axis = 1).melt(id_vars = ['csID','csBedID','offsetDate','offsetTime','mgname'], value_name = "mgwave").sort_values(by=['offsetDate','offsetTime','variable'])
#And then I want to set all missing / invalue numbers (from mgInvalid and mgMissing) to nan, I can worry about the distinction later.
dfTempLongNaN=dfTempLong.replace(["-32768","-32766","-32765","-32764","-32760","-32759","-32758","-32757","-32756","-32755","-32754","-32767","-32763","-32762","-32761","-32753","-32752"],np.NaN)
#And I want to drop all of those missing or invalue readings. just to clean up the dataset a bit. 
dfTempLongNoNaN=dfTempLongNaN.dropna()

I need to think about how I want to deal with Time, my instinct is to just make a date-time timestamp and add the Hz value to it. But the issue with that is that there are duplicate timestamps. 

The reason why there are duplicate timestamps is because polltime sometimes fluxuates by a second which causes little 1 second gaps or 1 second duplications in 0.18% of cases. As a result, Date + Time + Hz Count is an order of events, rather than any sort of absolute time. 

In [6]:
dfTempLongNoNaN['DateTimeHz']=pd.to_datetime(dfTempLongNoNaN['offsetDate'].astype(str)+" "+dfTempLongNoNaN['offsetTime'])+ (dfTempLongNoNaN['variable']*pow(10,9)/240).round().astype(int).apply(pd.offsets.Nano)
dfTempLongNoNaN[dfTempLongNoNaN['DateTimeHz'].duplicated()==True]

  dfTempLongNoNaN['DateTimeHz']=pd.to_datetime(dfTempLongNoNaN['offsetDate'].astype(str)+" "+dfTempLongNoNaN['offsetTime'])+ (dfTempLongNoNaN['variable']*pow(10,9)/240).round().astype(int).apply(pd.offsets.Nano)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfTempLongNoNaN['DateTimeHz']=pd.to_datetime(dfTempLongNoNaN['offsetDate'].astype(str)+" "+dfTempLongNoNaN['offsetTime'])+ (dfTempLongNoNaN['variable']*pow(10,9)/240).round().astype(int).apply(pd.offsets.Nano)


Unnamed: 0,csID,csBedID,offsetDate,offsetTime,mgname,variable,mgwave,DateTimeHz
1446,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,0,-2,2020-02-21 22:49:01.000000000
122892,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,1,1,2020-02-21 22:49:01.004166667
244338,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,2,-2,2020-02-21 22:49:01.008333333
365784,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,3,-8,2020-02-21 22:49:01.012500000
487230,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,4,-11,2020-02-21 22:49:01.016666667
...,...,...,...,...,...,...,...,...
28659072,987842478180,1228360646657,2020-02-25,17:08:11,ecg-I#1,235,64,2020-02-25 17:08:11.979166667
28780518,987842478180,1228360646657,2020-02-25,17:08:11,ecg-I#1,236,70,2020-02-25 17:08:11.983333333
28901964,987842478180,1228360646657,2020-02-25,17:08:11,ecg-I#1,237,77,2020-02-25 17:08:11.987500000
29023410,987842478180,1228360646657,2020-02-25,17:08:11,ecg-I#1,238,85,2020-02-25 17:08:11.991666667


So, I know Date-Time is unique, And I know that Date-Time-Hz is not unique due to the second gaps + overlaps. 
The next thing I want to do is to make a list of datetimes to identify gaps. A gap is when the next point is more than 3 seconds away. Because a 2 second gap in time is expected, and a 1-3 second gap accounts for that waivering second. So I want to identify cases where the next datetime is more than 3 seconds. 

In [7]:
#First I want to demonstrate that time differences are 1-3 seconds or are quite long. 
dfTempLongNoNaN['DateTime']=pd.to_datetime(dfTempLongNoNaN['offsetDate'].astype(str)+" "+dfTempLongNoNaN['offsetTime'])
dfuniDateTime=pd.DataFrame({'DateTime':dfTempLongNoNaN['DateTime'].sort_values().unique()})
dfuniDateTime['dt2']=dfuniDateTime['DateTime'].shift(1)
dfuniDateTime['ddt']=(dfuniDateTime.DateTime- dfuniDateTime.dt2)
dfuniDateTime['ddt'].dt.seconds.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfTempLongNoNaN['DateTime']=pd.to_datetime(dfTempLongNoNaN['offsetDate'].astype(str)+" "+dfTempLongNoNaN['offsetTime'])


array([      nan, 2.000e+00, 1.000e+00, 3.000e+00, 3.602e+03, 9.540e+02,
       2.600e+01, 2.240e+02])

The issue is, now that I've demonstrated that there are big and small gaps... I need to use those "gaps" to restart time. 
What I'm thinking is that for when the value is NaN (row 1) or there is a big gap (a gap bigger than 3 seconds) I need to re-initilize Time. This is because I plan to assert that datetime + Hz represents an order in a series rather than an absolute time. And then I want to take that "order" and assign an absolute time. 

So the next thing I need to calculate is the anchor time, and the position in the series relative to that anchor time. 
An alternative approach is I could correct time, in cases where it's 1 I can set it to 2, and in cases where it's 3 I can set it to 2, that's much easier probably, so let me try that first, on second thought I don't really know what direction truth is and I'm a bit worried about the problem cascading. But I really would prefer to solve it in this way.

I like how I solved it in this case, below. we're looping through the dataframe and if it's the first row or a row with a time gap we're starting an iterable and just adding 2 seconds per position in the iterable to the initial date time. I understand that this is probably a bad idea after removing the NaNs but I won't be removing the NA's in the final version so it's fine for now. 

In [8]:
n=0
newTime=[]
for index, row in dfuniDateTime.iterrows():
    if row['ddt'].seconds > 3:
        n=0
    if n==0:
        lastAnchorDateTime=row['DateTime']
    newTime.append(lastAnchorDateTime+datetime.timedelta(seconds=n*2))
    n=n+1
dfuniDateTime['newTime']=newTime

Now I need to merge back on the original table, 
Regenerate DateTimeHz
And validate there are no duplicates

In [12]:
dfTempCorrectedTime = pd.merge(dfTempLongNoNaN, dfuniDateTime, how="inner", on="DateTime")

In [13]:
#dfTempCorrectedTime=dfTempCorrectedTime[["csID","csBedID","mgname","newTime","variable","mgwave"]]
dfTempCorrectedTime

Unnamed: 0,csID,csBedID,offsetDate,offsetTime,mgname,variable,mgwave,DateTimeHz,DateTime,dt2,ddt,newTime
0,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,120,-12474,2020-02-21 22:45:01.500000000,2020-02-21 22:45:01,NaT,NaT,2020-02-21 22:45:01
1,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,121,-12835,2020-02-21 22:45:01.504166667,2020-02-21 22:45:01,NaT,NaT,2020-02-21 22:45:01
2,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,122,-13874,2020-02-21 22:45:01.508333333,2020-02-21 22:45:01,NaT,NaT,2020-02-21 22:45:01
3,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,123,8003,2020-02-21 22:45:01.512500000,2020-02-21 22:45:01,NaT,NaT,2020-02-21 22:45:01
4,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,124,12985,2020-02-21 22:45:01.516666667,2020-02-21 22:45:01,NaT,NaT,2020-02-21 22:45:01
...,...,...,...,...,...,...,...,...,...,...,...,...
56665459,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,235,-1,2020-02-25 17:40:01.979166667,2020-02-25 17:40:01,2020-02-25 17:39:59,0 days 00:00:02,2020-02-25 17:40:16
56665460,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,236,0,2020-02-25 17:40:01.983333333,2020-02-25 17:40:01,2020-02-25 17:39:59,0 days 00:00:02,2020-02-25 17:40:16
56665461,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,237,0,2020-02-25 17:40:01.987500000,2020-02-25 17:40:01,2020-02-25 17:39:59,0 days 00:00:02,2020-02-25 17:40:16
56665462,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,238,0,2020-02-25 17:40:01.991666667,2020-02-25 17:40:01,2020-02-25 17:39:59,0 days 00:00:02,2020-02-25 17:40:16


In [14]:
dfTempCorrectedTime['DateTimeHz']=dfTempCorrectedTime['newTime']+(dfTempCorrectedTime['variable']*pow(10,9)/240).round().astype(int).apply(pd.offsets.Nano)
dfTempCorrectedTime[dfTempCorrectedTime['DateTimeHz'].duplicated()==True]
dfTempCorrectedTime.iloc[57359:57373]

  dfTempCorrectedTime['DateTimeHz']=dfTempCorrectedTime['newTime']+(dfTempCorrectedTime['variable']*pow(10,9)/240).round().astype(int).apply(pd.offsets.Nano)


Unnamed: 0,csID,csBedID,offsetDate,offsetTime,mgname,variable,mgwave,DateTimeHz,DateTime,dt2,ddt,newTime
57359,987842478180,1228360646657,2020-02-21,22:48:59,ecg-I#1,479,-6,2020-02-21 22:49:00.995833333,2020-02-21 22:48:59,2020-02-21 22:48:57,0 days 00:00:02,2020-02-21 22:48:59
57360,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,0,-6,2020-02-21 22:49:01.000000000,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01
57361,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,0,-2,2020-02-21 22:49:01.000000000,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01
57362,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,1,-5,2020-02-21 22:49:01.004166667,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01
57363,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,1,1,2020-02-21 22:49:01.004166667,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01
57364,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,2,-2,2020-02-21 22:49:01.008333333,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01
57365,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,2,-2,2020-02-21 22:49:01.008333333,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01
57366,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,3,-1,2020-02-21 22:49:01.012500000,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01
57367,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,3,-8,2020-02-21 22:49:01.012500000,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01
57368,987842478180,1228360646657,2020-02-21,22:49:01,ecg-I#1,4,-1,2020-02-21 22:49:01.016666667,2020-02-21 22:49:01,2020-02-21 22:48:59,0 days 00:00:02,2020-02-21 22:49:01


Well, that's unfortinuate, the newTime variable isn't a primary key, though it's only for 0.04% of the data.

In [34]:
dfD=dfWave[dfWave["offsetDate"].astype(str)=="2020-02-21"]

In [37]:
dfDT=dfD[dfD["offsetTime"].astype(str)=="22:48:59"]

In [38]:
dfDT[dfDT["mgname"]=="ecg-I#1"]

Unnamed: 0,csID,csBedID,offsetDate,offsetTime,mgname,mgGain,mgHZ,mgwave,mguom,mgsite,mgscale,mginvalid,mgmissing,mgPoints,mgPointsBytes,mgMin,mgMax,mgOffset
1126666,987842478180,1228360646657,2020-02-21,22:48:59,ecg-I#1,1.0,none,-9^-8^-8^-6^-5^-4^-5^-6^-9^-9^-8^-5^-3^-2^-3^-5^-7^-7^-6^-4^0^0^0^-2^-4^-5^-5^-2^-1^0^0^0^1^2^1^-1^-1^-1^0^0^-1^0^1^1^0^0^2^9^20^35^50^62^66^62^48^23^-9^-40^-62^-66^-55^-40^-30^-25^-19^-13^-7^-4^-3^-1^1^5^5^3^0^-1^-1^0^2^3^4^5^7^9^13^13^11^7^3^2^5^9^13^13^12^12^14^18^23^25^25^26^27^30^32^31^29^26^25^26^25^20^15^11^11^13^12^10^7^6^5^3^-1^-6^-10^-11^-11^-12^-13^-16^-16^-14^-11^-8^-6^-8^-10^-11^-13^-13^-15^-17^-18^-17^-14^-12^-14^-17^-18^-16^-11^-9^-7^-10^-11^-12^-13^-12^-10^-7^-7^-9^-11^-14^-14^-12^-9^-5^-3^-4^-6^-8^-8^-7^-6^-3^0^7^19^35^56^75^84^81^64^36^-1^-38^-63^-69^-59^-44^-33^-26^-19^-9^-1^1^-2^-4^-7^-5^-3^1^2^4^5^5^5^3^1^2^5^9^8^7^7^9^11^11^11^11^12^14^14^13^14^17^20^21^22^25^29^30^30^29^30^33^34^31^25^21^18^17^15^13^11^7^3^0^-1^-1^-2^-6^-9^-9^-9^-8^-7^-7^-9^-11^-12^-14^-13^-11^-9^-9^-8^-8^-7^-5^-5^-5^-8^-9^-10^-9^-8^-9^-10^-13^-13^-10^-6^-2^-1^-3^-6^-7^-6^-5^-4^-4^-6^-6^-5^-3^-3^-2^0^2^3^3^1^0^-2^-3^-5^-6^-5^-4^-3^-5^-6^-5^-2^2^2^2^0^-2^0^2^2^0^-1^-2^-1^1^2^2^2^2^1^0^0^-1^-1^-2^-1^0^3^6^8^7^5^4^4^5^5^7^12^24^41^57^71^79^82^75^56^23^-14^-43^-56^-54^-44^-33^-22^-13^-5^3^9^11^11^9^8^9^12^15^15^13^12^12^13^13^13^13^14^14^14^13^12^14^17^18^19^16^15^17^23^27^29^29^28^27^27^28^26^24^21^19^19^20^19^17^14^12^14^15^16^14^11^8^5^3^1^-2^-6^-10^-12^-11^-11^-10^-10^-11^-11^-10^-11^-10^-10^-10^-13^-16^-17^-16^-13^-11^-11^-11^-13^-12^-13^-12^-10^-9^-8^-6^-3^-2^-1^-1^-1^0^1^1^1^1^1^1^1^3^3^1^-2^-7^-9^-9^-8^-6,uV,,2.44,"-32768,-32766,-32765,-32764,-32760,-32759,-32758,-32757,-32756,-32755,-32754","-32767,-32763,-32762,-32761,-32753,-32752",480,none,none,none,0


In [33]:
dfWave[dfWave["mgname"]=="ecg-I#1"]["mgPoints"].value_counts()

480    120927
240       115
360        83
120        78
180        73
300        73
60         49
420        47
660         1
Name: mgPoints, dtype: int64

So, this is the root of the problem, In a small subset of cases there we see a a change in frequency. This is because some of the previous read is attached to the next read or vice versa. To fix this You need to know in which direction there is a hole. To fix that I need to start from the top.

In [5]:
dftemp=dfT.mgwave.apply(pd.Series) 
dfTempLong=dftemp.merge(dfT, right_index = True, left_index = True).drop(["mgwave","mgGain","mgHZ","mguom","mgsite","mgscale","mginvalid","mgmissing","mgPointsBytes","mgMin","mgMax","mgOffset"], axis = 1).melt(id_vars = ['csID','csBedID','offsetDate','offsetTime','mgname','mgPoints'], value_name = "mgwave").sort_values(by=['offsetDate','offsetTime','variable'])
dfTempLongNaN=dfTempLong.replace(["-32768","-32766","-32765","-32764","-32760","-32759","-32758","-32757","-32756","-32755","-32754","-32767","-32763","-32762","-32761","-32753","-32752"],np.NaN)
dfTempLongNoNaN=dfTempLongNaN.dropna()
dfTempLongNoNaN['DateTime']=pd.to_datetime(dfTempLongNoNaN['offsetDate'].astype(str)+" "+dfTempLongNoNaN['offsetTime'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfTempLongNoNaN['DateTime']=pd.to_datetime(dfTempLongNoNaN['offsetDate'].astype(str)+" "+dfTempLongNoNaN['offsetTime'])


In [7]:
dfuDTHZ=dfTempLongNoNaN.groupby(['DateTime','mgPoints']).size().reset_index().rename(columns={0:'count'})
dfuDTHZ=dfuDTHZ.sort_values('DateTime')
dfuDTHZ['mgPointsNext']=dfuDTHZ['mgPoints'].shift(-1)
dfuDTHZ['mgPointsPrev']=dfuDTHZ['mgPoints'].shift(1)
dfuDTHZ['mgPointsNext2']=dfuDTHZ['mgPoints'].shift(-2)
dfuDTHZ['mgPointsPrev2']=dfuDTHZ['mgPoints'].shift(2)
dfuDTHZ['dt2']=dfuDTHZ['DateTime'].shift(1)
dfuDTHZ['ddt']=(dfuDTHZ.DateTime- dfuDTHZ.dt2)
orderMod=[]
for index, row in dfuDTHZ.iterrows():
    if row['mgPoints'] == "480":
        orderMod.append([index,0])
    if row['mgPoints'] != "480":
        try:
            if ((int(row['mgPoints'])+int(row['mgPointsNext']))%480==0 or (int(row['mgPoints'])+int(row['mgPointsNext2']))%480==0):
                orderMod.append([index,1])
            elif ((int(row['mgPoints'])+int(row['mgPointsPrev']))%480==0 or (int(row['mgPoints'])+int(row['mgPointsPrev2']))%480==0):
                orderMod.append([index,-1])
        except:
            orderMod.append([index,0])
dfuDTHZ=dfuDTHZ.merge(pd.DataFrame(orderMod, columns=['index2','Mod']), left_index = True, right_on = 'index2')
dfuDTHZ=dfuDTHZ.sort_values(by=["DateTime",'Mod'])
n=0
newTime=[]
for index, row in dfuDTHZ.iterrows():
    if row['ddt'].seconds > 3:
        n=0
    if n==0:
        lastAnchorDateTime=row['DateTime']
    newTime.append([lastAnchorDateTime,n,index])
    n=n+1
dfuDTHZ=dfuDTHZ.merge(pd.DataFrame(newTime, columns=['AnchorTime','AnchorOrder','index3']), left_index = True, right_on = 'index3')
dfuDTHZ=dfuDTHZ[["DateTime","mgPoints","AnchorTime","AnchorOrder"]] #Initially I thought I still needed mod here, but mod has already been handled by AnchorOrder

So at this point I have dfuDTHZ4, I need to use dateTime, mgPoints to link back to the original dataset
And now I have anchorTime and anchorOrder so I can generate a waveform. 

In [10]:
dfuOrdered=pd.merge(dfTempLongNoNaN, dfuDTHZ, how="inner", on=["DateTime","mgPoints"]).sort_values(by=['AnchorTime','AnchorOrder','variable'])

In [11]:
dfuOrdered

Unnamed: 0,csID,csBedID,offsetDate,offsetTime,mgname,mgPoints,variable,mgwave,DateTime,AnchorTime,AnchorOrder
0,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,120,-12474,2020-02-21 22:45:01,2020-02-21 22:45:01,0
1,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,121,-12835,2020-02-21 22:45:01,2020-02-21 22:45:01,0
2,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,122,-13874,2020-02-21 22:45:01,2020-02-21 22:45:01,0
3,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,123,8003,2020-02-21 22:45:01,2020-02-21 22:45:01,0
4,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,124,12985,2020-02-21 22:45:01,2020-02-21 22:45:01,0
...,...,...,...,...,...,...,...,...,...,...,...
56661859,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,360,235,-1,2020-02-25 17:40:01,2020-02-24 23:35:10,32554
56661860,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,360,236,0,2020-02-25 17:40:01,2020-02-24 23:35:10,32554
56661861,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,360,237,0,2020-02-25 17:40:01,2020-02-24 23:35:10,32554
56661862,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,360,238,0,2020-02-25 17:40:01,2020-02-24 23:35:10,32554


In [20]:
totalOrder=[]
anchortime=0
n=0
for index, row in dfuOrdered[['AnchorTime','AnchorOrder']].iterrows():
    if anchortime!=row['AnchorTime']:
        n=0
        anchortime=row['AnchorTime']
    totalOrder.append([index,n])
    n=n+1
    


In [22]:
dfuOrdered=dfuOrdered.merge(pd.DataFrame(totalOrder, columns=['index3','SeqNumber']), left_index = True, right_on = 'index3')

So, now that I'm here, what do I really want to achieve. if I want to make the dataset as simple as possible, I want to make every row of the dataset a... Patient, Bed, Startdatetime, source, HZ, unit of measure, Array. 

What does the library need to perform the converstion?
1: File Name (Some combo of CSID, CSVedID, and Date)
2: fs (Hz)
3: units (From uom)
4: fmt(I don't totally understand this but I'm sure I'll get to it eventually)
5: gain (from gain)
6: baseline (unsure what to include here)
7: samps_per_frame (if there are different frequencies per frame)
8: counter_freq ( unsure)
9: base_counter (This is 0)
10-11: base_time, base_date (From base date, time)
12: comments: My dataset doesn't have this
13: signame: (This is a list of mgnames)
A bunch of other stuff that doesn't matter a much probably. 

In [62]:
dfuOrdered=dfuOrdered.sort_values(by=['AnchorTime','SeqNumber'])

In [68]:
dfuOrdered

Unnamed: 0,csID,csBedID,offsetDate,offsetTime,mgname,mgPoints,variable,mgwave,DateTime,AnchorTime,AnchorOrder,index3,SeqNumber
0,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,120,-12474,2020-02-21 22:45:01,2020-02-21 22:45:01,0,0,0
1,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,121,-12835,2020-02-21 22:45:01,2020-02-21 22:45:01,0,1,1
2,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,122,-13874,2020-02-21 22:45:01,2020-02-21 22:45:01,0,2,2
3,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,123,8003,2020-02-21 22:45:01,2020-02-21 22:45:01,0,3,3
4,987842478180,1228360646657,2020-02-21,22:45:01,ecg-I#1,360,124,12985,2020-02-21 22:45:01,2020-02-21 22:45:01,0,4,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
56661859,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,360,235,-1,2020-02-25 17:40:01,2020-02-24 23:35:10,32554,56661859,15621727
56661860,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,360,236,0,2020-02-25 17:40:01,2020-02-24 23:35:10,32554,56661860,15621728
56661861,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,360,237,0,2020-02-25 17:40:01,2020-02-24 23:35:10,32554,56661861,15621729
56661862,987842478180,1228360646657,2020-02-25,17:40:01,ecg-I#1,360,238,0,2020-02-25 17:40:01,2020-02-24 23:35:10,32554,56661862,15621730


In [59]:
import tarfile
tar = tarfile.open('Allergies.parquet.tar.gz', "r:gz")
tar.extractall(path="Allergies.parquet")
tar.close()
