#### This example notebook shows how to read a parquet file from the processed PSG output
#### This sample only contains readings of the EEG Fpz-Cz[uV] sensor
#### Each "row" of the dataset is of the form (subject, label, [reading1, reading2, reading3, reading4, ..., reading n])
##### For example: 
<code>(SC4001E1,1,[1.07838827838828, 2.86007326007326, 4.92307692307692, 5.95457875457875, 6.23589743589744, 6.61098901098901, 8.67399267399267, -1.07838827838828, -5.2043956043956, 0.797069597069597, 6.23589743589744, 6.42344322344322, 0.234432234432234, 0.140659340659341, 1.92234432234432, 6.14212454212454, 4.07912087912088, 2.95384615384615, 1.07838827838828, -3.23516483516484, -7.45494505494506, -1.64102564102564, -0.328205128205128, 6.14212454212454, 1.64102564102564, 2.67252747252747, 10.174358974359, 13.5501831501832, 14.6754578754579, 15.6131868131868, 13.8315018315018, 13.1750915750916, 9.23663003663004, 8.39267399267399, 2.95384615384615, 6.42344322344322, 5.95457875457875, -0.60952380952381, -5.2981684981685, -1.73479853479853, -0.797069597069597, -1.07838827838828, 1.73479853479853, 4.07912087912088, 2.01611721611722, -0.703296703296703, -7.17362637362637, -6.04835164835165, -8.01758241758242, -4.92307692307692, -6.04835164835165, -6.23589743589744, -5.11062271062271, -2.39120879120879, 3.7040293040293, 3.23516483516484, 3.42271062271062, 0.60952380952381, -0.0468864468864469, -2.203663003663, -5.2981684981685, -6.23589743589744, -8.67399267399267, -5.2043956043956, -3.61025641025641, 3.51648351648352, 7.54871794871795, 9.98681318681319, 11.7684981684982, 12.3311355311355, 15.8007326007326, 13.1750915750916, 6.51721611721612, 1.82857142857143, 0.890842490842491, 0.328205128205128, 3.98534798534799, -1.92234432234432, 4.54798534798535, 9.33040293040293, 11.1120879120879, 3.61025641025641, 2.95384615384615, 3.98534798534799, 5.11062271062271, 1.07838827838828, -3.04761904761905, -8.76776556776557, -4.64175824175824, -2.95384615384615, -3.23516483516484, -2.48498168498169, 2.203663003663, -1.64102564102564, -1.82857142857143, -6.23589743589744, -5.86080586080586, -4.45421245421245, 1.26593406593407, 0.234432234432234, 1.17216117216117, 1.92234432234432, 1.07838827838828, 1.17216117216117, 1.64102564102564])</code>

To capture the relationships properly we use the Parquet file format which also provides better performance and compression thatn standard CSV or text file formats

First import pandas module, we will need Pandas version > 0.21 but the most recent compatabile version is preferred. We also need to have corresponding "pyarrow" module installed (pip install pyarrow)

In [1]:
import pandas as pd

Can read a single Parquet file simply using "read_parquet" method to generate a Pandas dataframe

In [2]:
df = pd.read_parquet("sample.snappy.parquet")
df

Unnamed: 0,subject,annotation,EEG_Fpz-Cz_uV
0,SC4001E0,W,"[19.6454212454212, 14.1128205128205, 22.927472..."
1,SC4001E0,W,"[-41.5882783882784, -36.5245421245421, -41.494..."
2,SC4001E0,W,"[-74.4087912087912, -73.5648351648352, -77.690..."
3,SC4001E0,W,"[-58.1860805860806, -40.181684981685, -39.1501..."
4,SC4001E0,W,"[-42.0571428571429, -37.8373626373626, -37.837..."
...,...,...,...
228,SC4001E0,1,"[-29.8666666666667, 1.35970695970696, -7.92380..."
229,SC4001E0,2,"[-2.39120879120879, 0.984615384615385, 1.07838..."
230,SC4001E0,2,"[-0.0468864468864469, 1.17216117216117, 1.8285..."
231,SC4001E0,2,"[-12.4249084249084, -17.7699633699634, -19.926..."


Can also easily then convert to numpy since we are working with arrays

In [3]:
num = df.to_numpy()
num

array([['SC4001E0', 'W',
        array([ 19.64542125,  14.11282051,  22.92747253, ..., -25.27179487,
       -28.83516484, -32.21098901])],
       ['SC4001E0', 'W',
        array([-41.58827839, -36.52454212, -41.49450549, ..., -85.28644689,
       -85.00512821, -78.53479853])],
       ['SC4001E0', 'W',
        array([-74.40879121, -73.56483516, -77.69084249, ..., -44.58901099,
       -26.3032967 , -45.52673993])],
       ['SC4001E0', 'W',
        array([-58.18608059, -40.18168498, -39.15018315, ..., -44.87032967,
       -45.05787546, -38.02490842])],
       ['SC4001E0', 'W',
        array([-42.05714286, -37.83736264, -37.83736264, ..., -16.26959707,
       -19.36410256, -11.58095238])],
       ['SC4001E0', 'W',
        array([-9.7992674 , -8.11135531, -7.64249084, ..., 48.71501832,
       38.21245421, 39.24395604])],
       ['SC4001E0', 'W',
        array([ 37.64981685,  39.9003663 ,  34.46153846, ..., -46.74578755,
       -69.34505495, -67.93846154])],
       ['SC4001E0', 'W',
        

However most processing results in multiple Parquet files.  By convention all of these are written to a single directory.  We need to create a file glob to process that directory and read each one into Pandas, followed by the conversion to numpy array

In [4]:
from pathlib import Path


data_dir = Path('output')
full_df = pd.concat(
    pd.read_parquet(parquet_file)
    for parquet_file in data_dir.glob('*.parquet')
)
full_df = full_df.to_numpy()

Each "row" is one of the scored samples

We can see this dataset contains 1442 rows of readings, each which correspond to 30 seconds of scored data

In [5]:
print(len(full_df))

1442


Each row contains a label and an array of data readings.  This example shows an epoch that was scored as stage 3

In [6]:
print(full_df[800])

['SC4001E0' '3'
 array([ 32.49230769,  38.21245421,  52.46593407, ..., -17.48864469,
       -17.3010989 , -12.04981685])
 array([ 0.02783883, -2.08351648, -6.01831502, ..., 18.74212454,
       18.55018315, 19.7018315 ])
 array([-24.88620269, -29.32136752, -23.9006105 , ...,   4.68156288,
        -2.21758242,   3.6959707 ])]


Each feature array contains 30 seconds of data sample at 100Hz, 3000 readings total

In [7]:
print(len(full_df[800][2]))

3000


#### Read data from the full, partitioned dataset
##### Assumes data has been downloaded to directory "output_full_partitioned" in same dir as notebook

Now lets read from a partitioned Parquet table of all subjects<br/>
Note 1: The data is sorted by increasing epoch start time<br/>
Note 2: I had to increase memory available in Jupyter using this command to start: jupyter notebook --NotbookApp.iopub_Data_Rate_Limit=1e10

In [8]:
all_data = pd.read_parquet('output_full_partitioned')
all_data

Unnamed: 0,timestamp,annotation,EEG_Fpz-Cz_uV,EEG_Pz-Oz_uV,EOG_horizontal_uV,subject
0,1989-04-25 00:13:30,W,"[8.111355311355311, 17.488644688644687, 21.239...","[-1.6996336996336874, -1.8915750915750793, -4....","[19.46544566544572, 44.59804639804644, 33.7565...",SC4001
1,1989-04-25 00:14:00,W,"[-10.736996336996336, -11.393406593406594, -4....","[-1.2197802197802075, -1.6036630036629915, -2....","[-25.378998778998724, -47.062026862026805, -37...",SC4001
2,1989-04-25 00:14:30,W,"[61.37435897435897, 38.68131868131868, 49.5589...","[2.9069597069597193, 10.104761904761915, 15.09...","[46.07643467643473, 70.22344322344328, 6.65274...",SC4001
3,1989-04-25 00:15:00,W,"[27.803663003663, 24.146520146520142, 24.14652...","[-4.098901098901087, -0.06813186813185584, 0.1...","[36.71330891330896, 6.159951159951213, 79.0937...",SC4001
4,1989-04-25 00:15:30,W,"[69.15750915750915, 74.59633699633699, 79.0036...","[4.3465201465201595, 2.0432234432234555, -1.50...","[56.42515262515267, 71.70183150183156, 11.0879...",SC4001
...,...,...,...,...,...,...
194594,1991-09-27 08:32:30,W,"[49.2888888888889, 57.73333333333334, 64.31111...","[3.7772893772893683, 14.891575091575081, 15.36...","[132.23882783882783, 126.4161172161172, 132.23...",SC4822
194595,1991-09-27 08:33:00,W,"[-84.04444444444444, -85.73333333333333, -75.4...","[-6.8620268620268705, -18.926251526251534, -8....","[-74.95262515262515, -76.8935286935287, -77.37...",SC4822
194596,1991-09-27 08:33:30,W,"[4.044444444444451, 2.2666666666666737, 6.4444...","[1.2124542124542033, -3.822222222222231, 1.592...","[71.10036630036628, 83.71623931623931, 89.0537...",SC4822
194597,1991-09-27 08:34:00,W,"[-43.95555555555555, -63.42222222222221, -70.8...","[-13.226617826617835, -8.38192918192919, -3.82...","[-63.30720390720391, -91.93553113553112, -129....",SC4822


Now let's only open a single subject's data<br/>
Note: Subject is not part of the dataframe when manually specifing which partition to open, but data is still sorted by epoch start time.

In [9]:
sc4001_data = pd.read_parquet('output_full_partitioned/subject=SC4001/part-00042-tid-3307149152373805409-acfc9857-a6f3-4910-9106-ade3c42e68e2-1993-1.c000.snappy.parquet')
sc4001_data

Unnamed: 0,timestamp,annotation,EEG_Fpz-Cz_uV,EEG_Pz-Oz_uV,EOG_horizontal_uV
0,1989-04-25 00:13:30,W,"[8.111355311355311, 17.488644688644687, 21.239...","[-1.6996336996336874, -1.8915750915750793, -4....","[19.46544566544572, 44.59804639804644, 33.7565..."
1,1989-04-25 00:14:00,W,"[-10.736996336996336, -11.393406593406594, -4....","[-1.2197802197802075, -1.6036630036629915, -2....","[-25.378998778998724, -47.062026862026805, -37..."
2,1989-04-25 00:14:30,W,"[61.37435897435897, 38.68131868131868, 49.5589...","[2.9069597069597193, 10.104761904761915, 15.09...","[46.07643467643473, 70.22344322344328, 6.65274..."
3,1989-04-25 00:15:00,W,"[27.803663003663, 24.146520146520142, 24.14652...","[-4.098901098901087, -0.06813186813185584, 0.1...","[36.71330891330896, 6.159951159951213, 79.0937..."
4,1989-04-25 00:15:30,W,"[69.15750915750915, 74.59633699633699, 79.0036...","[4.3465201465201595, 2.0432234432234555, -1.50...","[56.42515262515267, 71.70183150183156, 11.0879..."
...,...,...,...,...,...
836,1989-04-25 07:11:30,W,"[-21.70842490842491, 11.299633699633699, -18.1...","[3.866666666666679, -6.114285714285702, -6.402...","[-126.89499389499385, -98.31282051282047, -92...."
837,1989-04-25 07:12:00,W,"[4.4542124542124535, 5.485714285714286, 4.7355...","[-1.0278388278388157, -2.275457875457863, 4.44...","[64.30989010989016, 63.817094017094064, 71.209..."
838,1989-04-25 07:12:30,W,"[-9.330402930402931, -2.672527472527473, -2.48...","[-4.866666666666654, -5.6344322344322215, 0.50...","[83.52893772893779, 87.9641025641026, 85.99291..."
839,1989-04-25 07:13:00,W,"[24.709157509157507, 22.27106227106227, 27.991...","[13.463736263736276, 13.655677655677668, 0.123...","[-34.2493284493284, -31.292551892551838, -39.1..."


Now we will open another copy of the full dataset, but only load the subject, timestamp and EEG_Fpz channel
##### Note how much faster this loads !!!

In [10]:
all_data_3_cols = pd.read_parquet('output_full_partitioned', columns=['subject','timestamp','EEG_Fpz-Cz_uV'])
all_data_3_cols

Unnamed: 0,subject,timestamp,EEG_Fpz-Cz_uV
0,SC4001,1989-04-25 00:13:30,"[8.111355311355311, 17.488644688644687, 21.239..."
1,SC4001,1989-04-25 00:14:00,"[-10.736996336996336, -11.393406593406594, -4...."
2,SC4001,1989-04-25 00:14:30,"[61.37435897435897, 38.68131868131868, 49.5589..."
3,SC4001,1989-04-25 00:15:00,"[27.803663003663, 24.146520146520142, 24.14652..."
4,SC4001,1989-04-25 00:15:30,"[69.15750915750915, 74.59633699633699, 79.0036..."
...,...,...,...
194594,SC4822,1991-09-27 08:32:30,"[49.2888888888889, 57.73333333333334, 64.31111..."
194595,SC4822,1991-09-27 08:33:00,"[-84.04444444444444, -85.73333333333333, -75.4..."
194596,SC4822,1991-09-27 08:33:30,"[4.044444444444451, 2.2666666666666737, 6.4444..."
194597,SC4822,1991-09-27 08:34:00,"[-43.95555555555555, -63.42222222222221, -70.8..."


If we convert this to numpy will be extremely operation efficient

In [11]:
np_arr = all_data_3_cols.to_numpy()

In [12]:
np_arr.shape

(194599, 3)