# Some insights about the data

To build an application capable of usefully visualizing the provided data, first we need to understand how it works. In this notebook, we'll attempt to do some rough data visualization, cleaning. Then we'll attempt to build a useful graph with it. Following that, we'll attempt to verify the assumptions made about the first file, using a second CSV file.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import plotly.express as px

<font color="red">The provided example .csv file appears to have been anonymized by removing the name row. This would confuse current implementation row-number wise when production version does not have this anomaly, so verify this!</font>
<a id="#concern_one"></a>

In [None]:
raw_df = pd.read_csv("validation_data/HART.csv", delimiter=",", header=None)
raw_df.head(15)

Lines 0 up to and including 5 contain general information about the measurements. Lines 6 and 7 are blank lines. 8 and below belong to the actual measurements. It might be useful to split these 2 into 2 separate dataframes:

In [None]:
information_df = raw_df.loc[:5, :2]
information_df.set_index(0, inplace=True)
information_df

<font color="red">Row 10 is currently not included because it's NaN in provided file. Make sure that's the case in all files before actual deployment</font>

Currently, the subject expertise for what these things mean exactly is lacking to say the least. Would require more insight for actual deployment version.

In [None]:
measurement_df = pd.DataFrame(
    raw_df.loc[11:, :1].copy()
)

# Cannot change columns directly in DataFrame construction because data already contains columns.
# Could probably be done more elegantly then with a zip but I have a fever so fuck it.
measurement_df.columns = [f"{name}({unit})" for name, unit in zip(raw_df.loc[8, :1], raw_df.loc[9, :1])]
measurement_df.reset_index(drop=True, inplace=True)
measurement_df

It appears as though we've isolated the required numeric data in a separate pd.DataFrame. Now to have a look at the datatypes, considering Pandas most probably interpreted the columns in question as strings when reading the entire file at once, considering the confusing first 10 lines of the file.

In [None]:
measurement_df.dtypes

That's not quite right, yet fixable.

In [None]:
measurement_df = measurement_df.apply(pd.to_numeric, axis=1)
measurement_df.head(5)

In [None]:
measurement_df.dtypes

The datatypes are alright. Might also be useful that there are no NaN values.

In [None]:
measurement_df.isna().any()

It appears as though the second column contains NaN values. How many does it?

In [None]:
f"{(measurement_df['Afleiding I(µV)'].isna().sum() / measurement_df.shape[0]) * 100}% of Afleiding I(µV) is NaN"

We'll attempt to visualize the timeframe where NaN values occur in the column:

In [None]:
col2nd_nans = pd.to_numeric(measurement_df["Afleiding I(µV)"].isna())

In [None]:
_, ax = plt.subplots()
ax.plot(col2nd_nans)
ax.set_xlabel("Measurement")
ax.set_ylabel("Presence of NaN-value")
ax.set_yticks([0, 1])
ax.set_yticklabels(["No", "Yes"])

On the surface level, they seem to be distributed without any clear logic to them.

Before we investigate this further, it might be useful to visually examine what both columns act

In [None]:
comparison_df = measurement_df.iloc[:1000]

In [None]:
comparison_df

In [None]:
fig, axes = plt.subplots(comparison_df.shape[1], 1)
fig.subplots_adjust(hspace=1)

for i, col in enumerate(comparison_df.columns):
    axes[i].plot(comparison_df[col], c="r")
    axes[i].set_xlabel("Observation number")
    axes[i].set_ylabel("Value")
    axes[i].set_title(f"Value against time for {col}")


Product owner has expressed the desire to specifically visualize the top graph, so we won't investigate the bottom column further.

### Fixing timescale

The dataset contains measurements at a set frequency. That means we can use this frequency to add a time measurements column, so that graphs can have their actual time since the start of the recording on the x-axis.

The information dataframe contains the following information on measurements frequencies:

In [None]:
information_df.loc["Meetfrequentie"]

_511_ and _023_ hertz...

Considering we saw 3 heart beats in the earlier graph, in the first 1000 measurements. 23 hertz wouldn't make sense, because we know the patient was not in a medical crisis where:

$$ T = \frac{\textup{1000 measurements}}{\textup{3 heartbeats}} = 333 \frac{1}{3}\textup{ frames per heartbeat} $$

which with a frequency of 23 Herz translates to

$$ \frac{333\frac{1}{3}\textup{ frames per heartbeat}}{23\textup{ Hertz}} \approx 14.5\textup{ s per heartbeat} $$


In [None]:
measurement_df["Time"] = measurement_df.index / int(information_df.loc["Meetfrequentie", 1])
measurement_df

### Attempting to make a graph

In [None]:
fig = px.line(
    measurement_df,
    x="Time",
    y="Afleiding(Eenheid)",
    title="Personal device ECG measurements over time",
    labels={
        "Time": "Seconds since start",
        "Afleiding(Eenheid)": "Measurement amplitude (unit unknown)"
    }
)

In [None]:
# Very slow, improve!
major_x_line_interval = .2
major_x_line_pos = np.arange(0, np.max(measurement_df["Time"]), major_x_line_interval)

np.vectorize(lambda x: fig.add_vline(x, line_color="rgba(255, 0, 0, 0.5)", line_width=.9))(major_x_line_pos)
None

In [None]:
# This just takes an inordinate amount of time, should probably not include these, and have a button that makes a matplotlib figure for a certain specified area.
# minor_x_line_interval = .04
# minor_x_line_pos = np.arange(0, np.max(measurement_df["Time"]), minor_x_line_interval)

# np.vectorize(lambda x: fig.add_vline(x, line_color="rgba(255, 0, 0, 0.8)", line_width=.3))(minor_x_line_pos)
# None

In [None]:
fig.update_traces(line_color="#000000", line_width=1)
fig.update_layout(xaxis_rangeslider_visible=True, xaxis_range=[0, 2])

## Validating row numbers

[As noted before](#concern_one) the row numbers of the CSV might be unsure, and should be validated. For that purpose, a second CSV file has been provided.

In [None]:
raw_df_2 = pd.read_csv("validation_data/VOORBEELD2.csv", delimiter=",", header=None)
raw_df_2.head(15)

The information portion of the CSV does appear to be 2 rows longer, like expected. This means the actual measurement data, and it's headers, start 2 rows lower as well. This should be taken into account when modifying the data for use in a graph.