---
title: "Finding Outliers by Index"
cdt: 2024-09-06T15:07:04
description: "Identification of outlier samples based on their time span
conclusion: samples 29 and 87 were determined to be runs to be discarded"
project: "total_dataset_EDA"
conclusion: "A table 'excluded_samples' was created to note bad samples to be excluded from downstream analyses. The table consists of columns: 'sample_num' - the primary key from 'pbl.sample_metadata', 'comment' a user friendly comment describing why its excluded, and 'proof' the file name of a notebook containing more information. Samples 29 and 87 were excluded based on their signal."
---

# Summary

- outlier detection through observation of length of signal
  - To investigate the range of idx/mins, we observed the sample lengths, defined as the number of rows of 'mins_corrected' per sample 'id'.
  - We found 10 unique lengths ranging from 2087 to 15600, with 1 sample 2087 and one at 15600 - samples 29 and 87, respectively.
  - sample 29
    - sample 29 was determined to be an outlier because its runtime was cut too short and baseline was high relative to its neighbours
  - sample 87
    - sample 87 was determined to be an outlier with an observation frequency of 5Hz, double the sample frequency of 2.5Hz.
- dataset observation frequency
  - Total dataset observation frequncy bar outliers and minor errors was 2.5Hz
- outcomes
  - dataset is ready for resampling to common time coordinates.


In [None]:
import duckdb as db
import polars as pl

db_path = "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/wines.db"

con = db.connect(db_path, read_only=True)


## What is typical length of a Sample?

To find duplicates, we need a standard to compare against. What lengths are present in the set?


In [None]:
lengths = con.sql(
    """
    SELECT
        id,
        min(idx) as min_idx,
        max(idx) as max_idx,
        count(idx) as length
    FROM
        dataset_eda.nm_254
    GROUP BY
        ID
    ORDER BY
        id, length
    """
).pl()

lengths.head()


And how many distinct lengths?

In [None]:
pl.Config.set_tbl_rows(1000)

unique_lengths = con.sql(
    """--sql
    SELECT
        DISTINCT length,
        min_idx,
        max_idx,
    FROM
        lengths
    ORDER BY
        length
    """
)

unique_lengths.pl()


In [None]:
(
    unique_lengths.pl()
    .describe()
)


There are 10 unique lengths, with a minimum of 2087, a maximum of 15600, a median of 6600 and mean of 7019.10, with std dev 3404.82. There are a corresponding 10 unique 'max_mins ranging from 13.9 to 52 with a median of 43.99, mean of 41.59 and std dev of 11.16.



From experience we know that a 2.5Hz run requires ~30 mins for full separation, therefore anything below that is highly suspect.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
bin_breaks = [2000, 3000, 5000, 6000, 7000, 8000, 10000, 15000, 16000]
sns.histplot(lengths,
             x='length',
            bins=bin_breaks,
            binwidth=1000
            # bins=50
             )
plt.title('Histogram of lengths with even binning, bins = 5')


In [None]:
lengths = lengths.with_columns(pl.col('length').cut(breaks=bin_breaks).alias('bin_range'))
lengths = lengths.with_columns(pl.col('bin_range').rank('dense').alias('bin')).sort('bin')
lengths.head()


In [None]:
bin_counts = lengths.group_by('bin','bin_range').len().sort('bin')
bin_counts


As we can see, there is 1 sample in the range 0 to 3000, 67 in the range 5000 to 6000, 10 in the range 6000 to 7000, and 1 sample in the range 15000 to 16000. The first thing to do would be to investigate the outliers.


## What are the Outliers?

In [None]:
pl.Config.set_fmt_str_lengths(99999)
con.sql(
    """--sql
    SELECT
        mta.wine,
        mta.sample_num,
        len,
        length,
    FROM
        (FROM lengths) as len
    JOIN
        (FROM bin_counts) as bc
    ON
        len.bin=bc.bin
    JOIN
        pbl.sample_metadata as mta
    USING
        (id)
    WHERE
        bc.len = 1
    """
).pl()


samples 29 and 87 are the outliers, with 2087 and 15600 respectively.


Observing sample 29..

### Sample 29


In [None]:
con.sql("""--sql
SELECT
    nm_254.*,
    mta.sample_num,
    mta.wine
FROM
    dataset_eda.nm_254
JOIN
    pbl.sample_metadata as mta
USING
        (id)
WHERE
    mta.sample_num in (29, 28, 32)
""").pl().plot(x='idx', y='absorbance', title='sample 29 @ 254 nm', by='wine')


When comparing 29 to 28 and 32, some neighbouring samples, the profile is not outrageous, the baseline is a little high, it just looks like an aborted run. Nevertheless, it will be isolated from the rest of the set.

In [None]:
# a list of excluded samples by 'sample_num'
excluded_samples = [29,]


### Sample 87


In [None]:
con.sql("""--sql
SELECT
    nm_254.*,
    mta.sample_num,
    mta.wine
FROM
    dataset_eda.nm_254
JOIN
    pbl.sample_metadata as mta
USING
        (id)
WHERE
    mta.sample_num in (86, 87, 88)
""").pl().plot(x='idx', y='absorbance', title='sample 87 @ 254 nm', by=['wine', 'sample_num'], legend=None, fontscale=0.5)


So not only does it have over 2 times as many data points as the other samples, the overall signal appears more spread out, with a much lower baseline. According to log entry: "2023-10-24_dtw_has_proved_be_somewhat_red", I had determined that sample 87 was recorded at twice the sampling rate. Can we  confirm this?


In [None]:
con.sql("SELECT * FROM dataset_eda.nm_254").columns


In [None]:
sample_87 = con.sql(
    """--sql
    SELECT
        sample_num,
        idx,
        secs_corrected
    FROM
        dataset_eda.nm_254
    WHERE
        sample_num = 87
    ORDER BY
        idx
    """
).pl()

# confirm that the secs column matches ordering expectations
con.sql(
    """--sql
    SELECT
        first(idx) as first_idx,
        first(secs_corrected) as min_secs,
        last(idx) as last_idx,
        last(secs_corrected) as max_secs
    FROM
        sample_87
    """
).pl()


Find the frequency in hertz, defined as 1/diff:

In [None]:
finite_diff_hz_87 = con.sql(
"""
SELECT
  sample_num,
  idx,
  secs_corrected - lag(secs_corrected) OVER (ORDER BY idx) as diff,
  1/diff as hz
FROM
  sample_87
order by
  idx
"""
).pl().head()


and the mean..

In [None]:
con.sql(
    """
    FROM
        finite_diff_hz_87
    SELECT
        mean(diff) as mean_diff,
        mean(hz) as mean_hz
    """
).pl()


In [None]:
del finite_diff_hz_87
del sample_87


#### Sample Observation Frequency


As we can see, there is a mean value of 5hz. How does that compare to the full set?


In [None]:
nm_254_diff_hz = con.sql(
    """--sql
    WITH with_secs_corrected AS (
    
    SELECT
        *,
        (mins - first(mins) OVER (PARTITION BY id ORDER BY idx))*60 as secs_corrected_x,
        --1/diff as hz
    FROM
        dataset_eda.nm_254
    )
    SELECT
        *,
        secs_corrected_x - lag(secs_corrected_x) OVER ( PARTITION BY id ORDER BY idx) as diff,
        1/diff as hz
    FROM
        with_secs_corrected

    """
)
nm_254_diff_hz.pl().head()


In [None]:
sns.scatterplot(nm_254_diff_hz.pl().cast({'sample_num': str}), x='idx', y='hz', hue='sample_num', legend=False)
plt.title("hz vs. idx per sample")


As can be seen the vast majority of samples fall onto the 5 or 2.5Hz frequencies, with singular deviances. So how many samples are at 5 Hz?


In [None]:
nm_254_diff_hz_means = con.sql(
    """--sql
    SELECT
        sample_num,
        mean(hz) as mean_hz
    FROM
        nm_254_diff_hz
    GROUP BY
        sample_num
    """
)

nm_254_diff_hz_means.pl().head()


In [None]:
nm_254_diff_hz_means.pl().shape


In [None]:
binned_hz = con.sql("""--sql
SELECT
    sample_num,
    mean_hz,
    CASE WHEN mean_hz <= 2.45 THEN 0 WHEN mean_hz > 2.45 AND mean_hz < 2.6 THEN 1 WHEN mean_hz >= 2.6 THEN 2 END as bin
FROM
    nm_254_diff_hz_means
""").pl()

con.sql("""--sql
SELECT
    bin,
    count(sample_num) as count_sample_num,
FROM
    binned_hz
GROUP BY
    bin
""").pl()


So all samples BAR 1 fall into a range between 2.45 and 2.6 Hz, with one sample with a Hz greater than that.

In [None]:
con.sql("""--sql
SELECT
    sample_num,
    mean_hz
FROM
    nm_254_diff_hz_means
WHERE
    mean_hz > 2.6
""").pl()


Low and behold, its sample 87. Thus we can confidently state that sample 87 is an outlier to be excluded from the dataset, AND that the dataset is ripe for resampling to smooth out irregularities, once a common time cutoff is determined.

In [None]:
excluded_samples.append(87)


Now we'll add these two to a table created now called 'excluded_samples' with the sample number and a comment:

In [None]:
# %%script true # comment this line to run

con.close()
if input("warning: this will replace the table with the values here. press y to continue:") == 'y':
        with db.connect(db_path) as con:
                excluded_samples = con.sql("""--sql
                CREATE OR REPLACE TABLE
                        dataset_eda.excluded_samples (sample_num INTEGER PRIMARY KEY, comment VARCHAR, proof VARCHAR);
                INSERT INTO
                        dataset_eda.excluded_samples
                BY NAME
                        (SELECT 87 as sample_num, 'recorded at 5hz compared to rest of dataset @ 2.5Hz' as comment, 'finding_outliers_by_idx.ipynb' as proof );
                INSERT INTO
                        dataset_eda.excluded_samples
                BY NAME
                        (SELECT 29 AS sample_num, 'aborted run, high baseline' as comment, 'finding_outliers_by_idx.ipynb' as proof);
                SELECT * FROM dataset_eda.excluded_samples LIMIT 10        
                """).pl()
        display(excluded_samples)
        print("replaced table")
else:
        print("did not execute")


### Conclusion

The outliers are sample 29 and 87. 29 because it was an aborted run, and 87 because it was recorded at the wrong sampling frequency. All other samples were recorded @ approximately 2.5 Hz barring rare cases of error. The dataset is ready for resampling to remove those errors once a maximum cutoff is determined.
