# Working with data
-------------

In [None]:
import scipy.stats as spst
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Today's Goals:
* Common operations for reading in data
* Descriptions of data
* Presenting data
* Statistical measures

What is data?

What is metadata?

# An example data set
Let's work through some common operations involved with *reading* in and *displaying* data. These files were downloaded from [Kauring data](https://ecat.ga.gov.au/geonetwork/srv/eng/catalog.search#/metadata/70811):


Inside the KauringGravity folder are a few files:
 * A point located [data file](KauringGravity/Kauring%20AG-AGG%20Ground%20Gravity%202009%20AGG%20Area%20Data.dat)
 * a [README](KauringGravity/Kauring%20AG-AGG%20Ground%20Gravity%202009%20AGG%20Area%20Data%20--%20README.txt)
 * A [header file](Lectures/Fundamentals/KauringGravity/Kauring%20AG-AGG%20Ground%20Gravity%202009%20AGG%20Area%20Data.dfn) describing the columns of the data files

You are all likely familiar with using pandas to read in data files, so let's stick with that.

In [None]:
import pandas as pd

In [None]:
data_file_name = 'KauringGravity/Kauring AG-AGG Ground Gravity 2009 AGG Area Data.dat'

In [None]:
pd.read_table(data_file_name)

That did something, there is data there, but likely not in a way we can actually use it...

Let's give panda's some more information, starting with explicitly telling it the delimiter (separator)

In [None]:
pd.read_table(data_file_name) 

Well that looks better, we have the expected number of columns!

What else is wrong though?

In [None]:
# don't interpret the first line as a header, it is actually data
pd.read_table(data_file_name, sep=r"\s+", header=None)

Now what?

This data is in a completely useable format, but it could be a bit easier if we gave each column an appropriate name...

The `.dfn` file has some good names! Let's make a list of them.

In [None]:
header_file_name = 'KauringGravity/Kauring AG-AGG Ground Gravity 2009 AGG Area Data.dfn'

In [None]:
with open(header_file_name) as f:
    while line:=f.readline():
        print(line)

Looks like there are descriptions of two data files in here:
1) The point data
2) The coordinate reference frame

Let's grab the labels of the data frame here

an example line looks like 

`DEFN 23 ST=RECD,RT=;UNITS_OF_ELLIPSOID_GROUND_HGT:A7:NULL=9999`

how can we separate out the `UNITS_OF_ELLIPSOID_GROUND_HGT` from the above string?

In [None]:
example_line = "DEFN 23 ST=RECD,RT=;UNITS_OF_ELLIPSOID_GROUND_HGT:A7:NULL=9999\n"

In [None]:
labels = []
with open(header_file_name) as f:
    f.readline() # read the first line, and do nothing
    for i in range(74): # read the next 74 lines
        line = f.readline()
        name = line.split(";")[1].split(":")[0]
        labels.append(name)
labels

In [None]:
# Now let's also give the pandas reader the column names
dat = pd.read_table(data_file_name, sep=r"\s+", names=labels)
dat

## Key thoughts on reading in data files
* Data on its own is useless without metadata
* Make use of available data readers (`pandas` or `numpy` in python)
    * do not write code to read in data unless you absolutely have to
    * You'll likely need to pass specific parameters to these functions to make it work with your file.

# Fundamental thoughts on displaying data
You want to be able to interpret the data you collect, often visually.

in python `matplotlib` is often a great place to start.

In [None]:
import matplotlib.pyplot as plt

In [None]:
lats = dat['LATITUDE']
lons = dat['LONGITUDE']

In [None]:
# creating a simple scatter plot
plt.scatter(lons, lats)

It looks like this data set also has UTM coordinates associated with it too!

In [None]:
dat_x = dat['X_GDA94_MGA50']
dat_y = dat['Y_GDA94_MGA50']
plt.scatter(dat_x, dat_y)

What's missing from these plots?

How can we make them better?

## Plotting values associated with spatial locations

Q: What's difficult about plotting values associated with these locations?

* It's not a regular grid
* It's at an angle

Some options:
1) Color the scatter plot points

In [None]:
grav_v = dat['INFINITE_SLAB_BOUGUER_ANOMALY']
plt.scatter(dat_x, dat_y, c=grav_v, s=1)

2) Interpolate to a regular grid and plot

In [None]:
# The easy way!
plt.tripcolor(dat_x, dat_y, grav_v)

In [None]:
# Change the interpolation method
plt.tripcolor(dat_x, dat_y, grav_v, shading='gouraud')

3) contour plot

In [None]:
plt.tricontour(dat_x, dat_y, grav_v, levels=20)

In [None]:
plt.tricontourf(dat_x, dat_y, grav_v, levels=20)

In [None]:
# or both filled and contoured!
plt.tricontourf(dat_x, dat_y, grav_v, levels=20)
plt.tricontour(dat_x, dat_y, grav_v, levels=20, colors='w', alpha=0.25, linestyles='--')

## A note on colorbars

In [None]:
def make_comparison(map1='viridis', map2='jet'):
    plt.figure(figsize=(8,4), dpi=250)
    plt.subplot(1,2,1)
    plt.tripcolor(dat_x, dat_y, grav_v, cmap=map1, shading='gouraud')
    plt.subplot(1,2,2)
    plt.tripcolor(dat_x, dat_y, grav_v, cmap=map2, shading='gouraud')

In [None]:
make_comparison()

### What's wrong with Jet?

... alot

In [None]:
import numpy as np
from colorspacious import cspace_convert
from matplotlib.colors import ListedColormap

def make_simple(map1='viridis', map2='jet', colorblind=None):
    x = np.linspace(10, 20, 256)
    y = np.ones(100)
    v = x[None, :] * y[:, None]

    map1 = plt.get_cmap(map1)
    plt.figure(figsize=(8,4), dpi=250)
    plt.subplot(1,2,1)
    plt.imshow(v, cmap=map1)
    plt.axis('off')
    plt.title(map1.name)

    map2 = plt.get_cmap(map2)
    plt.subplot(1,2,2)
    plt.imshow(v, cmap=map2)
    plt.axis('off')
    plt.title(map2.name)

In [None]:
def as_colorblind(cmap, cvd_type="deuteranomaly", severity=100):
    assert cvd_type in {
        "deuteranomaly",
        "protanomaly",
        "tritanomaly"
    }

    n_colors = 256
    original_cmap = plt.get_cmap(cmap, n_colors)
    rgb_colors = original_cmap(np.linspace(0, 1, n_colors))[:, :3] # Get RGB, discard alpha
    # 2. Define the CVD simulation parameters
    cvd_space = {
        "name": "sRGB1+CVD",
        "cvd_type": cvd_type,
        "severity": severity,
    }
    # 3. Apply the CVD transformation and convert back to sRGB
    simulated_rgb_colors = cspace_convert(rgb_colors, cvd_space, 'sRGB1')
    
    # Ensure values are within valid RGB range [0, 1]
    simulated_rgb_colors = np.clip(simulated_rgb_colors, 0, 1)
    cmap = ListedColormap(simulated_rgb_colors, name=original_cmap.name + " " + cvd_type)
    return cmap

def compare_colorblind(cmap, cvd_type="deuteranomaly", severity=100):
    make_simple(cmap, as_colorblind(cmap, cvd_type, severity))

In [None]:
make_simple('jet', 'viridis')

### Colorblindness...
check this out: [changing matplotlib's defaults](https://bids.github.io/colormap/)

Types of colorblindness
* deuteranomaly - Red - Green colorblind (No greens)
* protanomaly - Red - Green colorblind (No Reds)
* tritanomaly - Blue - Yellow colorblind (No Blues)

Red-greens colorblinds are about 1 in 12 male and 1 in 200 female, fairly common

Tritanomaly is **very** rare (<0.1%)

In [None]:
compare_colorblind('jet', cvd_type="deuteranomaly", severity=100)

In [None]:
cmap = 'jet'; make_comparison(cmap, as_colorblind(cmap))

### More information
checkout matplotlib's descriptions of colormaps
[here](https://matplotlib.org/stable/users/explain/colors/colormaps.html).

# Statistics

Q: When we take a measurement in the real world should we ever use an interpretation that exactly matches our measurement?

## Errors in data
Measurements always have sources of uncertainties.

What can we do to reduce them?

## Basic Idea:
If the noise is random, take many measurements and average them.

In [None]:
true = np.r_[1, 1]
rng = np.random.default_rng(4421)

obs = true + rng.standard_normal((3312, 2))

avg = np.mean(obs, axis=0)
avg

In [None]:
plt.scatter(obs[:, 0], obs[:, 1], alpha=0.2, marker='.')
plt.scatter(true[0], true[1])
plt.scatter(avg[0], avg[1])

# Probability distributions

Measurements are statistical in nature, their errors usually follow some unknown probability distribution.

Distributions are either:
* Continuous
* Discrete

## Common discrete distributions
* Binomial: Number of success in a specific number of tests with a given probability of success.
* Poisson: Number of expected events in a given amount of time, with a given probability of success. (a limit of the Binomial distribution)

In [None]:
# Parameters
n = 100          # Number of trials
p = 0.5         # Probability of success

x = np.arange(0, n+1)

pmf_values = spst.binom.pmf(x, n, p)

# Plotting
plt.bar(x, pmf_values, color='skyblue', edgecolor='black')
plt.title(f'Binomial Distribution (n={n}, p={p})')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')

## Common continuous distributions
* Uniform
* Exponential
* Laplace (or double exponential)
* Normal (or Gaussian)
* Chi-squared
* Student's T-distribution

A ton available through scipy see [here](https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions)

In [None]:
x = np.linspace(-4, 4, 256)

pdf = spst.uniform.pdf(x, -2, 4)

# Plotting
plt.plot(x, pdf)
plt.title(f'Uniform Distribution')
plt.xlabel('X')
plt.ylabel('Probability')

In [None]:
x = np.linspace(-4, 4, 256)
pdf = spst.t.pdf(x, 2)

# Plotting
plt.plot(x, pdf)
plt.title(f"Student's T Distribution")
plt.xlabel('X')
plt.ylabel('Probability')

In [None]:
x = np.linspace(0, 20, 256)
pdf = spst.chi2.pdf(x, 5)

# Plotting
plt.plot(x, pdf)
plt.title(f"Chi-Squared Distribution")
plt.xlabel('X')
plt.ylabel('Probability')

In [None]:
x = np.linspace(0, 20, 256)
pdf = spst.lognorm.pdf(x, 50)

# Plotting
plt.plot(x, pdf)
plt.title(f"Log-normal Distribution")
plt.xlabel('X')
plt.ylabel('Probability')

In [None]:
x = np.linspace(-4, 4, 256)
pdf = spst.norm.pdf(x, 0)

# Plotting
plt.plot(x, pdf)
plt.title(f"Normal Distribution")
plt.xlabel('X')
plt.ylabel('Probability')

## Properties of sets of samples
* Sample: A draw (or draws) from a random process

Multiple samples usually described by
* Means ($\mu$)
* Standard deviations ($\sigma$) (or variances $\sigma^2$)

Q: If I take two sets of 10 measurements, average each individually, will I get the same value?

### Implies:
Sample averages (means) are also randomly distributed!

P.S. Sample standard deviations are also randomly distributed.

Sample Mean:
$$ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \neq \mu$$.

(Population) Variance:
$$ \sigma^2 = \frac{1}{n}\sum_{i=1}^{n} (x_i - \mu)^2 $$


Sample Variance (note $\bar{x}$ instead of $\mu$ and $n-1$ in division):
$$ \sigma^2 = \frac{1}{n - 1}\sum_{i=1}^{n} (x_i - \bar{x})^2 $$

## The Central Limit Theorem:
Given enough samples, the distribution of means ($\tilde{X}_n$) tends towards a normal distribution with mean, $\mu$ (population/true mean), and variance ($\sigma^2/2$), **regardless of the distribution of $X$**

P.S. The sample variance follows a chi-squared distribution

Why is this important?

Most of the time we do not take a single value to represent a measurement, commonly we:
* Average a single location over time to reduce the effects of random noise, and average out consistantly variable noise (i.e. a sine wave like error source).
* Take many measurements close together in space (i.e. on an airborne system) and average them together.

## Estimating errors
So how do we estimate the error in our measurements (or at least a range of possible values)?

Say we take measurements of a value at a single location over time and obtain the following values.

In [None]:
my_data = np.array([1.0, 1.5, 2.3, 3.1, 0.5, 2.4, 3.2, 2.3, 2.1, 4.2, 0.0, 1.9])

In [None]:
plt.hist(my_data, 10);
plt.xlabel('X')
plt.ylabel('Counts')

As we take more and more values, we might expect the shape of this histogram to match the shape of the true distribution

We can calculate the mean and standard deviation of this data:

In [None]:
np.mean(my_data)

In [None]:
np.std(my_data, ddof=1)

These numbers define the mean and standard deviation of our samples.

But, what is the error in the mean?

Turns out, that in the central limit theorem, the mean tends towards a normal distribution with a standard deviation of:
$$ \sigma_{mean}^2 = \frac{\sigma^2}{n}$$

* As we take more samples, the standard deviation gets smaller

### Questions: 
I average 2 samples from a distribution who's standard deviation is $\sigma=20$:

1) What is the standard deviation of my average value?

In [None]:
print(np.sqrt(20**2 / 2))

2) If I instead average 4 measurements, what is the standard deviation?

In [None]:
print(np.sqrt(20**2 / 4))

**note**: taking 4 measurements decreased the standard deviation by 1/2

3) What about for 6, 8, 10, 12, 14, 16?

In [None]:
for n in [6, 8, 10, 12, 14, 16]:
    s_mean = np.sqrt(20**2/n)
    print(n, s_mean)

* going from 1 to 4 measurements decreased the error by 1/2
* going from 4 to 16 measurements decreased the error by 1/2

In general, it is commonly understood that the error in your measurement decreases by a factor of 
$$ \frac{1}{\sqrt{n}} $$

In [None]:
1 / np.sqrt(4096)