## Course Information

#### Websites
###### Google Drive

* Readings
* Lectures

##### Github

* Homework assignments and starter code
* Tutorials
* Course Notes

#### Assignments

Homework will be weekly at first

#### Final Project
format: Jupyter notebook
* Code, text, images
* Open ended, but using data that is interesting or related to your thesis project.


## Types of Data 
__Discrete__ - represented by an integer (whole number)- counts, presence/absence <br>
__Continuous__ - represented by real<sup>1</sup> numbers (e.g. temperature, wave speed, length of an organism, concentration)  <br>
__Categorical__ - examples: species, sediment type, hair color, site # <br>
__Metadata__ - "data describing data"

<sup>1</sup> *Actually, vectors can be represented as complex numbers with an imaginary part*

## Types of Measurements
__Nominal__ - categories of equal rank<br>
* Species Description (phytoplankton types: diatoms, coccolithophores, etc.) 


__Ordinal__ - Categories have a logically defined rank. Steps arent equal in size or quantifiable<br>

* How sediment grains are categorized: angularity and sphericity
* Hurricane scale: ranking is not equivalent to strength
* Beaufort Wind Scale: Mariner estimate of wind from wave climate, logically defined, but not quantifiable

__Scale__: __interval__ and __ratio__
	
* __interval scale__: constant succesive intervals, but the reference point is arbitrary e.g. temperature scale <br>
* __ratio scale__: natural zero point (ex: length, mass)

## Types of Error
__Systematic error__  - systematically repeatable biased, i.e. some pattern in how they vary <br>
__Random error__ - Impercise, but unbiased. Noise.
![error](images/error_type.png)

__Measurement precision__<br>
You would not record a value from a ruler to 3.7567453 cm if your ruler only has mm hash marks

Rounding introduces error into your calculations, so in general it is better to use all of the digits that you have and round off to the significant digit when reporting the value

__Drift__ - a systematic error that changes over time
![Nitrate Profile](images/nitrate_profile_smooth.png)
Johnson, K. S., and L. J. Coletti, 2002: In situ ultraviolet spectrophotometry for high resolution and long-term monitoring of nitrate, bromide and bisulfide in the ocean. _Deep Sea Res. Part I Oceanogr. Res. Pap._, 49, 1291–1305, doi:10.1016/S0967-0637(02)00020-1.

Bottle data (squares): chemically derived values of nitrate (low resolution, high precision)<br>
ISUS Nitrate Sensor (High resolution, systematic error (underreported values compared to bottle casts) , instrument noise (small scale random error))

Smoothing reduces noise, but there is a trade-off: lowers resolution<br>
In this example, the bias is reduced by correcting for the temperature dependence of the sensor.

__Normal Distribution__

<img src="images/normal_dist.png" width="500">

_Image Source:_ Tauxe, L, Banerjee, S.K., Butler, R.F. and van der Voo R, Essentials of Paleomagnetism, 4th Web Edition, 2016. 
[https://earthref.org/MagIC/books/Tauxe/Essentials/WebBook3ch11.html](https://earthref.org/MagIC/books/Tauxe/Essentials/WebBook3ch11.html)<br>

Bars: Fraction of samples that occur in each range, per bed thickness, normalized

Black Line: Probability density = fraction/$\Delta x$ <br>
Red line: Normal distribution (hypothetical), allows the use of mathematical theories to use statistics <br>

Integral of area under the curve = 1 (all probability falls under the curve)


## Descriptive statistics

### Mean ###

With a finite number of $N$ samples, the __true mean__ of a population, $\mu$ ("mu"), can be _estimated_ by the sample mean $\bar{x}$,

$$ \bar{x} = \frac{1}{N}\sum_{i=1}^{N}{x_i} = \frac{1}{N}(x_1+x_2+x_3+....+x_N) $$


### Variance ###
__Variance__ describes the spread of the data. The sample variance is equal to the sample __standard deviation__ squared ($s^2$). The sample variance is an estimate of the true variance ($\sigma^2$)

$$s^2 = \frac{1}{N-1}\sum_{i=1}^N(x_i-\bar{x})^2,$$

where $N-1$ is the __degrees of freedom__. Degrees of freedom are the number of independent pieces of information. The variance of one sample ($N=1$) is essentially meaningless because it is always equal to zero, no matter what the value of the sample is. In this case, there are no degrees of freedom. Calculating the variance/standard deviation with $N-1$ in the denominator is called the __unbiased__ estimate of the true standard deviation $\sigma$.

The variance is __positive definite__ because it is the sum of squared values and therefore cannot be negative.

### Standard Error ###

The __standard error__ describes how well the sample mean describes the true mean

$$ SE = \frac{s}{\sqrt{N}}. $$

The standard error does not describe the spread of the data, it describes the how well the sample mean, $\bar{x}$, represents the true mean, $\mu$. The standard error can thought of as the standard deviation of $(\bar{x} - \mu)$, after many repeated experiments with $N$ samples. This assumes that the data are normally distributed.

![Normal Distribution Curve](images/norm_dist_rule.png)
[Wikipedia](https://en.wikipedia.org/wiki/Normal_distribution#/media/File:Empirical_Rule.PNG)

### Z-Score ###
z-score is the number of standard deviations from the mean a data point is. But more technically it's a measure of how many standard deviations below or above the population mean a raw score is.

$$ z_i=\frac{x_i-\bar{x}}{s} $$

## Algorithms

__Algorithm__: A problem solving procedure - set of instructions to solve a problem

Finding the __Mean__ in plain English:
	1. Take Samples
	2. Add up all of the values in the set
	3. Count the total number of values in the set
	4. Divide the step 2. / step 3.

Finding __Variance__ in plain English:
	1. Take samples
	2. Find mean (as stated above)
	3. Subtract mean from each sample
	4. Square step 3
	5. Sum step 4
	6. Count the number of samples
	7. Subtract 1 from step 6
	8. Divide step 5 by step 7.
    
Finding the __Z-score__ in plain english
    1.Take Samples
    2.Find Mean (as stated above)
    3.Subtract mean from each sample
    4.sum step 3
    5.find variance(as stated above)
    6.take square root of step 5
    7.divide step step 4 by step 6

We will come back to these algorithms later when we start writing code.

## Computers 
[__Raspberry Pi__](https://hackaday.io/projects/tag/raspberry%20pi): A small, cheap computer that usually runs linux distributions

__CPU__:Central Processing Unit - Processes user instructions from an input

__RAM__:Random Access Memory - Stores data for the short term, where is can be quickly accessed. Limited capacity
Long-term Memory: Usually a hardrive, large capacity, data access is slower.

Computers are made of many tiny switches called __transistors__ that can be turned On or Off (1 or 0).
The number of transistors that can be placed on a CPU has grown exponentially over the past 40 years.

**Moore's Law**: 

Co-founder of Intel, Gordon Moore, proposed a law for the geometric growth of the number of transistors on a CPU, which translates to processing power

$$ P_n = P_o 2^n.$$

With more processing power:
- Scientists can run models and collect data at higher resolution.
- Computation is cheap! A more people have access to computers.
- Programming languages are built to be more accessible, but less efficient.

**Interpreted Languages** - Python, code is translated into machine language line by line

**Compiled Languages** -Java, C, Fortran, entire code is compiled into object code, then into machine language. Can be much faster if you are resusing the code or making programs, but less practical for data analysis.

**Bit**: binary digit (1 or 0)

**Byte**: eight bits

#### Representing integers in binary

Numbers are stored on computers in binary representation. Positive integers can be represented as a combination of powers of two. For example, if we wanted to represent __the Answer to the Ultimate Question of Life, the Universe, and Everything__ in a byte of computer memory, this would be one way of doing it:

0 0 1 0 1 0 1 0

= (0·2<sup>7</sup>)+(0·2<sup>6</sup>)+(1·2<sup>5</sup>)+(0·2<sup>4</sup>)+(1·2<sup>3</sup>)+(0·2<sup>2</sup>)+(1·2<sup>1</sup>)+ (0·2<sup>0</sup>)

= 42

= (0·10<sup>2</sup>) + (4·10<sup>1</sup>) + (0·2<sup>0</sup>)

Computers use base 2 because the are made up of transitors (on-off switches). Humans use base 10 because we typically have ten fingers that we use for counting.

This is an example of an **unsigned byte**. A byte can have 2<sup>8</sup> = 256 possible combinations. An unsigned byte can have any value between 0-255. Trying to calculate a number greater than 255 would result in an "overflow" error.

A **signed byte** uses the first bit to store the sign of the number (positive or negative). 

The table below summarizes common ways of expressing integers in binary:


| Type          | range of values                 |number of bits|possible values
| ------------- | ------------------------------- |--------------|-----------
| signed byte | 0 to 255                        | 8            |2^8
| unsigned byte   | -128 to 127                     | 8            |2^8
| short integer | -32,768 to 32,767               | 16           |2^16
| long integer  | -2,147,483,648 to 2,147,483,647 | 32           |2^32




#### Representing decimals with floating point numbers

The binary numbers described above cannot represent numbers with decimals. Real numbers, particularly irrational numbers like $\pi$, must be approximated. A real number *x* is represented in the following form:

$x = \pm m 10^E $

where $m$ is called the mantissa.

IEEE 32 bit standard: single precision
* 1 bit - sign
* 8 bits - exponent
* 23 bits - mantissa

IEEE 64 bit standard: double precision
* 1 bit - sign
* 11 bits - exponent
* 52 bits - mantissa

#### Round-off error

Round-off errors occur when trying to represent real numbers as floating point numbers with limited precision. This type of error accumulates very quickly when repeating calculations.

#### Truncation error

Truncation errors occur when using finite number of steps to approximate an infinite number of steps. For example, a function $f(x)$, in the neighborhood of $x=a$, can be represented as a Taylor series expansion about the point $x = a$

$$ f(x) = f(a) + \frac{f'(a)}{1!}(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + ...$$

Approximating a function with the first few terms of the Taylor series expansion would be an example of a truncation error.