# Principles of data mining

## What is it?

[Merriam-Webster](https://www.merriam-webster.com/dictionary/data%20mining):

>**Data mining**
>
>*"the practice of searching through large amounts of computerized data to find useful patterns or trends"*
>
>**First known use: 1968**

(another word from the same year is "error bar", coincidence?)

## Differences vs. statistical modeling

- Focus on emprirical/exploratory research for a priori unknown relationships in the data
- Usually automated through machine learning
- Complex relationships and large numbers of variables are easier to model

### Terminology

- *Variable* (also *feature*)
    * "table column"
    * Can be continuous (e.g. temperature) or categorical (e.g. species).
- *Sample*
    * "table row"
    * A single measurement of one or more variables
- *Model*
    * A predictor for a variable given the other variables

## Types of data mining

- Classification
    * Prediction of categorical variables (species)
    
- Regression
    * Prediction of continuous variables (temperature, moisture, etc.)

- Clustering
    * Used to find related subgroups of data when the categories are not known beforehand.

- Anomaly detection
    * Automated detection of outliers from the data


[Flowchart for machine learning](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

# Example: Looking at the data

### Show, don't tell

This lecture note material is in a Jupyter notebook, which lets us not only write text, but also runs (here, [Python](https://python.org)) code and shows the results:

In [1]:
# Code here
print('Result here')

Result here


#### Doing it yourself
If you are familiar with Jupyter notebooks and want to run this notebook yourself, take a look at the [readme](./README.md) for instructions.

First we'll have to read the data. [This](./Windermere_NBAS_data_1945_2013.txt) is a dataset from lake Windermere which looks like this:
```
sdate,variable,value,sign_if_LT_LOD
02-Jan-47,TEMP,5.6,
08-Jan-47,TEMP,5.1,
13-Jan-47,TEMP,4.95,
20-Jan-47,TEMP,5.1,
27-Jan-47,TEMP,4.6,
03-Feb-47,TEMP,4,
10-Feb-47,TEMP,3.4,
18-Feb-47,TEMP,2.8,
24-Feb-47,TEMP,1.6,
...
```

We'll use a library called [Pandas](https://pandas.pydata.org) which has a lot of useful tooling for manipulating tabular data (Similar to R). It has a function `read_csv` to easily read .csv format.
We'll also remove the last column for this example, since it is not needed.

In [2]:
import pandas as pd
data = pd.read_csv('Windermere_NBAS_data_1945_2013.txt', parse_dates=True, index_col=0)
data = data.drop(columns='sign_if_LT_LOD')

The data has multiple variables that have been measured at certain dates. Let's see what's there:

In [3]:
data.groupby('variable').describe()

Unnamed: 0_level_0,value,value,value,value,value,value,value,value
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
variable,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ALKA,1804.0,10132.06153,2007.322631,6100.0,8600.0,10000.0,11500.0,17000.0
NH4N,1546.0,9.319599,8.372265,0.0,5.0,5.0,12.0,125.0
NO3N,2736.0,323.472953,145.980099,8.0,220.0,300.0,417.0,1196.0
OXYG,3242.0,97.804226,10.454674,61.0,90.4,98.0,105.0,134.3
PH,987.0,7.245064,0.42214,6.38,7.0,7.17,7.37,9.66
PO4P,2511.0,2.170315,2.90583,0.1,0.6,1.0,3.0,66.3
SECC,772.0,4.423316,1.092187,1.6,3.6,4.3,5.1,9.6
SIO2,2771.0,1062.488271,656.17724,-44.0,440.0,1100.0,1680.0,2700.0
TEMP,5472.0,10.825303,4.990691,0.6,6.215,10.3,15.3,23.7
TOCA,1860.0,5.564737,4.477227,0.29,1.65,4.735,8.06,28.76


It would be nicer if each measured variable was a column by itself. We can use `pivot_table` to do just that:

In [4]:
data = data.pivot_table(index='sdate', columns='variable', values='value')
data.head()

variable,ALKA,NH4N,NO3N,OXYG,PH,PO4P,SECC,SIO2,TEMP,TOCA,TOTP
sdate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1968-01-02,9600.0,1.0,320.0,,,3.6,,1680.0,5.7,0.75,
1968-01-09,9800.0,3.0,340.0,,,3.4,,1750.0,5.2,0.8,
1968-01-16,9700.0,1.0,350.0,,,3.3,,1720.0,4.95,0.8,
1968-01-23,8500.0,7.0,360.0,,,2.9,,1760.0,4.7,0.96,
1968-01-30,,,330.0,,,3.0,,1770.0,5.45,0.49,


Now we can easily study the correlations of different variables:

In [5]:
data.corr()

variable,ALKA,NH4N,NO3N,OXYG,PH,PO4P,SECC,SIO2,TEMP,TOCA,TOTP
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ALKA,1.0,0.206367,0.005054,0.201926,0.417583,-0.074682,-0.27177,-0.400939,0.455592,0.320383,-0.202437
NH4N,0.206367,1.0,-0.049884,0.014153,0.007311,-0.099261,-0.025426,-0.108779,0.158552,0.021559,-0.15199
NO3N,0.005054,-0.049884,1.0,-0.189748,-0.402358,0.348183,0.252787,0.355121,-0.520163,-0.437774,0.003584
OXYG,0.201926,0.014153,-0.189748,1.0,0.45429,-0.115428,-0.249327,-0.303016,0.268855,0.310691,0.105338
PH,0.417583,0.007311,-0.402358,0.45429,1.0,-0.439783,-0.344579,-0.535502,0.583771,0.562226,0.013258
PO4P,-0.074682,-0.099261,0.348183,-0.115428,-0.439783,1.0,0.287969,0.394254,-0.395955,-0.348194,0.201379
SECC,-0.27177,-0.025426,0.252787,-0.249327,-0.344579,0.287969,1.0,0.147791,-0.185314,-0.591275,-0.165457
SIO2,-0.400939,-0.108779,0.355121,-0.303016,-0.535502,0.394254,0.147791,1.0,-0.849555,-0.60162,0.02358
TEMP,0.455592,0.158552,-0.520163,0.268855,0.583771,-0.395955,-0.185314,-0.849555,1.0,0.569124,-0.055489
TOCA,0.320383,0.021559,-0.437774,0.310691,0.562226,-0.348194,-0.591275,-0.60162,0.569124,1.0,0.172794


The same, but in color:

In [6]:
%matplotlib notebook

import matplotlib.pyplot as plt
plt.matshow(data.corr())
plt.colorbar()

<IPython.core.display.Javascript object>

<matplotlib.colorbar.Colorbar at 0x7f7a84a03da0>

Let's try something a bit more involved and try to determine a temperature trend. For this we'll fit a line to our temperature data:

In [57]:
import numpy as np

# Our target to fit
y = data['TEMP'].values[~np.isnan(data['TEMP'].values)]
# We don't need the actual dates, a numbering will be enough
X = np.arange(0, len(y))

model = np.polyfit(X, y, 1)
model

array([-2.64966253e-04,  1.11876447e+01])

In [62]:
monthly = data.resample('1M').mean()
monthly.head()

variable,ALKA,NH4N,NO3N,OXYG,PH,PO4P,SECC,SIO2,TEMP,TOCA,TOTP
sdate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1968-01-31,9400.0,3.0,340.0,,,3.24,,1736.0,5.2,0.76,
1968-02-29,9400.0,2.0,357.5,,,3.275,,1770.0,4.1875,0.69,
1968-03-31,9333.333333,,367.5,,,2.55,,1770.0,4.5375,1.07,
1968-04-30,11266.666667,3.0,368.0,,,1.35,,1472.0,7.32,4.132,
1968-05-31,9200.0,,315.0,,,0.225,,243.75,10.2,9.39,


In [65]:
%matplotlib notebook
monthly.rolling(window=12).mean().plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f7a73b67fd0>

Lesson 1: **Really** look at your data!

Peering at the statistics won't help you:
[The datasaur dozen](https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/)


### [Continue after the break...](./02-Outro.ipynb)