# Principles of data mining

## What is it?

[Merriam-Webster](https://www.merriam-webster.com/dictionary/data%20mining):

>**Data mining**
>
>*"the practice of searching through large amounts of computerized data to find useful patterns or trends"*
>
>**First known use: 1968**

(another word from the same year is "error bar", coincidence?)

### Terminology

- Variable
    * "table column"
    * Measured quantity (e.g. temperature)
- Sample
    * "table row"
    * A single measurement of one or more variables
- Model
    * A predictor for a variable given the other variables

### Show, don't tell
Looking at an example is the easiest way to get an idea what we're talking about, and the best example of data mining is doing the calculations and looking at the results.

This lecture note material is in a Jupyter notebook, which lets us not only write text, but also runs (here, [Python](https://python.org)) code and shows the results:

In [1]:
# Code here
print('Result here')

Result here


#### Doing it yourself
If you are familiar with Jupyter notebooks and want to run this notebook yourself, take a look at the [readme](./README.md) for instructions.

## Example: Looking at the data

We'll use a library called [Pandas](https://pandas.pydata.org) which has a lot of useful tooling for manipulating tabular data. (Similar to R)

In [5]:
import pandas as pd

First we'll have to read the data. This is a dataset from lake Windermere

In [108]:
data = pd.read_csv('Windermere_NBAS_data_1945_2013.txt', parse_dates=True, index_col=0, )
data.tail()

Unnamed: 0_level_0,variable,value,sign_if_LT_LOD
sdate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2047-01-02,TEMP,5.6,
2047-01-08,TEMP,5.1,
2047-01-13,TEMP,4.95,
2047-01-20,TEMP,5.1,
2047-01-27,TEMP,4.6,


The data has multiple variables that have been measured at certain dates. Let's see what's there:

In [36]:
pd.unique(data.variable)

array(['TEMP', 'OXYG', 'PO4P', 'ALKA', 'NO3N', 'SIO2', 'TOTP', 'NH4N',
       'TOCA', 'PH  ', 'SECC'], dtype=object)

Let's take a look at the variables:

In [109]:
data.groupby('variable').describe()

Unnamed: 0_level_0,value,value,value,value,value,value,value,value
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
variable,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ALKA,1804.0,10132.06153,2007.322631,6100.0,8600.0,10000.0,11500.0,17000.0
NH4N,1546.0,9.319599,8.372265,0.0,5.0,5.0,12.0,125.0
NO3N,2736.0,323.472953,145.980099,8.0,220.0,300.0,417.0,1196.0
OXYG,3242.0,97.804226,10.454674,61.0,90.4,98.0,105.0,134.3
PH,987.0,7.245064,0.42214,6.38,7.0,7.17,7.37,9.66
PO4P,2511.0,2.170315,2.90583,0.1,0.6,1.0,3.0,66.3
SECC,772.0,4.423316,1.092187,1.6,3.6,4.3,5.1,9.6
SIO2,2771.0,1062.488271,656.17724,-44.0,440.0,1100.0,1680.0,2700.0
TEMP,5472.0,10.825303,4.990691,0.6,6.215,10.3,15.3,23.7
TOCA,1860.0,5.564737,4.477227,0.29,1.65,4.735,8.06,28.76


In [146]:
temps = data[data.variable == 'TEMP']
temps = temps.drop(columns='variable')
temps.tail()

Unnamed: 0_level_0,value,sign_if_LT_LOD
sdate,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-10-15,12.3,
2013-10-29,11.6,
2013-11-12,9.5,
2013-11-26,7.5,
2013-12-10,7.1,


In [160]:
monthly = temps.resample('1M').mean()
monthly.head()

Unnamed: 0_level_0,value
sdate,Unnamed: 1_level_1
1968-01-31,5.2
1968-02-29,4.1875
1968-03-31,4.5375
1968-04-30,7.32
1968-05-31,10.2


In [166]:
%matplotlib notebook
pd.rolling_mean(monthly, 12).plot()

	DataFrame.rolling(window=12,center=False).mean()
  


<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7fe7ee8c32e8>

Lesson 1: **Really** look at your data!

Peering at the statistics won't help you:
[The datasaur dozen](https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/)

### Next: [Model validation](./02-validation.ipynb)