# Principles of data mining

## What is it?

[Merriam-Webster](https://www.merriam-webster.com/dictionary/data%20mining):

>**Data mining**
>
>*"the practice of searching through large amounts of computerized data to find useful patterns or trends"*
>
>**First known use: 1968**

(another word from the same year is "error bar", coincidence?)

## Differences vs. statistical modeling

- Focus on emprirical/exploratory research for a priori unknown relationships in the data
- Usually automated through machine learning
- Complex relationships and large numbers of variables are easier to model

### Terminology

- *Variable* (also *feature*)
    * "table column"
    * Can be continuous (e.g. temperature) or categorical (e.g. species).
- *Sample*
    * "table row"
    * A single measurement of one or more variables
- *Model*
    * A predictor for a variable given the other variables

## Types of data mining

- Classification
    * Prediction of categorical variables (species)
    
- Regression
    * Prediction of continuous variables (temperature, moisture, etc.)

- Clustering
    * Used to find related subgroups of data when the categories are not known beforehand. 
    * [Examples](./cluster_comparison.png)

- Anomaly detection
    * Automated detection of outliers from the data


[Flowchart for machine learning](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

# Example: Looking at the data

### Show, don't tell

This lecture note material is in a Jupyter notebook, which lets us not only write text, but also runs (here, [Python](https://python.org)) code and shows the results:

In [None]:
# Code here
print('Result here')

#### Doing it yourself
If you are familiar with Jupyter notebooks and want to run this notebook yourself, take a look at the [readme](./README.md) for instructions.

First we'll have to read the data. This is a dataset from lake Windermere from the [Center of Ecology and Hydrology](https://eip.ceh.ac.uk/datais). The data is formatted in a few columns as such:
```
sdate,variable,value,sign_if_LT_LOD
02-Jan-47,TEMP,5.6,
08-Jan-47,TEMP,5.1,
13-Jan-47,TEMP,4.95,
20-Jan-47,TEMP,5.1,
27-Jan-47,TEMP,4.6,
03-Feb-47,TEMP,4,
10-Feb-47,TEMP,3.4,
18-Feb-47,TEMP,2.8,
24-Feb-47,TEMP,1.6,
...
```

We'll use a library called [Pandas](https://pandas.pydata.org) which has a lot of useful tooling for manipulating tabular data (Similar to R). It has a function `read_csv` to easily read .csv format.
We'll also remove the last column for this example, since it is not needed.

In [None]:
import pandas as pd
data = pd.read_csv('Windermere_NBAS_data_1945_2013.txt', parse_dates=True, index_col=0)
data = data.drop(columns='sign_if_LT_LOD')

The data has multiple variables that have been measured at certain dates. Let's see what's there:

In [None]:
data.groupby('variable').describe()

It would be nicer if each measured variable was a column by itself. We can use `pivot_table` to do just that:

In [None]:
data = data.pivot_table(index='sdate', columns='variable', values='value')
data.head()

Now we can easily study the correlations of different variables:

In [None]:
data.corr()

The same, but in color:

In [None]:
%matplotlib notebook

import matplotlib.pyplot as plt
plt.matshow(data.corr())
plt.colorbar()

Let's try something a bit more involved and try to determine a temperature trend. Since the date frequency is not quite constant, let's just look at the biannual means for now.

In [None]:
biannual = data.resample('6M').mean()
biannual.head()

Now let us fit a line to these mean values. We'll use `polyfit` from the `numpy` package (used for almost anything numerical in Python).

In [None]:
import numpy as np

# Our target to fit, needs 
y = biannual['TEMP'].values[~np.isnan(biannual['TEMP'].values)]

# We don't need the actual dates, a numbering will be enough (since it should)
X = np.arange(0, len(y))

model = np.polyfit(X, y, 1)
model

It's showing a very slight downward trend. Let's plot it and see it with the actual data:

In [None]:
%matplotlib notebook
plt.plot(X, y, X, model[1] + model[0] * X)

We could've also plotted the original data. Let's do that now:

In [None]:
data.plot()

Lesson 1: **Really** look at your data!

Here's a better example why the statistics won't help you either:

[The datasaur dozen](https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/)


### [Continue after the break...](./02-Outro.ipynb)