# Lecture 4, Data science in Neuroscience


## Plan for today

1. Review of last week's exercises: Reconstructing our mean waveforms
2. Overview of the next weeks
3. Data analysis project 1
4. Introduction to machine learning
5. Quizz on machine learning
6. Speed cell: a simple example of machine learning with a linear regression

***

## Review of last week's exercises

see `lecture_03.ipynb`

***


## Data analysis project 1

In the last step of the spike extraction procedure, we applied a K-means clustering algorithm to our spike waveform. Was this the best choice for this problem?

In a jupyter notebook, compare k-means clustering with at least 3 other clustering techniques available in [Scikit-learn](https://scikit-learn.org/stable/modules/clustering.html#clustering). 

For each of 4 clustering technique (including k-means):
1. Describe briefly how the algorhithm works
2. List a few advantages and disadvantages of the technique
3. Apply the techniques to your waveforms
4. Display the results

As a conclusion, and based on what you learned, choose which technique you think is the best for clustering our spike waveforms. Describe in a few points why you chose this algorighm.



***
## Overview of the next weeks

1. Machine learning introduction (25.11.2021)
2. Behavioral tracking with a deep neural network (Deeplabcut) (02.12.2021)
3. Behavioral correlates of firing activity (place cells and grid cells) (09.12.2021)
4. Presentation of data analysis project (16.12.2021, alternatively January 2022)


***
## Introduction to machine learning

What is machine learning?


**Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.**

**The computer learns from input data to achieve a specific objective.**

Examples: 

* A program learns to identify the nose, ears and tail of a mouse in images (complex model). 
* A program learns the relationship between the firing rate of a neuron and the running speed of an animal (simple model).


## Why should you care about machine learning as a Neuroscientist?


Machine learning is behind many modern tools used by neuroscients.

* [Track behavior](https://www.nature.com/articles/s41593-018-0209-y)
* [Image segmentation (e.g., cell counting)](https://www.nature.com/articles/s41592-018-0261-2)
* [Spike extraction and clustering](https://www.biorxiv.org/content/10.1101/061481v1)

They make new experiments possible. 

These are state-of-the-art software in their respective field.

## Objective during this course

* Understand what machine learning is.
* Get faminiar with the terminology
* Experiment with a few examples

## Definition of machine learning

* Input: $X$ (single number or an array)
* Output: $Y$ (single number or an array)
* Unknown function or model: $f()$
* Random error: $\epsilon$

$$Y = f(X) + \epsilon $$

Machine learning refers to a set of approaches for estimating the best parameters in $f()$

$f()$ can be the the equation of a line or a deep neural network with millions of parameters.

***
## What is learning?

Learning can be defined as finding the best model parameters to solve a problem.

**Simple example**: Find the relationship between IQ and education with a linear regression model. Two parameters ($a$ and $b$).

$$y = a*x + b $$ 

**Complex example**: Find a mouse in an image. Millions of parameters.


<div>
<img src="../images/deep-neural-network.png" width="500"/>
</div>



## The training loop (when there is no closed formula)

1. Start with random model parameters
2. Feed data with label to your model
3. Calculate the error of your model (loss).
4. Adjust the model parameters by a small amount to reduce the error.
5. Go back to 2.

***
## Prediction versus inference

Why do we want to estimate $f$?


### Prediction

* We focus on predicting $Y$.
* $f$ is treated as a black box (a useful black box)

### Inference

* **Understand** how $Y$ is affected as $X$ changes.
* Which predictors are associated with the response?
* Is the relation between $Y$ and each predictor adequately summarized using a linear equation?


***
## Supervised versus unsupervised

### Supervised
* The training set contains labeled data.
* For each observation of the predictors $X_{i}, i = 1,...,n$ there is a known response measurement $y_{i}$.
* Example: linear regression

### Unsupervised
* Uncovering hidden patterns from unlabeled data.
* For each observation $i = 1,...,n$, we observed a vector of measurements $X_{i}$, but no response $y_{i}$.
* Examples: k-means clustering, PCA


***
## Regression versus classification

* If $Y$ is a continuous variable, then it is a regression task.
* If $Y$ is a categorical variable, then it is a classification task.


***
## Training and test sets

A **training set** is our observed data points that is used to estimate $f$. Our training set has $n$ observations.

A **test set** is used to test how accurate our model is. Not used for training!

Keeping data aside to test how well you model work is essential when using complex models. 

Complex models can learn to perform great on your training set but might generalize very poorly to new data. This is called **overfitting**.


***
## Time for a quizz!

[Link](https://docs.google.com/forms/d/e/1FAIpQLSfmL_igF1P0sZ_6aorGTE71pwNEa34oSWklG34y5vMPXvEYTQ/viewform?usp=sf_link)

or

https://tinyurl.com/y3jhxgr6

You have 5 minutes to complete the questions.


***
# Using machine learning to characterize a speed cell.

**Speed cell**: Neuron for which the firing rate is linearly correlated with the running speed of an animal.

[Speed cells in the medial entorhinal cortex (2015) Nature](https://www.nature.com/articles/nature14622)


<div>
<img src="../images/speed_kropff.png" width="1000"/>
</div>

