# 1. Preliminaries and data

Welcome to this tutorial! In this notebook we will review:
- Information about this tutorial and the background knowledge you will need for it.
- Common Python libraries for performing machine learning analyses, outlining the ones we will use in this tutorial.
- The dataset used in some of the exercises.
---

# About this tutorial

This short interactive tutorial will show you 
how to use the [scikit-learn](https://scikit-learn.org/) 
Python package to perform basic machine learning analysis. 
It will also cover how to visualize your data with 
the [matplotlib](https://matplotlib.org/) 
and [seaborn](https://seaborn.pydata.org/) Python packages. 

## Assumed background knowledge
- For this tutorial, we assume you have:
    - Basic knowledge of machine learning concepts. 
    For example, you know the difference between supervised/unsupervised learning, 
    or the difference between classification, regression and clustering models.
    - Basic experience with Python.
- We also assume that you have seen the video 
["A tutorial on machine learning"](https://www.youtube.com/watch?v=pOAK6ynM11E&list=PLVso6Qs8PLCiciMyxyqxCzp38G5tEhdy6&index=6) 
by [Laura Suarez](https://twitter.com/LauraESuarez24).

### Machine Learning (ML) background resources
If you think you are lacking some knowledge/experience on ML, 
we recommend the following resources to fill this gap:
- [Introduction to Statistical Learning](https://www.statlearning.com/) by _Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani_
- [Machine Learning](https://www.coursera.org/learn/machine-learning) by _Andrew Ng_
- [StatQuest videos](https://statquest.org/video-index/) for a friendly introduction to some of the topics.


---

# Machine Learning software in Python

The most well-known Python library for performing ML analysis is [__scikit-learn__](https://scikit-learn.org/). This library provides a [wide range](https://scikit-learn.org/stable/modules/classes.html) of classes and functions that implement the most common steps in ML. It is also known for its user-friendly and comprehensive [documentation](https://scikit-learn.org/stable/user_guide.html) and [examples](https://scikit-learn.org/stable/auto_examples/index.html).

In neuroscience research, other toolboxes have been developed specifically for carrying out ML analysis on neuroimaging data. For example, [nilearn](https://nilearn.github.io/) is very popular among fMRI researchers, while [mne](https://martinos.org/mne/stable/index.html) is most known among the M/EEG community. _Scikit-learn_ is the backbone of both _nilearn_ and _mne_, so in this tutorial we will explain how to use this more general toolbox with the hopes that it will come in handy when using more neuroscience-specific tools.

Other Python libraries are also frequently needed when carrying out ML analysis. Of these, [Numpy](https://numpy.org/) provides numerical computing capabilities while [pandas](https://pandas.pydata.org/) is a data analysis and manipulation library that allows the creation of easy to use and visualize data structures (if you are not familiar with _pandas_ you can read their tutorial ["10 minutes to pandas"](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)). We will use both _numpy_ and _pandas_ in this tutorial.

---

# The dataset

In each of the notebooks we will give examples on how to run ML analysis using fake data generated by _scikit-learn_, or we will use well-known datasets that can be retrieved using _scikit-learn_'s API.

However, at the end of each notebook you will find exercises where you will need to practice what you have learned using a real neuroimaging dataset. For these exercises we will use a dataset from the [Autism Brain Imaging Data Exchange II](http://fcon_1000.projects.nitrc.org/indi/abide/abide_II.html) (ABIDE II) project. ABIDE is a long-running effort to advance understanding of autism by aggregating and sharing autism-related structural and functional imaging datasets from around the world.

 Let's familiarize ourselves with this dataset. We will use a _pandas_ dataframe for this purpose:

In [None]:
import pandas as pd

# Load dataset into dataframe
abide_data = pd.read_csv("../data/abide2.tsv", sep="\t")
abide_data

Using _pandas_, we can quickly visualize that our dataset has 1004 rows and 1446 columns. Each row appears to store the brain recordings of one subject, measured in a specific neuroimaging center (indexed by the column `site`). The dataset contains additional demographic data for every participant, specifically their `age` and `sex`. 

Let's plot the age of the participants. Don't worry about the code that produces this plot yet. How to visualize data will be showed in [notebook 5](./05-visualization.ipynb).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot histogram of age
sns.histplot(abide_data["age"])
plt.show()

We can see that the majority of the partipants are kids.

In this tutorial the demographic data will not be relevant. In our exercises we are interested in being able to predict whether the participant has autism or not from their brain recordings. Thus, the column `group` encodes the variable we want to predict (1 = autism, 2 = control). 

The columns starting with `fs` encode the brain recordings. These 1440 features represent 4 sets of features. Columns starting with `fsArea` encode the surface area of a specific region; those starting with `fsVol` encode their volume; those starting with `fsLGI` encode their local gyrification index, and those starting with `fsCT` encode thier cortical thickness.


This dataset has already been preprocessed by Tal Yarkoni in [this awesome ML tutorial](https://github.com/neurohackademy/nh2020-curriculum/blob/master/tu-machine-learning-yarkoni). Thus, is already ready for being the input of a ML analysis.

In a common ML analysis, our dataset is encoded in the following format:
- A matrix `X`, usually called __feature matrix__, stores the values of each __sample/observation__ (in our case subjects) for each of the __features__ (in our case brain data or demographic variables).
- The variable that we are trying to predict, also called our __outcome variable/target__, is stored in `y` and it can be a vector or a matrix depending if we are trying to predict one or more variables (the latter is usually called a _multi-label problem).

The following image helps visualize these concepts and is taken from the [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook) by Jake VanderPlas:

<figure>
  <img src="../images/data_format.png" alt="kfoldcv" width="500"/>
</figure>


How can we define `X` and `y` for the ABIDE dataset using _pandas_? Using this library you can subselect columns by passing the name of the column between brackets, so we can define our target `y` as:

In [None]:
# Define y
y = abide_data["group"]
y

If we want `X` to be all brain recordings for each subject, we need to tell _pandas_ to subselect all columns starting with `fs`. We can do so using `.loc` (read documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)):

In [None]:
# Define X
X = abide_data.loc[:, abide_data.columns.str.startswith("fs")]
X

_scikit-learn_ expects both `X` and `y` to be stored in a _numpy_ array or a _pandas_ dataframe, so we have our data ready for ML analysis!

---
# Acknowledgements

This tutorial takes inspiration from:
- [scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)
- [sklearn tutorial](https://github.com/jakevdp/sklearn_tutorial) by _Jake VanderPlas_
- [Machine learning for psychologists: A gentle introduction](https://github.com/tyarkoni/ML4PS) by _Tal Yarkoni_
- [Machine learning tutorial for NeuroHackademy](https://github.com/neurohackademy/nh2020-curriculum/blob/master/tu-machine-learning-yarkoni) by _Tal Yarkoni_
- [Machine Learning and Data Visualization](https://github.com/sina-mansour/OHBM-Brainhack-2021) by _Sina Mansour_