# 1. Preliminaries and data

Welcome to this tutorial! In this notebook we will review:
- How this tutorial is organized, and the background knowledge you will need for it.
- Common Python libraries for performing machine learning analyses, outlining the ones we will use in this tutorial.
- The dataset used for some of the exercises.
---

# About this tutorial

This short interactive tutorial will show you 
how to use the [scikit-learn](https://scikit-learn.org/) 
Python package to perform basic machine learning analysis. 
It will also cover how to visualize your results with 
the [Matplotlib](https://matplotlib.org/) 
and [seaborn](https://seaborn.pydata.org/) Python packages. 

## The contents
{TO-DO: enumerate notebooks}

## Assumed background knowledge
- For this tutorial, we assume you have:
    - Basic knowledge of Machine Learning concepts. 
    For example, you know the difference between supervised/unsupervised learning, 
    or the difference between classification, regression and clustering models.
    - Basic experience with Python
- We also assume that you have seen the video 
["A tutorial on machine learning"](https://www.youtube.com/watch?v=pOAK6ynM11E&list=PLVso6Qs8PLCiciMyxyqxCzp38G5tEhdy6&index=6) 
by [Laura Suarez](https://twitter.com/LauraESuarez24).

If you think you are lacking some of this knowledge/experience, 
we recommend the following resources to fill this gap:

### ML background resources
- [Introduction to Statistical Learning](https://www.statlearning.com/)
- [Machine Learning](https://www.coursera.org/learn/machine-learning) by _Andrew Ng_

### Python background resources

{TO-DO: list}

---

# Machine Learning software in Python

The most well-known Python library for performing machine learning (ML) analysis is [__scikit-learn__](https://scikit-learn.org/).
- (...) explain about the resources in scikit learn

In neuroscience research, other toolboxes have been developed specifically for carrying out ML analysis on neuroimaging data. For example, [nilearn](https://nilearn.github.io/) is very popular among fMRI researchers, while [mne](https://martinos.org/mne/stable/index.html) is most known among the M/EEG community. Scikit-learn is the backbone of both _nilearn_ and _mne_, so in this tutorial we will explain how to use this more general toolbox with the hopes that it will come in handy when using more neuroscience-specific toolboxes.

Other Python libraries are also needed when carrying out ML analysis. 


- (...) [explain that we will be using other software like pandas and numpy]

    - If you are not familiar with _pandas_ you can read their tutorial ["10 minutes to pandas"](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html).
---

# The datasets

In each of the notebooks we will exemplify how to run ML analysis using fake data generated by _scikit-learn_, or will use well-known datasets that can be retrieved using _scikit-learn_ API. We will not be using real neuroimaging data in these examples.

However, at the end of each notebook you will find exercises where you will need to practice what you have learned using a real neuroimaging dataset. For these exercises, we will use a dataset from the [Autism Brain Imaging Data Exchange II](http://fcon_1000.projects.nitrc.org/indi/abide/abide_II.html) (ABIDE II) project. ABIDE is a long-running effort to advance understanding of autism by aggregating and sharing autism-related structural and functional imaging datasets from around the world.

Let's familiarize ourselves with this dataset. We will use [pandas](https://pandas.pydata.org/) for this purpose:

In [None]:
import pandas as pd

# Load dataset into dataframe
abide_data = pd.read_csv("../data/abide2.tsv", sep="\t")
abide_data

Using _pandas_, we can quickly visualize that our dataset has 1004 rows and 1446 columns. Each row appears to store the information about one subject, measured in a specific neuroimaging center (indexed by the column `site`). The dataset contains additional demographic data for such participant, specifically their `age` and `sex`. 

Let's plot the age of the participants. Don't worry about the code that produces this plot yet. How to visualize data will be showed in [notebook 5](./05-visualization.ipynb).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot histogram of age
sns.histplot(abide_data["age"])
plt.show()

We can see that the majority of the partipants are kids.

In this tutorial, the demographic data will not be relevant. In our exercises we are interested in being able to predict whether the participant has autism or not from their brain recordings. Thus, the column `group` encodes the variable we want to predict (1 = autism, 2 = control). The columns starting with `fs` encode the brain data.

(...) Explain brain recordings:
- These dataset encodes fMRI recordings with the columns starting with `fs`. 
- What ROIs are

The 1,440 features represent 4 sets of 360 features. The 4 variables extracted by FreeSurfer are surface area, volume, cortical thickness, and local gyrification index. For each feature, there are 360 variables, representing the 360 parcels in the Human Connectome Project Multi-Modal Parcellation atlas (HCP-MMP1). The parcellation looks approximately like this:

(...) We are lucky, because this dataset is already pretty much prepared for performing machine learning analysis with it. A dataset ready for machine learning analysis:
- explain `X` (with samples (also called observations) and features (also called predictors)) 
    - In neuro features can represent channels, voxels, rois, etc.
- and `y` (labels, targets)
- explain the format of `X` {n_samples by n_features}
    - Add image
- In scikit-learn `X` can both be passed as a pandas dataframe or as a numpy array.


# Acknowledgements

This tutorial takes inspiration from:
- [scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)
- [sklearn tutorial](https://github.com/jakevdp/sklearn_tutorial) by _Jake VanderPlas_
- [Machine learning for psychologists: A gentle introduction](https://github.com/tyarkoni/ML4PS) by _Tal Yarkoni_
- [Machine learning tutorial for NeuroHackademy](https://github.com/neurohackademy/nh2020-curriculum/blob/master/tu-machine-learning-yarkoni) by _Tal Yarkoni_
- [Machine Learning and Data Visualization](https://github.com/sina-mansour/OHBM-Brainhack-2021) by _Sina Mansour_