# Welcome!

This short interactive tutorial will show you 
how to use the [scikit-learn](https://github.com/scikit-learn/scikit-learn) 
Python package to perform basic machine learning analysis. 
It will also cover how to visualize your results with 
the [Matplotlib](https://matplotlib.org/) 
and [seaborn](https://seaborn.pydata.org/) Python packages. 

## The contents of the tutorial
{TO-DO: enumerate notebooks}

## Assumed background knowledge
- For this tutorial, we assume you have:
    - Basic knowledge of Machine Learning concepts. 
    For example, you know the difference between supervised/unsupervised learning, 
    or the difference between classification, regression and clustering models.
    - Basic experience with Python
- We also assume that you have seen the video 
["A tutorial on machine learning"](https://www.youtube.com/watch?v=pOAK6ynM11E&list=PLVso6Qs8PLCiciMyxyqxCzp38G5tEhdy6&index=6) 
by [Laura Suarez](https://twitter.com/LauraESuarez24).

If you think you are lacking some of this knowledge/experience, 
we recommend the following resources to fill this gap:

### ML background resources
- [Introduction to Statistical Learning
- {List of background resources for ML}

### Python background resources

{TO-DO: list}

# Machine Learning in Python

- Explain scikit-learn
- Mention there are other toolboxes like nilearn and mne that specifically carry out ML analysis on brain data.
    - scikit-learn is the foundation of both, so we will explain how to use this toolbox for more general knowledge

# The dataset

- !! Description of the data we will be using
- Explain comprehension exercises

Let's familiarize ourselves with this dataset. We will use [pandas](https://pandas.pydata.org/) dataframes for inspecting the data in this and the following tutorials.

In [9]:
import pandas as pd

abide_data = pd.read_csv("../data/abide2.tsv", sep="\t")
abide_data.head()  # print the first five rows of the dataset

Unnamed: 0,site,subject,age,age_resid,sex,group,fsArea_L_V1_ROI,fsArea_L_MST_ROI,fsArea_L_V6_ROI,fsArea_L_V2_ROI,...,fsCT_R_p47r_ROI,fsCT_R_TGv_ROI,fsCT_R_MBelt_ROI,fsCT_R_LBelt_ROI,fsCT_R_A4_ROI,fsCT_R_STSva_ROI,fsCT_R_TE1m_ROI,fsCT_R_PI_ROI,fsCT_R_a32pr_ROI,fsCT_R_p24_ROI
0,ABIDEII-KKI_1,29293,8.893151,13.642852,2.0,1.0,2750.0,306.0,354.0,2123.0,...,3.362,2.827,2.777,2.526,3.202,3.024,3.354,2.629,2.699,3.179
1,ABIDEII-OHSU_1,28997,12.0,16.081732,2.0,1.0,2836.0,186.0,354.0,2261.0,...,2.809,3.539,2.944,2.769,3.53,3.079,3.282,2.67,2.746,3.324
2,ABIDEII-GU_1,28845,8.39,12.866264,1.0,2.0,3394.0,223.0,373.0,2827.0,...,2.435,3.321,2.799,2.388,3.148,3.125,3.116,2.891,2.94,3.232
3,ABIDEII-NYU_1,29210,8.3,13.698139,1.0,1.0,3382.0,266.0,422.0,2686.0,...,3.349,3.344,2.694,3.03,3.258,2.774,3.383,2.696,3.014,3.264
4,ABIDEII-EMC_1,29894,7.772758,14.772459,2.0,2.0,3080.0,161.0,346.0,2105.0,...,2.428,2.94,2.809,2.607,3.43,2.752,2.645,3.111,3.219,4.128


Using pandas, we can quickly visualize that our dataset has 1004 rows and 1446 columns.

We are lucky, because this dataset is already pretty much prepared for performing machine learning analysis with it. A dataset ready for machine learning analysis (...)

To-Do:
- explain `X` (with observations and features - also called predictors-) and `y` (labels, targets)
    - In neuro features can be channels, voxels, rois, etc.
- explain the format of `X` {n_samples by n_features}

![]

- In scikit-learn `X` can both be passed as a pandas dataframe or as a numpy array.


To-do, going back to our dataset:
- explain the classes (which will be our `y`)
- explain the meaning of each feature
- explain `ROIs`

# Additional Resources
- https://inria.github.io/scikit-learn-mooc/
- https://github.com/jakevdp/sklearn_tutorial
- https://github.com/tyarkoni/ML4PS