# Description of the project

## Data science in health genomics
Most genetic association studies rely on small datasets, both in terms of participants and genetic resolution. Therefore, the outcome is often focused on particular genetic variants showing a strong difference in frequency between cases and controls. Typically, these variants are identified in Genome Wide Association Studies (GWAS), where frequencies of genetic variants are compared between cases and controls

<img src="images/Popgen_pitch_gwas.png" />

However, most diseases cause multiple gene mutations and, when using these traditional methods, it is tremendously time-consuming to uncover all the gene combinations related to a given disease.

Only recently have trials of thousands of participants and millions of genetic variants enabled for finer resolution. These new studies have two main goals:

* Create accurate patient profile to provide precisely targeted treatment and improve prevention
* Learning more about the complexity of human genes.

To achieve these goals, researchers have explored machine learning techniques, such as random forest, to reveal the most likely combinations of genetic variants related to a disease, allowing more accurate diagnosis and time-saving when investigating treatments.

* **Machine Learning** has already improved predictive abilities for complex diseases.
* **Machine Learning**, and **Deep Learning** in particular, have fewer assumptions about the signatures of selection.


These approaches are a core component of **Precision medicine**: tailored medical treatments based on patients characteristics.

**Machine Learning** will improve risk prediction models and therefore prevention and costly medical screening procedures.

## Novelty
It is frequent that one person has genetic variants suggesting an increased risk for a given disease, while other may suggest a decreasing risk.
We also know that But we multiple genes are involved in diseases, but many are probably missed because their effect is not strong enough to be detected with current methods and datasets.
As such, predictions can be inacurrate.

But with 
    1. More individuals 
    2. Complete coverage of the genome
    3. and Models incorporating the complexity of genetic relations
Predictions should improve 

This implies a change in method, as we relied on comparisons of frequencies and logistic regressions to identify individual genetic markers associated to a disease.

## Deliverables
My goal is to explore the potential of **Data Science**, and **Machine Learning** in particular, for the prediction of genetic diseases in clinical genomics
1. I am hereby publishing my results and the tools I used in hope they will be useful to other researchers and clinicians.

2. I am also sharing my scripts and providing **Interactive Tools** to speed up data exploration.

## Project

Current studies rely on Genome Wide Association Studies (GWAS) to identify outlier genetic variants between case and control groups

<img src="images/Popgen_w3_gwas.png" />

Taking advantage of the increasing resolution and sample size, recent studies have been exploring the benefits of Machine and Deep Learning to predict predispositions to diseases more accurately.

In addition, models have become more accurate and complex with the inclusion of population structure covariates.

Once the prediction has been validated on the testing set part of the trial, the prediction can be extented to the global population, by taking advantage of the genetic data from the __[1000 genome project](https://www.internationalgenome.org/home)__

<img src="images/Popgen_w3_ml.png" />

## Data science tools 

### Packages
* `pandas`: Pandas is the most common python package to deal with dataframes
https://pandas.pydata.org/
* `pandas_plink`: python package to read plink files
https://pypi.org/project/pandas-plink/
* `scikit-learn`: Most common machine learning package in python
https://scikit-learn.org/stable/index.html
* `seaborn`: Data visualization package, oriented towards pandas dataframes. https://seaborn.pydata.org/

### Functions
* `SimpleImputer`: Imputation transformer for completing missing values. From `scikit-learn`
* `OneHotEncoder`: Encodes categorical values in numerical arrays. Can be used to encode each of the 3 SNPs genotypes. From `scikit-learn`
* `Pipeline`: A pipeline consist in a sequential list of transformers and an estimator. From `scikit-learn`
* `train_test_split`: Split the dataset (X and y) in a training and testing set, usually in a 70/30 or 80/20 ratio. This step was done in R with a similar function before the GWAS. From `scikit-learn`
* **n_jobs**: Control the number of parallelized jobs for machine learning. Used in `RandomizedSearchCV`

### Machine learning using `scikit-learn`
* `Random Forest Classifier`: Decision tree classifier. A number of parameters were optimized with `RandomizedSearchCV`
* `RandomizedSearchCV`: Randomized search to optimize the parameters of the estimator or pipeline (`Random Forest Classifier` in this case)
    * **n_snps**: The number of SNPs considered in the model (X)
    * **max_depth**: The maximum depth of the tree
    * **n_estimators**: The number of trees in the forest
    * **min_samples_split**: The minimum number of samples required to split an internal node
    * **min_samples_leaf**: The minimum number of samples required to be at a leaf node. 

* Accuracy and overfitting:
    * cross-validation. set to 5 in `RandomizedSearchCV`
    * Iterations. 2000 iterations in `RandomizedSearchCV`
    * Accuracy
    * Recall
    * Confusion Matrix
    

### Simulations
Simulations were performed with __[plink](http://zzz.bwh.harvard.edu/plink/)__ for:
* 4000 individuals 
    * 2000 cases / 2000 controls


* 1 000 000 genetic variants
    * 100 of them are simulating a disease


* Split training and testing set 80/20


### Genome Wide Association Study
Most common approach in clinical studies. In a GWAS, the frequency of genetic variants is compared between case and control groups and outliers are identified.

<img src="images/manhattan plot unadjusted assoc train.png" />

Manhattan plot used to identify of genetic variants associated to a disease. The X-axis represents the 23 human chrosomes, and the Y axis the p-value associated with each genetic variant. 

Variants simulating the disease are marked in green.


### Machine Learning
The script and results for the machine learning part can be found in the following notebook:

__[Jupyter notebook on the Random Forest and Gradient Boosting Classifier](Scripts/RandomForest_HumanSimulation.ipynb)__

### Local adaptation of Ugandan Cattle
The script and results for this part of the projectcan be found in the following notebook:

__[Jupyter notebook on the Random Forest and Gradient Boosting Classifier](Scripts/UGBT.ipynb)__

## Ressources
About Random Forest Classifiers
* https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/
* https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76
* https://www.datacamp.com/community/tutorials/random-forests-classifier-python


About overfitting and control
* https://elitedatascience.com/overfitting-in-machine-learning
* https://scikit-learn.org/stable/modules/cross_validation.html