## What is Machine Learning?
Machine learning is a subset of Artificial Intelligence(AI) that allows machines to learn from raw data. It is an application of AI that enables automatic learning and improvement from experience without explicitly programming and knowledge of the learning environment.

As huge genomic data is generated, machine learning is one of the tools that scientists are applying to make sense of the data. For example, the base calling step in the Oxford Nanopore technology depends on machine learning, and specifically the neural networks to learn the bases from the signals. 

### Applications of Machine Learning
Some of the applications of machine learning in genomics include;

1. Predicting the influence of genetic variation on gene regulatory mechanisms such as DNA receptiveness and splicing.
2. Studies for gene editing as a means to significantly reduce the time, cost and effort necessary to identify an appropriate target sequence.
3. Usage in direct to customer genomic applications. Genomic tests can be used to help determine the likelihood of developing a particular disease and also to determine genetic hereditary. It can be used to identify patterns, make predictions and model the progression or treatment of a disease.

In this training, we explore how we can utilize machine learning in genomics, and specifically, how to tranform the data.

>>>Machine learning helps us answer questions. How do we define the question? (Delta Analytics)


## Modules
- What is machine learning?
- How do you define a research question?
- What are observations?
- What are features?
- What are outcome variables?
- Introduction to genomic data
- Model Selection and Evaluation 
- Linear Regression 
- Decision Trees
- Ensemble Algorithms
- Unsupervised Learning Algorithms
- Natural Language Processing Part 1
- Natural Language Processing Part 2

## Model 
All models have 3 key components: 
- **Task**: What is the problem we want our model to solve?
- **Learning methodology**: ML algorithms can be supervised or unsupervised. This determines the learning methodology.
- **Performance measure**: Quantitative measure we use to evaluate the model’s performance.

### 1. Task
What is the problem we want our model to solve?
- **Defining f(x)**: What function will map our x (input) as close as possible to the true Y (output).
    - The goal of f(x) is to predict a Y* as close to the true Y as possible.
    -  Given the explanatory feature(s), the model (`f(x)+e = Y*` will lead to the predicted outcome (Y). 

e is *irreducible error*. This captures error caused by factors like measurement error, randomness in the data, and inappropriate model choice. No matter how well you optimize your model, this will never be reduced to 0. 

- **Feature Engineering and Selection**: What is x? How do we decide what explanatory features to include in our model?
- **Is our f(x) correct for this problem?**: What assumptions does our model make about the data? Do we have to transform the data?

Building a model involves turning your research question into a machine learning question. A machine learning task has explanatory features and an outcome feature.

An **outcome feature** is the feature we expect to change when the explanatory features are manipulated. [TO DO:]Add a machine learning example


### 2. Learning Methodology
Is our model supervised or unsupervised? How does that affect the learning processing?
- How does our **ML model learn**?: Overview of how the model teaches itself.
    - How the algorithm learns depends upon type of data you have. 
    - Labelled data will use supervised learning, while unlabelled data will use unsupervised learning.
    - A model’s goal is to minimize the loss function.
- What is our **loss function**?: Every supervised model has a loss function it wants to minimize.
    - A loss function quantifies how unhappy you would be if you used f(x) to predict Y* when the correct output is y. 
    - It is what we want to minimize. 
    - Loss function quantifies how well our f(x) fits our data. 
    - The choice of loss function depends upon the type of task.
 
 **Regression**: [absolute error (L1) and least squares error (L2)](http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/), including mean squared error (MSE), root mean squared error (RMSE)
 
 **Classification**: log loss and hinge loss
 
- **Optimization process**: How does the model minimize the loss function.

The process of changing f(x) to reduce the loss function is called learning. It is what makes ordinary least squares (OLS) regression a machine learning algorithm.

### 3. Performance measure
- Measures of performance: R2, Adjusted R2, MSE
- Feature performance: Statistical significance, p-values 
- Ability to generalize to unseen data: Overfitting, underfitting, bias, variance

[Stats Book](https://bookdown.org/sbikienga/Intro_to_stat_book/)


## Some images

![Image](https://ars.els-cdn.com/content/image/1-s2.0-S2001037020303068-gr2_lrg.jpg)



In [2]:
#pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install

In [None]:
import math 