# An introduction to Machine Learning


## Purpose of the course

 1. Provide an overview of what Machine Learning is and the tools it uses.
 2. Provide the basics of statistics and data analysis.
 3. Show an overview of machine Learning techniques.
 4. Provide best practices for data interpretation and visualisation.
 5. Provide necessary knowledge to understand basicevaluation of Machine Learning algorithms.

## Some references

 - **Understanding Machine Learning: From Theory to Algorithms**
by Shai Shalev-Shwartz and Shai Ben-David

 - **Scikit-Learn Tutorial: Statistical-Learning for Scientific Data Processing**
by Andreas Mueller

 - **Building Machine Learning Systems with Python**
by Willi Richert and Luis Pedro Coelho

 - **An Introduction to Statistical Learning (with applications in R)**
by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

 - **Deep Learning**
by Ian Goodfellow and Yoshua Bengio and Aaron Courville

# Setting un the environment

## Installing Miniconda Python

Miniconda is a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, some useful ML libraries, and a small number of other useful packages, including pip, zlib and a few others. Use the conda install command to install 720+ additional conda packages from the Anaconda repository.

Install scripts are available for Linux, Windows, and MacOS. Since most of use use a Mac, I will only show you how to install on this OS. Instructions are available for other platforms [here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html).

### Get the installer
Installer for MacOS can be downloaded from [here](https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh).

Then you just need to run 
```bash
$ bash Miniconda3-latest-MacOSX-x86_64.sh
```

You can specify the installation directory, if you so wish. The default is `/home/$USER/miniconda3`.

# Some important libraries  (*there is a module for that...*)

## `numpy`

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

 - a powerful N-dimensional array object;
 - broadcasting functions;
 - tools for integrating C/C++ and Fortran code;
 - useful linear algebra, Fourier transform, and random number capabilities.

NumPy can be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. Since the low-level implementation of numerical routines in NumPy is in C, the calculations are blazing fast.

## `pandas`

`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
`pandas` uses `numpy` under the hood to ensure all data operations are as fast as possible (if you use the right tools).

Pandas is going to be that base of our data handling and we'll get to know it intimately.

## `matplotlib` and `seaborn`

Libraries to generate beautiful plots and graphs. We will use these to view our data.

## `scipy`

The SciPy ecosystem is built on top of Python and NumPy.

It includes:

 - The SciPy library, a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics, and much more.
 - Matplotlib, a mature and popular plotting package that provides publication-quality 2-D plotting, as well as rudimentary 3-D plotting.

On this base, the SciPy ecosystem includes general and specialised tools for data management and computation, productive experimentation, and high-performance computing.

## `scikit-learn`

Efficient library for data analysis and predictive modeling. It provides access to a plethora of Machine LEarning algorithms using a simple and linear API.

## `statsmodels`

Statistical modeling, and hypothesis testing.

-------------

## Deep Learning toolkits

### `Tensorflow`

Tensorflow is an open source platform for Machine Learning developed by Google.

### `Keras`


### `Pytorch`

<img src="https://www.dataiku.com/static/img/learn/guide/getting-started/getting-started-with-python/logo-stack-python.png">

# Some basic syntax

## `pandas` - data manipulation

In [2]:
# standard import (usually aliased as `pd`)
import pandas as pd

#### Series

In [3]:
# basic structure
series = pd.Series([1, 2, 3, 4])

In [4]:
series

0    1
1    2
2    3
3    4
dtype: int64

#### Dataframe

A collection of (named) series is a dataframe.

In [8]:
df = pd.DataFrame([[1, 2, 3], [2, 4, 6]], columns=["number", "doubles", "triples"])

In [9]:
df

Unnamed: 0,number,doubles,triples
0,1,2,3
1,2,4,6


In [10]:
# selecting columns
df["doubles"]

0    2
1    4
Name: doubles, dtype: int64

In [11]:
# subset of columns
df[["number", "triples"]]

Unnamed: 0,number,triples
0,1,3
1,2,6


In [12]:
# masking
df["triples"] == 6

0    False
1     True
Name: triples, dtype: bool

In [13]:
# filtering using masks
df[df["triples"] == 6]

Unnamed: 0,number,doubles,triples
1,2,4,6


In [15]:
# selecting specific rows
df.loc[0]  # by index

number     1
doubles    2
triples    3
Name: 0, dtype: int64

In [16]:
df.iloc[1] # by position

number     2
doubles    4
triples    6
Name: 1, dtype: int64

## `scikit-learn` - ML algorithms

`scikit-learn` follows a general API structure for all of its components. The most common used methods are `fit`, `transform`, `fit_transform`, and `predict`.

We will see moree of this when we deal with the first Machine Learning examples.

# Practice

In [18]:
from sklearn.datasets import load_wine

wine = load_wine()

df = pd.DataFrame(wine["data"], columns=wine["feature_names"])
target = wine["target"]

In [20]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


Find:
 - The average phenol content of all the wines.
 - How many wines have an alcohol content <12%.
 - The wine with the highest alcohol content.
 - The median content of malic acid for all wines that have more than 13.5% alcohol content.
 - The wines with alcohol content >12% and  magnesium content >100.

In [31]:
df[df.alcohol > 13.50].malic_acid.median()
df[df.alcohol > 12].shape[0]
df.total_phenols.mean()
df[df.alcohol == df.alcohol.max()]
df[(df.alcohol > 12) & (df.magnesium > 100)]

4