# MAT 388E, Lecture 13

# Review

## Data Science in relation to mathematics, statistics, and machine learning

This course was a Data Science course specifically designed for students working towards degrees in fundamental sciences such as mathematics, physics, chemistry, and biology. And during the semester we use many tools from different disciplines such as linear algebra, statistics, analysis, machine learning, computer science, natural language processing, image processing etc. But there are main ideas we used in this class came from three main disciplines:

1. Mathematics
2. Statistics
3. Machine Learning

### Mathematics

Mathematics is a vast field covering a large area from logic to differential equations, from linear algebra to topology and analysis. As such we have used many many ideas and tools from mathematics. The most visible of these tools was linear algebra, geometry, and real analysis. Many of the data analyses we made relied on geometry of point clouds embedded in an affine space, and computations we made involved very heavy uses of (numerical) linear algebra.

### Statistics

In this course, our main object of study was data. There are many different types of data: numerical, vector, categorical, structured, unstructured etc. The correct theoretical framework for working with data is the study of random variables because all data can be considered as finite samples taken from random variables. Since Statistics is the study of randomness, every rigorous study of data must involve heavy doses of statistics.

### Machine Learning

Machine learning is the study of developing algorithms to solve problems intrinsically based only on data. The basic process is that we decide on the machine learning algorithm whose parameters are not known. Then through a statistical analysis of the data an optimization algorithm decides on the best parameters that solves the problem by using the ML algorithm we decided at the outset. As such, it may seem that data science is a subfield of machine learning. However, the meta-analysis of whether a given ML algorithm with the optimized parameters solve the given problem for the application at hand requires a holistic use of ML, statistics, mathematics, and even some deep domain knowledge from the field that the data is coming from such as physics, chemistry, biology, engineering etc.

## Data Science and Python

During our course, we relied heavily on python language and its vast library ecosystem. Python is a good choice since in the last decade it became de-facto standard in Data Science and Data Engineering applications. However, there are alternatives:

### Alternatives to python

* [R Statistical Language](https://www.r-project.org/)

```{R}
    df <- read.csv("file.csv")
    head(df, n=10)
    summary(df)
```

* [Julia Language](https://julialang.org/)

```{Julia}
    using DataFrames

    df = DataFrame(CSV.File("file.csv"))
    println(first(df, 10))
    describe(df)
```

* [Apache Spark](https://spark.apache.org/) and [Scala](https://www.scala-lang.org/)

```{Scala}
    import org.apache.spark.sql.SparkSession

    val spark = SparkSession.builder().getOrCreate()
    val df = spark.read.format("csv").option("header", "true").load("file.csv")
    df.show(10)
    df.describe().show()
```

## Demarcation Lines in Data Science

There are several ways we can separate different types of algorithms and approaches one can employ during the analysis of data. The separation we used in this course was

### Supervised vs Unsupervised Learning Models

1. Supervised Learning Algorithms: This type of algorithms solve a very concrete type of a problem. We have (input,output) pairs as data and we must come up with a function/algorithm that produces the output from the given output as best as it can.

2. Unsupervised Algorithms: This type of algorithms are more difficult to specify since we only have inputs and we usually have no idea what the output is supposed to be. In most cases, we have a fit function that would tell us how good a given proposal as solution is.

### Parametric vs Non-parametric Machine Learning Models

One can draw different demarcation lines for classes of Machine Learning algorithms. One such demarcation line is Supervised vs Unsupervised. Another line is Parametric vs Non-parametric models.

1. Parametric models: for such models we have a vector $\theta$ of parameters that we use to optimize an error or a cost function $E(D,\theta)$ on a finite sample of data points $D$. This framework is extremely convenient and most machine learning tasks (supervised or unsupervised) fall under this umbrealla. Examples include regression models, $k$-means, SVM, neural networks models, etc.

2. Non-parametric models: for such models, there are no parameters, and therefore, error functions or cost functions are of no use. One has to use other extrinsic measures to decide on best possible model. Examples include hiearchical clustering models, k-nearest neighbors and decision tree algorithms.

## Supervised Learning 

We defined *machine learning* as the study of developing algorithms intrinsically based only on the data. In the best case scenario, we have (input,output) pairs, and the *learning problem* is to come up with a function/algorithm that recreates the output by looking at the input. In other words, the algorithm has to *learn* what the output is going to be from the input alone. This procedure is called *supervised learning*.

We have seen many examples of such algorithms during the course:

1. K-Nearest Neighbors
2. Linear Regression
3. Logistic Regression
4. Decision Tree Regression
5. Decision Trees and Random Forests
4. Support Vector Machines
6. Boosting Algorithms
5. Neural Network Models

### Discrete vs Continuous Response Variables

In all supervised learning algorithms, since we know what the output is going to be we have several good definitions of what the **error** function should be. We find the best parameters for the corresponding supervised learning algorithm by minimizing this error function. The error function depends on whether the output is a continuous or a discrete random variable.

1. Discrete response variable: K-NN, logistic regression, decision tree, SVM, NN models.
2. Continuous response variable: regression, boosting algorithms, NN models.

### Cross-validation

We can calculate the *average error* $|E(Err(D,\theta))|$ made by the algorithm for the dataset $D$ we have for a given set of parameters $\theta$. The problem is $D$ is a finite random sample from a random variable $\mathcal{D}$, and $\theta$ is another random variable whose value is decided by an optimization algorithm. As such there is no way we can statistically be confident that $D$ is a generic sample and $\theta$ is indeed the correct parameter that solve the problem with minimum error. We need a sample of calculations. For this purpose we use various cross-validation schemes. Here are the most common cross-validation schemes:

1. $k$-fold cross-validation: For a natural number $k$ we split the data into $k$-disjoint subsets $D_1,\ldots,D_k$. Then for each $i=1,\ldots,k$, we build a supervised machine learning model on $D\setminus D_i$ and evaluate the error $\epsilon_i$ of the model on $D_i$. We then report on the mean calculated error and the 95% confidence interval using the population $\epsilon_1,\ldots,\epsilon_k$.

2. Leave-one-out cross-validation: For each point $x\in D$, we build a model on $D\setminus\{x\}$ and then evaluate the error $\epsilon_x$ on the single point $x$. We then report on the mean calculated error and the 95% confidence interval using the population $\{\epsilon_x\mid x\in D\}$.

3. Simple train-test-split: we split the dataset into two disjoint subsets $D_{\text{train}}$ and $D_{\text{test}}$. We build the model on $D_{\text{train}}$ and then calculate the error $\epsilon$ on $D_{\text{test}}$. We repeat this a few times to obtain a population of errors $\epsilon_1,\ldots,\epsilon_m$, and  we then report on the mean calculated error and the 95% confidence interval using this population.

## Unsupervised Learning

Supervised learning is basically a machine learning how to create a known output from a given known input since we have (input, output) pairs. In unsupervised learning, we only have inputs and no supposed output. This means it is impossible to solve whatever we are supposed to solve using a ML algorithm intrinsically, i.e. by looking at the data alone. We need some extrinsic knowledge on what a good solution is supposed to be.

In the case of supervised learning, since we knew what the supposed output is for a given input, we have a good error function: the difference between the model's output and the real output. In the absence of a real output, the least we must have is an error function, or at least a cost function: a function $C(D,\theta)$ that would tell us the cost of model $M(\theta)$ with the chosen parameter $\theta$ on the sample of data $D$ we have at hand. This is so that we can find the optimal model $M(\theta_0)$ with the optimal parameter $\theta_0$ by minimizing the cost function $C(D,\theta)$.

We have seen many examples of unsupervised learning algorithms in this course:

1. K-means Clustering
2. Hierarchical Clustering
3. Density based clustering
4. Autoencoders

### Clustering Algorithms and Measuring Quality of a Clustering Algorithm

With the exception of autoencoders, the unsupervised learning algorithms we have seen were examples of clustering algorithms. The supervised machine learning has a similar class of algorithms called classification algorithms. Both clustering and classification algorithms are aimed to split the data into a finite disjoint collection of subsets (a partition). Another way of saying is that given a dataset $D$, we want a function $\ell\colon D\to \{1,\ldots,k\}$ whose valued are discrete labels. In the case of classification, we know what the labels are, and we want to create a nice function by describing it more algorithmically/matematically. In the case of clustering, we don't know what the labels mean. It is the job of the data scientist to sit down and sift through the dataset and the partition and come up with an intrinsic definition of what each label means.

However, there are some basic measures one can use to judge whether the results of a clustering algorithm is actually worth considering:

The most commonly used ones are

1. The rand score
2. The silhoutte score 
3. Mutual information


## A Cheat Sheet for Selecting Appropriate ML Algorithm for the Question at Hand

![Cheat_Sheet](./images/cheat-sheet.png)