<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is created by Zhuo Chen under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).<br />

For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
___


# Machine Learning

**Description:** This notebook describes:
* What is Machine Learning
* The machine learning pipeline
* Supervised and unsupervised learning
* Training/validation/test data 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 75 minutes

**Knowledge Required:** None

**Knowledge Recommended:** None

**Data Format:** None

**Libraries Used:** None

**Research Pipeline:** None

# What is Machine Learning

Machine learning, a branch of artifical intelligence, is becoming ubiquitous. You find it everywhere. It is used in image recognition, medical diagnosis, stock prediction and a lot more areas. What's important to us is that it is also used in text analysis. 

What is Machine Learning? Encapsulated in one sentence, Machine Learning learns from data and produces a model to do a certain task. 

<a id='section2.1'></a>
<h2 id='section2.1'>Machine Learning Pipeline</h2>

The one-sentence definition of ML spells out the ML pipeline, which can be represented by the following graph. We feed some data to a ML algorithm to produce a model that fits the input data. We derive some kind of intelligence from the model we build and use it to accomplish a certain task.

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_ML_pipeline.png' width=600></center>

Let's use a simple example to understand ML and the ML pipeline. Let's say you are interested in the relationship between the size and the price of a house in your neighborhood. Specifically, you would like to use the size of a house to predict its price. You go to Redfin/Zillow and find the information about the recently sold houses in your neighborhood. You note down their size and sale price. 

|House addr|# of sqft|Price|
|---|---|---|
|1 Walnut St.|950|550k|
|10 Walnut St.|1040|500k|
|21 Walnut St.|1180|390k|
|36 Walnut St.|1240|510k|
|8 Hazelnut Rd.|1400|410k|
|18 Hazelnut Rd.|1450|510k|
|5 Chestnut Dr.|1480|505k|
|40 Chestnut Dr.|1520|450k|
|...|...|...|

You create a scatter plot to examine the data. 

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ML_ScatterPlot.png' width=250></center>

Now, you would like to derive a relationship between the house size and house price. How do you do that? As an example, you can use linear regression to model the relationship of interest to you. Essentially, this model will fit a linear line to the data points. 

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ML_LinearReg.png' width=250></center>

The function for this line is y = ax + b (where y is the price and x is the # of sqft). Of course, you would not just fit any line to your data points. You would want to fit a line so that the difference between the actual house prices and the predicted house prices, i.e. the **error** of the model, is the smallest. The learning the ML method you choose has to do, then, is to learn the value of a and b in the function y = ax + b such that the **error** of this model in the given dataset is the smallest. Once the best performing linear line is identified, you can use it to make predictions about new data, i.e. houses that are not in the input data.

<h2 style="color:red; display:inline">Questions </h2>

Going back to the ML pipeline with the above example in mind, can you answer the following questions?
<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_ML_pipeline.png' width=500></center>

1. What is the input data in this example?

2. What is the ML method used in this example?

3. What is the intelligence we derive from the model we build? What task does it do?

## ML Methods in Text Analysis
One question you may ask is: but how do we know what ML method we should choose? Don't worry! Researchers who work on ML algorithms have developed ML methods for different kinds of tasks. The following table gives an overview$^{1}$.

|Task|ML method|
|---|---|
|[Sentiment analysis](./sentiment-analysis-with-vader.ipynb)|Naïve-Bayes classifier|
|[Topic modeling](./topic-modeling.ipynb)|Latent Dirichlet Allocation (LDA)|
|[Named Entity Recognition](./NER-3.ipynb)|Neural network|
|Document retrieval|k-nearest neighbor (k-NN)|
|Document clustering|k mean|
|...|...|...|

<font size='1'> 1. For each text analysis task, only one ML method is given in the table, but as a matter of fact, more than one ML method has been developed to tackle a certain task. Constellate classes will strive to keep up to date with the most recent ML method proposed for these text analysis tasks.</font>

# Supervised and Unsupervised Learning

Depending on the text analysis task at hand, ML learning takes either of the two forms: supervised learning and unsupervised learning$^{2}$.
<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ML_SuperUnsuper.png' width=400></center>
<font size='1'>2. More types of ML learning have emerged, including semi-supervised learning and reinforcement learning. In this notebook, we focus on the traditional bifurcation of supervised learning and unsupevised learning.</font>

## Supervised Learning

Supervised learning trains a model on a dataset with **known input and output data**. The derived model is then used to make predictions on new datasets.

A dataset is divided into three groups in supervised learning: training set, validation set and test set. There is no hard and fast rule as to the division of the data, but a common practice is:

* 60% as training set
* 20% as validation set
* 20% as test set

Let's get an understanding of each type of these data with the house-buying example we have seen before. 

### Training set

In the house-buying scenario, the dataset at hand contains information on the sizes and sale prices of the recently sold houses in your neighborhood. The ML model and input/output data are: 

* ML model: simple linear regression
* input data: the size of the houses
* output data: the sale price of the houses

The input and output data used to train the linear regression model are both known. Given a house in the dataset, you know how big it is in square feet and how much it was sold for. 

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ML_InputOutput.png' width=450></center>

You may notice that in [section 2.1](#section2.1) of ML pipeline, we did not break down the dataset into three groups. This is because we were trying to avoid the unnecessary complications when understanding the ML pipeline. Now you know that the training data are a subset of the dataset. 

The subset of the dataset used to train the ML model is **training set**.

### Validation Set

In the house-buying example, you use the simple linear regression model to fit a linear line to your data. Looking at the graph, you may wonder, this linear line does not seem to be a very good fit because so many data points fall outside of the line.


<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ML_LinearReg.png' width=250></center>

What if you use a much more complex regression model to fit a polynomial line that can capture every data point in the training data? A model that is perfect given the training data?


<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ML_HigherOrderReg.png' width=250></center>

This line performs perfectly relative to the training data because it makes correct predictions on the price of every house in the training data, i.e. it has an error of 0. However, if it is so tuned to the training data, it will very likely perform badly on new data that it has never seen!  

The question is: how do we know how complex a model we should choose to do a task? This is where validation set comes in. Basically, a model is trained using the training data and then applied to the validation data to check its performance. Since the validation data are never seen by the model, they function as a kind of test data testing how well the model performs on new data. If we have models of different complexity as potential candidate models to do a certain task, we will train each model and calculate their error in the validation data. The model with the smallest error relative to the validation set will win. 

The parameters determining the complexity of a model are called **hyperparameters**. Validation set, therefore, is used to tune the **hyperparameters**. 

### Test Set

Test set is held out from the dataset at hand from the very beginning. It is not touched when tuning the parameters of a chosen model, a task reserved for training data. It is not touched when tuning the hyperparameters of a model, a task reserved for validation data. Test data are only used to calculate the performance error of a model on the test data which has been trained using the training data and whose complexity has the smallest performance error on the validation data.

Ideally, the performance error of the trained model on the test data will approximate the true error the model will have when it encounters every data point it might ever see in the real life. 

<h2 style="color:red; display:inline">Questions </h2>

<img style="float: center;" src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ML_twosplits.png' width=750> 

1. Think about why we don't just divide the dataset at hand into training set and test set. If we train a model on the training data, select the complexity of the model such that it has the smallest performance error on the test data, how well do you think the resulting model will perform on new data that it never sees before? Why?

# Unsupervised Learning

Unsupervised learning is used to uncover the hidden patterns in the given dataset without the data being labeled by humans. This is different from supervised learning where a model is trained on known input and output. 

The most common unsupervised learning task is clustering. For example, if we have a big dataset of documents, we can use unsupervised learning to discover the groups of articles within the dataset that are related to each other. 

Unsupervised learning is typically applied before supervised learning in order to identify the underlying structure of the data in terms of classes and groupings. 

## Some real-life applications

Unsupervised learning is used in a lot of real-world tasks. The following are some examples from IBM's website:

**News Sections**: Google News uses unsupervised learning to categorize articles on the same story from various online news outlets. For example, the results of a presidential election could be categorized under their label for "US" news.
 
**Medical imaging**: Unsupervised machine learning provides essential features to medical imaging devices, such as image detection, classification and segmentation, used in radiology and pathology to diagnose patients quickly and accurately.

**Customer personas**: Defining customer personas makes it easier to understand common traits and business clients' purchasing habits. Unsupervised learning allows businesses to build better buyer persona profiles, enabling organizations to align their product messaging more appropriately.

Source: https://www.ibm.com/topics/unsupervised-learning
