# Session 1: Basics

In this Notebook, we will consider some basic concepts of Machine Learning, namely:
* Machine Learning vs. Artificial Intelligence
* Supervised vs. Unsupervised Methods
* Training vs. Inference
* Accuracy, Precision and Recall
* Some Popular Machine Learning Algorithms

You will get the chance to test a little bit of code in this Notebook. In future sessions, we will go into more depth about how to implement particular Machine Learning algorithms.


## Machine Learning vs. Artificial Intelligence

Machine Learning (ML) and Artificial Intelligence (AI) are closely related, but they aren't the same:
* **Machine Learning**: An approach to programming, where the computer learns how to perform a task from data rather than being told what to do.
* **Artificial Intelligence**: An aspiration of computer science, to create computers that effectively model, simulate or replicate intelligence.

Today, ML and AI are often used synonymously, because most AI researchers use ML methods to try and achieve AI. But this has not always been the case. From the 1940s to the 1980s, the dominant paradigm in AI research was 'symbolic AI'. During this period, AI researchers tried to replicate intelligence by programming computers to manipulate symbols according to pre-programmed logical rules. Only the the 1990s did machine learning become sufficiently successful to take over.

The cell below gives a very simple example of how early symbolic AI worked. In a symbolic AI program, you define different *objects* (or *terms* or *arguments*) and *predicates* (or *relations*). The computer can use this set of objects and predicates to describe and reason about the world:

In [None]:
class Animal(object):
    mortal = True

class Human(Animal):
    limbs = 4
    bipedal = True

def is_it(being, prop):
    if hasattr(being, prop):
        return getattr(being, prop)
    else:
        print(f'The property {prop} is not defined for this being.')
        return None

socrates = Human()
is_it(socrates, "mortal")

There is still some symbolic AI research going on today. The most famous example is [Cyc](https://cyc.com) (see also the [Wikipedia page](https://en.wikipedia.org/wiki/Cyc)).

But even Cyc relies on Machine Learning as well as Symbolic AI. This is because of a recent development in computer science, the rise of:

* **Big data**: Big data simply means *lots* of data, and the technology to process it.

As we will see in future weeks, computers learn extremely slowly and incrementally. So only if big data is available can ML be really used effectively. [The internet is the main source of big data](https://www.visualcapitalist.com/wp-content/uploads/2019/04/data-generated-each-day-wide.html).

Some famous products that rely on big data to enable ML:

* [Google Translate](https://translate.google.com/)
* OpenAI's [GPT-3 Language Model](https://github.com/openai/gpt-3)
* The [Deep Dream Generator](https://deepdreamgenerator.com/)

## Supervised vs. Unsupervised Learning

This is the first big distinction in ML. As the machine learns, it creates a **model** which it uses to analyse input data and give an output. There are two main kinds of model:

* **Supervised Learning**: The computer is given some data points, and some correct answers. It creates a model that can predict the given answers based on the provided data.
* **Unsupervised Learning**: The computer is given a mass of unstructured data, and tries to find a way to cluster or classify the data without guidance.



## Training vs. Inference

There are many stages to an ML project before the actual machine *learning* can begin: collecting the data, cleaning it, refining the question you want the computer to answer, choosing an algorithm, and setting hyperparameters (i.e choosing settings for the algorithm). Once you have sorted out the general shape of the project, however, a final distinction remains:

* **Training**: The learning stage. The computer examines the data and updates its parameters accordingly.
* **Inference**: The application stage. The computer examines some fresh data and gives its output or result.

In practice, these two steps can be hard to distinguish. Most major AI systems (e.g. Facebook or Google Ads) *infer* things and also *train* themselves constantly. Facebook, for example, constantly predicts which ads and posts you want to see and serves them to you (**inference**), while also observing your responses and updating its models (**training**).

To give you a practical example of these two distinctions, we are now going to apply two different ML algorithms to the famous 'iris' dataset. This dataset comes preloaded with the `scikit-learn` Python package, which is automatically installed on Google CoCalc (if that is where you are accessing this Notebook).

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
print(iris['DESCR'])

In [None]:

print(f'Four features:\n{iris["feature_names"]}')
print(f'First ten rows:\n{iris["data"][:10]}')
print(f'First ten species: {[iris["target_names"][target] for target in iris["target"][:10]]}')

The first thing we need to do is chop up the data. We have 150 flowers in the dataset, divided into three main species. We will divide this data into a 'training set' that we will show the algorithm when it is learning, and a 'test set' that we will hide from the algorithm, and then use to see if it has really learnt something from the training data. 

In [None]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(iris["data"], iris["target"], train_size=0.8, stratify=iris["target"])

First, we will use an **unsupervised** method, called **k-means clustering**. We will tell the computer that the flowers fall into three groups, and see if it can work out what those three groups are.

In [None]:
from sklearn.cluster import KMeans
# Initialise model
kmeans_model = KMeans(n_clusters=3, random_state=1825)
# Train the model on the training data (NB: for this unsupervised method, you don't need the 'y' data)
kmeans_model.fit(train_x)
# Now use the model to cluster the training data:
train_x_predictions = kmeans_model.predict(train_x)

In [None]:
# How closely do these clusters match the actual species?
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA

# We need to reduce the dimensionality in order to graph it
train_x_reduced = PCA(n_components=2).fit_transform(train_x)

# Now produce two graphs, one showing the clusters, and the other the actual species
plt.figure(1, figsize=[15,6.5])
plt.subplot(121)
plt.scatter(train_x_reduced[:,0], train_x_reduced[:,1], c=train_x_predictions)
plt.title("Kmeans clusters (training data)")
plt.subplot(122)
plt.scatter(train_x_reduced[:,0], train_x_reduced[:,1], c=train_y)
plt.title("Actual species (training data)")
plt.show()

In [None]:
# Your turn! To reproduce the above graphs, but for the *test* data, simply replace the Nones below with the appropriate code.

# First use the model's .predict() method to cluster the test data:
test_x_predictions = None

# Now use Principal Component Analysis to reduce the number of axes to 2, so we can plot the test data:
test_x_reduced = None

# Now plot the test data, so we can take a look:
plt.figure(1, figsize=[15,6.5])
plt.subplot(121)
plt.scatter(x=None, y=None, c=None)
plt.title("Kmeans clusters (test data)")
plt.subplot(122)
plt.scatter(x=None, y=None, c=None)
plt.title("Actual species (test data)")
plt.show()

Another option is to use a **supervised** method. In the next example, we apply the most popular supervised method of all, the Artificial Neural Netowork (also known as a Deep Neural Network, or simply as a Neural Network). In this supervised method, we let the computer know the species of each flower in the test set. It then works backward from this to learn to distinguish different species from one another. 

In [None]:
# Google's TensorFlow is one of the most widely used libraries for neural networks
# It can be used with 'keras', a standard interface for Machine Learning,
# meaning that the code below looks very similar to the code for kmeans clustering above.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Initialise the model
neural_model = keras.Sequential([
    layers.Dense(10),
    layers.Dense(5),
    layers.Dense(3, activation="softmax") # <-- this must have 3 neurons, as there are three species of flower
])

# Train the model
neural_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics="accuracy") # <-- we need this extra 'compile' step with TensorFlow
neural_model.fit(train_x, train_y, epochs=100) # <-- this time, we let it see which species each flower is (`train_y`)

In [None]:
# Let's see how it did:

import numpy as np

train_x_neural_probabilities = neural_model.predict(train_x)
train_x_neural_predictions = np.argmax(train_x_neural_probabilities, axis=-1)

plt.figure(1, figsize=[15,6.5])
plt.subplot(121)
plt.scatter(train_x_reduced[:,0], train_x_reduced[:,1], c=train_x_neural_predictions)
plt.title("Predicted species (train data)")
plt.subplot(122)
plt.scatter(train_x_reduced[:,0], train_x_reduced[:,1], c=train_y)
plt.title("Actual species (train data)")
plt.show()

In [None]:
# Your turn! How did it do on the test data?

# Remember, we have already defined the variables `test_x_reduced` and `test_y`, so you can use these again.

test_x_neural_probabilities = None
test_x_neural_predications = None

plt.figure(1, figsize=[15,6.5])
plt.subplot(121)
plt.scatter(x=None, y=None, c=None)
plt.title("Predicted species (test data)")
plt.subplot(122)
plt.scatter(x=None, y=None, c=None)
plt.title("Actual species (test data)")
plt.show()

## Accuracy, Precision and Recall

We have judged the above two models by their accuracy: How many plants did they categorise correctly? This is only one way of measuring the 'fitness' of a model, however. In this section I introduce two other metrics that are often used for *binary classification problems*. A binary classification problem is one in which there are only two possible answers, e.g. Is this person positive for coronavirus? Can this person speak Twi? Did Shakespeare write this play? When you have such a binary, or yes-no question, then it is common to measure two other things alongside the accuracy:

* **Precision**: When the model gives a positive result, how likely is it that it is a true positive? (e.g. This test says a person has COVID. How likely is it that they actually do have COVID?)
* **Recall**: If a person really is positive, how likely is it that the model will detect this? (e.g. There are 20 COVID-positive people in this room of 100 people. How many of those 20 will the test detect if we test everyone?)

Below is a worked example to show you how in a practical situation, **accuracy**, **precision** and **recall** can be quite disinct from one another. It is always worth thinking about which metrics are most appopriate to your situation.

The ten films: *Star Wars*, *Saw*, *Gone With the Wind*, *The Babadook*, *Under the Shadow*, *Citizen Kane*, *Dilwale Dulhaina Le Jayenge*, *Police Story*, *Gol Maal*, *Carrie*.

Question: Is this film a horror film?

In [None]:
import numpy as np
actual_answers = np.array([0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
predicted_answers = np.array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1])
n = 10

In [None]:
correct = actual_answers == predicted_answers
true_positive = (actual_answers == 1) & (predicted_answers == 1)
false_positive = None # When the true answer is negative, but the predicted answer is positive
false_negative = None # When the true answer is positive, but the predicted answer is negative

In [None]:
accuracy = correct.sum() / n
precision = None # TP / (TP + FP)
recall = None # TP / (FN + FP)


In [None]:
print(f'The model correctly predicted the genre of {correct.sum()} out of {n} films.')
print(f'The model\'s accuracy is therefore {accuracy}.')
print(f'But the model\'s precision is only {precision}')
print(f'And its recall is only {recall:.2f}')

## Some Popular Kinds of Machine Learning Algorithms

* **Bayesian Networks**: Probabalistic models that model how likely a particular outcome is given certain information. These are quite complex but very flexible and easy to interpret. We look at an example of a Bayesian Network in Session 2.
* **Deep Learning**: Another synonym for Neural Networks. There are many kinds of Neural Networks for different problems. In Sessions 3 and 4 we will look at two kinds that are particularly interesting to humanists, [Recurrent Neural Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) for modelling and generating language, and a [Skip-Gram Model](https://en.wikipedia.org/wiki/Word2vec) for trying to mathematically represent the meaning of individual words.
* **Genetic Algorithms**: A family of algorithms that mimic natural selection. Different versions of a model are spawned and then compete with one another according to set rules.
* **Reinforcement Learning**: A kind of deep learning where an algorithm learns from itself. Such Reinforcement Learning is behind many of the newsworthy AI stories, such as [deepfakes](https://en.wikipedia.org/wiki/Deepfake) or AI grandmasters of [Go](https://www.youtube.com/watch?v=WXuK6gekU1Y) and [Starcraft II](https://www.youtube.com/watch?v=UuhECwm31dM).

## Conclusion

In this week, we have covered some basic Machine Learning concepts, and learnt the basic pattern of machine learning programming in Python. The rough template is:

```
# Split training and test data:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y)

# Initialise model, and choose settings
model = Model(...)

# Training: Fit model to training data
model.fit(train_x) # <-- unsupervised
model.fit(train_x, train_y) # <-- supervised

# Inference: Use model to make predictions, categorise inputs etc.
predictions = model.predict(test_x)
```
