# Maching learning

"Machine learning (ML) is an application of AI that allows machines to extract knowledge from data and learn from it autonomously." - Google

<img src='https://miro.medium.com/v2/resize:fit:870/1*aQJf4cz9_V25xIAMO1YwiA.png' title='ML within AI' width=500/>

For this course we will explore the intersection of machine learning and data science, but we won't be doing neural networks / deep learning here.

Our goal is to make predictions about new data:
1. What Tiktok videos / Spotify songs / Instagram adverts might you like, based on your history within the app?
2. Does this brain scan show cancer or not?
3. What will the price of Bitcoin be tomorrow?

Machine learning problems are split into two main types: classification and regression. Classification is when their are only a few options and we wish to select the most likely. Regression is when the outcome can take a range of values and we wish to make a sensible guess.

Classification examples:
1. Will you enjoy this new Netflix series: yes or no?
2. Does this brain scan show cancer: yes or no?
3. What animal is in this picture: cat, dog, cow, or horse?

Regression examples:
1. What is the price of a house in London with two bedrooms and one bathroom?
2. How many likes will the next MrBeast post get in the first week?
3. What is the probability of it raining tomorrow?

In machine learning we use data and algorithms to answer these questions. The data takes the form of multiple observations of the type of problem we wish to solve. 

Examples:
1. A database of 100,000 previous London house sales, each recording the price, number of bedrooms, and number of bathrooms.
2. 1,000 images of brains, each labelled "cancer" or "no cancer" by a doctor.

We adopt a common language for machine learning problems. The number of observations is called n, the target which we wish to predict is $y$, and the variables which we use to make this prediction are contained in a matrix $X$.

Examples:
1. $n=100,000$, $y$ is a vector of length 100,000, $X$ is a matrix with 100,000 rows and 2 columns. $$ y = \begin{bmatrix}£100,000\\ £250,000\\ £1,000,000\\ \vdots\end{bmatrix}, \quad X=\begin{bmatrix}1\text{ bedroom} & 1\text{ bathroom}\\ 2\text{ bedroom} & 1\text{ bathroom}\\ 3\text{ bedroom}& 2\text{ bathroom}\\ \vdots & \vdots\end{bmatrix}.$$
2. $n=1,000$, $y$ is a vector of length 1,000, $X$ is a matrix with 1,000 rows and one column per pixel in the images. 
$$ y=\begin{bmatrix}\text{cancer}\\ \text{no cancer}\\ \vdots\end{bmatrix}, \quad X=\begin{bmatrix} \text{grayscale value of pixel 1 in image 1} & \text{grayscale value of pixel 2 in image 1} & \cdots\\ \text{grayscale value of pixel 1 in image 2} & \text{grayscale value of pixel 2 in image 2} & \cdots\\ \vdots & \vdots & \ddots \end{bmatrix}. $$

# The whole game

1. Load the data.
2. Clean the data.
3. Split the data for training and testing.
4. Build a model.
5. Train the model.
6. Test the model.
7. Improve the model, repeating steps 4-6.

Let's look at the process on a very famous data set. HMS Titanic was a huge British passenger ship that sunk in 1912 after hitting an iceberg. Our goal is to predict which passengers survived. This is a classification problem (each passenger either survived or died).

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/1024px-RMS_Titanic_3.jpg' title='HMS Titanic' width=500/>

### Load the data

In [12]:
import pandas as pd # Import a standard toolbox for handling data
df = pd.read_csv('../data/titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Clean the data

In [3]:
# For now we will just use "Sex" and "Pclass" to predict "Survived", so let's drop the other variables.
df = df[['Survived', 'Sex', 'Pclass']]
df

Unnamed: 0,Survived,Sex,Pclass
0,0,male,3
1,1,female,1
2,1,female,3
3,1,female,1
4,0,male,3
...,...,...,...
886,0,male,2
887,1,female,1
888,0,female,3
889,1,male,1


In [5]:
# Lots of machine learning methods only work with numbers, not with words. So let's convert "male" --> 0 and "female" --> 1.

df.loc[:, 'Sex'] = 1*(df['Sex'] == 'female')
df

Unnamed: 0,Survived,Sex,Pclass
0,0,0,3
1,1,0,1
2,1,0,3
3,1,0,1
4,0,0,3
...,...,...,...
886,0,0,2
887,1,0,1
888,0,0,3
889,1,0,1


### Split the data for training and testing

In [6]:
# We have 891 data points. Let's use the first 80% to predict the others.

train_y = df.loc[0:712, "Survived"].reset_index(drop=True)
train_X = df.loc[0:712, ["Sex", "Pclass"]].reset_index(drop=True)
test_y = df.loc[713:, "Survived"].reset_index(drop=True)
test_X = df.loc[713:, ["Sex", "Pclass"]].reset_index(drop=True)

In [7]:
print(train_y)
print(train_X)

0      0
1      1
2      1
3      1
4      0
      ..
708    1
709    1
710    1
711    0
712    1
Name: Survived, Length: 713, dtype: int64
     Sex  Pclass
0      0       3
1      0       1
2      0       3
3      0       1
4      0       3
..   ...     ...
708    0       1
709    0       3
710    0       1
711    0       1
712    0       1

[713 rows x 2 columns]


### Build a model

In [8]:
from sklearn.linear_model import LogisticRegression # Load a simple model from a machine learning library.
model = LogisticRegression() # Initialise the model.

### Train the model

In [9]:
model.fit(X=train_X, y=train_y)

### Test the model

In [11]:
from sklearn.metrics import accuracy_score
predictions = model.predict(test_X)
print(accuracy_score(y_true=test_y, y_pred=predictions))

0.7359550561797753


Our survival predictions are over 70% accurate! We could improve this model by including more of the original features in the date (here we only used "Sex" and "Pclass"), or by trying different models (here we used a LogisticRegression classifier).

# Quiz

1. Pick one of the machine learning problems from the examples given.
2. What is the resonse / target / outcome $y$?
3. Is this a classification or regression problem?
4. What features / predictors / variables $X$ do you think would be helpful in predicting $y$?
5. Where could you find some data containing both $X$ and $y$ to help you solve the problem?
6. Think of your own machine learning problem, and answer questions 2-5.

# Bonus

Read the solution presented here (<https://www.kaggle.com/code/alexisbcook/titanic-tutorial/notebook>), and compare it to our approach.