# Module 1: Machine Learning Terminology

### Types of Machine Learning

* **Supervised learning** (Gmail spam filtering): training a model from input data and its corresponding labels to predict new examples.
* **Unsupervised learning** (Google News): training a model to find patterns in a dataset, typically an unlabeled dataset. No targets are given.
* **Reinforcement learning** (AlphaGo): a family of algorithms for finding suitable actions to take in a given situation in order to maximize reward.
* **Recommendation systems** (Amazon item recommendation system): predict the "rating" or "preference" a user would give to an item.

In **supervised learning**, we are given a set of observations ($X$) and their corresponding targets $y$ and we wish to find a model that relates $X$ to $y$.

In **unsupervised learning**, we are given a set of observations ($X$) and we wish to group similar things together in $X$.

### Supervised Learning

* **Classification**: predicting among two or more categories, also known as classes (e.g. predict whether a patient has live disease or not)
* **Regression**: predicting a continuous value (e.g. predict housing prices)

### Terminology

* **examples** = rows
* **features** = inputs
* **targets** = outputs
* **training** = learning = fitting

### Scikit-learn

In [1]:
import pandas as pd

from sklearn.dummy import DummyClassifier

In [2]:
classification_df = pd.read_csv("https://raw.githubusercontent.com/UBC-MDS/DSCI_571_sup-learn-1/master/lectures/data/quiz2-grade-toy-classification.csv")
classification_df.head()

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,A+
1,1,0,94,90,80,83,91,not A+
2,0,0,78,85,83,80,80,not A+
3,0,1,91,94,92,91,89,A+
4,0,1,77,83,90,92,85,A+


In [3]:
X = classification_df.drop(columns=["quiz2"]) # features
y = classification_df["quiz2"] # target

In [4]:
dummy_clf = DummyClassifier(strategy="most_frequent")

In [5]:
dummy_clf.fit(X, y)

Let's see what it predicts for a single observation first:

In [6]:
single_obs = X.loc[[0]]
single_obs

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1
0,1,1,92,93,84,91,92


In [7]:
dummy_clf.predict(single_obs)

array(['not A+'], dtype='<U6')

In [8]:
X

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1
0,1,1,92,93,84,91,92
1,1,0,94,90,80,83,91
2,0,0,78,85,83,80,80
3,0,1,91,94,92,91,89
4,0,1,77,83,90,92,85
5,1,0,70,73,68,74,71
6,1,0,80,88,89,88,91
7,0,1,95,93,69,79,75
8,0,0,97,90,94,99,80
9,1,1,95,95,94,94,85


In [9]:
dummy_clf.predict(X)

array(['not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+'], dtype='<U6')

In [10]:
print("The accuracy of the model on the training data: %0.3f" %(dummy_clf.score(X, y)))

The accuracy of the model on the training data: 0.524


In [11]:
print("The error of the model on the training data: %0.3f" %(1 - dummy_clf.score(X, y)))

The error of the model on the training data: 0.476


### Steps

1. Create your X and y objects
2. `clf = DummyClassifier()`: create a model
3. `clf.fit(X, y)`: train the model
4. `clf.score(X, y)`: assess the model
5. `clf.predict(Xnew)`: predict on some new data using the trained model

### Dummy Regression

In [12]:
regression_df = pd.read_csv("https://raw.githubusercontent.com/UBC-MDS/DSCI_571_sup-learn-1/master/lectures/data/quiz2-grade-toy-regression.csv")
regression_df

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,90
1,1,0,94,90,80,83,91,84
2,0,0,78,85,83,80,80,82
3,0,1,91,94,92,91,89,92
4,0,1,77,83,90,92,85,90
5,1,0,70,73,68,74,71,75
6,1,0,80,88,89,88,91,91


In [13]:
X = regression_df.drop(columns=["quiz2"])
y = regression_df["quiz2"]

In [14]:
from sklearn.dummy import DummyRegressor

dummy_reg = DummyRegressor(strategy="mean")

In [15]:
dummy_reg.fit(X, y)

In [16]:
single_obs = X.loc[[2]]
single_obs

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1
2,0,0,78,85,83,80,80


In [17]:
dummy_reg.predict(single_obs)

array([86.28571429])

In [18]:
X

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1
0,1,1,92,93,84,91,92
1,1,0,94,90,80,83,91
2,0,0,78,85,83,80,80
3,0,1,91,94,92,91,89
4,0,1,77,83,90,92,85
5,1,0,70,73,68,74,71
6,1,0,80,88,89,88,91


In [19]:
dummy_reg.predict(X)

array([86.28571429, 86.28571429, 86.28571429, 86.28571429, 86.28571429,
       86.28571429, 86.28571429])

In [20]:
print("The accuracy of the model on the training data: %0.3f" %(dummy_reg.score(X, y))) # R^2 score

The accuracy of the model on the training data: 0.000
