# Introduction

The course DATA.ML.200 Pattern Recognition and Machine Learning assumess that you have the basic knowledge about machine learning (e.g. DATA.ML.100), you know well the engineering mathematics and you have moderately good programming skills. The course is for MSc level studies in engineering, and particularly for students of computer science (information technology), electrical engineering, and robotics.

## Background

Before delving into technical stuff, let's spend some time on discussing those more philosophical questions that the (dummy) general public and press is interested about *artificial intelligence* (AI). 

### Intelligence

### Consciousness

<div>
<img src="pictures/origins_of_life.png" width=600>
</div>

For more information:

 * https://www.pmfias.com/origin-evolution-life-earth-biological-evolution/



### Representation 

## Conventional machine learning

A vast majority of the machine learning problems encountered in the real life can be solved by using the functionality provided in 

 * [scikit-learn](https://scikit-learn.org/stable/) machine learning library for Python
 
 and therefore we next quickly go through the topics familiar from DATA.ML.100

**Install Python**

Create a new Anaconda environment for this course
```bash
 $ conda update conda
 $ conda update --all
 $ conda create -n dataml200-24
 $ conda activate dataml200-24
 (dataml200-24) $ conda install python=3.11
 (dataml200-24) $ conda install scikit-learn
 (dataml200-24) $ conda install matplotlib
 (dataml200-24) $ conda install pandas
```

Install Jupyter notebook
```bash
 (dataml200-24) $ conda install -c conda-forge notebook
```

which allows to run the provided lecture notebooks
```bash
 (dataml200-24) $ jupyter notebook
```

### Classification

#### Demo: Classification of text files

20 News Groups is a dataset of user written messages posted to 20 different discussion groups

In [None]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint

def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')

pprint(data_train.target_names)

print(f'Total of {len(data_train.data)} posts in the dataset and the total size is {size_mb(data_train.data):.2f}MB')

Print a few examples

In [None]:
sample_id = 0
sample_target = data_train.target[sample_id]
print(f'Class id: {sample_target} and class name {data_train.target_names[sample_target]}')
print(data_train.data[sample_id])

sample_id = 1
sample_target = data_train.target[sample_id]
print(f'Class id: {sample_target} and class name {data_train.target_names[sample_target]}')
print(data_train.data[sample_id])

The posts that are strings of arbitrary length must be converted to fixed length *feature vectors*. The simplest feature is called as *Bag of Words* (BoW).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Simple examples strings
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?'
          ]
vectorizer = CountVectorizer()

# This makes vocabulary from the given data
X = vectorizer.fit_transform(corpus)
pprint(vectorizer.get_feature_names_out())
pprint(corpus)
pprint(X.toarray())

Construct vocabulary from training data

In [None]:
# Form data
vectorizer = CountVectorizer(stop_words="english")
X_train = vectorizer.fit_transform(data_train.data)
print(f'Size of the vocabulary is {len(vectorizer.get_feature_names_out())}')

Transform also test data.

In [None]:
X_test = vectorizer.transform(data_test.data)
y_train, y_test = data_train.target, data_test.target
print(X_train.shape)
print(y_train.shape)

Pick one of the Scikit-Learn classifiers and train using the training data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train, y_train)

Evaluate the classifier using ready-made functions in Scikit-Learn

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
pred = clf.predict(X_test)

# Classification accuracy
acc = accuracy_score(y_test, pred)
print(f'Classification accuracy {acc:.2f}')

# Confusion matrix
fig, ax = plt.subplots(figsize=(10, 5))
ConfusionMatrixDisplay.from_predictions(y_test, pred, ax=ax)
ax.xaxis.set_ticklabels(data_train.target_names)
ax.yaxis.set_ticklabels(data_train.target_names)
plt.xticks(rotation=90)
ax.set_title(f"Confusion Matrix for {clf.__class__.__name__}\non the original documents")
plt.show()

### Detection

Detection is a special case of classification, but **do not** confuse detection with classification since the evaluation is very different.

 * [Receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (Wikipedia)


#### Tracking

Tracking is a special case of detection.

 * Demo: Python/opencv_tracker_webcam.py


## Regression

#### Demo: Boston house prices regression

A standard dataset for data analysis. The dataset provides 80 input variables of various type and one output variable (the house selling price). Pandas provides many tools for data analysis and Sklearn supports Pandas. 

In [None]:
from sklearn.datasets import fetch_openml

df = fetch_openml(name="house_prices", as_frame=True, parser="pandas")
X = df.data
y = df.target
print(X.head())
print(y.head())
print(X.shape)
print(y.shape)

For simplicity let's take a small subset of the available input features. We also skip all categorial features as they would be needed to be converted to numerical.

In [None]:
from sklearn.utils import shuffle
import numpy as np

features = [
        "YrSold",
#        "HeatingQC",
#        "Street",
        "YearRemodAdd",
#        "Heating",
#        "MasVnrType",
#        "BsmtUnfSF",
#        "Foundation",
        "MasVnrArea",
        "MSSubClass",
#        "ExterQual",
#        "Condition2",
        "GarageCars",
#        "GarageType",
        "OverallQual",
        "TotalBsmtSF",
        "BsmtFinSF1",
#        "HouseStyle",
#        "MiscFeature",
        "MoSold",
]

X = X.loc[:, features]
print(X.head())


X, y = shuffle(X, y, random_state=666)


X_train = X.iloc[:1000]
y_train = y.iloc[:1000]
X_test = X.iloc[1000:]
y_test = y.iloc[1000:]

X_train=X_train.values
y_train=y_train.values
X_test=X_test.values
y_test=y_test.values

print(X_train.shape)
print(f'Mean price (training): {np.mean(y_train)}')

print(X_test.shape)
print(f'Mean price (test): {np.mean(y_test)}')

Sanity check baseline

In [None]:
from sklearn import metrics

y_pred = np.ones(y_test.shape)*np.mean(y_train)

print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R2:', metrics.r2_score(y_test, y_pred))

Baseline - linear regression (fails, but why?)

In [None]:
from sklearn import linear_model

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(np.squeeze(X_train), np.squeeze(y_train))

# Make predictions using the testing set
y_pred = regr.predict(X_test)

Let's use Pandas style to replace missing values and re-do our stuff.

In [None]:
print(np.argwhere(np.isnan(X_train)))
print(X_train[102,:])

In [None]:
X_pure = X.apply(lambda x: x.fillna(x.mean()), axis=0)

X_train = X_pure.iloc[:1000]
X_test = X_pure.iloc[1000:]

X_train=X_train.values
X_test=X_test.values

print(np.argwhere(np.isnan(X_train)))
print(X_train[102,:])
print(np.mean(X_train[:,2]))

Redo the baseline (linear regression)

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(np.squeeze(X_train), np.squeeze(y_train))

# Make predictions using the testing set
y_pred = regr.predict(X_test)

print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R2:', metrics.r2_score(y_test, y_pred))

Random forest regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

regr = RandomForestRegressor()
regr.fit(np.squeeze(X_train), np.squeeze(y_train))
y_pred = regr.predict(X_test)

print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R2:', metrics.r2_score(y_test, y_pred))

## Clustering & other unsupervised learning

Unsupervised ML not included to this course, but discussed in DATA.ML.100 and in other data analysis courses.

## Reinforcement learning

Part of this course. We go more advanced that the basics in DATA.ML.100

## References

 * DATA.ML.100