# What is Machine Learning?
Loosely following Chapter 1 in Python Machine Learning 3rd Edition, Raschka.

<img src='./diagrams/a-machine-learning.jpg'>

[Image source](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.informatec.com%2Fen%2Fmachine-learning&psig=AOvVaw1id7LiQnbAWJSImoPrFrnN&ust=1630778280221000&source=images&cd=vfe&ved=0CAsQjRxqFwoTCKDcxeKw4_ICFQAAAAAdAAAAABAD)

#### Definition 1:
>Vast amounts of data are being generated in many fields, and the statistician’s job is to make sense of it all: to extract import patterns and trends, and understand “what the data says.” We call this learning from data.
<br><br>Hastie et al., The Elements of Statistical Learning.

#### Definion 2:
> Machine learning is a subfield of computer science that is concerned with building algorithms which, to be useful, rely on a collection of examples of some phenomenon. These examples can come from nature, be handcrafted by humans or generated by another algorithm.
<br><br>Burkov, The Hundred-Page Machine Learning Book

#### Definition 3:
> Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
<br><br>A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. Some implementations of machine learning use data and neural networks in a way that mimics the working of a biological brain. In its application across business problems, machine learning is also referred to as predictive analytics.  
<br>Source: https://en.wikipedia.org/wiki/Machine_learning

#### Definition 4:
>A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. (Tom Mitchell)

# Machine Learning vs. Artificial Intelligence
<img src='./diagrams/ai-ml.jpeg'>

[Image source](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.bbntimes.com%2Fscience%2Fartificial-intelligence-vs-machine-learning-vs-artificial-neural-networks-vs-deep-learning&psig=AOvVaw2rw9Ou9dU_we-lUl0PjbBt&ust=1630849936510000&source=images&cd=vfe&ved=0CAsQjRxqFwoTCIDRhdC75fICFQAAAAAdAAAAABAO)


# Machine Learning Timeline
<img src='./diagrams/nvidia-ai-ml.png'>

[Image source](https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/)  

---

Neural networks and deep learning are really accelerated in usage and popularity due to increases in data available AND exponential increases in computing. Deep learning has significantly benefited from [Graphics Processing Units (GPUs)](https://developer.nvidia.com/deep-learning).

---

[Detailed timeline](https://en.wikipedia.org/wiki/Timeline_of_machine_learning)<bk>

#### Events of note:  
- [1959 - Arthur Samual popularizes the term "machine learning" and teaches a computer to play checkers](https://en.wikipedia.org/wiki/Arthur_Samuel)  
- [1970s - AI Winter: project shutdowns, general lack of advancement]  (https://en.wikipedia.org/wiki/AI_winter)  
- [1986 - Hilton's backpropagation](https://en.wikipedia.org/wiki/Backpropagation)  
- 1990s - Support vector machines and neural networks gain traction  
- [2009 - ImageNet](https://en.wikipedia.org/wiki/ImageNet)  
- 2010s - Deep learning becomes popular  
- [2010 - Kaggle](https://www.kaggle.com)  
- [2011 - Watson Wins Jeopardy](https://en.wikipedia.org/wiki/Watson_(computer))  
- 2011 - 2014 Siri, Cortana, Alexa
- [2016 - AlphaGo](https://en.wikipedia.org/wiki/AlphaGo)  
- [2020 - AI to detect misinformation](https://ai.facebook.com/blog/heres-how-were-using-ai-to-help-detect-misinformation/)  
- [2020 - GPT3 Language Generation Model](https://en.wikipedia.org/wiki/GPT-3)
- [2022 - ChatGPT](https://en.wikipedia.org/wiki/ChatGPT)

# Additional Examples
[Google Search](https://www.google.com/?client=safari)
<br>[Apple Face ID](https://support.apple.com/en-us/HT208109)
<br>[Cancer Detection](https://www.nature.com/articles/d41586-020-00847-2)
<br>[Fake News Detection](https://arxiv.org/pdf/1805.08751.pdf)
<br>[Retirement Planners](https://www.aiplanner.com)
<br>[Creating New Flavors](https://www.ibm.com/blogs/research/2019/02/ai-new-flavor-experiences/)
<br>[Fraud Detection](https://aws.amazon.com/solutions/implementations/fraud-detection-using-machine-learning/)
<br>[Computer Vision](https://en.wikipedia.org/wiki/Computer_vision)
<br>[Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing)
<br>[Spam Filtering](https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)
<br>[DeepFakes](https://www.theguardian.com/technology/2020/jan/13/what-are-deepfakes-and-how-can-you-spot-them)

# Resources
[Neural Networks vs. Deep Learning](https://www.ibm.com/cloud/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks)
<br>[Raschka's GitHub](https://github.com/rasbt/python-machine-learning-book-3rd-edition)
<br>[Raschka's Website](https://sebastianraschka.com)

# Before machines could learn we used rules [rule-based models](https://en.wikipedia.org/wiki/Rule-based_modeling).

```python
dogOrCat = ''
if whiskers and eatsMice:
    dogOrCat = 'Cat'
elif name in ['Garfield', 'Felix the Cat']:
    dogOrCat = 'Cat'
elif sleepsAllDay:
    dogOrCat = 'Cat'
else:
    dogOrCat = 'Dog'
```


# Explicit Programming vs. Machine Learning
<img src='./diagrams/programming-vs-learning.png'>

[Image source](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/01_overview/01-ml-overview__notes.pdf)

# Old School: Explicitly Defining Spam

In [None]:
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/week3/spam.csv', encoding='latin-1')
df.head()

In [None]:
df = df.iloc[:, :2]
df.columns = ['label', 'message']
df.head()

In [None]:
df['label'].value_counts().plot.barh()
plt.title('Distribution of Class Labels', loc='left')
plt.show()

In [None]:
counts = df['label'].value_counts()
counts / counts.sum()

__A subject matter expert (SME) reviewed messages and identified a key word list. The programmer takes the list and implements the first spam detector.__

In [None]:
# list of words that indicate the message contains spam - provided by a SME
spamWords = set(['free', 'text', 'winner', 'win', 'urgent', 
             'txt', 'charged', 'sms', 'prize', 'account', 
             'laid', 'freemsg', 'partner','bonus', 'congrats'
                ])

# split message into tokens
def split_words(x):
    x = x.lower()
    x = x.translate(str.maketrans('', '', string.punctuation))
    x = x.split(' ')
    return x

# run tokens through the spam list
def eval_spam(x, spamList = spamWords):
    is_spam = 'ham'
    for word in x:
        if word in spamList:
            is_spam = 'spam'
            break
    return is_spam
    
df['words'] = df['message'].apply(lambda x: split_words(x))
df['explicit'] = df['words'].apply(lambda x: eval_spam(x))
df.head()

In [None]:
sum(df['label'] == df['explicit']) / len(df)

In [None]:
df['dummy'] = 'ham'
sum(df['label'] == df['dummy']) / len(df)

__How does it perform? [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)__

<img src='./diagrams/cm.png' style="width: 400px;">

[Image source](https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsdatascience.com%2Fconfusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826&psig=AOvVaw1BGzxd0qgbSOLBDGjHBsll&ust=1631231267101000&source=images&cd=vfe&ved=0CAsQjRxqFwoTCODV_ZfI8PICFQAAAAAdAAAAABAD)

<img src='./diagrams/fnfp.webp' style="width: 400px;">

[Image source](https://neeraj-kumar-vaid.medium.com/statistical-performance-measures-12bad66694b7)

In [None]:
explicitCM = df.pivot_table(index='explicit', columns='label', values='message', aggfunc='count')
explicitCM

# Metrics we can use to evaluate the performance:

#### Accuracy:
$$\frac{TP + TN}{TP + TN + FN + FP}$$

#### Recall:
$$\frac{TP}{TP + FN}$$

#### Precision:
$$\frac{TP}{TP + FP}$$

----

<img src='./diagrams/Precisionrecall.svg.png' style="width: 400px;">

[Image source](https://en.wikipedia.org/wiki/Precision_and_recall)

**Note**: We'll go over these in more detail later. There's many metrics and different methods for evaluating performance for classification problems.  

----

<img src='./diagrams/cm-metrics.png'>

[Image source](https://en.wikipedia.org/wiki/Template:Diagnostic_testing_diagram)

In [None]:
explicitCM

In [None]:
def accuracy(m):
    return (m.loc['ham', 'ham'] + m.loc['spam','spam']) / m.to_numpy().sum()

eA = accuracy(explicitCM)

print(f'Accuracy: {eA:.2%}')

In [None]:
def recall(m):
    return (m.loc['spam','spam'])/(m.loc['spam','spam'] + m.loc['ham','spam'])
    
def precision(m):
    return (m.loc['spam','spam'])/(m.loc['spam','spam'] + m.loc['spam','ham'])

eR = recall(explicitCM)
eP = precision(explicitCM)

print(f'Recall: {eR:.2%}')
print(f'Precision: {eP:.2%}')

In [None]:

print(f'Accuracy: {eA:.2%}')
print(f'Recall: {eR:.2%}')
print(f'Precision: {eP:.2%}')

## Pain points for this rule-based approach
- Need to know the words ahead of time.    
- Need to monitor and update lists.  
- Precision likely poor.  
- Complex relationships aren't possible.  

# Machine Learning identities underlying patterns without manually coding
#### Machines can learn from the data  
#### Don't need to have priori on the signal drivers
Since we have messages and their spam/ham label, we can train models to learn the patterns within the data that explain the label.

Steps:
- [Convert text to a matrix of word counts](https://en.wikipedia.org/wiki/Bag-of-words_model)  
- [Fit a naive Bayes classifier model](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

In [None]:
X = df['message']
y = df['label']

In [None]:
X.shape

In [None]:
y.shape

In [None]:
X

In [None]:
# count words in each measure
from sklearn.feature_extraction.text import CountVectorizer
counter = CountVectorizer()
X = counter.fit_transform(X)

X

In [None]:
# features = counter.get_feature_names_out()
# # features
# values = X.todense()

# pd.DataFrame(data=values, columns=features)

In [None]:
# fit a naive bayes model
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(X, y)

# predict class from above model
preds = model.predict(X)
df['ml_prediction'] = preds
df.head(1)

## How does it perform?

In [None]:
machineCM = df.pivot_table(index='ml_prediction', columns='label', values='message', aggfunc='count')
machineCM

In [None]:
mA = accuracy(machineCM)
mR = recall(machineCM)
mP = precision(machineCM)

print(f'Accuracy: {mA:.2%}')
print(f'Recall: {mR:.2%}')
print(f'Precision: {mP:.2%}')

This looks great!

__not so fast__

In [None]:
# df.head(20).message[0]

In [None]:
X2 = counter.transform(['you won $10000 come and claim it'])
model.predict(X2)

## Takeaways:
- Big improvement with less effort and subject matter expertise.  
- Clearly SPAM is more complicated than containing simple key words. Machine learning can detect those more complicated patterns.  
- I could have no idea want words are correlated with spam, but if the dataset is labeled, the algorithms allow us to learn those patterns.  

>You'll still want to make sure you understand the data first and perform rigorous exploratory data analysis. 

>This was a pretty lazy example. We are evaluating the performance based on the data that we trained on instead of data that was withheld from the model. We'll discuss different evaluation methods in a few weeks.*

# Types of Machine Learning
We'll be focused on supervised and unsupervised learning in our course. The SPAM example above is a common example of supervised learning.

<img src='./diagrams/learning-types.png'  style="width: 600px;">

[Image Source](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch01/ch01.ipynb)

# Supervised Learning
[scikit-learn supervised learning](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)

## Goal is to detect a function that utilizes data (features) to explain a known phenomena
### Requirements: label (categorical or continuous), matrix of input features that *may* explain the label

$y=f(x)$

- We observe an outcome (can be binary, categorical, continuous)
- We observe features $X_{i}=(X_{i1},\dots,X_{in})$ in the outcome $Y_{i}$
- We can model this as $Y_{i}=f(X_{i})+\epsilon_{i}$  
- $\epsilon_{i}$ is the error or noise that we won't be able to capture from the function (at least without overfitting). We want to minimize this!
- Our goal is to estimate this unknown function ￼ that is the true data generating process
- We may choose different methods depending on the goal
- Prediction - care about predicting ￼ to our best ability
- Inference - care about the interpretation of our ￼ and how the inputs (￼) relate to ￼

## Interpretability or Black box  
- Do you need to understand why a prediction is being made?  
- Some models will offer a degree of feature importance intrepretations.  
- Some models are basically black boxes.

## Classification
Class of learning underneath supervised learning. Goal is to predict labels (usually categorical) of new unlabeled examples, based on training data that contained the correct label. The email spam example earlier is an example.

- Single/binary class (spam/ham) or multiclass (dog/cat/fish).  
- Looking to create a decision boundary with a function. In 2-dimensions it can be visually, but gets complicated in larger feature spaces.  

<img src='./diagrams/supervised-classification.png'  style="width: 400px;">

[Image source](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch01/ch01.ipynb)

### Example with the [Titantic Dataset](https://www.kaggle.com/c/titanic) using Logistic Regression

<img src='./diagrams/titantic.jpg'   style="width: 400px;">

[Image source](https://en.wikipedia.org/wiki/Titanic#/media/File:RMS_Titanic_3.jpg)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

titantic = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/week3/titantic.csv')
titantic.info()

In [None]:
titantic.head()

In [None]:
titantic['Survived'].value_counts().plot.barh()
plt.title('Distribution of Passenger Survival', loc='left')
plt.show()

In [None]:
# split into training/test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical = ['Sex', 'Pclass', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
cols = ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard','Fare']

def gen_splits(dataframe, features, target, test_pct=0.2):
    return train_test_split(dataframe[cols], dataframe[target], test_size=test_pct)

X_train, X_test, y_train, y_test = gen_splits(titantic, cols, 'Survived')

In [None]:
# create pipeline with transformations
from sklearn.linear_model import LogisticRegression

def pipe(model):
    pipeline = Pipeline([('t', transformer), ('m', model)])
    return pipeline

transformer = ColumnTransformer(transformers=[('ohe', OneHotEncoder(handle_unknown='ignore'), categorical)])

lgr = pipe(LogisticRegression())
lgr.fit(X_train, y_train)
print('Model fit')

In [None]:
# Evaluate
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

labs = {0: 'No Survive', 1:'Survived'}
cm = confusion_matrix(y_test.map(labs), pd.Series(lgr.predict(X_test)).map(labs), labels=['No Survive','Survived'])
ConfusionMatrixDisplay(cm, display_labels=['No Survive','Survived']).plot()
plt.show()

In [None]:
print(f'Training accuracy: {lgr.score(X_train, y_train):.2%}')
print(f'Test accuracy: {lgr.score(X_test, y_test):.2%}')

# Classification Considerations
- May need to create a customized way to evaluate models.  
- If looking using for workload triaging, may want to consider using ranked probabilites. 
- What is the cost difference between a false-positive and a false-negative?

# Regression
The other major class of supervised learning. Here the goal is to predict a real-value number. This is based on training data that contains examples with the response of interest. For example, I may want to predict weight based on height. I'll have a dataset with weight and height, train a model to determine the relationship, and use that model to predict the weights of new examples where I only know the height.

<img src='./diagrams/supervised-regression.png'  style="width: 400px;">
     
[Image source](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch01/ch01.ipynb)


### Example with [ISRL's advertising dataset](https://www.kaggle.com/ishaanv/ISLR-Auto) Using Least-Squares Regression

In [None]:
adData = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/week3/advertising.csv')
adData.info()

In [None]:
adData.head()

In [None]:
pd.plotting.scatter_matrix(adData)
plt.show()

### We can hypothesize $Sales_{i}=f(TV_{i},Radio_{i},Newspaper_{i})+\epsilon{i}$

In [None]:
import statsmodels.formula.api as smf

results = smf.ols('Sales ~ TV + Radio + Newspaper', data=adData).fit()
print(results.summary())

>Increased newspaper budgets reduce sales? You'd probably get a question about that.

In [None]:
predictedSales = results.predict(adData)

plt.plot(adData['Sales'], predictedSales, 'bo')
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.show()

# Unsupervised Learning
Discovery of hidden structures in the data

We observe features $X_{i}=(X_{i1},\dots,X_{in})$ that may explain a *latent or unobserved $Y_{i}$*.
- We may have data on user behavior for a web application. We might be able to segment these into groups that correlate with actual demographics.  
- We may be able to extract topic groups for streams of different documents.  
- Sometimes we don't need to decompose into groups, but need to compress the dimensionality of the data without losing significant amounts of the underlying variance.

## Reducing/simplifying the dimensionality of the data
### For compressing the feature space and/or visualization
Reduce the number of features to increase efficiency, reduce noise, and further densify the data.

<img src='./diagrams/unsupervised-reduction.png'  style="width: 400px;">

[Image source](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch01/ch01.ipynb)

# [t-SNE Example](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) with [MNIST](http://yann.lecun.com/exdb/mnist/)

MNIST is a database of hand-written digits that is commonly used for benchmarking model performance.

#### t-SNE is [t-distributed neighborhood embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1 ,parser='auto')
mnist.keys()

#### The features are pixel intensities. Nothing we've looked at so far is going to be able to handle looking at 784 features at once.

In [None]:
mnist.data.shape

#### Target distribution (i.e., labels of the digits)

In [None]:
mnist.target.value_counts().sort_index()

In [None]:
mnist.target.value_counts().sort_index().plot(kind='bar')
plt.show()

#### Sample to decrease the computational needs and runtime

In [None]:
samples = 2500

mnist_sample = mnist.data.sample(samples)
mnist_sample_targets = mnist.target.iloc[mnist_sample.index.tolist()]

#### Fit the transform
> This could take a while on larger dataset and/or if your machine doesn't have great specifications. Test it out on smaller datasets to get a sense of runtime.

In [None]:
from sklearn.manifold import TSNE
import datetime

ts_start = datetime.datetime.now()
tsne = TSNE(n_components=2).fit_transform(mnist_sample)
ts_end = datetime.datetime.now()

print(f'Completed in {ts_end-ts_start}')

### Add transform and label into a DataFrame for visualization

In [None]:
tsne_df = pd.DataFrame(tsne)
tsne_df.index = mnist_sample.index.tolist()
tsne_df.columns = ['component1', 'component2']

tsne_df = pd.concat([tsne_df, mnist_sample_targets], axis=1)
tsne_df.head()

In [None]:
import seaborn as sns

sns.set(rc={'figure.figsize':(9,9)})
sns.scatterplot(x='component1', y='component2', hue='class', data=tsne_df)
plt.show()

### Considerations:
- Condensing from 784 dimensions (mostly sparse) to 2 dimensions.  
- Not uncommon for distortions to occur and clusters to materialize that aren't really related.  
- This could direct you to potential issues in the data, e.g., what's with the handful of zeros that look like sixes?

# [Principal Components](https://en.wikipedia.org/wiki/Principal_component_analysis)

<img src='./diagrams/pca.png'  style="width: 400px;">

[Image source](https://en.wikipedia.org/wiki/Principal_component_analysis)

In [None]:
from sklearn.decomposition import PCA

adX = adData.iloc[:, :3]
adXpca = PCA(n_components=2).fit(adX)

In [None]:
adXpca.explained_variance_ratio_

#### Can help reduce the feature space and remove noise that may cause unstability in modeling
>Won't guarantee better results, but yet another knob you can try to tune.

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(adXpca.transform(adX),adData['Sales'])
pcaPredictions = model.predict(adXpca.transform(adX))

plt.plot(pcaPredictions, adData['Sales'], 'bo')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## Clustering
Clustering is one of the major categories of unsupervised learning. The goal is to group the data based on its feature set. We don't know or have class labels, but hope we can split out latent (hidden) groups. [scikit-learn clustering](https://scikit-learn.org/stable/modules/clustering.html#clustering)

#### [k-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
<img src='./diagrams/unsupervised-clustering.png'  style="width: 400px;">

[Image source](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch01/ch01.ipynb)

[scikit-learn Demo using the Digits data](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py)



### k-Means with iris

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np

iris = datasets.load_iris()
iris.keys()

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, n_init='auto').fit(iris.data)
kmeans

In [None]:
kmeansPredict = kmeans.predict(iris.data)
kmeansPredict

In [None]:
irisClass = pd.Series(iris.target)
irisCluster = pd.Series(kmeansPredict)

from sklearn.metrics import confusion_matrix
confusion_matrix(irisClass, irisCluster)

__That's not too shabby!__

### [Hierarchial](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering)
<img src='./diagrams/unsupervised-hierarchy.png'  style="width: 400px;">

[Image source](https://en.wikipedia.org/wiki/Hierarchical_clustering)




In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model = model.fit(iris.data)
model

In [None]:
counts = np.zeros(model.children_.shape[0])
linkage_matrix = np.column_stack([model.children_, model.distances_, counts]).astype(float)

dendrogram(linkage_matrix, no_labels=True)
plt.show()

# Reinforcement Learning (outside scope of course, just for awareness)
Goal is to get an agent (e.g., dog) to perform tasks (e.g., fetch a stick) and get rewards (e.g., treat) or penalty (e.g., no treat) for completing the task. Usually going to be a series of actions that comprise an optimized reward signal, which can be immediate or delayed. Think of it as intelligent trial and error. Chess is another popular example. [AlphaGo is another interesting example.](https://deepmind.com/research/case-studies/alphago-the-story-so-far)

<img src='./diagrams/dog.png'>

[Image source](https://www.mathworks.com/discovery/reinforcement-learning.html)

# You're a chef. The kitchen is fully stocked.
### (You need to know how to cook though)

#### Your job:
- Determine the type of problem you are trying to solve.  
- Evaluate the data you have been provided.  
- Determine how to structure an experiment to test the hypothesis.  
- Determine how to evaluate the results of your experiment.  

<img src='./diagrams/ml_map.png'  style="width: 700px;"><bk>
    
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

# Parameters vs. Hyperparameters

- Many models are [parametric](https://en.wikipedia.org/wiki/Parametric_model), which means the algorithms learns weights (e.g., parameters) during the optimization process. Think of the coefficients in regression models.  
- Many models have [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) which are user-defined values that control parts of the learning process.   
- Finding the best set of hyperparameters can be time consuming.  
- Think of hyperparameters as knobs you need to tune, like on an old radio.  


# General Data Notation

## As you would see it in a pandas DataFrame or numpy array:
<img src='./diagrams/data-and-labels.png' style="width: 600px;">

[Image source](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch01/ch01.ipynb)



## As you would see it expressed mathematically in matrix notation
Most machine learning is going to be based on linear algebra and calculus. If its been awhile since you've seen either of those types of problems, it may be good to quick refresher.

#### General form for the feature matrix (*m* samples and *n* features)
$$\begin{equation*}
A_{m,n} = 
\begin{pmatrix}
a_{1,1} & a_{1,2} & \cdots & a_{1,n} \\
a_{2,1} & a_{2,2} & \cdots & a_{2,n} \\
\vdots  & \vdots  & \ddots & \vdots  \\
a_{m,1} & a_{m,2} & \cdots & a_{m,n} 
\end{pmatrix}
\end{equation*}
$$

#### iris (150 samples, 4 features)
$$\begin{equation*}
iris_{150,4} = 
\begin{pmatrix}
a_{1,1} & a_{1,2} & \cdots & a_{1,4} \\
a_{2,1} & a_{2,2} & \cdots & a_{2,4} \\
\vdots  & \vdots  & \ddots & \vdots  \\
a_{150,1} & a_{150,2} & \cdots & a_{150,4} 
\end{pmatrix}
\end{equation*}
$$

#### Target variable general
$$\begin{equation*}
y_{m} = 
\begin{pmatrix}
y_{1} \\
y_{2} \\
\vdots  \\
y_{m} 
\end{pmatrix}
(y\in [class_{1}, class_{2}, ..., class_{z}])
\end{equation*}
$$

#### Target variable general for iris (150 samples)
$$\begin{equation*}
species_{150} = 
\begin{pmatrix}
species_{1} \\
species_{2} \\
\vdots  \\
species_{150} 
\end{pmatrix}
(y\in [setosa, versicolor, virginica])
\end{equation*}
$$




# Typical Pipeline Flow
The algorithms/models are just a piece in the machine learning puzzle. There are many other component around organzing your data, defining your experiment, transforming your data into features, and evaluating your results.

### Sample Supervised Workflow
<img src='./diagrams/ml-pipeline2.png'>

# Best Practices  
- Preprocessing.  
- Determine how to design your experiment.  
- Training and selecting models.  
- Evaluating models.  
- Reproducibility and communicating results.

## No Free Lunch
Popularized by David Wolpert. See [*The Lack of A Prior Distinctions Between Learning Algorithms, D.H. Wolpert, 1996](https://ieeexplore.ieee.org/document/6795940)

>I suppose it is tempting, if the only tool you have is a hammer, to treat everything like a nail.
<br><br>Abraham Maslow, 1966

### No single model will perform the best across all problems. 
### You should compare at least a couple of different models when you are trying to solve a problem.

# Must Bring Balance to the Force 
<img src='./diagrams/bias-error.png'  style="width: 700px;">

[Image Source - By Bigbossfarin - Own work, CC0](https://commons.wikimedia.org/w/index.php?curid=105307219)

# Algorithms We'll Explore in Future Classes

[Linear Regression](https://en.wikipedia.org/wiki/Linear_regression)
<br>[Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression)
<br>[Support Vector Machines](https://en.wikipedia.org/wiki/Support-vector_machine)
<br>[k-Nearest Neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
<br>[Random Forest](https://en.wikipedia.org/wiki/Random_forest)
<br>[Decision Trees](https://en.wikipedia.org/wiki/Decision_tree_learning)
<br>[Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
<br>[Neural Networks](https://en.wikipedia.org/wiki/Neural_network)
<br>[Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis)
<br>[Linear Discrimant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
<br>[Agglomerative Clusterig](https://en.wikipedia.org/wiki/Cluster_analysis#agglomerative_clustering)
<br>[Density-based Clustering](https://en.wikipedia.org/wiki/DBSCAN)
<br>[k-Means Clustering](https://en.wikipedia.org/wiki/K-means_clustering)

# Readings

[Raschka's Introduction to Machine Learning Notes](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/01_overview/01-ml-overview__notes.pdf)<br>
[Hundred-Page Machine Learning Book: Introduction to Machine Learning](https://www.dropbox.com/s/lrhtt1wkffnm4fe/Chapter1.pdf?dl=0)<br>
[Hundred-Page Machine Learning Book: Notation](https://www.dropbox.com/s/0cprdghmnzpck8h/Chapter2.pdf?dl=0)

# Resources
[Getting started with scikit-learn](https://scikit-learn.org/stable/getting_started.html)
<br>[Glossary](https://scikit-learn.org/stable/glossary.html)
<br>[Don't fall into the pit!](https://scikit-learn.org/stable/common_pitfalls.html)
<br>[Basic tutorial](https://scikit-learn.org/stable/tutorial/basic/tutorial.html)
<br>[Supervised learning](https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html)
<br>[Unsupervised learning](https://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html)
<br>[Putting it together](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html)
