<center><h1>Course Wrap-Up</h1></center>

#1. A review of the 3 algorithms we have covered

##KNN

It remains popular and there is no shame in using it. It can be a fast way to get a ballpark accuracy score. Useful as a baseline when trying more complicated algorithms, e.g., ANNs.

What you need: a matrix. Fairly easy to get from a pandas table. Note this matrix is the same as you would feed to an ANN.

##Naive Bayes

A good contender for text problems. Cost is building the bag of words table.

Can deal with its shortcomings using things like smoothing.

What you need: a word bag with "author" counts. spacy gives you tools to parse words from text.

##Artificial Neural Net

Both a good modeling technique on its own, but also the starting point for recent advances in machine learning, e.g., CNN, RNN, transfer learning.

One problem is complexity of set-up. Finding the libraries you need and how to use methods in those libraries is tedious. Suspect it will get better as field matures.

Another problem is large (huge) hyperparameter space. There are tools like grid-search that support search in this space.

A third problem is a complex error/loss space. Finding a true error minimum versus a local minimum ends up being another search problem.

What you need: a matrix ala KNN. A model architecture. Once you decide on values for all the hyperparemeters, actually building a model is straightforward.

#2. Algorithms we did not cover

##Logistic regression

A form of linear regression well suited for (binary) classification problems. Shares some of the same ideas as ANNs, e.g., searching for the optimal values for a set of weights using gradient descent. In fact, you can implement logistic regression with an ANN!

##Support Vector Machines

Powerful if can find right kernel. Also uses gradient-descent to search for weight values.

<img src='https://miro.medium.com/proxy/1*3t_Gn5yuirT6fSC-sbxKAA.png' height=250>

##Decision Trees, Random Forests, Boosting

This remains a popular set of techniques. Breaking them out:

* **Decision tree**. Each node in the tree asks a yes/no question and shuffles you down one path or the other. Eventually you arrive at a prediction node (leaves of the tree). Way good at certain types of non-linear problems.

<img src='https://i.pinimg.com/originals/96/2f/a1/962fa1e5b1e3072cbc911cb158915606.png' height = 250>

* **Random forest**. More recently, someone suggested building many decision trees and use them to do crowd-sourcing. The trees are built randomly in terms of questions to ask on a node. For testing, they all get one vote on answer.

<img src='https://miro.medium.com/max/1170/1*58f1CZ8M4il0OZYg2oRN4w.png' height=250>

* **Boosting**. Even more recently, someone suggested that when building the forest, do it in a thoughtful way. Build the first tree and test it. Note what samples it gets wrong. Give those samples a higher weight. Build the next tree, which now pays attention to the weighted samples (i.e., the errors of the first tree). Hopefully the second tree will do better on errors of first. But it will produce new errors. Weight again and keep going until you get tired. You now have a random forest where each tree is particularly good at something.

<img src='https://www.dropbox.com/s/3wxcfbg1f3rwjsb/Screenshot%202020-06-02%2011.20.02.png?raw=1' height=250>

##Probabilistic graphical models

One of our faculty specializes in this area and teaches a course in it. I view it as a more human-understandable model than Naive Bayes.

<img src='https://miro.medium.com/max/1920/0*Ws_1F7ZFdbqph4Y8.' height=250>

If anyone interested, this is the book the course uses. I saw it on sale for $20.
<hr>
<img src='https://www.dropbox.com/s/mo1grp5oe4z7ij9/Screenshot%202020-06-02%2013.58.25.png?raw=1' height=250>

#3. Data wrangling

This has become a debatable need for me. There are certain things that have to happen, including filling empties and normalizing.

Prior to ANNs, there was lots of effort devoted to feature engineering. Reducing features, combining features, etc. I view ANNs as doing this for us and doing a better job.

If you are not using ANNs and not using text (talking to you regression folks out there), then all the wrangling books will still be useful to you.


#4. Judging accuracy

##Unseen (future) data

The problem is that we have a set of labeled samples to work with now. We can compute accuracy with this data. But will that accuracy hold up as new, unlabeled data starts to arrive? Typically not.

The standard way to deal with this is the **holdout** method. Split into training and testing sets. Train your model (e.g., construct your KNN matrix, your NB word bag, your ANN) on the training set. Then run the test set through your model. Compute accuracy from that. Typically much closer to reality.

##Overfitting

The tendency for all machine learning algorithms is to get very good at predicting on the labeled data they have to work with. Too good. They tend to use minutiae from the labeled data that does not generalize well.

The holdout method is the easiest way to deal with this. The work on ANNs has brought in others, e.g., cross-validation, dropout layers.

#5. Machine learning in society


##Human bias
Two ways that human bias can slip into a machine learning algorithm. First, the data that is collected can represent bias. If we scan the web for text to use for training, then the biases seen on the web will be reflected in our data. Or if we collect hiring data from a big tech firm, then hiring biases (male preference) will be reflected in our data.

The second form of bias is in labeling. Humans typically label data. For offensive language in tweets, who decides to label a tweet good or bad? If you are focusing on racist language, what are the qualifications of the labeler of that data?

##Explainability

Given that a machine learning model may be used to make sensitive decisions, can we determine that rationale for a model's decision? This is a particular problem in ANNs, where the model can be boiled down to hundreds if not thousands of weight values. Can we trust a model that we do not understand?

Ramon Alvarado, a professor in our Philosophy Department, teaches courses on this topic, as well as bias and privacy. Ramon impresses me. He knows the technical side of machine learning and philosophy/ethics side of placing the tech in larger society. He teaches courses that have few prereqs so you might check them out.

<img src='https://www.dropbox.com/s/7e2duwlxkb4g1m8/Screenshot%202020-06-02%2014.18.27.png?raw=1'>

##The cost of errors

The confusion matrix breaks out 4  cases. It is possible that the 2 error cases have societal costs. The Pima Diabetes study is a good example. Medical ethics can easily become involved.

#6. Where from here?


##From a practical view

On the Python side, I'm not sure I would recommend a straight Python course. You want Python for machine learning. What you really will need is familiarity with the Python libraries I was hiding from you in puddles functions, e.g., tensorflow, scikitlearn. There is lots of online material on these.

On the machine learning side, it depends on what type of data you will end up working with. I would recommend ANNs for almost everything, including regression! But that is an ANN fanboy speaking :) Lots and lots of online courses and material on ANNs.

One note on colab. It is possible to run a jupyter server on your own laptop. Then you have no need for an Internet connection. You can run your notebooks for as long as you like. They won't die after a certain amount of time has elapsed. Plus your notebooks will have features not found in colab, a spellchecker being the biggest one for me. I downloaded the anaconda bundle that includes Python and a jupyter server packaged together.



##From the 10000 feet view

Some of the luminaries in the field are starting to question what we are doing. They note that humans do not take 1000s of samples over 1000s of epochs to learn. They can generalize with many fewer examples.

And there is the whole common-sense dilemma. A 5-year old has more common-sense than an ANN with a gazillion layers. We seem to be missing something fundamental in our algorithms.

It's an exciting time to be in AI :)