<a href="https://colab.research.google.com/github/jfogarty/machine-learning-intro-workshop/blob/master/syllabus-sep-2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Base2 Solutions Machine Learning Workshop

## September 6th and 13th

The workshop is two Friday afternoon/evenings over a two week period.  Please bring your own laptop, with power cord.  You should make sure your Google account is working and that you can access [Google's Colab](https://colab.research.google.com) before you get to class.

This hands on workshop will be at Base2 in Bellevue, although the Zoom sessions will be live-streamed and the contents will be available afterwards for those who couldn't attend one or both sessions.
 




# Session 1 - Friday September 6th


- Learning to manipulate and represent data
  - numpy arrays
  - using pandas and other tools


-	Importing and Running a model
  - Linear Regression (Or maybe something with MNIST?)
  - High level overview of basic syntax and concepts
  - Splitting data into training and test
  - Training the model
  - Test/Validate the model
  - Graph the model w/data


- Learning more about data
  - Intro: Linear, Logistic Regression and Perceptrons
  - Randomize Data
  - Normalizing Data
    - Scaling Feature values
       - We want to prevent some values having more weight unnecessarily because it is recorded on a much greater scale (i.e. 0 to 500000)
       - Scaling our values all to a similar range (-1 to +1 or -3 to +3 etc)
       - We can use the Z score of the value
         - scaledvalue = (value - mean) / stddev
         - Usually ends up between -3 and +3

       - Handling Extreme Outliers
         - Can take the log of the values, better but still leaves a tail
         - We can also clip feature values at a number and any number higher or lower just become the value decided for clipping (if we clip at 4, any values >4 become 4)
       - Mapping Categorical Values

       - We can map a list of string names to a map of values, which will allow us to multiply the weights appropriately
       - This mapping often is not the right solution however, we often don't want our categories directly linked, importance of street names shouldn't have the same weight across the board as there is usually no direct linkage between any two streets and home prices. Homes on park place may all be expensive, but a street like baltic ave may have a completely different affect on price.
       - It is better instead to make each category it's own binary feature. All houses where isOnParkPlace=True should have a higher weight for house price, but houses on Baltic Avenue should not be modified by the Park Place weights, instead it will be weighted based off of it's own feature, isOnBalticAvenue
       - Separating our data out like this also allows us to have homes in multiple categories (I.E Corner of a cross street)

       - Binning
       - Add values into a bin, and make each bin a boolean value
       - Things like latitude and longitude don't make sense as floating point values as there is no good linear relationship (higher the latitude != larger house price)
       - Group the latitude into bins (latitude 31-33 may correspond to a higher latitude, especially when crossed with longitude 52-54) (more on feature crosses later?)

- Cleaning Data
  - Search for duplicates, missing values, outliers, compare training/validation to confirm they are different/similar enough
  - 	Should appear with non-zero values more than a small handful of times in the dataset
  - Features should have a clear obvious meaning
  - Features shouldn't have magic values (-1 denoting special values/undefined, instead add a separate boolean to show undefined)
  - Definition of a feature shouldn't change over time
  - Should not have outlier values
  - Unique identifiers are bad feature set because they appear only once, and add no comparable information for the model
  - Scrubbing
    - We need to scrub our data for inconsistencies and errors
    - Omitted Values: Someone forgot to enter a field
    - Duplicate examples: we received a value twice
    - Bad labels: A person mislabeled a picture of a dog as a cat
    - Bad feature values: Someone typed in an extra digit accidentally
  -	Histograms can help us realize when things are wrong
    - Maximum and minimum
    - Mean and median
    - Standard deviation

  - Feature Crosses/Synthetic Features
    - Using feature crosses to get even better features
    - Cross multiple features to link the features together
      - \# of bedrooms in one city could mean something different than # of bedrooms in another city, so if we cross them we have that data linked
      - Be careful with cross products, some crosses can cause unnecessary complexity to the model

-	Exercise: Look at the dataset histogram and see if we find any data that needs cleaning/scrubbing (use a dataset with omitted or outlier values)
- Exercise: Take a feature represented by a floating point and bin them, creating a new feature
- Exercise: Play with normalization functions and look at the data output
- Exercise/Discussion: What would a good feature cross look like over our sample dataset? Implement it?

- Importing and training our model (logistic regression? Maybe classification?)
  - Import model
- Hyperparameters
  - Step Size
- Gradient Descent
  - Backpropagation
  - How weights change
- Batch size
  -	How much data do we train on?
    - Why don't we train on the entire set?
  - \# of epochs
  - How long do we train?
  - Mention of early stopping/regularization
- Understanding of overfitting
  - How do we optimize our model to limit loss and complexity (regularization)
  -	Regularization is a way to measure (and then penalize) model complexity
  - Need to be careful about overfitting, do not overtrain to your training data
  - Either stop early (not the best but often used) or…
  - Penalize model complexity
    - Avoid complexity where possible, structural risk minimization
    - Try to prefer smaller weights on our model
  -	L2 Regularization
    -	Penalize the sum of the squared values of the weights
  -	Model complexity can be seen as the following
  - Function of the weights of all the features of the model
  - Function of the total number of features with nonzero weights
-	Defines the regularization term as the sum of the squares of all the feature weights
    - Encourages weight values towards 0 but not exactly 0
    - Encourages the mean of the weights toward 0 with a normal distribution
    - Lambda (Regularization Rate)
    - Gives model developers the ability to tune the overall impact of the regularization term by multiplying its value by a scalar known as lambda
  - Increasing lambda strengthens the regularization effect
- Touch on other types of regularization as well?
 
 
 




-	Loss
  - Loss functions provide more than just a static representation of how your model is performing–they’re how your algorithms fit data in the first place. Most machine learning algorithms use some sort of loss function in the process of optimization, or finding the best parameters (weights) for your data.
  - Penalty between a models predictions and the truth
  - Helps us understand how well a model fits our data
  - Explanation on different ways loss is measured and what they mean
  - Key to training our model predictions
  - Mean Squared Error (MSE): is the workhorse of basic loss functions: it’s easy to understand and implement and generally works pretty well. To calculate MSE, you take the difference between your predictions and the ground truth, square it, and average it out across the whole dataset.
  - The likelihood function: is also relatively simple, and is commonly used in classification problems. The function takes the predicted probability for each input example and multiplies them.
  - For example, consider a model that outputs probabilities of $[0.4, 0.6, 0.9, 0.1]$ for the ground truth labels of $[0, 1, 1, 0]$.
    - The likelihood loss would be computed as $(0.6) * (0.6) * (0.9) * (0.9) = 0.2916$.
    - Since the model outputs probabilities for TRUE (or 1) only, when the ground truth label is 0 we take (1-p) as the probability. In other words, we multiply the model’s outputted probabilities together for the actual outcomes.
  - LogLoss:
    - (equation)	 
  - $(x,y)$ exists in $D$ is the data set containing many labeled examples
  - $y$ is the label in the labeled example, must be 0 or 1
  - $y'$ is the predicted value (somewhere between 0 and 1) given the set of features in $x$


- Optimizers
  - Talk on different optimizers, how to choose them.
    - Optimization functions usually calculate the gradient i.e. the partial derivative of loss function with respect to weights, and the weights are modified in the opposite direction of the calculated gradient. This cycle is repeated until we reach the minima of loss function.


- Logging/illustrating the training process
- Understanding of precision/recall
- Understanding of ROC Curve and AUC
- Explain the cycle of training->testing->training->testing
- Exercise: Adjust hyperparameters to try and get the lowest loss
- Exercise: Use a different loss function and see how things change
  - Does training take longer? What happens to the weights with different loss functions?
- Exercise: Use different optimizers?
- Other topics to cover (need to find a good place for these)
- Activation functions


# Session 2 - Friday September 13th

- CNN? Or another Deep Learning/NN example.
- RNNs
- GANS


# Other Workshop Resources

- [The Workshop Github Repository](https://github.com/jfogarty/machine-learning-intro-workshop)

- [Python Machine Learning - 2nd Edition](https://github.com/rasbt/python-machine-learning-book-2nd-edition)

### End of notes.