In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

# What is Machine Learning (ML) ?
<ul><h3>prediction</h3></ul>

<img src="images/W1_L1_Predicting_on_Unseen-data_AdobeStock_321987501.png" width="60%"/>
<br>
<ul><h3>**informed** prediction</ul>
    <ul>
    <ul>- prediction is better than random guess</ul>
    <ul>- method: learn from existing >data</ul>
        <ul>- goal: Generalization. Predicting on new, unseen data</ul></ul></h3>
    
 <img src="images/W1_L1_Informed-Prediction_AdobeStock_339315304.png" width="60%"/>   

# Where is ML used ?

**Everywhere !!**

<ul><h3>targeted advertising</ul></h3>
    <ul>
    <ul>- Why does Facebook seem to know what I'm thinking ?</ul>
    <img src="images/W1_L1_Targeted_advertising_AdobeStock_320775942.png" width="40%"/>
<br><br>
<ul><h3>spam detection</ul></h3>
    <ul>
    <ul>- You are a winner !</ul>
    <img src="images/W1_L1_Spam_detection_AdobeStock_354630313.png" width="35%"/>
<br><br>
<ul><h3>forecasting</ul></h3>
    <ul>
        <ul>- Sales</ul>
        <ul>- Logistics</ul>
        <ul>- Where's my Uber ?</ul>
<img src="images/W1_L1_Forecasting_sales_AdobeStock_307330168-[Converted].png" width="40%"/>
<br><br>
<ul><h3>anomaly detection</ul></h3>
    <ul>
        <ul>- credit card fraud</ul>
<img src="images/W1_L1_Anomaly-detection_AdobeStock_311611620.png" width="40%"/>
<br> 

## Uses in Finance
- model prices, risk
    - hedging
- Trading signals
- forecast sales
- predict defaults, pre-payments
<br>
<img src="images/W1_L1_Trading_signals_AdobeStock_188236125.png" width="60%"/>


## Not just numeric data !
- Images
    - Satellite: 
        - Counting cars in a parking lot to forecast sales   
        - How full is that oil tank ?<br>
 <img src="images/W1_L1_Satellite_image_cars_AdobeStock_130994721.png" width="50%">
   <br> - Did the CFO really mean what he said ? facial signals for confidence/evasiveness
<br>- Text
    - Twitter sentiment as a signal ?
    - SEC filings
    - Derive industry groups by clustering press releases

# What you need to succeed ("Pre-requisites")
- An inquiring mind
    -  <img src=images/emoji/female-scientist-type-4_1f469-1f3fd-200d-1f52c.png width=50, align=left> &nbsp; 
 Approach this topic like a Scientist
 <br>
 &nbsp; Find a problem, gather data, formulate a hypothesis, test.  Repeat.
 <br>
 <br>
- Technical skills

    - <img src=images/emoji/steam-locomotive_1f682.png width=50 align=left> &nbsp;
You are engineers !
<br>
&nbsp; Solid programming skills.

<br>

- Some math/statistics

    - <img src=images/emoji/graduation-cap_1f393.png width=50 align=left> &nbsp;
To be a successful data scientist, you need to understand the machinery.
<br>
&nbsp; It is not enough to know an API.

<br>

- Self-motivation and energy
    - willingness to pick up tools/skills outside of lectures
    - You are engineers, nothing is too hard !

## Technical prerequisites

- Python
    - Object oriented (OO) Python
    - Numpy
    - Pandas
    - Matplotlib
- Some statistics (e.g., regression)
- Some math
    - comfort with Matrix/Vector notation
    
Fear not ! The last half of this lecture will be a whirlwind tour of the Python topics (except for OO)

# Textbooks

## Python Data Science Handbook
[VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/)
<table><tr><td>
<h3>- Online !
- Solid foundation to acquire pre-requisites
    - Jupyter
    - Numpy
    - Pandas
    - Matplotlib
    - Quick view of models</h3>
    </td>
    <td>
        <img src="images/W1_L1_Textbook_VanderPlas.png" width="40%" />
             </td><tr><table>
        


## Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition 

[Geron](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)
<table><tr><td>
<h3>- Assumes you know the pre-requisites
    - More detailed chapters on various models</h3></td>
    <td><img src="images/W1_L1_Textbook_Geron.jpeg" width="70%"></tr><table>
    

# Notebooks
Both textbooks have code repositories on GitHub
 - VanderPlas "book" is actually a notebook !
 
The real learning comes from active "doing" (play with notebooks) rather than passive "reading"

**Get the notebooks in the repos !**
- `mkdir ~/Notebooks; cd ~/Notebooks`
- `git clone https://github.com/jakevdp/PythonDataScienceHandbook.git`
- `git clone https://github.com/ageron/handson-ml2.git`


## Accessing a repository on Github

If the `git` command is not available on your machine you can either
- install it via the command
>`conda install git`
- Download the repo as a ZIP file.  Visit the given URL
    - click on the <span style="background:light green; color:white;">Download or clone</span> button
    - Choose "Download ZIP"

**Get the lecture notebooks**
- `git clone https://github.com/kenperry-public/ML_Spring_2020.git`
- periodically refresh: `git pull` from top level directory

# Machine Learning *using* Scikit-Learn (sklearn), TensorFlow (TF)

sklearn is a popular library for Machine Learning.  We will be using it in the first part of the course.
TensorFlow is another popular library that we will use in the second part of the course.

We are learning **Machine Learning**, not sklearn/TensorFlow !

Tools are a means, not an ends
- Goal is to understand ML independent of the toolset
- You can be an expert in sklearn/TensorFlow and still not understand ML

# Teaching method

**Iterative**: visit the problem many times, at increasing levels of focus
- top-down vs bottom-up
- Motivate: very high level view
    - know WHAT we are trying to achieve
- Understand: medium level view
- Deep understanding
    - math, statistics

<div class="alert alert-block alert-success">
Bonus: advice from a practioner
</div>    

# Landscape of ML


 <table><tr><td width="25%">
    <ul><h3 align = "right">Types of learning</h3></ul><br>
    <ul>- Types of targets
    <ul>- continous</ul>
    <ul>- discrete</ul></ul>
    <ul>- Types of features
    <ul>- numerical</ul>
        <ul  align = "left">- categorical</ul>
        <ul  align = "left">- text</ul></ul></ul></td><td><img src=images/W1_L3_S9_Big_picture_ML_taxonomy.png width="80%"></td></tr></table>


# Challenges of ML

- You need data to train, often a lot of it
    - not always easy to get
        - supervised: needs to be labelled
    - quality issues
    - Is the training data representative of "the real world" for which you are designing ?
- Overfitting and Underfitting
    - Overfit: good training accuracy, poor generalization
    - Underfit: lost opportunity
- Engineering meaningful features is key 
    - Data transformations
        - create features that aid prediction
        - art and science
    - Deep Learning may view feature engineering as part of the problem, not the solution !
- Testing and validation
    - an honest test uses held-out data
    - training data is a precious resource; painful to hold some out



# ML in one slide
<img src=images/Cartoon_ML.jpg>

In [2]:
print("Done")

Done
