In [None]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'sans-serif',
              'transition': 'zoom',
              'start_slideshow_at': 'selected',
})

cm.update('livereveal', {
              'scroll': True,
})

cm.update('livereveal', {
             'width': None,
             'height': None,
});

# Agenda

* Introduction
* Data Science 101
* Where do we go from here?

# Data Science 101

&nbsp;
&nbsp;

## Mark Wicks

# Goals
* What is data science?

* Types of problems solved

* Tools used

* How to get started with some simple tools

* Important skillsets

# What is a data scientist?

&nbsp;

## Josh Wills:
> Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

# What is a data scientist?
## Drew Conway's &mdash; Data Science Venn Diagram
<DIV ALIGN="CENTER"><IMG SRC="https://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=750w"/>
</DIV>

# What do they do?

### They use technology and their knowledge of statistics to

- Make predictions

- Turn data into information and information into insights.

- Find _correlations_ or patterns in data that can be _used for "lift" or leverage_ (i.e., allocating resources where they have the biggest effect)

# What types of problems do data scientists do solve?
&nbsp;

## Let's start by looking at [Kaggle competitions](https://www.kaggle.com/competitions)!
&nbsp;

### Practice Contests:

- Given a photo, [predict whether it is a dog or a cat](https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition)
- [Determine whether two images were created by the same artist](https://www.kaggle.com/c/painter-by-numbers)
- [Identify a handwritten digit (MNIST data)](https://www.kaggle.com/c/digit-recognizer):
&nbsp;

<div align="center">
<img>
<img style="display:inline-block" src="images/very_odd_four.png" width="120"/>
<img style="display:inline-block" src="images/four_or_nine.png" width="120"/>
<img style="display:inline-block" src="images/four_or_nine_2.png" width="120"/>
<img style="display:inline-block" src="images/five_or_nine.png"  width="120"/>
<img style="display:inline-block" src="images/zero_or_nine.png"  width="120"/>
</div>


### Sponsored Kaggle Contests:
- $5k &ndash; [Predict whether a student will answer the next test question correctly](https://www.kaggle.com/c/WhatDoYouKnow)

- $100k &ndash; [Give the text of a student essay, predict the score a trained rater gave this essay](https://www.kaggle.com/c/asap-aes) using a rubric.

- $500k &ndash; [Identify which patients will be hospitalized in the next 12 months from historical claims](http://www.heritagehealthprize.com/c/hhp)

- $100k &ndash;  [Given an image of the retina, predict a clinician’s diagnosis of diabetic retinopathy](https://www.kaggle.com/c/diabetic-retinopathy-detection)

### More Sponsored Kaggle Contests:
- $70k &ndash; [Predict which customers will leave an insurance company within the next 12 months](https://www.kaggle.com/c/deloitte-churn-prediction)

- $50k &ndash; (Red Hat) [Predict the potential business value of prospective customers (individuals)](https://www.kaggle.com/c/predicting-red-hat-business-value) based on the activities completed by those individuals

- $10k &ndash; [Predict Parkinson’s disease progression from smartphone data (accelerometers)](https://www.kaggle.com/c/predicting-parkinson-s-disease-progression-with-smartphone-data)

- $30k &ndash; [Predict who is driving a vehicle based on vehicle path data](https://www.kaggle.com/c/axa-driver-telematics-analysis)  (acceleration profiles, trip length, types and number of turns, etc.)

- $1k &ndash; [Identify psychopaths from their Twitter usage](https://www.kaggle.com/c/twitter-psychopathy-prediction)

# Terminology
* *Machine Learning* is a catch-all phrase that encompasses many techniques where you build models from data
* Values used to make predictions may be called *predictors*, *covariates*, *features*, *explanatory variables*, *factors*, or *independent variables*
* The variable you are trying to predict may be called the *target*, *response*, or *dependent variable*
* Variables may be *continuous*, *categorical*, or *discrete*
* *Classification:* the target variable is *categorical*
* *Regression:* the target variable is normally *continuous*

# "Supervised" learning &mdash; Classification/Regression

- Classification and regression predict an _unknown_ variable (often a future value) from correlated _known_ variables.
- Supervised means you provide examples with the "correct" answer and train the machine to find the pattern that produces that answer.

*Examples*:

  * Will this student get into college X?
 

 * Text classification &mdash; Document classification, [spam filters](TextClassification.ipynb)
 * How how much is this home worth (Zillow)?
 * Will this client renew?
 * Will this prospective client buy (i.e., lead scoring)?
 * What is the age of the person in this photo? [how-old.net](https://how-old.net/)

* Most common *regression* technique: *Linear Regression*
* Popular *classification* techniques:
  - Decision Trees (DT)
  - k-Nearest Neighbors (k-NN)
  - Logistic Regression (LR)
  - Naive Bayes (NB)
  - Neural Networks (NN)
  - Support Vector Machines (SVM)
  - Ensembles of some of the above
 

### Classification &mdash; one approach
<img src="images/example2_0.png">

### Classification &mdash; one approach
<img src="images/example2_1.png">

### Classification &mdash; one approach
<img src="images/example2_2.png">

# Example:

## Classification using a decision tree &mdash; [Predict who survived on the Titanic](titanic.ipynb) - (Kaggle Data)

# Linear regression example

Andreas Muller  &ndash; Education, income inequality, and mortality: a multiple regression analysis, BMJ 2002;324:23:
![test](images/F2.large.jpg)

## "Unsupervised" learning &mdash; Clustering
&nbsp;
## Clustering doesn't "predict" anything; it finds structure in existing data:
* Market segmentation
* Anomaly detection (intrusion detection, fraud, impending failure, etc.)

# Recommender Systems
&nbsp;
### Can be viewed as regression problems, but more like matching problems
* Netflix - What movie will this person like?
* LinkedIn - Who should this person connect with?
* Amazon - People who buy X also buy Y.

- What's a good Active Match filter for this client?
- What's a good college for this student?



# Natural Language Processing (text analytics)
- What type of document is this?
- What is the sentiment of the writer?
- What are the topics in this body of documents?
- How would a person describe this photo?
- How would a native speaker translate this phrase?

# The Bleeding Edge &mdash; Deep Learning

Huge recent progress enabled by 
  -  Breakthrough algorithms for training "deep" neural networks ([Geoff Hinton's 2006 work](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjP4smD_PjPAhUE8j4KHZV4AVIQFggcMAA&url=http%3A%2F%2Fwww.cs.toronto.edu%2F~hinton%2Fabsps%2Ffastnc.pdf&usg=AFQjCNHe0TzJEzNbvYX-3W8eobekVLXWLg))
  - Computers finally having enough horsepower (GPUs, Theano, Hadoop, etc.)
 

# Deep Learning Applications

* Predict how a person would describe these photos: ![](https://tctechcrunch2011.files.wordpress.com/2014/11/screen-shot-2014-11-17-at-2-11-11-pm.png?w=738)

## Deep Learning Applications
&nbsp;
[Colorizing B&W photos:](http://petapixel.com/2016/07/14/app-magically-turns-bw-photos-color-ones/)
<div align="center">
<img>
<img style="display:inline-block" src="images/dog.jpg" width="480"/>
<img style="display:inline-block" src="images/dog_colored.png" width="480"/>
</div>


  
# More Deep Learning Applications
* [Adding sound effects to silent films](http://mashable.com/2016/06/14/artificial-intelligence-sounds-turing/#TsFac2gCtOqG)
* Photo Search
* Language translation

# Tools
* Machine Learning and Predictive models &mdash; [Python (pandas, statsmodels, scikit-learn)](titanic.ipynb),  R (various packages), Apache Spark's MLLib (Python and Scala API), [Vowpal Wabbit](https://en.wikipedia.org/wiki/Vowpal_Wabbit), [Weka](http://www.cs.waikato.ac.nz/ml/weka/) (Java and a GUI),  [XGBoost](http://xgboost.readthedocs.io/en/latest/model.html), [IBM Watson](http://www.ibm.com/watson/developercloud/)
* Deep Learning &mdash; Tensorflow, Theano (some assembly required)
* Visualization &mdash; Matplotlib (Python), D3 (JavaScript), R, plotly, Tableau
* Big Data Technologies (e.g., Hadoop) &mdash; HDFS (storage), MapReduce/Apache Spark (computation), HBase (NoSQL based on Google BigTable), Hive/Drill/Impala (SQL)

# Additional Resources

* [Kaggle](https://www.kaggle.com/competitions)
* [How to Learn Data Science by Vik Paruchuri](https://www.youtube.com/watch?v=Ura_ioOcpQI)
* [Mathematicalmonk's Youtube Series on Machine Learning](https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA) (excellent, but fairly mathematical)
* Local meetups &mdash; If you're in DC there's Data Science DC, Data Driven DC, and Washington DC Area Apache Spark Interactive
* [Andrew Ng's Courserra Course](https://www.coursera.org/learn/machine-learning) (Ng specializes in deep learning)

# Thank you!