# The Data Science Taste Test - Overview
#### Hands-on Introduction to Data Science with Python

## Outline

### 1. The Purpose of this Talk
- Why I'm doing this
- Why data science, why python?
- What you'll need to participate

### 2. Data Science Today
- Roles & Responsibilities
- State of the Market
- The Data Science Pipeline
 
### 3. The Data Science Workflow (General Assembly)
- Frame - Problems & Hypotheses 
- Prepare - Ingestion & Cleaning
- Analyze - Studying the Data
- Interpret - Inference & Prediction
- Communicate/Deploy - Enabling Decisions

### _~Learning Resources for YOU!~_

## 1 - Purpose

![Start with Why Diagram](./assets/why.png "Simon Sinek advises that we should first think about why we're trying to solve a particular problem.")

[Watch Simon Sinek's full presenation here](https://www.ted.com/talks/simon_sinek_how_great_leaders_inspire_action?language=en)

### Why I'm Doing This



![How I Started](./assets/phone.png "My career began in sales.")

In the olden days I was a Sales Development Representative, which is a telemarketer with more even more syllables. The task was to call on hospitals and get them interested in the latest and greatest in technology to engage patients in their care.

Over time, that role evolved into operations management. This involved analyzing sales data to help the sales team make the best decisions concerning outreach, strategy, and what have you. It also reignited my college love affair with economics.

![books](./assets/books.png "I highly recommend Freakonomics, Everybody Lies, and The Book of Why for inspiration to go down this path")

Our generous organizer, Nick Sanford, taught me enough Excel to be dangerous, and books like these were a heavy influence before and during my transition into data science. I strongly encourage you check them out! 

As the focus shifted from sales into analytics, and as the team demanded more in terms of speed, quantity of data, and inferential/predictive capabilities, picking up data science skills & platforms simply made sense. 

![Eliud Kipchoge](./assets/eliud.png "Eliud Kipchoge, first human being to run 26.2 miles in under 2 hours")

The transition to data science using Python went exactly like this:
- "Hello, General Assembly, I've never coded before and I'm terrified, but I'd like to give this data science course a try"
- While taking the course, replaced Excel tasks with Python workflows
 - Built classifier algorithm to identify health systems team should target
 - Used text mining to find millions of dollars mislabeled revenues
 - Automated a major conversion effort between CRMs
- Prior company sold, I fell victim to layoff
- Created a consultative data science position at a new organization
- Now work with healthcare systems on all sorts of interesting challenges as well as teach for General Assembly!

![Work](./assets/work.png "Shows the places I work, SymphonyRM and General Assembly")

![Black Hole](./assets/blackhole.png "Katie Bouman & team were experts in a field of artificial intelligence called 'Computer Vision,' which is concerned with getting computers to see.")

Now I'm here to tell you that if you're interested enough and passionate about the problems you're solving, you can do it too!

### Why Python?

What we're witnessing when we see marvels like self driving cars, sub-2 hour marathons, or visualized black hole is really the effort of a largely open-source community. It's a bunch of people trying stuff out and talking to each other. A key way they do this is through Jupyter Notebooks, such as the one you're using right now. In fact, the code used to visualize the Black Hole? It's right here: [VLBI Code](https://github.com/achael/eht-imaging).

This community has helped me do better work than I thought I could ever do, and to pay it forward, I want to share this opportunity with you. 

By running throught this in Python, it's putting your hands on the most popular language for doing data science and putting models into production.

Even if you've never coded before, this is intended to get your hands on it just so that you're not in completely unfamiliar territory should you decided to go down this path. Hands-on experience will make coding much less intimidating. 

### What You'll Need

If you're reading this on your own computer you're pretty close!

#### The Basics
- Computer with at least 8gb RAM, preferrably a MacBook Pro
- Anaconda - A distribution package that installs Python, Jupyter Lab, and the most common data science libraries: [Download it here.](https://www.anaconda.com/distribution/)

#### For Interactive Visualization with hvplot
Run these two lines in your command prompt:

- conda install jupyterlab
- jupyter labextension install @pyviz/jupyterlab_pyviz


## 2 - Data Science Today

![Data Science Venn Diagram](./assets/venn.png "This Venn Diagram shows data science as the intersection between hacking skills, substantiative expertise, and math expertise")

You'll commonly see this or a similar Venn Diagram describing the role of data science. There are pros and cons to this view. 

The pro is that this is a great sum of what you need to offer value to a data science organization:
- **Hacking Skills**: The ability to write production-ready code that's well works, is thoroughly commented and documented, and is robust to bugs & real-world challenges
- **Math & Statistics Knowledge**: Understanding the math and statistics well enough to know which algorithms to use in various situations and how they work
- **Substantative Expertise**: Deep knowledge of the field (chemistry, biology, healthcare, etc.) for which you are solving problems

The con is that it doesn't go deep enough into what a data scientist actually does.

Fortunately, Google's Chief Decision Intelligence Officer, [Cassie Kozyrkov](https://twitter.com/quaesita), has got you covered. She is one of the most prolific writers & speakers on introductory AI material, statistical material, and strategic thinking around managing AI teams. A GREAT person to follow. 

From her [What on Earth is Data Science?](https://hackernoon.com/what-on-earth-is-data-science-eb1237d8cb37?source=post_page-----1e73b9c7682----------------------) article: 

![Data Science Decision Tree](./assets/decisions.png "Cassie Kozyrkov shows when to use Machine Learning vs Statistics vs Analytics")

"Data science is the discipline of making data useful."

This flow chart takes it a layer deeper to help understand what might be in a data scientist's day to day. Each of those paths terminates in either _Data Analytics, Statistics, or Machine Learning:_ 
- **Data Analytics**: Goes deep in understanding, cleaning, and manipulating what's in the data. Finds patterns and posits hypotheses. Identifies opportunities for further research
- **Statistics**: Makes inferences and predictions using data. Best to be engaged early on in order to design experiments that truly identify causal relationships
- **Machine Learning**: Uses large data sets to train computers to make decisions such as which product to show a site visitor or which name to auto-tag to a Facebook photo

In reality, different teams have different levels of specialization. You may have roles that demand more analytics heavy players or people who are specifically machine learning engineers. You may also have roles where you're asked to do all three along with other tasks such as data engineering pipelines or ETL! Defining data science can get slippery, but this is a great chart to keep in mind. 

### State of the Market
![State of Market](./assets/stateofmarket.png "Chart shows rising interest over time for Data Science & Machine Learning")
![Key](./assets/key.png)
- Source: Google Trends

Interest in data science and machine learning continue to grow over time.

This is just conjecture, but hear me out...

![Global Datasphere](./assets/global_datasphere.png "The rate at which we produce data is growing exponentially.")

We've got the amount of data we're producing growing exponentially, and leaders are now learning that data can help drive better decisions. The greater availability of data leads to more and more instances of Cassie's Data Science Adventure flowchart above. The data is growing, and there's a shortage of people who are able to mine insights, predictions, and decisions.

There's a need to vet & deliver all these data driven analyses, recommendations, and machines. How do research institutions and companies put this into place?

### The Data Science Pipeline

There's no one data science pipeline universally accepted by all. This depends on organizational culture, but it's good to have some baselines in mind. For instance, General Assembly teaches this framework:

![General Assembly Approach](./assets/general_assembly_pipeline.png "General Assembly's Frame, Prepare, Analyze, Interpret, Communicate framework")

This is a great overall approach that can fit into any culture. The emphasis here is on presenting insights learned from data science techniques in order to make decisions. Other scenarios might call to get data science models into a live production environment where they're making decisions for users in real time. Think about how Netflix anticipates what movies you'd like to watch or your phone predicts messages. These might call for a different, albeit similar concept.

![CRISP-DM Approach](./assets/CRISP-DM.jpg)

How are these approaches similar?

How are they different?

### Conclusion and Transition

It's time to get our hands dirty and dive into some data. This is a taste test. The notebook is designed for you to be able to follow along and to be able to play with parameters in the code even if you don't have any background. Yes, you'll break things and get headaches, but if you go down this path, you'll be doing that for the rest of your life. 

The next notebook will walk us through a pipeline similar to the General Aseembly example above. 