> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser, type following in the console:
 
 
> `> jupyter nbconvert [this_notebook.ipynb] --to slides --post serve`
 
 
> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View > Cell Toolbar > None`.

For more help, check out [this tutorial](https://drive.google.com/open?id=17q01buf7YFuB4yF8cFjmnc_ZARQWOljGhlHs99i9X0c).

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# What is Data Science?
 
_Authors: Alexander Egorenkov (DC), Amy Roberts (NYC)_
 
---


### Lesson Guide
- Welcome and introductions
- [Activity: Data Science in the Real World (5 min)](#ds-real-world)
- [How to Ask a Question (10 min)](#question)
- [Data Science Workflow Through Ames Data (20 min)](#dsfw)
- [Summary (5 min)](#summary1)
- [Common Machine Learning Definitions (15 min)](#common-ml-defs)
- [Activity: Quiz or Group (15 min)](#ml-activity)
- [Summary (5 min)](#summary2)
- [Course & Project Structure (5 min)](#course-info)

- Break (10 min)

- Survey (5–10 min)
- Common Python Types (10 min)
- Common Types Code-Along (20 min)
- Common Python Functions and Control Flow (10 min)
- Common Python Functions Code-Along (20 min)
- Recap and Requests (5–10 min)


## Learning Objectives

- Set up and confirm your development environment.
- Understand how the course will run
- Define the Data Science Workflow and common Machine Learning concepts.
- Discuss the topics and goals of our course.
- Use types in Python correctly.
- Create basic functions in Python.

# Welcome to Data Science at GA!

### Instructor: Greg Baker

<img src="extra-materials/GregB.jpg" alt="Greg Baker" width="180"/>

- runs a consulting company specialising in natural language processing problems 
- previously worked at Atlassian, Google and CSIRO (where he nearly prevented WiFi from being invented)
- studying a PhD at Macquarie University
- book author, film music composer 
- wrote some of the terms of the USA - Australian free trade agreement. 
- General Assembly's Distinguished Faculty program.
- (the photo above is very out-of-date, but my children don't like being photographed)



# Welcome to Data Science at GA!

### Teaching Assistant: Prasanth Thangavel

<img src="extra-materials/PrasanthPhoto.png" alt="Prasanth" width="150"/>

- Phd Scholar (Using AI & ML for detecting brain disorders)
- Instructor / TA (Coding, ML, DA, DS)
- Co-founder (Chatbot based multi-channel job platform that serves blue-collar job market in SG)

<a id="typical-class"> </a>
# A typical class

- IAs put a message in slack asking to confirm attendance
- Open question time until about 8:02am IST
- Run "git pull" because I sometimes make corrections when I check the course content
- Overview of lesson objectives
- Some motivating problem and data
- Code-alongs
  - Sometimes I just run some code and discuss it
  - Sometimes I pick someone to run some existing code
  - Sometimes I pick someone to come up with some code
- Usually a little exercise to do on your own during the class
- A 15 minute break about half way (9:30am IST); I often use a timer
- Finish at 11:00am IST
- "Exit tickets"

# Schedule


_ | Monday | Wednesday 
---| ---| ---
Week 1 | Feb 24 | Feb 26
Week 2 | Mar 2 | Mar 4
Week 3 | Mar 9 | Mar 11
Week 4 | Mar 16 | Mar 18
Week 5 | Mar 23 | Mar 25
Week 6 | Mar 30 | Apr 1
Week 7 | Holiday | Apr 8
Week 8 | Apr 13 | Apr 15
Week 9 | Apr 20 | Apr 22
Week 10 | Apr 27 | Apr 29
Week 11 | May 4th |

(The last two days are presentation days)

<a id="asking-questions"> </a>
# I have a question / doubt!

Many options:

- Just ask it out loud; that's normal student behaviour in Australia
- Hand-up in the Zoom room
- Ask your question in the slack room #ind-datr-2-24
  - One of the instructor assistants might answer it there
  - They might ask me out loud on your behalf



# Github Enterprise (GHE)

- You have already created an account
- Share this on slack with Prasanth and we will add you to the team
- Run `git clone https://git.generalassemb.ly/dat-ms-feb2020`

<a id="question"> </a>
# What is Data Science?

- Being scientific on business data

### What does it mean to use a scientific method?

Most practitioners apply a version of the scientific method in order to logically deconstruct and analyze an issue. At General Assembly, we call this the data science workflow, which we've broken down into a series of steps.

This problem-solving framework will help you produce results that are reliable (so that your findings will be more accurate) and reproducible (so that others can follow your steps and achieve the same results).

Note that, depending on the problem, this process is not always linear. You may require lots of iteration and repetition before any conclusions can be drawn!

### Asking a Good Question

**Let's get really scientific for a moment...**

A good 'question' is actually a **statement** which:
- might be wong
- and could be rejected..
    - within a certain bounds of confidence

Which is a **hypothesis**

#### What Are Some Common Questions Asked in Data Science?

**Machine learning more or less asks the following questions:**

- Does X predict Y? (Where X is a set of data and y is an outcome.)
- Are there any distinct groups in our data?
- What are the key components of our data?
- Is one of our observations “weird”?

**From a business perspective, we can ask:**

- What is the likelihood that a customer will buy this product?
- Is this a good or bad review?
- How much demand will there be for my service tomorrow?
- Is this the cheapest way to deliver my goods?
- Is there a better way to segment my marketing strategies?
- What groups of products are customers purchasing together?
- Can we automate this simple yes/no decision?

_This list may seem limited, but we rewrite most questions to fit this form._

<a id="dswf"></a>
## Introduction: The Data Science Workflow

---

Throughout this course and for our projects, we'll be following a general workflow. This workflow will help you produce *reliable* and *reproducible* results.

- **Reliable**: Accurate findings.
- **Reproducible**: Others can follow your steps and achieve the same results.
### Steps in the Data Science Workflow

- **Frame**: Develop a hypothesis-driven approach to your analysis.
- **Prepare**: Select, import, explore, and clean your data.
- **Analyze**: Structure, visualize, and complete your analysis.
- **Interpret**: Derive recommendations and business decisions from your data.
- **Communicate**: Present (edited) insights from your data to different audiences.

![](./assets/Data-Framework-White-BG.png)

#### Notes about GA's Data Workflow

_Remember, these steps are not hard-set rules; instead, think of them as problem-solving guidelines._


- Some projects may not require every step.
- These steps are iterative; it's normal to go back and repeat certain steps a few times in a row.
- The process is cyclical; after completing the process, you may restart it on new findings.

### Data acquisition

---
https://www.kaggle.com/mnoori/ames-housing-prices#AmesHousing.txt
- **What are some questions we should ask during the acquisition process?**

- Our Ames data set contains the following information:
    - [Ames Data Set Introduction PDF](./extra-materials/ames.pdf) (from the "Journal of Statistics Education")
    - "Data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010."

### Data Quality

---

- **What are some questions we should ask when checking the data for quality?**
  - [Ames Data Set Documentation](./extra-materials/ames_data_documentation.txt)

##  Prepare

---

Often, we are given *secondary data*, or data that were collected previously. In these cases, we have to learn as much as possible about our data using tools like data dictionaries and source documentation to determine how the set was gathered.

Here's an example of a data dictionary:

Variable | Description | Type of Variable
---| ---| ---
Square Footage | Floating Point | Continuous
Street Type | 1 - Gravel, 2 - Paved | Categorical
Neighborhood | String, e.g., 'Tenderloin' | Categorical
Number of Bedrooms | Integer | Discrete

**Common considerations when preparing our data include:**  

- **Ensuring data is clearly defined and structured**
- **Check and clean data formatting as needed**

**Common considerations for cleaning include**:

- **Most data will **not** come perfectly clean and ready to use. Cleaning data is normally the most time-consuming task a data scientist faces.**

---

As you can see, the "Prepare" phase of the data science workflow encompasses several steps: the act of reviewing, indexing, and cleaning your data. This normally consumes a great deal of time!

## Analyze

---

As an example of basic statistics, Data scientists often check the mean, standard deviation, or specific frequency counts of their data. Statistics that we might expect for the earlier housing variables include:

Variable | Mean or Frequency (%)
---| ---
Square Footage | 2201.3
Street Type - Gravel | 8%
Street Type - Paved | 92%
Number of Bedrooms | 1.8

**What sort of questions do these types of statistics allow us to answer? Why would we do this?**

### Creating a Predictive Model 

We generate predictive models based on the SMART goal we decided upon earlier. Typically, our interest is in predicting or guessing some sort of value we might be interested in (such as the housing price for a house given some set of fixed characteristics). 

**What are some other business goals we can support as data scientists for this realty company? What are some values we would like to guess?*

**What do you think are the steps for model building?**

_We'll be spending much of our time in this course on data analysis and predictive modeling._

## Interpret

---

### Develop Recommendations and Decisions

**Now that you have a model, what are some things you should check?**

**Now that you have a model, can you convert your model's finding into a conclusion or next step for your employer?**

>**Instructor Note:** For things to check after a model, guide students toward the following questions:
- Did you reject or fail to reject your hypotheses?
    - What does this mean for your project?
    - What does this mean for your client?
- Were your questions answered?
    - Which ones?
    - What do you need to do to answer the ones that weren't?
- Do your findings support any business recommendations, actions, or decisions?
    - Is there further supportive analysis?
    - How do your data support these recommendations?
    
>**Instructor Note:** For the **Communication** phase, guide students toward the following questions:
- Reaching a conclusion:
    - Seek guidance/interaction with subject matter experts (SMEs).
    - If those are not available, check with the data — are you coming to reasonable conclusions and predictions given what you've seen?
    - Do the next steps that you envision have any dependencies or corollary steps?
- What are some conclusions you can draw?
    - Conclusion: "Customers from large companies were twice as likely to place another order with Planet Express than customers from small companies."
    - Recommendation:  "We should target more large companies to use our delivery service."
    - Conclusion: "Other than size of company, I found no significant evidence that any other feature affected the odds of customers reusing our delivery service."

## Communicate

---

#### Share the Results of Your Analysis  

Presentations are a critical part of your analysis. It doesn't matter how brilliant your model is or how illuminating your findings are — without effective communication, your work will not be used.

The most basic form of a data science presentation should include a simple sentence that describes your results:

_"Customers from large companies had twice (CI 1.9, 2.1) the odds for placing another order with Planet Express compared to customers from small companies."_

Data science presentations can also be far more complex and exciting, like some of the [research presented by Nate Silver's FiveThirtyEight blog](http://fivethirtyeight.com/burrito/#brackets-view).

When crafting a presentation, always consider your audience and make sure to practice your presentation beforehand. Consider the types of questions people might ask or — better yet — test your presentation on a few people and pay attention to their response. Clarify and refine your presentation accordingly.


**A Note About Iteration**

Iteration is an important part of *every step* in the data science workflow. At any given point in the process, you may find yourself repeating or going back and redoing steps in order to better understand your data, clarify your model, and refine your presentation.

**What are some things you may want to redo or iterate over after presenting your findings?**

<a id="summary1"></a>
# Summary

---

1) **Crafting good questions is key.** <br>
  - Without a thoughtful, targeted, and SMART question, it can be difficult to create an effective model.
2) **Use the data science workflow to iteratively develop solutions.** <br>
  - **Frame**: Develop a hypothesis-driven approach to your analysis.
  - **Prepare**: Select, import, explore, and clean your data.
  - **Analyze**: Structure, visualize, and complete your analysis.
  - **Interpret**: Derive recommendations and business decisions from your data.
  - **Communicate**: Present (edited) insights from your data to different audiences.
3) **Informed by your past work, continue to refine your findings and models.** <br>
  - While the data science workflow may appear to be linear, we consistently return to past steps to implement new findings

<a id="ML"></a>

## Introduction: Machine Learning

---


## Examples of Machine Learning

<a id="common-ml-defs"> </a>
## Common Machine Learning Definitions

There are two main categories of machine learning: supervised learning and unsupervised learning.

**Supervised learning (a.k.a., “predictive modeling”):**  
_Classification and regression_
- Predicts an outcome based on input data.
    - Example: Predicts whether an email is spam or ham.
- Attempts to generalize.
- Requires past data on the element we want to predict (the target).

**Unsupervised learning:**  
_Clustering and dimensionality reduction_
- Extracts structure from data.
    - Example: Segmenting grocery store shoppers into “clusters” that exhibit similar behaviors.
- Attempts to represent.
- **Does not require** past data on the element we want to predict.

**HOT TIP: Stick with supervised where possible!!**

<a id="supervised"></a>
### Supervised Learning

Supervised learning tends to be the most frequent type of work that data scientists do and will be the main focus of this course. How does supervised learning work?

1) We train a **machine learning model** (more on that shortly) using **labeled data** (the "response" label from earlier). <br>
    - The “machine learning model” learns some kind of relationship between the features and the response.

2) We make predictions on **new data** for which the response is unknown. <br>

The primary goal of supervised learning is to build a model that “generalizes” — i.e., accurately predicts the **future** rather than the **past**!

### Practice: Classification vs. Regression

There are two categories of supervised learning:

**Regression**
- The outcome we are trying to predict is a continuous value.
    - **Can you think of anything we would want to predict like this?** 

**Classification**
- The outcome we are trying to predict is categorical (i.e., it comes in one of a set number of classes).
    - **Can you think of anything that we would want to predict like this?**

The type of supervised learning problem has nothing to do with the features; only the response matters!

>**Instructor Note:** Examples of regression targets include price, blood pressure, temperature, etc.

>**Instructor Note:** Examples of classification include spam/ham, cancer class of tissue sample, etc.

## Unsupervised Learning

#### Common Types of Unsupervised Learning

- **Clustering:** Groups “similar” data points together.
- **Dimensionality reduction:** Reduce the dimensionality of a data set by extracting features that capture most of the variance in the data.

**Steps for Clustering**

Imagine that we had a bunch of coins we wanted to automatically split into groups. An unsupervised learning technique would involve the following steps:

1) Clustering the coins based on “similarity" — this could be through the size, the material, or the language on the coins. <br>
2) Inspecting the grouping that the algorithm found. <br>

Hopefully this would put the coins into sets of related groups.



**Steps for Dimensionality Reduction**

Imagine that we had a huge amount of features related to those coins — country of origin, size, weight, mass, density, condition, chemical makeup, etc. Moreover, say that we had thousands or (in some cases, millions) of different features. Not all of these features are helpful, however! Unsupervised learning can help us by grouping features together automatically. It would involve the following:

1) The unsupervised learning technique groups or combines features that are similar or do the same thing into a smaller set, leading to a set of new features that's smaller in size. <br>

Hopefully, the algorithm would recognize something like.

$$\dfrac {mass} {size} = density$$

Here, density could take the place of two different features from before. 

Sometimes unsupervised learning is used as a “preprocessing” step for supervised learning. (Can you guess how?)

### Examples

**Supervised Learning: Coin Classifier**

- **Observations:** Coins.
- **Features:** Size and mass.
- **Response or target variable:** Hand-labeled coin type.

- Train a machine learning model using labeled data.
    - The model learns the relationship between the features and the coin type.

- Make predictions on new data for which the response is unknown.
    - Give the model a new coin and it will predict the coin type automatically.
    
**Unsupervised Learning: Types of Customers at a Bar**

- **Observations:** Customers.
- **Features:** Drink purchases, people they interact with, etc.
- **Response or target variable:** There isn’t one — instead, we group similar customers together.

<a id = 'algorithm'></a>

## Algorithms

Regardless of whether it's supervised or unsupervised, the underlying engine driving a machine learning model is an algorithm. These algorithms are used to help identify trends, represent said trends, and explain the overall variance of the data.   

Let's say we are a real estate agent looking to price a house using only its square footage. We know there are other features that can highly influence this outcome, but we are only focusing on square footage for now. We know that, as square footage increases, so does price. At this point, you may be thinking that a simple algebra equation could be useful; one that helps us price the house by its square footage.  

Recently, we sold a house whose square footage was 2,500 for about \$285,000. If we apply this information to a normal linear equation — $ Y = mx + b$ — we can create a simple _algorithm_ to help us predict a house.

$$285,000 = 2,500x + b$$

$$ x = 114, b = 0 $$ 

_The Y intercept has been omitted for this example._

#### Final Algorithm

$$ Price = 114x $$



## Algorithms ...

#### Final Algorithm
$$ Price = 114x $$

This is an example of a model built with the intent of predicting price. The algorithm is simple and built off of limited information. Typically, our models will be more complex, and we'll consider a greater amount of prior data to help us develop a final algorithm.  

#### Algorithm Training 

In our example, we used previously known information to find our coefficients. This action is also referred to as "training." But, let's make something clear:

- Model building would be the task of constructing an actual algorithm.
    - This is the linear model of $ Y = mx + b $.
- Training involves figuring out the coefficient and the Y intercept the model uses for _our intended purpose_.  
    - The coefficients uncovered via training were $m= 114$ and $b=0$.


<a id="conclusion2"></a>
## Conclusion

---

Check to see if you can answer the following questions easily:

- What is data science?
- What is the data science workflow?
- What is the difference between supervised and unsupervised learning?
- What is the difference between regression and classification? 
- What is an algorithm?

<a id="course-info"> </a>
# Course Information
    
### GA offers a special learning environment.

- What you should know: GA is a global community of individuals and organizations empowered to pursue the work we love.
- Who we are: Meet your instructional team.
- How to provide feedback: exit tickets, mid-course survey, and end-of-course survey. We want to hear from you!

### Road to Success

- The emotional cycle of change: This course is fast and covers a lot of material. There will be times when you may feel discouraged or overwhelmed, but don't give up - this is natural (and part of the design). By the end of the course, you'll feel more confident in your ability to define problems, analyze data, and prototype solutions. 
- Student learning responsibility: Our lessons cover topic foundations, but there is always more to learn! You are responsible for your learning experience - but don't get overwhelmed! Instead, just make sure you follow along, practice as much as possible, and ask questions.
- GA requirements: Show up. Be on time. Participate. Submit your projects. Allow yourself to struggle. Read the docs. Have fun!
- Q/A.


### Course Outline and Project Due Dates

General Assembly's part-time Data Science materials are organized into **four** units.

| Unit   | Title  | Topics Covered  | Length | 
| ---    | ---    |  ---     | ---    |
| Unit 1 | Foundations       | Python Syntax, Development Environment | Lessons 1–4 |
| Unit 2 | Working with Data | Stats Review, Visualization, & EDA     | Lessons 5–9  | 
| Unit 3 | Data Modeling     | Regression, Classification, & KNN      | Lessons 10–14  | 
| Unit 4 | Applications      | Decision Trees, NLP, & Flex Topics     | Lessons 15–19  | 

> **Instructor Note:** If there is time, briefly walk through the entire `course-info` repository with your students. If not, refer them to it for class information.