> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser, run the following cell:

> `> jupyter nbconvert [this_notebook.ipynb] --to slides --post serve`
 
> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View > Cell Toolbar > None`.

In [1]:
#! jupyter nbconvert AI_ML_Concepts.ipynb --to slides --post serve

<img src="./images/salesforce.svg" width="50" height="50" align="right"/>

<img src="./images/di.png" width="50" height="50" align="right"/>

# AI and ML Concepts




<a id="learning-objectives"></a>
## Learning Objectives
*In this lesson, we will go over the folowing:*

- Jupyter Notebooks
- Data science and the data science workflow.
- Test train split
- Over/underfitting concepts 
- Data Science Applications
- AI ethics

### What Are Some Common Questions Asked in Data Science?

**Machine learning more or less asks the following questions:**

- Does X predict Y? 
- Are there any distinct groups in our data?
- What are the key components of our data?
- Is one of our observations “weird”?

**From a business perspective, Data Science can help us with use cases such as:**

- What is the likelihood that a customer will buy this product?
- Is this a good or bad review?
- How much demand will there be for my service tomorrow?
- Is this the cheapest way to deliver my goods?
- Is there a better way to segment my marketing strategies?
- What groups of products are customers purchasing together?
- Can we automate this simple yes/no decision?


## Are AI and Machine Learning different things?

The AI onion 

- Artificial intelligence is an umbrella term that covers machine learning and deep learning 
- Deep learning and neural networks are also types of machine learning algorithms  
- What Data Science VS. (Machine Learning Engineer): 


<img src="./images/onion.png" width="270" height="270" align="center"/>


## When did this whole thing started? 
---
AI is nothing new 
> - it started in 1950's 
> - followed by two winters

<img src="./images/aiml.png" width="500" height="500" align="center"/>
<img src="./images/aiwinters.png" width="500" height="500" align="center"/>

## Why now?

---

In the last few years there has been a lot of advancements in technologies that enable AI
> - Compute Power 
> - Big Data
> - Powerful Algorithms 

<img src="./images/whynow.png" width="350" height="350" align="center"/>

Read more here: https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/an-executives-guide-to-ai

Can you mention examples of advancements in the above technologies? 

# Data collected every minute
<img src="./images/data.png" width="400" height="400" align="center"/>


## Why is AI powerful?
<img src="./images/ny-vs-sf.jpg" width="350" height="350" align="center"/>


<img src="./images/nysf.png" width="600" height="600" align="center"/>


Check the demo here:
http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

## Brain Vs. Computer 
<img src="./images/brainvscomputer.png" width="600" height="600" align="center"/>

<a id="dswf"></a>
## Introduction: The Data Science Workflow

---
- **Understand the Business Problem**: Develop a hypothesis-driven approach to your analysis.
- **Data Acquisition and Understanding**: Select, import, explore, and clean your data.
- **Build a Model**: engineer your data, build models, evaluate them and build the best model.
- **Deployment**: deploy your model in production and deliver ROI!

<img src="./images/lifecycle.png" width="650" height="650" align="center"/>



# This is what data scientists do

> They take care of the above lifecycle 

Effective data scientists are able to identify relevant questions, collect data from a multitude of different data sources, organize the information, translate results into solutions, and communicate their findings in a way that positively affects business decisions. These skills are required in almost all industries, causing skilled data scientists to be increasingly valuable to companies.

<img src="./images/timewise.png" width="400" height="400" align="center"/>



### Data scientist vs. machine learning engineer
While there’s some overlap, which is why some data scientists with software engineering backgrounds move into machine learning engineer roles, data scientists focus on analyzing data, providing business insights, and prototyping models, while machine learning engineers focus on coding and deploying complex, large-scale machine learning products.

## What data engineers, analysts and architects do?

ETL and Data Cleaning are the most time consuming steps

> Data scientists work with machine learning engineers to move their models to production

<img src="./images/time.jpg" width="700" height="700" align="center"/>



** Times (week / month, .. ) mentioned in the chart are relative ** 

# Let's Review the Data Science Lifecycle Step by Step

# Step 1. Business Understanding 

---

- Identify the business/product objectives.
- Identify and hypothesize goals and criteria for success.
- Create a set of questions to help you identify the correct data set.

## An Example Use Case
We work for a real estate company interested in using data science to determine the best properties to buy and resell. Specifically, your company would like to identify the characteristics of residential houses that estimate their sale price and the cost-effectiveness of doing renovations.

> #### Identify the Business/Product Objectives

The customer tells us their business goals are to accurately predict prices for houses (so that they can sell them for as large a profit as possible) and to identify which kinds of features in the housing market would be more likely to lead to foreclosure and other abnormal sales (which could represent more profitable sales for the company).

> #### Identify and Hypothesize Goals and Criteria for Success

Ultimately, the customer wants us to:
* Deliver a presentation to the real estate team.
* Write a business report discussing results, procedures used, and rationales.
* Build an API that provides estimated returns.

> #### Create a Set of Questions to Help You Identify the Correct Data Set

* Can you think of questions that would help this customer deliver on their business goals? 
* What sort of features or columns would you want to see in the data?

> **Instructor Note:** before going to data acquisition, you can ask questions such as 
> * What would an ideal data set look like? 
> * Describe the dataset that you think would be ideal for this use case

# Step 2. Data Acquisition

** Ideal Data vs. Available Data**  

Oftentimes, we'll start by identifying the *ideal data* we would want for a project.

Then, during the data acquisition phase, we'll learn about the limitations on the types of data actually available. We have to decide if these limitations will inhibit our ability to answer our question of interest or if we can work with what we have to find a reasonable and reliable answer.

For example, we provide a set of housing data for Ames, Iowa, which [includes](./extra-materials/ames_data_documentation.txt):

- 20 continuous variables indicating square footage.
- 14 discrete variables indicating number of each room type.
- 46 categorical variables containing 2–28 classes each, e.g., street type (gravel/paved) and neighborhood (city district name).

---



### ** Review the Dataset**

Take a moment to look through the data description. How closely does the set match the ideal data that you envisioned? Would it be sufficient for our purposes? What limitations does it have?

<img src="./images/houses.png" width="800" height="400" align="center"/>

---

This is possibly the hardest step in the data science workflow. At this stage, it's common to realize that the problem you're trying to solve may not be solvable with the information available. The data could be incomplete, non-existant, or unable to meet the criteria necessary to answer your question.  

That said, you now have a better feel for the data that's available and the information they could contain. You can now identify a new, answerable question that ultimately helps you solve or better understand your problem.

> **Instructor Note**: During the **Framing** phases, guide students toward the following questions:
> - Where are the data coming from?
> - How do the data fit together?
> - Are there enough data?
> - Do our data appropriately align with the question/problem statement?
> - Can the data set be trusted? How was it collected?
> - Is this data set aggregated? Can we use the aggregation, or do we need to obtain it pre-aggregation?
> - What are necessary resources, requirements, assumptions, and constraints?
> - Can we import data from the web (Google Analytics, HTML, XML)?
> - Can we import data from a file (CSV, XML, TXT, JSON)?
> - Can we import data from a pre-existing database (SQL)?
> - Can we set up local or remote data structures?
> - What are the most appropriate tools for working with the data?
> - Do these tools align with the format and size of the data set?

## 2.1 Data Wrangling & Cleaning

This is by far the most time consuming step of Data Science Lifecyle

For the Ames housing dataset we discussed,
- What if the data are in different databases and we have to consolidate them?
- What if the values for some columns in the dataset are missing or in wrong format? 


<img src="./images/datac.png" width="400" height="400" align="center"/>


** we will review and practice the data cleaning process as part of this course. **

## AI ML algorithms are picky eaters 
### They like coockis more than flour (Raw data)
<img src="./images/coockie.jpg" width="400" height="400" align="center"/>

# Step 3. Modeling
** What is a Model? **

- Using Machine Learning algorithms we build a model from input data (image, text, ...)
> - In case of housing data set discussed above we can build a model that learns how to predict price of a house
- The resulted model is a representative of the data used for training 

<img src="./images/model.png" width="400" height="400" align="center"/>

> - The size of the output model can be alot smaller than the training data 

## There are many algorithms that can be used to build a model

<img src="./images/modelS.png" width="700" height="500" align="center"/>

> - Depending on the use case, requirements and available data, a model will be selected!

## Data scientists use one of these available algorithms and tune it for their use case
> - Most these algorithms are available in public and open source libraries 

> - Most data Scientists do no build their own algorithms, they just customize and tune an existing algorithm  

<a id="common-ml-defs"> </a>
## 3.1 Supervised  vs. Unsupervised Learning 

There are two main categories of machine learning: supervised learning and unsupervised learning.

**Supervised learning (a.k.a., “predictive modeling”):**  
_Classification and regression_
- Predicts an outcome based on input data.
    - Example: Predicts whether an email is spam or ham.
- Attempts to generalize.
- Requires past data on the element we want to predict (the target).

**Unsupervised learning:**  
_Clustering and dimensionality reduction_
- Extracts structure from data.
    - Example: Segmenting grocery store shoppers into “clusters” that exhibit similar behaviors.
- Attempts to represent.
- **Does not require** past data on the element we want to predict.

<img src="./images/sup.png" width="700" height="500" align="center"/>


Oftentimes, we may combine both types of machine learning in a project to reduce the cost of data collection by learning a better representation. This is referred to as transfer learning.

Unsupervised learning tends to present more difficult problems because its goals are amorphous. Supervised learning has goals that are almost too clear and can lead people into the trap of optimizing metrics without considering business value.

## 3.2 Feature Engineering 

#### Data Enrichment 

- Machine learning algorithms need the data to be engineered before they consume it
<img src="./images/garbage.png" width="300" height="300" align="center"/>

> - We need feature engineering to enrich the raw data 

Suppose, we want to predict the customers next purchase using a dataset looking like this:
<img src="./images/f1.png" width="350" height="350" align="center"/>

How can we enrich this data?

<img src="./images/f3.png" width="380" height="380" align="center"/>
Here, creating the new feature “Age” is an example of feature engineering.

Now, the steps to do feature engineering are as follows:

> - Brainstorm features.
> - Create features.
> - Check how the features work with the model.
> - Start again from first until the features work perfectly.


So here is another definition of feature engineering:

### Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

> **Instructor Note**: What are other feature engineering ideas for above dataset:
> - Can we use dates to make an API call and get the interest rate for each point at time?
> - Can we use dates to calculate stock performance indexes and see if it affects the sale price of a house? 

You can get creative and creative/engineer new features that can make your model stand out!!!

## 3.3 OverFitting and UnderFitting


What is a Good Model? 
Arguably, Machine Learning models have one sole purpose; to generalize well.

> Generalization is the model’s ability to give sensible outputs to sets of input that it has never seen before.

### An Example
Let’s say we’re trying to build a Machine Learning model for the following data set.

<img src="./images/fit1.png" width="450" height="450" align="center"/>








### A fit model
Training the Linear Regression model in our example is all about minimizing the total distance (i.e. cost) between the line we’re trying to fit and the actual data points. This goes through multiple iterations until we find the relatively “optimal” configuration of our line within the data set. This is exactly where overfitting and underfitting occur.

> In Linear Regression, we would like our model to follow a line similar to the following:
<img src="./images/fit2.png" width="450" height="450" align="center"/>

>The line above could give a very likely prediction for the new input, as, in terms of Machine Learning, the outputs are expected to follow the trend seen in the training set.

### Overfitting
When we run our training algorithm on the data set, we allow the overall cost (i.e. distance from each point to the line) to become smaller with more iterations. Leaving this training algorithm run for long leads to minimal overall cost. However, this means that the line will be fit into all the points (including noise), catching secondary patterns that may not be needed for the generalizability of the model.
> Referring back to our example, if we leave the learning algorithm running for long, it cold end up fitting the line in the following manner:
<img src="./images/over.png" width="450" height="450" align="center"/>

>This looks good, right? Yes, but is it reliable? Well, not really.

> If the model does not capture the dominant trend that we can all see (positively increasing, in our case), it can’t predict a likely output for an input that it has never seen before — defying the purpose of Machine Learning to begin with!

>Overfitting is the case where the overall cost is really small, but the generalization of the model is unreliable. This is due to the model learning “too much” from the training data set.

>> We always want to find the trend, not fit the line to all the data points.

### Underfitting
We want the model to learn from the training data, but we don’t want it to learn too much (i.e. too many patterns). One solution could be to stop the training earlier. However, this could lead the model to not learn enough patterns from the training data, and possibly not even capture the dominant trend. This case is called underfitting.

>Underfitting is the case where the model has “ not learned enough” from the training data, resulting in low generalization and unreliable predictions.

<img src="./images/under.png" width="450" height="450" align="center"/>



### Bias-variance trade-off

So what is the right measure? Depending on the model at hand, a performance that lies between overfitting and underfitting is more desirable. This trade-off is the most integral aspect of Machine Learning model training. As we discussed, Machine Learning models fulfill their purpose when they generalize well. Generalization is bound by the two undesirable outcomes — high bias and high variance. Detecting whether the model suffers from either one is the sole responsibility of the model developer.

## 3.4 Test Train Split

Should we use all the data for training a model? 


> Data Scientists usually keep parts of the data for testing the model performance
<img src="./images/ttsplit.jpg" width="500" height="500" align="center"/>

> if we use all the data for training then we do not have any way of evaluating the model performance. 







### Cross Validation

> why to have a fixed test and train split when we can use different combination of test and train data?

<img src="./images/cross.png" width="450" height="450" align="center"/>

Instead of using one fixed set of the data for test and train we can use cross validation. 
> - In Cross Validation we use different parts of the data for test and training purposes to evaluate the model performance

> - Then average performance of different test and train splits can be used as final performance

# Step 4. Use Cases
** What are some of the use cases for AI/ML ? **

- Nearly all occupations will be affected by automation
> - But only about 5 percent of occupations could be fully automated by currently demonstrated technologies.
- Many more occupations have portions of their constituent activities that are automatable: 
> - we find that about 30 percent of the activities in 60 percent of all occupations could be automated. 

<img src="./images/usecase.svg" width="500" height="500" align="center"/>

> - the size of the output model can be alot smaller than the training data 

## 4.1  Example AI Use Cases 

> **Instructor Note**: This is a good section in which to provide your own work (or side project) experience as well! These are just a couple of options:
- [This Person is not real](https://thispersondoesnotexist.com/)
- [Google Quick Draw](https://quickdraw.withgoogle.com/)
- [Deep Dream Generator](https://deepdreamgenerator.com/)
- Add your own!

## AI Ethics

As an engineer or some other non-philosopher it can be very easy to forget about ethics and simply build systems for the sake of building cool things. We must, however, be aware of the potential outcomes of our build decisions when it comes to highly complex, sophisticated, and potentially impactful systems.


### Data-Biasing

The quality of your model is usually a direct result of the quality and quantity of your data. 

You can imagine a myriad of situations in which classification problems could go wrong because of bias in past data. From an ethical perspective, I think we can all agree that systems which discriminate against individuals on the basis of race, gender, age, ethnicity, etc. 

Some bad outcomes:
> Security systems trained to discriminate based on an individual’s race or gender.
 
> An AI based resume review tool that values the gender of applicants

> Facial recognition systems that lack a diverse training set, resulting in only detecting the race for which they are trained
 
> Court systems (AI judges/juries) with past biased rulings against certain races as the training data

<img src="./images/ugly_ai.png" width="350" height="350" align="center"/>

### How to Avoid Bias?
Ultimately, the majority of these issues can be solved by some human-centered approaches to acquiring, cleaning, labeling, and annotating data. But this can be especially difficult. 

> Our AI in many ways are mirrors of the people who train them.


We need to develop some approach for identifying AI that are not performing within our ethical framework and are producing net bad outcomes for society.

## Next Steps

- Install Anaconda (https://www.anaconda.com/distribution/)

> Make sure to install the latest (python 3.7) version 

<img src="./images/anaconda.png" width="600" height="600" align="center"/>

- We will go over the Python Programming language in next session