Summary

Software Engineering Overview for Data Scientists by Stephen Pettinato, Data Scientist at FabFitFun.com

There are lots of Software Engineering principles floating around that Data Scientists can leverage in their day to day work.  This talk organizes all these ideas to illustrate the _why_ of it all.  

# Software Engineering for Data Scientists
## or
## Why can't I just build models?

### Stephen Pettinato
### 2020-02-10

# Bio

### Degrees: BA Mathematics, MA Statistics, MS Computer Science

### Work History: Software Engineer, Data Engineer, Data Scientist

## Currently a Senior Data Scientist at FabFitFun

<img src="https://static.fabfitfun.com/wp-content/uploads/2019/11/WI19_1572634027.8742_1572634028.4905.png" alt="drawing" style="width:400px;"/>


## This talk evolved over the last year from numerous questions around,

## The model is done, but now I have to produce predictions?

## Data Scientists produce predictions and insights

## So, it's like this? right?

<img src="https://images.pexels.com/photos/1181398/pexels-photo-1181398.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=375&w=630" alt="drawing">


## Oops, more like,

## Data Scientists produce _code_ that produces predictions and insights

<img src="https://www.jetbrains.com/pycharm/features/screenshots/survey-sciview.png" alt="drawing">


## Ok, so people care about the code we write?

## No - people only care about the **predictions** and **insights**

### Basically it's the same with Software Engineers,

### nobody cares about the code they write, they only care about **products** and **product features**

### Software Engineers follow best practices to allow them to produce products and product features, 
* easily
* consistently
* reproducibly

### Software Engineering process
1. Architecture
2. POC
3. Implementation
4. Productionalization
5. Support

### Data Scientist process
1. Analysis/Problem Definition
2. Modeling
3. Productionalization
4. Support

### The big difference here is that a SE can be pretty sure that their implementation will work, whereas a DS doesn't always know if the model will work well.

## Iteration
### Software Engineers have learned to iterate on problems.

### What's the smallest item that can be delivered in the next 2 weeks?

1. Architecture
2. POC
3. Implementation
4. Productionalization
5. Support

### Some considerations here are,
#### An SE approach can sometimes work for models with a high likelihood of effectiveness
#### Some DS work is universally useful such as tidying a dataset

### By comparison, DS work can be exceptionally varied

## Different Tasks require Different Skills

### Depending on the task, put on a different hat

1. If the task at hand is analysis, then put on your Analyst Hat
2. If the task at hand is modeling, then put on your Machine Learning Hat
3. etc.

### Identify the task at hand and act with the proper role

### Data Scientist process

<img src="table.png" alt="drawing">

## What are you trying to do?

## What skills should you lean into for this task?

* Product Manager
* Analyst
* Machine Learning Engineer
* Software Engineer

### It can be hard to stick to a role

### When doing productionalization, what if you find an anomaly in the dataset?

### This is one of the hardest problems in applied data science

> If you are doing analysis, should you also be doing software engineering?

> If you are doing productionalization, should you also be doing analysis?

### Switching is fine, but it's impossible to do both at the same time

# What can Data Scientists learn from Software Engineers _without_ becoming Engineers?

1. Generalization
2. Reproducibility
3. Automation

## Generalization
* As soon as this analysis on this time-range is finished, I'll be asked to analyze other time-ranges
* As soon as I've built a model, I'll need to run it over different inputs for testing/validation

```python
sql = """
    SELECT * 
    FROM atble
    WHERE event_time IS BETWEEN('2020-01-10', '2020-01-20')
"""

# Then run analysis here
```

```python
start_time = '2020-01-10'
end_time = '2020-01-20'
sql = f"""
    SELECT * 
    FROM atble 
    WHERE startdate IS BETWEEN('{start_time}', '{end_time}')
"""

# Then run analysis here
```

## Reproducibility

## This is different depending on who you talk to.

## Mostly this means
* use a notebook with source control
* that runs from production datasets

## i.e. If an analysis is cobbled together from 10 bits of untracked code, then it can't be redone

## I've done this ☹️
1. SQL -> file to get 6 different versions of the file
2. Some bash/pandas to clean the file
3. Produce some plots
4. Store to a word doc

## Nowadays I do,

```python
myfile_df = pd.read_csv('s3://mybucket/myfile.csv')
mydb_df = pd.read_sql('SELECT * FROM atable')


## transformations ##
# Plots and results!
```

```bash
sh build_model.py --input_data s3://mybucket/inputdata.csv

# instead of

sh build_model.py
```

## Automation

Once a model is built, it will have to be
* Periodically Re-trained
* Used to generate predictions

### Modeling doesn't mean building 1 model, it means building a system to build models

## Caveat - Avoid Over-Optimization

* Slow code is fine during Analysis and Modeling as rapid iteration is highly valued
* For Productionalization, code has to meet some base level of speed


### _It's easier to make correct code fast, than to make fast, correct code_

# Software Engineering Best Practices

## The Joel Test - for Software Engineers

1. Do you use source control?
2. Can you make a build in one step?
3. Do you make daily builds?
4. Do you have a bug database?
5. Do you fix bugs before writing new code?
6. Do you have an up-to-date schedule?
7. Do you have a spec?
9. Do programmers have quiet working conditions?
10. Do you use the best tools money can buy?
11. Do you have testers?
12. Do new candidates write code during their interview?
13. Do you do hallway usability testing?

https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/

### Data Scientists can learn a few items from the Joel Test
1. Do you use source control?
2. Can you make ~~a build~~ <span style="background-color: #FFFF00">both models and predictions</span> in one step?
3. Do you make daily ~~builds~~ <span style="background-color: #FFFF00">predictions</span>?
4. Do you have a bug database?
5. Do you fix bugs before writing new code?
6. Do you have an up-to-date schedule?
7. Do you have a spec<span style="background-color: #FFFF00">/goals/requirements/stakeholders/documentation</span>?
9. Do programmers have quiet working conditions?
10. Do you use the best tools money can buy <span style="background-color: #FFFF00">without extensive support</span>?
11. Do you ~~have testers~~ <span style="background-color: #FFFF00">evaluate predictions</span>?
12. Do new candidates write code during their interview?
13. Do you do ~~hallway usability testing~~ <span style="background-color: #FFFF00">collaborate</span>?

## So, what do we do?

## Tools
* Github
* Jira
* Confluence
* Domino, DataBricks, AWS
* AirFlow/AWS Lambda
* S3/HDFS
* Google Sheets/Slides

## Culture
* Testing
* Automation
* Documentation
* Collaboration

## Github 
1. Do you use source control?

## Airflow/DataBricks/Domino/AWS

2. Can you make <span style="background-color: #FFFF00">both models and predictions</span> in one step?
3. Do you make daily <span style="background-color: #FFFF00">predictions</span>?
10. Do you use the best tools money can buy <span style="background-color: #FFFF00">without extensive support</span>?

## Jira/Confluence
4. Do you have a bug database?
5. Do you fix bugs before writing new code?
6. Do you have an up-to-date schedule?
7. Do you have a spec<span style="background-color: #FFFF00">/goals/requirements/stakeholders/documentation</span>?

## Collaboration
9. Do programmers have quiet working conditions?
13. Do you do <span style="background-color: #FFFF00">collaborate</span>?

## Testing
11. Do you <span style="background-color: #FFFF00">evaluate predictions</span>?

## Culture
12. Do new candidates write code during their interview?

## Culture

### Testing

* Does the code do what we think it does?
* Have we tested it with full datasets and done spot checks?
* Are corner cases documented and understood?

## Culture
### Automation

* When is the project done, does it need automation?
* If it needs to run regularly, does it do so automatically?
* Does it do sanity checks during periodic runs?

## Culture
### Documentation

* Ye olde "hit by a bus" problem
* When I pick up the code in 6 months, how painful will it be?
* Can I easily communicate metrics, process and results to stakeholders?

## Culture
### Collaboration
* How easy would it be to work on this project with a colleague?
* Can they run the notebooks?
* Can they understand the problem and goals?

# Conclusions

* Focus on Predictions and Insights
* Ask yourself "What role does this task require?"
* Keep an eye on the future,
  1. Generalization
  2. Reproducibility
  3. Automation
  4. Collaboration
  


# Questions?