Summary

Software Engineering Overview for Data Scientists by Stephen Pettinato, Data Scientist at FabFitFun.com

There are lots of Software Engineering principles floating around that Data Scientists can leverage in their day to day work.  This talk organizes all these ideas to illustrate the _why_ of it all.  

# Software Engineering Overview for Data Scientists
## or
## Why can't I just build models?

### Stephen Pettinato
### 2019-12-13

# Bio

### Degrees: BA Mathematics, MA Statistics, MS Computer Science

### Work History: Software Engineer, Data Engineer, Data Scientist

## Currently a Senior Data Scientist at FabFitFun

<img src="https://static.fabfitfun.com/wp-content/uploads/2019/11/WI19_1572634027.8742_1572634028.4905.png" alt="drawing" style="width:400px;"/>


This talk evolved over the last year from numerous quesions like,

1. Why is my code slow?
2. Why doesn't this code work?
3. What is this code doing?

# Outline
1. Motivation
2. Library Versions
3. Version Control
4. Modules and Packages
5. Distributed, Parallel and GPUs
6. Object Oriented Programming
7. Functional Programming
8. Automation
9. Testing
10. Languages

In [16]:
import arrow
import pandas as pd
import unittest

In [6]:
df = pd.DataFrame({'name': ['Loretta', 'Oliver', 'Romolo', 'Loretta', 'Oliver'],
                   'product': ['towel', 'towel', 'magazine', 'spatchula', 'towel']})

# Library Versions

In [12]:
# What exactly are you importing?
import pandas as pd

In [11]:
print("Hello World")  # Python 3

# print "Hello World"  # Python 2

Hello World


In [10]:
# As of Pandas 0.21
display(
    df.apply(lambda col: col == 'towel', axis='columns')
)

# In all versions of Pandas >= 0.7
display(
    df.apply(lambda col: col == 'towel', axis=1)
)


Unnamed: 0,name,product
0,False,True
1,False,True
2,False,False
3,False,False
4,False,True


Unnamed: 0,name,product
0,False,True
1,False,True
2,False,False
3,False,False
4,False,True


Pandas Documentation for DataFrame.apply
* [0.21](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.apply.html)
* [0.22](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.apply.html)
* [0.7](https://pandas.pydata.org/pandas-docs/version/0.7.0/generated/pandas.DataFrame.apply.html)

## Numpy has a 2000 word policy on Backwards Compatibility

https://numpy.org/neps/nep-0023-backwards-compatibility.html

* Aim not to break users’ code unnecessarily.
* Aim never to change code in ways that can result in users silently getting incorrect results from their previously working code.
* Backwards incompatible changes can be made, provided the benefits outweigh the costs.

## What version should I use?

* New Software - Newest version
* Existing Software - Whatever was used previously

## What if I have to upgrade a library?

* Option 1: Write unit tests
* Option 2: 

<img src="https://images-na.ssl-images-amazon.com/images/I/51ziSsJQOEL._SX376_BO1,204,203,200_.jpg" alt="drawing" style="width:300px;"/>

# Version Control

Goes hand-in-hand with Library Versions

Sure you are using Pandas 0.25.1, but what's the actual code powering this?

## Software Engineers use version control for
* Collaboration and code organization across team members
* Which code has which bug
* What code is running in which environment
  * Production
  * Staging
  * Development

They know exactly what code is running in each location, and can roll back when something breaks.

### So what do Data Scientists use Version Control for?
* Collaboration code organization across team members
* Which code has which bug
* What code is running in which environment
  * Production
  * Staging
  * Development

# Modules and Packages

Hey I've written some code and want others to be able to run it in Notebooks, how do I do this?

`pip install your_code`

or

`pip install git+ssh://git@github.com:fabfitfun/your_code.git`

In [None]:
todo more here

# Distributed, Parallel and GPUs

Whats the difference?

## Parallel Processing
* Runs across multiple CPUs on a single machine
* Good for 
  * Small Datasets
  * If you are constrained to your laptop or a single machine

## GPU Processing
* Runs on a Graphics Card on a single machine
* Good for 
  * Small Datasets
  * If you are constrained to your laptop or a single machine
  * Neural Nets/Deep Learning

## Distributed Processing
* Runs on many machines
* Good for
  * Large Datasets
  * Small Datasets that need lots of parallel processing

Historically all of these are complicated, but there are nice DS libraries nowadays
* Spark
* Dask
* PyTorch/rTorch
* TensorFlow

## It's a progression as your workflow evolves

On one end is a single processor on a single machine

On the other end there is Distributed Computing + GPU Processing

# Object Oriented Programming

Data Representation

Data Processing

In [36]:
# Data Representation
# Not particulurly useful as mostly we use DataFrames

class Customer():
    def __init__(self, account_code):
        self.account_code = account_code
        self.signup_date = arrow.get()
        
class SeasonalMember(Customer):
    def __init__(self, account_code, season):
        self.season = season
        super().__init__(account_code)
        
Customer('myaccount');
SeasonalMember('myaccount', 1904);

In [37]:
# Data Processing
# Can be useful in some situations, mostly if you have a coding pattern that you want to follow

class 




# Functional Programming

# Automation

# Testing

## Fibonacci example 

(Modified from https://www.python-course.eu/python3_tests.php)

In [13]:
def fib(n):
    """ 
    Calculates the n-th Fibonacci number iteratively  

    >>> fib(0)
    0
    >>> fib(1)
    1
    >>> fib(10) 
    55
    >>> fib(15)
    610
    >>> 

    """
    a, b = 0, 1
    for i in range(n):
        a, b = b, a + b
    return a

In [14]:
import unittest

class FibonacciTest(unittest.TestCase):

    def testCalculation(self):
        self.assertEqual(fib(0), 0)
        self.assertEqual(fib(1), 1)
        self.assertEqual(fib(5), 5)
        self.assertEqual(fib(10), 55)
        self.assertEqual(fib(20), 6765)

## Should I test everything?

In [15]:
def count_products(df):
    return df.groupby('name').product.size()

# Conclusion
1. There is a _lot_ of stuff to learn
2. Be iterative, learn what you have to in order to accomplish your current task
3. Keep learning!

# Questions?