# Data Science coding habits

Notes about how to organize better my analyses and code 

refs: 
* https://www.thoughtworks.com/insights/blog/coding-habits-data-scientists
* https://github.com/zedr/clean-code-python


In [4]:
import numpy as np
import pandas as pd

from scipy import stats

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline 

import IPython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


In [5]:
!pwd
!ls images

/home/leandroohf/Documents/leandro/ds_pragmatic_programming
data_frame.png				 pathlib_cheatsheet_p1.png
iris_petal_sepal.png			 pivot-table-datasheet.png
layers.jpeg				 refactor_notebooks.png
neuron_ANN.png				 resampling.png
non-linear_and_linear_decision_edge.png  smote.png
notebook_vs_code.png			 split-apply-combine.png
onehot.png				 tomek.png


## Jupyter notebook analyses workflow


2 folders workflow for working with jupyter notebook as data scientist. Once you have a model refactors as soon as possible the code to put in dev/production environment. See images bellow.


1. dev/ quick test your ideas (do worry too much about code QA )

    * dev code should be fast so, do not worry too much about code quality
    * keep notebooks short as posisible. Do not try to address many questions in one notebook  
    
1. report/notebook/presentation: 

    * keep only important results and move to report folder (publish) 

    * refactor important result (code quality matters in this phase) 

1. Notebooks conventions

    * Notebooks contents 

        * top: What are the mains questions and goals. Do not try to answer many questions at once in a notebook
        * top: Main conclusions and finds 
        * top: small plan what to try next 

    * Names: Ex: 2019-12-14-lhof-short_description.ipynb 
    * Try to keep notebooks short. **Large notebooks is hard to understand and mantain**.  

```sh

# search notebooks by dates 
ls 2019-12-1*.ipynb

# search by authors
ls 2019*-lhof-*.ipynb

# search by keyword in description
ls 2019*_keyword*.ipynb


# search a keyword in jupyter notebooks contents
grep --include=\*.ipynb -rnw '.' -e "lstm"

``` 

1. As soon as possible you have a model, move all the code from notebooks to implement an API or ETLS and etc 

    * Refactor again and pay attention on code quality. Error handle and etc. Try to follow coding best practices 
    

<div style="clear:both">
<img src="images/notebook_vs_code.png" style="float:left" width="400" align="left"/> 
</div>

<br><br><br><br>

<br>

<div style="clear:both">
<img src="images/refactor_notebooks.png" style="float:left" width="600" align="left"/>  
</div>


<br><br><br>

<div style="clear:both">


## Notes about best coding practices in jupyter notebooks


###  keep code clean

    * Don't expose your internals (Keep implementation details hidden). function and class are good for that
        * Ex: categorize_column, encode_label or split_train_n_test
    * Avoid print statements 
        * Ex: even glorified print statements such as df.head(), df.describe(), df.plot()  

    * Good variables name: Variable names should reveal intent


```python
loans = pd.read_csv('loans.csv')

monthly_loans = loans.groupby(['month']).sum()
monthly_loans_in_december = filter_loans(monthly_loans, month=12)

```
    
    * Avoid comments  


```python
## BAD

# Check to see if employee is eligible for full benefits
if (employee.flags and HOURLY_FLAG) and (employee.age > 65):
    # do something

## Better
if employee.isEligibleForBenefits():
    # do something
    
```

    * Avoid mental map

```python

# Bad
seq = ('Austin', 'New York', 'San Francisco')

for item in seq:
    do_stuff()
    do_some_other_stuff()
    # ...
    # Wait, what's `item` for again?
    dispatch(item)

# Good
locations = ('Austin', 'New York', 'San Francisco')

for location in locations:
    do_stuff()
    do_some_other_stuff()
    # ...
    dispatch(location)
```



### Use code abstracting (Functions and class)

    * Use functions to keep code “DRY” (Don’t Repeat Yourself)
    * Functions should do one thing
    * functions name are verbs
    * Fewer arguments (try to keep 2 or 3)

```python
def create_menu(title, body, button_text, cancellable):
    # ...

## Good
class Menu:
    def __init__(self, config: dict):
        title = config["title"]
        body = config["body"]
        # ...

menu = Menu(
    {
        "title": "My Menu",
        "body": "Something about my menu",
        "button_text": "OK",
        "cancellable": False
    }
)

## Also Good 
class MenuConfig:
    """A configuration for the Menu.

    Attributes:
        title: The title of the Menu.
        body: The body of the Menu.
        button_text: The text for the button label.
        cancellable: Can it be cancelled?
    """
    title: str
    body: str
    button_text: str
    cancellable: bool = False


def create_menu(config: MenuConfig):
    title = config.title
    body = config.body
    # ...


config = MenuConfig
config.title = "My delicious menu"
config.body = "A description of the various items on the menu"
config.button_text = "Order now!"
# The instance attribute overrides the default class attribute.
config.cancellable = True

create_menu(config)
 
```

    
    * class names are Noums and methods verbs
    
    
    * Use default arguments instead of short circuiting or conditionals
    
```python

def create_micro_brewery(name):
    name = "Hipster Brew Co." if name is None else name
    slug = hashlib.sha1(name.encode()).hexdigest()
    # etc.
    
## Better
def create_micro_brewery(name: str = "Hipster Brew Co."):
    slug = hashlib.sha1(name.encode()).hexdigest()
    # etc.

```
    
Gains because of the use of functions

* Readability 

    * Is focusing what while reading code instead of how. 

* Tetability (not realy sure. Only make sense when developyn the backe end code or API) 

    * we can easily write a unit test for it. 

* Resuability
</div>



In [6]:
data = pd.read_csv('./data/phone_data.csv')
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
index           830 non-null int64
date            830 non-null object
duration        830 non-null float64
item            830 non-null object
month           830 non-null object
network         830 non-null object
network_type    830 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 45.5+ KB


Unnamed: 0,index,date,duration,item,month,network,network_type
0,0,15/10/14 06:58,34.429,data,2014-11,data,data
1,1,15/10/14 06:58,13.0,call,2014-11,Vodafone,mobile
2,2,15/10/14 14:46,23.0,call,2014-11,Meteor,mobile
3,3,15/10/14 14:48,4.0,call,2014-11,Tesco,mobile
4,4,15/10/14 17:27,4.0,call,2014-11,Tesco,mobile


In [8]:
import functools

# Shameless stolen from the comments of 
# https://www.thoughtworks.com/insights/blog/coding-habits-data-scientists
def compose(*functions):
    
    return functools.reduce(lambda f, g: lambda x: f(g(x)), functions, lambda x: x)

# Examples of common functions
# The implementation is only ilustrations

def encode_column(df: pd.DataFrame, col_name: str):
    
    col_name_out = col_name + '_enc'
    df[col_name_out] = df[col_name] + '_enc'
    
    
    return df


def add_categorical_column(df):
    
    df['cat'] = df['network']
    
    
    return df

def convert_to_minutes(df):
    
    df['duration'] = df['duration'] /60.00
    
    return df


def split_features_and_labels(df):
    
    # XXX: You can use split_train form scikit learn.
    # But the exampleas is enough to express the ideas
    y = df['duration']
    X = df.iloc[:, df.columns != 'duration']

    
    return X,y


In [52]:
## Good example

df = encode_column(data, col_name='item')
df = add_categorical_column(df)
df = convert_to_minutes(df)
X,y = split_features_and_labels(df)

## Better example
# Data processing is like sequence of events in a history
prepare_data = compose(functools.partial(encode_column, col_name='item'),
                       add_categorical_column,
                       convert_to_minutes
                      )

data_pre = prepare_data(data)
X, y = split_features_and_labels(data_pre)