# Data Science Project Architecture

In [2]:
%matplotlib inline

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time

In [4]:
import this # The Zen of Python 

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


## [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)

## Naming stuff

* lower_with_under – variables, functions, files, folders
* UPPER_WITH_UNDER – global constants
* PascalCase – class names, folders
* camelCase – only to conform to existing conventions
* Notes
    * \_leading_underscore – marks a private variable
        * Not truly private, only a signal to developers not to mess with it
        * \_\_double_leading_underscore – “mangles” variable names
    * __double_underscores__ – special variables or methods
        * \_\_name\_\_, \_\_doc\_\_, \_\_init\_\_, \_\_str\_\_, \_\_repr\_\_, \_\_len\_\_, etc.

In [5]:
numbers = [2, 5, 18, -4]

In [6]:
type(numbers)

list

In [7]:
len(numbers)

4

In [8]:
dir(numbers)

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [9]:
numbers.__len__

<method-wrapper '__len__' of list object at 0x000002303AF4F400>

In [10]:
numbers.__len__()

4

In [11]:
numbers.__str__() # structure

'[2, 5, 18, -4]'

In [15]:
iterator = numbers.__iter__()

In [16]:
iterator.__next__()

2

## Readability

* Use imports for modules and packages
* Avoid global variables
    * Pollute the global scope
    * Can create subtle dependencies in the code
    * Try using function parameters (and / or classes)
* List comprehensions, lambdas, conditional expressions
    * Okay for simple, one-line cases

In [18]:
a = 5
print([x + 3 for x in range(3)])
sum_two_nums = lambda x, y: x + y
print("even" if a % 2 == 0 else "odd")

[3, 4, 5]
odd


* Lexical scoping (closures) – use very carefully

In [20]:
def summator(a): # Usage: summator(4)(5)
    def inner_summator(b):
        return a + b 
    return inner_summator

* Whitespace
    * **DO NOT** mix tabs and spaces!
        * Prefer spaces (text editors replace 1 tab with 4 spaces by default)
        * This can create a lot of pain and sinister bugs
    * 1-2 blank lines between variables, functions and methods
    * Use typography rules (e.g. 1 space after comma)
* Comments
    * Avoid inline comments

In [22]:
x= 3
x = x + 1 # Increment x by 1

* Docstrings – a way of documenting the code, unique to Python
    * More info [here](https://peps.python.org/pep-0257/)
* TODO comments: temporary code, short-term solution, or good enough but not perfect

## Object-Oriented Programming

* Python has OOP
* For most of our purposes, it's not necessary
    * We have used a lot of objects, but we didn't really need to create classes
* We generally prefer a combination of procedural and functional style
* If you're comfortable, feel free to use classes
    * All principles from other OOP languages apply
    * Once again, the goal is to create readable code, which is easier to maintain

## Notebook Structure

* Similar to scientific papers
* Imports – usually the first cell contains all imports
* Title, author(s)
* Abstract – not mandatory, but really good to have
* Data manipulation process
    * Divided into sections and subsections
    * Most commonly: getting data, transformations, visualization, modelling, etc.
* Conclusion(s)
* Tips
    * Make sections self-contained, reduce dependencies
    * Create functions when possible
        * To avoid creating too many global variables

## File Structure

* Usually, projects have one notebook
    * You may include many notebooks if you wish
    * You can also import code from notebooks
* Very long code can be separated in .py files
    * Not greatly recommended, but sometimes helps
        * E.g., if the file contains a lot of utility functions
* Using: simply import the files
    * Using the file names
    * You can also create folders and import them
        * These are called "modules"
    * We usually put all code in a separate folder, e.g., libs or utilities
* Data, images and other assets should also be in their own folders

## Debugging

* Hardest way: don't debug at all
* Easier: use print() statementsat important places
* Best: use a debugger to trace the code execution
    * Every IDE (such as Visual Studio, VSCode, PyCharm, etc.) has one
* Most important concepts
    * Breakpoints
    * Step into, step over, step out
        * These usually have keyboard shortcuts assigned
    * Variable inspection
    * Interactive window; terminal
    * Call stack

## Unit Testing

* Debugging and testing are very scientific processes
    * Intuitive for most people with math / science background
* Can show bugs in the code
    * Cannot show the code is bug-free!
    * "Absence of evidence is not evidence of absence"
* Unit tests: pieces of code that test other pieces of code
* Unit test layout: AAA (Arrange, Act, Assert)

In [23]:
def sum_numbers(a, b):
    return a + b
def test_sum_with_zeros():
    #Arrange
    a = 0
    b = 0
    #Act
    result = sum_numbers(a, b)
    #Assert
    assert result == 0
    
test_sum_with_zeros()

## Other Types of Tests

* Unit testing ensures our functions work
* There are many more types of tests
    * Software: integration tests, regression tests, system tests, security tests, etc.
    * Data validation tests – these ensure correct formats of the data
* Statistical tests
    * Is my hypothesis (or data model) true?
    * Example: for linear regression ⇒ 𝑅^2
    * Another example: train / test set in machine learning
    * "Sanity checks"
    * Plotting graphs, comparisons, etc.
* It's absolutely important to check most (if not all) of our steps

## Performance Tests

* Test how fast a code executes
    * Better: test the code complexity with different arguments
        * Possibly, plot the results
    * We can use the time library
* Important: be careful how you check the time
    * Average execution over multiple trials to reduce random errors
    * Do not include initializations and "external" code
        * Code that you’re not interested in optimizing
* Do not optimize prematurely!

In [26]:
start = time.time()
for i in range(1000):
    sum(numbers)
stop = time.time()
print(stop - start)

0.0011661052703857422
