<a href="https://colab.research.google.com/github/jpacilo/PythonWorkshop/blob/main/DSP%20Python%20Workshop%202022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Better Practices** in Python For Data Science
Author: Joshua Paolo Acilo <br>
Date: March 4, 2022 <br>
Time: 1:00 - 4:00 PM

## README

**Schedule**
- 1:00 - 2:30 PM Lecture
- 2:40 - 4:00 PM Homework

**Reminders**
- Feel free to ask questions anytime! You can leave a message in the chatbox or unmute yourself and speak.
- This is not an Introduction to Python. I expect everyone to at least know the basics in programming.
- You learn more by doing. Try to adopt this new concepts in your workflow next time!



⚠️ Please make a copy of this colab notebook first by clicking *File -> Save a Copy in Drive* on the menu bar <br>

## Setup Python

In [12]:
# check the current python version you have
import sys
sys.version

'3.7.12 (default, Jan 15 2022, 18:48:18) \n[GCC 7.5.0]'

In [6]:
# pendulum is a library to manipulate dates
!pip3 install pendulum

Collecting pendulum
  Downloading pendulum-2.1.2-cp37-cp37m-manylinux1_x86_64.whl (155 kB)
[?25l[K     |██▏                             | 10 kB 24.7 MB/s eta 0:00:01[K     |████▎                           | 20 kB 13.1 MB/s eta 0:00:01[K     |██████▍                         | 30 kB 9.7 MB/s eta 0:00:01[K     |████████▌                       | 40 kB 8.5 MB/s eta 0:00:01[K     |██████████▋                     | 51 kB 4.6 MB/s eta 0:00:01[K     |████████████▊                   | 61 kB 5.4 MB/s eta 0:00:01[K     |██████████████▉                 | 71 kB 6.0 MB/s eta 0:00:01[K     |█████████████████               | 81 kB 4.5 MB/s eta 0:00:01[K     |███████████████████             | 92 kB 5.0 MB/s eta 0:00:01[K     |█████████████████████▏          | 102 kB 5.5 MB/s eta 0:00:01[K     |███████████████████████▎        | 112 kB 5.5 MB/s eta 0:00:01[K     |█████████████████████████▍      | 122 kB 5.5 MB/s eta 0:00:01[K     |███████████████████████████▌    | 133 kB 5.5 MB/

## Write Cleaner Code 

💡 Any fool can write code that a computer can understand. **Good programmers write code that humans can understand.** <br>

### VARIABLES

**DON'T(s)**
- Thou shall not start with a number. <br>
```4ever = True```
- Thou shall not use special characters. <br>
```amountIn$ = 100```
- Thou shall not use reserved keywords. <br>
```id = 10012216```

**DO(s)**
- PEP8 suggests to use snake_case. <br>
```lower_case_with_underscores = True```

Use **meaningful and pronounceable variable names.** Let the variable speak for itself.

In [15]:
import pendulum

def start_pipeline(date):
    # do stuff
    pass

# this is bad, not only it is unpronounceable, it is also vague and non-descriptive
ymddt = pendulum.now().strftime("%Y-%m-%d")
start_pipeline(ymddt)

# this is good, it gives me clue that the current date controls the timing of the pipeline
current_date = pendulum.now().strftime("%Y-%m-%d")
start_pipeline(current_date)

Of course, there will be some exceptions, especially in **domain-specific jargons.**

In [5]:
import numpy as np

# you'll see this very often in the lake
pxn_dt = pendulum.parse(current_date).subtract(days=1)

# this is boilerplate ML, so it's okay too
X, y = np.arange(10).reshape((5, 2)), range(5)

It is a fact that *we will read more code than we will ever write.* It's important that **the code is readable and searchable.** Yes, we can proceed with the quick and dirty way and get the same result as compared to the slow and cleaner way, but in the long run this will hurt your readers. 😓

In [9]:
def aggregate_features(window_duration):
    # do stuff
    pass

# i'm betting you'll forget this the next time you look at your code
aggregate_features(1440)

# we can assign a descriptive constant instead denoted by capital letters 
MINUTES_IN_A_DAY = 60 * 24
aggregate_features(MINUTES_IN_A_DAY)

Don't force the reader of your code to translate what the variable means. **Explicit is better than implicit.**

In [11]:
# this is bad, implicit
seq = ("Taguig", "Makati", "Mandaluyong")
for item in seq:
    # do stuff
    pass

# this is good, explicit
cities = ("Taguig", "Makati", "Mandaluyong")
for city in cities:
    # do stuff
    pass

### FUNCTIONS

**Write a manual for your function using docstrings.** This will help not only you in the future, but also your collaborators.

In [None]:
from math import radians, cos, sin, asin, sqrt

# this is good, write docstrings as much as possible to future proof your work
def get_haversine_distance(lon1, lat1, lon2, lat2, r=6371):
    """Calculate the great circle distance (in kilometers) between two points on the earth.

    Args:
        lon1 (float): Longitude of Point 1
        lat1 (float): Latitude of Point 1
        lon2 (float): Longitude of Point 2
        lat2 (float): Latitude of Point 2
        r (int, optional): Radius of earth in kilometers. Defaults to 6371.

    Returns:
        float: Haversine distance between the two given coordinates.
    """

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    
    return c * r

Your Python **functions should accomplish one thing.** When functions do more than one thing, they are harder to compose, test, and reason about. When you can isolate a function to just one action, they can be refactored easily and your code will read much cleaner.

In [None]:
print("hello")

## Test Your Code
🤔 Just because you've counted all the trees **doesn't mean you've seen the forest.**