<a href="https://colab.research.google.com/github/jpacilo/PythonWorkshop/blob/main/Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Better Practices** in Python For Data Science
⚠️ Please make a copy of this colab notebook first by clicking **File -> Save a Copy in Drive** on the menu bar <br>

## README

**Lecturer**
- Joshua Paolo Acilo
- Model Development Expert
- EDO Advanced Analytics

**Schedule**
- 2:00 - 3:30 PM Lecture
- 3:30 - 3:50 PM Quiz
- 3:50 - 4:00 PM Q&A

**Reminders**
- Feel free to ask questions anytime! You can leave a message in the chatbox or unmute yourself and speak. <br>
- This is not an Introduction to Python. I expect everyone to at least know the basics in programming. <br>
- You learn more by doing. Try to adopt this new concepts in your workflow next time!



## What to expect from me in this session

This afternoon we're going to talk about 🔬📈⚽♻️

## Setup Python

In [None]:
# check the current python version you have
import sys
print(sys.version)

In [None]:
# just to mute the warnings for deprecated methods
import warnings
warnings.filterwarnings("ignore")

In [None]:
# pendulum is a library to manipulate dates
%%capture
!pip3 install pendulum

In [None]:
# geopandas is a library to manipulate spatial data
%%capture
!pip3 install geopandas

In [None]:
# leafmap is a library to visualize spatial data
%%capture
!pip3 install leafmap

In [None]:
# gives nicer output for your tests
%%capture
!pip3 -q install pytest pytest-sugar

## Write Clean Code 

Any fool can write code that a computer can understand. **Good programmers write code that humans can understand.** 🤔

Recap of some things to note when writing Python variables. <br>

**DON'T(s)**
- Thou shall not start with a number. <br>
```4ever = True```
- Thou shall not use special characters. <br>
```amountIn$ = 100```
- Thou shall not use reserved keywords. <br>
```id = 10012216```

**DO(s)**
- PEP8 suggests to use snake_case. <br>
```lower_case_with_underscores = True```

Use **meaningful and pronounceable variable names.** Let the variable speak for itself. 🤯

In [None]:
import pendulum

def start_pipeline(date):
    # do stuff
    pass

# this is bad, not only it is unpronounceable, it is also vague and non-descriptive
ymddt = pendulum.now().strftime("%Y-%m-%d")
start_pipeline(ymddt)

# this is good, it gives me clue that the current date controls the timing of the pipeline
current_date = pendulum.now().strftime("%Y-%m-%d")
start_pipeline(current_date)

In [None]:
# check out current_date variable yourself
# YOUR CODE GOES HERE

Of course, there will be some exceptions, especially in **domain-specific jargons.** 🧐

In [None]:
import numpy as np
import pandas as pd

# you'll see this very often in the lake
# pxn_dt stands for partition date
pxn_dt = pendulum.parse(current_date).subtract(days=1)

# this is boilerplate ML, so it's okay too
# X is the input matrix (features)
# y is the output vector (prediction)
X, y = np.arange(10).reshape((5, 2)), range(5)

# this one also is widespread in DS
# df stands for dataframe 
df = pd.read_csv("sample_data/california_housing_train.csv")

In [None]:
df.head(1)

Here is a non-exhaustive list of jargons that aren't that exactly pronounceable and meaningful (for people with no background in data science).
- abt => analytical base table
- lng => longitude (from lat, lng)
- std => standard deviation
- txn => transaction
- pxn => partition

✋ What about you? Can you think of a well-accepted variable name in some domains within the Python community **that is not pronounceable?**

It is a fact that *we will read more code than we will ever write.* It's important that **the code is readable and searchable.** Yes, we can proceed with the quick and dirty way and get the same result as compared to the slow and cleaner way, but in the long run this will hurt your readers. 😓

In [None]:
def aggregate_features(window_duration):
    # do stuff
    pass

# i'm betting you'll forget this the next time you look at your code
aggregate_features(1440)

# we can assign a descriptive constant instead denoted by capital letters 
MINUTES_IN_A_DAY = 60 * 24
aggregate_features(MINUTES_IN_A_DAY)

In [None]:
# create a constant following the UPPER_CAP_SNAKE_CASE format pertaining to the number of seconds in a day
# YOUR CODE GOES HERE

Don't force the reader of your code to translate what the variable means. **Explicit is better than implicit.** 🤔

In [None]:
# this is bad, implicit
seq = ("Taguig", "Makati", "Mandaluyong")
for item in seq:
    # do stuff
    pass

# this is good, explicit
cities = ("Taguig", "Makati", "Mandaluyong")
for city in cities:
    # do stuff
    pass

This tip can save you if have a foreign collaborator that is not familiar with the city names in PH 😅

**Write a manual for your function using docstrings.** This will help not only you in the future, but also your future collaborators. 😉

```
"""
This is an example of Google style docstring.

Args:
    param1: This is the first param.
    param2: This is a second param.

Returns:
    This is a description of what is returned.

Raises:
    KeyError: Raises an exception.
"""
```

In [None]:
from math import radians, cos, sin, asin, sqrt

# this is good, write docstrings as much as possible to future proof your work
def get_haversine_distance(lon1, lat1, lon2, lat2, r=6371):
    """Calculate the great circle distance (in kilometers) between two points on the earth.

    Args:
        lon1 (float): Longitude of Point 1
        lat1 (float): Latitude of Point 1
        lon2 (float): Longitude of Point 2
        lat2 (float): Latitude of Point 2
        r (int, optional): Radius of earth in kilometers. Defaults to 6371.

    Returns:
        float: Haversine distance between the two given coordinates.
    """

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    
    return c * r

Your Python **functions should accomplish one thing.** When functions do more than one thing, they are harder to compose, test, and reason about. When you can isolate a function to just one action, they can be refactored easily and your code will read much cleaner. 🤓

In [None]:
mkdir data

In [None]:
cd data

In [None]:
from google.colab import files
files.upload();

In [None]:
cd /content

In [None]:
import pandas as pd

def load_data(filename, schema):
    df = pd.read_csv(filename)
    df = df.astype(schema, errors="ignore")
    return df

filename = "data/cafes_in_bgc.csv"
schema = {
    "cafe_name": str,
    "address": str,
    "latitude": float,
    "longitude": float
} 
df = load_data(filename, schema)
display(df)

Suppose you and your new DSP friends want to go coffee shop hopping in Bonifacio Global City today. Since you only have an hour for lunch break, you decided to only visit (n) shops for now. The task is to find the (n)-closest coffee shops to each other from the given data. 🧩 

In [None]:
import leafmap
import itertools
import geopandas as gpd
from shapely.geometry import Polygon

In [None]:
THE_GLOBE_TOWER_COORDS = (14.553474948859346, 121.04989287111896)

In [None]:
# THIS IS BAD

def get_map(df, reference_point, n=3):

    # initialize map, set TGT as reference point for BGC
    map_select = leafmap.Map(
        center=reference_point, 
        zoom=16, 
        layers_control=True, 
        measure_control=False, 
        attribution_control=False
    )
    map_select.add_basemap("Stamen.TonerLite")

    # get points of interest df
    gdf_points = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude, crs="EPSG:4326"))
    gdf_points = gdf_points.drop(columns=["address", "latitude", "longitude"])

    cols_gdf = [(f"cafe_name_{i}", f"geometry_{i}") for i in range(1, n+1)]
    cols_gdf = [item for sublist in cols_gdf for item in sublist]
    cols_geometry = [col for col in cols_gdf if "geometry" in col]

    # get all possible combinations of poi(s) e.g. cafe(s)
    points_combinations = list(itertools.combinations(gdf_points.values.tolist(), n))
    
    # get polygons df
    gdf_polygons = pd.DataFrame(columns=cols_gdf)
    for i, points_combination in enumerate(points_combinations):
        gdf_polygons.loc[i] = [item for sublist in points_combination for item in sublist]

    # add n-polygon geometry column based from the given points 
    gdf_polygons["geometry"] = gdf_polygons.apply(lambda x: Polygon([x[col] for col in cols_geometry]), axis=1)
    gdf_polygons = gdf_polygons.drop(columns=cols_geometry)

    # 4326 for viz, 3857 for distance related calculations
    gdf_polygons = gpd.GeoDataFrame(gdf_polygons, crs="EPSG:4326")
    gdf_polygons["polygon_perimeter_in_meters"] = gdf_polygons.to_crs(3857)["geometry"].length

    # add the points and polygons gdf
    map_select.add_gdf(gdf_polygons.sort_values(by="polygon_perimeter_in_meters", ascending=True).head(1), layer_name="Smallest Geom", fill_colors=["green"])
    map_select.add_gdf(gdf_polygons.sort_values(by="polygon_perimeter_in_meters", ascending=False).head(1), layer_name="Biggest Geom", fill_colors=["red"])
    map_select.add_gdf(gdf_points, layer_name="Cafes in BGC")

    return map_select


In [None]:
# THIS IS BETTER

def flatten_list(lst):
    flattened_list = [item for sublist in lst for item in sublist]
    return flattened_list

def get_column_names(n, geom):
    cols = flatten_list([(f"cafe_name_{i}", f"geometry_{i}") for i in range(1, n+1)])
    if geom:
        return [col for col in cols if "geometry" in col]
    else:
        return cols

def get_point_combinations(gdf_points, n):
    points_combinations = list(itertools.combinations(gdf_points.values.tolist(), n))
    return points_combinations

def get_gdf_points(df):
    gdf_points = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude, crs="EPSG:4326"))
    gdf_points = gdf_points.drop(columns=["address", "latitude", "longitude"])
    return gdf_points

def get_gdf_polygons(gdf_points, n):
    
    # get column names
    cols_gdf = get_column_names(n, False)
    cols_geometry = get_column_names(n, True)

    # get all possible combinations of poi(s) e.g. cafe(s)
    points_combinations = get_point_combinations(gdf_points, n)

    # create polygons table
    gdf_polygons = pd.DataFrame(columns=cols_gdf)
    for i, points_combination in enumerate(points_combinations):
        gdf_polygons.loc[i] = flatten_list(points_combination)

    # add n-polygon geometry column based from the given points 
    gdf_polygons["geometry"] = gdf_polygons.apply(lambda x: Polygon([x[col] for col in cols_geometry]), axis=1)
    gdf_polygons = gdf_polygons.drop(columns=cols_geometry)
    
    # 4326 for viz, 3857 for distance related calculations
    gdf_polygons = gpd.GeoDataFrame(gdf_polygons, crs="EPSG:4326")
    gdf_polygons["polygon_perimeter_in_meters"] = gdf_polygons.to_crs(3857)["geometry"].length

    return gdf_polygons

def get_map(reference_point, gdf_points, gdf_polygons):

    # initialize map, set TGT as reference point for BGC
    map_select = leafmap.Map(
        center=reference_point, 
        zoom=16, 
        layers_control=True, 
        measure_control=False, 
        attribution_control=False
    )
    map_select.add_basemap("Stamen.TonerLite")

    # add the points and polygons gdf
    map_select.add_gdf(gdf_polygons.sort_values(by="polygon_perimeter_in_meters", ascending=True).head(1), layer_name="Smallest Geom", fill_colors=["green"])
    map_select.add_gdf(gdf_polygons.sort_values(by="polygon_perimeter_in_meters", ascending=False).head(1), layer_name="Biggest Geom", fill_colors=["red"])
    map_select.add_gdf(gdf_points, layer_name="Cafes in BGC")

    return map_select

In [None]:
gdf_points = get_gdf_points(df)
gdf_points.head(1)

In [None]:
gdf_polygons = get_gdf_polygons(gdf_points, 4)
gdf_polygons.head(1)

In [None]:
gdf_polygons.sort_values(by="polygon_perimeter_in_meters", ascending=True).head(1)

In [None]:
gdf_polygons.sort_values(by="polygon_perimeter_in_meters", ascending=False).head(1)

In [None]:
get_map(THE_GLOBE_TOWER_COORDS, gdf_points, gdf_polygons)

✋ What about you? Can you help me write a docstring for the get_map function below?

In [None]:
def get_map(reference_point, gdf_points, gdf_polygons):
    # YOUR CODE GOES HERE
    # YOUR CODE GOES HERE
    # YOUR CODE GOES HERE

    # initialize map, set a reference point for the map
    map_select = leafmap.Map(
        center=reference_point, 
        zoom=16, 
        layers_control=True, 
        measure_control=False, 
        attribution_control=False
    )
    map_select.add_basemap("Stamen.TonerLite")

    # add the points and polygons gdf
    map_select.add_gdf(gdf_polygons.sort_values(by="polygon_perimeter_in_meters", ascending=True).head(1), layer_name="Smallest Geom", fill_colors=["green"])
    map_select.add_gdf(gdf_polygons.sort_values(by="polygon_perimeter_in_meters", ascending=False).head(1), layer_name="Biggest Geom", fill_colors=["red"])
    map_select.add_gdf(gdf_points, layer_name="Points of Interest")

    return map_select

**WHAT WE'VE COVERED**
- How to write clean variables
    - Make use of meaningful and pronounceable variable names, if possible.
    - Make your code readable and searchable with the use of constants.
    - Make use of explicit variable names, especially in lists.
- How to write clean functions
    - Write docstrings containing the input/output args and description.
    - Break down your functions to accomplish one thing. Don't repeat yourself.

## Write Tested Code
Just because you've counted all the trees **doesn't mean you've seen the forest.** 🤔

Basically, you should write tests for your data science projects because it:
- allows collaborators to **understand your code better**
- confirms that the code is **working as expected**
- helps in detecting **edge cases** or scenarios


Suppose we have this function that identifies the sentiment of an English text. 🧐

In [None]:
from textblob import TextBlob

def extract_sentiment(text: str):
    """Extract text sentiments using textblob library
    Args:
        text (str): English text
    Returns:
        float: Polarity of the sentiment ranging from -1 to 1
    """

    text = TextBlob(text)
    sentiment = text.sentiment.polarity
    
    return sentiment

Since we will be using this library for the first time, we don't know how it reacts to different scenarios. We want to make sure that this tool or model is reliable, so **we will be testing it against multiple text inputs**, from the obvious scenarios to the rare ones or the edge cases.

In [None]:
extract_sentiment("The weather is beautiful today!")

In [None]:
extract_sentiment("I had a bad meeting yesterday.")

✋ What about you? Check out how TextBlob performs using your own sentiment for what you feel today.

In [None]:
# YOUR CODE GOES HERE

We want to be able to do this kind of testing next time, but it is better to do it in a modular kind of way. So we will be using *pytest* - it is a **framework that makes it easy to write small, readable tests**, and can scale to support complex functional testing for applications and libraries.

In [None]:
mkdir src

In [None]:
mkdir tests

In [None]:
ls

In [None]:
%%file src/sentiment.py

from textblob import TextBlob

def extract_sentiment(text: str):
    """Extract text sentiments using textblob library
    Args:
        text (str): English text
    Returns:
        float: Polarity of the sentiment ranging from -1 to 1
    """

    text = TextBlob(text)
    sentiment = text.sentiment.polarity
    
    return sentiment

In [None]:
%%file tests/test_sentiment.py

import sys
import os.path
sys.path.append(
    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir))
)
from src.sentiment import extract_sentiment

def test_extract_sentiment_positive():

    text = "I did well on the exam last week."
    sentiment = extract_sentiment(text)

    assert sentiment > 0

def test_extract_sentiment_negative():

    text = "This workshop is pretty basic and boring!"
    sentiment = extract_sentiment(text)

    assert sentiment < 0

def test_extract_sentiment_neutral():

    text = "..."
    sentiment = extract_sentiment(text)

    assert sentiment == 0

def test_extract_sentiment_filipino():

    text = "Nakakaengganyo pakinggan ang guro namin sa workshop"
    sentiment = extract_sentiment(text)

    assert sentiment > 0

We will be calling the *pytest* from the terminal. This will loop through our script and run the functions that have a prefix of **test**. 🤯

In [None]:
!python3 -m pytest -vv tests/test_sentiment.py

✋ What about you? Update the tests/test_sentiment.py and run your own test.

In [None]:
%%file tests/test_sentiment.py

import sys
import os.path
sys.path.append(
    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir))
)
from src.sentiment import extract_sentiment

# YOUR CODE GOES HERE

In [None]:
!python3 -m pytest -vv tests/test_sentiment.py

From the pytest output shown, we can see the scenarios where the function fails (e.g. the positive and filipino test inputs) and succeeds. From this exercise, **we are not only able to know whether our function works as expected but also know why it doesn’t work.** Based on result of the positive test input, we know that this sentiment identifier model from textblob isn't correct all the time. As the developer, we can now make an informed decision on what to do next. This shows the value of testing your work before using it in production. 🤩

We can also test multiple inputs using ```pytest.mark.parametrize```

In [None]:
%%file tests/test_sentiment.py

import sys
import os.path
sys.path.append(
    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir))
)
import pytest
from src.sentiment import extract_sentiment

test_inputs_positive = [
    "I am blessed with a wonderful family.",
    "I am thankful for my company.",
    "I am grateful for my friends."
]

test_inputs_negative = [
    "I feel bad for leaving the party early last night.",
    "I am still disappointed from my performance last week.",
    "I am too sick to travel tomorrow."
]

@pytest.mark.parametrize("text", test_inputs_positive)
def test_extract_sentiment_positive(text):

    sentiment = extract_sentiment(text)

    assert sentiment > 0

@pytest.mark.parametrize("text", test_inputs_negative)
def test_extract_sentiment_negative(text):

    sentiment = extract_sentiment(text)

    assert sentiment < 0

In [None]:
!python3 -m pytest -vv tests/test_sentiment.py

There comes a time where the test cases in your script will be lengthy and comprehensive. We can choose to run a specific test function one at a time using this syntax ```pytest file.py::function_name``` 😮

In [None]:
!python3 -m pytest -vv tests/test_sentiment.py::test_extract_sentiment_positive

✋ What about you? Run the specific test function for test_extract_sentiment_negative

In [None]:
# YOUR CODE GOES HERE

We can also choose to use the same test input data to different functions using ```pytest.fixture```

In [None]:
%%file tests/test_sentiment.py

import sys
import os.path
sys.path.append(
    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir))
)
import pytest
from src.sentiment import extract_sentiment

@pytest.fixture
def sample_data():
    return "I had mixed feelings about the concert last night."

def test_extract_sentiment_positive(sample_data):

    sentiment = extract_sentiment(sample_data)

    assert sentiment > 0

def test_extract_sentiment_negative(sample_data):

    sentiment = extract_sentiment(sample_data)

    assert sentiment < 0

def test_extract_sentiment_neutral(sample_data):

    sentiment = extract_sentiment(sample_data)

    assert sentiment == 0

In [None]:
!python3 -m pytest -vv tests/test_sentiment.py

**WHAT WE'VE COVERED**
- How to structure a basic test project
- How to use pytest in automating tests
    - How to run pytest and understand its results
    - How to test multiple inputs using pytest.mark.parametrize
    - How to pass common data to different functions using pytest.fixture

## Write Performant Code
Efficiency is **doing better** what is already being done. 🤔

On the following sections, we will be discussing some tips and tricks on **how to better optimize your code in terms of speed and memory utilization using the pandas library.** As developers, it pays off for us to read the official documentation of the packages we frequently use, it enables us to leverage on its strengths and quirks which improves the efficiency of our default processes, and who knows, maybe we'll discover something we can improve on in our future projects that we can share with everyone in the community! 😁

In [None]:
import pandas as pd
pd.__version__

The key data structure in pandas is called a ```DataFrame```. It is a two-dimensional table with rows and columns, which is similar to the tables in relational databases and R's dataframe. **One important thing to know is that pandas is column-major**, which means *consecutive elements in a column are stored next to each other in memory.* Since modern computers process sequential data more efficiently than non sequential data, **if a table is column-major, accessing its columns will be much faster than accessing its rows.** 🤯

To demonstrate this particular quirk of pandas, we will be using the **taxis dataset** that is readily available in the ```seaborn``` package.

In [None]:
import seaborn as sns
sns.get_dataset_names()

In [None]:
df_taxis = sns.load_dataset("taxis")
df_taxis.head()

In [None]:
print(len(df_taxis))

A column in pandas ```DataFrame``` is called a ```Series```. Basically, a ```DataFrame``` is just a collection of ```Series``` stored to next to each other in memory.

In [None]:
# Fetch the column `pickup`, 1k loops
%timeit -n1000 df_taxis["pickup"]

In [None]:
# Fetch the first row, 1k loops
%timeit -n1000 df_taxis.iloc[0]

For the seaborn taxi dataset, **accessing a row takes about 30-50x longer than accessing a column.**

✋ What about you? Try to time the process of fetching another column in the df_taxis `DataFrame`.

In [None]:
# YOUR CODE GOES HERE

Now let's look at a simple pandas operation that we can execute in different ways. Say we want to add a column `travel_time` in our `df_taxis` dataframe before that pertains to the **total travel time of the passenger**, we can approach this problem in three different ways.
- Iterate over the rows in the `DataFrame` using `.iterrows()`
- Use a `lambda` function together with `.apply()` on the `DataFrame`
- Use the relevant `Series` and directly perform the operation onto them

In [None]:
df_taxis.info()

Notice from the schema shown above using ```.info()``` that the pickup and dropoff columns were identified as objects by default. We want to cast this to a datetime object in order for us to perform accurate datetime related calculations (e.g. subtraction for the elapsed time).

In [None]:
# cast the relevant fields to datetime
df_taxis["pickup"] = pd.to_datetime(df_taxis["pickup"])
df_taxis["dropoff"] = pd.to_datetime(df_taxis["dropoff"])

In [None]:
df_taxis.info()

In [None]:
# option #1: this is bad
def get_travel_time(df_taxis):
    travel_time = []
    for idx, row in df_taxis.iterrows():
        travel_time.append(row.dropoff-row.pickup)
    return pd.Series(travel_time)
%timeit -n10 df_taxis["travel_time_a"] = get_travel_time(df_taxis)

In [None]:
# option #2: this is good
%timeit -n10 df_taxis["travel_time_b"] = df_taxis.apply(lambda x: x["dropoff"]-x["pickup"], axis=1)

In [None]:
# option #3: this is better
%timeit -n10 df_taxis["travel_time_c"] = df_taxis["dropoff"] - df_taxis["pickup"]

Options #2 and #3 leverages on the quirk of pandas being column-major, hence making it a speedier alternative than Option #1. One reason why option #3 is faster than option #2 is because the ```.apply()``` operation makes use of python loops under the hood. 💡

In [None]:
# different approach, same result
df_taxis[[col for col in df_taxis.columns if "travel_time" in col]].head()

Now, suppose we want to access the **tip from the first row** of the seaborn taxis dataset.

In [None]:
# option #1: this is bad
%timeit -n1000 df_taxis.iloc[0]["tip"]

In [None]:
# option #2: this is good
%timeit -n1000 df_taxis.loc[0, "tip"]

In [None]:
# option #2: this is better     
%timeit -n1000 df_taxis["tip"][0]

When performing multiple slice operations, **always do the column-based slicing first**, since it leverages on the column-major quirk of pandas.

If you've used pandas before, most likely you've seen this `SettingCopyWarning` message when you try to assign values to a subset of the data. First, let's try to understand this warning message, then let's look for ways to address it amd avoid it from appearing in the future.

In [None]:
import warnings
warnings.filterwarnings("default")

In [None]:
df_taxis.tail(1)

Suppose we want to alter the color of taxi for the last row of the pandas dataframe

In [None]:
df_taxis["color"][len(df_taxis)-1] = "yellow"

In [None]:
df_taxis.tail(1)

It worked but pandas threw the `SettingWithCopyWarning` mentioned above.

Suppose we want change the credit card payment category to mastercard.

In [None]:
df_taxis[df_taxis["payment"]=="credit card"]["payment"] = "mastercard"

In [None]:
df_taxis.tail(1)

It didn't work and pandas threw the `SettingWithCopyWarning` error mentioned above.

Pandas behaves this way we're trying to make an assignment to a `Copy` instead of a `View`.

- `Copy` is a copy of the actual `DataFrame`. This will be thrown away as soon as the operation is done.
- `View` is the actual `DataFrame` you want to work with

To avoid this error, we can use the `.loc(row_indexer, col_indexer)` operation in pandas.

In [None]:
# let's revert the changes from yellow to green
df_taxis.loc[len(df_taxis)-1, "color"] = "green"

In [None]:
df_taxis.tail(1)

In [None]:
# let's retry to update the credit card to mastercard
df_taxis.loc[df_taxis["payment"]=="credit card", "payment"] = "mastercard"

In [None]:
df_taxis.tail(1)

Now, we were both able to update the contents of the dataframe without the `SettingWithCopyWarning` message!

✋ What about you? Can you update the null entries in payment with `rewards`?

In [None]:
df_taxis.loc[df_taxis.payment.isna()].head()

In [None]:
# YOUR CODE GOES HERE

In [None]:
affected_indices = [7, 445, 491, 545, 621]
df_taxis.loc[affected_indices]

As you can see from our several demonstrations, there is a lot of ways for solving things in pandas. Next time, we can leverage our knowledge of pandas being column-major **in order to speed-up the computations in our data pipeline processes and exploratory data analysis.** 😁

**Sometimes, we process near-to-larger-than-memory datasets in adhoc,** and when our only option available is pandas, here are some tips and tricks on how to process large datasets efficiently without running out of memory or exhausting your compute resources.

In [None]:
mkdir data

In [None]:
cd data

In [None]:
from google.colab import files
files.upload();

In [None]:
cd /content

When you're loading data into pandas, **load only the relevant columns** otherwise they're just occupying unnecessary space in the memory.

In [None]:
# https://www.kaggle.com/shivamb/netflix-shows
FILEPATH_NETFLIX_TITLES = "data/netflix_titles.csv"

In [None]:
df_netflix = pd.read_csv(FILEPATH_NETFLIX_TITLES)
df_netflix.info(verbose=True, memory_usage="deep")

In [None]:
df_netflix.head()

In [None]:
df_netflix.nunique()

Suppose we're only tasked to **visualize the number of netflix movies released across the years since its inception.** If this dataset is heavy, we can opt not to load the other unncessary columns by using the parameter ```usecols``` in pandas' ```read_csv``` method and select the relevant ones. 

In [None]:
cols = ["show_id", "type", "release_year"]
df_netflix = pd.read_csv(FILEPATH_NETFLIX_TITLES, usecols=cols)
df_netflix.info(verbose=False, memory_usage="deep")

In [None]:
8.5e6 / 1.1e6

Hooray! We've been able to shrink the memory usage down to **~7x times** just by mindfully selecting the columns we need for analysis.

In [None]:
netflix_titles_over_the_yrs = df_netflix.groupby(["release_year", "type"])["show_id"].nunique().reset_index(drop=False)
netflix_titles_over_the_yrs.head()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(18, 9))

sns.lineplot(data=netflix_titles_over_the_yrs, x="release_year", y="show_id", hue="type", ax=ax)
plt.ylabel("# of titles", fontsize=12)
plt.xlabel("release year", fontsize=12);

We're able to accomplish the task without the other columns, right? Of couse, **err on the side of caution and do EDA and sanity checks first.** 😅

In [None]:
mkdir data

In [None]:
cd data

In [None]:
from google.colab import files
files.upload();

In [None]:
cd /content

In the case where you need to load many fields, but you still want to optimize on memory consumption, you can do the following:
- Compress the categorical (usually a string) fields with the `category` dtype
- Compress the numerical fields with `int` or `float` dtypes but with smaller magnitude

In [None]:
# https://www.kaggle.com/uciml/mushroom-classification
FILEPATH_MUSHROOMS = "data/mushrooms.csv"

In [None]:
df_mushrooms = pd.read_csv(FILEPATH_MUSHROOMS)
df_mushrooms.info(verbose=True, memory_usage="deep")

By default, when loading the data into pandas, without passing a schema, **pandas just "guesses" the dtype of the fields.**

In [None]:
df_mushrooms.head()

In [None]:
df_mushrooms.nunique()

Upon inspection, we see that the columns can fit in the `category` dtype, let's try passing a schema when loading the data.

In [None]:
df_mushrooms = pd.read_csv(FILEPATH_MUSHROOMS, dtype="category")
df_mushrooms.info(verbose=True, memory_usage="deep")

In [None]:
10.3e6 / 193.3e3 

Hooray! We've been able to shrink the memory usage down to **~50x times** just by changing the original object/string dtype to `category`

In [None]:
mkdir data

In [None]:
cd data

In [None]:
from google.colab import files
files.upload();

In [None]:
cd /content

In [None]:
# https://www.kaggle.com/iabhishekofficial/mobile-price-classification
FILEPATH_MOBILE_PRICE = "data/mobile_price.csv"

In [None]:
df_mobile_price = pd.read_csv(FILEPATH_MOBILE_PRICE)
df_mobile_price.info(verbose=True, memory_usage="deep")

In [None]:
df_mobile_price.head()

In [None]:
df_mobile_price.nunique()

In [None]:
df_mobile_price.describe().loc[["min", "max"]]

In [None]:
# just looking for floating points in the table

def is_whole_number(n):
    return n % 1 == 0

df_mobile_price.apply(is_whole_number, axis=1).sum(axis=0) / len(df_mobile_price)

Let's recall some basic computing concepts:

For integers
- **int8** can store integers from -128 to 127.
- **int16** can store integers from -32768 to 32767.
- **int64** can store integers from -9223372036854775808 to 9223372036854775807.

In [None]:
schema = {
    "battery_power": "int16",
    "blue": bool,
    "clock_speed": "float32",
    "dual_sim": bool,
    "fc": "int8",
    "four_g": bool,
    "int_memory": "int8",
    "m_dep": "float32",
    "mobile_wt": "int16",
    "n_cores": "int8",
    "pc": "int8",
    "px_height": "int16",
    "px_width": "int16",
    "ram": "int16",
    "sc_h": "int8",
    "sc_w": "int8",
    "talk_time": "int8",
    "three_g": bool,
    "touch_screen": bool,
    "wifi": bool
}
df_mobile_price = pd.read_csv(FILEPATH_MOBILE_PRICE, dtype=schema)
df_mobile_price.info(verbose=True, memory_usage="deep")

In [None]:
468.9e3 / 90.9e3 

Hooray! We've been able to shrink the memory usage down to **~5x times** just by changing the original `int64` and `float64` dtypes identified by pandas to smaller magnitude integer and float dtype counterparts (and boolean) categories accordingly.

Imagine this approach in an actual working dataset, you'll be surprised how efficient this technique can be in your own data science workflow, **not only it speeds up the subsequent processes, it also allows you to do more** since you have less usage of your compute resources. 😁

**WHAT WE'VE COVERED**
- How to leverage pandas `DataFrame`'s quirk of being a column-major data structure.
    - Use `Series` to speed up calculations as compared to a row-based approach.
    - Efficient chaining of operation by starting with the column slicing.
- How to correctly address the notorious `SettingWithCopy` warning of pandas.
    - Difference between a `Copy` and `View`.
    - How to mitigate the SettingWithCopy warning using `.loc()`.
- How to efficiently load a pandas dataframe into memory.
    - Use only relevant columns for the problem.
    - Make use of the category dtype in pandas for categorical data.
    - Make use of lower magnitude floats and integers for the data.


## What I expect from you after the session

Now that we've finished this session I expect you to 📚✍️👨‍🏫👩‍🏫👨‍🎓👩‍🎓

## QUIZ

https://docs.google.com/forms/d/e/1FAIpQLSdPVkeKfi7ehcKpRCSQ8D8RMZAAVNqm8LeTZ8FH6NsobqHEwQ/viewform?usp=sf_link

## REFERENCES

Here are some helpful references curated just for you!

Docs
- [Here is the PEP8 style guide in Python.](https://www.python.org/dev/peps/pep-0008/)
- [Here is pendulum's official documentation.](https://pendulum.eustace.io/docs/)
- [Here is pandas' official documentation.](https://pandas.pydata.org/docs/reference/index.html#api)
- [Here is geopandas' official documentation.](https://geopandas.org/en/stable/docs.html)
- [Here is pytest's official documentation.](https://docs.pytest.org/en/7.0.x/) 
- [Here is seaborn's official documentation.](https://seaborn.pydata.org/api.html)

Books
- [If you want to brush up on your python programming.](https://www.tomasbeuzen.com/python-programming-for-data-science/README.html)
- [If you want to learn more about inferential thinking.](https://inferentialthinking.com/chapters/intro.html)
- [If you want to learn more about geographic data science.](https://geographicdata.science/book/intro.html)

Blogs
- [If you want to learn more about test-driven development.](https://testdriven.io/blog/)