In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib_inline.backend_inline import set_matplotlib_formats

set_matplotlib_formats("svg")
sns.set_context("poster")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
pd.set_option("display.max_rows", 8)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Lecture 2 – DataFrame Fundamentals

## DSC 80, Fall 2023

**Pull the repo from GitHub and open lec02.ipynb, we will be coding today!**

### Announcements 📣

- The [Welcome Survey](https://forms.gle/8vVAFAkqfW5rQdRq6) is due **tonight at 11:59pm**.
- Lab 1 is released, and is due **next Monday, Oct 9th at 11:59PM!**
    - See the [Tech Support](https://dsc80.com/tech_support/#replicating-the-gradescope-environment) page for instructions and watch [this video 🎥](https://www.youtube.com/watch?v=PPKXJqu2XmY) for tips on how to set up your environment and work on assignments.
    - Please try to set up your computer ASAP, since we have OH on Friday but not over the weekend to help debug your environment.
- You may use a slip day, in which case the due date will be Oct 10th.
- Discussion tomorrow will talk about **what a conda environment is**, and how to debug package import issues on your own.
- Lecture recordings are available [here](https://podcast.ucsd.edu/watch/fa23/dsc80_a00) .

### Agenda

- Whirlwind review of `numpy` and `babypandas`.
- `pandas` DataFrame objects.
- Subsetting dataframes
    - `.loc`, `.iloc`, filtering/querying
    
Can't cover every single detail! The [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) will be your friend.

## Review: `numpy`

- NumPy stands for "numerical Python". It is a commonly-used Python module that enables **fast** computation involving arrays and matrices.
- `numpy`'s main object is the **array**. In `numpy`, arrays are:
    - Homogenous – all values are of the same type.
    - (Potentially) multi-dimensional.
- Computation in `numpy` is fast because:
    - Much of it is implemented in C.
    - `numpy` arrays are stored more efficiently in memory than, say, Python lists. 
- [This site](https://cloudxlab.com/blog/numpy-pandas-introduction/) provides a good overview of `numpy` arrays.

We used `numpy` in DSC 10 to work with sequences of data:

## Review: `numpy`

In [None]:
...

### ⚠️ The dangers of `for`-loops

- `for`-loops are slow when processing large datasets. **You will rarely write `for`-loops in DSC 80 (except for Lab 1 and Project 1), and may be penalized on assignments for using them when unnecessary!**
- One of the biggest benefits of `numpy` is that it supports **vectorized** operations. 
    - If `a` and `b` are two arrays of the same length, then `a + b` is a new array of the same length containing the element-wise sum of `a` and `b`.
- To illustrate how much faster `numpy` arithmetic is than using a `for`-loop, let's compute the squares of the numbers between 0 and 1,000,000:
    - Using a `for`-loop.
    - Using vectorized arithmetic, through `numpy`.

In [None]:
%%timeit
squares = []
for i in range(1_000_000):
    squares.append(i * i)

In [None]:
%%timeit
squares = np.arange(1_000_000) ** 2

- Python: takes about 0.06 seconds per loop
- `numpy`: takes about 0.0004 seconds per loop, more than 100x faster!

## Introduction to `pandas` 🐼

### Baby pandas

- a subset of pandas that is beginner friendly.
<center><img src='imgs/babypanda.jpg' width=45%></center>

### pandas

- everything that you learned in babypandas will carry over.
<center><img src='imgs/angrypanda.jpg' width=60%></center>

### `pandas`

<center><img src='imgs/pandas.png' width=200></center>

- `pandas` is **the** Python library for tabular data manipulation.
- Before `pandas` was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
- Wes McKinney, the original developer of `pandas`, wanted a library which would allow everything to be done in Python.
    - Python is faster to develop in than Java, and is more general-purpose than R.

### `pandas` data structures

There are three key data structures at the core of `pandas`:
- DataFrame: 2 dimensional tables.
- Series: 1 dimensional array-like object, typically representing a column or row.
- Index: sequence of column or row labels.

<center><img src='imgs/example-df.png' width=600></center>

### Importing `pandas` and related libraries

`pandas` is almost always imported in conjunction with `numpy`:

In [None]:
import pandas as pd
import numpy as np

### Example: Dog Breeds (woof!) 🐶

Data originally from the American Kennel Club, which was made into a [neat plot](https://informationisbeautiful.net/visualizations/best-in-show-whats-the-top-data-dog/):

![](https://infobeautiful4.s3.amazonaws.com/2014/11/IIB_Best-In-Show_1276x2.png)

### But...

The data are no longer available! One website has a slightly different version: https://tmfilho.github.io/akcdata/

We'll use the version that Sam saved while the data were still online.

In [None]:
all_dogs = pd.read_csv('data/all_dogs.csv')
all_dogs

#### Discussion Question

Let's refresh your DSC 10 knowledge! Find the most popular and least popular dog breeds using the `popularity_all` column.

In [None]:
# Fill in this cell

### A Smaller Dogs Dataframe

The `all_dogs` dataframe is a bit large, so we have a smaller version here to make it easier to show `pandas` functionality.

In [None]:
dogs = pd.read_csv('data/dogs43.csv')
dogs

### Review: `head`, `tail`, `shape`, `index`, `get`, `sort_values`

To extract the first or last few rows of a DataFrame, use the `head` or `tail` methods.

In [None]:
dogs.head(3)

In [None]:
dogs.tail(2)

The `shape` attribute returns the DataFrame's number of rows and columns.

In [None]:
dogs.shape

We know that we can use `.get()` to select out a few columns...

In [None]:
# This is review from DSC 10 but most people don't use .get() in practice.
# Will cover in just a few minutes...
...

To sort by a column, use the `sort_values` method. Like most DataFrame and Series methods, `sort_values` returns a new DataFrame, and doesn't modify the original.

In [None]:
dogs.sort_values(...)

### Setting the index

Think of each row's index as its **unique identifier** or **name**. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the `set_index` method.

In [None]:
# By reassigning dogs, our changes will persist.
dogs = dogs.set_index('breed')
dogs

In [None]:
# There used to be 7 columns, but now there are only 6!


### 💡 Pro-tip: Displaying more rows/columns

Sometimes, you just want `pandas` to display a lot of rows and columns. You can use this helper function to do that:

In [None]:
from IPython.display import display
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
    """Displays n rows and cols from df"""
    with pd.option_context("display.max_rows", rows,
                           "display.max_columns", cols):
        display(df)

In [None]:
display_df(dogs, rows=43)

## Selecting columns

### Selecting columns in `babypandas` 👶🐼

- In `babypandas`, you selected columns using the `.get` method.
- `.get` also works in `pandas`, but it is not **idiomatic** – people don't usually use it.

In [None]:
dogs

In [None]:
dogs.get('size')

In [None]:
# This doesn't error, but sometimes we'd like it to.
dogs.get('size oops!')

### Selecting columns with `[]`

* The standard way to select a column in `pandas` is by using the `[]` operator.
* Specifying a column name returns the column as a Series.
* Specifying a list of column names returns a DataFrame.

In [None]:
dogs

### Useful Series methods

There are a variety of useful methods that work on Series. You can see the entire list [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). Many methods that work on a Series will also work on DataFrames, as we'll soon see.

## Subsetting rows (and columns)

### Using `loc` to slice rows and columns using labels

In [None]:
# The first argument is the row label
#        ↓
dogs.loc[...]
#                  ↑
# The second argument is the column label

### 💡 Pro-Tip: Using Pandas Tutor

If you want, you can install `pandas_tutor` from pip (in your terminal):

    pip install pandas_tutor

Then, you can load the extension by adding:

    %reload_ext pandas_tutor

At the top of your notebook. After that, you can render visualizations with the `%%pandas_tutor` or `%%pt` cell magics:

In [None]:
# Pandas Tutor setup. You'll need to run `pip install pandas_tutor` in your terminal
# for this cell to work, but you can also ignore the error and continue onward.
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}

In [None]:
%%pt
dogs.loc['Pug', 'longevity']

### `.loc` is flexible

`.loc` will expand dimensions whenever an argument is a sequence:

In [None]:
dogs.loc[...]

### Review: Filtering (aka Querying)

- Filtering is the act of selecting rows in a DataFrame that satisfy certain condition(s).
- Comparisons with arrays (Series) result in Boolean arrays (Series).
- We can use comparisons along with the `loc` operator to **filter** a DataFrame.


In [None]:
dogs

Note that because we set the index to `'breed'` earlier, we can select rows based on dog breeds without having to query.

In [None]:
dogs

If `'breed'` was instead a column, then we'd need to query to access information about a particular school.

In [None]:
dogs_reset = dogs.reset_index()
dogs_reset

In [None]:
# DataFrame!
dogs_reset[dogs_reset['breed'] == 'Maltese']

### Filtering with Multiple Conditions

Remember, you need parentheses around each condition. Also, you must use `&` and `|` instead of the `and` and `or` keywords. `pandas` makes weird decisions sometimes!

In [None]:
...

### 💡 Pro-Tip: Using `.query` (optional)

`.query` is a convenient way to filter, since you don't need parentheses and you can use the `and` and `or` keywords. We'll use it during lecture, but you won't need to use it yourself unless you'd like to. (It won't be used in our exams.)

### Don't forget `iloc`!

- `iloc` stands for "integer location".
- `iloc` is like `loc`, but it selects rows and columns based off of integer positions only.

In [None]:
dogs

`iloc` is often most useful when we sort first. For instance, to find the weight of the longest-living dog breed in the dataset:

### More Practice

Consider the DataFrame below.

In [None]:
jack = pd.DataFrame({1: ['fee', 'fi'], 
                     '1': ['fo', 'fum']})
jack

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own. [Here's a Pandas Tutor link](https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0A%0Ajack%20%3D%20pd.DataFrame%28%7B1%3A%20%5B'fee',%20'fi'%5D,%20%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'1'%3A%20%5B'fo',%20'fum'%5D%7D%29%0Ajack%5B1%5D&d=2023-10-05&lang=py&v=v1) to visualize these!

In [None]:
# jack[1]

In [None]:
# jack[[1]]

In [None]:
# jack['1']

In [None]:
# jack[[1, 1]]

In [None]:
# jack.loc[1]

In [None]:
# jack.loc[jack[1] == 'fo']

In [None]:
# jack[1, ['1', 1]]

In [None]:
# jack.loc[1,1]

## Adding and modifying columns

### Adding and modifying columns, using a copy

- To add a new column to a DataFrame, use the `assign` method.
    - To change the values in a column, add a new column with the same name as the existing column.
- Like most `pandas` methods, `assign` returns a new DataFrame.
    - **Pro** ✅: This doesn't inadvertently change any existing variables.
    - **Con** ❌: It is not very space efficient, as it creates a new copy each time it is called.

### 💡 Pro-Tip: Method chaining

I recommend chaining methods together instead of writing one long line:

### 💡 Pro-Tip: `assign` for column names with special characters

You can also use `assign` when the desired column name has spaces (and other special characters) by unpacking a dict:

### Adding and modifying columns, in-place

* You can assign a new column to a DataFrame **in-place** using `[]`.
    - This works like dictionary assignment.
    - This **modifies** the underlying DataFrame, unlike `assign`, which returns a new DataFrame.
* This is the more "common" way of adding/modifying columns. 
    - ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.

In [None]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)

Note that we never reassigned `dogs` in the cell above – that is, we never wrote `dogs_copy = ...` – though it was still modified.

### Mutability

DataFrames, like lists, arrays, and dictionaries, are **mutable**. As you learned in DSC 20, this means that they can be modified after being created. 

Not only does this explain the behavior on the previous slide, but it also explains the following:

In [None]:
dogs_copy.head(2)

In [None]:
def cost_in_thousands():
    dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000

In [None]:
# What happens when we run this twice?
cost_in_thousands()

In [None]:
dogs_copy

### ⚠️ Avoid mutation when possible

Note that `dogs` was modified, even though we didn't reassign it! These unintended consequences can **influence the behavior of test cases on labs and projects**, among other things! 

To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include `df = df.copy()` as the first line in functions that take DataFrames as input.

Also, some methods let you use the `inplace=True` argument to mutate the original. **Don't use this argument, since future `pandas` releases plan to remove it.**

### Replacing values

Instead of mutation, we recommend using `replace`, which returns a copy of the original dataframe:

In [None]:
dogs.replace(...)

## Axes

### Axes

- The rows and columns of a DataFrame are both stored as Series.
- The **axis** specifies the direction of a "slice" of a DataFrame.

<center><img src='imgs/axis.png' width=30%></center>

- Axis 0 refers to the index (rows).
- Axis 1 refers to the columns.

### DataFrame methods with `axis`

Consider the DataFrame `A` defined below using a dictionary.

In [None]:
A = pd.DataFrame({
    'A': [1, 4],
    'B': [2, 5],
    'C': [3, 6],
})
A

If we specify `axis=0`, `A.sum` will "compress" along axis 0, and keep the column labels intact.

If we specify `axis=1`, `A.sum` will "compress" along axis 1, and keep the row labels (index) intact.

<center><img src='imgs/axis-sum.png' width=600></center>

What's the default axis?

In [None]:
A

### DataFrame methods with `axis`

- In addition to `sum`, many other Series methods work on DataFrames.
- In such cases, the DataFrame method usually applies the Series method to every row or column.
- Many of these methods accept an `axis` argument; the default is usually `axis=0`.

## `pandas` and `numpy`

<center><img src='imgs/python-stack.png' width=60%></center>

### `pandas` is built upon `numpy`

- A Series in `pandas` is a `numpy` array with an index.
- A DataFrame is like a dictionary of columns, each of which is a `numpy` array.
- Many operations in `pandas` are fast because they use `numpy`'s implementations.
- If you need access the array underlying a DataFrame or Series, use the `to_numpy` method.

### `pandas` data types

- Each Series (column) has a `numpy` data type, which refers to the type of the values stored within. Access it using the `dtypes` attribute.
- A column's data type determines which operations can be applied to it.
- `pandas` tries to guess the correct data types for a given DataFrame, and is often wrong.
    - This can lead to incorrect calculations and poor memory/time performance.
- As a result, you will often need to explicitly convert between data types.

In [None]:
dogs.dtypes

### `pandas` data types

Notice that Python `str` types are `object` types in `numpy` and `pandas`.

|Pandas dtype|Python type|NumPy type|SQL type|Usage|
|---|---|---|---|---|
|int64|int|int_, int8,...,int64, uint8,...,uint64|INT, BIGINT| Integer numbers|
|float64|float|float_, float16, float32, float64|FLOAT| Floating point numbers|
|bool|bool|bool_|BOOL|True/False values|
|datetime64 or Timestamp|datetime.datetime|datetime64|DATETIME|Date and time values|
|timedelta64 or Timedelta|datetime.timedelta|timedelta64|NA|Differences between two datetimes|
|category|NA|NA|ENUM|Finite list of text values|
|object|str|string, unicode|NA|Text|
|object|NA|object|NA|Mixed types|

[This article](https://www.dataquest.io/blog/pandas-big-data/) details how `pandas` stores different data types under the hood.

[This article](https://mortada.net/can-integer-operations-overflow-in-python.html#Can-integers-overflow-in-python?) explains how `numpy`/`pandas` `int64` operations differ from vanilla `int` operations.

### Type conversion

You can change the data type of a Series using the `.astype` Series method.

For instance, we can change the data type of the `'lifetime_cost'` column in `dogs` to be `int64`:

In [None]:
dogs.head()

In [None]:
dogs.dtypes

### 💡 Pro-Tip: Setting dtypes in `read_csv`

Usually, we prefer to set the correct dtypes in `read_csv`, since it can help `pandas` load in files more quickly:

In [None]:
dogs = pd.read_csv('data/dogs43.csv', dtype=...)
dogs

In [None]:
dogs.dtypes

## Putting it all together

Talk to your neighbor about a dog breed that you personally like or know the name of. Then, try to find a few other dog breeds that are similar in weight to yours in `all_dogs`. Which similar breeds have the lowest and highest `lifetime_cost`? `intelligence_rank`? Are there any similar breeds that you haven't heard of before?

As a bonus, look up these dog breeds on the [AKC website](https://www.akc.org/) to see how they look!

In [None]:
all_dogs

## Summary, next time

### Summary

- `pandas` is **the** library for tabular data manipulation in Python.
- There are three key data structures in `pandas`: DataFrame, Series, and Index.
- Refer to the lecture notebook and the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for tips.
- `pandas` relies heavily on `numpy`. An understanding of how data types work in both will allow you to write more efficient and bug-free code.
- Series and DataFrames share many methods (refer to the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for more details).
- Most `pandas` methods return copies of Series/DataFrames. Be careful when using techniques that modify values in-place.
- Next time: `groupby` and data granularity.