# `pandas` Part I - Basic usage, indexing

This document includes a sequence of notebooks to introduce data manipulation in Python using the `pandas` library.

## Basic building blocks
`pandas` works with tabular data (rows and columns) through `Series`, `DataFrame` objects.

In [None]:
import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

*Source: [NASA Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh/data_preview)*

In [None]:
type(meteor)

A `DataFrame` is the basic "spreadsheet" or "table" used in Python.  A `DataFrame` object is composed of one or more `Series` objects (columns), and indexed by `Index` (rows).

In [None]:
# head.() shows us the first several rows
meteor.head()

In [None]:
# investigate a column (note its type)
print(type(meteor.name))

meteor.name.head()

In [None]:
# investigate how the columns are labeled
meteor.columns

In [None]:
# investigate how the rows are indexed
meteor.index

## DataFrame sources

`DataFrame`s can be created from reading a file, scraping the web, and/or API requests.  

### Reading from a file

In [None]:
import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

### API requests (more details later)

In [None]:
import requests

response = requests.get(
    'https://data.nasa.gov/resource/gh4g-9sfh.json',
    params={'$limit': 50000}  # Depending on the API, there may be a default limit of records one can obtain
)  

In [None]:
response  # A 200 exit code indicates success

*Tip:* A list of HTTP GET exit codes is available at https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

In [None]:
response.ok  # checks ok flag

In [None]:
# Extract data if request is successful
if response.ok:
    payload = response.json()
else:
    print(f'Request failed with exit code {response.status_code}.')
    payload = None

In [None]:
# Load into DataFrame
meteor_json = pd.DataFrame(payload)
meteor_json.head()

In [None]:
# Removing auto-computed columns
mask = meteor_json.columns.str.contains('@computed_region', regex=True)

columns_to_drop = meteor_json.columns[mask]

In [None]:
meteor_json = meteor_json.drop(columns=columns_to_drop)

In [None]:
# storing downloaded data into files
    # meteor_json.to_csv('meteor.csv')  

## Basic inspection

### What type of data are available in the dataframe? Are there missing data?

In [None]:
meteor.info()

### How much data are available?

In [None]:
meteor.shape

## Subsetting and indexing

Effectively extracting data from a full dataset requires fluency in how the `DataFrame` can be subsetted and how it is indexed.

### Calling a column by attributes (if valid)

In [None]:
meteor.recclass

### Calling a column by keys

In [None]:
meteor['mass (g)']

### Multiple columns by name

In [None]:
meteor[['name', 'mass (g)']]

### Selecting rows

In [None]:
meteor[5:10]  # end-exclusive

### Indexing with `.loc[]`, `iloc[]`

- `.loc[]` indexes by row labels
- `.iloc[]` indexes by indices

In [None]:
meteor.loc[0:4, 'name':'mass (g)']

In [None]:
meteor.iloc[0:4, 0:5]

### Filtering or subsetting by condition
Selection by condition can be performed by creating a Boolean *mask* with True/False values to specify which rows/columns to select.

In [None]:
# select records with heavy meteor (mass > 10^7) that are found (fall = 'Found')
mask = (meteor['mass (g)'] > 1e7) & (meteor.fall == 'Found')
mask

In [None]:
meteor[mask]

**Note:** Each condition is surrounded by parentheses, and we use bitwise operator (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`).

In [None]:
# negation of a mask
meteor[~mask]

**Note**: Boolean masks can be used with `loc[]` and `iloc[]` as well.

## Calculating summary statistics
This section discusses preliminary calculations before conducting further data analysis.

### How many of the meteorites were observed falling vs found?

In [None]:
meteor.fall.value_counts()

In [None]:
meteor.fall.value_counts(normalize=True)

### Behavior of mass of a meterorite?

In [None]:
meteor['mass (g)'].mean()

In [None]:
meteor['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])

In [None]:
meteor['mass (g)'].max()

In [None]:
# sometimes helpful to locate the other information related to a particular entry
meteor.loc[meteor['mass (g)'].idxmax()]  # note the "index" of max()

### How many unique classes are in this dataset?

In [None]:
meteor.recclass.nunique()

### General statistics
The `.describe()` method includes numeric columns by default.  Here we can force it to include all columns.

In [None]:
meteor.describe(include='all')


**Note**: `NaN` values signify missing data. For instance, the fall column contains strings, so there is no value for mean; likewise, mass (g) is numeric, so we don't have entries for the categorical summary statistics (unique, top, freq).

## Practice 1

The following command downloads a `.parquet` file containing NYC Yellow Taxi data, a common storage format for moderate to large datasets.

In [None]:
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

In [None]:
taxi = pd.read_XXXXXXX('yellow_tripdata_2024-01.parquet')

1. Examine the first five rows of the dataset.

2. How much data are included in this dataset?

3. Calculate summary statistics for the `fare_amount`, `tolls_amount`, and `tip_amount`.  Do they add up to the `total_amount`?

4. Find the trip that has the longest trip by distance (`trip_distance`).

5. Compare the average `total_amount` for short versus long trips (short trip has `trip_distance` < 10).  Make sure we do not include zero-distance trips.

---

## Data cleaning
We will cover some common transformations that facilitate data analysis, including rearranging columns, type conversion, and sorting.

In [None]:
minitaxi = taxi.sample(10000)  # down-sample our dataset for illustration

In [None]:
minitaxi.shape

In [None]:
minitaxi.head()

### Dropping columns

In [None]:
# drop all id columns and the store_and_fwd_flag column
mask = minitaxi.columns.str.contains('ID$|store_and_fwd_flag', regex=True)
columns_to_drop = minitaxi.columns[mask]
columns_to_drop

In [None]:
minitaxi = minitaxi.drop(columns=columns_to_drop)
minitaxi.head()

### Renaming columns

In [None]:
minitaxi = minitaxi.rename(
    columns={
        'tpep_pickup_datetime': 'pickup', 
        'tpep_dropoff_datetime': 'dropoff'
    }
)
minitaxi.columns

## Examine the correct data types (type conversion)

In [None]:
minitaxi.dtypes

In [None]:
minitaxi.info()

In [None]:
# cast passenger_count to integers

minitaxi.loc[:, 'passenger_count'] = \
    minitaxi.loc[:, 'passenger_count'].astype(int)
minitaxi.dtypes

### Dropping rows with NA values

In [None]:
minitaxi = minitaxi.dropna()

In [None]:
minitaxi.info()

## Creating new columns

Let's calculate the following for each row:

1. elapsed time of the trip
2. the tip percentage
3. the total taxes, tolls, fees, and surcharges
4. the average speed of the taxi

In [None]:
minitaxi = minitaxi.assign(
    elapsed_time=lambda x: x.dropoff - x.pickup,
    cost_before_tip=lambda x: x.total_amount - x.tip_amount,
    tip_pct=lambda x: x.tip_amount / x.cost_before_tip, 
    fees=lambda x: x.cost_before_tip - x.fare_amount, 
    avg_speed=lambda x: x.trip_distance.div(
        x.elapsed_time.dt.total_seconds() / 60 / 60
    )
)

In [None]:
minitaxi.head(2)

*Notes*:

- We used `lambda` functions to 1) avoid typing taxis repeatedly and 2) be able to access the `cost_before_tip` and `elapsed_time` columns in the same method that we create them.
- To create a single new column, we can also use `df['new_col'] = <values>`.

## Sorting by values

In [None]:
# sort by descending passenger count and pickups from earliest to latest
minitaxi.sort_values(['passenger_count', 'pickup'], ascending=[False, True]).head()

In [None]:
# pick out the 3 trips with largest timespan
minitaxi.nlargest(3, 'elapsed_time')

## Working with index
Currently the index is simply using the row numbers, but if we wish to work with the pickup times significantly, perhaps indexing by the datetime column is more effective.

In [None]:
minitaxi.index

### Setting index

In [None]:
minitaxi = minitaxi.set_index('pickup')
minitaxi.head(3)

### Sorting by index

In [None]:
# sorting by indices
minitaxi = minitaxi.sort_index()
minitaxi.head()

Recall how we have used `[0:4]` to locate the rows indexed 0 to 4. Since we are indexing by datetime, we can select within time ranges.

### Selecting by index

In [None]:
# selecting the taxi rides in the first 6 hours of the new year
minitaxi['2024-01-01 00:00':'2024-01-01 06:00']

### Resetting index
We can revert any specific column index back to row numbers, **but** notice that by setting and resetting index you lose the original row numbers.

In [None]:
minitaxi = minitaxi.reset_index()
minitaxi.head()

## Practice 2

Using the `meteor` dataset, 

1. cast the `year` column to an integer column.
2. create a new column indicating whether the meteorite was observed falling before 1970.
3. set the index to the id column and extract all the rows with IDs between 10,036 and 10,040 (inclusive) with `loc[]`.
4. examine the `year` column to see if there are any data errors.