# Data manipulation with `pandas`

This document includes a sequence of notebooks to introduce data manipulation in Python using the `pandas` library.

## Basic building blocks
`pandas` works with tabular data (rows and columns) through `Series`, `DataFrame` objects.

In [None]:
import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

*Source: [NASA Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh/data_preview)*

In [None]:
type(meteor)

A `DataFrame` is the basic "spreadsheet" or "table" used in Python.  A `DataFrame` object is composed of one or more `Series` objects (columns), and indexed by `Index` (rows).

In [None]:
# head.() shows us the first several rows
meteor.head()

In [None]:
# investigate a column (note its type)
print(type(meteor.name))

meteor.name.head()

In [None]:
# investigate how the columns are labeled
meteor.columns

In [None]:
# investigate how the rows are indexed
meteor.index

## DataFrame sources

`DataFrame`s can be created from reading a file, scraping the web, and/or API requests.  

### Reading from a file

In [None]:
import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

### API requests (more details later)

In [None]:
import requests

response = requests.get(
    'https://data.nasa.gov/resource/gh4g-9sfh.json',
    params={'$limit': 50000}  # Depending on the API, there may be a default limit of records one can obtain
)  

In [None]:
response  # A 200 exit code indicates success

*Tip:* A list of HTTP GET exit codes is available at https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

In [None]:
response.ok  # checks ok flag

In [None]:
# Extract data if request is successful
if response.ok:
    payload = response.json()
else:
    print(f'Request failed with exit code {response.status_code}.')
    payload = None

In [None]:
# Load into DataFrame
meteor_json = pd.DataFrame(payload)
meteor_json.head()

In [None]:
# Removing auto-computed columns
mask = meteor_json.columns.str.contains('@computed_region', regex=True)

columns_to_drop = meteor_json.columns[mask]

In [None]:
meteor_json = meteor_json.drop(columns=columns_to_drop)

In [None]:
# storing downloaded data into files
    # meteor_json.to_csv('meteor.csv')  

## Basic inspection

### What type of data are available in the dataframe? Are there missing data?

In [None]:
meteor.info()

### How much data are available?

In [None]:
meteor.shape

## Subsetting and indexing

Effectively extracting data from a full dataset requires fluency in how the `DataFrame` can be subsetted and how it is indexed.

### Calling a column by attributes (if valid)

In [None]:
meteor.recclass

### Calling a column by keys

In [None]:
meteor['mass (g)']

### Multiple columns by name

In [None]:
meteor[['name', 'mass (g)']]

### Selecting rows

In [None]:
meteor[5:10]  # end-exclusive

### Indexing with `.loc[]`, `iloc[]`

- `.loc[]` indexes by row labels
- `.iloc[]` indexes by indices

In [None]:
meteor.loc[0:4, 'name':'mass (g)']

In [None]:
meteor.iloc[0:4, 0:5]

### Filtering or subsetting by condition
Selection by condition can be performed by creating a Boolean *mask* with True/False values to specify which rows/columns to select.

In [None]:
# select records with heavy meteor (mass > 10^7) that are found (fall = 'Found')
mask = (meteor['mass (g)'] > 1e7) & (meteor.fall == 'Found')
mask

In [None]:
meteor[mask]

**Note:** Each condition is surrounded by parentheses, and we use bitwise operator (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`).

In [None]:
# negation of a mask
meteor[~mask]

**Note**: Boolean masks can be used with `loc[]` and `iloc[]` as well.

## Calculating summary statistics
This section discusses preliminary calculations before conducting further data analysis.

### How many of the meteorites were observed falling vs found?

In [None]:
meteor.fall.value_counts()

In [None]:
meteor.fall.value_counts(normalize=True)

### Behavior of mass of a meterorite?

In [None]:
meteor['mass (g)'].mean()

In [None]:
meteor['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])

In [None]:
meteor['mass (g)'].max()

In [None]:
# sometimes helpful to locate the other information related to a particular entry
meteor.loc[meteor['mass (g)'].idxmax()]  # note the "index" of max()

### How many unique classes are in this dataset?

In [None]:
meteor.recclass.nunique()

### General statistics
The `.describe()` method includes numeric columns by default.  Here we can force it to include all columns.

In [None]:
meteor.describe(include='all')


**Note**: `NaN` values signify missing data. For instance, the fall column contains strings, so there is no value for mean; likewise, mass (g) is numeric, so we don't have entries for the categorical summary statistics (unique, top, freq).

## Practice 1

The following command downloads a `.parquet` file containing NYC Yellow Taxi data, a common storage format for moderate to large datasets.

In [None]:
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

In [None]:
taxi = pd.read_XXXXXXX('yellow_tripdata_2024-01.parquet')

1. Examine the first five rows of the dataset.

2. How much data are included in this dataset?

3. Calculate summary statistics for the `fare_amount`, `tolls_amount`, and `tip_amount`.  Do they add up to the `total_amount`?

4. Find the trip that has the longest trip by distance (`trip_distance`).

5. Compare the average `total_amount` for short versus long trips (short trip has `trip_distance` < 10).  Make sure we do not include zero-distance trips.

---

## Data cleaning
We will cover some common transformations that facilitate data analysis, including rearranging columns, type conversion, and sorting.

In [125]:
minitaxi = taxi.sample(10000)  # down-sample our dataset for illustration

In [126]:
minitaxi.shape

(10000, 19)

In [127]:
minitaxi.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
1463982,2,2024-01-17 18:24:02,2024-01-17 18:38:47,1.0,1.92,1.0,N,113,148,1,14.9,2.5,0.5,2.25,0.0,1.0,23.65,2.5,0.0
279981,2,2024-01-04 17:05:34,2024-01-04 17:15:20,5.0,1.45,1.0,N,140,162,2,10.7,2.5,0.5,0.0,0.0,1.0,17.2,2.5,0.0
1520803,2,2024-01-18 11:13:37,2024-01-18 11:20:40,1.0,1.37,1.0,N,239,236,1,9.3,0.0,0.5,2.66,0.0,1.0,15.96,2.5,0.0
1661790,1,2024-01-19 18:34:09,2024-01-19 18:45:45,1.0,2.9,1.0,N,170,4,1,14.9,5.0,0.5,4.25,0.0,1.0,25.65,2.5,0.0
359292,2,2024-01-05 13:08:13,2024-01-05 13:11:19,1.0,0.57,1.0,N,142,239,2,5.1,0.0,0.5,0.0,0.0,1.0,9.1,2.5,0.0


### Dropping columns

In [130]:
# drop all id columns and the store_and_fwd_flag column
mask = minitaxi.columns.str.contains('ID$|store_and_fwd_flag', regex=True)
columns_to_drop = minitaxi.columns[mask]
columns_to_drop

Index(['VendorID', 'RatecodeID', 'PULocationID', 'DOLocationID'], dtype='object')

In [131]:
minitaxi = minitaxi.drop(columns=columns_to_drop)
minitaxi.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
1463982,2024-01-17 18:24:02,2024-01-17 18:38:47,1.0,1.92,1,14.9,2.5,0.5,2.25,0.0,1.0,23.65,2.5,0.0
279981,2024-01-04 17:05:34,2024-01-04 17:15:20,5.0,1.45,2,10.7,2.5,0.5,0.0,0.0,1.0,17.2,2.5,0.0
1520803,2024-01-18 11:13:37,2024-01-18 11:20:40,1.0,1.37,1,9.3,0.0,0.5,2.66,0.0,1.0,15.96,2.5,0.0
1661790,2024-01-19 18:34:09,2024-01-19 18:45:45,1.0,2.9,1,14.9,5.0,0.5,4.25,0.0,1.0,25.65,2.5,0.0
359292,2024-01-05 13:08:13,2024-01-05 13:11:19,1.0,0.57,2,5.1,0.0,0.5,0.0,0.0,1.0,9.1,2.5,0.0


### Renaming columns

In [132]:
minitaxi = minitaxi.rename(
    columns={
        'tpep_pickup_datetime': 'pickup', 
        'tpep_dropoff_datetime': 'dropoff'
    }
)
minitaxi.columns

Index(['pickup', 'dropoff', 'passenger_count', 'trip_distance', 'payment_type',
       'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'congestion_surcharge',
       'Airport_fee'],
      dtype='object')

## Examine the correct data types (type conversion)

In [135]:
minitaxi.dtypes

pickup                   datetime64[us]
dropoff                  datetime64[us]
passenger_count                 float64
trip_distance                   float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
Airport_fee                     float64
dtype: object

In [138]:
minitaxi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1463982 to 871002
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   pickup                 10000 non-null  datetime64[us]
 1   dropoff                10000 non-null  datetime64[us]
 2   passenger_count        9562 non-null   float64       
 3   trip_distance          10000 non-null  float64       
 4   payment_type           10000 non-null  int64         
 5   fare_amount            10000 non-null  float64       
 6   extra                  10000 non-null  float64       
 7   mta_tax                10000 non-null  float64       
 8   tip_amount             10000 non-null  float64       
 9   tolls_amount           10000 non-null  float64       
 10  improvement_surcharge  10000 non-null  float64       
 11  total_amount           10000 non-null  float64       
 12  congestion_surcharge   9562 non-null   float64       
 13 

In [143]:
# cast passenger_count to integers

minitaxi.loc[:, 'passenger_count'] = \
    minitaxi.loc[:, 'passenger_count'].astype(int)
minitaxi.dtypes

pickup                   datetime64[us]
dropoff                  datetime64[us]
passenger_count                   int32
trip_distance                   float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
Airport_fee                     float64
dtype: object

### Dropping rows with NA values

In [140]:
minitaxi = minitaxi.dropna()

In [144]:
minitaxi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9562 entries, 1463982 to 871002
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   pickup                 9562 non-null   datetime64[us]
 1   dropoff                9562 non-null   datetime64[us]
 2   passenger_count        9562 non-null   int32         
 3   trip_distance          9562 non-null   float64       
 4   payment_type           9562 non-null   int64         
 5   fare_amount            9562 non-null   float64       
 6   extra                  9562 non-null   float64       
 7   mta_tax                9562 non-null   float64       
 8   tip_amount             9562 non-null   float64       
 9   tolls_amount           9562 non-null   float64       
 10  improvement_surcharge  9562 non-null   float64       
 11  total_amount           9562 non-null   float64       
 12  congestion_surcharge   9562 non-null   float64       
 13  

## Creating new columns

Let's calculate the following for each row:

1. elapsed time of the trip
2. the tip percentage
3. the total taxes, tolls, fees, and surcharges
4. the average speed of the taxi

In [146]:
minitaxi = minitaxi.assign(
    elapsed_time=lambda x: x.dropoff - x.pickup,
    cost_before_tip=lambda x: x.total_amount - x.tip_amount,
    tip_pct=lambda x: x.tip_amount / x.cost_before_tip, 
    fees=lambda x: x.cost_before_tip - x.fare_amount, 
    avg_speed=lambda x: x.trip_distance.div(
        x.elapsed_time.dt.total_seconds() / 60 / 60
    )
)

In [147]:
minitaxi.head(2)

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
1463982,2024-01-17 18:24:02,2024-01-17 18:38:47,1,1.92,1,14.9,2.5,0.5,2.25,0.0,1.0,23.65,2.5,0.0,0 days 00:14:45,21.4,0.10514,6.5,7.810169
279981,2024-01-04 17:05:34,2024-01-04 17:15:20,5,1.45,2,10.7,2.5,0.5,0.0,0.0,1.0,17.2,2.5,0.0,0 days 00:09:46,17.2,0.0,6.5,8.90785


*Notes*:

- We used `lambda` functions to 1) avoid typing taxis repeatedly and 2) be able to access the `cost_before_tip` and `elapsed_time` columns in the same method that we create them.
- To create a single new column, we can also use `df['new_col'] = <values>`.

## Sorting by values

In [149]:
# sort by descending passenger count and pickups from earliest to latest
minitaxi.sort_values(['passenger_count', 'pickup'], ascending=[False, True]).head()

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
42387,2024-01-01 14:08:46,2024-01-01 14:34:09,6,11.1,1,46.4,5.0,0.5,10.93,0.0,1.0,65.58,0.0,1.75,0 days 00:25:23,54.65,0.2,8.25,26.237689
52534,2024-01-01 17:10:57,2024-01-01 17:24:38,6,2.65,1,15.6,0.0,0.5,2.0,0.0,1.0,21.6,2.5,0.0,0 days 00:13:41,19.6,0.102041,4.0,11.619976
65419,2024-01-01 21:42:16,2024-01-01 21:55:34,6,1.41,1,13.5,1.0,0.5,3.7,0.0,1.0,22.2,2.5,0.0,0 days 00:13:18,18.5,0.2,5.0,6.360902
73092,2024-01-02 06:45:41,2024-01-02 06:54:41,6,1.23,1,10.0,0.0,0.5,3.5,0.0,1.0,17.5,2.5,0.0,0 days 00:09:00,14.0,0.25,4.0,8.2
191261,2024-01-03 16:17:08,2024-01-03 16:44:49,6,4.0,2,26.1,2.5,0.5,0.0,0.0,1.0,32.6,2.5,0.0,0 days 00:27:41,32.6,0.0,6.5,8.669476


In [151]:
# pick out the 3 trips with largest timespan
minitaxi.nlargest(3, 'elapsed_time')

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
2283617,2024-01-26 12:54:56,2024-01-27 12:49:06,1,1.75,2,21.2,0.0,0.5,0.0,0.0,1.0,25.2,2.5,0.0,0 days 23:54:10,25.2,0.0,4.0,0.073213
2500524,2024-01-28 13:44:38,2024-01-29 13:31:56,5,0.52,2,7.2,0.0,0.5,0.0,0.0,1.0,11.2,2.5,0.0,0 days 23:47:18,11.2,0.0,4.0,0.021859
1284138,2024-01-15 18:00:08,2024-01-16 17:45:02,1,3.1,1,19.8,0.0,0.5,2.38,0.0,1.0,26.18,2.5,0.0,0 days 23:44:54,23.8,0.1,4.0,0.130535


## Working with index
Currently the index is simply using the row numbers, but if we wish to work with the pickup times significantly, perhaps indexing by the datetime column is more effective.

In [154]:
minitaxi.index

Index([1463982,  279981, 1520803, 1661790,  359292, 2439708,  109810,   60123,
        759802,  623340,
       ...
       1200054,  467157, 1864316,  100643, 2470600, 2081689, 1505987, 1923043,
        371985,  871002],
      dtype='int64', length=9562)

### Setting index

In [155]:
minitaxi = minitaxi.set_index('pickup')
minitaxi.head(3)

Unnamed: 0_level_0,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
pickup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2024-01-17 18:24:02,2024-01-17 18:38:47,1,1.92,1,14.9,2.5,0.5,2.25,0.0,1.0,23.65,2.5,0.0,0 days 00:14:45,21.4,0.10514,6.5,7.810169
2024-01-04 17:05:34,2024-01-04 17:15:20,5,1.45,2,10.7,2.5,0.5,0.0,0.0,1.0,17.2,2.5,0.0,0 days 00:09:46,17.2,0.0,6.5,8.90785
2024-01-18 11:13:37,2024-01-18 11:20:40,1,1.37,1,9.3,0.0,0.5,2.66,0.0,1.0,15.96,2.5,0.0,0 days 00:07:03,13.3,0.2,4.0,11.659574


### Sorting by index

In [157]:
# sorting by indices
minitaxi = minitaxi.sort_index()
minitaxi.head()

Unnamed: 0_level_0,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
pickup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2024-01-01 00:05:48,2024-01-01 00:14:19,1,1.71,1,11.4,1.0,0.5,4.1,0.0,1.0,20.5,2.5,0.0,0 days 00:08:31,16.4,0.25,5.0,12.046967
2024-01-01 00:08:20,2024-01-01 00:45:43,1,2.56,1,31.0,1.0,0.5,7.2,0.0,1.0,43.2,2.5,0.0,0 days 00:37:23,36.0,0.2,5.0,4.108783
2024-01-01 00:09:47,2024-01-01 00:15:40,1,1.02,2,7.9,1.0,0.5,0.0,0.0,1.0,12.9,2.5,0.0,0 days 00:05:53,12.9,0.0,5.0,10.402266
2024-01-01 00:14:19,2024-01-01 00:30:00,1,2.28,1,15.6,1.0,0.5,0.0,0.0,1.0,20.6,2.5,0.0,0 days 00:15:41,20.6,0.0,5.0,8.722635
2024-01-01 00:14:29,2024-01-01 00:14:29,1,0.0,2,3.0,3.5,0.5,0.0,0.0,1.0,8.0,2.5,0.0,0 days 00:00:00,8.0,0.0,5.0,


Recall how we have used `[0:4]` to locate the rows indexed 0 to 4. Since we are indexing by datetime, we can select within time ranges.

### Selecting by index

In [159]:
# selecting the taxi rides in the first 6 hours of the new year
minitaxi['2024-01-01 00:00':'2024-01-01 06:00']

Unnamed: 0_level_0,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
pickup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2024-01-01 00:05:48,2024-01-01 00:14:19,1,1.71,1,11.4,1.0,0.5,4.10,0.0,1.0,20.50,2.5,0.0,0 days 00:08:31,16.4,0.25,5.0,12.046967
2024-01-01 00:08:20,2024-01-01 00:45:43,1,2.56,1,31.0,1.0,0.5,7.20,0.0,1.0,43.20,2.5,0.0,0 days 00:37:23,36.0,0.20,5.0,4.108783
2024-01-01 00:09:47,2024-01-01 00:15:40,1,1.02,2,7.9,1.0,0.5,0.00,0.0,1.0,12.90,2.5,0.0,0 days 00:05:53,12.9,0.00,5.0,10.402266
2024-01-01 00:14:19,2024-01-01 00:30:00,1,2.28,1,15.6,1.0,0.5,0.00,0.0,1.0,20.60,2.5,0.0,0 days 00:15:41,20.6,0.00,5.0,8.722635
2024-01-01 00:14:29,2024-01-01 00:14:29,1,0.00,2,3.0,3.5,0.5,0.00,0.0,1.0,8.00,2.5,0.0,0 days 00:00:00,8.0,0.00,5.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-01-01 04:49:11,2024-01-01 05:08:49,1,5.30,1,24.7,1.0,0.5,5.94,0.0,1.0,35.64,2.5,0.0,0 days 00:19:38,29.7,0.20,5.0,16.196944
2024-01-01 05:02:37,2024-01-01 05:12:34,1,2.90,1,14.9,1.0,0.5,3.98,0.0,1.0,23.88,2.5,0.0,0 days 00:09:57,19.9,0.20,5.0,17.487437
2024-01-01 05:05:25,2024-01-01 05:28:31,1,4.03,4,27.5,1.0,0.5,0.00,0.0,1.0,32.50,2.5,0.0,0 days 00:23:06,32.5,0.00,5.0,10.467532
2024-01-01 05:42:22,2024-01-01 05:55:44,1,3.70,1,17.0,3.5,0.5,4.40,0.0,1.0,26.40,2.5,0.0,0 days 00:13:22,22.0,0.20,5.0,16.608479


### Resetting index
We can revert any specific column index back to row numbers, **but** notice that by setting and resetting index you lose the original row numbers.

In [161]:
minitaxi = minitaxi.reset_index()
minitaxi.head()

Unnamed: 0,index,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
0,0,2024-01-01 00:05:48,2024-01-01 00:14:19,1,1.71,1,11.4,1.0,0.5,4.1,0.0,1.0,20.5,2.5,0.0,0 days 00:08:31,16.4,0.25,5.0,12.046967
1,1,2024-01-01 00:08:20,2024-01-01 00:45:43,1,2.56,1,31.0,1.0,0.5,7.2,0.0,1.0,43.2,2.5,0.0,0 days 00:37:23,36.0,0.2,5.0,4.108783
2,2,2024-01-01 00:09:47,2024-01-01 00:15:40,1,1.02,2,7.9,1.0,0.5,0.0,0.0,1.0,12.9,2.5,0.0,0 days 00:05:53,12.9,0.0,5.0,10.402266
3,3,2024-01-01 00:14:19,2024-01-01 00:30:00,1,2.28,1,15.6,1.0,0.5,0.0,0.0,1.0,20.6,2.5,0.0,0 days 00:15:41,20.6,0.0,5.0,8.722635
4,4,2024-01-01 00:14:29,2024-01-01 00:14:29,1,0.0,2,3.0,3.5,0.5,0.0,0.0,1.0,8.0,2.5,0.0,0 days 00:00:00,8.0,0.0,5.0,


## Practice 2

Using the `meteor` dataset, 

1. cast the `year` column to an integer column.
2. create a new column indicating whether the meteorite was observed falling before 1970.
3. set the index to the id column and extract all the rows with IDs between 10,036 and 10,040 (inclusive) with `loc[]`.
4. examine the `year` column to see if there are any data errors.