# Your Info
__Name__:

__PDX Email__:

__Collaborators__:

# Pandas DataFrame Workout

This workout will cover creating, accessing, manipulating, and analyzing data within a Pandas DataFrame.

## Overview of the Dataset

For this workout, we will be examining a dataset from a CSV file.

This dataset contains information about individual taxi trips in New York City, sourced from [NYC Open Data](https://opendata.cityofnewyork.us/data/). 

Each row represents one ride and includes the following details:

__Key Columns:__

* `VendorID`: Identifies the taxi company operating the trip.	
* `tpep_pickup_datetime` & `tpep_dropoff_datetime`: the start and end times of the trip.	
* `passenger_count`: Indicates the number of passengers. 	
* `trip_distance`: The elapsed trip distance in miles reported by the taximeter. 	
* `RatecodeID`: The final rate code in effect at the end of the trip.
    * `1` = Standard rate
    * `2` = JFK
    * `3` = Newark
    * `4` = Nassau or Westchester
    * `5` = Negotiated fare
    * `6` = Group ride
    * `99` = Null/unknown 
* `store_and_fwd_flag`: This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y = store and forward trip N = not a store and forward trip 	
* `PULocationID` & `DOLocationID`: TLC Taxi Zone in which the taximeter was engaged and disengaged respectively. 
* `payment_type`: A numeric code signifying how the passenger paid for the trip.
    * `0` = Flex Fare trip
    * `1` = Credit card
    * `2` = Cash
    * `3` = No charge
    * `4` = Dispute
    * `5` = Unknown
    * `6` = Voided trip 
* `fare_amount`: The time-and-distance fare calculated by the meter.
* `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `congestion_surcharge`: These columns provide details on additional charges and payments.	
* `total_amount`: The total amount charged to passengers. Does not include cash tips.	

## Step 0 - Importing the Tools

* Import the `pandas` library using the `pd` alias

In [1]:
### Begin Solution
import pandas as pd

### End Solution

## Step 1 - Load the Dataset

__The Setup__

Your task is to create a Pandas DataFrame, named `df`, by loading data from the `CSV` file located at the following location:

```bash
../data/nyc_taxi_2020-07.csv
```

In [2]:
### Begin Solution
FILE = "../data/nyc_taxi_2020-07.csv"
taxi_df = pd.read_csv(FILE, low_memory=False)

### End Solution

## Step 2 - Skimming the Data

Display the first few rows of the DataFrame to see the column names and some sample data.

In [3]:
### Begin Solution
taxi_df.head()

### End Solution

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-07-01 00:25:32,2020-07-01 00:33:39,1.0,1.5,1.0,N,238,75,2.0,8.0,0.5,0.5,0.0,0.0,0.3,9.3,0.0
1,1.0,2020-07-01 00:03:19,2020-07-01 00:25:43,1.0,9.5,1.0,N,138,216,1.0,26.5,0.5,0.5,0.0,0.0,0.3,27.8,0.0
2,2.0,2020-07-01 00:15:11,2020-07-01 00:29:24,1.0,5.85,1.0,N,230,88,2.0,18.5,0.5,0.5,0.0,0.0,0.3,22.3,2.5
3,2.0,2020-07-01 00:30:49,2020-07-01 00:38:26,1.0,1.9,1.0,N,88,232,1.0,8.0,0.5,0.5,2.36,0.0,0.3,14.16,2.5
4,2.0,2020-07-01 00:31:26,2020-07-01 00:38:02,1.0,1.25,1.0,N,37,17,2.0,6.5,0.5,0.5,0.0,0.0,0.3,7.8,0.0


## Step 3 - Getting a Summary Overview

Obtain a concise summary of the DataFrame, including the number of rows and columns, data types, and non-null values.

In [4]:
### Begin Solution
taxi_df.info()

### End Solution

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800412 entries, 0 to 800411
Data columns (total 18 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   VendorID               737565 non-null  float64
 1   tpep_pickup_datetime   800412 non-null  object 
 2   tpep_dropoff_datetime  800412 non-null  object 
 3   passenger_count        737565 non-null  float64
 4   trip_distance          800412 non-null  float64
 5   RatecodeID             737565 non-null  float64
 6   store_and_fwd_flag     737565 non-null  object 
 7   PULocationID           800412 non-null  int64  
 8   DOLocationID           800412 non-null  int64  
 9   payment_type           737565 non-null  float64
 10  fare_amount            800412 non-null  float64
 11  extra                  800412 non-null  float64
 12  mta_tax                800412 non-null  float64
 13  tip_amount             800412 non-null  float64
 14  tolls_amount           800412 non-nu

## Step 4 - Descriptive Statistics for Numerical Columns

Calculate and display the descriptive statistics for the numerical columns in the dataset. This will give you insights into the central tendency, dispersion, and range of values.

In [5]:
### Begin Solution
taxi_df.describe()

### End Solution

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
count,737565.0,737565.0,800412.0,737565.0,800412.0,800412.0,737565.0,800412.0,800412.0,800412.0,800412.0,800412.0,800412.0,800412.0,800412.0
mean,1.622879,1.378401,4.304165,1.046801,160.061218,156.104931,1.352521,13.4384,1.003705,0.492124,1.789151,0.316874,0.296943,18.63146,2.019911
std,0.484666,1.03979,473.708961,1.203844,68.5634,72.990234,0.523148,13.675661,1.240157,0.078519,2.643472,1.533511,0.041673,15.060771,1.006693
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-391.5,-4.5,-0.5,-25.83,-18.36,-0.3,-397.6,-2.5
25%,1.0,1.0,1.0,1.0,107.0,90.0,1.0,6.0,0.0,0.5,0.0,0.0,0.3,10.8,2.5
50%,2.0,1.0,1.79,1.0,161.0,161.0,1.0,9.0,0.5,0.5,1.66,0.0,0.3,14.16,2.5
75%,2.0,1.0,3.4,1.0,234.0,233.0,2.0,15.0,2.5,0.5,2.75,0.0,0.3,20.55,2.5
max,2.0,9.0,256069.13,99.0,265.0,265.0,4.0,1995.0,90.06,3.3,1001.0,126.12,0.3,1995.0,2.5


## Step 5 - Examining Specific Columns

Focus on the `passenger_count` and `total_amount` columns.

* Find the minimum, maximum, and average values for each of these columns.

In [6]:
### Begin Solution

passenger_count = [taxi_df["passenger_count"].min(),
                   taxi_df["passenger_count"].max(),
                   taxi_df["passenger_count"].mean()]

total_amount = [taxi_df["total_amount"].min(),
                 taxi_df["total_amount"].max(),
                 taxi_df["total_amount"].mean()]
                
passenger_count, total_amount
### End Solution

([np.float64(0.0), np.float64(9.0), np.float64(1.378400547748334)],
 [np.float64(-397.6), np.float64(1995.0), np.float64(18.631459910646026)])

## Step 6: Exploring Payment Types

Investigate the `payment_type` column to see the different categories of payment and how many times each occurs.

* Determine the frequency of each payment type.
* Calculate the proportion of each payment type relative to the total number of rides.

In [7]:
### Begin Solution
payment_types = taxi_df["payment_type"].value_counts()



total_rides = len(taxi_df["payment_type"])
proportions = payment_types / total_rides
proportions
### End Solution

payment_type
1.0    0.613435
2.0    0.295389
3.0    0.008519
4.0    0.004138
Name: count, dtype: float64

## Step 7 - Identifying Potentially Unusual Rides 

Use boolean indexing to filter the DataFrame and count the number of rides that meet the criteria for being unusual:
* more than 6 passengers
* 0 passengers
* total amount of \$1000 or more
* total amount of \$0 or less.

__Explanation__

Boolean Indexing in Pandas

Boolean indexing lets you select rows in a DataFrame based on whether a condition is `True` or `False`.

Syntax:

```python
df[df['column_name'] condition value]
```

* `df['column_name']`: Selects the column to check.
* `condition`: A comparison operator (`>`, `<`, `>=`, `<=`, `==`, `!=`).
* `value`: The value to compare against.

This creates a boolean Series, and only rows where the condition is `True` are kept.

In [8]:
### Begin Solution
mask = taxi_df["passenger_count"] > 6

print(len(taxi_df[mask]))

mask = taxi_df["passenger_count"] == 0

print(taxi_df[mask].shape[0])
### End Solution

8
19506


## Useful References

__Core DataFrame Operations__:

* `pd.read_csv()`: Reads data from a CSV file into a DataFrame. 
* `df['column']`: Accesses a single column as a Series.
* `df.head(n)`: Displays the first n rows of the DataFrame, useful for initial inspection.
* `df.dtypes`: Shows the data type of each column in the DataFrame.
* `df.info()`: Provides a concise summary of the DataFrame, including data types and non-null values.
* `df.columns`: Returns the column names of the DataFrame.
* `df.index`: Returns the index of the DataFrame (row labels).
* `df.loc[row_label, column_label]`: Accesses data by label(s). Useful for selecting specific rows or columns by their names.
* `len(df)`: Returns the number of rows in the DataFrame. Useful for finding the total number of records or the number of rows after filtering.


__Adding and Modifying Data__:
* `df['new_column']` = value: Adds a new column to the DataFrame with a specified value (scalar or a Series). 

__Grouping and Aggregation__:
* `.count()`: Calculates the number of non-missing values within each group or Series.
* `.sum()`: Calculates the sum of values within each group or Series.
* `.mean()`: Calculates the average of values within each group or Series.


__Value Counts and Proportions__:

* `df['column'].value_counts()`: Returns a Series containing the counts of unique values in a column. Useful for analyzing the distribution of passenger counts and payment types.

For more in-depth information on the topics we've discussed and many others, you can visit the official Pandas documentation here:

[https://pandas.pydata.org/docs/](https://pandas.pydata.org/docs/)