# Your Info
__Name__:

__PDX Email__:

__Collaborators__:

# Pandas DataFrame Workout

This workout will cover creating, accessing, manipulating, and analyzing data within a Pandas DataFrame.

## Overview of the Dataset

For this workout, we will be examining a dataset from a CSV file.

This dataset contains information about individual taxi trips in New York City, sourced from [NYC Open Data](https://opendata.cityofnewyork.us/data/). 

Each row represents one ride and includes the following details:

__Key Columns:__

* `VendorID`: Identifies the taxi company operating the trip.	
* `tpep_pickup_datetime` & `tpep_dropoff_datetime`: the start and end times of the trip.	
* `passenger_count`: Indicates the number of passengers. 	
* `trip_distance`: The elapsed trip distance in miles reported by the taximeter. 	
* `RatecodeID`: The final rate code in effect at the end of the trip.
    * `1` = Standard rate
    * `2` = JFK
    * `3` = Newark
    * `4` = Nassau or Westchester
    * `5` = Negotiated fare
    * `6` = Group ride
    * `99` = Null/unknown 
* `store_and_fwd_flag`: This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y = store and forward trip N = not a store and forward trip 	
* `PULocationID` & `DOLocationID`: TLC Taxi Zone in which the taximeter was engaged and disengaged respectively. 
* `payment_type`: A numeric code signifying how the passenger paid for the trip.
    * `0` = Flex Fare trip
    * `1` = Credit card
    * `2` = Cash
    * `3` = No charge
    * `4` = Dispute
    * `5` = Unknown
    * `6` = Voided trip 
* `fare_amount`: The time-and-distance fare calculated by the meter.
* `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `congestion_surcharge`: These columns provide details on additional charges and payments.	
* `total_amount`: The total amount charged to passengers. Does not include cash tips.	

## Step 0 - Importing the Tools

* Import the `pandas` library using the `pd` alias

In [None]:
### Begin Solution


### End Solution

## Step 1 - Load the Dataset

__The Setup__

Your task is to create a Pandas DataFrame, named `df`, by loading data from the `CSV` file located at the following location:

```bash
data/nyc_taxi_2020-07.csv
```

To work efficiently with the data relevant to our initial questions, when reading the `CSV` file, instruct Pandas to import only these columns:

* `passenger_count`
* `total_amount`
* `payment_type`

In [None]:
### Begin Solution








### End Solution

## Step 2 - Skimming the Data

Display the first few rows of the DataFrame to see the column names and some sample data.

In [None]:
### Begin Solution



### End Solution

## Step 3 - Getting a Summary Overview

Obtain a concise summary of the DataFrame, including the number of rows and columns, data types, and non-null values.

In [None]:
### Begin Solution


### End Solution

## Step 4 - Descriptive Statistics for Numerical Columns

Calculate and display the descriptive statistics for the numerical columns in the dataset. This will give you insights into the central tendency, dispersion, and range of values.

In [None]:
### Begin Solution


### End Solution

## Step 5 - Examining Specific Columns

Focus on the `passenger_count` and `total_amount columns`.

* Find the minimum, maximum, and average values for each of these columns.

In [None]:
### Begin Solution













### End Solution

## Step 6: Exploring Payment Types

Investigate the `payment_type column` to see the different categories of payment and how many times each occurs.

In [None]:
### Begin Solution



### End Solution

## Step 7 - Identifying Potentially Unusual Rides 

Use boolean indexing to filter the DataFrame and count the number of rides that meet the criteria for being unusual:
* more than 6 passengers
* 0 passengers
* total amount of \$1000 or more
* total amount of \$0 or less.

__Explanation__

Boolean Indexing in Pandas

Boolean indexing lets you select rows in a DataFrame based on whether a condition is `True` or `False`.

Syntax:

```python
df[df['column_name'] condition value]
```

* `df['column_name']`: Selects the column to check.
* `condition`: A comparison operator (`>`, `<`, `>=`, `<=`, `==`, `!=`).
* `value`: The value to compare against.

This creates a boolean Series, and only rows where the condition is `True` are kept.

In [None]:
### Begin Solution








### End Solution

## Useful References

__Core DataFrame Operations__:

* `pd.read_csv()`: Reads data from a CSV file into a DataFrame. 
* `df[['column1', 'column2']]`: Selects specific columns from a DataFrame.
* `df['column']`: Accesses a single column as a Series.
* `df.head(n)`: Displays the first n rows of the DataFrame, useful for initial inspection.
* `df.dtypes`: Shows the data type of each column in the DataFrame.
* `df.info()`: Provides a concise summary of the DataFrame, including data types and non-null values.
* `df.columns`: Returns the column names of the DataFrame.
* `df.index`: Returns the index of the DataFrame (row labels).
* `df.loc[row_label, column_label]`: Accesses data by label(s). Useful for selecting specific rows or columns by their names.
* `len(df)`: Returns the number of rows in the DataFrame. Useful for finding the total number of records or the number of rows after filtering.


__Filtering Data__:

* __Boolean Indexing__
    * `df[df['column'] > value]`
    * `df[(df['col1'] == value) & (df['col2'] < other_value)])`
    
Selects rows based on conditions applied to column values.

__Adding and Modifying Data__:

* `df['new_column']` = value: Adds a new column to the DataFrame with a specified value (scalar or a Series). 

__Grouping and Aggregation__:

* `df.groupby('column')`: Groups rows based on the unique values in a specified column. Essential for analyzing data per year.
* `.count()`: Calculates the number of non-missing values within each group or Series.
* `.sum()`: Calculates the sum of values within each group or Series.
* `.mean()`: Calculates the average of values within each group or Series.
* `.size()`: Returns the size of each group (including missing values).
Value Counts and Proportions:

* `df['column'].value_counts()`: Returns a Series containing the counts of unique values in a column. Useful for analyzing the distribution of passenger counts and payment types.

For more in-depth information on the topics we've discussed and many others, you can visit the official Pandas documentation here:

[https://pandas.pydata.org/docs/](https://pandas.pydata.org/docs/)

## Submitting Your Assignment

Please follow these steps to submit your work:

1. Ensure your name and the names of any collaborators are clearly stated in the [Your Info](#Your-Info) section at the top of the notebook.
2. Download your completed notebook as a `.ipynb` file.
3. Upload this `.ipynb` file to the designated assignment submission area in Canvas.