# Data Visualization: Biketown

<center>
<img src="https://images.ctfassets.net/p6ae3zqfb1e3/2CeomZNihpBG2HdYlfr5nK/40135dfe151de3d26e0148b1eaea80dd/BIKETOWN_Homepage_Hero_2x.png?w=1000&q=60&fm=webp">
</center>


In [None]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Loading the Data

Load: `https://s3.amazonaws.com/biketown-tripdata-public/2018_05.csv`

Output information for insight (we'll need to clean the dataset).

In [None]:
# Load Data
file = "https://s3.amazonaws.com/biketown-tripdata-public/2018_05.csv"

biketown_df = 

# Output Info


## Copy the DataFrame

For cleaning purposes, copy the DataFrame into a new DataFrame using `df.copy()`.

In [None]:
# Copy DataFrame for Cleaning
biketown_clean_df = 

## Drop: `TripType`

* `TripType` only contains 47 items.
* This is a tiny fraction of the data.
* Let's drop the entire column.

In [None]:
# Drop Trip Type Column



# Output Info


## Filling Nans: Start and End Hubs

* `StartHub` and `EndHub` contain missing values.
* Use `df.fillna()` to replace the missing values with `Unknown`.
* How might you apply `df.fillna()` to multiple columns?

In [None]:
# Fill Na










## Filling Nans: Geographic Data

* Fill the `Nan` values in `StartLatitude`, `EndLatitude`, `StartLongitude`, and `EndLongitude` with median value of each column.

In [None]:
## Filling Nan: Geographic Data







# Output Info


## Time Conversion

* Extract the time delta from `Duration` into a new column: `Duration_Timedelta`
* This can be achieved with `pd.to_timedelta()`

In [None]:
# Extract Time Delta


# Output Info


## Time Conversion: Seconds

* Extract the total seconds from `Duration_Timedelta` into a new column: `Duration_Seconds`.
* This can be achieved using `.dt.total_seconds()` on the `timedelta` object.

In [None]:
## Time Conversion - Seconds


# Output Info


## Fill NaTs

* Determine the median value of the `Duration_Seconds` column.
* Fill the `NaN`/`NaT` values of the `Duration_Seconds` columns with the median value

In [None]:
# Calculate Median


# Fill NaN with Median Value


# Output Info


## Drop `Duration_Timedelta` & `Duration`

* Drop the `Duration` and `Duration_Timedelta` columns

In [None]:
# Drop Duration_Timedelta & Duration Columns










## DateTime Conversions

* Convert `StartDate` and `EndDate` to datetime objects (format: `%m/%d/%Y`)
* Convert `StartTime` and `EndTime` to time objects (use `dt.time` to access `time`)

In [None]:
# Convert Date Time




















## That one ride...

* Use a mask to remove rides longer than 30 miles...

In [None]:
# Remove Outlier Rides







# Output Info


## Preview Data

* Take a look at the first few rows of data of your clean dataframe.

In [None]:
# Preview Data


# Visualizing Data 

Now that our data is prepared, we will follow a few simple steps to graphically represent the data with `Seaborn`

1. __Draw Figure__

The first step is to initialize the canvas where your plot will be drawn.

```python
plt.figure(figsize=(width_in_inches, height_in_inches))
```

2. __Generate the Plot__

The next step uses a plotting function from a library like Seaborn (`sns`) to render the data onto the figure.

```python
# Use a Seaborn function such as boxplot, scatterplot, histplot
sns.plotting_function(
    data=your_dataframe,
    x='X_VARIABLE_COLUMN',
    y='Y_VARIABLE_COLUMN',
    # Add optional styling arguments like hue, palette, etc.
    **kwargs
)
```

3. __Set Labels and Titles__

The final step uses `Matplotlib` functions to add context, readability, and polish to your plot.

```python
# Add a title to summarize the plot's content
plt.title('A Concise and Informative Plot Title')

# Label the horizontal axis
plt.xlabel('Descriptive Label for the X-Axis')

# Label the vertical axis
plt.ylabel('Descriptive Label for the Y-Axis')

# Optional: Add a legend if multiple data series are plotted
plt.legend() 

# Optional: Turn on the background grid lines for readability
plt.grid(True) 

# Final action: Display or save the plot
# plt.show()
# OR
# plt.savefig('my_final_plot.png', dpi=300)
```

## Histogram

A Histogram is a type of graph used to show the distribution and frequency of a single continuous numerical variable.

* The data is divided into equal-sized ranges called bins.

* The height of each vertical bar (the bin) shows the frequency (or count) of how many data points fall within that range.

* Bars are typically placed right next to each other to emphasize that the data is continuous.

__When to Use Histograms__

Histograms are used to answer questions about how often different values occur in a dataset:

* __To see the shape of the data:__ Is it symmetrical (like a bell curve), skewed (piled up on one side), or uniform?

* __To find the central tendency:__ Where does most of the data cluster?

* __To check variability:__ Is the data spread out (wide bars) or tightly packed (narrow bars)?


## Histogram: Trip Distance

What are the shape, central tendency, and variability of the Trip Distance Data?

* Plot a histogram to visualize the Distribution of `Trip_Distance`

Syntax

```python
sns.histplot(data = data,
             x = x,
             bins = bins,
             color = color,
             edgecolor = edgecolor)
```

In [None]:
# Setup Figure


# Generate Histogram





# Set Titles and Labels





# Show Figure


### Insights

* What stories is the histogram telling you?

## Scatter Plot

A Scatter Plot is used to visualize the relationship between two numerical values.

* The data is represented as points on a standard Cartesian coordinate system.
* Each point on the graph presents a single observation.
* The point's horizontal position is the value for the first numerical value.
* The point's vertical position is the value for the second numerical value.

__When to use Scatter Plots:__

The primary goal of a scatter plot is to look for correlation (or association) between the two variables:

* __Positive Correlation:__ The points generally form a line sloping upward (as X increases, Y increases).

* __Negative Correlation:__ The points generally form a line sloping downward (as X increases, Y decreases).

* __No Correlation:__ The points appear randomly scattered with no obvious pattern.

* __Identifying Clusters/Outliers:__ They help easily spot groups of similar data points (clusters) or individual data points that fall far away from the main group (outliers).

## Scatter Plot: Distance and Duration

Is there a pattern between the distance and duration of a ride?

* Let's create a Scatter Plot to Visualize the relationship between `Distance_Miles` and `Duration_Seconds`

Syntax:

```python
sns.scatterplot(data = data,
                x = x,
                y= y)
```

In [None]:
# Scatter Plot

# Set Up Figure


# Generate Scatter Plot





# Set Titles and Labels









### Scatter Plot: Insights

* What stories is the Scatter Plot telling you?

### Filtering the Scatter Plot for Outliers

* Set a y-limit on the plot to limit the data to rides no longer than 3 hours.
* Does the story change?

Syntax

```python
plt.ylim(min, max)
```

In [None]:
# Scatter Plot

# Set Up Figure


# Generate Scatter Plot





# Set Limits



# Set Titles and Labels










## Scatter Plot: Start Hub

* Create a scatter plot using the `StartLongitude` and `StartLatitude` data.
* Consider setting limits on the x and y axis

In [None]:
# Scatter Plot

# Set Figure


# Generate Scatter plot







# Set Limits



         
# Set Labels and Titles






## Box Plots

A Box Plot Graphically represents the spread and center of numerical data.

<center><img src="../images/web/boxplot.png"></center>

image: [ArcGIS Pro](https://pro.arcgis.com/en/pro-app/latest/help/analysis/geoprocessing/charts/box-plot.htm)

__Breakdown:__

* __The Box:__ Represents the Interquartile Range (Q1 - Q3). Which is the middle 50% of the data.
* __The Line in the Box:__ Represents the Median (Q2) or 50th percentile, which is the center of the data.
* __The Whiskers:__ These lines extend from the box to show the minimum and maximim values without outliers
* __The Points:__ The data points lying outside the whiskers are considered ouliers.

__When to Use Box Plots:__

* Comparing the distribution of one or more groups.
* Identifying median values
* Finding Outliers
* Assessing the symmetry of the data (if the median is not in the center, the data is skewed).

## Box Plot: Duration and Payment Plan

Is there a difference in the ride duration between subscribers and casual riders?

* Create a Box Plot to Visualize Key Information on the `Duration_Seconds` and `Payment_Plan` columns.


Syntax:

```python
sns.boxplot(data=data],
            x=x,
            y=y,
            hue=hue,
            palette=palette)
```

In [None]:
# Box Plot

# Set Figure


# Generate Box Plot







# Set Limits and What Not


# Set Labels and What Not









### Box Plot - Insights

* What stories did the boxplot reveal to you?

## Box Plot: Distance and Payment Plan

* Create a box plot to visualize `PaymentPlan` and `Distance_Miles`.
* What stories might the visualization reveal?

In [None]:
# Set Figure


# Generate Box Plot







# Set Limits


# Set Labels and Titles








