<a href="https://colab.research.google.com/github/Joyakis/DATA_VIZ/blob/main/BOKEH_VIZ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PROJECT BY JOY AKINYI
## DATA VISUALISATION USING BOKEH


#### Introduction

In this analysis we will investigate the Nyc trip dataset (January 2022) and conclude the Considerations that should be taken into account  for Statistical Analysis and their efffects

In [None]:
import pandas as pd

# Read the parquet file
df = pd.read_parquet('/content/drive/MyDrive/yellow_tripdata_2022-01.parquet')



In [None]:
#installing bokeh
!pip install bokeh
#importing bokeh
import bokeh


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Print the shape of the DataFrame
print('Shape:')
print(df.shape)

# Print the info of the DataFrame
print('Info:')
print(df.info())

Shape:
(2463931, 19)
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       


### The dataframe has 2463931 rows and 19 columns.The majority of the data is in the datatype float

In [None]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0


Let us start by investigating the data set and cleaning it

# **DATA** **CLEANING**

> Here we will check for tidiness of the data and aim to clean it



In [None]:
#Checking for null values
df.isnull().sum()

VendorID                     0
tpep_pickup_datetime         0
tpep_dropoff_datetime        0
passenger_count          71503
trip_distance                0
RatecodeID               71503
store_and_fwd_flag       71503
PULocationID                 0
DOLocationID                 0
payment_type                 0
fare_amount                  0
extra                        0
mta_tax                      0
tip_amount                   0
tolls_amount                 0
improvement_surcharge        0
total_amount                 0
congestion_surcharge     71503
airport_fee              71503
dtype: int64

Passenger_count,RatecodeID,store_and_fwd_flag,congestation_surcharge and airport_fee have quite a number of missing values.We shall decide on whether to impute them or drop the missing values

In [None]:
#Checking for duplicates
df.duplicated().sum()

0

This dataset has no duplicates







In [None]:
#Dropping missing values
df.dropna(inplace=True)


######  Base function

In [None]:
#Importing necessary libraries
import numpy as np
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource
from bokeh.models.tiles import WMTSTileSource

from bokeh.models import BoxZoomTool

output_notebook()
# Create a new plot with the tile provider as the background
def base_plot(tools='pan,wheel_zoom,reset',output_backend='webgl'):
    p = figure(tools=tools,width=750,height=625,min_border=0, min_border_left=0, min_border_right=0,outline_line_color=None,output_backend=output_backend)
    p.add_tile(tile_provider)
    p.axis.visible = False
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    p.add_tools(BoxZoomTool(match_aspect=True))
    return p

# UNDERSAMPLING
I have included geographical map data here with a sample of only 1000 points of the drop off and pick up locations to show the impact of undersampling


In [27]:
# Generate some random drop and pick up locations
np.random.seed(0)
#Sampling 1000 points
x = df['PULocationID'].sample(n=1000).astype(float)

y=df['DOLocationID'].sample(n=1000).astype(float)

# Create a ColumnDataSource
source = ColumnDataSource(data=dict(x=x, y=y))


# Define the tile provider for the map background
tile_provider = WMTSTileSource(url='http://tile.stamen.com/terrain/{Z}/{X}/{Y}.png',
                               attribution="Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL")

    
#Creating a circle marker plot and calling function
p = base_plot()
p.circle(x='x', y='y', source=source, size=10, color='blue')


# Display the output on the notebook
show(p)

#  OVERSAMPLING

Here i sampled 10000 plots to see if there will be a significant change in the representation of geographical data and as you can tell,given that there exists more circles in the output,it means that there was more data being more represented bt the data points are too clustered together making it difficult to distinguish individual points or patterns within the data.This is desribed as  Overplotting and is a common issue in data visualization

In [None]:
# Generate some random drop and pick up locations
np.random.seed(0)
#Sampling 1000 points
x = df['PULocationID'].sample(n=10000).astype(float)

y=df['DOLocationID'].sample(n=10000).astype(float)

# Create a ColumnDataSource
source = ColumnDataSource(data=dict(x=x, y=y))


# Define the tile provider for the map background
tile_provider = WMTSTileSource(url='http://tile.stamen.com/terrain/{Z}/{X}/{Y}.png',
                               attribution="Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL")
#Calling the function
p=base_plot()
#Creating a circle plot
p.circle('x', 'y', source=source)

# Display the output on the notebook
show(p)

> * EFFECTS OF ADDING ALPHA

To curb overplotting i decided to adjust the size of the circle bt in popular
drop and pick up locations,overplotting will still occur so i thought it wise to also reduce the opacity of alpha so that for saturation to occur,multiple points would have to overlap.This can also show where dropoffs were more common
 

In [None]:
options = dict(line_color=None, fill_color='blue', size=5, alpha=0.1)
#Sampling 10000 points
samples = df.sample(n=10000)
p = base_plot(output_backend='webgl')
p.circle(x=samples['PULocationID'], y=samples['DOLocationID'], **options)
show(p)

**Conclusions**

1.Effects of oversampling


*  Visual clutter: Overplotting can make a plot appear visually cluttered, making it difficult to read and interpret. This can also lead to user fatigue and make it harder to discern patterns and trends in the data.

*   Bias in analysis: Overplotting can also lead to biased analysis, particularly if certain regions of the plot are more densely populated than others. This can result in overemphasis on certain areas of the data and underemphasis on others, leading to skewed conclusions.

2.Effects of Undersampling
*  Loss of information: Undersampling can result in a loss of information, particularly if important data points are excluded from the subset. This can result in incorrect conclusions being drawn from the data and can reduce the accuracy of models trained on the subset. 
* Overgeneralization: Undersampling can also result in overgeneralization, particularly if the subset is not representative of the population as a whole. This can result in inaccurate models that do not capture the true complexity of the data
#### Using Alpha
Alpha refers to the level of transparency or opacity of graphical elements, such as points or lines, used to represent data. Alpha is typically represented by a numerical value between 0 and 1, where 0 is completely transparent (invisible) and 1 is completely opaque.

> The above effects can be solved by using alpha as it improves data visibility by reducing overplotting and allowing users to see the individual data points more clearly. This can help users identify patterns and trends in the data that may have been obscured by overplotting. 


>> Using alpha also enhances effective communication since it helps in the communication of data by making it easier for users to understand the data being presented. This can help users better communicate their findings to others and improve the overall effectiveness of data-driven decision-making.











 

