## Pedestrian Data Cleaning

This dataset focuses on pedestrian data in New York city, and will potentially be used to predict foot traffic in Manhattan. Pedestrian counts taken from 7-9am and 4-7pm.

Taken from https://www.nyc.gov/html/dot/html/about/datafeeds.shtml#trafficcounts

Metadata: https://www.nyc.gov/html/dot/downloads/pdf/bi-annual-ped-count-readme.pdf

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

#Read data frame from csv file
df = pd.read_csv('archive/PedCountLocationsMay2015.csv', keep_default_na=True, delimiter=',', skipinitialspace=True)

#Read size from csv
df.shape

(114, 98)

In [4]:
#Checking for duplicate values

df.duplicated().sum()

#No duplicate values

0

In [5]:
#Basic information on dataframe features and feature types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114 entries, 0 to 113
Data columns (total 98 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   the_geom    114 non-null    object 
 1   OBJECTID    114 non-null    int64  
 2   LOC         114 non-null    int64  
 3   Borough     114 non-null    object 
 4   Street_Nam  114 non-null    object 
 5   From_Stree  114 non-null    object 
 6   To_Street   96 non-null     object 
 7   Index       114 non-null    object 
 8   May07_AM    112 non-null    float64
 9   May07_PM    112 non-null    float64
 10  May07_MD    112 non-null    float64
 11  Sept07_AM   113 non-null    float64
 12  Sept07_PM   113 non-null    float64
 13  Sept07_MD   113 non-null    float64
 14  May08_AM    113 non-null    float64
 15  May08_PM    113 non-null    float64
 16  May08_MD    113 non-null    float64
 17  Sept08_AM   112 non-null    float64
 18  Sept08_PM   112 non-null    float64
 19  Sept08_MD   112 non-null    f

In [7]:
#I have elected only to drop the_geom and index, as these are the only columns not related to pedestrian count or location

df = df.drop(columns=['the_geom', 'Index'])

In [12]:
#changing object pedestrian counts to float64, as they represent continuous features. Changing objectid and loc to objects, as they represent categorical features

#df['Oct22_AM'] = df['Oct22_AM'].astype(np.float64)
#df['Oct22_PM'] = df['Oct22_PM'].astype(np.float64)
#df['Oct22_MD'] = df['Oct22_MD'].astype(np.float64)

#Edit: running into errors casting as float. String values contained. WIll investigate further or potentially drop if data is chosen for final project

df['OBJECTID'] = df['OBJECTID'].astype(object)
df['LOC'] = df['LOC'].astype(object)

In [13]:
#Removing all non-Manhattan values

df = df[df.Borough == 'Manhattan']

In [14]:
#Printing the first and last 5 rows to check data formatting 

print("First 5 rows:")
print(df.head())
print("Last 5 rows:\n")
print(df.tail())

First 5 rows:
   OBJECTID LOC    Borough       Street_Nam        From_Stree  \
34       35  35  Manhattan     Broad Street     Beaver Street   
35       36  36  Manhattan         Broadway     Morris Street   
36       37  37  Manhattan         Broadway  West 63rd Street   
37       38  38  Manhattan  Chambers Street     West Broadway   
38       39  39  Manhattan  Columbus Avenue  West 75th Street   

               To_Street  May07_AM  May07_PM  May07_MD  Sept07_AM  ...  \
34  South William Street    3469.0    3992.0     599.0     4214.0  ...   
35        Exchange Place    3660.0    8390.0    2361.0     4507.0  ...   
36      West 64th Street    1611.0    6764.0    4592.0     1805.0  ...   
37      Greenwich Street    7081.0    8512.0    2061.0     7192.0  ...   
38      West 76th Street    1071.0    3037.0    3500.0     1189.0  ...   

    May21_MD  Oct21_AM  Oct21_PM  Oct21_MD  May22_AM  May22_pM  May22_MD  \
34    1168.0    1736.0    2711.0    1279.0    2143.0    4002.0    1470.0  

In [16]:
#Saving cleaned frame to CSV

df.to_csv('cleaned_pedestrian_df.csv', index=False)