# Data Wrangling with Python: Project (22nd November 2020)

## Overview / Context


As a Data Science Consultant, you have been tasked to provide recommendations on how to reduce the cases of breakdown of buses in the city of New York. 

The dataset that you have been provided comes from the Bus Breakdown and Delay system which collects information from school bus vendors operating out in the field in real time. Bus staff that encounter delays during the route are instructed to radio the dispatcher at the bus vendor’s central office. The bus vendor staff are then instructed to log into the Bus Breakdown and Delay system to record the event and notify OPT. 

OPT customer service agents use this system to inform parents who call with questions regarding bus service. The Bus Breakdown and Delay system is publicly accessible and contains real time updates. All information in the system is entered by school bus vendor staff



## Research Question

Specific research question would then be : <em>**How can we reduce the cases of breakdown of buses in the city of New York?** </em>

Metric for success : The main goal here is **to deduce factors affecting bus breakdowns** we will not be making any predictions and therefore we do not have a specific metric upon which to base our success. Success can be determined by whether or not we get to deduce useful relationships in the data that can help us understand the factors that influence the breakdown of buses in he city of New York.


## Project Deliverable

The expected deliverable for this project will be an Jupyter notebook with data wrangling technique in Python.

## Dataset Explained


The Bus Breakdown and Delays is taken from [here](bit.ly/BusBreakdownDataset) and provided by [data source](https://data.cityofnewyork.us/Transportation/Bus-Breakdown-and-Delays/ez4e-fazm)


## Step 1. Pre-requisites

In [4]:
# Importing pandas library which will help in reading of data from an external 
#source as well and its manipulation.
import pandas as pd

## Step 2. Preparing Dataset


### Step 2.1 Importing Dataset

The dataset on bus breakdowns and delays.

In [5]:
# Load the cab rides dataset from an external csv file and store it in a dataframe called bus_breakdown
bus_breakdown = pd.read_csv('https://bit.ly/BusBreakdownDataset')

Get the structure / glossary the dataframe.

In [10]:
glossary = pd.read_csv('bus_breakdown_delays_glossary.csv', encoding='unicode_escape')
glossary

Unnamed: 0,Column Name,Description,Data Type
0,School_Year,Indicates the school year the record refers to...,Plain Text
1,Busbreakdown_ID,Unique ID of each record.,Number
2,Run_Type,Designates whether a breakdown or delay occurr...,Plain Text
3,Bus_No,The bus number is assigned by the bus vendor. ...,Plain Text
4,Route_Number,This refers to the unique identifier four (1 a...,Plain Text
5,Reason,Reason for delay as entered by staff employed ...,Plain Text
6,Schools_Serviced,OPT Codes of all transportation sites on the r...,Plain Text
7,Occurred_On,"Time/date the incident occurred, as entered by...",Date & Time
8,Created_On,Time/date the record was created in the OPT Br...,Date & Time
9,Boro,"Borough, county or state in which the delay oc...",Plain Text


### Step 2.2 Exploratory Data Analysis

In [15]:
# sampling first 10 records of the breakdown dataset

bus_breakdown.head(10)

Unnamed: 0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Occurred_On,Created_On,Boro,Bus_Company_Name,How_Long_Delayed,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Incident_Number,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK
0,2015-2016,1227538,Special Ed AM Run,2621,J711,Heavy Traffic,75003,11/05/2015 08:10:00 AM,11/05/2015 08:12:00 AM,New Jersey,"RELIANT TRANS, INC. (B232",,11,Yes,No,Yes,11/05/2015 08:12:00 AM,,11/05/2015 08:12:14 AM,Running Late,School-Age
1,2015-2016,1227539,Special Ed AM Run,1260,M351,Heavy Traffic,06716,11/05/2015 08:10:00 AM,11/05/2015 08:12:00 AM,Manhattan,HOYT TRANSPORTATION CORP.,20MNS,2,Yes,Yes,No,11/05/2015 08:12:00 AM,,11/05/2015 08:13:34 AM,Running Late,School-Age
2,2015-2016,1227540,Pre-K/EI,418,3,Heavy Traffic,C445,11/05/2015 08:09:00 AM,11/05/2015 08:13:00 AM,Bronx,"G.V.C., LTD.",15MIN,8,Yes,Yes,Yes,11/05/2015 08:13:00 AM,,11/05/2015 08:13:22 AM,Running Late,Pre-K
3,2015-2016,1227541,Special Ed AM Run,4522,M271,Heavy Traffic,02699,11/05/2015 08:12:00 AM,11/05/2015 08:14:00 AM,Manhattan,"RELIANT TRANS, INC. (B232",15 MIN,6,No,No,No,11/05/2015 08:14:00 AM,,11/05/2015 08:14:04 AM,Running Late,School-Age
4,2015-2016,1227542,Special Ed AM Run,3124,M373,Heavy Traffic,02116,11/05/2015 08:13:00 AM,11/05/2015 08:14:00 AM,Manhattan,"RELIANT TRANS, INC. (B232",,6,No,No,No,11/05/2015 08:14:00 AM,,11/05/2015 08:14:08 AM,Running Late,School-Age
5,2015-2016,1227543,Special Ed AM Run,HT1502,W796,Heavy Traffic,75407,11/05/2015 07:58:00 AM,11/05/2015 08:14:00 AM,Westchester,CHILDREN`S TRANS INC. (B2,30 min,1,Yes,Yes,Yes,11/05/2015 08:14:00 AM,,11/05/2015 08:14:15 AM,Running Late,School-Age
6,2015-2016,1227544,Special Ed AM Run,142,W633,Heavy Traffic,75670,11/05/2015 08:24:00 AM,11/05/2015 08:15:00 AM,Westchester,MAR-CAN TRANSPORT CO. INC,20MINS,3,Yes,No,No,11/05/2015 08:15:00 AM,,11/05/2015 08:16:53 AM,Running Late,School-Age
7,2015-2016,1227545,Special Ed AM Run,1417,M678,Heavy Traffic,03417,11/05/2015 08:15:00 AM,11/05/2015 08:16:00 AM,Manhattan,LEESEL TRANSP CORP (B2192,15,3,Yes,Yes,No,11/05/2015 08:16:00 AM,,11/05/2015 08:16:22 AM,Running Late,School-Age
8,2015-2016,1227546,Special Ed AM Run,56102,M126,Heavy Traffic,01450,11/05/2015 07:55:00 AM,11/05/2015 08:17:00 AM,Manhattan,CONSOLIDATED BUS TRANS. I,30 mins,5,Yes,Yes,Yes,11/05/2015 08:17:00 AM,,11/05/2015 08:17:07 AM,Running Late,School-Age
9,2015-2016,1227547,Special Ed AM Run,1344,M922,Heavy Traffic,02930,11/05/2015 08:16:00 AM,11/05/2015 08:17:00 AM,Manhattan,LEESEL TRANSP CORP (B2192,20,3,Yes,Yes,No,11/05/2015 08:17:00 AM,,11/05/2015 08:17:03 AM,Running Late,School-Age


In [14]:
# Checking  the last 5 records

bus_breakdown.tail()

Unnamed: 0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Occurred_On,Created_On,Boro,Bus_Company_Name,How_Long_Delayed,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Incident_Number,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK
281105,2016-2017,1338452,Pre-K/EI,9345,2,Heavy Traffic,C530,04/05/2017 08:00:00 AM,04/05/2017 08:10:00 AM,Bronx,"G.V.C., LTD.",15-20,7,Yes,Yes,No,04/05/2017 08:10:00 AM,,04/05/2017 08:10:15 AM,Running Late,Pre-K
281106,2016-2017,1341521,Pre-K/EI,0001,5,Heavy Traffic,C579,04/24/2017 07:42:00 AM,04/24/2017 07:44:00 AM,Bronx,"G.V.C., LTD.",20 MINS,0,Yes,Yes,No,04/24/2017 07:44:00 AM,,04/24/2017 07:44:15 AM,Running Late,Pre-K
281107,2016-2017,1353044,Special Ed PM Run,GC0112,X928,Heavy Traffic,09003,05/25/2017 04:22:00 PM,05/25/2017 04:28:00 PM,Bronx,G.V.C. LTD. (B2192),20-25MINS,0,Yes,Yes,Yes,05/25/2017 04:28:00 PM,90323827.0,05/25/2017 04:34:36 PM,Running Late,School-Age
281108,2016-2017,1353045,Special Ed PM Run,5525D,Q920,Won`t Start,24457,05/25/2017 04:27:00 PM,05/25/2017 04:30:00 PM,Queens,LITTLE RICHIE BUS SERVICE,,0,Yes,Yes,No,05/25/2017 04:30:00 PM,,05/25/2017 04:30:07 PM,Breakdown,School-Age
281109,2016-2017,1353046,Project Read PM Run,2530,K617,Other,21436,05/25/2017 04:36:00 PM,05/25/2017 04:37:00 PM,Brooklyn,"RELIANT TRANS, INC. (B232",45min,7,Yes,Yes,Yes,05/25/2017 04:37:00 PM,,05/25/2017 04:37:37 PM,Running Late,School-Age


In [13]:
# Get the total number of records and columns.

bus_breakdown.shape

(281110, 21)

In [20]:
# Check how many breakdows were reported agains Running late

bus_breakdown.Breakdown_or_Running_Late.value_counts()

Running Late    250053
Breakdown        31057
Name: Breakdown_or_Running_Late, dtype: int64

In [26]:
# Why do the breakdowns happen?

bus_breakdown.Reason.value_counts()

Heavy Traffic                  173221
Other                           37579
Mechanical Problem              28162
Won`t Start                     12283
Flat Tire                        8411
Weather Conditions               6932
Late return from Field Trip      5704
Problem Run                      4068
Accident                         2526
Delayed by School                2222
Name: Reason, dtype: int64

In [16]:
# Checking the data types of the dfields.

bus_breakdown.dtypes

School_Year                        object
Busbreakdown_ID                     int64
Run_Type                           object
Bus_No                             object
Route_Number                       object
Reason                             object
Schools_Serviced                   object
Occurred_On                        object
Created_On                         object
Boro                               object
Bus_Company_Name                   object
How_Long_Delayed                   object
Number_Of_Students_On_The_Bus       int64
Has_Contractor_Notified_Schools    object
Has_Contractor_Notified_Parents    object
Have_You_Alerted_OPT               object
Informed_On                        object
Incident_Number                    object
Last_Updated_On                    object
Breakdown_or_Running_Late          object
School_Age_or_PreK                 object
dtype: object

<em><font color = lightblue> There are a total of `281,110` records of bus breakdowns and `21` fields.

The `Last_Updated_on`, `Informed_on`, `Occurred_On` and `Created_On` are saved as objects (text fields) while they represent dates. For accuracy this will need to be converted to dates. 
Further, we notice that there are 8 more times of running late cases than breakdowns, main reason being heavy traffic along the road.
<br><br>The top three reasons for buses running late are Heavy raffic, Mechanical problems and others listed as just others.</font> </em>

## Step 3. Data Preparation.
Our main interest is to interogate the available data so as to deduce any useful relationships, if any, between the dataset features that can help us understand the factors that affect breakdowns and delayed trips for school buses.. <br> Rephrasing this into a research question: **"How can we reduce the cases of breakdown of buses in the city of New York?"**

### 3.1 Data Cleaning

#### 3.1.1 Check to see if we have null entries for records.

In [25]:
#Check to see if we have null records.

bus_breakdown.isna().sum()

School_Year                             0
Busbreakdown_ID                         0
Run_Type                                3
Bus_No                                  9
Route_Number                            7
Reason                                  2
Schools_Serviced                        7
Occurred_On                             0
Created_On                              0
Boro                                13461
Bus_Company_Name                        0
How_Long_Delayed                    35608
Number_Of_Students_On_The_Bus           0
Has_Contractor_Notified_Schools         0
Has_Contractor_Notified_Parents         0
Have_You_Alerted_OPT                    0
Informed_On                             0
Incident_Number                    271627
Last_Updated_On                         0
Breakdown_or_Running_Late               0
School_Age_or_PreK                      0
dtype: int64

In [24]:
# Chaeck for duplicated entries.

bus_breakdown.duplicated().any()

False

<em><font color = lightblue> We notice that there are quite a number of missing entries especially for the `Incident_Number` which are 271,627 entries out of 281,110. These are so many and dropping them would mean loosing significant number of records. It owuld be interesting to know why the cases do not have incident numbers attached tot them (So happens to Boroughs and the delay time)
 <br> <br> Also we note that we do not have any duplicated records in the provided recordset. 

We further notice that the namimg of the dataset columns is uniform and so no much input in this.</font> </em>



### 3.1.2 More focused questions to assist in handing our research question.

**1. Which bus companies that had the highest breakdowns??**

In [27]:
# We make use of the property 'Bus_Company_Name' and use the COUNT()function against this 
# variable for each of its make group. 

bus_breakdown.Bus_Company_Name.value_counts()

G.V.C., LTD.                          19394
LEESEL TRANSPORTATION CORP (B2192)    17200
RELIANT TRANS, INC. (B232             13741
PIONEER TRANSPORTATION CO             12017
BORO TRANSIT, INC.                    11953
                                      ...  
`                                         1
FORTUNA BUS COMPANY                       1
SMART PICK INC                            1
phillip bus service                       1
L&M Bus Corp.                             1
Name: Bus_Company_Name, Length: 117, dtype: int64

<em><font color = lightblue> The leading bus company on the number of cases is `G.V.C., Ltd` with a total of 19,394 cases. followed closely by `Leesel Transportaion Corp (B2192)`with 17,200 cases </font> </em>



---



In [36]:
#  The above can also be achieved by.. 
#
breakdowns = bus_breakdown.groupby(['Bus_Company_Name']).count()
breakdowns

Unnamed: 0_level_0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Occurred_On,Created_On,Boro,How_Long_Delayed,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Incident_Number,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK
Bus_Company_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1967,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1
1992,108,108,108,108,108,108,108,108,108,108,107,108,108,108,108,108,0,108,108,108
ACME BUS CORP. (B2321),1602,1602,1602,1602,1602,1602,1602,1602,1602,151,1415,1602,1602,1602,1602,1602,19,1602,1602,1602
ADDIES,26,26,26,26,26,26,26,26,26,26,21,26,26,26,26,26,0,26,26,26
ALINA SERVICES CORP.,534,534,534,534,534,534,534,534,534,534,459,534,534,534,534,534,0,534,534,534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Y & M TRANSIT CORP (B2321),28,28,28,28,28,28,28,28,28,28,21,28,28,28,28,28,1,28,28,28
`,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1
alina,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1
gvc,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,0,5,5,5


**2. How many students were in the buses when they broke down?**

In [8]:
# Count of the number of students in the bus during breakdown.
#
bus_breakdown.groupby(['Number_Of_Students_On_The_Bus']).count()[['Busbreakdown_ID']].reset_index()

Unnamed: 0,Number_Of_Students_On_The_Bus,Busbreakdown_ID
0,0,170376
1,1,15704
2,2,16737
3,3,14837
4,4,12248
...,...,...
219,6209,1
220,6219,1
221,8547,1
222,9007,1


In [30]:
# Check to see if there are other entried apart from YES / No in the Contactor_Notified_Parents field.

bus_breakdown.Has_Contractor_Notified_Parents.value_counts()

Yes    209632
No      71478
Name: Has_Contractor_Notified_Parents, dtype: int64

In [9]:
# Check to see if there are other entried apart from YES / No in the Contactor_Notified_Parents field.

#This can also be achieved by using `bus_breakdown.Has_Contractor_Notified_Parents.value_counts()`

bus_breakdown['Has_Contractor_Notified_Parents'].nunique()

2



---

