# What outliers, strange, and missing values did you find on the trips table? What would you do to deal with this data and why?

In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Trips Table

In [2]:
#Importing trips table
trips_table = pd.read_csv("Datasets/trip.csv")
trips_table

Unnamed: 0,tripid,personid,trip_path_distance,speed_mph,Destination purpose,Primary Mode
0,1710000501001,1710000501,2.299694071,13.798164,"Conducted personal business (e.g., bank, post ...",Household vehicle 1
1,1710000501002,1710000501,1.122817397,13.473809,Went grocery shopping,Household vehicle 1
2,1710000501003,1710000501,3.263440492,19.580643,Went home,Household vehicle 1
3,1710000501004,1710000501,8.126289938,19.503096,Went to religious/community/volunteer activity,Household vehicle 1
4,1710000501005,1710000501,8.044890337,24.134671,Went home,Household vehicle 1
5,1710000502001,1710000502,2.299694071,13.798164,"Conducted personal business (e.g., bank, post ...",Household vehicle 1
6,1710000502002,1710000502,1.122817397,13.473809,Went grocery shopping,Household vehicle 1
7,1710000502003,1710000502,3.263440492,19.580643,Went home,Household vehicle 1
8,1710002401001,1710002401,0.541214141,3.247285,"Went to work-related place (e.g., meeting, sec...","Walk, jog, or wheelchair"
9,1710002401002,1710002401,0.541214141,3.247285,Went home,"Walk, jog, or wheelchair"


### Finding Missing Values Based On Distance of the Trip Path

In [5]:
missing_values_trip = trips_table[trips_table["trip_path_distance"] == " "]
missing_values_trip

Unnamed: 0,tripid,personid,trip_path_distance,speed_mph,Destination purpose,Primary Mode
14892,1711494102001,1711494102,,,"Went to work-related place (e.g., meeting, sec...",Airplane or helicopter
16459,1711625701001,1711625701,,,Went home,Airplane or helicopter
22942,1712399302001,1712399302,,,Went home,Airplane or helicopter
23793,1712469601001,1712469601,,,"Attended recreational event (e.g., movies, spo...",Airplane or helicopter
40677,1714124802001,1714124802,,,"Went to school/daycare (e.g., daycare, K-12, c...","Other mode (e.g., skateboard, kayak, motorhome..."
45650,1714618802001,1714618802,,,Went to primary workplace,"Urban rail (e.g., Link light rail, monorail)"
48788,1714997602001,1714997602,,,Went to primary workplace,Other household vehicle


Much of the missing values for trip path distance usually had a primary mode of transport related to flight, such as an airplane or a helicopter. The lack of distance given would often make the speed of the trip unknown as well.

## Missing Values

### Finding Missing Values Based on Speed

In [19]:
missing_values_speed = pd.isnull(trips_table["speed_mph"])
ms_speed = trips_table[missing_values_speed]
ms_speed

Unnamed: 0,tripid,personid,trip_path_distance,speed_mph,Destination purpose,Primary Mode
7620,1710768901023,1710768901,0.564826414,,"Went to other shopping (e.g., mall, pet store)",Household vehicle 1
14892,1711494102001,1711494102,,,"Went to work-related place (e.g., meeting, sec...",Airplane or helicopter
16459,1711625701001,1711625701,,,Went home,Airplane or helicopter
19748,1712029601006,1712029601,1.856657122,,,
22373,1712340301016,1712340301,0.318763422,,Went grocery shopping,"Walk, jog, or wheelchair"
22942,1712399302001,1712399302,,,Went home,Airplane or helicopter
23793,1712469601001,1712469601,,,"Attended recreational event (e.g., movies, spo...",Airplane or helicopter
28335,1712919601011,1712919601,0.707120417,,,
28336,1712919601012,1712919601,0.784791816,,,
28463,1712930501010,1712930501,17.7022439,,"Dropped off/picked up someone (e.g., son at a ...",Household vehicle 1


For the NaN values for speed that had distance present, most trips had a distance of under a mile, with one trip having an outlier of 17 miles. The mode of the purpose of the trips were from dropping off or picking up someone.

### Finding Missing Values Based on Destination Purpose

In [21]:
missing_values_destination = pd.isnull(trips_table["Destination purpose"])
trips_table[missing_values_destination]

Unnamed: 0,tripid,personid,trip_path_distance,speed_mph,Destination purpose,Primary Mode
996,1710083402033,1710083402,4.78455818,31.897055,,
997,1710083402034,1710083402,4.078680506,20.393403,,
998,1710083402035,1710083402,0.489640499,4.896405,,
1771,1710137502006,1710137502,0.171498449,3.429969,,
1799,1710137502034,1710137502,0.175226676,2.628400,,
2193,1710187302011,1710187302,2627.303423,489.559644,,
2200,1710187302018,1710187302,6.747469777,36.804381,,
2201,1710187302019,1710187302,20.6096397,112.416217,,
2203,1710187302021,1710187302,251.5695836,457.399243,,
2204,1710187302022,1710187302,807.6707031,598.274595,,


The vast majority of the missing values for the trip purpose also seemed to have missing values for the primary mode. An outlier in the case of a missing trip purpose with be a present value reported for primary mode of getting around.

### Finding Missing Values Based on Primary Mode

In [28]:
missing_values_primary = pd.isnull(trips_table["Primary Mode"])
trips_table[missing_values_primary]

Unnamed: 0,tripid,personid,trip_path_distance,speed_mph,Destination purpose,Primary Mode
565,1710059001031,1710059001,0.778578104,9.342937,Went to restaurant to eat/get take-out,
566,1710059001032,1710059001,0.840093852,10.081126,Went home,
567,1710059001033,1710059001,2.614108606,15.684652,Went to religious/community/volunteer activity,
568,1710059001034,1710059001,2.515310586,16.768737,Went home,
569,1710059001035,1710059001,2.950891792,17.705351,"Attended social event (e.g., visit with friend...",
570,1710059001036,1710059001,2.691158634,12.420732,Went home,
571,1710059001037,1710059001,2.696129603,6.470711,Went to primary workplace,
572,1710059001038,1710059001,0.4,2.666667,"Went to medical appointment (e.g., doctor, den...",
573,1710059001039,1710059001,0.4,3.000000,Went to primary workplace,
574,1710059001040,1710059001,2.070408813,4.968981,Went home,


Much of the missing data for primary home also had missing data for destination purpose, though much of the destination purpose answers that didnt have missing data reported the point of their trips was to return home. 

If both the destination purpose as well as the Primary Mode of transportation is missing, I would remove that entire row from the dataset as it lacks some crucial data that was needed for the survey. If I wanted to know the primary mode of transporation for each trip, any trip with a missing value would be a hindrance when comparing modes of transportation.

## Finding Outliers

### Finding Outliers Based on Distance of Trip

In [53]:
#Find the average distance of all trips
path_distance = trips_table["trip_path_distance"]

#Replacing empty strings with NaN values
path_distance.replace(' ', np.nan, inplace =True)
path_distance.dropna( inplace = True)

#Convert string to numeric
numeric = pd.to_numeric(path_distance)

#Finding the average
avg = numeric.mean()
avg

9.679149494596356

In [65]:
#Checking for distances over 20 miles
over_20 = numeric[numeric > 20]
over_20.count()

2521

In [66]:
over_100 = numeric[numeric > 100]
over_100.count()

272

In [73]:
over_1000 = numeric[numeric > 1000]
over_1000.count()

83

In [74]:
over_3000 = numeric[numeric > 3000]
over_3000.count()


10

There are 272 trips that covered over 100 miles, and 83 that covered over a thousand miles. Since the average distance of all trips taken is around 10 miles, taking into account other priamry modes of transportation, trips that exceeded 100 miles seem to be outliers and only account for 0.5% of all trips. Strangely enough, some trips reported over 3000 miles covered, the longest distance being over 6000 miles in a single trip.

### Finding Outliers Based on Speed of Trip

In [75]:
#Find the average speed of all trips
path_speed = trips_table["speed_mph"]

#Replacing empty strings with NaN values
path_speed.replace(' ', np.nan, inplace =True)
path_speed.dropna( inplace = True)

#Convert string to numeric
numeric_speed = pd.to_numeric(path_speed)

#Finding the average
avg_speed = numeric_speed.mean()
avg_speed

17.21969107087986

In [77]:
#Checking for speeds under 1
under_1 = numeric_speed[numeric_speed < 1]
under_1.count()

703

In [79]:
#Checking for speeds over 50
over_50 = numeric_speed[numeric_speed > 50]
over_50.count()

674

### Checking Outliers in Purpose of Trip

In [85]:
destination_count = trips_table["Destination purpose"].value_counts()
destination_count

Went home                                                                            16857
Went to primary workplace                                                             6604
Went to restaurant to eat/get take-out                                                4336
Dropped off/picked up someone (e.g., son at a friend's house, spouse at bus stop)     3173
Went grocery shopping                                                                 3095
Went to exercise (e.g., gym, walk, jog, bike ride)                                    2945
Went to other shopping (e.g., mall, pet store)                                        2546
Went to work-related place (e.g., meeting, second job, delivery)                      2073
Attended social event (e.g., visit with friends, family, co-workers)                  1983
Conducted personal business (e.g., bank, post office)                                 1811
Attended recreational event (e.g., movies, sporting event)                            1252

In [89]:
destination_count_under_300 = destination_count[destination_count < 300]
destination_count_under_300

Other social/leisure (rMove only)      246
Went to other work-related activity    138
Went to the zoo                          1
Name: Destination purpose, dtype: int64

### Checking for Outliers in Primary Mode of Transportation

In [87]:
primary_mode_count = trips_table["Primary Mode"].value_counts()
primary_mode_count

Household vehicle 1                                          22328
Walk, jog, or wheelchair                                     10827
Household vehicle 2                                           6787
Bus (public transit)                                          3843
Friend/colleague's car                                        1261
Bicycle or e-bike                                             1199
Other household vehicle                                        724
Household vehicle 3                                            715
Urban rail (e.g., Link light rail, monorail)                   646
Car from work                                                  593
Other hired service (e.g., Lyft, Uber)                         563
Rental car                                                     370
School bus                                                     354
Other non-household vehicle                                    242
Private bus or shuttle                                        

In [90]:
primary_mode_count_under_100 = primary_mode_count[primary_mode_count < 100]
primary_mode_count_under_100

Household vehicle 4                93
Taxi (e.g., Yello Cab)             60
Commuter rail (Sounder, Amtrak)    52
Other rail (e.g., streetcar)       28
Paratransit                        17
Other motorcycle/moped/scooter     13
Household vehicle 6                 2
Hopping                             1
Skateboard                          1
Name: Primary Mode, dtype: int64