# Introductory workshop Machine Learning: Data cleaning
Welcome to this introductory workshop to machine learning.

During this workshop you will learn:
 - How to explore your data 
 - How to clean your data 
 - How to create a machine learning model to predict if a bus is delayed, on time.
 
 
 
 Raw dataset directly from Kolumbus. Not processed in any way
 
 
You will work with the following data:

| Navn | Beskrivelse | Eksempel |
|----|:-----------|:--------|
| Delay |Total forsinkelse i sekunder i forhold til rutetabellen. <br>Negativ delay vil si at busse er foran rutetabellen og motsatt. Denne delayen kommer fra sanntidssystem til Kolumbus.<br> Delayet tar hensyn til hvordan bussen ligger an på ruten (selv mellom stopp) og oppdateres kontinuerlig.<br>Delay=-25 vil si at bussen er 25 sekunder foran rutetabellen akuratt der den er nå. Hvordan denne variablen er regnet ut er en del av forretningshemmeligheten til de som leverer sanntidssystemet. Kolumbus sier den skal være nøyaktig og bruker all tilgjengelige data i utregningen. | -25.0 |

Line	Line er det korte navnet på en rute (det som står på bussen). Oslo-Stavanger og Stavanger-Oslo kan begge være Line 7	8
LineID	LineID er en rute. Oslo-Stavanger og Stavanger-Oslo har forskjellig LineID men samme Line. Merk at to elementer med samme LineId ikke alltid kjører 100% samme rute	3022
TripID	TripID er en konkret avgang. Oslo-Stavanger klokka 15:00 har en unik TripID men samme LineID som Oslo-Stavanger 16:00. En TripID blir kun brukt en gang per dag.	30222006
DirectionRef	Dette er en binær variablene som er enten ’go’ eller ’back’. Oslo – Stavanger vs. Stavanger – Oslo	'go'
TripHeadsign	Det som står foran på bussen	'Tasta - Vardeneset – Randaberg'
DestinationAimedArrivalTime	Når bussen skal være på endestoppet av den nåværende ruten	2017-03-06 00:19:00
DestinationTerminalCode	ID-en til siste stopp. Hvert stopp har en unik ID, men ikke nødvendigvis unikt navn. To stopp på hver side av veien har ofte det samme navnet, men ulik ID	11275943
To	Navnet på ende-destinasjonen (Ikke nødvendigvis navnet endestoppet)	'Randaberg sentrum'
OriginAimedDepartureTime	Når bussen kjører fra startstoppet	2017-03-05 23:44:00
OriginStopTerminalCode	ID-en på første stoppet	11034315
From	Start-destinasjonen (Ikke nødvendigvis navnet på startstoppet)	'Stavanger hpl. 25'
DistanceBetweenStops	Meter igjen til det neste stoppet. Tror denne er ganske nøyaktig og tar hensyn til faktiske ruten, ikke luftlinje	20.0
PercentageBetweenStops	Knyttet sammen med distancebetweenstops. Hvor langt bussen har kommet på ruten. Ved PercentageBetweenStops=0 er bussen på forrige stop og PercentageBetweenStops=100 er den på neste stopp	5.0
Heading	Rotasjon i grader. heading=0 er rett nord	-80.2347288919
Latitude	Kartkoordinater	58.98848
Longitude	Kartkoordinater	5.673485
NextStop	Navnet på neste stopp. Flere stopp kan ha samme navn, f.eks på hver side av veien	'Eskelandstunet'
NextStopCode	Alle stoppestedene har unike ID-er. To stopp på hver side av veien kan ha samme navn men vil alltid ha en unik ID. F. eks id for Eskelandstunet.	11031438
RecordedAtTime	Klokkeslett bussen sendte ut den gjeldene raden. Denne raden kan brukes sammen med OriginAimedDepartureTime og DestinationAimedArrivalTime til å finne progresjonen på ruta i 
	78.4

In [None]:
import read_json
import pandas as pd
from test_dataCleaner import TestDataCleaner
cleaner = TestDataCleaner()
frame = read_json.get_dataframe()

# Inspecting the data
Below is a snippet of the data recived from Kolumbus. 

As you can see, the data contains a great deal of missing values, NaN. 

In [None]:
frame.head(5)

# Dealing with empty fields
Most of the real time data from Kolumbus have missing route information. 

Buses that are not in service continue to transmit their location. 

Another issue is that the bus driver has to manually enter which route he or she is driving. This leads to problems where active buses are not tied to a route or tied to the the wrong route. 


The first task is to remove all rows with missing route information. 


In [None]:
columns = [] # The columns listed here are required to have a value or the entire row is removed columns = ['column_1','column_2','column_2']
columns = ['DestinationAimedArrivalTime','OriginAimedDepartureTime','Delay','TripId'] # CORRECT ANSWER, TO BE REMOVED

test_frame = frame.dropna(subset=columns) 

cleaner.remove_null_rows(test_frame)

In [None]:
frame = test_frame

frame['RecordedAtTime'] = frame['RecordedAtTime'].astype('datetime64[ns]')
frame['OriginAimedDepartureTime'] = frame['OriginAimedDepartureTime'].astype('datetime64[ns]')
frame['DestinationAimedArrivalTime'] = frame['DestinationAimedArrivalTime'].astype('datetime64[ns]')

# Inspecting the data II
With most of the missing values now removed, we can take a better look at the data and all the available fields.

In [None]:
frame.head(5)

In [None]:
frame[['Longitude','Latitude']].describe(include='all')

# Duplicate rows

The real time transit system is updated about every second. The buses however, only updates their position a couple of times a minute. The data we recived form Kolumbus therfore contains a great deal of duplicates as the previous update is repteated until a new one is recived. 


Removes every duplicated row where every value in the specified columns are the same

            What might be used to identify a duplicated row? Ask yourself:
            * Can there be several messages sent at the same time?
            * Can buses leave or arrive at the same time?
            * Could they be at the same place at the same time?
            * How specific do you need to be with line, lineId or TripID?

#### Example
Study the the updates from the bus below. Notice any similarities?

In [None]:
frame[frame['TripId'] == '10061098'].head(3) 

In [None]:
columns = ['column_1', 'column_2', 'column_3'] # Removes all duplicate rows were 'column_1', 'column_2', 'column_3' have identical values. columns = ['column_1', 'column_2', 'column_3'] 
columns = ['RecordedAtTime', 'TripId'] # CORRECT ANSWER, TO BE REMOVED

test_frame = frame.drop_duplicates(columns)
cleaner.remove_duplicate_entries(test_frame)

In [None]:
frame = test_frame

# Drop redundant, duplicate and hard coded columns



* Columns with hard coded values. Every row
* Redundant columns.
* Columns with missing values

In [None]:
frame.head(10)

In [None]:
frame[['Delay']].describe(include='all')

In [None]:
frame['Line'].value_counts().head(20)

### Columns with one repeating value

As you might have noticed, some of the columns in the Kolumbus data one repating value for all rows.
Find these columns and remove them.


In [None]:
columns = [] # Removes the columns listed here. 
columns = ['NextStopVisitNumber', 'VehicleModes'] # CORRECT ANSWER, TO BE REMOVED

test_frame = frame.drop(columns,1)
cleaner.remove_hard_coded_columns(test_frame)

In [None]:
frame = test_frame

### Columns with missing rows

Some of the columns does still contain rows with missing data (NaN). In this case, these columns are not important and should be removed
Your next task is to find these columns and remove them.

In [None]:
columns = [] # Removes the columns listed here. 
columns = ['Heading','IsMonitored','TripHeadsign'] # CORRECT ANSWER, TO BE REMOVED

test_frame = frame.drop(columns,1) # CORRECT ANSWER, TO BE REMOVED
cleaner.remove_columns_with_null_values(test_frame)

In [None]:
frame = test_frame

### Redundant columns

One columns might contain the same information as others. This is also the case for this dataset. 

Use the tools. Keep the columns that are most similar to the rest of the data.


In [None]:
columns = ['column_1',] # Removes all duplicate rows were 'column_1', 'column_2', 'column_3' have identical values
columns = ['Position'] # CORRECT ANSWER, TO BE REMOVED

test_frame = frame.drop(columns,1)
cleaner.remove_hard_coded_columns(test_frame)

In [None]:
frame = test_frame

# Outliers

Removes outliers and data missing certain values. Poor data quality as input results in poor predictions as output 

Hints: You have data from one day, or do you?
     All bus routes should be in the [Rogaland area](https://www.google.com/maps/place/Rogaland/@58.9350028,5.2741278,9z/data=!3m1!4b1!4m5!3m4!1s0x463a353f2adcd70b:0xe0061cba0b0cc0bc!8m2!3d59.1489544!4d6.0143432), but sometimes the GPS freaks out..
     
Study the Delay column and decide on a resonalbe range. A large delay might indicate engine trouble or other abnormalities, make sure to remove it!

   Rember that a _positive_ 'Delay' is the number of seconds _behind_ schedule. I.e A bus with Delay=600 is 10 minutes behind schedule
     
     
     However, if the bus is way ahead of schedule, that's not right either.
                    What's the distance between two stops anyway?
                    If a bus is driving towards its first station, that should not be counted as it being early.

In [None]:


frame = frame[(frame['OriginAimedDepartureTime'] < frame['DestinationAimedArrivalTime'])]

In [None]:
(frame['Latitude'] > 61).sum()
(frame['Longitude'] > 7.3).sum()

In [None]:
frame['Delay'].describe()
frame['RecordedAtTime'].describe()

In [None]:
frame = frame[frame['Delay'] > -1000]
frame = frame[frame['column_1'] < 4000]

In [None]:
(frame[(frame['RecordedAtTime'] < frame['OriginAimedDepartureTime'])])