# Preliminaries: Jupyter Basics

[Jupyter](https://www.jupyter.org) notebooks are an interactive coding environment for Python (as well as other programming languages). You can write and run code, view results and graphs, and include descriptions all in one place. You can run individual blocks of code one at a time, jump around to run different blocks in the notebook, or run all of the code in the notebook from start to finish. This flexibility makes Jupyter a great tool for analysis.

All of the content and code in a notebook is organized into "cells" that can be edited and run individually. This paragraph is contained in what is called a "markdown" cell - just descriptive text with some basic formatting options. Try double clicking on this paragraph to change it into editing mode. Then press Ctrl+Enter to "run" the cell and format the text. You can also click "Run" on the toolbar at the top of the screen to execute a cell.

Next we will look at a "code" cell.

In [None]:
# This is a code cell. The hash mark at the beginning of this line means that this is a comment.
# Everything below this line is executable code. Select this cell and press Ctrl+Enter to run the calculation below.

5 * 3

Notice how the output appears just below the code cell. You can include multiple lines of code in a cell and execute them with a single command. Select the below cell and press Ctrl+Enter to execute the code. The output will be the result of the last line ($x + y$).

In [None]:
x = 10
y = 3 * 2
x + y

You can easily create and view plots. Run the cell below. Don't worry about understanding what this code is doing - we'll get to that later.

In [None]:
import plotly.express as px
fig = px.scatter(x=[0, 1, 2, 3, 4], y=[0, 1, 4, 9, 16])
fig.show()

You can also run multiple consecutive cells by highlighting them (click the first cell then hold the Shift key while you click the last cell you'd like to run) and pressing Ctrl+Enter. Try that on the cells above.

One word of caution: unlike "typical" programs that run code top to bottom, you can execute cells in any order in Jupyter. That means that if you run a cell, run other cells further down that change variable values, then jump back up the run the first cell again, you may get a different result! 

Try it on the two cells below. When you run the first cell, you should get the same result that you got for $x + y$ when you ran it above. Run the second cell below to change the $x$ value and re-run the first cell. Your result will change!

In [None]:
x + y

In [None]:
x = x - 1

This simple example may not seem like a big deal. But as your notebooks become more complicated and you start to bounce around to try different ways of analyzing your data, this can lead to the very frustrating situation of being unable to replicate a result. One way to avoid this: every few cells, select Run All Above from the Cell dropdown menu at the top of the notebook. This will re-run the whole notebook up to that point, which will hopefully reset any variables. Try it on the cell above.

If you ever want to completely start a session over, select Restart & Clear Output from the Kernel dropdown above. 

You'll be getting more comfortable with Jupyter notebooks as you work through the exercises. Hopefully this gives you enough to get started. Now on with the show!

# Introduction

For this course, we are going to be analyzing flight data from the US Federal Aviation Administration (FAA). With this data, we are going to take you through what it might look like to ingest, process, and analyze data to derive insights. 

To give some background on the data: 

This data comes from two systems: the Airport Surface Detection Equipment (ASDE-X) and the Airport Surface Surveillance Capability (ASSC). These two systems essentially track the movement of aircraft on the surface of an airport. More information regarding these two systems can be found here: https://www.faa.gov/nextgen/programs/adsb/atc/assc/ and https://www.faa.gov/air_traffic/technology/asde-x/. We will be working with a simplified set of data that has been merged from these two sources and stripped down to a small subset of the available fields. 

The flights in this data operated from 1-1-2020 to 9-1-2020. Flights have been filtered to include only 4 airlines: American, Delta, United, and Southwest

# Ingesting Data + Prep

Start by importing the pandas library. Pandas provides a data set structure called a "dataframe" - a tabular data structure of rows and columns - and provides many easy functions (commands) for manipulating data in dataframes. It is amazingly helpful and is a foundational tool for analysis in Python.

In [None]:
# Import Libraries

import pandas as pd

Now we need to load the data into the Jupyter environment. Pandas and Python provide connectors to a wide range of data formats and sources, but for today, we will just be using text files formatted as comma separate values (CSV). To load the data, we will use the pandas *read_csv* function. If we simply tell this function where to find our CSV file then it will import it into a dataframe. For more information about this function, you can go to the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). While this type of documentation might seem overwhelming at first, it is incredibly useful as you get use python more. 

In [None]:
# Ingest Data

df = pd.read_csv(r'/home/jovyan/flight_data.csv')

The variable 'df' is now a dataframe containing the flight data that we read in from the file. Let's take a quick peek at what this variable holds.

In [None]:
df

Nice and neat - rows and columns of data. We will explore this data more below. 

But first, let's load up another data set that we will use later in the notebook. Why don't you give it a try? The file is located in the same folder as our other file and is called airport_data.csv

In [None]:
# Answer

df_airport = pd.read_csv(r'/home/jovyan/airport_data.csv')

Great! Now that we have the data imported, let's take a closer look!

# Basic Info

Whenever you receive a new dataset, you should start with getting a basic understanding of what kind of data you are dealing with. Let's look at how much data we have, what attributes we have, and what kind of data is stored in each attribute. 

The first thing we can do is look at how many observations we have and how many columns - or features - we have. There are 2 ways that we can look at the data. First we can use the size and shape methods. 

In [None]:
print("Size of DF: ", df.size)
print("Shape of DF: ", df.shape)

The size method returns the number of elements in a dataframe, while the shape method returns the number of rows (observations) and columns (features) in our dataframe. If you were to multiply the two numbers that the shape method returns, you should get the number the size method returns.

Because we have another dataset here that contains airport attributes, let's try getting the size and shape of this dataset. This dataset is called df_airport.

In [None]:
# Answer

size = df_airport.size
shape = df_airport.shape

print("Size of Dataframe: ", size)
print("Shape of Dataframe: ", shape)

Returning to the orginal dataframe, 'df,' we can can learn about the type of data in each column with the dtypes method. 

In [None]:
print("Column Types of DF: ", df.dtypes)

We can see that most of the data that we have to work with are objects, which typically mean that they can include any data type, but in this case, as we will see, these object columns are mostly strings. Are there any object fields that seem like they should have a different data type? If so then we may need to convert them to the correct type before using them in our analysis. There are also a few columns that seem to be numbers, namely track and stid. 

Next, let's actually look at a snippet of the data using a method called head() which returns the the first few rows of the data set. 

In [None]:
df.head()

You'll notice that the method head() returns only 5 observations. This is the default behavior, but we can actually change the number of observations that it returns.

To do this, we need to change what is called a parameter. A parameter is a user-defined value that is fed into the method for the method to use. In the case of the head() method, the parameter that can be used to change the number of observations is called 'n' which is supposed to be an integer. Why do you give it a try? Let's try to show 10 observations instead of 5. If you need help, this documentation will help: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [None]:
# Answer

head = df.head(n = 10)
head

You'll may have noticed that you can actually use df.head(10) or df.head(n = 10). Both work! However, it is good practice to explicitly call your parameters so that you aren't confused, particularly when a method takes in more than 1 parameter.

Returning to the data, what do you notice from this sample?

Each row represents a single event for a given flight. The flights are identified by their call signs. From this snippet, we can see two kinds of event in the data set: on (wheels down - i.e., landing) and off (wheels up - i.e., takeoff). We'll need to check that those are the only values in the data set later on. The time field is the time that the event occurred, and the timestamp seems to be a rounding to the nearest hours - another assumption to verify later.

The airport field appears to be the airport at which the event occurred, while the departure_airport and destination_airport define the flight's origin and destination. This provides another opportunity for validation: we would expect that we the airport and departure_airport should match for an "off" event, and the airport and destination_airport should match for an "on" event. That checks out for the small sample above, but we'll want to vet that assumption against the full data set.

The track and stid fields seem to be some sort of indentifiers - good to know in case we have to use them to integrate with another data set later on, but probably not informative on their own.

Finally, it seems the "timestamp" column is the nearest hour of the "time" column. 

One other thing you might notice is that on the left side, there are numbers starting from 0. These are what is called an index. They enumerate the rows of the data frame. You'll also notice that it starts from 0. This is because Python is called a 0 index programming langauge - the first row in Python is row 0.

Great now that we have finished looking at a snippet of the data, let's move on to cleaning it!

# Practice Exercises

We will be performing a lot of data manipulations throughout these notebooks. There are several basic operations that are going to be extremely helpful in doing this. In this section, we will go over such techniques and give you practice so that you are familiar with them..

In this section, we will cover 4 topics: types of data structures, how to subset them, how to manipulate them, and how to use basic functions on them. In this short introduction, we will barely scratch the tip of the iceberg, but there is nothing better than getting your feet a little wet before jumping head first into the pool!

### Data Structures

Data structures are ways to organize and store data so that they can be accessed and worked with efficiently. They define the relationship between data and make it easy for various operations to be performed on the data. While there are many types of data structures, they are generally broken up into two types: primitive data structures and non-primitive data structures. 

#### Primitive data structures


Primitive data structures are the basic building blocks of data manipulation and generally contain simple values of data. There are 4 types of primitive variable types: 

    1) Integers - these are whole numbers from negative infinity to infinity like 0,-4, or 1
    2) Floats  - these are rational numbers, usually ending with a decimal figure such as 3.14 or 3.65 
    3) Strings - these are collections of characters (letters, punctuation, etc.) such as 'cake' or '+c00kie+'
    4) Boolean - this is a logical data type that can only take the value of True or False. Booleans are useful in conditional and comparison expression. 


#### Non-Primitive data structures

Non-primitive data structures are compound data structures - they collect primitive elements into a larger object. Think of a table in an Excel spreadsheet where each cell is a primitive (holding a single number, string, or boolean), and the table as a whole can be thought of as a non-primitive data structure.

There are many non-primitive data structures like arrays, tuples, etc., but for this notebook, we will briefly cover two that are pertinant to what we will be doing.

    1) Dataframes - 2-dimensional labeled data structure with columns of potentially different types
    2) Series - one-dimensional labeled array capable of holding any single data type

### Subsetting data

Going forward, we are primarily are going to be looking at the dataframe data structure as this is the most important data structure for analysis. At points, we will uses lists and arrays, but most of the notebooks will focus on manipulating and wrangling the data that is in the various dataframes that we have built. 

One major part of working with dataframes is using only the parts of the data that are needed. In other words, we want to take slices of the data that pertain to the task at hand. To do this, we need to subset the data. There are many ways to slice and dice the data, but here we will just cover a few. 

### Selecting Data using Labels

The first way of selecting data is using labels (aka column headings) to select the columns of data. To do this, we use square brackets [] with the name of the desired column(s) in quotes. For example, using the the dataset above, we can select airport column as shown below.  

In [None]:
# TIP: we will use the .head() method to make the output shorter
df_airport['airport'].head()

Perfect - we've selected just the *airport* column. You can also use the notation of a period followed by the unquoted column name as below. Compare the outputs of the previous command and this one - they should match.

In [None]:
# Method 2
df_airport.airport.head()

We can also pass a list of columns labels/names to subset the data as well. This is also useful when we need to reorganize our data as the order of columns in the output will match the order of the input field names.

In [None]:
df

In [None]:
# Input the list of fields directly...
df_airport[['airport','latitude','longitude']].head()

# ...or store the list as a variable and use that instead
list = ['airport','latitude','longitude']
df_airport[list].head()

Before moving on, let's do some exercises on how to subset the data. This time using the df dataframe, let's try to subset it. 

In [None]:
# Answer
# Select the airport and altitude columns from the dataframe
df[['airport','call_sign']]

# What happens when you flip the order of the columns?
# the columns get reordered!
df[['call_sign','airport']]

# What happens when you ask for a column that doesn't exist? Try using the label time_to_land
# python throws a key error!
df['time_to_land']

### Slicing the data using Indexes

For selecting rows, we can subset the data using indexes. Indexes refer the the position within an iterable (more on this later). Essentially, indexes are numeric labels showing the position of an element or value in the data structure. In a dataframe, each row (or observation) is indexed. Thus, we can subset rows using their indexes. 

One quick analogy that might help is a jury duty, where each juror was assigned a number. The number is your index. So when they say, "Juror number 2", you know that they were referrring to you. 

Below we will demonstrate how to select rows and/or columns from a dataframe using its index. To slice out a set of rows, we will using the following syntax: *dataframe[start:stop]*. When slicing in pandas, the start boundary is included in the output while the stop boundary is one step beyond the row you want. 

In [None]:
# Selecting the first 2 rows
df[0:2]

# Selecting the last 3 rows
# Note 1: we are able to this using negative numbers to count back from the end of the data
# Note 2: if you don't include a bound, the bound will go to the beginning or end of the data set
df[-3:]

You'll notice that each row's index is displayed on the very left of the dataframe. When we subset a dataframe, the row indexes do not change. 

Before moving on, let's do some more exercises.

In [None]:
# Answer
# Select the first 5 rows of the df dataframe.
df[:5]

# Select the last 10 rows of the df dataframe. 
df[-10:]

# Select rows 25 - 29 of the df dataframe.
df[25:30]

### Subsetting Rows and Columns

To select both rows and columns, we can use either label or integer-based indexing. There are generally two ways to do this:

1) loc which is primarily label based indexing

2) iloc which is primarily integer based indexing

Here are examples below.

In [None]:
# Select columns 1-3 and rows 3-5 using indexing
df.iloc[3:6, :3]

# Select columns latitude and longitude and rows 10-15
df.loc[10:15, ['airport','call_sign']]

# Selecting all columns, for rows 2
df.iloc[2, :]

Now some exercises for you to practice

In [None]:
# Answer
# Select the 6th row and 3rd column of the dataframe df
df.iloc[6,3]

# Select all the rows and column airport
df.loc[:, 'airport']

# Select rows 10-20 and columns time and airport
df.loc[10:20, ['time','airport']]

### Subsetting the data through criteria

Lastly, we can subset the data through criteria. This is called logical indexing.

For example, we can select rows for events from the Los Angeles airport or from either Los Angeles or San Francisco airports.

In [None]:
# Selecting just LAX observations
df[df.airport == 'KLAX']

# Selectin observations from either LAX and SFO. Note that "|" means "or"
df[(df.airport == 'KLAX') | (df.airport == 'SFO')]

This is very powerful - now we can subset the data based on criteria on what the data contains or how values relate across a given row.

Now let's combine everything with some exercises.

In [None]:
# Answer
# Select all rows that are observations of the KLAX airport
df[df.airport == 'KLAX']

# Select all rows that are observations of the KLAX airport and that have an "on" event. Note that "&" means "and"
df[(df.airport == 'KLAX') & (df.event == 'on')]

### Manipulating Data

Now that we can subset the data. We need to learn how to manipulate it. This can involve sorting, dropping, and grouping the data. These can be done with what Python calls methods.  Moreover, we can use basic functions in combination with these methods. Functions such as counting the number of rows or finding the min or max or sum of the data are all common things to be done. Functions are simply blocks of code that run when called. Functions and methods are similar, but for simplicity sake, methods are like functions except that they are associated with an object (like a dataframe). Functions are not. 

In [None]:
# Sorting the data by time
df.sort_values('time')

# Finding the earliest flight
df.time.min()

# Counting airport observations
# Using reset_index to reset the index of the dataframe
df.groupby('airport')['airport'].count().reset_index(name = 'count')


Now let's do some exercises. 

In [None]:
# Exercises
# Let's find when the last observation occurred
df.altitude.max()

# Let's count the number observations by event
df.groupby('airport')['airport'].count()

# Let's find the number of obversations by airport and then sort them from smallest to largest
df.groupby('airport')['airport'].count().reset_index(name = 'count').sort_values('count')