#### The Jupyter Notebook is an interactive environment for writing and running code. 

#### This is a tutorial that introduces you to the Jupyter Notebook, the basics of Python, and working with/visualizing data. 

#### <mark>Yellow highlights indidate a small exercise or task for you to try out.</mark>

# The Jupyter Notebook

Let's get familiar with working in the Jupyter Notebook.

On the left you can navigate to other files and double click on them to open them in a new tab within the Jupyter Notebook. <mark>Open "airline-dafety.csv" and leave the tab open. We will use this data set later.</mark>

You can also view multiple tabs at once in the same window. <mark>Open "airplane.interior.jpeg" in a new tab. Now drag the tab to this window and you will see a light blue shaded area where you can drop the tab. You can move the window around if you like. You can go back to having just one tab in the window by dragging the window back to the tabs at the top of your original window.</mark>

You can double click on any text field to edit it. In the drop-down menu above, you can choose whether a cell is for `code`, `markdown` (instructions, like this cell), or `raw`. In this notebook, all markdown cells are for your to read and follow the instructions. Code cells are for you to edit and enter some code to make something happen. 

You can add a new cell by clicking the + in the menu at the top. You can move cells up or down by hovering your mouse to the left of the cell until you see the move cursor and dragging the cell. <mark>Try it out.</mark>

# Python 101

<b>Things to keep in mind:</b>\
You can copy and paste code from one cell to another. This is encouraged to avoid small mistakes.\
All code is case sensitive, and any small mistake (e.g., an extra space or a missing quotation mark) can cause an error.\
Remember to run all the relevant cells (Shift+Enter) or the subsequent cells might result in an error.

## Comments

Anthing in a code cell that is preceeded by a hash `#` is ignored by the Python interpreter.

Comments are used to make the code easier for you and others to understand. A code cell like the one below indicates where you can write or copy/paste code to fulfill a task that is highlighted in yellow. These tasks are optional, but highly encouraged!

<mark>Try entering a comment in the code cell below `# testing out code cells`\
Hit enter and write `5+5`\
Hit Shift+Enter\
You should get the output `10` <mark>

In [None]:
# your code here

This illustrates one use for Python: a calculator. Feel free to add more code cells (the plus button in the top menu) and try out more calculations.

## Variables (identifiers), strings, and assigning values

Assigning a value to a variable allows us to later use the variable in commands. This is done using the `=` operator.

In [None]:
# assign the value 567 to the name 'x' and the value 4 to the name 'y'
x=567
Y=4

You might notice that when you ran the above cell, nothing happened. This is because you didn't ask for anything to happen, the only instructions you gave were assigning some names to some values and the Python interpreter has understood this.

In general, if you run a code cell and nothing happens, do not panic! This means the Python interpreter has understood the intructions you wrote but there is just no output. If you see an error, however, there is something that the interpreter did not understand.

Try using x and Y with an operator (e.g., x+Y). Don't forget to add a comment to explain what you are doing (e.g., # add x and y) - this is a good habit to develop when writing code.

In [None]:
# your code here 

Notice that the varibles we assigned are x and Y (not y). <mark>What happens if you try to add x and y?</mark>

In [None]:
# your code here

Python identifiers or variables are case-sensitive can be a combination of letters, numbers and/or underscores, starting with a letter or underscore (i.e., not a number).

You can also assign variables or identifiers to a string, which is a sequence of non-numerical characters enclosed within a pair of 'single' or "double quotes".

In [None]:
# create a string and assign to 'colors_list'
colors_list = 'blue, green, orange, red, magenta, white, pink'

You can print the list by just writing the name of the list in the code cell below.

In [None]:
# your code here 

## Printing

The print function outputs the value of its comma-delimited arguments, separated by a single blank space.

In [None]:
print('blue', 'green', 'orange', 'red', 'magenta', 'white', 'pink', 4, 6, 9, 234)

Notice that numerical values do not need to be enclosed within quotes. <mark>What happens if you forget to enclose a string within quotes?</mark>

If you enter multiple commands in one code cell and hit Shift+Enter, it will automatically print the last command line.

In [None]:
4+5
6/7

If you want to print mutiple lines, you will need to use the print funtion.

In [None]:
print(4+5)
6/7

## Lists

A list is an ordered sequence of 0 or more comma-delimited elements enclosed within square brackets `[]`. Python lists are heterogeneous, i.e., they can contain elements of different types.

In [None]:
# create a heterogeneous list and assign to name 'hetero_list'
hetero_list = ['blue', 'green', 'orange', 4, 6, 234, 'pink']

In [None]:
print(hetero_list)

Remember, you can also just print the list by writing the name of the list. <mark>Try it out below.</mark>

In [None]:
# your code here

You can combine lists using the `+` operator.

In [None]:
# create 3 lists
list_1 = [2, 3, 5, 6,]
list_2 = ['blue', 'green', 'orange', 'red', 'magenta', 'white', 'pink']
list_3 = [12, 23,]

In [None]:
# add list_1 and list_3
print (list_1 + list_2 + list_3)

You can access a certain element within a list by specifying its position.\
Note: In Python, the first element is at position `0`.

In [None]:
# print 3rd element in hetero_list
print(hetero_list[2])

<mark>Print the 5th element in list_2.</mark>

In [None]:
# your code here

Slice notation can be used to access numbers within a list by specifying two index positions separated by a colon `:`.

In [None]:
# print the first 4 elements in list_2
print(list_2[0:4])

<mark>Feel free to play around with creating new lists or variables, using operators, and the print function.</mark>

## Fun with Functions

A function is a block of code that is used to perform an action. Python, like most other programming laguages, comes with a set of built-in fuctions. 
We have already used a function: print().

The `input()` function allows the user input a string before a default message.\
The `int()` function returns an integer.\
The `str()`function returns a string.\
Let's create a program that uses these fuctions to tell the user how old they will be in 100 years. <mark>Try it out. Feel free to play around with the program by changing variables or values.</mark>

In [None]:
Name = input('Name: ')
Age = int(input('Age: '))
Current_year = int(input("Current year: "))
Year = str((Current_year - Age)+100)
print(Name + ', ' + 'you will be 100 years old in ' + Year + '.')

# Working with Data

Before we work with data, we need to import some Python libraries, which are a collection of functions.\
`Pandas` is a commonly used library for analyzing data.\
`Matplotlib` is a plotting library for visualizing data.\
`NumPy` is a library for scientific computing.\
`Seaborn` is a library for statistical data visualization.

In [None]:
# import Pandas library and call it 'pd'
import pandas as pd

# import matplotlib.plplot and call it 'plt'
import matplotlib.pyplot as plt

#import numpy and call it 'np'
import numpy as np

#import seaborn and call it 'sns'
import seaborn as sns 

As you will see below, we will "call upon" these libraries in commands by referring to these names.

## Import Data Set

Now that you are familiar with the some Python basics, let's work with some actual data to understand some basic statistics.\
The data set we will work with shows the number kilometers flown every week `avail_seat_km_per_week`, number of incidents, fatal accidents, and fatalities in the time period of 1985 to 1999 `incidents_85_99` `fatal_accidents_85_99` `fatalities_85_99`, and 2000 to 2014 `incidents_00_14` `fatal_accidents_00_14` `fatalities_00_14`, according to airline `airline`.

In [None]:
# load data on to pandas data frame and assign to the name 'airline_data'
airline_data = pd.read_csv("airline-safety.csv")

<mark>You can see the entire data set by typing the assigned name of the data set below.</mark> This is the same data set you opened up earlier.

In [None]:
# your code here

Alternatively, you can see the first few or last few rows of data.

In [None]:
# show the first 5 rows of the data set
airline_data.head(5)

In [None]:
# show the last 5 rows of the data set
airline_data.tail(5)

## Measures of Central Tendency

Let's learn more about the data set by looking at some measures of central tendency. The `mean` (average) of a data set is the sum of all the values divided by the number of values. The `median` is the middle value when a data set is ordered from least to greatest. Both the mean and median give us values that represent the data.

In [None]:
# Caluclate the means (averages) of each column
airline_data.mean(axis=0)

In [None]:
# Caluclate the medians of each column
airline_data.median(axis=0)

If we just look at the means and medians for incidents, it would seem that incidents have reduced from 1985-1999 to 2000-2014.

We can also visualize the incidents by airline this in a graph.

In [None]:
# set figure size and resolution 
plt.rcParams['figure.figsize'] = [15, 6]
plt.rcParams['figure.dpi'] = 300

#plot bar graph of incidents by airline (1985-1999 and 2000-2014)
plot_incidents = airline_data.plot(x='airline', y=['incidents_85_99', 'incidents_00_14'], kind='bar', title='Number of incidents by airline (1985-1999 and 2000-2014)')


By looking at the graph, we can see that our earlier assumption (incidents have reduced from 1985-1999 to 2000-2014) is not true for every airline; for some airlines, incidents stayed the same or even increased.

<mark>Plot two more graphs: one for the number of fatal accidents by airline and one for the number of fatalities by airline. What do they show?</mark>

In [None]:
# your code here (graph one)

In [None]:
# your code here (graph two)

Measures of central tendency are not the best representation of a data set becuase they are affected by outliers, which are values that are far from the majority of the values in the data set. Outliers could be a result of poor data collection or just variance in the data. 

Let's have a look at the outliers in our data by looking at box plots.

In [None]:
# set figure size and resolution 
plt.rcParams['figure.figsize'] = [10, 3]
plt.rcParams['figure.dpi'] = 100

# create box plot for incidents_85_99 and assign to 'incidents_85_99_boxplot'
incidents_85_99_boxplot = sns.boxplot(x=airline_data['incidents_85_99'])

The boxplot shows the 2nd and 3rd quartiles of the data (blue box), the median (the line in the blue box) and the whiskers (the smallest and largest data points that are not outliers).
There are 5 outliers between 15 to 80. 

<mark>Create another box plot for incidents_00_14. Do you notice anything? Are there outliers? </mark>

In [None]:
# your code here

## Correlation

We can also look at outliers by looking at correlations. A correlation shows how strongly pairs of variables are related. If we are looking at airline incident data, we might want to look at the correlation between incidents in 1985-1999 and incidents in 2000-2014. This might tell us if there is a relationship between airline incidents from one time period to the next, i.e., whether the incident track record of an airline can be used to predict the current/future safety of the airline.

In [None]:
# plot the correlation between incidents in 1985-1999 and 2000-2014
airline_data.plot('incidents_85_99', 'incidents_00_14', kind='scatter')
plt.show()                     

From the scatter plot, there seems to be a positive correlation, i.e., a relationship between airline incidents from one time period to another. But let's take a look at whether this correlation means anything.

In [None]:
# calculate correlation coefficients for all variables
airline_data.corr(method ='pearson')

In this chart, we can see the correlation coeffections between all variables. The closer the number to 1.0, the stronger the relationship, and the closer the number to 0, the weaker the relationship.\
We can see the correlation between airline incidents in 1985-1999 and 2000-2014 is 0.4, which is is moderate in statistical terms. 

<mark>Try creating scatter plots between other variables. Do you notice anything? Are there outliers? Which variables are strongly related and which are not?</mark>

In [None]:
# your code here

Correlating variables in a large data set is a good starting point before further analyzing the data, but it is just a starting point. Correlations are not very informtive in terms of making predictions and data models, but this is beyond this introduction. Feel free to check out any of the links on this page to learn more about Python and data science: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#machine-learning-statistics-and-probability