# Data Science with Jupyter Notebook


## Part 1: Navigating Python - Jupyter Notebook

There are many different interfaces available for us to code in Python and the one we will be using for this course is Jupyter Notebook - an incredibly powerful tool for interactive computing and development of data science projects. It provides a web-based application suitable for developing, documenting, executing code, as well as communicating the results. 

A lot of scientific institutions actually share their findings with Jupyter Notebooks in order to clearly explain exactly how they got their results. Not only can these Notebooks show us how they got their results, we can also reproduce those results ourselves within these Notebooks. 

During the new few lessons, we will be exploring this platform and its extensive library of tools. While we learn to navigate our way around the notebook, the ultimate goal is to use this platform to analyse our own data sets and develop our own projects!

### First things first - Installing Jupyter Notebook
Watch my tutorial video on 'Installing Jupyter Notebook' or skip this part if you already have Jupyter Notebook on your laptop :)

### So what is a 'Notebook'?

What you're reading right now is a notebook. As you can see, a notebook integrates code and its output into a single document that combines code, text, visualisations, mathematical equations and other media. As you will see later, you can easily produce outputs in-line with your codes. 

### Notebook Interface

Let's take a look around the interface you have in front of you. Before we continue, there are two terms essential to understanding Jupyter:

* `kernel`: “computational engine” that executes the code contained in a notebook document
* `cell`: container for text to be displayed in the notebook or code to be executed by the notebook’s kernel

There are also different types of cells:
* `code cell`: contains code to be executed in the kernel
* `markdown cell`: contains text formatted using Markdown and displays its output in-place when it is run

### Shortcuts

There are also lots of shortcuts when working in Jupyter! You are not expected to pick them up immediately, but if your practice with them as you work through different tasks, they'll save you a lot of time!

* Toggle between *edit* and *command* mode with `Esc` and `Enter`, respectively.

* Once in *command* mode:
    * **Ctrl + Enter** to run the cell
    * **Shift + Enter** to run the cell and move to the cell below
    * Scroll up and down your cells with your **Up** and **Down** keys.
    * **Press A or B** to insert a new cell above or below the active cell.
    * **M** will transform the active cell to a Markdown cell.
    * **Y** will set the active cell to a code cell.
    * **D + D (D twice)** will delete the active cell.
    * **Z** will undo cell deletion.
    * **Hold Shift and press Up or Down** to select multiple cells at once.
    * With multiple cells selected, **Shift + M** will merge your selection.
    * **Ctrl + Shift + -**, in edit mode, will split the active cell at the cursor.
    * You can also **click and Shift + Click** in the margin to the left of your cells to select them.

There are many more in the Help drop-down menu in the top margin of the Jupyter web page! Check them out if you're interested ;)

### Guidelines

- You will be given instructions to complete tasks. 
- When asked to insert code, do so below the line which says: 

`# [YOUR CODE HERE]`
- When asked to comment on the code or the output, write the comments in a 'Code Cell' with a sign `#` at the beginning of each row, and in the areas which begin: 

`# [INSERT YOUR ANSWER HERE]`

### How to debug?

When you work through this notebook, you will for sure encounter problems or produce errors, which can be very frustrating (trust me, we have all been there). But don't be discouraged! These errors are part of the learning process and can definitely help you become a more capable programmer. 

Here I will list a few tips that you can refer to when you do encounter problems. This is not an exhaustive list by any means, but it is definitely a starting point for you during your debugging process. These are some of the most common mistakes I made when I first started coding, so hopefully will be helpful to you in some way:

* Maybe an error is occurring because you've made a spelling mistake. Check if there is supposed to be an 's' or the word begins with a capital letter. Python is case sensitive.
* Check that your parentheses match. For example, do not forget to close off your brackets.
* Indents matter! You might not need to use indents much during this course, but in future remember correct indenting is important.
* If you aren't sure what the new variable you define looks like, print it! By typing the variable name and running the cell, you can see the columns/numbers in the variable.

### Up for a Challenge?

Tasks that are starred indicate that it is a bit more difficult than the others. Do not be scared to attempt these tougher questions! I have made tutorial videos and included hints for these questions if you ever get stuck and need a live demonstration - just watch the tutorial for the corresponding task. If you are up for learning about more complicated codes, or more interesting output, or if you are simply looking for a challenge, then these are for you ;)

Now that all the house keeping are out of the way, let's try and complete some tasks!

#### Task 1.1:
Use `print()` to print a sentence.

In [34]:
# [YOUR CODE HERE]


What do you notice in the margins? What does it mean?

In [None]:
# [YOUR COMMENT HERE]
# 

#### Task 1.2:

In [None]:
Change this code cell to a markdown cell using a shortcut.

#### Task 1.3:

Add a cell below this one using a shortcut, type and run the code in the new cell. 

`import time 
 time.sleep(3)`
 
What did you notice when the cell was running?

#### Task 1.4: 
Run the function `say_hello` below: 

In [27]:
def say_hello(recipient):
    return 'Hello, {}!'.format(recipient)

Do you remember how to call a function? Type in your name as input and run the function. (Hint: `say_hello('')`)

In [29]:
# [YOUR CODE HERE]


### Kernels

Behind every notebook runs a kernel. When you run a code cell, that code is executed within the kernel and any output is returned back to the cell to be displayed. The kernel’s state persists over time and between cells — it pertains to the document as a whole and not individual cells.

For example, if you import libraries or define variables in one cell, they will be available in another. Let’s try this out to get a feel for it. First, let's define a function.

In [8]:
def square(x):
    return x * x

After we execute the cell above, we can reference our function 'square' in any other cell.

Now let's define a few variables and print some statements to check if they make sense.

In [9]:
# define x as the number 6
x = 6

# define y as the output of our previously described function 'square'
y = square(x)

# print the statment below by running the current cell
print('%d squared is %d' % (x, y))

Does our output makes sense?

This will work regardless of the order of the cells in your notebook. You can try it yourself, let’s print out our variables again.

In [7]:
# run this cell
print('Is %d squared %d?' % (x, y))

Now lets change `y`. Type in any number other than 6 and run the cell below.

In [12]:
y = 

Now, what do you think will happen if we ran the previous `print` statement again? Try it!

Lastly, if you ever wish to reset things, there are several incredibly useful options from the Kernel menu:

1. **Restart:** restarts the kernel, thus clearing all the variables etc that were defined.
2. **Restart & Clear Output:** same as above but will also wipe the output displayed below your code cells.
3. **Restart & Run All:** same as above but will also run all your cells in order from first to last.

If your kernel is ever stuck on a computation and you wish to stop it, you can choose the Interupt option.

## Part 2: Data Exploration

We’ve looked at *what* a Jupyter Notebook is, it’s time to look at *how* they’re used in practice. What better way to do that than to actually analyse a real life dataset? The dataset that we will be working with is on all passengers that were on the Titanic during one of the most infamous shipwrecks in history. The question we will attempt to answer is: **What types of passengers were more likely to survive?**

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others. Before we continue, what do you think are the most important reasons passengers survived the Titanic sinking?

Before we get down to the analysis, let's first make sure we have the correct setup.

### Sharing and saving notebooks
First things first, rename this notebook and save a version of it in your own folders. You will be coding in this notebook from now on.

You can also create a brand new notebook anytime. Either from 

* `File → New` menu option from within an active notebook
* or from the dashboard `New → Python 3`

Feel free to do so if you would like to work with a brand new notebook and personalise your own project as you wish!

### Set up: Importing libraries

We will begin our project set up by importing the following libraries. Libraries are collections of functions and methods that allows you to perform many actions without writing your code.

The ones we will be using are:

* `matplotlib.pyplot (plt)`: this is a 2D plotting library that makes plotting of graphs, histograms, scatterplots, bar charts etc possible
* `seaborn (sns)`: this is another powerful plotting library that helps us visualise our data
* `numpy (np)`: it allows us to work with mathematics
* `pandas (pd)`: this is extremely useful in data analysis. It allows you to work with data structures that are easy to manipulate such as tables, matrices


It is also good practice to import libraries as an alias, like so:

`import matplotlib.pyplot as plt`

#### Task 2.1:

Import `matplotlib` using the code mentioned above.

In [31]:
# [YOUR CODE HERE]


Now let's import the other two libraries mentioned above as their respective aliases written in brackets.

In [None]:
# [YOUR CODE HERE]


### Set up: Loading data

Now that we've imported the necessary libraries, let us load our data. Before we do this, we need to ensure that **our files are saved in the same folder as this notebook**. Otherwise, the following code will not work.

The command `pd.read_csv` lets us automatically load our csv (comma separated values) data files into a pandas dataframe.

We will be working with two files: `train.csv` and `test.csv`.

#### Task 2.2

Uncomment the code below and run the cell to import `train` data set.

In [39]:
# train = pd.read_csv('train.csv')

Now import `test.csv`:

In [None]:
# [YOUR CODE HERE]


#### Task 2.3
Can you try and make sense of the code that loaded our data?
Explain what the following codes mean:

`pd` and `read_csv()`

In [None]:
# [YOUR COMMENT HERE]
# pd: 
# read_csv(): 

What we have done now is imported all data needed for our course as dataframes. We have also given these data sets names that are easy to remember. Of course, you can name your data sets anything you like, but in this case, let's use `train` and `test`, which are straighforward.

### Data Exploration

Now that we've loaded our dataset into Jupyter, we can use the following commands to check what these datasets consist of:

* `df.head()` - first five lines in the dataframe
* `df.tail()` - last five lines in the dataframe

Here `df` is the shorthand for dataframe and refers to the name of the dataframe. 

You can also specify the number of lines 
you want the code to display by entering a desired number in the brackets: `df.head(3)`

#### Task 2.4

Display the first 5 lines of `train` and the last 10 lines of `test`.

In [None]:
# train
# [YOUR CODE HERE]


In [None]:
# test
# [YOUR CODE HERE]


#### Task 2.5
Run the cell below. What does the output tell us?

In [None]:
# [YOUR COMMENT HERE]
# 
train.shape

Try this function, how is it different?

In [None]:
# [YOUR COMMENT HERE]
# 
len(train)

#### Task 2.6

How many data points and columns are there in `test`:

In [None]:
# [YOUR CODE HERE]


#### Task 2.7
Sometimes we also want a straightforward summary of our dataframe. Try using `df.info()` and display a summary of `train`.

In [41]:
# [YOUR CODE HERE]


What are your observations of this data set?

In [None]:
# [YOUR COMMENT HERE]
#

If you want to save all the column names of your datarfame in an array, you can do the following: 

`[variable name] = df.columns.values`


#### Task 2.8
Define a variable containing all columns in `train`. (Take care when naming your variable)

In [None]:
# [YOUR CODE HERE]


Now we have a list of columns (features) in our data set, it's important to understand what each feature represents. Luckily, this dataset has been labeled clearly and it is obvious what type of information most of the columns contain.

* PassengerId - unique ID
* Survived - target, what we are trying to predict
* Pclass - ticket class, (1-3 for 1st/2nd/3rd class)
* Name - text field for passenger name, including title
* Sex - passenger gender (male or female)
* SibSp - # of siblings or spouses onboard
* Parch - # of parents or children onboard
* Ticket - ticket number
* Fare - cost of ticket
* Cabin - cabin number
* Embarked - port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

More often than not, the data you collect may contain information you may not find useful for solving a particular problem. Think about the problem that we are trying to solve in our case. Can you identify which column is irrelevant?

In [48]:
# [YOUR COMMENT HERE]
# 

We can drop columns that are not useful in order to simplify the analysis process, using `df.drop()`. The correct syntax for example, to drop two useless columns, is as below:

`new_name = df.drop(['useless_column_1','useless_column_2'],axis=1)`

#### Task 2.9

Drop all columns that you think are irrelevant in `train` and `test`, and name your simplified dataframes `new_train` and `new_test`.

In [None]:
# [YOUR CODE HERE]


Now check what `new_train` and `new_test` dataframes look like:

In [None]:
# [YOUR CODE HERE]


In [None]:
# [YOUR CODE HERE]


#### Task 2.10   
**Initial predictions?**

Now that you're more familiar with the dataset, can you predict which types of passengers are less likely to survive?

## Part 3: Data Analysis

The key learning objective in this section of the course is to try and use simple Python commands to extract more useful information and gain insight from our dataset. All of which is part of our problem solving process. We now have a rough idea of what this dataset consists of, and we can be comfortable proceding to doing more analysis with our dataset. 

In order to be able to analyse our dataset in more detail, it is often necessary to 'isolate' columns of the data set. We can do so with `df['Column_Name']`.

#### Task 3.1
Try this code for any column in `new_train` below.

In [2]:
# [YOUR CODE HERE]


Here is a few simple commands we can use for analysis:
* `df.mean()`
* `df.max()`
* `df.min()`
* `df.median()`

#### Task 3.2

How can we use these commands on our Titanic dataset? Name the columns that are appropriate for us to use these codes on.

In [3]:
# [YOUR COMMENT HERE]
#

Now let's practice using these commands. 

#### Task 3.3
Can you find the minimum, maximum, mean and median of all ages of the passengers on the ship? (Hint: Which column do you need to isolate first?)

In [None]:
# youngest passenger
# [YOUR CODE HERE]


In [None]:
# oldest passenger
# [YOUR CODE HERE]


In [None]:
# average age 
# [YOUR CODE HERE]


In [None]:
# median age 
# [YOUR CODE HERE]


You might also want to find out more about individual passengers. For example, who a particular passenger was, if he/she survived the shipwreck and what cabin did he/she stay in? Using `df.loc[condition]`, we can extract certain data points that satify our 'condition'. 

You might want to find out more information about those with a particular age. We can use the following code: 

`df.loc[df['column_name'] == value]`

Let's break it down:

`df['column_name']`: here we are extracting a specific feature in question

`==`: this double equal sign compares two values

`value`: this will be the value of the feature that you want to extract

NOTE: the `value` you input after `==` needs to be in '' if it is not a number.

#### Task 3.4

Find out more about the youngest passenger on the ship, by extracting the row. (Hint: `df` = train, `column_name` = Age, `value` = [minimum of age])

In [54]:
# [YOUR CODE HERE]


Try the same with the oldest passenger.

In [57]:
# [YOUR CODE HERE]


Sometime we might want to know how many people survived the shipwreck, or how many males/females there were. In order to find out these numbers, we will need to count how many data points satisfied that condition i.e. condition == female.

There are many ways to do this and we have already seen the code that is needed to do get this information. Can you think of ways to count data points with the same characteristics using the codes you have already seen?

#### * Task 3.5
Find out how many males there were on the ship. 

Hint 1: Do we need to extract a certain column and compare values like we did before? Do we know how to compute the length of a dataframe?

Hint 2: You can always check what your variable looks like by typing the name of the variable and run the cell.


In [42]:
# Step 1: define a variable containing only males as `train_male`
# [YOUR CODE HERE]


In [None]:
# Step 2: compute the length of `train_male`
# [YOUR CODE HERE]


How can we check if the output is correct? Perhaps we can find the number of females on the ship?  

#### * Task 3.6
How many females were on the ship?

In [None]:
# [YOUR CODE HERE]


#### Task 3.7
Check and verify that our outputs are correct. (Hint: How many people were there in total on Titanic?)

In [None]:
# [YOUR CODE HERE]


Another way, an easier way, is to use `df.value_counts()`. The output will be all categories of the variable corresponding to each of their counts.

#### Task 3.8
Compute the number of males and females on the ship by typing `train['Sex']` inplace of `df` in the code given above. 

In [None]:
# [YOUR CODE HERE]


Are these numbers the same as what you got previously?

Now, let's practice a bit more using `df.value_counts()`. 
#### * Task 3.9
Find out how many people there were in each `Pclass`.

In [None]:
# [YOUR CODE HERE]


How many people survived the crash?

In [None]:
# [YOUR CODE HERE]


How many males survived? (Hint: Use the variable `train_male` that is already defined, and then count number of survivals)

In [64]:
# [YOUR CODE HERE]


Now that we know the number of people survived and the number of people in each gender, we can combine the two statistics and find out the percentage of females/males who survived. How can we do that with the codes we have used so far? 

#### * Task 3.10
Compute the percentage of males who survived. (Hint: male survival % = survived male/total male)

In [None]:
# [YOUR CODE HERE]


Percentage of females who survived.

In [None]:
# [YOUR CODE HERE]


Percentage of survival from each Pclass. (Hint: Follow the same procedures as before)

In [None]:
# Pclass 1
# [YOUR CODE HERE]


In [None]:
# Pclass 2
# [YOUR CODE HERE]


In [None]:
# Pclass 3
# [YOUR CODE HERE]


## Part 4: Visualisation and Communication

Perhaps the most interesting part of data analysis is that we are able to visualise our data in various ways and most importantly communicate our findings in a clear manner. Through visualisation, we can have an intuitive understanding of our data and also spot certains patterns within our data. Ultimately, we can tell a better story through visualising our data.

The two visualisation libraries we will be using in this section are `matplotlib` and `seaborn` , which you should've already installed in previous sections.

Before we begin, can you think of what graphs would be helpful for us to plot in order to understand our dataset better? 

In [None]:
# [YOUR COMMNENT HERE]


### Bar Charts

Bar charts are a really good way to display categorical data. Let's see how we can plot bar graphs with our dataset.

The python function used for plotting bar graphs is:

`df.plot(kind=bar)`

Let's say we want to visualise the number of people survived vs the number of people that did not. See what happens when we type the following:

In [4]:
# uncomment and run code below 
# train['Survived'].plot(kind='bar')
# plt.show()

What did the plot show us? What's wrong with the graph above? How can we plot a bar chart that gives us the information desired? 

#### Task 4.1
Let's instead try the following. Add a comment above every line of code explaining what it does, and then run the cell:

In [6]:
# [YOUR COMMENT HERE]
#
train_survived = train["Survived"].value_counts()

# [YOUR COMMENT HERE]
#
train_survived.plot(kind="bar")

# [YOUR COMMENT HERE]
#
plt.show()

Does this look better? How did this different to the previous plot?

In [None]:
# [YOUR COMMENT HERE]
# 

It looks like this plot is missing a few attributes. Any good graph will include a title, axes labels, let's make the plot better one by adding these items:

`plt.title('[title]')`

`plt.ylabel('[y_axis_label]')`

`plt.xlabel('[x_axis_label]')`

#### Task 4.2
Plot the same bar graph as in Task 4.1 with a title and axis labels. (Hint: `plt.show()` needs to be the last line of code) 

In [None]:
# [YOUR CODE HERE]


Now that we have a better looking bar chart, we can see clearly that there were less passengers who survived than those who didn't. Based on our calculations from before, try and make sense of the graph to see if the numbers make sense.

Even though this graph told us the split between those who survived and those who did not, a more informative plot will tell us the survival rate per `Sex`, per `Pclass`, or per `Parch`.

Now let's first try to plot **survival rate by sex**. Sketch it first on a piece of paper to help you understand what you information you need to make this plot.

Previously, we used `matplotlib`, now to make our lives simpler, we can use the `seaborn` library instead. (It is important to check that we have imported `seaborn`.)

A useful command from the seaborn library is `sns.barplot(x='', y='', data= )`. 
* `x` = the column that we want the x-axis to be from `train`
* `y` = the column that we want the y-axis to be from `train`
* `data` = the data set that we want to plot from

NOTE: in the `sns` command, there is no need to extract a column from our dataframe like this `df['column']`. You can simply insert the column name in the quotation marks as `sns` let's you specify which dataframe you are plotting from.

#### * Task 4.3
Plot a bar chart illustrating survival rate by male/female using `sns.barplot()`. (Hint: Two variables you might need are `'Sex'`, `'Survived'`, which one should be on the x axis and which should be on the y axis?) 

In [None]:
# [YOUR CODE HERE]


Based on our previous calculations, does our graph reflection the same numbers?

#### * Task 4.4
Plot a bar chart of survival rate by `Pclass`.

In [None]:
# [YOUR CODE HERE]


#### * Task 4.5
Do the same for `Embarked` and `Parch`.

In [74]:
# [YOUR CODE HERE]


In [None]:
# [YOUR CODE HERE]


### Histograms

Histograms are very useful for understanding the distribution of our dataset. 

The command for plotting histograms is: `plt.hist()`. To plot a histogram of one specific feature, we need to plot only that specific column like so `plt.hist(column)`. We need to first isolate that column from our data set, do you remember how to do that?

Recall what a distribution is and let's first plot a histogram illustrating the distribution of the age of passengers on the ship. Use the previous commands I showed you to add a title and axes labels.

#### Task 4.6
Plot a histogram of the age of all passengers on Titanic. (Hint: `plt.hist(train['column'])`)

In [76]:
# histogram
# [YOUR CODE HERE]


# title and lables
# [YOUR CODE HERE]


plt.show()


What distribution does this look like? Does it make sense?

In [None]:
# [YOUR COMMENT HERE]
#

This is great! Now if we would like to modify the dimensions of our plot, we can write this line of code before we plot:

`plt.figure(figsize=(20,10))` 

where (20,10) represent the dimensions of the plot.

#### Task 4.7
Instead of (20,10) we can use any dimensions we'd like. Plot the same age distribution histogram and change the size of the plot. Try and stweak the numbers to get the comfortable size.

In [None]:
# [YOUR CODE HERE]


Let's now try plotting the distribution of ticket fare.

In [None]:
# [YOUR CODE HERE]


## Line Plots

Another really simple, yet useful plot is of course the line plot. 

Let us try and plot a line graph that presents the number of survivors per age group. Please comment above every line of code explaining what the code means:

In [10]:
# [YOUR COMMENT HERE]
#
Survived = train[train["Survived"] == 1]

# [YOUR COMMENT HERE]
#
Ages = Survived['Age']

# [YOUR COMMENT HERE]
#
Values = Ages.value_counts(sort=False)

# [YOUR COMMENT HERE]
#
plt.plot(range(len(Values)),Values)

plt.show()


In the ```plt.plot( , )``` function, what do the first and second argument in the bracket mean? 

In [None]:
# [YOUR COMMENT HERE]
#

It is important to learn about the different types of graphs python can help us plot. But more importantly, we should recognise those graphs as part of our tool box and that choosing the appropriate types of graphs is more important for the purpose of gaining relevant information. We have now seen how to plot bar charts, histograms and line graphs, but as we explore further our given dataset, we may also find other plots such as scatter plots more useful. We may also want to overlay two plots on top of each other for comparison. 

I hope this series on Intro to Data Science can act as an inspiration for you to dig deeper and explore further many other exciting things that is possible within Python and Jupyter Notebook. Do use this opportunity (this notebook) to investigate other aspects of the dataset we have not touched on, and perhaps write a report on your findings. Do not limit yourselves to this notebook! Google codes that you did not fully understand, read about them through other tutorials, try and find this dataset on Kaggle.com, watch YouTube videos etc. This is only the beginning of your journey to becoming a great data scientist!

Hope you have enjoyed this tutorial and happy coding!