# Jupyter Notebooks

## What is Jupyter?

What you see here is a [Jupyter notebook](https://jupyter.org/). Like CodeExpert, Jupyter is an interactive programming environment for Python and other programming languages. While CodeExpert makes it easier for you to get started with programming, Jupyter notebooks are widely used in practice to share code with others and interactively process, visualize, and explore data.

Jupyter notebooks combines code, output, and comments into a fully editable document (with an .ipynb extension). That means you can change the code and the comments to adapt the output (print output, plots, etc.) accordingly. Jupyter notebooks can also be exported as PDF or HTML files.

**In short:**
- an interactive programming environment for Python (and other programming languages)
- reproducible data analysis
- combines code, output, and comments into one document that can be shared with and edited by others



## Cells

A Jupyter notebook usually consists of several *cells*, of which code and markdown cells are the most important types.

* **Markdown cells** are used for comments, like the text you are reading. In Markdown cells, the text can be formatted, e.g. by adjusting the <span style="font-family:Comic Sans MS">font</span> and <span style="font-size:1.5em">font size</span>, and even links and images can be embedded. [Markdown syntax](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) can be used for this.

![An XKCD comic](https://imgs.xkcd.com/comics/python.png)

* **Code cells** contain Python code to be executed.

## Warmup

Below is a code cell defining two variables, `x` and `y`, and a function `compute_maximum` that returns the maximum of a given list of numbers:

In [None]:
x = 3
y = 7

def compute_maximum(some_list):
    if len(some_list) == 0:
        print("List is empty")
    else:
        current_maximum = some_list[0]
        for element in some_list:
            if element > current_maximum:
                current_maximum = element
        return current_maximum

***Exercise 1.0.*** Execute the code cell above using `CTRL + Enter` on Windows or `CMD + Enter` on MacOS. You can now refer to the initialized variables and functions throughout the rest of the notebook.

***Exercise 1.1.*** Now, insert a *new* code cell below. Initialize a variable with a list containing `x + y`, `x * y`, and `x - y`; then, apply the function `compute_maximum` above to the list you have defined.

***Exercise 1.2.*** Execute the code cell below *multiple* times in a row and observe the output. What is happening and why?

In [None]:
x = x * 3
print(str(x))

***Exercise 1.3.*** Now, insert a new *Markdown* cell below. Format the cell to include a list and text in italic and/or bold typeface.

**Important:** Unlike CodeExpert, where you have to run the entire script, a Jupyter notebook allows you to run cells in *any* order. Therefore, before sharing the notebook with others, it is recommended to restart it and run it from the top to bottom (via the `Kernel > Restart & Run All` menu field) to ensure that the results are *reproducible*.

# Data Analysis in Jupyter

## Introduction

Below we show how data analysis can be conducted in a Jupyter notebook. We will develop a simple predictive model to classify breast tumors into benign and malignant, given some numerical properties. We will explore the [Breast Cancer Wisconsin](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29) dataset, which you can find saved as the file `data_simplified.csv`.

We will almost exclusively use modules you have already learned about in the lecture. An exception is the `sklearn` module useful for designing predictive models. We will use a simple statistical model –– logistic regression familiar to you from previous lectures. Next semester, you will learn more about the `sklearn` module in the Data Science for Medicine course.

## Loading the Modules

***Exercise 2.0.*** Load `numpy` and `pandas` modules in the code cell below. In Jupyter notebooks, library imports work similarly to the CodeExpert.

In [None]:
# TODO: your code here

We will also need `matplotlib` for plotting and `LogisticRegression` from `sklearn` to fit a predictive model. 

In [None]:
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

## Loading the Data

We will now load the dataset as a `pandas.DataFrame` from the CSV file `data_simplified.csv`.

***Exercise 2.1.*** Read the documentation for the [`pandas.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function. Then, apply the function to load the Breast Cancer Wisconsin dataset below.

In [None]:
# TODO: complete this code cell
data = ...

***Exercise 2.2.*** How can we display just the first few rows of the loaded `pandas.DataFrame`? Implement your solution in the code cell below. *Hint*: you may find the [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) documentation useful.

In [None]:
# TODO: your code here

***Exercise 2.3.*** How many rows and columns does the dataset have? What do rows and columns correspond to? 

In [None]:
# TODO: your code here

Looking at the data above, we can already make the following conclusions:
* Each row contains numeric information about the properties of a tumor, such as its radius, texture, etc.
* The column `diagnosis` tells us whether the tumor is benign (`B`) or malignant (`M`)

## Exploratory Analysis

In the following, we will perform a brief exploratory data analysis before finally turning to predictive modeling.

### Histograms

A histogram displays how often a particular value occurs in the dataset (see the figure below for more terminology). Using histograms, we can visualize distributions of numerical and categorical variables. We can color study subject subgroups differently depending on their clinical condition. To plot a histogram, we will use the [`matplotlib.pyplot.hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) function from the module `matplotlib`.

![histogram terminology](http://diagrammm.com/img/diagrams/histogram-terminology.svg)

***Exercise 2.4.*** Inspect the code cell below carefully. What does it do? Execute the code and observe its output. What conclusions can be made from the resulting plot?

In [None]:
b = plt.hist(data[data["diagnosis"] == "B"].radius_mean, alpha=0.7, label="Benign")
m = plt.hist(data[data["diagnosis"] == "M"].radius_mean, alpha=0.7, label="Malignant")
plt.legend()
plt.xlabel("Radius Mean Values")
plt.ylabel("Frequency")
plt.title("Histogram of the Tumor Radius")
plt.show()

### Correlation Analysis

TODO