# Machine Learning in Python - Workshop 1

As with any other programming language, the best way to learn Python and its machine learning libraries is to play with them, so follow the steps below and ask a tutor if you get stuck.

If you are reading this document then you have most likely been able to obtain the assignment from **nbgrader** on Noteable, once you have completed the assignment you should submit it using the Submit button on the Assignments tab within Noteable.

---

## 1. Jupyter notebooks

A **Jupyter notebook** is a literate programming tool thats allows you to combine text, typeset maths, images, and code (and its output) together in one document. Jupyter notebooks are edited and viewed in a web browser.

A Jupyter notebook consists of several **cells**, which can be of 2 main types:
* **Markdown cells**, like this one, contain text formatted using Markdown. They can be edited by double-clicking on them. Markdown syntax is straightforward -- you can double-click on the Markdown cells in this notebook to view the source text. Markdown syntax also supports LaTeX typesetting for maths, both inline using `$...$`, e.g. $f: \mathbb{R}^2 \to \mathbb{R}$, and in display mode using `$$...$$`, e.g.

$$ \frac{\partial f}{\partial y} = 2e^{-x} \cos(y). $$

* **Code cells**, like the one below, in which we can type and run Python code interactively. They are indicated by `In [ ]:` on the left hand side. Note that cells will be executed in the order you run them (as indicated by the number in the square brackets on the left, to ensure the reproducibility of your document is always a good idea to select <kbd><samp>Kernel</samp></kbd>&raquo;<kbd><samp>Restart & Run All</samp></kbd> from the  menu to ensure that you still get the correct final document.

In [None]:
print('This is a code cell!')

To **run** a cell (i.e. to typeset Markdown cells or to execute code cells), click the <kbd><samp>>| Run</samp></kbd> button on the toolbar above, or press <kbd>Shift</kbd> + <kbd>Enter</kbd>.

Some additional resources that might prove useful now or later on in the course:

* [Jupyter Notebook documentation](https://jupyter-notebook.readthedocs.io/en/stable/)
* [A gallery of interesting Jupyter notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)
* [Markdown cheat sheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)
---

## 2. Pandas

This course will assume that you have some basic familiarity with the **pandas** library, 
and now is a good time to go back and review the relevant materials from Python Programming and the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/).

For this workshop we will review a small part of **pandas** by working with a sample of data of Airbnb listings in Edinburgh. These data are included in the `listings.csv` file which should be available within your workshop 1 assignment folder, `mlp-week01`, on Noteable.

The data set includes the following variables:

* `id` - ID number of the listing
* `price` - Price, in GBP, for one night stay
* `neighbourhood` - Neighbourhood listing is located in
* `accommodates` - Number of people listing accommodates
* `bathrooms` - Number of bathrooms
* `bedrooms` - Number of bedrooms
* `beds` - Number of beds (which can be different than the number of bedrooms)
* `review_scores_rating` - Average rating of property
* `number_of_reviews` - Number of reviews
* `listing_url` - Listing URL


We will read in these data using pandas with the following code,

In [None]:
import pandas as pd

d = pd.read_csv("listings.csv")
d

Note here we print out the pandas dataframe object by returning it at the end of the cell, generally when we want to output something in a notebook it is better to use an explicit `print` function call but in this case we want to take advantage of Jupyter's ability to nicely display the pandas data frame output.

Below are a couple of quick exercises to re-familiarize yourself with pandas.

---

### &diams; Exercise 1

How many observations are included in this data set?

In [None]:
# Enter your code here

---

### &diams; Exercise 2

How many different neighborhoods are represented in these data?

In [None]:
d["neighbourhood"].nunique()

In [None]:
d[["neighbourhood"]].describe()

In [None]:
d.groupby("neighbourhood")["neighbourhood"].count()

---

### &diams; Exercise 3

What is the mean and the median price per night of an Airbnb in Edinburgh?

In [None]:
# Enter your code here

---

### &diams; Exercise 4

Calculate a new column called `beds_per_bedroom` which is the number of beds divided by the number bedrooms for a listing. For this new column report the 2.5th and 97.5th percentile.

In [None]:
# Enter your code here

---

## 3. Visualization

For this course we will be using a combination of the libraries **seaborn** and **matplotlib** for the purposes of visualization. The former is actually built using the latter, and is designed to specifically provide a high-level interface for creating statistical graphics.

We will set up some initial configuration details using **matplotlib** to determine the size and resolution of the plots that will be shown in the notebook.

In [None]:
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 80

and then we can use pandas and seaborn to visualize the Airbnb data.

### 3.1 Univariate plots

For example if we want to examine the distribution of the rental prices we can use pandas as follows,

In [None]:
d["price"].plot.hist(bins=30)

We can generate a similar plot using seaborn via the `distplot` function, however if we attempt to plot the price column, as in the following code, we get and error.

In [None]:
sns.distplot(d["price"])

This occurs because the price data includes some missing values, encoded as `NaN`, that seaborn (and matplotlib) cannot handle. To correct this we just need to remove the offending values before attempting to plot, we will do this using panda's `dropna`. Note that by default seaborn's `distplot` includes both a histogram and a kernel density estimate of the data, to create a plot similar to the plot above we can turn off the latter by setting `kde=False`.

In [None]:
sns.distplot(d["price"].dropna(), kde=False, bins=30)

We can also examine the distribution of categorical variables by creating a bar plot. This is possible with pandas but somewhat clunky as we have to take care of transforming the variable into the underlying counts of the levels before creating the bar plot.

In [None]:
d["neighbourhood"].value_counts().plot(kind="bar")

A similar plot can be created with seaborn using the `catplot` or `countplot` functions,

In [None]:
sns.countplot(x="neighbourhood", data=d)

Note that the x-axis labels are overploting making it nearly impossible to read them, one quick fix is to rotate the plot by putting the catergories on the y-axis which can be done as follows,

In [None]:
sns.countplot(y="neighbourhood", data=d)

---

### &diams; Exercise 5

Create a plot and describe the distribution of the `review_scores_rating` variable.

In [None]:
# Enter your code here

---

## Multivariate plots

Seaborn also includes a number of functions for visualizing bivariate and multivariate relationships within a data set. The two primary high level functions are `relplot` and `catplot` for plotting numeric or categorical variable relationships respectively.

For example to create a scatter plot of `price` vs `review_scores_rating` we can use `relplot` as follows,

In [None]:
sns.relplot(
    x = "price",
    y = "review_scores_rating",
    data = d,
    aspect = 1.5,
    alpha = 0.1
)

We use the `aspect` argument to adjust the aspect ratio of the plot, making it 1.5 times as wide as it is tall and the `alpha` argument to reduce issues with the over-plotting of points.

Note that `relplot` can also be used with categorical data, the function only determines the type of plot that will be created (i.e. a scatter or line plot).

In [None]:
sns.relplot(
    x = "price",
    y = "neighbourhood",
    data = d,
    aspect = 2
)

`catplot` alternatively deals with plots that involve at least one categorical variable (e.g. boxplots, swarm plots, bar plots, etc.). The type of plot is determined by the `kind` argument that is passed to the function. You can try changing this in the cell below and see how it affects the plot. Try values like: `"violin"`, `"bar"`, `"strip"`, or `"point"`.

In [None]:
sns.catplot(
    x = "price",
    y = "neighbourhood",
    kind = "box",
    data = d,
    aspect = 2
)

Just like `relplot` there is not a requirement that both `x` and `y` arguments be categorical variables, but note that when using two numeric variables the `x` variable will be treated as the categorical variable for plotting purposes. 

In [None]:
sns.catplot(
    y = "price",
    x = "accommodates",
    data = d,
    aspect = 2
)

---

### &diams; Exercise 6

What happens if you rerun the cell above with the `x` and `y` arguments swapped? To make this behavior even more clear try changing the `kind` to `"box"` for both plots.

In [None]:
# Enter your code here

---

Finally, one other useful tool provided by seaborn is its ability to generate a pairs plot for examining the relationship between many numeric variables at the same time. Here we subset the original data to only include neighbourhoods in the city center and then create a pairs plot for the numeric variables. 

In [None]:
center = d.query('neighbourhood in ["New Town", "Old Town", "West End"]')

sns.pairplot(center.dropna(), hue="neighbourhood", markers=".")

*Hint* - if you get an error when running the above code make sure that you have not accidently introduced `Inf` values when you constructed the `beds_per_bedroom` column in **Exercise 4**.

---

### &diams; Exercise 7

Pick several other neighbourhoods that are of interest to you and create a pairs plot for them. Is there anything interesting revealed by your plot?

In [None]:
# Enter your code here

---

Additional information, documentation, and examples can be found at the [seaborn website](https://seaborn.pydata.org/). The tutorial and gallery sections are of particular use for new users. 

## 4. Competing the worksheet

At this point you have hopefully been able to complete the preceeding exercises. Now is a good time to check the reproducibility of this document by selecting <kbd><samp>Kernel</samp></kbd>&raquo;<kbd><samp>Restart & Run All</samp></kbd> from the menu above and checking all of the code cells is still working and correct.

Once the notebook is completed to your satisfaction you can turn it in using the <kbd>Submit</kbd> button on the Assignments tab within Noteable. If you notice a mistake after submitting you should be able to resubmit as many times as you would like, only the last submission will be considered. For this iteration of the course we will not be marking any of the worksheets but we will be using student's submitted work to evaluate and improve the worksheets for the course.