# Applied Data Science (MAST30034) 
## Welcome
Welcome to Applied Data Science for 2021 Semester 2! 

This is a capstone project subject, hence, expectations are higher than most other subjects that you will take in your undergraduate course. It is expected that students have already completed assessments to a satisfactory level for the following subjects:
- Elements of Data Processing (COMP20008)
- Statistics (MAST20005)
- Machine Leaarning (COMP30027)
- Linear Statistical Models (MAST30025)

## Teaching Team
Your teaching staff will be as follows:
- Lecturer: Dr. Karim Seghouane (Assignment 1)
- Subject Instructor: Akira Wang (Project 1 and 2)
- Tutors: Yue You and Calvin Huang

## Tutorial Structure
- Tutorials are broken into Python and R streams to support students in whichever language they prefer.
- The first hour of the tutorial will be based on general programming how-to's and walkthroughs.
- The remainder of the tutorial will generally follow a consultation / free-for-all style. That is, we can cover a topic of request out of the *Advanced Tutorials* module, answer project related questions, or ask questions about industry / applying for jobs. 
- You are free to attend any tutorial time, either half (or the full 2 hours) of the tutorial depending on your interests. You are all classified as *experienced university veterans* so do what works for you.
- Finally, tutorial attendence is not marked for the duration of Project 1 and Assignment 1, but there is an expectation that you attend tutorials with your group for Project 2. More details closer to the release date.

_________________


# Lab 1 Overview
## First Half
Using the JupyterHub server:
- https://jupyter.mast30034.science.unimelb.edu.au
- Please log in to verify you have access.
- If you would like to install packages, please use `pip3 install package_name`

Using GitHub Desktop vs Git CLI (Command Line Interface):
- Create a repository for your Project 1, push a commit, and ensure your repository accepts the changes.

Project 1 Tips:
- How to get started and what to look out for.
- Getting started on Latex with [Overleaf](https://www.overleaf.com).

## Second Half
Revision:
- Variable names, magic numbers, and constants.
- Docstrings and comments.
- Plotting geospatial maps the correct way.
- Jupyter Notebook Magic Cells.
- Data Serialisation.
- Downloading files using `urllib`.

Advanced (Optional):
- (Windows 10 Users) Installing WSL2 (Ubuntu 20.04) for a clean environment.
- Introduction to Apache Spark 3.0

## General Tips for Jupyter Notebook
Cell shortcuts:
- `shift + enter` : Run current cell (equivalent of pressing <button class='btn btn-default btn-xs'><i class="fa-play fa"></i><span class="toolbar-btn-label">Run</span></button>)
- `ctrl + enter` : Run selected cells

Command mode (press `esc` to enter):
- `m` : Makes the cell markdown
- `y` : Makes the cell into code
- `a` : Insert cell above
- `b` : Insert cell above
- double `d` : Delete current cell

Code Shortcuts:
- `shift + tab` : brings function arguments

Multiline Cursor:
- Hold down `ctrl` on Windows or `command` on Mac and click on the places you wish to edit all together.

_________________


## Using `git` on the VM

- Visit https://github.com/settings/tokens 
- Generate a token (set it to expire end of this year)
- Add changes and commit as usual
- Now, after inputting your username, instead of entering your normal password, enter your generated PAT.
- Changes should be pushed.

**Cloning:** 
1. Open a terminal (yes it is commandline `git` for this to work).
2. `git clone HTTPS` (where HTTPS is the https url to your gitlab repo).
3. Enter your credentials.
4. Done.

**Pushing:** 
1. Change directories to inside your repository (`cd NAME_OF_REPO_FOLDER`).
2. `git add -A` (this will stage all changed/untracked files files for the next commit, ignored files are excepted). You can use `git status` to track changed files before adding.
3. `git commit -m "message"` (make a commit with a message).
5. `git push`
6. Enter your credentials.
    - Here, use the same username
    - BUT, instead of your password, use the PAT you generated.
7. Done.

**Pulling:** 
1. Change directories to inside your repository (`cd NAME_OF_REPO_FOLDER`).
2. `git pull`
3. Done.

**Global Config**   
The first thing you should do when you install `git` is to set your user name and email address. This is important because every Git commit uses this information:
```bash
git config --global user.name "Your name as a string"
git config --global user.email some_email@example.com
```

_________________


# Readable Code
- We will be assessing the quality of your code and how you present it in your notebooks. 
- This is because there is no point writing code that cannot be easily interpreted. At the end of the day, clients are paying for your analysis, but also the corresponding code. 
- If your code is confusing or difficult to read, there is little chance your client will come back to you.

**Variable Names:**  
As long as you are consistent, then it is fine. For example, commit to either using:
- Snake Case: words are seperated by underscores such as `variable_name`
- Camel Case: words are seperated by captials such as `variableName`

Your variables should be contextual and describe the code. That is, try to name your variables to be understandable **without comments**.

**Comments and Docstrings (w.r.t JupyterNotebook Cells):**  
Cells in Jupyter Notebook should aim to do one "block of logic" at a time (i.e importing libraries, defining functions, filtering rows, etc).
- If it takes a reader more than a few seconds to understand your cell, you need comments.
- Your functions need to have docstrings describing what they do.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# a nice way of filtering out deprecated warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("../data/sample.csv")

df.tail()

## Revision of Pandas Methods and Attributes

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
COORDS = ['pickup_latitude', 'pickup_longitude']

df[COORDS].describe()

In [None]:
df[COORDS].describe().loc[['min','max']]

`.loc[]` has probably seen some significant updates since you first used it.
1. How to correctly use it to *slice* a DataFrame.

```python
df.loc[CONDITION, COLUMNS]
```
where:
- `CONDITION` returns a boolean value denoting the rows to keep.
- `CONDITION` can be multiple conditions using `&` (and) and `|` (or).
- `COLUMNS` is either a single column name as a `string` or an array of column names.

In [None]:
df.loc[df['VendorID'] == 1, ['trip_distance', 'pickup_longitude','pickup_latitude']]

In [None]:
df.loc[(df['VendorID'] == 1) & (df['passenger_count'] > 0), 
       ['trip_distance', 'pickup_longitude','pickup_latitude']]

2. How to correctly use it to update certain values in a DataFrame.

```python
df.loc[CONDITION, COLUMNS] = values
```
where:
- `values` is either a constant to assign or an array of values with matching dimensions.

In [None]:
df.loc[(df['pickup_longitude'] == 0) & (df['pickup_latitude'] == 0), 'trip_distance'] = 0

In [None]:
df.loc[(df['pickup_longitude'] == 0) & (df['pickup_latitude'] == 0), 'trip_distance']

This sets the trip distance to `0` for coordinates that have a latitude **and** longitude at `0, 0` (which is invalid).   
Since we are visualising data today, we will drop all invalid points (though you will have to justify this in your report).

In [None]:
df = df.loc[(df['pickup_latitude'] > 40) & (df['pickup_latitude'] < 41) & (df['pickup_longitude'] < -73)]

In [None]:
df[COORDS].describe()

## Plotting a geospatial map using `folium` and `bokeh`
Some terminology before we continue:
- Map Tiles / Tile Providers: The underlying map "style" (i.e Google Maps vs Open Street View)

Let's start off with `folium`.

In [None]:
import folium

In [None]:
# initialise a map
m = folium.Map(tiles="Stamen Terrain", zoom_start=10)

# save map
m.save('../plots/map.html')

# show map
m

For your Project 1, we will be working with NYC. Here's one method of setting it up.

In [None]:
# mcoords = the middle coordinates for the map
mcoords = df[COORDS].describe().loc[["50%"]].values[0]

# axis ranges
xRange = [df['pickup_longitude'].min(), df['pickup_longitude'].max()]
yRange = [df['pickup_latitude'].min(), df['pickup_latitude'].max()]

xRange, yRange

In [None]:
nyc_m = folium.Map(location=mcoords, tiles="Stamen Terrain", zoom_start=11)

nyc_m.save('../plots/folium_nyc.html')

nyc_m

Now let's do the equivalent in `bokeh`.

In [None]:
from bokeh.plotting import figure, show
from bokeh.tile_providers import get_provider, Vendors

# to display bokeh plots inside jupyter, we need to use output_notebook
from bokeh.io import reset_output, output_notebook

reset_output()
output_notebook()
# note below that it says "BokehJS 1.4.0 successfully loaded."

Bokeh requires axis to be in Mercer format and doesn't accept latitude/longitude
- https://en.wikipedia.org/wiki/Web_Mercator_projection

You may use the functions below for your Project 1, but you must attribute the code accordingly.

In [None]:
def latitude_to_mercator(coords):
    """
    Function which converts an array of latitude coordinates 
    into its mercator coordinate representation
    """
    k = 6378137
    converted = list()
    for lat in coords:
        converted.append(np.log(np.tan((90 + lat) * np.pi/360.0)) * k)
    return converted

def longitude_to_mercator(coords):
    """
    Function which converts an array of longitude coordinates 
    into its mercator coordinate representation
    """
    k = 6378137
    converted = list()
    for lon in coords:
        converted.append(lon * (k * np.pi/180.0))
    return converted

Let's view all the possible map tiles that you can use.

In [None]:
# for each map type in list of vendors available
for mapType in Vendors:
    # create plot with the coordinates we computed above
    p = figure(x_range=longitude_to_mercator(xRange), y_range=latitude_to_mercator(yRange),
           x_axis_type="mercator", y_axis_type="mercator")
    
    # add the underlying tile from our provider
    p.add_tile(get_provider(mapType))
    p.title.text = mapType
    
    # display the plot
    show(p)

You should decide on which tile to use for your plots. My suggestion is you stick to `STAMEN_TERRAIN_RETINA`.

(Advanced) If you would like to use Bokeh to plot Google Map tiles:
- https://docs.bokeh.org/en/latest/docs/user_guide/geo.html

Now, let's try plotting something over the map.

In [None]:
TILE = get_provider("STAMEN_TERRAIN_RETINA")

pickup_m = figure(x_range=longitude_to_mercator(xRange), y_range=latitude_to_mercator(yRange),
       x_axis_type="mercator", y_axis_type="mercator")
pickup_m.add_tile(TILE)
pickup_m.title.text = "Pickups in NYC"

In [None]:
df[COORDS]

In [None]:
df['pickupX'] = df['pickup_longitude'].apply(lambda x: longitude_to_mercator([x])[0])
df['pickupY'] = df['pickup_latitude'].apply(lambda x: latitude_to_mercator([x])[0])

In [None]:
df.head()

In [None]:
# for every source value, draw a small circle denoting a pickup
pickup_m.circle(x='pickupX', y='pickupY', 
         size=5, fill_color="blue", fill_alpha=0.5, 
         source=df[['pickupX','pickupY']])

In [None]:
show(pickup_m)

The equivalent for dropoffs.

In [None]:
# create map
dropoff = figure(x_range=longitude_to_mercator(xRange), y_range=latitude_to_mercator(yRange),
       x_axis_type="mercator", y_axis_type="mercator")
dropoff.add_tile(TILE)
dropoff.title.text = "Dropoff in NYC"

# convert to mercer
df['dropoffX'] = df['dropoff_longitude'].apply(lambda x: longitude_to_mercator([x])[0])
df['dropoffY'] = df['dropoff_latitude'].apply(lambda x: latitude_to_mercator([x])[0])

# plot circles (source = data source)
dropoff.circle(x='dropoffX', y='dropoffY', 
         size=5, color="pink", fill_color="red", fill_alpha=0.5, 
         source=df[['dropoffX','dropoffY']])

show(dropoff)

## Geospatial Inferences
- More pickups around central Manhattan, with more dropoffs in the surrounding bouroughs.
- Pickup location are easily divided into "hubs" (i.e Manhattan, Aiport, etc).
- Dropoffs seem to be scattered across the map.

**IMPORTANT:** The above is at most *describing* the plot. Your project will require *analysis* and *research* on top of describing a plot. That is:
- *Why might there be more pickups around central Manhattan?*
- *Is there an explanation surrounding the "hubs"?*
- *Why are dropoffs scattered across the map?*

As a suggestion, have less description and more analysis. Your visualisation should ensure that it can be easily interpreted and visible (i.e suitable font size, colour, alpha, legend, etc.)

### Where to go from here
We have a simple visualisation on the pickups and dropoffs, but how might they be affected?
- Perhaps we can take a look at the time, day of week, the weather conditions, events that are taking place, etc.

It is up to you to find an external dataset to answer these questions.

_________________


## Data Serialisation
Pickle: 
- Lightweight and super fast serialization for data.
- Python native and compatible with several data formats.
- High space, Low time.

Feather:
- Lightweight and super fast serialization for data using Apache Arrow.
- Python **and** R native, though not compatible with all data formats.
- Medium space, Low time.
- If you don't have it installed, use `pip3 install feather-format`

**Notes:**  
- The `feather` format does not support serializing to a non-default index (similar to a database).
- That is, we need a unique index by default as a column, and the only way to ensure this is the case is by resetting the index.
- As for `pickle`, it is Python built-in so you can just use `.to_pickle()`

In [None]:
df.reset_index().to_feather('../data/lab_specific/sample.feather')

In [None]:
df.to_pickle('../data/lab_specific/sample.pkl')

In [None]:
%%time
df_csv = pd.read_csv('../data/sample.csv')

In [None]:
%%time
df_feather = pd.read_feather('../data/lab_specific/sample.feather')

In [None]:
%%time
df_pickle = pd.read_pickle('../data/lab_specific/sample.pkl')

_________________


## Downloading Files

In [None]:
from os.path import getsize
from urllib.request import urlretrieve

output_dir = "../data/"
fname = "sample_downloaded.csv"

# its this easy
url = "https://raw.githubusercontent.com/akiratwang/MAST30034_Python/main/data/sample.csv"
urlretrieve(url, f"{output_dir}/{fname}")

print(f"Done downloading {fname} to {output_dir} with size {getsize(f'{output_dir}/{fname}') / 1073741824:.2f}GB")

_________________


## WSL Environment for Windows 10
Refer to this guide to get a native Linux terminal in Windows 10:
- https://github.com/akiratwang/COMP20003
- Ignore all the `C` related parts, just get Ubuntu installed.

## Apache Spark 3.0 (PySpark) Installation
- Visit `MAST30034_Python/advanced_tutorials/Spark%20Installation.ipynb`