# Jupyter Notebook Example
An example of what a notebook might look like if you're working in a team and might need to hand this over to someone else.


Here we're going to be:
- loading up some data (stolen vehicles in NZ over a 6 month period)
- doing some basic analysis on the data
- seeing if a very simple prediction model will tell us which day of the week our car is likely to be stolen on

# Notes

**Data Sources**
- Stolen Vehicle Counts https://www.police.govt.nz/can-you-help-us/stolen-vehicles
- NZ Population Counts https://explore.data.stats.govt.nz/

**Notes on the data**
- some vehicle makes and models have been misclassified. For the purposes of this demo this has not been attempted to be rectified
- the vehicle thefts data only spans a 6 month timeframe, so isn't a huge dataset to draw too many conclusions from
- NZ population data has been compiled from various sources and needs verifying before being used for anything more than a POC

**Assumptions**
- the NZ population data was compiled with the assumption that the police regions specified in the vehicle thefts data match closely to the NZ DHB regions from the population data. This needs to be verified, but the assumption is that this is a close enough match for an initial POC

**Next Steps**
...

# Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
from datetime import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
in_colab = False

In [None]:
if in_colab:
    print("""
Instructions:
- open Google colab
- upload the example notebook
- on the left open the files menu
- upload the colab-upload.zip archive from the repo
- now you can run the next cell to unzip that file
"""
         )

In [None]:
if in_colab:
    !unzip -o colab-upload.zip

# Data Load

## Stolen Vehicles

In [None]:
thefts = pd.read_csv('data/stolen_vehicles.csv')
thefts

In [None]:
thefts.dtypes

In [None]:
thefts.describe()

In [None]:
thefts.describe(include='object')

In [None]:
def add_date_cols(df: pd.DataFrame) -> pd.DataFrame:
    df['date_stolen'] = pd.to_datetime(df['date_stolen'], format="%d/%m/%Y")
    df['month_stolen'] = df['date_stolen'].dt.month
    df['dow_stolen'] = df['date_stolen'].dt.strftime('%a')
    df['ndow_stolen'] = df['date_stolen'].dt.strftime('%w')
    df['dom_stolen'] = df['date_stolen'].dt.strftime('%-d')
    df['woy_stolen'] = df['date_stolen'].dt.strftime('%U')
    df['is_weekend'] = df['dow_stolen'].isin(['Sat', 'Sun'])
    return df

thefts = add_date_cols(df=thefts)
thefts

**Instruction**: go back and run the dtypes call above

## Vehicle Categories

In [None]:
vehicle_cats = pd.read_csv('data/vehicle_categories.csv')
vehicle_cats

## NZ Population Stats

In [None]:
nzpop = pd.read_csv('data/nz_location_populations_by_dhb.csv')
nzpop

In [None]:
nzpop = nzpop.groupby(['linking_region', 'region'])['population'].sum().reset_index()
nzpop.rename(columns={'population': 'linking_region_population'}, inplace=True)
nzpop['region_population'] = nzpop.groupby('region')['linking_region_population'].transform('sum')
nzpop

**Instruction**: Rerun the above cell

## Merged Data

In [None]:
merged = thefts.merge(nzpop, left_on="location", right_on="linking_region", how="left")
merged = merged.merge(vehicle_cats, on="type", how="left")
merged

# Data Investigation
We need to:
- get a feel for the data
- understand any issues that we see in the data that might impact any analysis or modelling that we want to do

In [None]:
# Check how successful the data merges were
merged[pd.isnull(merged['region'])]

In [None]:
merged[pd.isnull(merged['category'])]

In [None]:
merged = merged[~pd.isnull(merged['make'])]

In [None]:
merged[pd.isnull(merged['type'])]

In [None]:
merged[merged['model'] == 'crv']

In [None]:
merged.loc[72, 'type'] = 'Caravan'
merged.loc[428, 'type'] = 'Mobile Machine'
merged.loc[819, 'type'] = 'Farm Bike'
merged.loc[1215, 'type'] = 'Moped'
merged.loc[1215, 'model'] = 'CICLONE'
merged.loc[3202, 'type'] = 'Hatchback'
merged.loc[3828, 'type'] = 'Motorbike'
merged.loc[4826, 'type'] = 'Farm Bike'
merged.loc[4968, 'type'] = 'Stationwagon'
merged.loc[4968, 'model'] = 'CRV'
merged = merged[~pd.isnull(merged['type'])]

In [None]:
merged[merged['id'].isin([72, 428, 819, 1215, 3202, 3828, 4826, 4968])]

In [None]:
# if you want to do some further investigation, check out things like:
# - the different types of Honda CIVIC in the dataset
# - what vehicles have been logged with the make of "Motorcycle" or "Moped"
# - what type of vehicles are those with the type "Mobile Machine"

merged[merged['year'] == 0]

What the end result would look like without all the investigative steps above

In [None]:
def prepare_data(df: pd.DataFrame) -> pd.DataFrame:
    df = df[~pd.isnull(df['make'])]

    # obviously this wouldn't work in a production pipeine, but can be used to prove out a POC
    df.loc[72, 'type'] = 'Caravan'
    df.loc[428, 'type'] = 'Mobile Machine'
    df.loc[819, 'type'] = 'Farm Bike'
    df.loc[1215, 'type'] = 'Moped'
    df.loc[1215, 'model'] = 'CICLONE'
    df.loc[3202, 'type'] = 'Hatchback'
    df.loc[3828, 'type'] = 'Motorbike'
    df.loc[4826, 'type'] = 'Farm Bike'
    df.loc[4968, 'type'] = 'Stationwagon'
    df.loc[4968, 'model'] = 'CRV'
    df = df[~pd.isnull(df['type'])]
    return df

data = thefts.copy()
data = prepare_data(df=data)
data = datas.merge(nzpop, left_on="location", right_on="linking_region", how="left")
data = data.merge(vehicle_cats, on="type", how="left")
data

# Data Visualisation

## Why is this important?
Lets look at the interesting Data Morph project by Stefanie Molin.

https://github.com/stefmolin/data-morph

(This project is based on the work completed in Data Morph (DOI: 10.5281/zenodo.7834197) and Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing by Justin Matejka and George Fitzmaurice (ACM CHI 2017)).

In [None]:
# useful for local files, if they are likely to be changing
%load_ext autoreload
%autoreload 2

from datamorph_example import generate_datamorph_example

In [None]:
if in_colab:
    %pip install data-morph-ai

In [None]:
generate_datamorph_example()

**Dataset One**

![Panda](./data_morph/examples/panda-to-star-image-start.png)

![Panda](https://drive.google.com/uc?export=view&id=1ilFTXUl8HKLfQ3I5WlH5bVRdHyINK3zi)

**Dataset Two**

![Star](./data_morph/examples/panda-to-star-image-end.png)

![Star](https://drive.google.com/uc?export=view&id=1KCKxJKFNnu90Rstv-lteSqVTgvL1_AF3)

**Morphing**

![Panda to Star](./data_morph/examples/panda_to_star-example.gif)

![Panda to Star](https://drive.google.com/uc?export=view&id=1N7bpmwX5iAiP5chMVFVOIeBVXSzj-0dG)

In [None]:
%pinfo generate_datamorph_example

In [None]:
generate_datamorph_example??

## Back to Data Visualisation

In [None]:
data.groupby('type').size().sort_values().plot(kind='barh')
plt.show()

In [None]:
data.groupby('category').size().sort_values().plot(kind='barh')
plt.show()

In [None]:
sns.countplot(data=data, x="dow_stolen")
plt.show()

In [None]:
sns.set_style("whitegrid")
chart = sns.countplot(data=data, x="dow_stolen", hue="is_weekend", legend=False, order=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"], palette="Set2")
chart.set(ylabel="Vehicles Stolen", xlabel="Day of the Week")
plt.show()

In [None]:
chart = sns.boxplot(data=data, x="category", y="year", hue="category")
plt.xticks(rotation=30)
plt.show()

In [None]:
subset = data[data['year'] > 0]
chart = sns.boxplot(data=subset, x="category", y="year", hue="category")
plt.xticks(rotation=30)
plt.show()

# Building a simple model
Before anyone says it, this model is _not_ a good one, I've just included it as an example to cover some of the things we might be thinking of when we create a model.

It's also true that not all research models are successful, and these failures still consistitute valid outcomes of the research.

Hypothsis:
- based on my vehicle make, model, and location region, can I predict which day of the week it is most likely to get stolen on

In [None]:
make = "Toyota"
model = "HILUX"
region = "Auckland"
print(data[(data['make'] == make) & (data['model'] == model) & (data['region'] == region)].groupby('dow_stolen')['id'].count())

In [None]:
model_cols = ['colour', 'make', 'model']
target_col = 'ndow_stolen'
print(data[target_col])
data[model_cols]

In [None]:
model_data_with_target = data[model_cols + [target_col]].dropna()
target_data = model_data_with_target[target_col]
model_data = model_data_with_target[model_cols]

le = LabelEncoder()
X_encoded = model_data.apply(le.fit_transform)
y_encoded = le.fit_transform(target_data)

X_encoded

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
y_pred=model.predict(X_test)
print(f"Model accuracy: {score:.2f}")

print("Feature importances:", model.feature_importances_)

**Conclusion**: The model performs as well as a random guess, so offers no benefit in this particular scenario.

# Other Useful Magic Methods

In [None]:
%%time

print('hello')
for i in range(1000): np.random.random_sample()
time.sleep(5)
print('complete')

In [None]:
print('hello')
%time for i in range(1000): np.random.random_sample()
time.sleep(5)
print('complete')

In [None]:
%whos

In [None]:
%history -n

In [None]:
%recall 34-35

In [None]:
%%time

print('hello')
for i in range(1000): np.random.random_sample()
time.sleep(5)
print('complete')
print('hello')
%time for i in range(1000): np.random.random_sample()
time.sleep(5)
print('complete')

# Saving the Notebook
- recommend resetting the state of the notebook and running it end to end to check for any errors (usually state related)
- usual guidance is to not save the notebook output unless you really need to. Client data etc ending up in Github or Gitlab is really not ideal
- use something like Jupytext to only commit the code portions of the notebook to Git, and make your PRs readable too
- anything that needs to be logged and kept for potential re-use / rerunning later should be saved and logged appropriately
  - e.g. for research experiment tracking you want to note the exact code commit hash, so that the code can be checked out and run exactly as it was at the time of the experiment. The datasets used also want to be preserved, so that it's possible to recreate the experiment and it's outputs exactly
- if you are saving your notebook, it can be useful to include a final datetime print at the end, so you can see when it was last run end to end:

In [None]:
str(datetime.now())