# Montréal-Python Workshop: Data Manipulation & Machine Learning

This notebook contains my annotated solutions to the exercises discussed at the Montréal-Python workshop held on the 18th February 2023. 

<i>Ce cahier contient mes solution annotées pour l'atélier de Montréal-Python qui a eu lieu le 18 du février 2023</i>

## TOC
* [Exercise with Pandas / <i>Exercice de Pandas</i>](#exercice-de-pandas)
* [Exercise with Numpy / <i>Exercice de NumPy</i>](#exercice-de-numpy)
* [Machine Learning Exercise / <i>Exercice d'Apprentissage Automatic</i>](#exercice-d-apprentissage)

# Exercise 1 <a class="anchor" id="exercice-de-pandas"></a>

The goal of this [exercise](https://colab.research.google.com/drive/1QRNus1YCtKPAjlNT165i0NtJ3Qt2oZvr#scrollTo=kycZrfZPXa3b) was to identify the Montréal neighbourhood with the lowest number of Birch trees. [Data on public trees](https://donnees.montreal.ca/ville-de-montreal/arbres) is provided freely by the City of Montréal.

<i>Le but de cet [exercise](https://colab.research.google.com/drive/1QRNus1YCtKPAjlNT165i0NtJ3Qt2oZvr#scrollTo=kycZrfZPXa3b) était d'identifier l'arrondissement de Montréal ayant le moins de bouleaux. [Les données sur les arbres publics](https://donnees.montreal.ca/ville-de-montreal/arbres) sont fournies gratuitement par la ville.</i>

### Data Import and Cleaning

In [1]:
import pandas as pd

DATA_URL = r"https://data.montreal.ca/dataset/b89fd27d-4b49-461b-8e54-fa2b34a628c4/resource/64e28fe6-ef37-437a-972d-d1d3f1f7d891/download/arbres-publics.csv"

# Downloading the data takes some time (file size is approx. 100mb). Serialize the data to avoid downloading it on every run.

try:
    # File -> DataFrame
    tree_df = pd.read_pickle("./montreal-trees.pickle")
except FileNotFoundError:
    tree_df = pd.read_csv(DATA_URL)
    # DataFrame -> File
    # Avoid compressing file as space is generally cheaper than time.
    pd.to_pickle(tree_df, "./montreal-trees.pickle", compression=None)

#--- Basic EDA---#

# Check the number of rows and columns, as well as column names and data types

display(f"rows, columns: {tree_df.shape}")
display(tree_df.dtypes.to_frame(name = "col_type"))


# Correct inconsistent capitalisation of column names to make analyse easier

tree_df.rename(str.lower, axis = "columns", inplace = True)

# DataFrame -> DataFrame -> Series
# Count missing values for each column 

(tree_df.isna()
    .sum()
    .to_frame(name = "missing_values"))


    # sum().to_frame(name = "missing_values")

'rows, columns: (336634, 22)'

Unnamed: 0,col_type
INV_TYPE,object
EMP_NO,int64
ARROND,int64
ARROND_NOM,object
Rue,object
COTE,object
No_civique,float64
Emplacement,object
Coord_X,float64
Coord_Y,float64


Unnamed: 0,missing_values
inv_type,0
emp_no,0
arrond,0
arrond_nom,0
rue,108312
cote,108312
no_civique,157695
emplacement,0
coord_x,0
coord_y,3


### Main Solution

In [2]:
# DataFrame -> DataFrame
# Extract only data required to determine which neighbourhood has the lowest number of Birch trees

cols = ['arrond_nom', "essence_fr", "essence_latin", "essence_ang"]
tree_subset = tree_df.loc[:, cols].apply(lambda x: x.str.lower())

# Verify that classification as birch is consistent across columns.
# Doing so will allow simplification of filter required to extract
# relevant dataw

en_index = tree_df["essence_ang"].str.contains("birch").index
fr_index = tree_df["essence_fr"].str.contains("bouleau").index
latin_index = tree_df["essence_latin"].str.contains("betula").index


if not(all(en_index == fr_index) and all(fr_index == latin_index)):
    warnings.warn("Classification is not consistent", stacklevel = 2)


df_birch = tree_subset.loc[tree_subset["essence_fr"].str.contains("bouleau"), "arrond_nom"]

type(df_birch)

(df_birch.groupby(df_birch)
    .count()
    .sort_values()
    .to_frame(name = "count"))

Unnamed: 0_level_0,count
arrond_nom,Unnamed: 1_level_1
ville-marie,12
le plateau-mont-royal,34
saint-laurent,54
saint-léonard,56
villeray-saint-michel - parc-extension,71
verdun,115
le sud-ouest,147
lasalle,199
pierrefonds - roxboro,220
ahuntsic - cartierville,246


# Exercise 2 <a class="anchor" id="exercice-de-numpy">

This [exercise](https://colab.research.google.com/drive/1caSuGsZeiNHU-B_R07LNoXq7fVLYoTeK#scrollTo=Md3YAuk1lwpD) aimed to familarise participants with the NumPy library. 

<i>Cet [exercice](https://colab.research.google.com/drive/1caSuGsZeiNHU-B_R07LNoXq7fVLYoTeK#scrollTo=Md3YAuk1lwpD) visait à familiariser des participants avec la libraire de NumPy</i> 

In [4]:
import numpy as np

examen_1_liste = [67, 89, 73, 65, 75, 95, 62]
examen_2_liste = [65, 78, 87, 98, 67, 67, 71]
examen_3 = np.random.randint(55,101, size = 15)

# Convert both lists to 1d NumPy arrays, then create 2d array with NumPy functions

examen_1_liste, examen_2_liste = map(np.array, [examen_1_liste, examen_2_liste])
student_scores_np = np.hstack((examen_1_liste.reshape(7, 1), examen_2_liste.reshape(7, 1)))

# Alternatively, with indexing and block

student_scores_np_block = np.block([examen_1_liste[:, np.newaxis], examen_2_liste[:, np.newaxis]])

# One liner to create 2d NumPy array with list comprehension

student_scores_lc = np.array([[score1, score2] for score1, score2 in zip(examen_1_liste, examen_2_liste)])

# Confirm all arrays contain the same values and are the same shape

np.array_equal(student_scores_np, student_scores_np_block)
np.array_equal(student_scores_np, student_scores_lc)

# Cleanup
del(student_scores_np_block, student_scores_lc)

# Mean for each student

student_scores_np.mean(axis=1)

# Mean and std for each exam

scores_transposed = student_scores_np.transpose()
np.vstack((scores_transposed.mean(axis = 1), scores_transposed.std(axis = 1)))

# Std for each student (e.g. z-score?)

(student_scores_np.mean(axis = 1) - scores_transposed.mean()) / scores_transposed.std()

# Add three to each score for the first exam

examen_1_liste + 3

# Return the maximum score for the first exam

examen_1_liste.max()

# Subtract a random value between 1 to 5 from each score for the second exam, n.b. 

examen_2_liste - np.random.randint(1, 6) 

# Subtract a random value between 1 to 15 for each score for exam 1. Make sure you always remove the 
# same value from each student, but create a different column for each value subtracted. 
# Subtract four values in total, resulting in a table with dimensions (7, 4)

rng = np.random.default_rng(12345)
examen_1_liste[:, np.newaxis] - rng.integers(1, 16, (7, 4))

# Calculate the average for scores in exam 1 greater than 70 

examen_1_liste[examen_1_liste > 70].mean()

# For exam 1, replace scores less than 75 with a score of 75. Do so without modifying the original array.

examen_1_cp = examen_1_liste.copy()
examen_1_cp[examen_1_cp < 75] = 75


# For exam 2, add 2 to any note higher than 60 but less than 75. Do not modify the original results.
# n.b. must conditionally assign, otherwise broadcasting will result in a shortened array

examen_2_cp = examen_2_liste.copy()
examen_2_cp[(examen_2_cp > 60) & (examen_2_cp < 75)] += 2

# Modify exam 3 such that even indexed scores are equal to 75 and the last value is equal to 65

examen_3[:-1:2] = 75
examen_3[-1] = 65

examen_3

array([75, 77, 75, 94, 75, 67, 75, 80, 75, 91, 75, 72, 75, 75, 65])