# Montréal-Python Workshop: Data Manipulation & Machine Learning

This notebook contains my annotated solutions to the exercises discussed at the Montréal-Python workshop held on the 18th February 2023. 

<i>Ce cahier contient mes solution annotées pour l'atélier de Montréal-Python qui a eu lieu le 18 du février 2023</i>

## TOC
* [Exercise with Pandas / <i>Exercice de Pandas</i>](#exercice-de-pandas)
* [Exercise with Numpy / <i>Exercice de NumPy</i>](#exercice-de-numpy)
* [Machine Learning Exercise / <i>Exercice d'Apprentissage Automatic</i>](#exercice-d-apprentissage)

# Exercise 1 <a class="anchor" id="exercice-de-pandas"></a>

The goal of this [exercise](https://colab.research.google.com/drive/1QRNus1YCtKPAjlNT165i0NtJ3Qt2oZvr#scrollTo=kycZrfZPXa3b) was to identify the Montréal neighbourhood with the lowest number of Birch trees. [Data on public trees](https://donnees.montreal.ca/ville-de-montreal/arbres) is provided freely by the City of Montréal.

<i>Le but de cet [exercise](https://colab.research.google.com/drive/1QRNus1YCtKPAjlNT165i0NtJ3Qt2oZvr#scrollTo=kycZrfZPXa3b) était d'identifier l'arrondissement de Montréal ayant le moins de bouleaux. [Les données sur les arbres publics](https://donnees.montreal.ca/ville-de-montreal/arbres) sont fournies gratuitement par la ville.</i>

### Data Import and Cleaning

In [3]:
import pandas as pd

DATA_URL = r"https://data.montreal.ca/dataset/b89fd27d-4b49-461b-8e54-fa2b34a628c4/resource/64e28fe6-ef37-437a-972d-d1d3f1f7d891/download/arbres-publics.csv"

# Downloading the data takes some time (file size is approx. 100mb). Serialize the data to avoid downloading it on every run.

try:
    # File -> DataFrame
    tree_df = pd.read_pickle("./montreal-trees.pickle")
except FileNotFoundError:
    tree_df = pd.read_csv(DATA_URL)
    # DataFrame -> File
    # Avoid compressing file as space is generally cheaper than time.
    pd.to_pickle(tree_df, "./montreal-trees.pickle", compression=None)

#--- Basic EDA---#

# Check the number of rows and columns, as well as column names and data types

display(f"rows, columns: {tree_df.shape}", "column types:", tree_df.dtypes)

# Correct inconsistent capitalisation of column names to make analyse easier

tree_df.rename(str.lower, axis = "columns", inplace = True)

# DataFrame -> DataFrame -> Series
# Count missing values for each column 

tree_df.isna().sum()

  tree_df = pd.read_csv(DATA_URL)


'rows, columns: (336649, 22)'

'column types:'

INV_TYPE            object
EMP_NO               int64
ARROND               int64
ARROND_NOM          object
Rue                 object
COTE                object
No_civique         float64
Emplacement         object
Coord_X            float64
Coord_Y            float64
SIGLE               object
Essence_latin       object
Essence_fr          object
ESSENCE_ANG         object
DHP                float64
Date_releve         object
Date_plantation     object
LOCALISATION        object
CODE_PARC           object
NOM_PARC            object
Longitude          float64
Latitude           float64
dtype: object

inv_type                0
emp_no                  0
arrond                  0
arrond_nom              0
rue                108314
cote               108314
no_civique         157699
emplacement             0
coord_x                 0
coord_y                 3
sigle                   0
essence_latin           0
essence_fr              0
essence_ang             0
dhp                   637
date_releve           637
date_plantation    165660
localisation       108613
code_parc          228335
nom_parc           228335
longitude               3
latitude                3
dtype: int64

### Main Solution

In [13]:
# DataFrame -> DataFrame
# Extract only data required to determine which neighbourhood has the lowest number of Birch trees

cols = ['arrond_nom', "essence_fr", "essence_latin", "essence_ang"]
tree_subset = tree_df.loc[:, cols].apply(lambda x: x.str.lower())

# Verify that classification as birch is consistent across columns.
# Doing so will allow simplification of filter required to extract
# relevant dataw

en_index = tree_df["essence_ang"].str.contains("birch").index
fr_index = tree_df["essence_fr"].str.contains("bouleau").index
latin_index = tree_df["essence_latin"].str.contains("betula").index


if not(all(en_index == fr_index) and all(fr_index == latin_index)):
    warnings.warn("Classification is not consistent", stacklevel = 2)


df_birch = tree_subset.loc[tree_subset["essence_fr"].str.contains("bouleau"), "arrond_nom"]

type(df_birch)

(df_birch.groupby(df_birch)
    .pipe(lambda grp: grp.count())
    .pipe(lambda grp: grp.sort_values())
).to_frame(name = "count")



Unnamed: 0_level_0,count
arrond_nom,Unnamed: 1_level_1
ville-marie,12
le plateau-mont-royal,34
saint-laurent,54
saint-léonard,56
villeray-saint-michel - parc-extension,71
verdun,115
le sud-ouest,147
lasalle,199
pierrefonds - roxboro,220
ahuntsic - cartierville,246
