See README.md for setup instructions and general information.

# Context

In the framework of the "Greening the City" program organized by the City of Paris, we are conducting an exploratory data analysis on OpenData about trees managed by the City of Paris.The goal is to assist Paris in becoming a "Smart City" by managing its trees as responsibly as possible. That is, by optimizing the routes necessary for the maintenance of these trees.

## Tools Used

- Python
- Jupyter Notebook
- Pandas
- Matplotlib
- Seaborn


# Step 2: Perform a Naive Analysis of the Dataset

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Display the number of rows and columns to get a general overview of the dataset

## load the dataset

In [2]:
# Relative path to the CSV file from the notebook with separator ;
file_path = '../data/P2-arbres-fr.csv'

# Load the data into a pandas DataFrame
df = pd.read_csv(file_path, sep=';')

In [3]:
print(f'The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.')

The dataset contains 200137 rows and 18 columns.


## Get general information on columns, data types, etc.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200137 entries, 0 to 200136
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   200137 non-null  int64  
 1   type_emplacement     200137 non-null  object 
 2   domanialite          200136 non-null  object 
 3   arrondissement       200137 non-null  object 
 4   complement_addresse  30902 non-null   object 
 5   numero               0 non-null       float64
 6   lieu                 200137 non-null  object 
 7   id_emplacement       200137 non-null  object 
 8   libelle_francais     198640 non-null  object 
 9   genre                200121 non-null  object 
 10  espece               198385 non-null  object 
 11  variete              36777 non-null   object 
 12  circonference_cm     200137 non-null  int64  
 13  hauteur_m            200137 non-null  int64  
 14  stade_developpement  132932 non-null  object 
 15  remarquable      

## column descriptions
We see that for each tree listed, we have the following information (as per the dataset's:

- id: simple identifier for the tree (integer, e.g.: 99874)
- type_emplacement: type of location (text, e.g.: "Tree")
- domanialite: type of place the tree belongs to (text, e.g.: "Garden")
- arrondissement: the district of Paris where the tree is located (text, e.g.: "PARIS 7E ARRDT")
- complement_addresse: address complement (text, no example given)
- numero: address number (text, no example given)
- lieu: tree's location address (text, e.g.: "MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E")
- id_emplacement: identifier of the location (text, e.g.: "19")
- libelle_francais: common (vernacular) name of the tree's species (text, e.g.: "Chestnut")
- genre: genus of the tree (text, e.g.: "Aesculus")
- espece: species of the tree (text, e.g.: "hippocastanum")
- variete: variety of the tree (text, no example given)
- circonference_cm: circumference of the tree in centimeters (integer, e.g.: 20)
- hauteur_m: height of the tree in meters (integer, e.g.: 5)
- stade_developpement: stage of development of the tree (text, e.g.: "A" for "Adult")
- remarquable: whether the tree is "remarkable" or not (boolean, e.g.: 0 for a "non-remarkable" tree)
- geo_point_2d_a: latitude of the tree's position (floating number, e.g.: 48.857620)
- geo_point_2d_b: longitude of the tree's position (floating number, e.g.: 2.320962)

We already observe from the first few entries that:
- a number of values are not provided (`NaN` = "Not a Number" = data not available)
- we can classify the variables by their type:
    - quantitative
        - discrete: `id`, `circonference_cm`, `hauteur_m`
        - continuous: `geo_point_2d_a`, `geo_point_2d_b`
    - qualitative
        - nominal: `type_emplacement`, `domanialite`, `arrondissement`, `complement_addresse`, `numero`, `lieu`, `id_emplacement`, `libelle_francais`, `genre`, `espece`, `variete`
        - ordinal: `stade_developpement`, `remarquable`
- variables can also be classified into three main categories, according to their meaning:
    - internal system metadata: `id`, `id_emplacement`, `type_emplacement`
    - location data: `arrondissement`, `complement_addresse`, `numero`, `lieu`, `geo_point_2d_a`, `geo_point_2d_b`
    - descriptive data: 
        - size: `circonference_cm`, `hauteur_m` and `stade_developpement`
        - type: `libelle_francais`, `genre`, `espece`, and `variete`
        - other: `remarquable`
- the `remarquable` variable is a boolean, which can be converted to a more readable format (e.g., "Yes" and "No")

## Glossary of Statistical Terms

- **Quantitative**: Data that can be counted or measured and expressed numerically.
    - **Discrete**: Quantitative data that can only take on specific, separate values (e.g., the number of trees in a garden).
    - **Continuous**: Quantitative data that can take on any value within a given range (e.g., the height of trees in meters).

- **Qualitative**: Data that describes qualities or characteristics and cannot be measured numerically.
    - **Nominal**: Qualitative data that represents categories without a natural order or ranking (e.g., tree species names).
    - **Ordinal**: Qualitative data that represents categories with a natural order or ranking (e.g., stages of tree development).

- **NaN** (Not a Number): A term used to denote a value that is undefined or unrepresentable, especially in cases where data is missing.

- **Metadata**: Data that provides information about other data. In the context of a dataset, it can refer to auxiliary information like identifiers and types of locations.

- **Location Data**: Data that provides information on the geographical position, such as districts or coordinates.

- **Descriptive Data**: Data that gives descriptive information, which can be about size (like circumference and height), type (such as genus and species), or other characteristics (like whether a tree is considered remarkable).


## Display the first few rows of the DataFrame for an initial overview of the data

In [5]:
# Display the first few rows of the DataFrame for an initial overview of the data
df.head()

Unnamed: 0,id,type_emplacement,domanialite,arrondissement,complement_addresse,numero,lieu,id_emplacement,libelle_francais,genre,espece,variete,circonference_cm,hauteur_m,stade_developpement,remarquable,geo_point_2d_a,geo_point_2d_b
0,99874,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,19,Marronnier,Aesculus,hippocastanum,,20,5,,0.0,48.85762,2.320962
1,99875,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,20,If,Taxus,baccata,,65,8,A,,48.857656,2.321031
2,99876,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,21,If,Taxus,baccata,,90,10,A,,48.857705,2.321061
3,99877,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,22,Erable,Acer,negundo,,60,8,A,,48.857722,2.321006
4,99878,Arbre,Jardin,PARIS 17E ARRDT,,,PARC CLICHY-BATIGNOLLES-MARTIN LUTHER KING,000G0037,Arbre à miel,Tetradium,daniellii,,38,0,,,48.890435,2.315289


# Step 3: Conduct a Detailed Univariate Analysis of Each Variable in the Dataset

In [6]:
df.describe(include = 'all')

Unnamed: 0,id,type_emplacement,domanialite,arrondissement,complement_addresse,numero,lieu,id_emplacement,libelle_francais,genre,espece,variete,circonference_cm,hauteur_m,stade_developpement,remarquable,geo_point_2d_a,geo_point_2d_b
count,200137.0,200137,200136,200137,30902,0.0,200137,200137.0,198640,200121,198385,36777,200137.0,200137.0,132932,137039.0,200137.0,200137.0
unique,,1,9,25,3795,,6921,69040.0,192,175,539,436,,,4,,,
top,,Arbre,Alignement,PARIS 15E ARRDT,SN°,,PARC FLORAL DE PARIS / ROUTE DE LA PYRAMIDE,101001.0,Platane,Platanus,x hispanica,Baumannii',,,A,,,
freq,,200137,104949,17151,557,,2995,1324.0,42508,42591,36409,4538,,,64438,,,
mean,387202.7,,,,,,,,,,,,83.380479,13.110509,,0.001343,48.854491,2.348208
std,545603.2,,,,,,,,,,,,673.190213,1971.217387,,0.036618,0.030234,0.05122
min,99874.0,,,,,,,,,,,,0.0,0.0,,0.0,48.74229,2.210241
25%,155927.0,,,,,,,,,,,,30.0,5.0,,0.0,48.835021,2.30753
50%,221078.0,,,,,,,,,,,,70.0,8.0,,0.0,48.854162,2.351095
75%,274102.0,,,,,,,,,,,,115.0,12.0,,0.0,48.876447,2.386838


## Calculate Main Characteristics for Quantitative Variables

In [7]:
# Calculate the main characteristics for quantitative variables
# Assuming 'hauteur' and 'circonference' are quantitative variables in your dataset
variables = ['hauteur_m', 'circonference_cm']

for variable in variables:
    print(f"Analysis of {variable}:")
    print(f"Mean: {df[variable].mean()}")
    print(f"Median: {df[variable].median()}")
    print(f"Standard Deviation: {df[variable].std()}")
    print(f"Quantiles: {df[variable].quantile([0.25, 0.5, 0.75])}\n")


Analysis of hauteur_m:
Mean: 13.110509301128728
Median: 8.0
Standard Deviation: 1971.2173865928366
Quantiles: 0.25     5.0
0.50     8.0
0.75    12.0
Name: hauteur_m, dtype: float64

Analysis of circonference_cm:
Mean: 83.38047937163043
Median: 70.0
Standard Deviation: 673.1902130032512
Quantiles: 0.25     30.0
0.50     70.0
0.75    115.0
Name: circonference_cm, dtype: float64


## Create Graphs for Each Variable

In [None]:
# Density plot and boxplot for 'hauteur'
plt.figure(figsize=(10, 6))
sns.histplot(df['hauteur_m'], kde=True)
plt.title('Density Plot for Height')
plt.xlabel('Height')
plt.ylabel('Density')
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(y=df['hauteur_m'])
plt.title('Box Plot for Height')
plt.ylabel('Height')
plt.show()
# Create a histogram for the 'hauteur_m' variable
plt.figure(figsize=(12, 6))