See README.md for setup instructions and general information.

# Context

In the framework of the "Greening the City" program organized by the City of Paris, we are conducting an exploratory data analysis on OpenData about trees managed by the City of Paris.The goal is to assist Paris in becoming a "Smart City" by managing its trees as responsibly as possible. That is, by optimizing the routes necessary for the maintenance of these trees.

## Tools Used

- Python # Programming language
- Jupyter Notebook # Web-based interactive computing notebook
- Pandas # Data manipulation library
- Matplotlib # Data visualization library
- Seaborn # Data visualization library


# Step 2: Perform a Naive Analysis of the Dataset (before cleaning)

In [16]:
import pandas as pd # Import the pandas library used to manipulate the data


## load the dataset

In [17]:
# Relative path to the CSV file from with the data
file_path = '../data/P2-arbres-fr.csv' 

# Load the data of the csv file into a DataFrame with a separator ;
df = pd.read_csv(file_path, sep=';') 

## Display the number of rows and columns to get a general overview of the dataset

In [18]:
print(f'The dataset contains {df.shape[0]} rows (instances) and {df.shape[1]} columns (variables).') 

The dataset contains 200137 rows (instances) and 18 columns (variables).


## Get general information on columns, data types, etc.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200137 entries, 0 to 200136
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   200137 non-null  int64  
 1   type_emplacement     200137 non-null  object 
 2   domanialite          200136 non-null  object 
 3   arrondissement       200137 non-null  object 
 4   complement_addresse  30902 non-null   object 
 5   numero               0 non-null       float64
 6   lieu                 200137 non-null  object 
 7   id_emplacement       200137 non-null  object 
 8   libelle_francais     198640 non-null  object 
 9   genre                200121 non-null  object 
 10  espece               198385 non-null  object 
 11  variete              36777 non-null   object 
 12  circonference_cm     200137 non-null  int64  
 13  hauteur_m            200137 non-null  int64  
 14  stade_developpement  132932 non-null  object 
 15  remarquable      

In [20]:
df.dtypes.value_counts() # Count the number of columns with each data type

object     11
float64     4
int64       3
Name: count, dtype: int64

### Remarks:

    7 quantitative variables
    11 qualitative (categorical) variables

## Display the first few rows of the DataFrame for an initial overview of the data

In [21]:
# Display the first few rows of the DataFrame for an initial overview of the data
df.head()

Unnamed: 0,id,type_emplacement,domanialite,arrondissement,complement_addresse,numero,lieu,id_emplacement,libelle_francais,genre,espece,variete,circonference_cm,hauteur_m,stade_developpement,remarquable,geo_point_2d_a,geo_point_2d_b
0,99874,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,19,Marronnier,Aesculus,hippocastanum,,20,5,,0.0,48.85762,2.320962
1,99875,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,20,If,Taxus,baccata,,65,8,A,,48.857656,2.321031
2,99876,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,21,If,Taxus,baccata,,90,10,A,,48.857705,2.321061
3,99877,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,22,Erable,Acer,negundo,,60,8,A,,48.857722,2.321006
4,99878,Arbre,Jardin,PARIS 17E ARRDT,,,PARC CLICHY-BATIGNOLLES-MARTIN LUTHER KING,000G0037,Arbre à miel,Tetradium,daniellii,,38,0,,,48.890435,2.315289


# Step 3: Conduct a Detailed Univariate Analysis of Each Variable in the Dataset

## Statistical Indicators for Quantitative Variables

In [25]:
df.describe()

Unnamed: 0,id,numero,circonference_cm,hauteur_m,remarquable,geo_point_2d_a,geo_point_2d_b
count,200137.0,0.0,200137.0,200137.0,137039.0,200137.0,200137.0
mean,387202.7,,83.380479,13.110509,0.001343,48.854491,2.348208
std,545603.2,,673.190213,1971.217387,0.036618,0.030234,0.05122
min,99874.0,,0.0,0.0,0.0,48.74229,2.210241
25%,155927.0,,30.0,5.0,0.0,48.835021,2.30753
50%,221078.0,,70.0,8.0,0.0,48.854162,2.351095
75%,274102.0,,115.0,12.0,0.0,48.876447,2.386838
max,2024745.0,,250255.0,881818.0,1.0,48.911485,2.469759


## Glossary of Statistical Terms

- **Count**: The number of non-null entries in the dataset.
- **Mean**: The average value of the dataset.
- **Std**: Short for "standard deviation," a measure of the amount of variation or dispersion in a set of values.
- **Min**: The smallest value in the dataset.
- **25%**: The 25th percentile, or first quartile, indicating that 25% of the data falls below this value.
- **50%**: The 50th percentile, or median, the middle value of the dataset.
- **75%**: The 75th percentile, or third quartile, indicating that 75% of the data falls below this value.
- **Max**: The largest value in the dataset.


### Observations:
- 'id': identifier, unique values
- 'numero': only missing values
- 'circonference_cm': very high standard deviation, presence of outliers
- 'hauteur_m': very high standard deviation, presence of outliers
- 'remarquable': only 2 possible values (0 or 1) -> categorical variable
- 'geo_point_2d_a': very low standard deviation, no outliers
- 'geo_point_2d_b': very low standard deviation, no outliers


## Statistical Indicators for Qualitative Variables

In [32]:
var_qualitatives = df.select_dtypes(include='object').columns # Select qualitative variables

### List of Modalities


In [34]:
for var in var_qualitatives:
    print(f'{var}: {df[var].unique()}\n')

type_emplacement: ['Arbre']

domanialite: ['Jardin' 'Alignement' 'DJS' 'DFPE' 'CIMETIERE' 'DASCO' 'DAC'
 'PERIPHERIQUE' 'DASES' nan]

arrondissement: ['PARIS 7E ARRDT' 'PARIS 17E ARRDT' 'PARIS 16E ARRDT' 'PARIS 4E ARRDT'
 'PARIS 13E ARRDT' 'PARIS 12E ARRDT' 'PARIS 19E ARRDT' 'PARIS 14E ARRDT'
 'PARIS 15E ARRDT' 'PARIS 3E ARRDT' 'PARIS 20E ARRDT' 'PARIS 18E ARRDT'
 'PARIS 6E ARRDT' 'PARIS 11E ARRDT' 'PARIS 1ER ARRDT' 'PARIS 2E ARRDT'
 'PARIS 5E ARRDT' 'VAL-DE-MARNE' 'SEINE-SAINT-DENIS' 'HAUTS-DE-SEINE'
 'PARIS 9E ARRDT' 'PARIS 10E ARRDT' 'PARIS 8E ARRDT' 'BOIS DE BOULOGNE'
 'BOIS DE VINCENNES']

complement_addresse: [nan 'c 12' '12-36' ... 'au n.14' 'F2bis' '40face']

lieu: ['MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E'
 'PARC CLICHY-BATIGNOLLES-MARTIN LUTHER KING'
 'SQUARE ALEXANDRE ET RENE PARODI / 1 PLACE DE LA PORTE MAILLOT' ...
 'TERRAIN D EDUCATION PHYSIQUE / 49 RUE OLIVIER METRA' 'RUE EDOUARD QUENU'
 'RUE DU GENERAL NIESSEL']

id_emplacement: ['19' '20' '21' ... '720170153' '720170