# Project in Data Science: Maintaining Urban Trees in Paris.

### Notebook by [Nasr-edine DRAI](https://www.hackerrank.com/d_nasredine)



### [Openclassrooms](https://openclassrooms.com/en/)

## Verify Python Virtual Environments

#### Check the Version of the Python Interpreter

In [1]:
!python --version

Python 3.10.4


#### Verify that I'm using the right virtual environment

In [2]:
!pip -V

pip 22.3.1 from /Users/drainasr-edine/github/ingenieur_ia/P2_drai_nasr-edine/ia_project_2_env/lib/python3.10/site-packages/pip (python 3.10)


#### Check Installed Modules in Python

Run through this notebook to make sure my environment is properly setup. Be sure to launch Jupyter from inside the virtual environment.

In [3]:
import os, sys

parent = os.path.abspath('..')
sys.path.insert(1, parent)
print(parent)

/Users/drainasr-edine/github/ingenieur_ia/P2_drai_nasr-edine


In [4]:
from src.check_environment import run_checks
run_checks()

Using Python in /Users/drainasr-edine/github/ingenieur_ia/P2_drai_nasr-edine/ia_project_2_env:
[42m[ OK ][0m Python is version 3.10.4 (main, Jul 17 2022, 13:52:49) [Clang 13.1.6 (clang-1316.0.21.2.5)]

[42m[ OK ][0m jupyterlab
[42m[ OK ][0m matplotlib
[42m[ OK ][0m numpy
[42m[ OK ][0m pandas
[42m[ OK ][0m seaborn
[42m[ OK ][0m statsmodels
[42m[ OK ][0m folium


## Import Python library for data analysis

In [5]:
# Data manipulation
import numpy as np
import pandas as pd
import statsmodels as sm

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap

## Collecting the data

#### Creating a `DataFrame` by Reading in a CSV File

We will read in a CSV file using `pandas`. 

In [6]:
df = pd.read_csv('../data/p2_arbres_fr.csv', sep=';')  # Creating the dataframe

## Quick Examining dataframe

#### Is it empty?

In [7]:
df.empty

False

#### What are the dimensions?

We find the dimensions with the `shape` attribute as (rows, columns).

In [8]:
df.shape

(200137, 18)

#### What columns do we have?

In [9]:
df.columns

Index(['id', 'type_emplacement', 'domanialite', 'arrondissement',
       'complement_addresse', 'numero', 'lieu', 'id_emplacement',
       'libelle_francais', 'genre', 'espece', 'variete', 'circonference_cm',
       'hauteur_m', 'stade_developpement', 'remarquable', 'geo_point_2d_a',
       'geo_point_2d_b'],
      dtype='object')

#### What data types do we have?

We find the data types with the `dtypes` attribute:

In [10]:
df.dtypes

id                       int64
type_emplacement        object
domanialite             object
arrondissement          object
complement_addresse     object
numero                 float64
lieu                    object
id_emplacement          object
libelle_francais        object
genre                   object
espece                  object
variete                 object
circonference_cm         int64
hauteur_m                int64
stade_developpement     object
remarquable            float64
geo_point_2d_a         float64
geo_point_2d_b         float64
dtype: object

#### What does the data look like?
View rows from the top with `head()`:

In [11]:
df.head(2)

Unnamed: 0,id,type_emplacement,domanialite,arrondissement,complement_addresse,numero,lieu,id_emplacement,libelle_francais,genre,espece,variete,circonference_cm,hauteur_m,stade_developpement,remarquable,geo_point_2d_a,geo_point_2d_b
0,99874,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,19,Marronnier,Aesculus,hippocastanum,,20,5,,0.0,48.85762,2.320962
1,99875,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,20,If,Taxus,baccata,,65,8,A,,48.857656,2.321031


View rows from the bottom with `tail()`.

In [12]:
df.tail(2)

Unnamed: 0,id,type_emplacement,domanialite,arrondissement,complement_addresse,numero,lieu,id_emplacement,libelle_francais,genre,espece,variete,circonference_cm,hauteur_m,stade_developpement,remarquable,geo_point_2d_a,geo_point_2d_b
200135,2024744,Arbre,Jardin,BOIS DE VINCENNES,,,ARBORETUM DE L ECOLE DU BREUIL - ROUTE DE LA F...,720170154,Chêne,Quercus,n. sp.,,0,0,,0.0,48.822522,2.455956
200136,2024745,Arbre,Jardin,BOIS DE VINCENNES,,,ARBORETUM DE L ECOLE DU BREUIL - ROUTE DE LA F...,720170155,Raisinier,Hovenia,dulcis,,0,0,,0.0,48.820445,2.454856


## Describing and Summarizing

#### Get summary statistics

In [13]:
df.describe()

Unnamed: 0,id,numero,circonference_cm,hauteur_m,remarquable,geo_point_2d_a,geo_point_2d_b
count,200137.0,0.0,200137.0,200137.0,137039.0,200137.0,200137.0
mean,387202.7,,83.380479,13.110509,0.001343,48.854491,2.348208
std,545603.2,,673.190213,1971.217387,0.036618,0.030234,0.05122
min,99874.0,,0.0,0.0,0.0,48.74229,2.210241
25%,155927.0,,30.0,5.0,0.0,48.835021,2.30753
50%,221078.0,,70.0,8.0,0.0,48.854162,2.351095
75%,274102.0,,115.0,12.0,0.0,48.876447,2.386838
max,2024745.0,,250255.0,881818.0,1.0,48.911485,2.469759


#### Getting extra info and finding nulls

Let’s print the full summary of the dataframe.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200137 entries, 0 to 200136
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   200137 non-null  int64  
 1   type_emplacement     200137 non-null  object 
 2   domanialite          200136 non-null  object 
 3   arrondissement       200137 non-null  object 
 4   complement_addresse  30902 non-null   object 
 5   numero               0 non-null       float64
 6   lieu                 200137 non-null  object 
 7   id_emplacement       200137 non-null  object 
 8   libelle_francais     198640 non-null  object 
 9   genre                200121 non-null  object 
 10  espece               198385 non-null  object 
 11  variete              36777 non-null   object 
 12  circonference_cm     200137 non-null  int64  
 13  hauteur_m            200137 non-null  int64  
 14  stade_developpement  132932 non-null  object 
 15  remarquable      

## Cleaning the data

#### Columns That are empty

In [15]:
df.isnull().sum()

id                          0
type_emplacement            0
domanialite                 1
arrondissement              0
complement_addresse    169235
numero                 200137
lieu                        0
id_emplacement              0
libelle_francais         1497
genre                      16
espece                   1752
variete                163360
circonference_cm            0
hauteur_m                   0
stade_developpement     67205
remarquable             63098
geo_point_2d_a              0
geo_point_2d_b              0
dtype: int64

We have a column `numero` empty. We can drop empty column

In [16]:
print("column is empty ?:", df.numero.isnull().values.all())
df.drop('numero', axis=1, inplace=True)


column is empty ?: True


In [17]:
df.shape

(200137, 17)

#### Columns That Contain a Single Value

In [18]:
df.nunique()

id                     200137
type_emplacement            1
domanialite                 9
arrondissement             25
complement_addresse      3795
lieu                     6921
id_emplacement          69040
libelle_francais          192
genre                     175
espece                    539
variete                   436
circonference_cm          531
hauteur_m                 143
stade_developpement         4
remarquable                 2
geo_point_2d_a         200107
geo_point_2d_b         200114
dtype: int64

In [19]:
# get number of unique values for each column 
counts = df.nunique()
# record columns to delete
to_del = [i for i,v in enumerate(counts) if v == 1]
for column in to_del:
    print(df.columns[column])

type_emplacement


In [20]:
# drop useless columns
df.drop(df.columns[to_del], axis=1, inplace=True)

In [21]:
df.shape

(200137, 16)

#### Columns That have Very Few Values

In [22]:
# summarize the number of unique values in each column 
for column in  df.columns:
    num = df[column].nunique()
    percentage = float(num) / df.shape[0] * 100 
    if percentage < 1:
        print('%s, %d, %.1f%%' % (column, num, percentage))

domanialite, 9, 0.0%
arrondissement, 25, 0.0%
libelle_francais, 192, 0.1%
genre, 175, 0.1%
espece, 539, 0.3%
variete, 436, 0.2%
circonference_cm, 531, 0.3%
hauteur_m, 143, 0.1%
stade_developpement, 4, 0.0%
remarquable, 2, 0.0%


In [23]:
df.remarquable.value_counts()

0.0    136855
1.0       184
Name: remarquable, dtype: int64

column REMARQUABLE has very few unique values and have a small variance.

In [24]:
df.remarquable.var()


0.0013408904555056327

In [25]:
null_value = df.remarquable.isnull().sum()
percentage = float(null_value) / df.shape[0] * 100 
print('%s, %d, %.1f%%' % (df.remarquable.name, null_value, percentage))

remarquable, 63098, 31.5%


In [26]:
# drop useless columns
# df.drop('remarquable', axis=1, inplace=True)


#### Rows That Contain Duplicate Data

In [27]:
# calculate duplicates
dups = df.duplicated()
# report if there are any duplicates 
print(dups.any())

False


#### Rows That Contain values outisde Paris

We have values that do not correspond to Paris District

In [28]:
arr_list = df['arrondissement'].unique()
arr_list.sort()
not_in_paris = []
for district in arr_list:
    if not (district.startswith("PARIS") or district.startswith("BOIS")):
        not_in_paris.append(district)
not_in_paris

['HAUTS-DE-SEINE', 'SEINE-SAINT-DENIS', 'VAL-DE-MARNE']

I remove:
    - HAUTS-DE-SEINE
    - SEINE-SAINT-DENIS
    - VAL-DE-MARNE 
because it's outside of Paris. 

But I Keep:
    - Bois de Boulogne
    - Bois de Vincennes 
because it is the property of Paris

selecting rows where location is in Paris Districts only

In [29]:
df = df[~df['arrondissement'].isin(not_in_paris)]

