In [None]:
# Libraries
import pandas as pd
import numpy as np
from time import strptime
import datetime
import re

import matplotlib.pyplot as plt
import seaborn as sns

import chart_studio.plotly as py
import cufflinks as cf
%matplotlib inline
import ipywidgets as widgets
from plotly import tools
import plotly.graph_objs as go
import plotly.express as px
import warnings

import geopandas as gpd
from urllib.request import urlopen
import json

import copy
cf.go_offline()

token = 'pk.eyJ1IjoidmljdG9yYWl6cHVydWEiLCJhIjoiY2s5ajBzdWh6MDBkeTNrbm9ybjMzOWpmcCJ9.Y1XEigjlM3QDM4kCoyJF-A'

# Project 4: Analysis of pesticide use in different communes in France
-------



## Introduction

France is the biggest pesticide consumer of Europe.

While pesticides are measured in water (European mandatory standard existing for more than 30 years) and in food products, there is no obligation in France or in Europe to measure their presence in the air. So we can't know the amount of pesticides in the air every day. To date, there is no national monitoring plan, nor regulatory limits on the concentrations of pesticides in the air (indoor or outdoor).

Pesticides can be introduced into the air during application but also after their deposition by volatilizing or by diffusing there through erosion phenomena. The less stable pesticides can also undergo chemical or photochemical degradation and thus produce aerosols and secondary pollutants such as ozone.

Some pesticides persist in the environment for years after their ban.

Outdoor air is more polluted in the countryside than in town. Concentrations are higher in rural areas where there is a clear predominance of insecticides and fungicides.

Acute poisoning, linked to very high exposure over a short time, can cause poisoning, skin or eye risks. It was for acute intoxication that the farmer Paul François started legal proceedings against the American company Monsanto, which he moreover won.

Chronic intoxication, linked to lower exposure over a longer time. It can cause many diseases such as asthma, diabetes, cancer, infertility, deformities or even neurological disorders (Alzheimer's, Parkinson's, autism).

Comsomption (acute or chronic) of pesticides is linked by 80% to cancer. Pregnant women are a particularly population in danger. Insect life is increidibly endangered by pesticides, around 75% of flying bugs species have dissapeared because of pesticides. Polinisation is also highly affected.

------------

## Dataset

Multiple datasets where used for this project(4). The main dataset and source comes from the Atmo France Federation (Government association), the data can be found here: https://www.data.gouv.fr/fr/organizations/atmo-france/

The focus point on this dataset is information related to the quantity of a substance ('**Substance active**') in the air ('**Concentration ng/m<sup>3</sup>**') on different communes ('**Code INSEE**') in France over the years ('**Annee**'). 

Along this data, there were some other datasets that were needed in order to get the complete geodata and be able to plot the dataset into charts.

There was the commune dataset from France, to get all the info related to the communes and regions (dataset: https://www.insee.fr/fr/information/3720946).

Then there was the actual geodata information about the different communes in France. This data was found in Github (https://github.com/gregoiredavid/france-geojson/blob/master/communes.geojson).

Finally, the geodata from Martinique was also needed (https://github.com/gregoiredavid/france-geojson/blob/master/regions/martinique/communes-martinique.geojson?short_path=3c3f1cd)


In [None]:
insee = pd.read_csv('../data/code_insee.csv', sep=';')
polution = pd.read_excel('../data/polution.xlsx')
with urlopen('https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/communes.geojson') as response:
    communes = json.load(response)
with urlopen('https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/regions/martinique/communes-martinique.geojson') as response:
    comm_martinique = json.load(response)
insee_copy = insee.copy()
pol_copy = polution.copy()

----------------
## Data Cleaning

First, we merged our main pesticides sample dataset with the INSEE dataset containing information to the communes and regions. We proceed then to manually fill a Region name for a commune that doesn't have a value in it (Rhône-Alpes) and avoid Pandas errors and to rename some columns, capitalize some values, etc. We proceeded then to remove some unnecesary columns.

One particular cleaning on this database was related to the communes in France. This dataset contains information that has been recovered since 2002 all the way to 2017. During this period the structure of many communes have changed, either fused with other ones or split, which means changing the geo points, and other information. The 'Martigné-Briand' commune has been moved to a new commune called 'Terranjou' in 2016, and 'Sigolsheim' to 'Kayserberg-Vignoble', also in 2016.

Then we observed that there were observations in the 18th district of Paris. Since there was no specific geodata on the district of Paris, we decided to move it the 'Paris' commune.

There were some communes that appeared on the polution dataset that were not appearing on the geojson dataset. After careful observation we realised that the missing communes belonged to a DOM (Département d'Outre Mer) of France, to be precise the island of Martinique (Fort de France, Gros Morne, Lamentin, Macouba, Rivière Salée, Sainte Anne). This is the reason why the geojson dataset belonging to Martinique was also needed.

As a last step, we realised when creating the maps that they were taking long time to be rendered, so we decided to filter both geojson datasets to include only the communes that were being evaluated in our polution dataset, that helped speed up enormously the map rendering.

In [None]:
merged = pd.merge(pol_copy, insee_copy, on='Code INSEE', how='left')
drop_cols = ['xlamb93', 'ylamb93', 'Commune_y', 'geo_point_2d', 'geo_shape']
merged.drop(drop_cols, axis=1, inplace=True)
merged.rename(columns= {'Commune_x': 'Commune'}, inplace=True)

merged['Région'].fillna('Rhône-Alpes', inplace=True)
merged["Région"] =  merged["Région"].apply(lambda x: x.capitalize())
merged['Commune'] = merged['Commune'].apply(lambda x: x.capitalize())

#transform Paris18e insee code into Paris insee code
merged = merged.replace(to_replace ="75118", value ="75056")
#transform old communes into new ones
merged = merged.replace(to_replace ="49191", value ="49086")
merged = merged.replace(to_replace ="68310", value ="68162")

# filter the geojsons into a new one 
comm_insee = merged.groupby('Code INSEE').agg({'Concentration (ng/m3)':'mean'}).index.tolist()

comm_filtered = copy.deepcopy(communes)
comm_filtered['features'] = []
for commune in communes['features']:
    if commune['properties']['code'] in comm_insee:
        comm_filtered['features'].append(commune)
for commune in comm_martinique['features']:
    if commune['properties']['code'] in comm_insee:
        comm_filtered['features'].append(commune)
        
substances_list = list(merged['Substance active'].unique())
substances_list.sort()
year_list = list(merged['Annee'].unique())
year_list.sort()
top_substances = merged.groupby('Substance active').agg({'Concentration (ng/m3)':'mean'}).sort_values('Concentration (ng/m3)', ascending=False).head(10).index.tolist()

In [None]:
merged[['Région', 'Code INSEE', 'Substance active', 'Annee']].nunique()

We can see that in total there were 326 different pesticides evaluated throught the dataset. These evaluations were done in 146 different communes in all 18 regions of France (our dataset says 22 regions, since there were some regions that have been modified, as for the communes). The different samples have been taken along 16 years (from 2002 to 2017).

We will start by checking out what are the substances that are being analized the most over the years.

In [None]:
annotations = []

fig = go.Figure()
bla = merged['Substance active'].value_counts().reset_index()
fig.add_trace(
    go.Histogram(
        x=merged['Substance active'],
#         histnorm='percent',
        name='bla',
        marker_color='#37ced2',
        opacity=0.75,
    )
)

# Title Part
annotations.append(
    dict(
        xref='paper', 
        yref='paper', 
        x=0.0, 
        y=1.05,
        xanchor='left', 
        yanchor='bottom',
        text=f'Number of evaluations per substance',
        font=dict(
            family='Arial',
            size=30,
            color='rgb(37,37,37)'
         ),
         showarrow=False
    )
)

#Source Part
annotations.append(
    dict(
        xref='paper', 
        yref='paper', 
        x=0.5, 
        y=-0.26,
        xanchor='center', 
        yanchor='top',
        text='Source: Atmo France',
        font=dict(
            family='Arial',
            size=12,
            color='rgb(150,150,150)'
        ),
        showarrow=False
    )
)
fig.update_layout(
    bargap=0.2,
    bargroupgap=0.1,
    xaxis=dict(
        categoryorder='total descending',
        tickangle=45,
        range=(0,30)
    ),
    plot_bgcolor='white',
    annotations = annotations,
)
fig.show()

From this chart we can extract the top 10 substances that are of most interest for researchers, along with a brief description of their usage and side effets.

- Lindane : insecticide. It can cause nausea, restlessness, headaches, vomiting, shaking, ataxia, tonic-clonic seazures, changes in the EEG. Neurotoxic carcinogenic. Prohibited in 1998
- Pendimethaline: herbicide. low acute toxicity. It is slightly toxic by oral and eye administration. Not enough research about this substance.
- Fenpropimorphe : impact cell division and growth in plants. Prohibition to comercialize products with it.
- Cyprodinil: great health risks. Acute, subchronic, and chronic toxicity, carcinogenicity, reproductive and developmental toxicity, neurotoxicity, and genotoxicity.
- Metholachlore: widely known toxicity. Induces cytotoxic and genotoxic effects in human lymphocytes
- Chlorothalonil : highly toxic for health fungicide but still not prohibited.
- Folpel : prohibition is being discussed. Difficult to evaluate since its disposed quickly from the body. Stomach, esophagus problems linked to it.
- Metazachlore: herbicide, not enough information.
- Diflufenicanil: herbicide, not enough information.
- Kresoxim methyl :  reproductive toxicity. There are limits for the quantities that pose no risk of this product on food, but no research has been done in the air.


We could also quickly mention the substances that were evaluated the least. We have: Chlordane beta, Amidosulfuron, Metaldehyde, Aldicarbe sulfone, Oxamyl, Octachloronaphtalene, Propaquizafop, Eptc, Prohexadione calcium, Chloridazone.

It could be interesting to focus on why these substances are not the point of interest for research. Is it because its been proven that they have no detrimental effect on health/environment? or maybe they have been banned for a very long time? could it be that the sensors needed to evaluate them are not widely available? Shading some light into these questions could help refine research further.

-------

Now we will take a look at the substances with the highest concentration of nanogrames over metric cube. On the next bar graph we can decide which substances we want to see and the minimum threshold necessary in order for the substance to appear.

By default, all the substances with a concentration >= 0.001 ng/m<sup>3</sup> appear on the graph except for Folpel and Chlorothalonil. The reason for this is that their values are so much higher than the rest (Folpel is **32,021 ng/m<sup>3</sup>** and Chlorothalonil **2.566 ng/m<sup>3</sup>**), that including the from the beginning would make us loose discriminatory capacity when comparing the values of other substances.

In [None]:
select_list = copy.deepcopy(substances_list)
select_list.remove('Folpel')
select_list.remove('Chlorothalonil')

@widgets.interact(
    subs = widgets.SelectMultiple(
        options=substances_list,
        value=select_list,
        description='Substances:',
        disabled=False
    ),
    threshold = widgets.FloatText(
        value=0.001,
        description='Limit:',
        disabled=False
    )
)

def chart(subs, threshold):
    annotations = []
    df = merged.groupby('Substance active') \
               .agg({'Concentration (ng/m3)':'mean', 'LD (ng/m3)': 'first', 'LQ (ng/m3)': 'first'}) \
               .sort_values('Concentration (ng/m3)', ascending=False).reset_index()
    
    df = df[(df['Substance active'].isin(subs)) & (df['Concentration (ng/m3)'] >= threshold)]
    fig = px.bar(df, x='Substance active', y='Concentration (ng/m3)', hover_data=['LD (ng/m3)', 'LQ (ng/m3)'])
    
    # Title Part
    annotations.append(
        dict(
            xref='paper', 
            yref='paper', 
            x=0.0, 
            y=1.05,
            xanchor='left', 
            yanchor='bottom',
            text=f'Concentration of ng/m3 per substance',
            font=dict(
                family='Arial',
                size=30,
                color='rgb(37,37,37)'
             ),
             showarrow=False
        )
    )

    #Source Part
    annotations.append(
        dict(
            xref='paper', 
            yref='paper', 
            x=0.5, 
            y=-0.26,
            xanchor='center', 
            yanchor='top',
            text='Source: Atmo France',
            font=dict(
                family='Arial',
                size=12,
                color='rgb(150,150,150)'
            ),
            showarrow=False
        )
    )
    fig.update_layout(
        bargap=0.2,
        bargroupgap=0.1,
        xaxis=dict(
            tickangle=45,
            range=(0,30)
        ),
        plot_bgcolor='white',
        annotations = annotations,
    )
    fig.show()

From this visualisation, we can make a list with the 10 pesticides with the highest concentration:

- Folpel -
- Chlorothalonil -
- Prosulfocarbe: Herbicide known its health toxicity.
- Chlorpyriphos ethyl: genotoxicity and developmental neurotoxicity. Considered dangerous to health by the EFSA (European Food Safety Authority) since 2019.
- Pendimethaline - 
- Captane: Weak but visible impact on health.
- Spiroxamine: No health risks observed.
- Tolylfluanide: effects on the skeletal system (bones and teeth), liver and thyroid. Carcinogenic to humans.
- Lindane -
- Cymoxanil: small toxic potential to humans and aquatic life.

By comparing the list of 10 most evaluated substances and top 10 substances with highest concentration we can see some values that repeat themselves, which means that these substances are widely used in agriculture and are also greatly arouse the curiosity of reserchers. 

These substances are Folpel, Chlorothalonil, Pendimethaline and Lindane. All these substances have proved negative effects on health and some are even completely banned, but our data show us that they are still widely used.

Here we have the same output but with the ability to see the concentration on a distinctive year.


In [None]:
@widgets.interact(
    year = widgets.Select(
        options=year_list,
        value=2012,
        description='Year:',
        disabled=False
    )
)

def chart(year):
    annotations = []
    df = merged[merged['Annee'] == year] \
        .groupby('Substance active') \
        .agg({'Concentration (ng/m3)':'mean', 'LD (ng/m3)': 'first', 'LQ (ng/m3)': 'first'}) \
        .sort_values('Concentration (ng/m3)', ascending=False).reset_index()
    
    fig = px.bar(df, x='Substance active', y='Concentration (ng/m3)', hover_data=['LD (ng/m3)', 'LQ (ng/m3)'])
    
    # Title Part
    annotations.append(
        dict(
            xref='paper', 
            yref='paper', 
            x=0.0, 
            y=1.05,
            xanchor='left', 
            yanchor='bottom',
            text=f'Concentration of nanograms/m3 per substance in {year}',
            font=dict(
                family='Arial',
                size=30,
                color='rgb(37,37,37)'
             ),
             showarrow=False
        )
    )

    #Source Part
    annotations.append(
        dict(
            xref='paper', 
            yref='paper', 
            x=0.5, 
            y=-0.26,
            xanchor='center', 
            yanchor='top',
            text='Source: Atmo France',
            font=dict(
                family='Arial',
                size=12,
                color='rgb(150,150,150)'
            ),
            showarrow=False
        )
    )
    fig.update_layout(
        bargap=0.2,
        bargroupgap=0.1,
        xaxis=dict(
            tickangle=45,
            range=(0,30)
        ),
        plot_bgcolor='white',
        annotations = annotations,
    )
    fig.show()

We can see there are many substances that have a value of 0. A value of 0 doesn't really mean that the substance is completely absent from the sample, it could also be that the quantity is not enough to be sensed by the captor. Either way, the quantities are relatively low. Here we can see when we hover the data the substances with their respective Detection Limit(LD) and Quantification Limit (LQ) values.

When you hover on the data, you can see the respective LD and LQ values for every substance. Detection limits implies the minimum quantity needed of the substance in the sample to be able to be registered. Quantification limits refers to the threshold of variance needed in order to register a different value.

-----------
Now we will see the evolution of concentration of ng/m<sup>3</sup> of the substances over the years 2002-2017. By default the graph will show the 10 substances with the highest concentration, but these values can be modified to include/remove any substance that might interest the user.

In [None]:
@widgets.interact(
    threshold = widgets.ToggleButtons(
        options=['Top 10', 'Any'],
        value='Top 10',
        description='Choose:',
        disabled=False,
        tooltips=['See Top 10 substances', 'See all substances'],
    ),
    subs = widgets.SelectMultiple(
        options=substances_list,
        value=top_substances,
        description='Substances',
        disabled=False
    )
)
def chart(threshold, subs):
    df = merged.groupby(['Substance active', 'Annee']).agg({'Concentration (ng/m3)':'mean'}).sort_values('Annee').reset_index()
    if threshold == 'Top 10':
        df = df[df['Substance active'].isin(top_substances)]
    else:
        df = df[df['Substance active'].isin(subs)]
    df = df.reset_index()
    fig = px.line(
        df, 
        x="Annee",
        y='Concentration (ng/m3)',
        color="Substance active",
    )

    annotations = []
    # Title Part
    annotations.append(
        dict(
            xref='paper', 
            yref='paper', 
            x=0.0, 
            y=1.05,
            xanchor='left', 
            yanchor='bottom',
            text=f'Evolution of use of substances over the years',
            font=dict(
                family='Arial',
                size=30,
                color='rgb(37,37,37)'
             ),
             showarrow=False
        )
    )

    #Source Part
    annotations.append(
        dict(
            xref='paper', 
            yref='paper', 
            x=0.5, 
            y=-0.16,
            xanchor='center', 
            yanchor='top',
            text='Source: Atmo France',
            font=dict(
                family='Arial',
                size=12,
                color='rgb(150,150,150)'
            ),
            showarrow=False
        )
    )

    fig.update_layout(
            bargap=0.2,
            bargroupgap=0.1,
            xaxis=dict(
                tickangle=45,
            ),
            plot_bgcolor='white',
            annotations = annotations,
        )
    fig.show()

There are multiple substances on the rise the last few years.

We can see that prosulfocarbe, even thought its known to be toxic, has been on the rise on the last few years. Pendimethaline has some low risks associated to it but there is not enough research about it and Pyrimethanil has restrictions on oral ingestion, but no research done on airborne toxicity.

This chart allows us to see the trends over the years, mesure the use of pesticides after prohibition and see which others are on the rise, to maybe focus scientific research on those.

---

Now we will see in an interactive map the communes of France metropolitaine and Martinique that have been evaluated over 2002-2017:

In [None]:
@widgets.interact(
    style = widgets.Select(
        options=['light', 'dark', 'basic', 'outdoors', 'carto-positron', 'carto-darkmatter'],
        value='light',
        # rows=10,
        description='Style:',
        disabled=False
    )
)

# number of samples per commune
def chart(style):
    df = merged.groupby('Code INSEE').agg({'Annee': 'count','Commune': 'first', 'Région': 'first'} ).reset_index()
    fig = px.choropleth_mapbox(
        df, 
        geojson=comm_filtered,
        locations="Code INSEE", 
        featureidkey="properties.code",
        color='Annee',
        color_continuous_scale="Bluered_r",
        labels={'Annee':'Number samples'},
        mapbox_style="carto-positron",
        center = {"lat": 46.71109, "lon": 1.7191036},
        hover_data=["Commune", "Région"],
        zoom=4,
    )
    fig.update_layout(
        margin={"r":0,"t":0,"l":0,"b":0},
        mapbox_accesstoken= token,
        mapbox_style=style )
    fig.show()
#Reims, Lille, Villers-les-nancy, Puxieux, Poitiers

We see that the commune of Reims and Lille are among the most evaluated communes on the dataset. A hypothesis for this behaviour could be that the region to which Lille belongs (Hautes-de-France) is the first agricultural region on France. And Reims is a very bourgeois city close to Paris and in the Region of Champagne-Ardenne, where all the production of champagne in the world is done.

Another commune highly evaluated is the city of Poitiers, and this could be due to the fact that 69% of the whole department is dedicated to agriculture, so the need for pesticides should be higher.

---

To finish with our interactive visualistations, we have a final chart showing the concentration of each substance in the communes were they were evaluated in a particular year, along a line chart showing the general usage of the substance over the years, the user can dynamically choose the substance and the year.

In [None]:
@widgets.interact(
    style = widgets.Select(
        options=['light', 'dark', 'basic', 'outdoors', 'carto-positron', 'carto-darkmatter'],
        value='light',
        description='Style:',
        disabled=False
    ),
    subs = widgets.Select(
        options=substances_list,
        value='Folpel',
        description='Substance:',
        disabled=False
    ),
    year = widgets.Select(
        options=year_list,
        value=2017,
        description='Year:',
        disabled=False
    )
)
def chart(style, subs, year):
    df = merged.groupby(['Substance active', 'Annee', 'Code INSEE']).agg({'Concentration (ng/m3)':'mean', 'Commune': 'first', 'Région': 'first'} ).reset_index()

    fig = px.choropleth_mapbox(
        df[(df['Annee'] == year) & (df['Substance active'] == subs)], 
        geojson=comm_filtered, 
        locations="Code INSEE", 
        featureidkey="properties.code",
        color='Concentration (ng/m3)',
        color_continuous_scale="Bluered_r",
        center = {"lat": 46.71109, "lon": 1.7191036},
        hover_data=["Commune", "Région"],
        zoom=4,
    )
    fig.update_layout(
        margin={"r":0,"t":0,"l":0,"b":0},
        mapbox_style=style,
        mapbox_accesstoken= token,
    )
    fig.show()
    
    fig = px.line(
        df[(df['Substance active'] == subs)].groupby(['Substance active', 'Annee']).agg({'Concentration (ng/m3)': 'mean'}).reset_index(), 
        x="Annee",
        y='Concentration (ng/m3)',
        color="Substance active",
    )

    annotations = []
    # Title Part
    annotations.append(
        dict(
            xref='paper', 
            yref='paper', 
            x=0.0, 
            y=1.05,
            xanchor='left', 
            yanchor='bottom',
            text=f'Evolution of usage of {subs} over the years',
            font=dict(
                family='Arial',
                size=30,
                color='rgb(37,37,37)'
             ),
             showarrow=False
        )
    )

    #Source Part
    annotations.append(
        dict(
            xref='paper', 
            yref='paper', 
            x=0.5, 
            y=-0.16,
            xanchor='center', 
            yanchor='top',
            text='Source: Atmo France',
            font=dict(
                family='Arial',
                size=12,
                color='rgb(150,150,150)'
            ),
            showarrow=False
        )
    )

    fig.update_layout(
            bargap=0.2,
            bargroupgap=0.1,
            xaxis=dict(
                tickangle=45,
                title='Year'
            ),
            plot_bgcolor='white',
            annotations = annotations,
        )
    fig.show()


While the previous line chart allowed us to see the general trend of the use of pesticides over the years, the combination of the last map chart and line chart gives us a more detailed information, not only a general usage trend, but also a more geographical detailed insights, so even if we get a general high or low yearly overview, we can still see in details which communes present a higher/lower concentration than others.

---

## Conclusions

- There are pesticides that are still being used even after being banned.
- There are pesticides on the rise for the last few years (Prosulfocarbe, Pendimethaline, Pyrimethanil), some of them for which there is already scientific evidence of their toxicity, and others need some more research.
- There is only one product among the most used/evaluated that doesn't show any evident health risks (Spiroxamine).
- Theres a anormally high concentration of Folpel for the period 2003-2010.
- Due to the lack of research of pesticides in the air, there is a lack of clarity on the acceptable safe limits of these pesticides.

---

## Improvements

- Get some professional insights and inputs on their scientific questions or need to better tailor visualisations.
- Compare use of pesticides over different times of the year (spring and summer being the most prone seasons to use them).
- Dig deeper on the substances that are not being evaluated much.
- Work with other information on the dataset that are not used in this analysis (like date, days of sample, particles size, et).
- Check on the suprisingly high values of Folpel, specially between 2003-2010.
- Link this information with comorbidity studies in areas with higher concentration of the substances (check for asthma, allergies, cancer, general publich health).