# Introduction to data exploration mainly using Pandas

In [None]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 1. Pandas dataframe and series

The first step will be to open a CSV file which contain some information about some voting count of a referendum in France.

In [None]:
filename_referendum = os.path.join('data', 'referendum.csv')
filename_referendum

The data are not separated with a comma but a semi-colummn.

In [None]:
df = pd.read_csv(filename_referendum, sep=';')

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.index

In [None]:
df.columns

It will be easier with we use the name of the city as an index.

In [None]:
df = df.set_index('Town name')

In [None]:
df.head()

## 2. Simple analysis

* What is the city with the most registered people?

In [None]:
df.loc[:, 'Registered'].head()

In [None]:
col_registered = df.loc[:, 'Registered']

In [None]:
col_registered.max()

In [None]:
col_registered == col_registered.max()

In [None]:
mask_most_registered = col_registered == col_registered.max()

In [None]:
col_registered.loc[mask_most_registered]

In [None]:
df.loc[mask_most_registered]

* What is the city with the least number of registered persons?

In [None]:
mask_least_registered = col_registered == col_registered.min()

In [None]:
df.loc[mask_least_registered]

Let's go to the `notebook.ipynb` to formalize the different aspect we just used up to now.

## 3. Group information together.

Let's now make a more advance analysis. Instead to make a micro-analysis by cities, we would like to make a macro-analysis by department.

In [None]:
df.head()

Therefore, we would like group the votes by department and add them up.

In [None]:
df.groupby('Department code').sum()

In [None]:
df_department = df.groupby(['Department code', 'Department name']).sum().reset_index()
df_department

## 4. Merging information together

We would like to plot some of the information into the a map.

In [None]:
import geopandas as gpd

In [None]:
gdf_department = gpd.read_file(os.path.join('data', 'departements.geojson'))

In [None]:
type(gdf_department)

In [None]:
gdf_department.head()

In [None]:
gdf_department.plot()

So what we need at that stage is to merge the different dataframe together.

In [None]:
df = gdf_department.merge(df_department, how='inner', left_on='code', right_on='Department code')

In [None]:
df.plot(column='Registered')

## 5. Bug correction

In [None]:
def prepend_zero(code):
    if len(code) == 1:
        return '0' + code
    return code

df_department['Department code'] = df_department['Department code'].apply(prepend_zero)

In [None]:
df = gdf_department.merge(df_department, how='inner', left_on='code', right_on='Department code')

In [None]:
df.head()

In [None]:
df.plot(column='Registered')

In [None]:
df.plot(column='Choice A')

In [None]:
df.plot(column='Choice B')

In [None]:
df_normalized = df.copy()

In [None]:
df_normalized['Choice A'] /= df[['Choice A', 'Choice B']].sum(axis=1)
df_normalized['Choice B'] /= df[['Choice A', 'Choice B']].sum(axis=1)

In [None]:
df_normalized.plot(column='Choice B')

In [None]:
df_normalized.plot(column='Choice A')

## 6. This your turn

The goal will be to repeat the analysis but at the region scale. However, this is not as easy. The regions information in directly available and we will need to import it from another external source.

* Open the `data/referendum.csv` file.

In [None]:
# %load solutions/24_solutions.py

* Show the 5 first rows.

In [None]:
# %load solutions/25_solutions.py

As you can see, there is no information about the regions. Before to get to this stage, let's correct the issue with the deparment numbering.

In [None]:
# %load solutions/26_solutions.py

* Load the information related to the regions from the file `data/regions.csv`. Show the 5 first rows.

In [None]:
# %load solutions/27_solutions.py

* Lead the information related to the departments from the file `data/departments.csv`. Show the 5 first rows.

In [None]:
# %load solutions/28_solutions.py

* Find the column in the departments dataframe which is related to the `code` column of the regions dataframe. Merge both dataframe using these informations. Show the 5 first rows of the resulting dataframe.

In [None]:
# %load solutions/29_solutions.py

* In the previous dataframe as column linked to the department code which could be merged with our referendum data as we did previously. Since we already got information about the regions, we can get the regions dataframe with a new merge. Show the 5 first rows of the merged dataframe.

In [None]:
# %load solutions/30_solutions.py

* Group and add up the vote by region and show the resulting dataframe.

In [None]:
# %load solutions/31_solutions.py

* Taking example on the previous case, plot the vote for the "choice A" and "choice B". Use the file `regions.geojson` instead of `departments.geojson`.

In [None]:
# %load solutions/32_solutions.py

In [None]:
# %load solutions/33_solutions.py

In [None]:
# %load solutions/34_solutions.py