# Introduction to data exploration mainly using Pandas

In [1]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Part 1: simple manipulation of dataframe and series

### Quick exploration of the format of our data

The first step will be to open a CSV file which contain some information about some voting count of a referendum in France.

In [2]:
filename_referendum = os.path.join('data', 'referendum.csv')

The data are not separated with a comma but a semi-colummn.

In [3]:
df = pd.read_csv(filename_referendum, sep=';')

In [4]:
df.head()

Unnamed: 0,Department code,Department name,Town code,Town name,Registered,Abstentions,Null,Choice A,Choice B
0,1,AIN,1,L'Abergement-Clémenciat,592,84,9,154,345
1,1,AIN,2,L'Abergement-de-Varey,215,36,5,66,108
2,1,AIN,4,Ambérieu-en-Bugey,8205,1698,126,2717,3664
3,1,AIN,5,Ambérieux-en-Dombes,1152,170,18,280,684
4,1,AIN,6,Ambléon,105,17,1,35,52


In [5]:
df.tail()

Unnamed: 0,Department code,Department name,Town code,Town name,Registered,Abstentions,Null,Choice A,Choice B
36786,ZZ,FRANCAIS DE L'ETRANGER,7,Europe centrale,89643,54981,318,17055,17289
36787,ZZ,FRANCAIS DE L'ETRANGER,8,"Europe du Sud, Turquie, Israël",109763,84466,292,9299,15706
36788,ZZ,FRANCAIS DE L'ETRANGER,9,Afrique Nord-Ouest,98997,59887,321,22116,16673
36789,ZZ,FRANCAIS DE L'ETRANGER,10,"Afrique Centre, Sud et Est",89859,46782,566,17008,25503
36790,ZZ,FRANCAIS DE L'ETRANGER,11,"Europe de l'est, Asie, Océanie",80061,42911,488,13975,22687


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36791 entries, 0 to 36790
Data columns (total 9 columns):
Department code    36791 non-null object
Department name    36791 non-null object
Town code          36791 non-null int64
Town name          36791 non-null object
Registered         36791 non-null int64
Abstentions        36791 non-null int64
Null               36791 non-null int64
Choice A           36791 non-null int64
Choice B           36791 non-null int64
dtypes: int64(6), object(3)
memory usage: 2.5+ MB


In [7]:
df.index

RangeIndex(start=0, stop=36791, step=1)

In [8]:
df.columns

Index(['Department code', 'Department name', 'Town code', 'Town name',
       'Registered', 'Abstentions', 'Null', 'Choice A', 'Choice B'],
      dtype='object')

It will be easier with we use the name of the city as an index.

In [9]:
df = df.set_index('Town name')

In [10]:
df.head()

Unnamed: 0_level_0,Department code,Department name,Town code,Registered,Abstentions,Null,Choice A,Choice B
Town name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
L'Abergement-Clémenciat,1,AIN,1,592,84,9,154,345
L'Abergement-de-Varey,1,AIN,2,215,36,5,66,108
Ambérieu-en-Bugey,1,AIN,4,8205,1698,126,2717,3664
Ambérieux-en-Dombes,1,AIN,5,1152,170,18,280,684
Ambléon,1,AIN,6,105,17,1,35,52


### Let's answer to some basic questions

* What is the city with the most registered people?

In [11]:
df.loc[:, 'Registered'].head()

Town name
L'Abergement-Clémenciat     592
L'Abergement-de-Varey       215
Ambérieu-en-Bugey          8205
Ambérieux-en-Dombes        1152
Ambléon                     105
Name: Registered, dtype: int64

In [13]:
col_registered = df.loc[:, 'Registered']

In [14]:
col_registered.max()

1253322

In [15]:
col_registered == col_registered.max()

Town name
L'Abergement-Clémenciat           False
L'Abergement-de-Varey             False
Ambérieu-en-Bugey                 False
Ambérieux-en-Dombes               False
Ambléon                           False
Ambronay                          False
Ambutrix                          False
Andert-et-Condon                  False
Anglefort                         False
Apremont                          False
Aranc                             False
Arandas                           False
Arbent                            False
Arbignieu                         False
Arbigny                           False
Argis                             False
Armix                             False
Ars-sur-Formans                   False
Artemare                          False
Asnières-sur-Saône                False
Attignat                          False
Bâgé-la-Ville                     False
Bâgé-le-Châtel                    False
Balan                             False
Baneins                       

In [16]:
mask_most_registered = col_registered == col_registered.max()

In [17]:
col_registered.loc[mask_most_registered]

Town name
Paris    1253322
Name: Registered, dtype: int64

In [18]:
df.loc[mask_most_registered]

Unnamed: 0_level_0,Department code,Department name,Town code,Registered,Abstentions,Null,Choice A,Choice B
Town name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Paris,75,PARIS,56,1253322,248755,12093,506594,485880


* What is the city with the least number of registered persons?

In [19]:
mask_least_registered = col_registered == col_registered.min()

In [20]:
df.loc[mask_least_registered]

Unnamed: 0_level_0,Department code,Department name,Town code,Registered,Abstentions,Null,Choice A,Choice B
Town name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Douaumont,55,MEUSE,164,6,0,0,0,6


Let's go to the `notebook.ipynb` to formalize the different aspect we just used up to now.