In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

### 1.1 Reading data from a csv file

The `csv` format ([comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values)) is the most common file-based format to exchange data. The format is not standardized, and you will encounter many varieties in practice - for instance, how values are separated (not always by comma!), how `string` values are represented, or how dates are represented. It is quite important to know how you can translate these different formats to workable standard.

`pandas` can read data from a `csv` file using the `read_csv` function with numerous options to parse the file contents.

We will be using two `csv` files  available [here](http://donnees.ville.montreal.qc.ca/dataset/velos-comptage) (in French) containing data on 
1. the usage of 7 different bike paths in Montréal for each day during 2015 available, and 
2. the location of the stations that measured usage.

Let's first look at version of the usage data from 2012 that required some parsing. It is included in the 'data' subfolder of this repository.

In [2]:
bikes_2012 = pd.read_csv('../data/bike_usage_2012.csv')
bikes_2012.head(3)

Unnamed: 0,Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
0,01/01/2012;35;;0;38;51;26;10;16;
1,02/01/2012;83;;1;68;153;53;6;43;
2,03/01/2012;135;;2;104;248;89;3;58;


In [3]:
!file -I ../data/bike_usage_2012.csv

../data/bike_usage_2012.csv: text/plain; charset=iso-8859-1


There are at least two issues: 
- the cells are not separated by commas but by semicolons, and
- the `string` representation is off where there seem to be 'special' 
characters (french accents).

We'll use some of the `read_csv` options that to properly parse the file:

* `sep`: change the column separator to a `;`
* `encoding`: set to `'iso-8859-1'` (or `'latin1'` - the default is `'utf8'` - background [here](http://www.joelonsoftware.com/articles/Unicode.html))
* `parse_dates`: Parse 'Date' column by indicating that our dates have day instead of month first
* `index_col`: set the index to be the parsed 'Date' column

In [4]:
bikes_2012_parsed = pd.read_csv('../data/bike_usage_2012.csv', 
                                sep=';', 
                                encoding='iso-8859-1', 
                                parse_dates=['Date'], 
                                dayfirst=True, 
                                index_col='Date')
bikes_2012_parsed.head(3)

Unnamed: 0_level_0,Berri 1,Brébeuf (données non disponibles),Côte-Sainte-Catherine,Maisonneuve 1,Maisonneuve 2,du Parc,Pierre-Dupuy,Rachel1,St-Urbain (données non disponibles)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2012-01-01,35,,0,38,51,26,10,16,
2012-01-02,83,,1,68,153,53,6,43,
2012-01-03,135,,2,104,248,89,3,58,


`read_csv` can read from a file on disk or from a url. The data are included in the 'data' subfolder of this repo, but let's use the urls:

In [5]:
bike_usage_url = 'http://donnees.ville.montreal.qc.ca/dataset/f170fecc-18db-44bc-b4fe-5b0b6d2c7297/resource/64c26fd3-0bdf-45f8-92c6-715a9c852a7b/download/comptagesvelo2012.csv'
bike_station_url = 'http://donnees.ville.montreal.qc.ca/dataset/f170fecc-18db-44bc-b4fe-5b0b6d2c7297/resource/c7d0546a-a218-479e-bc9f-ce8f13ca972c/download/localisationcompteursvelo2015.csv'

In [6]:
bike_usage = pd.read_csv(bike_usage_url, index_col='Date')
bike_usage.head()

Unnamed: 0_level_0,Unnamed: 1,Berri1,Boyer,Brébeuf,CSC (Côte Sainte-Catherine),Maisonneuve_1,Maisonneuve_2,Maisonneuve_3,Notre-Dame,Parc,...,Pont_Jacques_Cartier,Rachel / Hôtel de Ville,Rachel / Papineau,René-Lévesque,Saint-Antoine,Saint-Laurent U-Zelt Test,Saint-Urbain,Totem_Laurier,University,Viger
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01/01/2015,00:00,58,12,4,17,33.0,49,21,16,16,...,,58,91,24,3,,17,78,21,6
02/01/2015,00:00,75,7,5,15,30.0,113,27,9,32,...,,109,177,32,13,,11,57,77,4
03/01/2015,00:00,79,7,3,7,30.0,107,36,12,18,...,,71,131,33,5,,14,174,40,5
04/01/2015,00:00,10,1,21,0,10.0,35,29,1,0,...,,6,11,6,1,,1,20,6,0
05/01/2015,00:00,42,0,2,0,27.0,90,21,1,1,...,,0,5,49,20,,0,41,56,10


Let's get rid of the 'Unnamed' column (the file as the `hour` and `minute` part of the date in a separate column with not header, but we don't need this for daily data).

In [7]:
# dropping the first COLUMN, so need to use axis=1
# using inplace, so the original object is modified 
# so you'll get an error if you run this twice!
bike_usage.drop('Unnamed: 1', axis=1, inplace=True)
bike_usage.info()

<class 'pandas.core.frame.DataFrame'>
Index: 319 entries, 01/01/2015 to 15/11/2015
Data columns (total 21 columns):
Berri1                         319 non-null int64
Boyer                          319 non-null int64
Brébeuf                        319 non-null int64
CSC (Côte Sainte-Catherine)    319 non-null int64
Maisonneuve_1                  62 non-null float64
Maisonneuve_2                  319 non-null int64
Maisonneuve_3                  319 non-null int64
Notre-Dame                     319 non-null int64
Parc                           319 non-null int64
Parc U-Zelt Test               52 non-null float64
PierDup                        319 non-null int64
Pont_Jacques_Cartier           209 non-null float64
Rachel / Hôtel de Ville        319 non-null int64
Rachel / Papineau              319 non-null int64
René-Lévesque                  319 non-null int64
Saint-Antoine                  319 non-null int64
Saint-Laurent U-Zelt Test      50 non-null float64
Saint-Urbain                 

In [None]:
bike_usage.head()

In [9]:
stations = pd.read_csv(bike_station_url, encoding='latin1')
stations.nom_comptage.sort_values().reset_index(drop=True)

0                        Berri1
1                         Boyer
2                       Brebeuf
3                           CSC
4                 Maisonneuve_1
5                 Maisonneuve_2
6                 Maisonneuve_3
7                    Notre-Dame
8                          Parc
9              Parc U-Zelt Test
10                      PierDup
11         Pont_Jacques-Cartier
12       Rachel/HÃ´tel de Ville
13              Rachel/Papineau
14              RenÃ©-LÃ©vesque
15                Saint-Antoine
16    Saint-Laurent U-Zelt Test
17                 Saint-Urbain
18                Totem_Laurier
19                   University
20                        Viger
Name: nom_comptage, dtype: object

### Creating a Map of Bike Usage

In [10]:
import folium
map_center = stations[['coord_Y', 'coord_X']].mean().tolist()
map = folium.Map(location=map_center)
map

In [11]:
avg_bike_usage = bike_usage.mean().to_frame('avg_usage')
avg_bike_usage.sort_index()

Unnamed: 0,avg_usage
Berri1,2915.3981
Boyer,2212.9091
Brébeuf,2859.4859
CSC (Côte Sainte-Catherine),1167.3887
Maisonneuve_1,89.9355
Maisonneuve_2,2208.0313
Maisonneuve_3,986.1379
Notre-Dame,1137.3166
Parc,1754.2571
Parc U-Zelt Test,2090.25


In [12]:
stations_summary = stations.merge(avg_bike_usage, 
                left_on='nom_comptage', right_index=True, how='left').loc[:,
                  ['nom', 'nom_comptage', 'coord_X', 'coord_Y', 'avg_usage']]
stations_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 5 columns):
nom             21 non-null object
nom_comptage    21 non-null object
coord_X         21 non-null float64
coord_Y         21 non-null float64
avg_usage       15 non-null float64
dtypes: float64(3), object(2)
memory usage: 912.0+ bytes


  rlab = rizer.factorize(rk)


In [13]:
missing_stations = stations[~stations.nom_comptage.isin(avg_bike_usage.index.tolist())].nom_comptage
print missing_stations

1                    Brebeuf
4            Rachel/Papineau
6                        CSC
7       Pont_Jacques-Cartier
14    Rachel/HÃ´tel de Ville
16           RenÃ©-LÃ©vesque
Name: nom_comptage, dtype: object


  f = lambda x, y: lib.ismember(x, set(values))


In [14]:
# alternative to achieve the same result
missing_stations = stations_summary[pd.isnull(stations_summary.avg_usage)].nom_comptage
{m:m.encode('ascii','ignore') for m in missing_stations}

{u'Brebeuf': 'Brebeuf',
 u'CSC': 'CSC',
 u'Pont_Jacques-Cartier': 'Pont_Jacques-Cartier',
 u'Rachel/H\xc3\xb4tel de Ville': 'Rachel/Htel de Ville',
 u'Rachel/Papineau': 'Rachel/Papineau',
 u'Ren\xc3\xa9-L\xc3\xa9vesque': 'Ren-Lvesque'}

In [15]:
# stations from 'usage' data that don't match
to_match = avg_bike_usage[~avg_bike_usage.index.isin(stations_summary.nom_comptage)].index.tolist()
{m:m.decode('ascii','ignore') for m in to_match}

{'Br\xc3\xa9beuf': u'Brbeuf',
 'CSC (C\xc3\xb4te Sainte-Catherine)': u'CSC (Cte Sainte-Catherine)',
 'Pont_Jacques_Cartier': u'Pont_Jacques_Cartier',
 'Rachel / H\xc3\xb4tel de Ville': u'Rachel / Htel de Ville',
 'Rachel / Papineau': u'Rachel / Papineau',
 'Ren\xc3\xa9-L\xc3\xa9vesque': u'Ren-Lvesque'}

In [16]:
# Testing string matching for all missing stations
from difflib import SequenceMatcher
missing_station_list = [m.encode('ascii','ignore') for m in missing_stations]
to_match_list = [m.decode('ascii','ignore') for m in to_match]
for missing_station in missing_station_list:
    print pd.Series({station: SequenceMatcher(None, 
          missing_station,station).ratio()*100 
                     for station in to_match_list})\
    .sort_values(ascending=False).to_frame(missing_station)        

                            Brebeuf
Brbeuf                      92.3077
Ren-Lvesque                 33.3333
Rachel / Papineau           25.0000
Pont_Jacques_Cartier        14.8148
Rachel / Htel de Ville      13.7931
CSC (Cte Sainte-Catherine)  12.1212
                            Rachel/Papineau
Rachel / Papineau                   93.7500
Rachel / Htel de Ville              48.6486
Ren-Lvesque                         38.4615
Pont_Jacques_Cartier                34.2857
CSC (Cte Sainte-Catherine)          29.2683
Brbeuf                              19.0476
                                CSC
CSC (Cte Sainte-Catherine)  20.6897
Pont_Jacques_Cartier         8.6957
Ren-Lvesque                  0.0000
Rachel / Papineau            0.0000
Rachel / Htel de Ville       0.0000
Brbeuf                       0.0000
                            Pont_Jacques-Cartier
Pont_Jacques_Cartier                     95.0000
CSC (Cte Sainte-Catherine)               39.1304
Rachel / Htel de Ville                   

In [17]:
# since the above works, let's 
def get_match(missing_station):
    matches = {station: SequenceMatcher(None, 
          missing_station,station).ratio()
                     for station in to_match}
    return max(matches, key=matches.get)

matches = pd.concat([missing_stations, missing_stations.apply(lambda x: get_match(x))], axis=1)
matches.columns = ['missing', 'replacement']
matches

  a[besti-1] == b[bestj-1]:
  a[besti+bestsize] == b[bestj+bestsize]:
  for j in b2j.get(a[i], nothing):


Unnamed: 0,missing,replacement
1,Brebeuf,Brébeuf
4,Rachel/Papineau,Rachel / Papineau
6,CSC,CSC (Côte Sainte-Catherine)
7,Pont_Jacques-Cartier,Pont_Jacques_Cartier
14,Rachel/HÃ´tel de Ville,Rachel / Hôtel de Ville
16,RenÃ©-LÃ©vesque,René-Lévesque


In [18]:
match_dict = dict(zip(matches.missing, matches.replacement))
stations.nom_comptage = stations.nom_comptage.apply(lambda x: match_dict.get(x, x))

In [19]:
stations.nom_comptage

0                    Saint-Urbain
1                         Brébeuf
2                   Maisonneuve_1
3                   Maisonneuve_2
4               Rachel / Papineau
5                      University
6     CSC (Côte Sainte-Catherine)
7            Pont_Jacques_Cartier
8                         PierDup
9                   Saint-Antoine
10                          Viger
11                  Maisonneuve_3
12                     Notre-Dame
13                           Parc
14        Rachel / Hôtel de Ville
15                          Boyer
16                  René-Lévesque
17                  Totem_Laurier
18                         Berri1
19               Parc U-Zelt Test
20      Saint-Laurent U-Zelt Test
Name: nom_comptage, dtype: object

In [20]:
stations_summary = stations.merge(avg_bike_usage, 
                left_on='nom_comptage', right_index=True, how='left').loc[:,
                  ['nom', 'nom_comptage', 'coord_X', 'coord_Y', 'avg_usage']]
stations_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 5 columns):
nom             21 non-null object
nom_comptage    21 non-null object
coord_X         21 non-null float64
coord_Y         21 non-null float64
avg_usage       21 non-null float64
dtypes: float64(3), object(2)
memory usage: 912.0+ bytes


  llab = rizer.factorize(lk)


<style>
    @font-face {
        font-family: "Computer Modern";
        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');
    }
    div.cell{
        width:800px;
        margin-left:16% !important;
        margin-right:auto;
    }
    h1 {
        font-family: Helvetica, serif;
    }
    h4{
        margin-top:12px;
        margin-bottom: 3px;
       }
    div.text_cell_render{
        font-family: Computer Modern, "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;
        line-height: 145%;
        font-size: 130%;
        width:800px;
        margin-left:auto;
        margin-right:auto;
    }
    .CodeMirror{
            font-family: "Source Code Pro", source-code-pro,Consolas, monospace;
    }
    .text_cell_render h5 {
        font-weight: 300;
        font-size: 22pt;
        color: #4057A1;
        font-style: italic;
        margin-bottom: .5em;
        margin-top: 0.5em;
        display: block;
    }
    
    .warning{
        color: rgb( 240, 20, 20 )
        }  

In [21]:
import folium
map_center = stations[['coord_Y', 'coord_X']].mean().tolist()
map = folium.Map(location=map_center, zoom_start=13)
map

In [22]:
for i, location in stations_summary.iterrows():
    folium.CircleMarker(location.loc[['coord_Y', 'coord_X']],
                    radius= np.sqrt(location.avg_usage/np.pi)*10,
                    popup=location.nom_comptage.decode('ascii', 'ignore'),
                    color='#3186cc',
                    fill_color='#3186cc',
                   ).add_to(map)

In [23]:
map

In [None]:
stations_summary.to_excel('test.xlsx')