# Introduction to Pandas

In this tutorial you will be introduced to using the Pandas Data Frame to read and manipulate data. By the end of this tutorial you should be able to:

- Read in .csv data.
- Select columns.
- Locating elements based on a boolean condition.


## Pandas Documentation

Just like any other popular python library, pandas is widely used and well documented. This means there will be plenty of solutions to common bugs on Stack Overflow. If at any point you are unsure about syntax, google what you'd like to do and you're likely to find a solution. 

## Reading Data from .csv

Pandas makes reading data from .csv (comma separated values) extremely easy. If you haven't seen a .csv file before don't worry, it is simply a set of values separated by a comma. Open the bikes.csv file in the data-v1 folder from desktop and see what it looks like. 

Let's now open the bikes.csv file in pandas. 

In [16]:
# Importing the pandas library
import pandas as pd

# Read from csv
bike_df = pd.read_csv('../data-v1/bikes.csv')

bike_df

Unnamed: 0,Date,Berri 1,Brébeuf (données non disponibles),Côte-Sainte-Catherine,Maisonneuve 1,Maisonneuve 2,du Parc,Pierre-Dupuy,Rachel1,St-Urbain (données non disponibles)
0,2012-01-01,35,,0,38,51,26,10,16,
1,2012-01-02,83,,1,68,153,53,6,43,
2,2012-01-03,135,,2,104,248,89,3,58,
3,2012-01-04,144,,1,116,318,111,8,61,
4,2012-01-05,197,,2,124,330,97,13,95,
5,2012-01-06,146,,0,98,244,86,4,75,
6,2012-01-07,98,,2,80,108,53,6,54,
7,2012-01-08,95,,1,62,98,64,11,63,
8,2012-01-09,244,,2,165,432,198,12,173,
9,2012-01-10,397,,3,238,563,275,18,241,


## Getting list of Columns

A list of columns can be retrieved from the data frame using the code below. 

In [35]:
bike_df.columns.values

array(['Berri 1', 'Brébeuf (données non disponibles)',
       'Côte-Sainte-Catherine', 'Maisonneuve 1', 'Maisonneuve 2',
       'du Parc', 'Pierre-Dupuy', 'Rachel1',
       'St-Urbain (données non disponibles)'], dtype=object)

## Selecting a Columns

The above data is organised in columns with each column showing the number of bikes at each location. If we wanted to retrieve a particular column we would use the syntax below:  

In [17]:
bike_df['Côte-Sainte-Catherine']

0         0
1         1
2         2
3         1
4         2
5         0
6         2
7         1
8         2
9         3
10        0
11        1
12        0
13        0
14        0
15        2
16        0
17        0
18        1
19        4
20        0
21        0
22        6
23        1
24        1
25        0
26        5
27        1
28        1
29        0
       ... 
280     660
281     880
282    2210
283    1537
284    1857
285    1460
286     802
287     287
288    1678
289    1858
290    1964
291    2292
292     597
293     748
294     609
295    1819
296    1997
297    1868
298    1815
299    1987
300     792
301     697
302    1458
303    1251
304    1294
305    1208
306     737
307     380
308     446
309    1170
Name: Côte-Sainte-Catherine, Length: 310, dtype: int64

### Setting index

The above data is *indexed* by meaningless integers. It would be more convenient to organise it by date. To do this we can set the index in pandas to a particular column using the set_index function. 

In [22]:
# We do not usually have to reload data. Done here for demonstration. 
bike_df = pd.read_csv('../data-v1/bikes.csv')

# Setting index
bike_df = bike_df.set_index('Date')

#Selecting column
bike_df['Côte-Sainte-Catherine']

Date
2012-01-01       0
2012-01-02       1
2012-01-03       2
2012-01-04       1
2012-01-05       2
2012-01-06       0
2012-01-07       2
2012-01-08       1
2012-01-09       2
2012-01-10       3
2012-01-11       0
2012-01-12       1
2012-01-13       0
2012-01-14       0
2012-01-15       0
2012-01-16       2
2012-01-17       0
2012-01-18       0
2012-01-19       1
2012-01-20       4
2012-01-21       0
2012-01-22       0
2012-01-23       6
2012-01-24       1
2012-01-25       1
2012-01-26       0
2012-01-27       5
2012-01-28       1
2012-01-29       1
2012-01-30       0
              ... 
2012-10-07     660
2012-10-08     880
2012-10-09    2210
2012-10-10    1537
2012-10-11    1857
2012-10-12    1460
2012-10-13     802
2012-10-14     287
2012-10-15    1678
2012-10-16    1858
2012-10-17    1964
2012-10-18    2292
2012-10-19     597
2012-10-20     748
2012-10-21     609
2012-10-22    1819
2012-10-23    1997
2012-10-24    1868
2012-10-25    1815
2012-10-26    1987
2012-10-27     792
2012-10

## Getting Index Values

One can retrieve a list the index values for the data frame using the code below. The list can be stored in an array to be used later. 

In [26]:
# Getting index values
indices = bike_df.index.values

# Looping through the list of values and printing each element
for index in indices:
    print(index)

2012-01-01
2012-01-02
2012-01-03
2012-01-04
2012-01-05
2012-01-06
2012-01-07
2012-01-08
2012-01-09
2012-01-10
2012-01-11
2012-01-12
2012-01-13
2012-01-14
2012-01-15
2012-01-16
2012-01-17
2012-01-18
2012-01-19
2012-01-20
2012-01-21
2012-01-22
2012-01-23
2012-01-24
2012-01-25
2012-01-26
2012-01-27
2012-01-28
2012-01-29
2012-01-30
2012-01-31
2012-02-01
2012-02-02
2012-02-03
2012-02-04
2012-02-05
2012-02-06
2012-02-07
2012-02-08
2012-02-09
2012-02-10
2012-02-11
2012-02-12
2012-02-13
2012-02-14
2012-02-15
2012-02-16
2012-02-17
2012-02-18
2012-02-19
2012-02-20
2012-02-21
2012-02-22
2012-02-23
2012-02-24
2012-02-25
2012-02-26
2012-02-27
2012-02-28
2012-02-29
2012-03-01
2012-03-02
2012-03-03
2012-03-04
2012-03-05
2012-03-06
2012-03-07
2012-03-08
2012-03-09
2012-03-10
2012-03-11
2012-03-12
2012-03-13
2012-03-14
2012-03-15
2012-03-16
2012-03-17
2012-03-18
2012-03-19
2012-03-20
2012-03-21
2012-03-22
2012-03-23
2012-03-24
2012-03-25
2012-03-26
2012-03-27
2012-03-28
2012-03-29
2012-03-30
2012-03-31

## Locating from Index

To locate an element from the data frame using the index of the element we can use the .loc function. 

```
    df.loc['index']
    
```

An example is shown below where we have first randomly selected an index from the list of indices retrieved earlier and then used this to retrieve the corresponding row. 

In [30]:
import random

random_index = random.choice(indices)
print("The randomly selected index is", random_index)
bike_df.loc[random_index]

The randomly selected index is 2012-05-26


Berri 1                                4974.0
Brébeuf (données non disponibles)         NaN
Côte-Sainte-Catherine                  1622.0
Maisonneuve 1                          2936.0
Maisonneuve 2                          4991.0
du Parc                                2373.0
Pierre-Dupuy                           3455.0
Rachel1                                5443.0
St-Urbain (données non disponibles)       NaN
Name: 2012-05-26, dtype: float64

** Exercise: using the data frame indexed by date print every 20th element. Hint: the list _indices_ is indexed as 0,1,2,3... You could loop through this. **



## Locating using boolean operators

We can also locate elements based on boolean operations. This means, we can select elements if they meet a certain condition. This is again, best illustrated with an example.

In the code below we will use our data frame to create a subset of the data that only contains elements where there were more than  200 bikes in 'du parc'. 

In [34]:
print("The original data frame has shape", bike_df.shape)
sub_df = bike_df.loc[bike_df['du Parc']>200]
print("The reduced data frame has shape", sub_df.shape)



The original data frame has shape (310, 9)
The reduced data frame has shape (258, 9)


By cutting out all elements in the data frame that have less than 200 bikes in 'du Parc' we have 258 values in the data frame instead of the original 310. 