
<img src="./figures/pandas.png" alt="Indentation" width="30%" height="30%">


# 7 Pandas Python library tutorial
  

Pandas is a library specialized in data manipulation.
This library contains a set of optimized functions for handling large datasets.
It allows to create and export tables of data from text files (separators, .csv, fixed format, compressed), binary (HDF5 with Pytable), HTML, XML, JSON, MongoDB, SQL ...


A new data structure is used with this library: the DataFrame.
There are two types of data with pandas: series and dataframes.


- a dataframe is an array that is created with dictionaries or lists
- they are based on Numpy or ndarray tables
- they can have column and line names
- they have the particularity of being able to mix the types of data: str, float, Nan, Int ...

- they can be viewed as an excel sheet but with a larger number of data volumes and a larger number of functions and attributes.


## 7.1 Series introduction  - 

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. 


In [11]:
# first, we must import Pandas library
import pandas as pd      # aliasing as pd
import warnings; warnings.filterwarnings(action='ignore')

A pandas Series can be created using the following constructor:

In [12]:
serie = pd.Series([11,15,12,13,14])
print(serie)

0    11
1    15
2    12
3    13
4    14
dtype: int64


We have in series, indexes and values. These indexes can be replaced by text with the index option. Be careful, the number of indexes must correspond to the number of values. 

In [13]:
serie = pd.Series([11,15,12,13,14], index=["Montréal", "Ottawa", "Toronto", "Gatineau", "Québec"])

In [14]:
print(serie)

Montréal    11
Ottawa      15
Toronto     12
Gatineau    13
Québec      14
dtype: int64


The describe() method computes a summary of statistic

In [15]:
serie.describe()

count     5.000000
mean     13.000000
std       1.581139
min      11.000000
25%      12.000000
50%      13.000000
75%      14.000000
max      15.000000
dtype: float64

We can use index to access to an element. 

In [16]:
serie["Montréal"]

11

We can use index number.  

In [17]:
serie[3]

13

We can use several indexes.

In [18]:
serie[["Montréal", "Québec", "Toronto"]]

Montréal    11
Québec      14
Toronto     12
dtype: int64

There are large number of methods collectively compute descriptive statistics such as min(), max(), sum() ... 

In [19]:
serie.min()

11

In [20]:
serie.max()

15

We can apply comparison operators. 

In [21]:
serie[serie>12]

Ottawa      15
Gatineau    13
Québec      14
dtype: int64

In [22]:
serie>12 

Montréal    False
Ottawa       True
Toronto     False
Gatineau     True
Québec       True
dtype: bool

## 7.2 Pandas Dataframes 

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

### 7.2.1  Create a  Dataframe

In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:

 <b> - a)</b> Using a Numpy array

In [23]:
import numpy as np  
stations = np.genfromtxt("./DATA/DATA_Barrage_1963_2017_5.csv", delimiter=",", dtype='float')   
stations

array([[  39.7       ,   39.09      ,   39.55645161,   23.23      ,
          22.5       ,   22.85903226, 2390.52612903, 3164.97      ,
        1673.8       ],
       [  41.16      ,   40.96      ,   41.08806452,   23.22      ,
          22.43      ,   22.7583871 , 2227.28290323, 3008.18      ,
        1697.87      ],
       [  41.15      ,   41.05      ,   41.11322581,   23.35      ,
          22.87      ,   23.15548387, 2851.27419355, 3231.8       ,
        2367.85      ],
       [  41.09      ,   40.61      ,   40.9683871 ,   23.78      ,
          22.7       ,   23.03258065, 2635.03774194, 3967.33      ,
        2069.65      ],
       [  41.09      ,   39.6       ,   40.26967742,   24.24      ,
          22.87      ,   23.64580645, 3924.23451613, 5407.32      ,
        2417.89      ]])

To create the dataframe, we use the Pandas <b> DataFrame () </b> function. It is at this stage that we define the names of our columns. In input we put the table Numpy. 

In [29]:
dataframe = pd.DataFrame(stations, columns=["Amont Max", "Amont Min", "Amont Mean", "Aval Max", "Aval Max", "Aval Mean", "Debit Mean", "Debit Max","Debit Min"])
dataframe

Unnamed: 0,Amont Max,Amont Min,Amont Mean,Aval Max,Aval Max.1,Aval Mean,Debit Mean,Debit Max,Debit Min
0,39.7,39.09,39.556452,23.23,22.5,22.859032,2390.526129,3164.97,1673.8
1,41.16,40.96,41.088065,23.22,22.43,22.758387,2227.282903,3008.18,1697.87
2,41.15,41.05,41.113226,23.35,22.87,23.155484,2851.274194,3231.8,2367.85
3,41.09,40.61,40.968387,23.78,22.7,23.032581,2635.037742,3967.33,2069.65
4,41.09,39.6,40.269677,24.24,22.87,23.645806,3924.234516,5407.32,2417.89


In [30]:
from tabulate import tabulate
print(tabulate(dataframe, headers='keys', tablefmt='pipe'))

|    |   Amont Max |   Amont Min |   Amont Mean |   Aval Max |   Aval Max |   Aval Mean |   Debit Mean |   Debit Max |   Debit Min |
|---:|------------:|------------:|-------------:|-----------:|-----------:|------------:|-------------:|------------:|------------:|
|  0 |       39.7  |       39.09 |      39.5565 |      23.23 |      22.5  |     22.859  |      2390.53 |     3164.97 |     1673.8  |
|  1 |       41.16 |       40.96 |      41.0881 |      23.22 |      22.43 |     22.7584 |      2227.28 |     3008.18 |     1697.87 |
|  2 |       41.15 |       41.05 |      41.1132 |      23.35 |      22.87 |     23.1555 |      2851.27 |     3231.8  |     2367.85 |
|  3 |       41.09 |       40.61 |      40.9684 |      23.78 |      22.7  |     23.0326 |      2635.04 |     3967.33 |     2069.65 |
|  4 |       41.09 |       39.6  |      40.2697 |      24.24 |      22.87 |     23.6458 |      3924.23 |     5407.32 |     2417.89 |


<b>- b)</b> Create Dataframe loading csv file: <b>read_table()</b> or <b>read_csv()</b> function

<b> read_table () </ b> and <b> read_csv () </ b> are the most useful functions under Pandas for reading text files and generating a DataFrame.

We will work with a dataset from a hydraulic dam.
Our csv file has 9 variables, the first line gives us the names of the variables (or labels).

A csv document can be read with the <b> read_table () </ b> function, with the separator attribute ",".

In [27]:
barrage = pd.read_table("./DATA/DATA_EXTREME_Carillon_1963_2017_5.csv", sep=",")
barrage.head()

Unnamed: 0,Amont_max,Amont_min,Amont_moyen,Aval_max,Aval_min,Aval_moyen,Debit_Moyen,Debit_max,Debit_min
0,39.7,39.09,39.556452,23.23,22.5,22.859032,2390.526129,3164.97,1673.8
1,41.16,40.96,41.088065,23.22,22.43,22.758387,2227.282903,3008.18,1697.87
2,41.15,41.05,41.113226,23.35,22.87,23.155484,2851.274194,3231.8,2367.85
3,41.09,40.61,40.968387,23.78,22.7,23.032581,2635.037742,3967.33,2069.65
4,41.09,39.6,40.269677,24.24,22.87,23.645806,3924.234516,5407.32,2417.89


However, if we know that our file to read is a csv, we can use a simpler function of Pandas which is <b> read_csv () </b>.

No need to use sep='' option. He will find the separator by default.

In [16]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file into DataFrame.
    
    Also supports option

Several options are available to the read_csv () function. It is important to know the list of possibilities and options offered by this simple command.


<table border="1" class="docutils">
<colgroup>
<col width="27%">
<col width="57%">
</colgroup>
<tbody valign="top">

<tr><td><tt class="docutils literal"><span class="pre"><b>path</b></span></tt></td>
<td>Path to our fileL</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>sep</b></span></tt></td>
<td>Delimiter like , ; | \t or \s+ for a variable number of spaces</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>header</b></span></tt></td>
<td>
default 0, the first line contains the name of the variables; if None the names are generated or defined later</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>index_col</b></span></tt></td>
<td>Names or numbers of columns defining the indexes of lines, indexes which can be hierarchized</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>names</b></span></tt></td>
<td>If header = None, list of variable names.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>nrows</b></span></tt></td>
<td>Useful for testing and limiting the number of lines to read</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>skiprow</b></span></tt></td>
<td>List of lines to jump in reading</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>skip_footer</b></span></tt></td>
<td>Number of lines to jump at the end of file</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>na_values</b></span></tt></td>
<td>Definition of the code or codes signaling missing values. They
can be defined in a dictionary to associate variables and codes
specific missing values</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>usecols</b></span></tt></td>
<td>Selects a list of variables to read to avoid reading
large or unnecessary fields or variables</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>skip_blank_lines</b></span></tt></td>
<td>If <b> True </b>, we skip the white lines</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>thousand</b></span></tt></td>
<td>Separator: "." or ",".</td>
</tr>
</tbody>
</table>


In [31]:
# Using example

file1 = "./DATA/DATA_EXTREME_Carillon_1963_2017_5.csv"
col_names = ['Variable1', 'Variable2', 'Variable3']
df2 = pd.read_csv(file1, skiprows=1, usecols=[0, 1, 3], names=col_names)

In [32]:
df2.head()

Unnamed: 0,Variable1,Variable2,Variable3
0,39.7,39.09,23.23
1,41.16,40.96,23.22
2,41.15,41.05,23.35
3,41.09,40.61,23.78
4,41.09,39.6,24.24


<b>- d)</b> Create Dataframe loading ascii file: 

In [34]:
with open('./DATA/Daily_Precipitation_1963-2017.txt', 'r') as file:
        rows = file.read() 

In [35]:
with open('./DATA/Daily_Precipitation_1963-2017.txt', 'r') as file:
        rows = file.read()      
dataset = [float(row) for row in rows.split()]   
df3 = pd.DataFrame({"Precipitation" : dataset})

In [36]:
df3.head()

Unnamed: 0,Precipitation
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0


<b>- e)</b> Create Dataframe loading excell (.xls) file: <b>read_excel()</b> function

We will open here an excel file (.xls extension). This file is a database containing information on all homogenized Environmental and Climate Change Canada temperature stations.

This database has 11 columns with data starting at the 4 th line.


We will define the "Province" column as index of our DataFrame.

In [38]:
df4 = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", index_col=0,skiprows = range(0, 3))
df4.head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y


### 7.2.2  Access data from DataFrames 

The first thing to do when opening a new dataset is print out a few rows. We accomplish this with <b>.head()</b> method:

In [41]:
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3))
dataframe.head()

Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
0,BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
2,BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y
4,BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y


To see the last five rows use  <b>.tail()</b> method. tail() also accepts a number, and in this case we printing the bottom two rows.:

In [43]:
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3))
dataframe.tail()

Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
333,NL,PORT AUXBASQUES,8402975,1909,2,2017,9,47.58,-58.97,40,N
334,NL,ST ANTHONY,8403389,1946,6,2017,12,51.37,-55.6,33,Y
335,NL,ST JOHN'S,8403505,1874,1,2017,12,47.62,-52.75,141,Y
336,NL,STEPHENVILLE,8403801,1895,6,2017,12,48.53,-58.55,26,Y
337,NL,WABUSH LAKE,8504177,1960,11,2017,12,52.93,-66.87,551,Y


Before exploring a Dataframe, you can modify the index to make it easier to analyze the dataset. For this, we use the <b> .set_index () </b> function. We must create a new object.

In [45]:
dataframe_Prov_index = dataframe.set_index("Prov")

In [46]:
dataframe_Prov_index.head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y


We can directly select a column from a Dataframe: 

In [48]:
dataframe_Prov_index['Nom de station'].head()

Prov
BC        AGASSIZ
BC          ATLIN
BC    BARKERVILLE
BC     BEAVERDELL
BC    BELLA COOLA
Name: Nom de station, dtype: object

Pandas supports Multi-axes indexing to get the subset of pandas object.
Then, to access an element in a dataframe, there are two methods:

   - the <b> iloc () </b> method to access data from index numbers
   
   - the <b> loc () </b> method to access data from labels
   
   #### a- <b> iloc () </b> method:
   
We can access data from Dataframe using index integer. Like numpy, this method is 0-based indexing. 

In [49]:
# Example1: select specific row and specific column
dataframe_Prov_index.iloc[0,0]

'AGASSIZ'

In [50]:
# Example2: iloc: # select first 4 rows f and all columns
dataframe_Prov_index.iloc[0:4,:]

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y


In [52]:
# Example3: iloc: # select all rows and 4 specific columns 
dataframe_Prov_index.iloc[:,0:4].head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BC,AGASSIZ,1100120,1893,1
BC,ATLIN,1200560,1905,8
BC,BARKERVILLE,1090660,1888,2
BC,BEAVERDELL,1130771,1939,1
BC,BELLA COOLA,1060841,1895,5


In [31]:
# Example4: iloc:# Slicing through list of values
print(dataframe_Prov_index.iloc[[1, 3, 5], [1, 3]])
print(dataframe_Prov_index.iloc[1:3, :])
dataframe_Prov_index.iloc[:,1:3].head()

        stnid  mois déb.
Prov                    
BC    1200560          8
BC    1130771          1
BC    1021480          7
     Nom de station    stnid  année déb.  mois déb.  année fin.  mois fin.  \
Prov                                                                         
BC            ATLIN  1200560        1905          8        2017         12   
BC      BARKERVILLE  1090660        1888          2        2015          3   

      lat (deg)  long (deg)  élév (m) stns jointes  
Prov                                                
BC        59.57     -133.70       674            N  
BC        53.07     -121.52      1265            N  


Unnamed: 0_level_0,stnid,année déb.
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1
BC,1100120,1893
BC,1200560,1905
BC,1090660,1888
BC,1130771,1939
BC,1060841,1895


   #### b- La méthode loc():
   
This method has purely label based indexing.
   
.loc() has multiple access methods like −

        -single scalar label
        
        -list of labels
        
        -slice object
        
        -Boolean array

.loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.
   

In [55]:
# Example1: loc: select all rows for a specific column
dataframe_Prov_index.loc[:,"Nom de station"].head()

Prov
BC        AGASSIZ
BC          ATLIN
BC    BARKERVILLE
BC     BEAVERDELL
BC    BELLA COOLA
Name: Nom de station, dtype: object

In [56]:
# Example2: loc: select all rows for a specific index name
dataframe_Prov_index.loc["QC",:].head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
QC,AMOS,709CEE9,1913,6,2017,8,48.57,-78.13,305,Y
QC,BAGOTVILLE,7060400,1880,11,2017,12,48.33,-71.0,159,Y
QC,BAIE COMEAU,704S001,1965,1,2017,12,49.13,-68.2,130,Y
QC,BEAUCEVILLE,7027283,1913,8,2017,8,46.15,-70.7,168,Y
QC,BELLETERRE,7080600,1951,9,2004,4,47.38,-78.7,322,N


In [58]:
# Example3: Select all rows for multiple columns, say list[]

dataframe_Prov_index.loc[:,["Nom de station", "année déb.", "année fin."]].head()

Unnamed: 0_level_0,Nom de station,année déb.,année fin.
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BC,AGASSIZ,1893,2017
BC,ATLIN,1905,2017
BC,BARKERVILLE,1888,2015
BC,BEAVERDELL,1939,2006
BC,BELLA COOLA,1895,2017


In [35]:
# Example4: Select few rows for multiple columns, say list[]
dataframe_Prov_index.loc[['BC','QC'],["Nom de station", "année déb.", "année fin."]].head()

Unnamed: 0_level_0,Nom de station,année déb.,année fin.
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BC,AGASSIZ,1893,2017
BC,ATLIN,1905,2017
BC,BARKERVILLE,1888,2015
BC,BEAVERDELL,1939,2006
BC,BELLA COOLA,1895,2017


In [60]:
# Example 5: # for getting values with a boolean array

(dataframe_Prov_index.loc['BC',["année déb."]]>1900).head()

Unnamed: 0_level_0,année déb.
Prov,Unnamed: 1_level_1
BC,False
BC,True
BC,False
BC,True
BC,False


In [37]:
# Example 6: # for getting values with a boolean array
dataframe_Prov_index.loc[dataframe_Prov_index["année fin."]>2015,:].head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y
BC,BLIND CHANNEL,1021480,1958,7,2016,2,50.42,-125.5,23,N
BC,BLUE RIVER,1160899,1946,9,2017,12,52.13,-119.28,683,Y


In [38]:
# Example 7: # for getting values with a boolean array
df2 = dataframe_Prov_index.loc["QC",:]
df2.loc[df2["année fin."]==2017,:].head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
QC,AMOS,709CEE9,1913,6,2017,8,48.57,-78.13,305,Y
QC,BAGOTVILLE,7060400,1880,11,2017,12,48.33,-71.0,159,Y
QC,BAIE COMEAU,704S001,1965,1,2017,12,49.13,-68.2,130,Y
QC,BEAUCEVILLE,7027283,1913,8,2017,8,46.15,-70.7,168,Y
QC,CAUSAPSCAL,7051200,1913,11,2017,8,48.37,-67.23,168,N


### 7.2.3  Change a Dataframe

   ### 7.2.3.1 Column Selection/Addition/Deletion-
 
We will use here our previous Dataframe.

In [39]:
dataframe_Prov_index.head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y


#### a- Create a new variable

We can select columns from our Dataframe to create a new one. In this example, we will calculate the number of recording years for each station.  

In [66]:
delta_year = (dataframe_Prov_index["année fin."] - dataframe_Prov_index["année déb."]) + 1

In [67]:
delta_year.head()

Prov
BC    125
BC    113
BC    128
BC     68
BC    123
dtype: int64

  #### b- Column Addition in  a DataFrame

In [68]:
dataframe_Prov_index["total année"] = delta_year

In [69]:
dataframe_Prov_index.head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,total année
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N,125
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N,113
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N,128
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y,68
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y,123


In [70]:
from tabulate import tabulate
print(tabulate(dataframe_Prov_index, headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   | stnid   |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |   total année |
|:-------|:-----------------|:--------|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|--------------:|
| BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              |           125 |
| BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |           113 |
| BC     | BARKERVILLE      | 1090660 |         1888 |           2 |         2015 |           3 |       53.07 |      -121.52 |       1265 | N              |           128 |
| BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        8

   #### c- Column Deletion in a DataFrame
   
Columns from a Dataframe can be deleted or popped; let us take an example to understand how.

In [74]:
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3))

In [75]:
# using del function
print ("Deleting 'stns jointes' column using DEL function:")
del dataframe['stns jointes']
dataframe.head()

Deleting 'stns jointes' column using DEL function:


Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m)
0,BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15
1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674
2,BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265
3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838
4,BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18


In [76]:
from tabulate import tabulate
print(tabulate(dataframe, headers='keys', tablefmt='pipe'))

|     | Prov   | Nom de station   | stnid   |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) |
|----:|:-------|:-----------------|:--------|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|
|   0 | BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 |
|   1 | BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 |
|   2 | BC     | BARKERVILLE      | 1090660 |         1888 |           2 |         2015 |           3 |       53.07 |      -121.52 |       1265 |
|   3 | BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 |
|   4 | BC     | BELLA COOLA      | 1060841 |         1895 |           5 |         2017 |          11 |       52.37 |      -

In [77]:
# using pop function
print ("Deleting 'stnid' column using POP function:")
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3))
dataframe.pop('stnid')
dataframe.head()

Deleting 'stnid' column using POP function:


Unnamed: 0,Prov,Nom de station,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
0,BC,AGASSIZ,1893,1,2017,12,49.25,-121.77,15,N
1,BC,ATLIN,1905,8,2017,12,59.57,-133.7,674,N
2,BC,BARKERVILLE,1888,2,2015,3,53.07,-121.52,1265,N
3,BC,BEAVERDELL,1939,1,2006,9,49.48,-119.05,838,Y
4,BC,BELLA COOLA,1895,5,2017,11,52.37,-126.68,18,Y


In [78]:
from tabulate import tabulate
print(tabulate(dataframe, headers='keys', tablefmt='pipe'))

|     | Prov   | Nom de station   |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |
|----:|:-------|:-----------------|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|
|   0 | BC     | AGASSIZ          |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              |
|   1 | BC     | ATLIN            |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |
|   2 | BC     | BARKERVILLE      |         1888 |           2 |         2015 |           3 |       53.07 |      -121.52 |       1265 | N              |
|   3 | BC     | BEAVERDELL       |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 | Y              |
|   4 | BC     | BELLA COOLA      |         1895 |           5 |         2017 |   

In [79]:
# using drop method
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3))
dataframe.drop(["stns jointes"], axis=1).head()

Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m)
0,BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15
1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674
2,BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265
3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838
4,BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18


In [81]:
from tabulate import tabulate
print(tabulate(dataframe.drop(["stns jointes"], axis=1), headers='keys', tablefmt='pipe'))

|     | Prov   | Nom de station   | stnid   |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) |
|----:|:-------|:-----------------|:--------|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|
|   0 | BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 |
|   1 | BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 |
|   2 | BC     | BARKERVILLE      | 1090660 |         1888 |           2 |         2015 |           3 |       53.07 |      -121.52 |       1265 |
|   3 | BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 |
|   4 | BC     | BELLA COOLA      | 1060841 |         1895 |           5 |         2017 |          11 |       52.37 |      -

   ### 7.2.3.2  Row Selection/Addition/Deletion-
   
We will now understand row selection, addition and deletion through examples. Let us begin with the concept of selection.

 #### a-  Row Selection
 
Selection by Label
Rows can be selected by passing row label to a loc function.

In [82]:
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3)).set_index("Prov")
dataframe.head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y


In [83]:
from tabulate import tabulate
print(tabulate(dataframe, headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   | stnid   |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |
|:-------|:-----------------|:--------|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|
| BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              |
| BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |
| BC     | BARKERVILLE      | 1090660 |         1888 |           2 |         2015 |           3 |       53.07 |      -121.52 |       1265 | N              |
| BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 | Y              |
| BC     | BELLA COOLA      | 1060841 |         1895 |    

In [84]:
dataframe.loc['BC'].head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y


In [85]:
from tabulate import tabulate
print(tabulate(dataframe.loc['BC'].head(), headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |
|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|
| BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              |
| BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |
| BC     | BARKERVILLE      | 1090660 |         1888 |           2 |         2015 |           3 |       53.07 |      -121.52 |       1265 | N              |
| BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 | Y              |
| BC     | BELLA COOLA      | 1060841 |         1895 |    

Rows can be selected by passing a boolean.

In [50]:
dataframe.loc['BC'].head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y


Rows can be selected by passing integer location to an iloc function.

In [51]:
dataframe.iloc[0]

Nom de station    AGASSIZ
stnid             1100120
année déb.           1893
mois déb.               1
année fin.           2017
mois fin.              12
lat (deg)           49.25
long (deg)        -121.77
élév (m)               15
stns jointes            N
Name: BC, dtype: object

Multiple rows can be selected using ‘ : ’ operator.

In [52]:
dataframe[2:4]

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y


In [86]:
from tabulate import tabulate
print(tabulate(dataframe[2:4], headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |
|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|
| BC     | BARKERVILLE      | 1090660 |         1888 |           2 |         2015 |           3 |       53.07 |      -121.52 |       1265 | N              |
| BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 | Y              |


#### b-  Row Addition

Add new rows to a DataFrame using function <b>append()</b> function. This function will append the rows at the end.

In [87]:
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3)).set_index("Prov")

In [88]:
df_new = pd.DataFrame({'Nom de station': ['station1', 'station2'], 'stnid': [8888, 9999], 'Prov': ['BC', 'QC']}).set_index("Prov")
df_new

Unnamed: 0_level_0,Nom de station,stnid
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1
BC,station1,8888
QC,station2,9999


In [89]:
from tabulate import tabulate
print(tabulate(df_new, headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   |   stnid |
|:-------|:-----------------|--------:|
| BC     | station1         |    8888 |
| QC     | station2         |    9999 |


In [90]:
dataframe = dataframe.append(df_new)
dataframe.tail()

Unnamed: 0_level_0,Nom de station,année déb.,année fin.,lat (deg),long (deg),mois déb.,mois fin.,stnid,stns jointes,élév (m)
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
NL,ST JOHN'S,1874.0,2017.0,47.62,-52.75,1.0,12.0,8403505,Y,141.0
NL,STEPHENVILLE,1895.0,2017.0,48.53,-58.55,6.0,12.0,8403801,Y,26.0
NL,WABUSH LAKE,1960.0,2017.0,52.93,-66.87,11.0,12.0,8504177,Y,551.0
BC,station1,,,,,,,8888,,
QC,station2,,,,,,,9999,,


In [91]:
from tabulate import tabulate
print(tabulate(dataframe.tail(), headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   |   année déb. |   année fin. |   lat (deg) |   long (deg) |   mois déb. |   mois fin. |   stnid | stns jointes   |   élév (m) |
|:-------|:-----------------|-------------:|-------------:|------------:|-------------:|------------:|------------:|--------:|:---------------|-----------:|
| NL     | ST JOHN'S        |         1874 |         2017 |       47.62 |       -52.75 |           1 |          12 | 8403505 | Y              |        141 |
| NL     | STEPHENVILLE     |         1895 |         2017 |       48.53 |       -58.55 |           6 |          12 | 8403801 | Y              |         26 |
| NL     | WABUSH LAKE      |         1960 |         2017 |       52.93 |       -66.87 |          11 |          12 | 8504177 | Y              |        551 |
| BC     | station1         |          nan |          nan |      nan    |       nan    |         nan |         nan |    8888 | nan            |        nan |
| QC     | station2         |          nan |          nan 

#### c-  Row Deletion

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

In [92]:
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3)).set_index("Prov")
dataframe.head()


Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N
BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N
BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N
BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y
BC,BELLA COOLA,1060841,1895,5,2017,11,52.37,-126.68,18,Y


In [93]:
from tabulate import tabulate
print(tabulate(dataframe.head(), headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |
|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|
| BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              |
| BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |
| BC     | BARKERVILLE      | 1090660 |         1888 |           2 |         2015 |           3 |       53.07 |      -121.52 |       1265 | N              |
| BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 | Y              |
| BC     | BELLA COOLA      | 1060841 |         1895 |    

In [94]:
# Drop rows with label 'BC'
dataframe = dataframe.drop('BC')
dataframe.head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
YT,BURWASH,2100181,1966,10,2017,12,61.37,-139.05,807,Y
YT,DAWSON,2100LRP,1901,1,2017,12,64.05,-139.13,370,Y
N YT,HAINES JUNCTIO,2100630,1944,10,2017,12,60.75,-137.5,596,N
YT,KOMAKUK BEACH,2100682,1958,7,2017,12,69.62,-140.2,13,Y
YT,MAYO,2100701,1924,10,2017,12,63.62,-135.87,504,Y


In [96]:
from tabulate import tabulate
print(tabulate(dataframe.head(), headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   | stnid   |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |
|:-------|:-----------------|:--------|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|
| YT     | BURWASH          | 2100181 |         1966 |          10 |         2017 |          12 |       61.37 |      -139.05 |        807 | Y              |
| YT     | DAWSON           | 2100LRP |         1901 |           1 |         2017 |          12 |       64.05 |      -139.13 |        370 | Y              |
| N   YT | HAINES JUNCTIO   | 2100630 |         1944 |          10 |         2017 |          12 |       60.75 |      -137.5  |        596 | N              |
| YT     | KOMAKUK BEACH    | 2100682 |         1958 |           7 |         2017 |          12 |       69.62 |      -140.2  |         13 | Y              |
| YT     | MAYO             | 2100701 |         1924 |    

In [97]:
dataframe = pd.read_excel("./DATA/Homog_Temperature_Stations.xls", skiprows = range(0, 3)).set_index("Prov")
dataframe.loc['ON'].head()

Unnamed: 0_level_0,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ON,ATIKOKAN,6020LPQ,1917,1,2017,12,48.8,-91.58,442,Y
ON,BEATRICE,6110607,1878,1,2017,12,45.13,-79.4,297,Y
ON,BELLEVILLE,6150689,1921,1,2017,12,44.15,-77.4,76,N
ON,BIG TROUT LAKE,6010735,1939,2,2017,12,53.83,-89.87,224,Y
ON,BROCKVILLE,6100971,1915,7,2017,12,44.6,-75.67,96,Y


In [98]:
from tabulate import tabulate
print(tabulate(dataframe.loc['ON'].head(), headers='keys', tablefmt='pipe'))

| Prov   | Nom de station   | stnid   |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |
|:-------|:-----------------|:--------|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|
| ON     | ATIKOKAN         | 6020LPQ |         1917 |           1 |         2017 |          12 |       48.8  |       -91.58 |        442 | Y              |
| ON     | BEATRICE         | 6110607 |         1878 |           1 |         2017 |          12 |       45.13 |       -79.4  |        297 | Y              |
| ON     | BELLEVILLE       | 6150689 |         1921 |           1 |         2017 |          12 |       44.15 |       -77.4  |         76 | N              |
| ON     | BIG TROUT LAKE   | 6010735 |         1939 |           2 |         2017 |          12 |       53.83 |       -89.87 |        224 | Y              |
| ON     | BROCKVILLE       | 6100971 |         1915 |    

###  7.2.3.3 Merging/Joining Dataframe-

Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.

     pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True)


<table border="1" class="docutils">
<colgroup>
<col width="27%">
<col width="57%">
</colgroup>
<tbody valign="top">

<tr><td><tt class="docutils literal"><span class="pre">left</span></tt></td>
<td>DataFrame object</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">right</span></tt></td>
<td>Another DataFrame object</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">on</span></tt></td>
<td>Columns (names) to join on. Must be found in both the left and right DataFrame objects.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">left_on</span></tt></td>
<td>Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">right_on</span></tt></td>
<td>Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">left_index</span></tt></td>
<td>If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">right_index</span></tt></td>
<td>Same usage as left_index for the right DataFrame.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">how</span></tt></td>
<td>One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each method has been described below.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">sort</span></tt></td>
<td>Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.</td>
</tr>

</tbody>
</table>


Let us now create two different DataFrames and perform the merging operations on it.

In [99]:
left_dataframe = pd.DataFrame({
   'id':[1,2,3,4],
   'Nom de station': ['MONTREAL TAVISH', 'QUEBEC', 'TADOUSSAC','OKA'],
   'variable':['var1','var2','var6','var5']})

right_dataframe = pd.DataFrame(
   {'id':[1,2,3,4],
   'Nom de station': ['TORONTO', 'OTTAWA', 'KINGSTON','CHAPLEAU'],
   'variable':['var3','var1','var6','var5']})
left_dataframe

Unnamed: 0,id,Nom de station,variable
0,1,MONTREAL TAVISH,var1
1,2,QUEBEC,var2
2,3,TADOUSSAC,var6
3,4,OKA,var5


In [100]:
from tabulate import tabulate
print(tabulate(left_dataframe, headers='keys', tablefmt='pipe'))

|    |   id | Nom de station   | variable   |
|---:|-----:|:-----------------|:-----------|
|  0 |    1 | MONTREAL TAVISH  | var1       |
|  1 |    2 | QUEBEC           | var2       |
|  2 |    3 | TADOUSSAC        | var6       |
|  3 |    4 | OKA              | var5       |


In [101]:
right_dataframe

Unnamed: 0,id,Nom de station,variable
0,1,TORONTO,var3
1,2,OTTAWA,var1
2,3,KINGSTON,var6
3,4,CHAPLEAU,var5


In [102]:
from tabulate import tabulate
print(tabulate(right_dataframe, headers='keys', tablefmt='pipe'))

|    |   id | Nom de station   | variable   |
|---:|-----:|:-----------------|:-----------|
|  0 |    1 | TORONTO          | var3       |
|  1 |    2 | OTTAWA           | var1       |
|  2 |    3 | KINGSTON         | var6       |
|  3 |    4 | CHAPLEAU         | var5       |


#### a-  Merge Two DataFrames on a Key

In [104]:
data=pd.merge(left_dataframe,right_dataframe,on='id')

In [105]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id | Nom de station_x   | variable_x   | Nom de station_y   | variable_y   |
|---:|-----:|:-------------------|:-------------|:-------------------|:-------------|
|  0 |    1 | MONTREAL TAVISH    | var1         | TORONTO            | var3         |
|  1 |    2 | QUEBEC             | var2         | OTTAWA             | var1         |
|  2 |    3 | TADOUSSAC          | var6         | KINGSTON           | var6         |
|  3 |    4 | OKA                | var5         | CHAPLEAU           | var5         |


#### b-  Merge Two DataFrames on a Key

In [106]:
data=pd.merge(left_dataframe,right_dataframe,on=['id','variable'])

In [107]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id | Nom de station_x   | variable   | Nom de station_y   |
|---:|-----:|:-------------------|:-----------|:-------------------|
|  0 |    3 | TADOUSSAC          | var6       | KINGSTON           |
|  1 |    4 | OKA                | var5       | CHAPLEAU           |


#### c-  Merge Two DataFrames Using 'How' argument

The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or the right tables, the values in the joined table will be NA.


In [108]:
# Left Join
data=pd.merge(left_dataframe, right_dataframe, on='variable', how='left')

In [109]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id_x | Nom de station_x   | variable   |   id_y | Nom de station_y   |
|---:|-------:|:-------------------|:-----------|-------:|:-------------------|
|  0 |      1 | MONTREAL TAVISH    | var1       |      2 | OTTAWA             |
|  1 |      2 | QUEBEC             | var2       |    nan | nan                |
|  2 |      3 | TADOUSSAC          | var6       |      3 | KINGSTON           |
|  3 |      4 | OKA                | var5       |      4 | CHAPLEAU           |


In [111]:
# right Join
data=pd.merge(left_dataframe, right_dataframe, on='variable', how='right')

In [112]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id_x | Nom de station_x   | variable   |   id_y | Nom de station_y   |
|---:|-------:|:-------------------|:-----------|-------:|:-------------------|
|  0 |      1 | MONTREAL TAVISH    | var1       |      2 | OTTAWA             |
|  1 |      3 | TADOUSSAC          | var6       |      3 | KINGSTON           |
|  2 |      4 | OKA                | var5       |      4 | CHAPLEAU           |
|  3 |    nan | nan                | var3       |      1 | TORONTO            |


In [113]:
# outer Join
data=pd.merge(left_dataframe, right_dataframe, on='variable', how='outer')

In [114]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id_x | Nom de station_x   | variable   |   id_y | Nom de station_y   |
|---:|-------:|:-------------------|:-----------|-------:|:-------------------|
|  0 |      1 | MONTREAL TAVISH    | var1       |      2 | OTTAWA             |
|  1 |      2 | QUEBEC             | var2       |    nan | nan                |
|  2 |      3 | TADOUSSAC          | var6       |      3 | KINGSTON           |
|  3 |      4 | OKA                | var5       |      4 | CHAPLEAU           |
|  4 |    nan | nan                | var3       |      1 | TORONTO            |


In [115]:
# inner Join
data=pd.merge(left_dataframe, right_dataframe, on='variable', how='inner')

In [116]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id_x | Nom de station_x   | variable   |   id_y | Nom de station_y   |
|---:|-------:|:-------------------|:-----------|-------:|:-------------------|
|  0 |      1 | MONTREAL TAVISH    | var1       |      2 | OTTAWA             |
|  1 |      3 | TADOUSSAC          | var6       |      3 | KINGSTON           |
|  2 |      4 | OKA                | var5       |      4 | CHAPLEAU           |


### 7.2.3.4  Dataframe Concatenation-

Pandas provides various facilities for easily combining together DataFrame objects.

    pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False)
    

<table border="1" class="docutils">
<colgroup>
<col width="27%">
<col width="57%">
</colgroup>
<tbody valign="top">

<tr><td><tt class="docutils literal"><span class="pre">objs</span></tt></td>
<td>This is a sequence or mapping of Series, DataFrame objects.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">axis</span></tt></td>
<td> {0, 1, ...}, default 0. This is the axis to concatenate along.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">join</span></tt></td>
<td>{‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">ignore_index</span></tt></td>
<td>boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre">join_axes</span></tt></td>
<td> This is the list of Index objects. Specific indexes to use for the other (n-1) axes instead of performing inner/outer set logic.</td>

</tbody>
</table>    
    

In [117]:
dataframe1 = pd.DataFrame({
   'id':[1,2,3,4],
   'Nom de station': ['MONTREAL TAVISH', 'QUEBEC', 'TADOUSSAC','OKA'],
   'variable':['var1','var2','var6','var5']})

dataframe2 = pd.DataFrame(
   {'id':[1,2,3,4],
   'Nom de station': ['TORONTO', 'OTTAWA', 'KINGSTON','CHAPLEAU'],
   'variable':['var3','var1','var6','var5']})

data=pd.concat([dataframe1,dataframe2])

In [118]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id | Nom de station   | variable   |
|---:|-----:|:-----------------|:-----------|
|  0 |    1 | MONTREAL TAVISH  | var1       |
|  1 |    2 | QUEBEC           | var2       |
|  2 |    3 | TADOUSSAC        | var6       |
|  3 |    4 | OKA              | var5       |
|  0 |    1 | TORONTO          | var3       |
|  1 |    2 | OTTAWA           | var1       |
|  2 |    3 | KINGSTON         | var6       |
|  3 |    4 | CHAPLEAU         | var5       |


Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame. We can do this by using the keys argument −

In [119]:
data=pd.concat([dataframe1,dataframe2],keys=['QC','ON'])

In [120]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|           |   id | Nom de station   | variable   |
|:----------|-----:|:-----------------|:-----------|
| ('QC', 0) |    1 | MONTREAL TAVISH  | var1       |
| ('QC', 1) |    2 | QUEBEC           | var2       |
| ('QC', 2) |    3 | TADOUSSAC        | var6       |
| ('QC', 3) |    4 | OKA              | var5       |
| ('ON', 0) |    1 | TORONTO          | var3       |
| ('ON', 1) |    2 | OTTAWA           | var1       |
| ('ON', 2) |    3 | KINGSTON         | var6       |
| ('ON', 3) |    4 | CHAPLEAU         | var5       |


If we don't want the index being duplicated, set ignore_index to True.

In [121]:
data=pd.concat([dataframe1,dataframe2],keys=['QC','ON'],ignore_index=True)

In [122]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id | Nom de station   | variable   |
|---:|-----:|:-----------------|:-----------|
|  0 |    1 | MONTREAL TAVISH  | var1       |
|  1 |    2 | QUEBEC           | var2       |
|  2 |    3 | TADOUSSAC        | var6       |
|  3 |    4 | OKA              | var5       |
|  4 |    1 | TORONTO          | var3       |
|  5 |    2 | OTTAWA           | var1       |
|  6 |    3 | KINGSTON         | var6       |
|  7 |    4 | CHAPLEAU         | var5       |


If the two Dataframes need to be added along axis=1, then the new columns will be appended.

In [123]:
data=pd.concat([dataframe1,dataframe2],axis=1)

In [124]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id | Nom de station   | variable   |   id | Nom de station   | variable   |
|---:|-----:|:-----------------|:-----------|-----:|:-----------------|:-----------|
|  0 |    1 | MONTREAL TAVISH  | var1       |    1 | TORONTO          | var3       |
|  1 |    2 | QUEBEC           | var2       |    2 | OTTAWA           | var1       |
|  2 |    3 | TADOUSSAC        | var6       |    3 | KINGSTON         | var6       |
|  3 |    4 | OKA              | var5       |    4 | CHAPLEAU         | var5       |


####  Concatenating Using append

A useful shortcut to concat are the append instance methods on DataFrame. They concatenate along axis=0, namely the index 

In [125]:
data=dataframe1.append(dataframe2)

In [126]:
from tabulate import tabulate
print(tabulate(data, headers='keys', tablefmt='pipe'))

|    |   id | Nom de station   | variable   |
|---:|-----:|:-----------------|:-----------|
|  0 |    1 | MONTREAL TAVISH  | var1       |
|  1 |    2 | QUEBEC           | var2       |
|  2 |    3 | TADOUSSAC        | var6       |
|  3 |    4 | OKA              | var5       |
|  0 |    1 | TORONTO          | var3       |
|  1 |    2 | OTTAWA           | var1       |
|  2 |    3 | KINGSTON         | var6       |
|  3 |    4 | CHAPLEAU         | var5       |


### 7.2.4  Basic Functionality on DataFrame

There are many built-in functions and methods:

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html


We will present some useful functions with exploring a dataset.

In [127]:
dataframe = pd.read_csv("./DATA/Climato_Stations_ECCC_1981_2010_YEAR.csv", encoding='latin-1')
dataframe.head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
0,0,BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N,,,,,
1,1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N,5.630427,18.903333,-3.46052,-19.235,860.083333
2,2,BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N,,,,,
3,3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y,12.628017,23.641667,4.017479,-3.646,1798.093333
4,4,BC,BLIND CHANNEL,1021480,1958,7,2016,2,50.42,-125.5,23,N,12.155616,20.200667,6.776893,1.094667,2518.68


In [128]:
from tabulate import tabulate
print(tabulate(dataframe.head(), headers='keys', tablefmt='pipe'))

|    |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |      Tmax |   Tmax90p |      Tmin |   Tmin10p |      DG0 |
|---:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|----------:|----------:|----------:|----------:|---------:|
|  0 |            0 | BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              | nan       |  nan      | nan       | nan       |  nan     |
|  1 |            1 | BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |   5.63043 |   18.9033 |  -3.46052 | -19.235   |  860.083 |
|  2 |            2 | BC     | BARKERVILLE      | 109066

- <b>.shape</b> method: 

Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns.

In [129]:
dataframe.shape 

(289, 17)

- <b>.columns</b> method:

Returns names of the columns in our Dataframe.

In [130]:
dataframe.columns

Index(['Unnamed: 0', 'Prov', 'Nom de station', 'stnid', 'année déb.',
       'mois déb.', 'année fin.', 'mois fin.', 'lat (deg)', 'long (deg)',
       'élév (m)', 'stns jointes', 'Tmax', 'Tmax90p', 'Tmin', 'Tmin10p',
       'DG0'],
      dtype='object')

- <b>.empty</b> method:

Returns the Boolean value saying whether the Object is empty or not; True indicates that the object is empty.

In [131]:
dataframe.empty

False

   - <b>.isnull</b> method: 
   
To detect missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions.



In [132]:
dataframe['Tmax'].isnull().head()

0     True
1    False
2     True
3    False
4    False
Name: Tmax, dtype: bool

We combine this method with .sum() to know the number of missing values.

In [133]:
dataframe['Tmax'].isnull().sum() 

117

   - <b>.dropna</b> method: 
   
If you want to exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.   

In [134]:
dataframe_sans_NaN = dataframe.dropna() 

In [135]:
dataframe_sans_NaN['Tmax'].isnull().sum() 

0

In [136]:
dataframe_sans_NaN.shape # nouvelle dimension de notre tableau 

(169, 17)

 - <b>sort_values</b> method:
 
sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.


In [137]:
dataframe.head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
0,0,BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N,,,,,
1,1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N,5.630427,18.903333,-3.46052,-19.235,860.083333
2,2,BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N,,,,,
3,3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y,12.628017,23.641667,4.017479,-3.646,1798.093333
4,4,BC,BLIND CHANNEL,1021480,1958,7,2016,2,50.42,-125.5,23,N,12.155616,20.200667,6.776893,1.094667,2518.68


In [138]:
from tabulate import tabulate
print(tabulate(dataframe.head(), headers='keys', tablefmt='pipe'))

|    |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |      Tmax |   Tmax90p |      Tmin |   Tmin10p |      DG0 |
|---:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|----------:|----------:|----------:|----------:|---------:|
|  0 |            0 | BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              | nan       |  nan      | nan       | nan       |  nan     |
|  1 |            1 | BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |   5.63043 |   18.9033 |  -3.46052 | -19.235   |  860.083 |
|  2 |            2 | BC     | BARKERVILLE      | 109066

In [139]:
df_label_sorted = dataframe.sort_values(by="Prov")
df_label_sorted.head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
127,127,AB,SLAVE LAKE,3065995,1922,8,2017,12,55.3,-114.78,583,Y,7.481366,22.464333,-3.422267,-20.380333,1131.03
106,106,AB,EDMONTON,3012216,1880,7,2017,12,53.57,-113.52,723,Y,8.982117,24.328,-3.117089,-19.190667,1095.933333
104,104,AB,COLD LAKE,3081680,1925,7,2017,12,54.42,-110.28,541,Y,7.724485,24.273667,-3.037595,-21.203667,1303.95
103,103,AB,CARWAY,3031402,1914,8,2017,12,49.0,-113.37,1354,Y,11.139694,24.925,-1.618858,-13.98,956.836667
102,102,AB,CAMROSE,3011240,1946,3,2017,12,53.03,-112.82,739,N,8.964966,24.175667,-3.714421,-20.588,1048.13


In [140]:
from tabulate import tabulate
print(tabulate(df_label_sorted.head(), headers='keys', tablefmt='pipe'))

|     |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |     Tmax |   Tmax90p |     Tmin |   Tmin10p |      DG0 |
|----:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|---------:|----------:|---------:|----------:|---------:|
| 127 |          127 | AB     | SLAVE LAKE       | 3065995 |         1922 |           8 |         2017 |          12 |       55.3  |      -114.78 |        583 | Y              |  7.48137 |   22.4643 | -3.42227 |  -20.3803 | 1131.03  |
| 106 |          106 | AB     | EDMONTON         | 3012216 |         1880 |           7 |         2017 |          12 |       53.57 |      -113.52 |        723 | Y              |  8.98212 |   24.328  | -3.11709 |  -19.1907 | 1095.93  |
| 104 |          104 | AB     | COLD LAKE        | 3081680 |

by' argument could takes a list of column values.

In [141]:
df_label_sorted = dataframe.sort_values(by=['Prov','année déb.'])
df_label_sorted.head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
106,106,AB,EDMONTON,3012216,1880,7,2017,12,53.57,-113.52,723,Y,8.982117,24.328,-3.117089,-19.190667,1095.933333
110,110,AB,FORT CHIPEWYAN,3072655,1883,10,2017,12,58.77,-111.12,238,Y,4.28219,23.566,-6.544206,-28.629667,1135.113333
121,121,AB,MEDICINE HAT,3034485,1883,8,2017,12,50.02,-110.72,717,Y,12.648926,28.497333,-0.1559,-15.783333,1563.866667
99,99,AB,CALGARY,3031092,1885,1,2017,12,51.12,-114.02,1084,Y,10.834976,24.467333,-1.444606,-15.235,1172.05
97,97,AB,BANFF,3050519,1887,11,2017,12,51.2,-115.55,1397,Y,8.906518,23.132,-3.084884,-15.858333,750.37


In [142]:
from tabulate import tabulate
print(tabulate(df_label_sorted.head(), headers='keys', tablefmt='pipe'))

|     |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |     Tmax |   Tmax90p |     Tmin |   Tmin10p |     DG0 |
|----:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|---------:|----------:|---------:|----------:|--------:|
| 106 |          106 | AB     | EDMONTON         | 3012216 |         1880 |           7 |         2017 |          12 |       53.57 |      -113.52 |        723 | Y              |  8.98212 |   24.328  | -3.11709 |  -19.1907 | 1095.93 |
| 110 |          110 | AB     | FORT CHIPEWYAN   | 3072655 |         1883 |          10 |         2017 |          12 |       58.77 |      -111.12 |        238 | Y              |  4.28219 |   23.566  | -6.54421 |  -28.6297 | 1135.11 |
| 121 |          121 | AB     | MEDICINE HAT     | 3034485 |    

- <b>sort_index()</b> method:

Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.

In [143]:
df_label_sorted.sort_index().head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
0,0,BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N,,,,,
1,1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N,5.630427,18.903333,-3.46052,-19.235,860.083333
2,2,BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N,,,,,
3,3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y,12.628017,23.641667,4.017479,-3.646,1798.093333
4,4,BC,BLIND CHANNEL,1021480,1958,7,2016,2,50.42,-125.5,23,N,12.155616,20.200667,6.776893,1.094667,2518.68


In [145]:
from tabulate import tabulate
print(tabulate(df_label_sorted.sort_index().head(), headers='keys', tablefmt='pipe'))

|    |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |      Tmax |   Tmax90p |      Tmin |   Tmin10p |      DG0 |
|---:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|----------:|----------:|----------:|----------:|---------:|
|  0 |            0 | BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              | nan       |  nan      | nan       | nan       |  nan     |
|  1 |            1 | BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |   5.63043 |   18.9033 |  -3.46052 | -19.235   |  860.083 |
|  2 |            2 | BC     | BARKERVILLE      | 109066

In [146]:
df_label_sorted.sort_index(ascending=False).head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
288,288,NL,WABUSH LAKE,8504177,1960,11,2017,12,52.93,-66.87,551,Y,2.339743,19.693667,-8.024778,-29.453667,792.913333
287,287,NL,STEPHENVILLE,8403801,1895,6,2017,12,48.53,-58.55,26,Y,8.806534,20.614,1.812748,-10.582,1710.17
286,286,NL,ST JOHN'S,8403505,1874,1,2017,12,47.62,-52.75,141,Y,9.073669,21.623667,1.54977,-8.919,1474.82
285,285,NL,ST ANTHONY,8403389,1946,6,2017,12,51.37,-55.6,33,Y,,,,,
284,284,NL,PORT AUXBASQUES,8402975,1909,2,2017,9,47.58,-58.97,40,N,,,,,


In [147]:
from tabulate import tabulate
print(tabulate(df_label_sorted.sort_index(ascending=False).head(), headers='keys', tablefmt='pipe'))

|     |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |      Tmax |   Tmax90p |      Tmin |   Tmin10p |      DG0 |
|----:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|----------:|----------:|----------:|----------:|---------:|
| 288 |          288 | NL     | WABUSH LAKE      | 8504177 |         1960 |          11 |         2017 |          12 |       52.93 |       -66.87 |        551 | Y              |   2.33974 |   19.6937 |  -8.02478 |  -29.4537 |  792.913 |
| 287 |          287 | NL     | STEPHENVILLE     | 8403801 |         1895 |           6 |         2017 |          12 |       48.53 |       -58.55 |         26 | Y              |   8.80653 |   20.614  |   1.81275 |  -10.582  | 1710.17  |
| 286 |          286 | NL     | ST JOHN'S        | 8

In [148]:
df_label_sorted.sort_index(axis=1).head()

Unnamed: 0.1,DG0,Nom de station,Prov,Tmax,Tmax90p,Tmin,Tmin10p,Unnamed: 0,année déb.,année fin.,lat (deg),long (deg),mois déb.,mois fin.,stnid,stns jointes,élév (m)
106,1095.933333,EDMONTON,AB,8.982117,24.328,-3.117089,-19.190667,106,1880,2017,53.57,-113.52,7,12,3012216,Y,723
110,1135.113333,FORT CHIPEWYAN,AB,4.28219,23.566,-6.544206,-28.629667,110,1883,2017,58.77,-111.12,10,12,3072655,Y,238
121,1563.866667,MEDICINE HAT,AB,12.648926,28.497333,-0.1559,-15.783333,121,1883,2017,50.02,-110.72,8,12,3034485,Y,717
99,1172.05,CALGARY,AB,10.834976,24.467333,-1.444606,-15.235,99,1885,2017,51.12,-114.02,1,12,3031092,Y,1084
97,750.37,BANFF,AB,8.906518,23.132,-3.084884,-15.858333,97,1887,2017,51.2,-115.55,11,12,3050519,Y,1397


In [149]:
from tabulate import tabulate
print(tabulate(df_label_sorted.sort_index(axis=1).head(), headers='keys', tablefmt='pipe'))

|     |     DG0 | Nom de station   | Prov   |     Tmax |   Tmax90p |     Tmin |   Tmin10p |   Unnamed: 0 |   année déb. |   année fin. |   lat (deg) |   long (deg) |   mois déb. |   mois fin. |   stnid | stns jointes   |   élév (m) |
|----:|--------:|:-----------------|:-------|---------:|----------:|---------:|----------:|-------------:|-------------:|-------------:|------------:|-------------:|------------:|------------:|--------:|:---------------|-----------:|
| 106 | 1095.93 | EDMONTON         | AB     |  8.98212 |   24.328  | -3.11709 |  -19.1907 |          106 |         1880 |         2017 |       53.57 |      -113.52 |           7 |          12 | 3012216 | Y              |        723 |
| 110 | 1135.11 | FORT CHIPEWYAN   | AB     |  4.28219 |   23.566  | -6.54421 |  -28.6297 |          110 |         1883 |         2017 |       58.77 |      -111.12 |          10 |          12 | 3072655 | Y              |        238 |
| 121 | 1563.87 | MEDICINE HAT     | AB     | 12.6489  |   28.49

- <b>.describe()</b> method:  

Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [150]:
dataframe['Tmax'].describe()

count    172.000000
mean       7.916314
std        5.524630
min      -15.366463
25%        7.364646
50%        8.988052
75%       11.018241
max       16.105511
Name: Tmax, dtype: float64

- <b>.dtypes()</b> method:  

Returns the dtypes in this object.

In [151]:
dataframe.dtypes

Unnamed: 0          int64
Prov               object
Nom de station     object
stnid              object
année déb.          int64
mois déb.           int64
année fin.          int64
mois fin.           int64
lat (deg)         float64
long (deg)        float64
élév (m)            int64
stns jointes       object
Tmax              float64
Tmax90p           float64
Tmin              float64
Tmin10p           float64
DG0               float64
dtype: object

### 7.2.5  DataFrame Function Application

To apply your own or another library’s functions to Pandas objects, you should be aware of the three important methods. The methods have been discussed below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame, row- or column-wise, or element wise.

- Table wise Function Application: <b>.pipe()</b>

- Row or Column Wise Function Application: <b>.apply()</b>

- Element wise Function Application: <b>.applymap()</b>

- <b>.apply()</b> method: 

Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument. By default, the operation performs column wise, taking each column as an array-like.



In [152]:
dataframe = pd.read_csv("./DATA/Climato_Stations_ECCC_1981_2010_YEAR.csv", encoding='latin-1')

In [155]:
dataframe = pd.read_csv("./DATA/Climato_Stations_ECCC_1981_2010_YEAR.csv", encoding='latin-1')
dataframe["stns jointes"]=dataframe["stns jointes"].apply(lambda x: x.replace("N", "NaN"))
dataframe["stns jointes"]=dataframe["stns jointes"].apply(lambda x: x.replace("Y", "1"))
dataframe = dataframe.dropna() 
dataframe.head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
1,1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,,5.630427,18.903333,-3.46052,-19.235,860.083333
3,3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,1.0,12.628017,23.641667,4.017479,-3.646,1798.093333
4,4,BC,BLIND CHANNEL,1021480,1958,7,2016,2,50.42,-125.5,23,,12.155616,20.200667,6.776893,1.094667,2518.68
5,5,BC,BLUE RIVER,1160899,1946,9,2017,12,52.13,-119.28,683,1.0,10.440754,25.991667,-1.125014,-12.582333,987.363333
9,9,BC,COMOX,1021830,1935,11,2017,12,49.72,-124.9,26,1.0,13.737613,23.102333,6.42485,-0.526,2444.393333


In [156]:
from tabulate import tabulate
print(tabulate(dataframe.head(), headers='keys', tablefmt='pipe'))

|    |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) |   stns jointes |     Tmax |   Tmax90p |     Tmin |   Tmin10p |      DG0 |
|---:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|---------------:|---------:|----------:|---------:|----------:|---------:|
|  1 |            1 | BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 |            nan |  5.63043 |   18.9033 | -3.46052 | -19.235   |  860.083 |
|  3 |            3 | BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 |              1 | 12.628   |   23.6417 |  4.01748 |  -3.646   | 1798.09  |
|  4 |            4 | BC     | BLIND CHANNEL    | 1021480 |     

In [157]:
dataframe["Tmin"]=dataframe["Tmin"].apply(lambda x: round(x,2))
dataframe["Tmax"]=dataframe["Tmax"].apply(lambda x: int(x))
dataframe.head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
1,1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,,5,18.903333,-3.46,-19.235,860.083333
3,3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,1.0,12,23.641667,4.02,-3.646,1798.093333
4,4,BC,BLIND CHANNEL,1021480,1958,7,2016,2,50.42,-125.5,23,,12,20.200667,6.78,1.094667,2518.68
5,5,BC,BLUE RIVER,1160899,1946,9,2017,12,52.13,-119.28,683,1.0,10,25.991667,-1.13,-12.582333,987.363333
9,9,BC,COMOX,1021830,1935,11,2017,12,49.72,-124.9,26,1.0,13,23.102333,6.42,-0.526,2444.393333


In [158]:
from tabulate import tabulate
print(tabulate(dataframe.head(), headers='keys', tablefmt='pipe'))

|    |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) |   stns jointes |   Tmax |   Tmax90p |   Tmin |   Tmin10p |      DG0 |
|---:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|---------------:|-------:|----------:|-------:|----------:|---------:|
|  1 |            1 | BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 |            nan |      5 |   18.9033 |  -3.46 | -19.235   |  860.083 |
|  3 |            3 | BC     | BEAVERDELL       | 1130771 |         1939 |           1 |         2006 |           9 |       49.48 |      -119.05 |        838 |              1 |     12 |   23.6417 |   4.02 |  -3.646   | 1798.09  |
|  4 |            4 | BC     | BLIND CHANNEL    | 1021480 |         1958 |      

### 7.2.6  DataFrame GroupBY method

Any groupby operation involves one of the following operations on the original object. They are −

    - Splitting the Object

    - Applying a function

    - Combining the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

    - Aggregation − computing a summary statistic

    - Transformation − perform some group-specific operation

    - Filtration − discarding the data with some condition

Let us now create a DataFrame object and perform all the operations on it −

In [159]:
dataframe = pd.read_csv("./DATA/Climato_Stations_ECCC_1981_2010_YEAR.csv", encoding='latin-1')
dataframe.head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
0,0,BC,AGASSIZ,1100120,1893,1,2017,12,49.25,-121.77,15,N,,,,,
1,1,BC,ATLIN,1200560,1905,8,2017,12,59.57,-133.7,674,N,5.630427,18.903333,-3.46052,-19.235,860.083333
2,2,BC,BARKERVILLE,1090660,1888,2,2015,3,53.07,-121.52,1265,N,,,,,
3,3,BC,BEAVERDELL,1130771,1939,1,2006,9,49.48,-119.05,838,Y,12.628017,23.641667,4.017479,-3.646,1798.093333
4,4,BC,BLIND CHANNEL,1021480,1958,7,2016,2,50.42,-125.5,23,N,12.155616,20.200667,6.776893,1.094667,2518.68


In [160]:
from tabulate import tabulate
print(tabulate(dataframe.head(), headers='keys', tablefmt='pipe'))

|    |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |      Tmax |   Tmax90p |      Tmin |   Tmin10p |      DG0 |
|---:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|----------:|----------:|----------:|----------:|---------:|
|  0 |            0 | BC     | AGASSIZ          | 1100120 |         1893 |           1 |         2017 |          12 |       49.25 |      -121.77 |         15 | N              | nan       |  nan      | nan       | nan       |  nan     |
|  1 |            1 | BC     | ATLIN            | 1200560 |         1905 |           8 |         2017 |          12 |       59.57 |      -133.7  |        674 | N              |   5.63043 |   18.9033 |  -3.46052 | -19.235   |  860.083 |
|  2 |            2 | BC     | BARKERVILLE      | 109066

Looking at the DataFrame above, we see that there are at least 3 variables that we can use to group our dataset. For example, we can group our data by province (Prov), by year of beginning of recording or year of end of recording.

We will use the Pandas groupby module to group our data.

- <b>.unique()</b> method : 

Returns the unique values of a column.

In [161]:
dataframe["Prov"].unique() 

array(['BC', 'YT', 'N   YT', 'NT', 'NU', 'AB', 'SK', 'MB', 'ON', 'QC',
       'NB', 'NS', 'PE', 'NL'], dtype=object)

#### Split Data into Groups:

In [162]:
dataframe.groupby('Prov')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000009D6CF98>

- To view groups:

In [163]:
dataframe.groupby('Prov').groups

{'AB': Int64Index([ 96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108,
             109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,
             122, 123, 124, 125, 126, 127, 128, 129, 130],
            dtype='int64'),
 'BC': Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
             17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
             34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
             50],
            dtype='int64'),
 'MB': Int64Index([156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
             169, 170, 171, 172, 173, 174, 175],
            dtype='int64'),
 'N   YT': Int64Index([52], dtype='int64'),
 'NB': Int64Index([254, 255, 256, 257, 258, 259, 260, 261], dtype='int64'),
 'NL': Int64Index([275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287,
             288],
            dtype='int64'),
 'NS': Int64Index([262, 263, 264, 265, 266, 267,

- Group by with multiple columns:

In [165]:
dataframe.groupby(['Prov','année fin.']).groups

{('AB', 2011): Int64Index([116], dtype='int64'),
 ('AB', 2013): Int64Index([101], dtype='int64'),
 ('AB', 2016): Int64Index([100, 130], dtype='int64'),
 ('AB',
  2017): Int64Index([ 96,  97,  98,  99, 102, 103, 104, 105, 106, 107, 108, 109, 110,
             111, 112, 113, 114, 115, 117, 118, 119, 120, 121, 122, 123, 124,
             125, 126, 127, 128, 129],
            dtype='int64'),
 ('BC', 2006): Int64Index([3], dtype='int64'),
 ('BC', 2013): Int64Index([20], dtype='int64'),
 ('BC', 2014): Int64Index([21, 25], dtype='int64'),
 ('BC', 2015): Int64Index([2, 8, 11, 16], dtype='int64'),
 ('BC', 2016): Int64Index([4, 6, 41], dtype='int64'),
 ('BC',
  2017): Int64Index([ 0,  1,  5,  7,  9, 10, 12, 13, 14, 15, 17, 18, 19, 22, 23, 24, 26,
             27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 42, 43, 44,
             45, 46, 47, 48, 49, 50],
            dtype='int64'),
 ('MB', 2016): Int64Index([156], dtype='int64'),
 ('MB',
  2017): Int64Index([157, 158, 159, 160, 161, 162

#### Iterating through Groups:

In [167]:
grouped = dataframe.groupby('Prov')

for name,group in grouped:
   print(name)
   print(group)

AB
     Unnamed: 0 Prov   Nom de station    stnid  année déb.  mois déb.  \
96           96   AB        ATHABASCA  3060L20        1918          6   
97           97   AB            BANFF  3050519        1887         11   
98           98   AB      BEAVERLODGE  3070600        1913          4   
99           99   AB          CALGARY  3031092        1885          1   
100         100   AB           CALMAR  3011120        1915         11   
101         101   AB          CAMPSIE  3061200        1912          9   
102         102   AB          CAMROSE  3011240        1946          3   
103         103   AB           CARWAY  3031402        1914          8   
104         104   AB        COLD LAKE  3081680        1925          7   
105         105   AB       CORONATION  3011887        1924          4   
106         106   AB         EDMONTON  3012216        1880          7   
107         107   AB            EDSON  3062246        1914          2   
108         108   AB         ENTRANCE  306A009  

#### Select a group:

- <b>.get_group()</b> method:

We can select a single group.

In [168]:
grouped.get_group('QC').head()

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
216,216,QC,AMOS,709CEE9,1913,6,2017,8,48.57,-78.13,305,Y,,,,,
217,217,QC,BAGOTVILLE,7060400,1880,11,2017,12,48.33,-71.0,159,Y,8.28139,25.297333,-1.833594,-20.811667,1528.413333
218,218,QC,BEAUCEVILLE,7027283,1913,8,2017,8,46.15,-70.7,168,Y,10.551494,26.078333,-1.376339,-19.628333,1509.36
219,219,QC,BROME,7020840,1890,9,2014,7,45.18,-72.57,206,N,11.140937,26.176667,-0.294151,-17.56,1676.226667
220,220,QC,CAUSAPSCAL,7051200,1913,11,2017,8,48.37,-67.23,168,N,,,,,


In [169]:
from tabulate import tabulate
print(tabulate(grouped.get_group('QC').head(), headers='keys', tablefmt='pipe'))

|     |   Unnamed: 0 | Prov   | Nom de station   | stnid   |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |      Tmax |   Tmax90p |       Tmin |   Tmin10p |     DG0 |
|----:|-------------:|:-------|:-----------------|:--------|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|----------:|----------:|-----------:|----------:|--------:|
| 216 |          216 | QC     | AMOS             | 709CEE9 |         1913 |           6 |         2017 |           8 |       48.57 |       -78.13 |        305 | Y              | nan       |  nan      | nan        |  nan      |  nan    |
| 217 |          217 | QC     | BAGOTVILLE       | 7060400 |         1880 |          11 |         2017 |          12 |       48.33 |       -71    |        159 | Y              |   8.28139 |   25.2973 |  -1.83359  |  -20.8117 | 1528.41 |
| 218 |          218 | QC     | BEAUCEVILLE      | 7

#### Aggregations

An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.

An obvious one is aggregation via the aggregate or equivalent agg method −

In [170]:
dataframe = pd.read_csv("./DATA/Climato_Stations_ECCC_1981_2010_YEAR.csv", encoding='latin-1')
grouped = dataframe.groupby('Prov')

In [171]:
grouped['Tmin'].agg(np.mean)

Prov
AB        -3.050928
BC         1.578024
MB        -4.493900
N   YT          NaN
NB         0.341750
NL        -1.324334
NS         2.542629
NT        -8.518655
NU       -16.522658
ON         0.031970
PE         1.986827
QC        -1.992695
SK        -3.208984
YT        -8.995421
Name: Tmin, dtype: float64

#### Applying Multiple Aggregation Functions at Once
With grouped Series, you can also pass a list or dict of functions to do aggregation with, and generate DataFrame as output −

In [172]:
grouped['Tmin'].agg([np.min, np.mean, np.max, np.std])

Unnamed: 0_level_0,amin,mean,amax,std
Prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AB,-6.544206,-3.050928,-0.1559,1.401326
BC,-6.06858,1.578024,7.016292,3.896384
MB,-10.094295,-4.4939,-1.980265,2.618348
N YT,,,,
NB,-0.491043,0.34175,0.840796,0.598429
NL,-8.024778,-1.324334,1.812748,3.425846
NS,1.214131,2.542629,3.75946,0.982653
NT,-12.451845,-8.518655,-6.683226,2.16937
NU,-21.935075,-16.522658,-12.233232,3.049096
ON,-7.484845,0.03197,5.918716,3.647877


In [173]:
from tabulate import tabulate
print(tabulate(grouped['Tmin'].agg([np.min, np.mean, np.max, np.std]), headers='keys', tablefmt='pipe'))

| Prov   |       amin |        mean |       amax |        std |
|:-------|-----------:|------------:|-----------:|-----------:|
| AB     |  -6.54421  |  -3.05093   |  -0.1559   |   1.40133  |
| BC     |  -6.06858  |   1.57802   |   7.01629  |   3.89638  |
| MB     | -10.0943   |  -4.4939    |  -1.98027  |   2.61835  |
| N   YT | nan        | nan         | nan        | nan        |
| NB     |  -0.491043 |   0.34175   |   0.840796 |   0.598429 |
| NL     |  -8.02478  |  -1.32433   |   1.81275  |   3.42585  |
| NS     |   1.21413  |   2.54263   |   3.75946  |   0.982653 |
| NT     | -12.4518   |  -8.51865   |  -6.68323  |   2.16937  |
| NU     | -21.9351   | -16.5227    | -12.2332   |   3.0491   |
| ON     |  -7.48484  |   0.0319696 |   5.91872  |   3.64788  |
| PE     |   1.98683  |   1.98683   |   1.98683  | nan        |
| QC     |  -8.76805  |  -1.99269   |   1.62212  |   2.70668  |
| SK     |  -5.01656  |  -3.20898   |  -1.42885  |   1.03055  |
| YT     | -10.2177   |  -8.99542   |  -

#### Transformations

Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform should return a result that is the same size as that of a group chunk.

In [174]:
grouped = dataframe.groupby('Prov')
and_stand = lambda x: (x - x.mean()) / x.std()
grouped['Tmin'].transform(and_stand).head()

0         NaN
1   -1.293133
2         NaN
3    0.626082
4    1.334280
Name: Tmin, dtype: float64

#### Filtration

Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.

In [175]:
dataframe.groupby('Prov').filter(lambda x: len(x) == 1)

Unnamed: 0.1,Unnamed: 0,Prov,Nom de station,stnid,année déb.,mois déb.,année fin.,mois fin.,lat (deg),long (deg),élév (m),stns jointes,Tmax,Tmax90p,Tmin,Tmin10p,DG0
52,52,N YT,HAINES JUNCTIO,2100630,1944,10,2017,12,60.75,-137.5,596,N,,,,,
274,274,PE,CHARLOTTETOWN,8300301,1872,11,2017,12,46.28,-63.13,49,Y,10.00633,23.768333,1.986827,-12.036667,1912.156667


In [176]:
from tabulate import tabulate
print(tabulate(dataframe.groupby('Prov').filter(lambda x: len(x) == 1), headers='keys', tablefmt='pipe'))

|     |   Unnamed: 0 | Prov   | Nom de station   |   stnid |   année déb. |   mois déb. |   année fin. |   mois fin. |   lat (deg) |   long (deg) |   élév (m) | stns jointes   |     Tmax |   Tmax90p |      Tmin |   Tmin10p |     DG0 |
|----:|-------------:|:-------|:-----------------|--------:|-------------:|------------:|-------------:|------------:|------------:|-------------:|-----------:|:---------------|---------:|----------:|----------:|----------:|--------:|
|  52 |           52 | N   YT | HAINES JUNCTIO   | 2100630 |         1944 |          10 |         2017 |          12 |       60.75 |      -137.5  |        596 | N              | nan      |  nan      | nan       |  nan      |  nan    |
| 274 |          274 | PE     | CHARLOTTETOWN    | 8300301 |         1872 |          11 |         2017 |          12 |       46.28 |       -63.13 |         49 | Y              |  10.0063 |   23.7683 |   1.98683 |  -12.0367 | 1912.16 |


In the above filter condition, we are asking to return the Provinces which have only one station.

### 7.2.7  Save a DataFrame: 

For writing a DataFrame, use the <b> .to_csv </b> or <b> _table </b> functions with similar options as read_csv () seen previously.

In [177]:
dataframe.to_csv("./DATA/My_new_DataFrame.csv", index = False, header = True, sep = ',')

### 7.3 Date Functionality:

Using the <b>date.range()</b> function by specifying the periods and the frequency, we can create the date series. By default, the frequency of range is Days.

In [178]:
pd.date_range('1/1/2011', periods=5)

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05'],
              dtype='datetime64[ns]', freq='D')

We can change the date frequency:

In [179]:
pd.date_range('1/1/2011', periods=5,freq='M')

DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30',
               '2011-05-31'],
              dtype='datetime64[ns]', freq='M')

<b>bdate_range()</b> stands for business date ranges. Unlike date_range(), it excludes Saturday and Sunday.

In [180]:
pd.bdate_range('1/1/2011', periods=10)

DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
               '2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14'],
              dtype='datetime64[ns]', freq='B')

Convenience functions like date_range and bdate_range utilize a variety of frequency aliases. The default frequency for date_range is a calendar day while the default for bdate_range is a business day.

In [181]:
start = pd.datetime(2011, 1, 1)
end = pd.datetime(2011, 1, 5)
pd.date_range(start, end)

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05'],
              dtype='datetime64[ns]', freq='D')

### 7.4 Format dates with the Datetime module:

Python provides many features to work with dates and time.

Datetime is a module that allows you to manipulate dates and times as objects. The idea is simple: you manipulate the object to do all your calculations, and when you need to display it, you format the object into a string.

https://docs.python.org/2/library/datetime.html

You can artificially create a datetime object with the following parameters:

        datetime (year, month, day, hour, minute, second, microsecond, timezone)

The parameters "year", "month" and "day" are mandatory.

The datetime module provides the following classes:

<table border="1" class="docutils">
<colgroup>
<col width="27%">
<col width="57%">
</colgroup>
<tbody valign="top">
   <tr>
    <th>Classe</th>
    <th>Description</th>
  </tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>datetime.date</b></span></tt></td>
<td>A date instance represents a date</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>datetime.datetime</b></span></tt></td>
<td>An instance of datetime represents a date and time according to the Gregorian calendar
</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>datetime.time</b></span></tt></td>
<td>An instance of time represents the time, except for the date. </td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>datetime.timedelta</b></span></tt></td>
<td>The timedelta class is used to keep the differences between two temporal or dated objects.</td>
</tr>
<tr><td><tt class="docutils literal"><span class="pre"><b>datetime.tzinfo</b></span></tt></td>
<td>The tzinfo class is used to implement time zone support for time and datetime objects. </td>
</tr>
</tbody>
</table>



We will see some examples of using DateTime and its classe.


#### 1-  The <b> datetime </b> class of the datetime module

<b>-a Creating a datetime object</b>

In [182]:
from datetime import datetime
datetime(2019, 3, 1)       # instance of datetime

datetime.datetime(2019, 3, 1, 0, 0)

In [183]:
now = datetime.now()
now

datetime.datetime(2019, 12, 5, 13, 18, 44, 833365)

In [184]:
now = now.today()
now

datetime.datetime(2019, 12, 5, 13, 18, 44, 989374)

In [185]:
now = datetime.utcnow()
now

datetime.datetime(2019, 12, 5, 18, 18, 45, 344394)

When opening a csv or text file, we have information about the date and time of the measurements but in the form of strings: "2018-11-01 15:20" or "2017/12/1 16:35:22 "...

It is possible during the reading to convert these strings into a datetime object.

In [186]:
dt = datetime.strptime("2018/11/01 15:20", "%Y/%m/%d %H:%M")
dt

datetime.datetime(2018, 11, 1, 15, 20)

In [187]:
dt = datetime.strptime("2017/12/1 16:35:22", "%Y/%m/%d %H:%M:%S")
dt

datetime.datetime(2017, 12, 1, 16, 35, 22)

In [188]:
dt = datetime.strptime("01/11/19 10-35:22", "%d/%m/%y %H-%M:%S")
dt


datetime.datetime(2019, 11, 1, 10, 35, 22)

In [189]:
dt = datetime.strptime("1Mar 2019 à 09h35", "%d%b %Y à %Hh%M")
dt

datetime.datetime(2019, 3, 1, 9, 35)

<b>b- Manipulate datetime object</b>

From an object or instance of datetime, you can retrieve the time and date.

In [190]:
#now.year
#now.month
#now.day
#maintenant.hour
now.minute
#now.second
#now.microsecond
#now

18

We can change datetime instance:

In [191]:
now.replace(year=1995) 

datetime.datetime(1995, 12, 5, 18, 18, 45, 344394)

In [192]:
now.replace(month=1)

datetime.datetime(2019, 1, 5, 18, 18, 45, 344394)

We can then convert datetime instance to string:

In [193]:
d = datetime.now(); print(d)
d

2019-12-05 13:18:48.514576


datetime.datetime(2019, 12, 5, 13, 18, 48, 514576)

In [194]:
d.strftime("%H:%M"), d.strftime("%Hh%Mmin")

('13:18', '13h18min')

In [195]:
d.strftime("%Y-%m %H:%M")

'2019-12 13:18'

In [196]:
'The day today is {0:%d} {0:%B} and it s  {0:%Hh%Mmin} '.format(d, "day", "month", "time")

'The day today is 05 December and it s  13h18min '

- <b>date</b> et <b>time</b> class

These two classes can be used to create a datetime instance.

In [197]:
from datetime import datetime, date, time
d = date(2005, 7, 14)
t = time(12, 30)
t

datetime.time(12, 30)

In [198]:
datetime.combine(d, t)

datetime.datetime(2005, 7, 14, 12, 30)

In [199]:
now = datetime.utcnow()
now.date()
now.time()


datetime.time(18, 18, 50, 389683)

- The <b> timedelta </b> class of the datetime module

In [200]:
from datetime import timedelta
delta = timedelta(days=3, seconds=100)    # we create our own timedelta

In [201]:
datetime.now()

datetime.datetime(2019, 12, 5, 13, 18, 50, 853709)

In [202]:
datetime.now() + delta

datetime.datetime(2019, 12, 8, 13, 20, 30, 996718)

In [203]:
datetime.now() + timedelta(days=2, hours=4, minutes=3, seconds=12)

datetime.datetime(2019, 12, 7, 17, 22, 3, 799764)

In [204]:
time_range = datetime(2010, 12, 31) - datetime(1981, 12, 31)
time_range

datetime.timedelta(days=10592)

- Example1: Calculate the year of birth from a given age

In [205]:
from datetime import datetime
 
old = 25
month = 10
 
actual_year = datetime.today().year
actual_month = datetime.today().month
 
result = actual_year - old - (1 if month > actual_month else 0)
print(result)

1994


- Example2: Calculate the year of birth from a given age


We can  generate dates for time series with an arbitrary time step:

In [206]:
from datetime import timedelta
dt = timedelta(days = 5, hours = 6, minutes = 25)
d0 = datetime(2000, 2, 21)
[str(d0 + i * dt) for i in range(10)]

['2000-02-21 00:00:00',
 '2000-02-26 06:25:00',
 '2000-03-02 12:50:00',
 '2000-03-07 19:15:00',
 '2000-03-13 01:40:00',
 '2000-03-18 08:05:00',
 '2000-03-23 14:30:00',
 '2000-03-28 20:55:00',
 '2000-04-03 03:20:00',
 '2000-04-08 09:45:00']