# Working with Pandas

This tutorial demonstrates a few of Pandas's many functionalities for working with tabular data. We are only scratching the surface here…

To use the `pandas` library, we need to import it. Because we will call the library multiple times, we *alias* it during the import.

In [1]:
import pandas as pd

## Loading data from a file

Because reading data from (and writing data to) files is a very common task, Pandas tries to make this easy for you. It accepts [various file formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), like CSV (comma-separated values), TSV (tab-separated values), Microsoft Excel (.xls/.xlsx), OpenOffice/LibreOffice spreadsheets and more.

We will work with the text-based CSV format, so we use the `pd.read_csv()` method. This method tries to determine the parameters it needs to read the file, but there may be times that you need to help by setting parameters explicitly.

Let's first look at the first lines of the file.

In [31]:
auctions_file = '/Users/companjenba/Downloads/1572968493008-Lijsten_van_de_leverantie/CSV Veilingen_VOC_Zeeland 27-03-2017.csv'

f = open(auctions_file)
# with open(auctions_file) as f:

# for line in f:
#     print(line)
    
    
for line_number, line in enumerate(f):
    print(line)
    if line_number > 5:
        break

﻿Archief ;toegang;inv. ;wiens verkoping ;jaar ;maand;dag ;soort product ;specifiek product ;meeteenheid;hoeveelheid;Achternaam;Voornaam;munteenheid;aantal (ponden/ guldens);aantal groten of stuivers

NL-HaNA;1.04.02;13377;VOC kamer Zeeland;1725;4;18;Peper;Bruijne;ponden;770459;Boursse;H. van;Vlaamse pond;45399;

NL-HaNA;1.04.02;13377;VOC kamer Zeeland;1725;4;18;Peper;Bruijne;ponden;35375;Huijen;Joost van ;Vlaamse pond;2063;12

NL-HaNA;1.04.02;13377;VOC kamer Zeeland;1725;4;18;Peper;Bruijne;ponden;25856;Pookx;Hendrik;Vlaamse pond;1523;5

NL-HaNA;1.04.02;13377;VOC kamer Zeeland;1725;4;18;Peper;Bruijne;ponden;282314;Ribaut;Casparus;Vlaamse pond;16625;6

NL-HaNA;1.04.02;13377;VOC kamer Zeeland;1725;4;18;Peper;Bruijne;ponden;47622;Six;Pieter;Vlaamse pond;2805;11

NL-HaNA;1.04.02;13377;VOC kamer Zeeland;1725;4;18;Peper;Bruijne;ponden;25788;Beukelaar;Jan ;Vlaamse pond;1519;4



So it turns out the delimiter is a semicolon, not a comma. When you read a CSV file, Pandas may not notice – in this case because there are a few commas as well. There is a header row, the first data row has a missing value (and it's not the only one!) and there are various data types in here: strings and numbers.

The result of `pd.read_csv(<filename>)` is an object called a *DataFrame*. For now it's enough to know that a DataFrame is like a table, with columns and rows.

In [3]:
auctions_df = pd.read_csv(auctions_file, delimiter=';')

## Inspecting the DataFrame

The `head()` method of a DataFrame shows the first rows, five by default.

The `dtypes` property of a DataFrame shows the data types of the columns in the DataFrame.

In [4]:
auctions_df.head()

Unnamed: 0,Archief,toegang,inv.,wiens verkoping,jaar,maand,dag,soort product,specifiek product,meeteenheid,hoeveelheid,Achternaam,Voornaam,munteenheid,aantal (ponden/ guldens),aantal groten of stuivers
0,NL-HaNA,1.04.02,13377,VOC kamer Zeeland,1725,4,18,Peper,Bruijne,ponden,770459,Boursse,H. van,Vlaamse pond,45399.0,
1,NL-HaNA,1.04.02,13377,VOC kamer Zeeland,1725,4,18,Peper,Bruijne,ponden,35375,Huijen,Joost van,Vlaamse pond,2063.0,12.0
2,NL-HaNA,1.04.02,13377,VOC kamer Zeeland,1725,4,18,Peper,Bruijne,ponden,25856,Pookx,Hendrik,Vlaamse pond,1523.0,5.0
3,NL-HaNA,1.04.02,13377,VOC kamer Zeeland,1725,4,18,Peper,Bruijne,ponden,282314,Ribaut,Casparus,Vlaamse pond,16625.0,6.0
4,NL-HaNA,1.04.02,13377,VOC kamer Zeeland,1725,4,18,Peper,Bruijne,ponden,47622,Six,Pieter,Vlaamse pond,2805.0,11.0


In [15]:
auctions_df.dtypes

Archief                       object
toegang                       object
inv.                           int64
wiens verkoping               object
jaar                           int64
maand                          int64
dag                            int64
soort product                 object
specifiek product             object
meeteenheid                   object
hoeveelheid                   object
Achternaam                    object
Voornaam                      object
munteenheid                   object
aantal (ponden/ guldens)     float64
aantal groten of stuivers     object
dtype: object

In [18]:
auctions_df['hoeveelheid'].value_counts()

1          463
2          218
3          129
100        123
200        112
400         83
4           81
50          78
10000       75
80          60
4000        60
300         57
5           55
500         51
20000       49
5000        49
40          45
160         41
60          40
6           40
600         39
180         31
120         30
10          28
8           28
7           28
150         26
1000        25
700         23
800         22
          ... 
8197         1
23388        1
1708         1
120143       1
71390        1
20043        1
545          1
10346        1
4200         1
276530       1
4811         1
4991         1
7701         1
18641        1
14307,5      1
18578        1
21374        1
291956       1
143900       1
1863         1
25860        1
47086        1
98930        1
168798       1
178203       1
209092       1
3086         1
3857         1
2994         1
771206       1
Name: hoeveelheid, Length: 6571, dtype: int64

In [28]:
auctions_df['hoeveelheid_float'] = pd.to_numeric(auctions_df['hoeveelheid'].str.replace(r',', '.'), errors='coerce')

In [29]:
auctions_df['hoeveelheid_float'].isna().any()

True

In [30]:
auctions_df.corr()

Unnamed: 0,inv.,jaar,maand,dag,aantal (ponden/ guldens),hoeveelheid_float
inv.,,,,,,
jaar,,1.0,0.168298,-0.092449,0.031164,-0.01202
maand,,0.168298,1.0,-0.142187,-0.090642,-0.074787
dag,,-0.092449,-0.142187,1.0,-0.02324,-0.004669
aantal (ponden/ guldens),,0.031164,-0.090642,-0.02324,1.0,0.604201
hoeveelheid_float,,-0.01202,-0.074787,-0.004669,0.604201,1.0


In [6]:
auctions_df['meeteenheid'].value_counts()

ponden           10170
pesen             3227
coopen             434
leggers            239
aamen              170
lasten              72
halve leggers       54
kelders             40
bottels             20
halve aam            7
stroopen             4
ons                  3
ankers               2
flessen              2
kisten               2
zakken               1
ons                  1
vaten                1
fust                 1
Name: meeteenheid, dtype: int64

In [7]:
auctions_df['aantal groten of stuivers'].value_counts()

6      765
16     765
10     744
7      743
13     740
18     737
5      736
8      731
12     728
3      724
11     716
4      713
17     712
1      708
14     700
15     699
19     687
9      681
2      668
-        2
29       2
34       1
132      1
64       1
162      1
181      1
18\      1
26       1
24       1
Name: aantal groten of stuivers, dtype: int64

## Aggregations

In [8]:
transactions_by_buyer = auctions_df.groupby(['Achternaam','Voornaam'])

In [12]:
transactions_by_buyer['aantal (ponden/ guldens)'].sum()

Achternaam        Voornaam          
 Hurgronje        Cornelia Machalina      126.0
 Sage             Wed. Benjamin le        402.0
 Tongeren         Willem Hendrik van     1831.0
Aantrekker        Daniel den             1058.0
Aartsen           Jacobus               10864.0
Abrahams          David                    50.0
                  Marcus                  113.0
Abrahams en Zoon  Levij                   196.0
Ackermans         Pieter                51998.0
Ackervelt         Charles               13861.0
Adriaansen        Adriaan                 418.0
                  Pieter                  589.0
Akeren            Adriaan van             655.0
                  Adriaan van            3846.0
AlSagoin          Martijn                 274.0
Alexander         Isaak                    40.0
Alffes            Joost Joan             5343.0
Alix              Jurriaan                241.0
Allen             Edward                   61.0
Allewaart         Hubertus                  2.0
Alv