# 1.4 Skills: Pandas 🐼

In this notebook we will cover how to:
- work with the two main data types in `pandas`: `DataFrame` and `Series`
- work with data types in `pandas`, especially strings and dates
- load data from JSON and CSV into a `DataFrame`
- manipulate the columns of a `DataFrame`
- access data in a `DataFrame` by means of indexes and slicing

## `pandas`' data structures

### `Series`

In `pandas`, series are the building blocks of dataframes.

Think of a series as a column in a table. A series collects *observations* about a given *variable*. 

In [67]:
from random import random
import pandas as pd
import numpy as np

#### Numerical series

In [82]:
# let's create a series containing 100 random numbers
# ranging between 0 and 1

s = pd.Series([random() for n in range(0, 100)])

Each observation in the series has an **index** as well as a set of **values**: they can be accessed via the omonymous properties:

In [83]:
s.index

RangeIndex(start=0, stop=100, step=1)

In [35]:
list(s.index)

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99]

In [93]:
s.values

array([0.9265142 , 0.33241964, 0.10981713, 0.33386138, 0.55944698,
       0.45851811, 0.9735605 , 0.61157124, 0.50429009, 0.44705075,
       0.5390561 , 0.4514962 , 0.29768582, 0.71224308, 0.94038729,
       0.01984962, 0.81643989, 0.80722717, 0.99664057, 0.28640714,
       0.99091849, 0.8572804 , 0.31129551, 0.41555793, 0.15392392,
       0.06682554, 0.65105341, 0.63313276, 0.87916164, 0.18716589,
       0.59066625, 0.85594405, 0.94284962, 0.85666283, 0.26379451,
       0.62626225, 0.43220826, 0.46681309, 0.72752949, 0.40233313,
       0.43346754, 0.28099661, 0.13224313, 0.04720902, 0.05949737,
       0.16871828, 0.14963836, 0.90665397, 0.23026855, 0.57016545,
       0.20005496, 0.36194639, 0.23643252, 0.79020001, 0.98790445,
       0.32872495, 0.44483239, 0.79144555, 0.17946678, 0.34617632,
       0.96825739, 0.32274289, 0.30380846, 0.02509437, 0.58754865,
       0.51548366, 0.23509294, 0.57791476, 0.0637667 , 0.23129279,
       0.02515153, 0.93475642, 0.83315453, 0.56381529, 0.83680

The `head()` and `tail()` methods allows for looking at the begininning and end of a series:

In [37]:
s.head()

0    0.056150
1    0.950886
2    0.549265
3    0.774797
4    0.866951
dtype: float64

In [38]:
s.tail()

95    0.884579
96    0.426309
97    0.746223
98    0.430681
99    0.373570
dtype: float64

The `value_counts()` method returns a count of distinct values within a series.

In [76]:
s.value_counts()

0.771811    1
0.114470    1
0.698512    1
0.603252    1
0.760992    1
           ..
0.951077    1
0.773665    1
0.401493    1
0.237970    1
0.527201    1
Name: count, Length: 100, dtype: int64

Is there any number in `s` that occurs twice?

In [75]:
# a `Series` can be easily cast into a list

list(s.value_counts()).count(1)

100

Another way of verifying this:

In [77]:
s.is_unique

True

In [78]:
s.min()

0.03505657992551481

In [79]:
s.max()

0.9927444641218586

In [80]:
s.mean()

0.498802863531209

In [81]:
s.median()

0.5035268320708364

#### Datetime series

In [94]:
from random import randint

In [122]:
# let's generate a list of random dates
# in the range 1900-1950

dates = [
    pd.Timestamp(
        year,
        randint(1, 12),
        randint(1, 28) # try replacing with 31 and see what happens
    )
    for year in range(1900,1950)
]

In [123]:
s1 = pd.Series(dates)

In [124]:
s1

0    1900-09-01
1    1901-10-08
2    1902-04-14
3    1903-01-03
4    1904-06-09
5    1905-05-22
6    1906-07-27
7    1907-09-18
8    1908-03-03
9    1909-12-01
10   1910-06-03
11   1911-12-01
12   1912-10-04
13   1913-11-16
14   1914-08-16
15   1915-01-18
16   1916-11-23
17   1917-04-21
18   1918-12-20
19   1919-09-12
20   1920-09-16
21   1921-11-27
22   1922-12-02
23   1923-08-02
24   1924-06-15
25   1925-02-16
26   1926-03-01
27   1927-02-20
28   1928-05-17
29   1929-12-06
30   1930-03-25
31   1931-02-06
32   1932-12-08
33   1933-11-22
34   1934-04-17
35   1935-05-03
36   1936-01-19
37   1937-10-21
38   1938-04-13
39   1939-05-14
40   1940-11-21
41   1941-07-15
42   1942-07-06
43   1943-12-11
44   1944-03-07
45   1945-09-22
46   1946-04-09
47   1947-04-24
48   1948-05-19
49   1949-07-08
dtype: datetime64[ns]

In [125]:
type(s1[1])

pandas._libs.tslibs.timestamps.Timestamp

In [127]:
s1[1].day_name()

'Tuesday'

In [119]:
s1.min()

datetime.date(1900, 5, 12)

In [120]:
s1.max()

datetime.date(1949, 5, 19)

In [128]:
s1.mean()

Timestamp('1925-01-09 12:28:48')

### `DataFrame`


What is a `pandas.DataFrame`? Think of it as an in-memory spreadsheet that you can analyse and manipulate programmatically.

A `DataFrame` is a collection of `Series` having the same length and whose indexes are in sync. A *collection* means that each column of a dataframe is a series

Let's create a toy `DataFrame` by hand. 

In [129]:
dates = [
    pd.Timestamp(
        year,
        randint(1, 12),
        randint(1, 28) # try replacing with 31 and see what happens
    )
    for year in range(1980,1990)
]

In [130]:
dates

[Timestamp('1980-01-14 00:00:00'),
 Timestamp('1981-09-20 00:00:00'),
 Timestamp('1982-06-26 00:00:00'),
 Timestamp('1983-08-23 00:00:00'),
 Timestamp('1984-08-12 00:00:00'),
 Timestamp('1985-09-02 00:00:00'),
 Timestamp('1986-03-13 00:00:00'),
 Timestamp('1987-01-03 00:00:00'),
 Timestamp('1988-07-08 00:00:00'),
 Timestamp('1989-05-04 00:00:00')]

In [131]:
counts = [
    randint(0, 10000)
    for i in range(0, 10)
]

In [132]:
event_types = ["fire", "flood", "car_crash", "plane_crash"]
events = [
    np.random.choice(event_types)
    for i in range(0, 10)
]

In [133]:
assert len(events) == len(counts) == len(dates)

In [136]:
toy_df = pd.DataFrame({
    "date": dates,
    "count": counts,
    "event": events
})

In [137]:
toy_df

Unnamed: 0,date,count,event
0,1980-01-14,1410,plane_crash
1,1981-09-20,7401,flood
2,1982-06-26,6483,flood
3,1983-08-23,4216,fire
4,1984-08-12,4700,plane_crash
5,1985-09-02,8833,flood
6,1986-03-13,9351,plane_crash
7,1987-01-03,7003,fire
8,1988-07-08,2845,plane_crash
9,1989-05-04,2583,fire


**Try out**: what happens if you change the length of either of the two lists? Try e.g. passing 20 dates instead of 10.

In [152]:
# instead of a dictionary of lists, you can pass
# a dictionary of `pandas.Series`. The result is the same.

toy_df = pd.DataFrame(
    {
        "date": pd.Series(dates),
        "count": pd.Series(counts),
        "event": pd.Series(events)
    }
)

In [140]:
toy_df

Unnamed: 0,date,count,event
0,1980-01-14,1410,plane_crash
1,1981-09-20,7401,flood
2,1982-06-26,6483,flood
3,1983-08-23,4216,fire
4,1984-08-12,4700,plane_crash
5,1985-09-02,8833,flood
6,1986-03-13,9351,plane_crash
7,1987-01-03,7003,fire
8,1988-07-08,2845,plane_crash
9,1989-05-04,2583,fire


In [141]:
toy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    10 non-null     datetime64[ns]
 1   count   10 non-null     int64         
 2   event   10 non-null     object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 368.0+ bytes


In [142]:
# a df is a collection of series
# each column is a series

type(toy_df.date)

pandas.core.series.Series

## Data manipulation in `pandas`

### Data types

String, datetimes (see above), categorical data.

In `pandas`, categories behave very much like string, yet they lead to better performances (faster operations, optimized storage).

Bottom-up approach:

In [144]:
# transforms a Series with strings into categories
# similar to R factors

toy_df.event.astype('category')

0    plane_crash
1          flood
2          flood
3           fire
4    plane_crash
5          flood
6    plane_crash
7           fire
8    plane_crash
9           fire
Name: event, dtype: category
Categories (3, object): ['fire', 'flood', 'plane_crash']

Top-down approach:

In [154]:
# here the list of categories is defined beforehand

from pandas.api.types import CategoricalDtype

cat_type = CategoricalDtype(
    categories=["flood", "fire", "car_crash", "earth_quake", "plane_crash"],
    ordered=True
)

toy_df.event = toy_df.event.astype(cat_type)

In [155]:
toy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    10 non-null     datetime64[ns]
 1   count   10 non-null     int64         
 2   event   10 non-null     category      
dtypes: category(1), datetime64[ns](1), int64(1)
memory usage: 510.0 bytes


**Question**: what happens if you remove e.g. "plane_crash" from the list `categories`? Can you explain why?

##### How are categories represented?

In [156]:
toy_df.event.cat.codes

0    4
1    0
2    0
3    1
4    4
5    0
6    4
7    1
8    4
9    1
dtype: int8

In [157]:
toy_df.event.cat.categories

Index(['flood', 'fire', 'car_crash', 'earth_quake', 'plane_crash'], dtype='object')

In [158]:
toy_df.event = toy_df.event.cat.rename_categories({"plane_crash": "airplane_crash"})

In [159]:
toy_df.head()

Unnamed: 0,date,count,event
0,1980-01-14,1410,airplane_crash
1,1981-09-20,7401,flood
2,1982-06-26,6483,flood
3,1983-08-23,4216,fire
4,1984-08-12,4700,airplane_crash


In [160]:
toy_df.event = toy_df.event.cat.rename_categories({"airplane_crash": "plane_crash"})

In [161]:
toy_df.head()

Unnamed: 0,date,count,event
0,1980-01-14,1410,plane_crash
1,1981-09-20,7401,flood
2,1982-06-26,6483,flood
3,1983-08-23,4216,fire
4,1984-08-12,4700,plane_crash


### Accessor properties

For certain data types (string, datetime), `pandas` provides a number of common methods that can be called on any series containing values of that type. These methods become available as methods of the series itself within a property — called *accessor* — named after the data type:

- the `.dt.*` accessor contains methods to operate on `datetime` series
- the `str.` accessor contains methods to operate on `str` (string) series.

As you will see in a moment, these methods are very convenient when filtering rows of a dataset based on the value of a certain column.

#### `datetime` accessor

To work with datetime series `pandas` provide a bunch of useful methods to operate on a series: they can be called from the `.dt` property of a datetime series.

They can be used to:
- convert from one timezone to another
- get the day/day name/month/year information from each date
- and much more (see the [documentation]())

In [163]:
s1.head()

0   1900-09-01
1   1901-10-08
2   1902-04-14
3   1903-01-03
4   1904-06-09
dtype: datetime64[ns]

In [164]:
s1.dt.day_of_week.head()

0    5
1    1
2    0
3    5
4    3
dtype: int32

#### `str` accessor

In [165]:
s = Series(["One", "TWO", "tHrEE"])

Accessors can be used to apply filters to a series by verifying whether a certain condition is verified or not, such is the case with `contains()`. Such methods will output a boolean value (`True` or `False`).

In [166]:
s.str.contains('o')

0    False
1    False
2    False
dtype: bool

In [167]:
s.str.contains('O')

0     True
1     True
2    False
dtype: bool

Other methods can be used, instead, to manipulate an entire series, e.g. `lower()` and `upper()`.

In [168]:
s.str.lower()

0      one
1      two
2    three
dtype: object

### Exploring a dataframe

Exploring a dataframe: `df.head()`, `df.tail()`, `df.info()`.

The method `info()` gives you information about a dataframe:
- how much space does it take in memory?
- what is the datatype of each column?
- how many records are there?
- how many `null` values does each column contain (!)?

In [169]:
toy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    10 non-null     datetime64[ns]
 1   count   10 non-null     int64         
 2   event   10 non-null     category      
dtypes: category(1), datetime64[ns](1), int64(1)
memory usage: 510.0 bytes


Alternatively, if you need to know only the number of columns and rows you can use the `.shape` property.

It returns a tuple with 1) number of rows, 2) number of columns.

In [170]:
toy_df.shape

(10, 3)

`head()` prints by first five rows of a dataframe:

In [171]:
toy_df.head()

Unnamed: 0,date,count,event
0,1980-01-14,1410,plane_crash
1,1981-09-20,7401,flood
2,1982-06-26,6483,flood
3,1983-08-23,4216,fire
4,1984-08-12,4700,plane_crash


But the number of lines displayed is a parameter that can be changed:

In [172]:
toy_df.head(2)

Unnamed: 0,date,count,event
0,1980-01-14,1410,plane_crash
1,1981-09-20,7401,flood


`tail()` does the opposite, i.e. prints the last n rows in the dataframe:

In [173]:
toy_df.tail()

Unnamed: 0,date,count,event
5,1985-09-02,8833,flood
6,1986-03-13,9351,plane_crash
7,1987-01-03,7003,fire
8,1988-07-08,2845,plane_crash
9,1989-05-04,2583,fire


### Loading data

Dataframe can be created from scratch as we did above, but most often they are created by loading existing data into a dataframe by means of `pandas`' input/oputput methods.

#### From JSON

Loading data from a JSON file is very similar to creating a `DataFrame` from a `dict`.

This is how one would do it in pure Python:

In [174]:
import json
json_file_path = '../data/bl_books/sample/book_data_sample.json'

# JSON data gets read into a dictionary

with open(json_file_path, 'r') as jsonfile:
    json_data = json.load(jsonfile)
    
books_df = pd.DataFrame(json_data)

Since reading from files is a very common operation in any data analysis workflow, `pandas` provides methods to read from a variety of formats (JSON, CSV, clipboard, etc.)

The block of code above can be replaced by the following one-liner:

In [175]:
books_df = pd.read_json(json_file_path)

  books_df = pd.read_json(json_file_path)


In [178]:
books_df.head(2)

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1841,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
1,1888,[British Library HMNTS 9025.cc.14.],Rivingtons,[A History of Greece. Part I. From the earlies...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Abbott, Evelyn']}",1888,{'1': 'lsidyv376da437'},4047,{},sample/full_texts/000004047_01_text.json,{'0': {'000257': ['11104648374']}}


In [179]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   datefield                  452 non-null    object
 1   shelfmarks                 452 non-null    object
 2   publisher                  452 non-null    object
 3   title                      452 non-null    object
 4   edition                    452 non-null    object
 5   flickr_url_to_book_images  452 non-null    object
 6   place                      452 non-null    object
 7   issuance                   452 non-null    object
 8   authors                    452 non-null    object
 9   date                       452 non-null    object
 10  pdf                        452 non-null    object
 11  identifier                 452 non-null    object
 12  corporate                  452 non-null    object
 13  fulltext_filename          452 non-null    object
 14  imgs      

**NB**: note the number of missing values for the `books_df.imgs` (n=172).

#### From CSV

Similarly to `pandas.read_json()`, `pandas.read_csv()` is there to make your life easier when it comes to loading CSV data into a dataframe (and that happens very often!).

Let's import one of the CSV files from the "Venice Apprenticeship" dataset (`../data/apprenticeship_venice/`).

In [180]:
csv_file_path = '../data/apprenticeship_venice/professions_data.csv'

In [181]:
garzoni_df = pd.read_csv(csv_file_path)

ParserError: Error tokenizing data. C error: Expected 7 fields in line 36, saw 8


Why it did not work??

Let's have a look at the file first...

In [182]:
!head -n 2 ../data/apprenticeship_venice/professions_data.csv

page_title;register;annual_salary;a_profession;profession_code_strict;profession_code_gen;profession_cat;corporation;keep_profession_a;complete_profession_a;enrolmentY;enrolmentM;startY;startM;length;has_fled;m_profession;m_profession_code_strict;m_profession_code_gen;m_profession_cat;m_corporation;keep_profession_m;complete_profession_m;m_gender;m_name;m_surname;m_patronimic;m_atelier;m_coords;a_name;a_age;a_gender;a_geo_origins;a_geo_origins_std;a_coords;a_quondam;accommodation_master;personal_care_master;clothes_master;generic_expenses_master;salary_in_kind_master;pledge_goods_master;pledge_money_master;salary_master;female_guarantor;period_cat;incremental_salary
Carlo Della sosta (Orese) 1592-08-03;asv, giustizia vecchia, accordi dei garzoni, 114, 155;NA;orese;orese;orefice;orefice;Oresi;1;1;1592;08;1592;08;3;0;orese;orese;orefice;orefice;Oresi;1;1;1;Zuan Battista;Amigoni;;;0, 0;Carlo Della sosta;17;1;;;0, 0;1;0;1;1;1;0;0;0;0;0;NA;0


More than a comma-separated value, it looks like semicolon-separated values...

In [183]:
# the `sep` input parameter
# allows us to specify which character/symbol is used
# to separate column values

garzoni_df = pd.read_csv(
    csv_file_path,
    sep=';'
)

**NB**: There may be invalid lines in the data you are reading in. `read_csv()` puts you in full control of that: by setting the param `on_bad_lines` to `warn` we tell `pandas` to warn us bout bad lines and skip them (see [docs](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)). If no warning is return, then your csv is well-formed.

In [190]:
garzoni_df = pd.read_csv(
    csv_file_path,
    sep=';',
    on_bad_lines="warn",
)

<div class="alert alert-info">
    <b>More format readers</b>
    <p></p>
    Pandas supports more formats than just CSV and JSON. See the library's <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html">documentation</a> for the full list of supported formats.
</div>

### Working with columns

#### Exploring values

In [192]:
garzoni_df.head(5)

Unnamed: 0,page_title,register,annual_salary,a_profession,profession_code_strict,profession_code_gen,profession_cat,corporation,keep_profession_a,complete_profession_a,...,personal_care_master,clothes_master,generic_expenses_master,salary_in_kind_master,pledge_goods_master,pledge_money_master,salary_master,female_guarantor,period_cat,incremental_salary
0,Carlo Della sosta (Orese) 1592-08-03,"asv, giustizia vecchia, accordi dei garzoni, 1...",,orese,orese,orefice,orefice,Oresi,1,1,...,1,1,1,0,0,0,0,0,,0
1,Antonio quondam Andrea (squerariol) 1583-01-09,"asv, giustizia vecchia, accordi dei garzoni, 1...",12.5,squerariol,squerariol,lavori allo squero,lavori allo squero,Squerarioli,1,1,...,0,0,1,0,0,0,1,0,1.0,0
2,Cristofollo di Zuane (batioro in carta) 1591-0...,"asv, giustizia vecchia, accordi dei garzoni, 1...",,batioro,batioro,battioro,fabbricatore di foglie/fili/cordelle d'oro o a...,Battioro,1,1,...,0,0,0,0,0,0,0,0,,0
3,Illeggibile (marzer) 1584-06-21,"asv, giustizia vecchia, accordi dei garzoni, 1...",,marzer,marzer,marzer,merciaio,Merzeri,1,1,...,0,0,0,0,0,0,0,0,,0
4,Domenico Morebetti (spechier) 1664-09-13,"asv, giustizia vecchia, accordi dei garzoni, 1...",7.0,marzer,marzer,marzer,merciaio,Merzeri,1,1,...,0,0,1,0,0,0,1,0,1.0,0


In [193]:
garzoni_df.a_profession.value_counts()

a_profession
spechier                       979
orese                          542
marzer                         506
marangon                       483
tagiapiera                     338
                              ... 
arte del saltar                  1
tentor da fustagni et tella      1
dalla dalla malvasia             1
biavariol , salumier             1
vender camisolle e calze         1
Name: count, Length: 826, dtype: int64

In [194]:
garzoni_df.annual_salary.value_counts()

annual_salary
4.000000     824
5.000000     750
3.000000     597
6.000000     576
2.000000     536
            ... 
1.071429       1
8.727273       1
18.792453      1
7.500000       1
6.769231       1
Name: count, Length: 434, dtype: int64

Mind that `.value_counts` automatically disregards `NaN`s, unless you explicitly opt to keep them.

In [195]:
garzoni_df.annual_salary.value_counts(dropna=False)

annual_salary
NaN         1783
4.000000     824
5.000000     750
3.000000     597
6.000000     576
            ... 
5.894737       1
4.695652       1
2.307692       1
8.250000       1
6.769231       1
Name: count, Length: 435, dtype: int64

In [196]:
garzoni_df.shape

(9653, 47)

#### Missing values

The series' methods `isna()` and `notna()` can be used as a way of filtering rows containing missing values (`NaN`).

In [197]:
garzoni_df[garzoni_df.annual_salary.isna()].shape

(1783, 47)

In [198]:
garzoni_df[garzoni_df.annual_salary.notna()].shape

(7870, 47)

The method DataFrame's method `dropna()` is used to remove rows containing missing values in any of the columns or on a selection.

In [199]:
garzoni_df.dropna().shape

(46, 47)

In [200]:
garzoni_df.dropna(subset=['annual_salary']).shape

(7870, 47)

#### Casting

We call *casting* the operation of changing the act of changing the data type of one or more variables.

In [201]:
# we define a string with value "10"
number_str = "10"

In [202]:
# we change its type from string (`str`)
# to integeer (`int`). This is call casting

number_int = int(number_str)

In [203]:
# the types of the two variable are different indeed

type(number_str) == type(number_int)

False

`pandas` objects like `Series` and `DataFrame` provide the method `astype()` to apply casting on their contents.

In [204]:
garzoni_df.head(3)

Unnamed: 0,page_title,register,annual_salary,a_profession,profession_code_strict,profession_code_gen,profession_cat,corporation,keep_profession_a,complete_profession_a,...,personal_care_master,clothes_master,generic_expenses_master,salary_in_kind_master,pledge_goods_master,pledge_money_master,salary_master,female_guarantor,period_cat,incremental_salary
0,Carlo Della sosta (Orese) 1592-08-03,"asv, giustizia vecchia, accordi dei garzoni, 1...",,orese,orese,orefice,orefice,Oresi,1,1,...,1,1,1,0,0,0,0,0,,0
1,Antonio quondam Andrea (squerariol) 1583-01-09,"asv, giustizia vecchia, accordi dei garzoni, 1...",12.5,squerariol,squerariol,lavori allo squero,lavori allo squero,Squerarioli,1,1,...,0,0,1,0,0,0,1,0,1.0,0
2,Cristofollo di Zuane (batioro in carta) 1591-0...,"asv, giustizia vecchia, accordi dei garzoni, 1...",,batioro,batioro,battioro,fabbricatore di foglie/fili/cordelle d'oro o a...,Battioro,1,1,...,0,0,0,0,0,0,0,0,,0


To cast the type of the `profession_cat` column, we can use directly the `astype()` method of the Series: 

In [205]:
professions = garzoni_df.profession_cat.astype('category')

In [206]:
professions.cat.categories

Index([' . rilegatore di libri', 'acquaroli', 'acquavite',
       'acquavite . arrotino', 'acquavite . venditore di crusca', 'archibugi',
       'archibugi . ', 'arginatura canali', 'arrotino', 'ballerino',
       ...
       'venditori di profumi . pellicciaio', 'venditori di tele',
       'venditori di tele . cotone .  . fabbricatori di fustagni . merciaio . materassaio . rigattiere',
       'venditori di tele . fabbricatori di laccioli . merciaio',
       'venditori di tele . materassaio', 'venditori di tele . merciaio',
       'venditori di tele . merciaio . cotone .  . fabbricatori di fustagni',
       'venditori di tele . merciaio . fabbricatori di laccioli', 'vetraio',
       'vetraio . trasportatori di sabbia'],
      dtype='object', length=360)

Another way of doing this while operating on the dataframe is to use the dataframe's `astype()`:

In [207]:
from pandas.api.types import CategoricalDtype

In [208]:
profession_cat_type = CategoricalDtype(
    categories=garzoni_df.profession_cat[garzoni_df.profession_cat.notnull()].unique()
)

In [210]:
garzoni_df = garzoni_df.astype(
    {
        "profession_cat": profession_cat_type
    }
)

In [213]:
garzoni_df.profession_cat.value_counts()

profession_cat
specchiaio                                                                                        1033
falegname                                                                                          748
orefice                                                                                            640
merciaio                                                                                           613
fabbricatore di foglie/fili/cordelle d'oro o argento                                               380
                                                                                                  ... 
venditori di tele . cotone .  . fabbricatori di fustagni . merciaio . materassaio . rigattiere       1
venditori di frutta . fabbricazione corone                                                           1
calafato                                                                                             1
fabbricatore di sapone . venditore di generi in salamoia  

#### Adding columns

Let's go back to our toy dataframe:

In [214]:
toy_df.head()

Unnamed: 0,date,count,event
0,1980-01-14,1410,plane_crash
1,1981-09-20,7401,flood
2,1982-06-26,6483,flood
3,1983-08-23,4216,fire
4,1984-08-12,4700,plane_crash


Using the column selector with the name of a column that does not exist yet will add the effect of setting the values of all rows in that column to the value specified.

In [215]:
toy_df['country'] = "UK"

In [225]:
toy_df.head(3)

Unnamed: 0,date,count,event,country
0,1980-01-14,1410,plane_crash,USA
1,1981-09-20,7401,flood,USA
2,1982-06-26,6483,flood,USA


But if the column already exists, its value is reset:

In [221]:
toy_df['country'] = "USA"

In [218]:
toy_df.head(3)

Unnamed: 0,date,count,event,country
0,1980-01-14,1410,plane_crash,USA
1,1981-09-20,7401,flood,USA
2,1982-06-26,6483,flood,USA


#### Removing columns

The double square bracket notation ``[[...]]`` returns a dataframe having only the columns specified inside the inner brackets.

This said, removing a column is done by unselecting it:

In [222]:
# here we removed the column country 

toy_df2 = toy_df[['date', 'count', 'event']]

In [223]:
# it worked!

toy_df2.head()

Unnamed: 0,date,count,event
0,1980-01-14,1410,plane_crash
1,1981-09-20,7401,flood
2,1982-06-26,6483,flood
3,1983-08-23,4216,fire
4,1984-08-12,4700,plane_crash


Or using the `.drop` method like this:

In [227]:
toy_df.drop(columns=['country'])

Unnamed: 0,date,count,event
0,1980-01-14,1410,plane_crash
1,1981-09-20,7401,flood
2,1982-06-26,6483,flood
3,1983-08-23,4216,fire
4,1984-08-12,4700,plane_crash
5,1985-09-02,8833,flood
6,1986-03-13,9351,plane_crash
7,1987-01-03,7003,fire
8,1988-07-08,2845,plane_crash
9,1989-05-04,2583,fire


#### Setting a column as index

In [228]:
toy_df.set_index('date')

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-14,1410,plane_crash,USA
1981-09-20,7401,flood,USA
1982-06-26,6483,flood,USA
1983-08-23,4216,fire,USA
1984-08-12,4700,plane_crash,USA
1985-09-02,8833,flood,USA
1986-03-13,9351,plane_crash,USA
1987-01-03,7003,fire,USA
1988-07-08,2845,plane_crash,USA
1989-05-04,2583,fire,USA


In [229]:
toy_df.head(3)

Unnamed: 0,date,count,event,country
0,1980-01-14,1410,plane_crash,USA
1,1981-09-20,7401,flood,USA
2,1982-06-26,6483,flood,USA


In [230]:
toy_df.set_index('date', inplace=True)

In [231]:
toy_df.head(3)

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-14,1410,plane_crash,USA
1981-09-20,7401,flood,USA
1982-06-26,6483,flood,USA


**Q**: can you explain the effect of the `inplace` parameter by looking at the cells above?

### Accessing data

 .loc, .iloc, slicing, iteration over rows

In [232]:
toy_df.head(3)

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-14,1410,plane_crash,USA
1981-09-20,7401,flood,USA
1982-06-26,6483,flood,USA


#### Label-based indexing

In [234]:
toy_df.loc['1980':'1981']

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-14,1410,plane_crash,USA
1981-09-20,7401,flood,USA


#### Integer-based indexing

In [235]:
# select a single row, the first one

toy_df.iloc[0]

count             1410
event      plane_crash
country            USA
Name: 1980-01-14 00:00:00, dtype: object

In [236]:
# select  a range of rows by index

toy_df.iloc[[1,3,-1]]

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1981-09-20,7401,flood,USA
1983-08-23,4216,fire,USA
1989-05-04,2583,fire,USA


In [237]:
# select  a range of rows with slicing

toy_df.iloc[0:5]

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-14,1410,plane_crash,USA
1981-09-20,7401,flood,USA
1982-06-26,6483,flood,USA
1983-08-23,4216,fire,USA
1984-08-12,4700,plane_crash,USA


In [238]:
toy_df.index

DatetimeIndex(['1980-01-14', '1981-09-20', '1982-06-26', '1983-08-23',
               '1984-08-12', '1985-09-02', '1986-03-13', '1987-01-03',
               '1988-07-08', '1989-05-04'],
              dtype='datetime64[ns]', name='date', freq=None)

#### Iterating over rows

In [240]:
for index, row in toy_df.iterrows():
    print(index)

1980-01-14 00:00:00
1981-09-20 00:00:00
1982-06-26 00:00:00
1983-08-23 00:00:00
1984-08-12 00:00:00
1985-09-02 00:00:00
1986-03-13 00:00:00
1987-01-03 00:00:00
1988-07-08 00:00:00
1989-05-04 00:00:00


In [241]:
for index, row in toy_df.iterrows():
    print(index, row.event)

1980-01-14 00:00:00 plane_crash
1981-09-20 00:00:00 flood
1982-06-26 00:00:00 flood
1983-08-23 00:00:00 fire
1984-08-12 00:00:00 plane_crash
1985-09-02 00:00:00 flood
1986-03-13 00:00:00 plane_crash
1987-01-03 00:00:00 fire
1988-07-08 00:00:00 plane_crash
1989-05-04 00:00:00 fire


In [245]:
toy_df

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-14,1410,plane_crash,USA
1981-09-20,7401,flood,USA
1982-06-26,6483,flood,USA
1983-08-23,4216,fire,USA
1984-08-12,4700,plane_crash,USA
1985-09-02,8833,flood,USA
1986-03-13,9351,plane_crash,USA
1987-01-03,7003,fire,USA
1988-07-08,2845,plane_crash,USA
1989-05-04,2583,fire,USA


## ⏰ ✏️ Time to practice  

**Dataset**

For this excercise we will be working with one of the datasets published by the [*Shakespeare and Company project*](https://shakespeareandco.princeton.edu/) – the *books dataset* – which can be downloaded from the following address: https://dataspace.princeton.edu/bitstream/88435/dsp01jm214s28p/2/SCoData_books_v1.2_2022-01.csv (file size = 1.34 MB)

TODO Content of this dataset?
 
**Steps**

Perform the following steps on the dataset:
- load it into a pandas' dataframe
- how many records does it contain?
- keep only the following columns: `uri`, `format` and `borrow_count`
- remove all rows where `format` value is `NaN`
- how many records does it contain now?

**Try to answer the following questions**

- What's the format(s) of the **most borrowed** document(s)? How many times was it/where they borrowed?
- What's the format(s) of the **least borrowed** document(s)? How many times was it/where they borrowed?
