Let's load up our data and do a little cleanup. We have one dirty record and some empty records.

In [1]:
import pandas as pd

DATA = '/kaggle/input/penguin-little-clothbound-classics/penguin-little-clothbound-classics.csv'
df = pd.read_csv(filepath_or_buffer=DATA)
df = pd.concat(axis='index', objs=[df, pd.Series(data={'title': 'The Star-Child', 'author': 'Oscar Wilde', 'season': 'Winter', 'year': 2022}).to_frame()])
df = df[df['author'] != 'Winter']
df = df.drop(columns=[0])
df = df[~df['year'].isna()]
df['year'] = df['year'].astype(int)
df.head()

Unnamed: 0,title,author,season,year
0,Metamorphosis,Franz Kafka,Autumn,2022
1,Cosmicomics,Italo Calvino,Autumn,2022
2,Lolly Willows,Sylvia Townsend Warner,Autumn,2022
3,The Queen of Spades,Alexander Pushkin,Autumn,2022
4,Lady Susan,Jane Austen,Autumn,2022


How many volumes do we have?

In [2]:
len(df)

47

We don't have a lot of data. Let's see what we can learn.

We would expect to see authors who have multiple books in this collection. Let's see if we do.

In [3]:
df['author'].value_counts().max()

1

We do not. How about multiple books with the same title?

In [4]:
df['title'].value_counts().max()

1

No. How are our volumes distributed across seasons?

In [5]:
df['season'].value_counts().to_frame().T

season,Autumn,Spring,Summer,Winter
count,12,12,12,11


Our volumes are distributed almost equally; winter is one short. How are they distributed across years?

In [6]:
df['year'].value_counts().to_frame().T

year,2023,2022
count,24,23


How are the volumes distributed across seasons and years?

In [7]:
df[['season', 'year']].value_counts().to_frame().reset_index().sort_index().sort_values(by='year').T

Unnamed: 0,0,3,1,2
season,Autumn,Winter,Spring,Summer
year,2022,2022,2023,2023
count,12,11,12,12
