## Library Data Analysis

Data analysis covered in this notebook aims to prove the answers and solutions to the business problems libraries face, books being checked out and returned late. 

The analysis will discover connections between entities engaged in this business transaction, customer, books, and libraries, by looking into their respective relationships. 
This part of the process is called EDA (Exploratory Data Analysis) and if successful, it is going to yield features that are the most descriptive of the problem and these features will help us build a predictive model.

The main idea is that the model will be able to help libraries elevate their businesses. 

This notebook
1. Loads the data
2. (Potentially) Adds additional columns to existing data (providing insight)
3. Detects and visualizes the features
4. Explains how these features can help mitigate the risks libraries are facing

In [99]:
from datetime import datetime, timezone
from pathlib import Path

import hvplot.polars
import polars as pl
import seaborn as sns

In [100]:
RETURN_LIMIT = 28

### Paths and tables

Defines paths and load tables showcasing some data points.

In [101]:
ROOT_PATH = Path(__name__).absolute().parent.parent
DATA_PATH = ROOT_PATH / "data"

TABLE_PATH = DATA_PATH / "storage" / "silver"

The available datasets are small, so we can load them into memory with the `read_parquet` function. If circumstances were different, we would use `LazyFrame` with lazy evaluation (with better optimizations for operations down the road).

Data is loaded from the "silver" layer or the layer that has clean data ready for the analysis. 

### Books table

In [127]:
# books

books_path = TABLE_PATH / "books.parquet"
books_df = pl.read_parquet(source=books_path)

books_df.head(1)

id,title,authors,publisher,published_date,categories,price,pages,published_year
str,str,list[str],str,str,list[str],f32,i32,i32
"""oObQOTJbpuoC""","""Web Advertising""","[""Anja Janoschka""]","""John Benjamins Publishing""","""2004-01-01""","[""Language Arts & Disciplines""]",330.0,635,2004


### Customers table

In [103]:
# customers

customers_path = TABLE_PATH / "customers.parquet"
customers_df = pl.read_parquet(source=customers_path)

customers_df.head(1)

id,name,street_address,city,state,zipcode,birth_date,gender,education,occupation
str,str,str,str,str,i32,date,str,str,str
"""65df33b14c09cad73919962bdc31d9…","""Matthew Hand""","""2406 Ne 19th Ave""","""Portland""","""Oregon""",97212,1962-10-31,"""male""","""High School""","""Education & Health"""


In [105]:
print(f"Number of customers: {len(customers_df)}")

Number of customers: 2000


### Libraries table

In [106]:
# libraries

libraries_path = TABLE_PATH / "libraries.parquet"
libraries_df = pl.read_parquet(source=libraries_path)

libraries_df.head(1)

id,name,street_address,city,region,postal_code
str,str,str,str,str,str
"""22d-222@5xc-kcy-8sq""","""Multnomah County Library Sellw…","""7860 Se 13th Ave""","""Portland""","""OR""","""97202"""


### Checkouts table

In [107]:
# checkouts

checkouts_path = TABLE_PATH / "checkouts.parquet"
checkouts_df = pl.read_parquet(source=checkouts_path)

checkouts_df.head(1)

id,patron_id,library_id,date_checkout,date_returned
str,str,str,date,date
"""7T9-BAAAQBAJ""","""b1b2bd87045d7a18ede00c39c77ca0…","""228-222@5xc-jtz-hwk""",2018-08-14,2119-04-15


### Taking a look at the tables

Looking at the tables, there are candidates for some additional columns that may help with the analysis and are potential candidates for features. 
For example, the `customers` table has the field `birth_date`. Creating age categories (Young, Adults, Middle-Aged Adults, and Old) can give additional insight into who rents books.

Also, the `checkouts` table has columns `date_checkout` and `date_returned,` which is going to be our label.

In [108]:
customers_df = customers_df.with_columns(
    ((datetime.now(timezone.utc).date() - pl.col("birth_date")).dt.total_days() / 365)
    .floor()
    .cast(pl.Int32)
    .alias("age")
)

In [109]:
customers_df = customers_df.with_columns(
    pl.when((3 <= pl.col("age")) & (pl.col("age") <= 19))
    .then(pl.lit("Young"))
    .when((20 <= pl.col("age")) & (pl.col("age") <= 39))
    .then(pl.lit("Adult"))
    .when((40 <= pl.col("age")) & (pl.col("age") <= 59))
    .then(pl.lit("Middle-Age"))
    .when((60 <= pl.col("age")) & (pl.col("age") <= 99))
    .then(pl.lit("Old"))
    .otherwise(pl.lit("Undefined"))
    .alias("age_category")
)

In [110]:
customers_df.head(3)["age", "age_category"]

age,age_category
i32,str
62,"""Old"""
-82,"""Undefined"""
46,"""Middle-Age"""


### Label calculation

Labels can be calculated by subsctracting `date_returned` and `date_checkout`. Limit for the book return is 28 days. 

Return date has to be higher than checkout date. If return date is `None` that means customer didn't returned book yet.

In [128]:
checkouts_df = checkouts_df.with_columns(
    pl.col("date_returned")
    .sub(pl.col("date_checkout"))
    .dt.total_days()
    .cast(pl.Int32)
    .alias("days_holding_book")
)

checkouts_df = checkouts_df.with_columns(
    pl.when(
        (pl.col("days_holding_book") <= RETURN_LIMIT)
        & (pl.col("days_holding_book") >= 0)
    )
    .then(pl.lit(1))
    .when(pl.col("days_holding_book") > RETURN_LIMIT)
    .then(pl.lit(0))
    .when(pl.col("date_returned").is_null())
    .then(pl.lit(0))
    .when(pl.col("days_holding_book") < 0)
    .then(pl.lit(-1))
    .alias("book_returned")
)

If there are values that are negative values in column `days_holding_book` number `-1` marks them as unvalid. 

In [112]:
checkouts_df.head(2)

id,patron_id,library_id,date_checkout,date_returned,days_holding_book,book_returned
str,str,str,date,date,i32,i32
"""7T9-BAAAQBAJ""","""b1b2bd87045d7a18ede00c39c77ca0…","""228-222@5xc-jtz-hwk""",2018-08-14,2119-04-15,36768,0
"""ygoFAAAAQAAJ""","""8ba1250ea887f492d296b8a64bcd75…","""224-222@5xc-jw2-t9z""",2119-10-18,2018-02-18,-37131,-1


### Entity analysis
Analysis of customers and book entities
TODO: Add more description

In [113]:
gender_df = customers_df.group_by(pl.col("gender")).len()
gender_df = gender_df.drop_nulls().rename(
    {"gender": "Gender", "len": "Number of customers"}
)
gender_df.hvplot.bar(
    x="Gender", y="Number of customers", rot=90, title="Library customers gender"
)

In [114]:
education_df = customers_df.group_by(pl.col("education")).len()
education_df = education_df.drop_nulls().rename(
    {"education": "Education", "len": "Number of customers"}
)
education_df.hvplot.bar(
    x="Education",
    y="Number of customers",
    rot=90,
    color="teal",
    title="Library customer education",
)

In [115]:
occupation_df = customers_df.group_by(pl.col("occupation")).len()
occupation_df = occupation_df.drop_nulls().rename(
    {"occupation": "Occupation", "len": "Number of customers"}
)
occupation_df.hvplot.bar(
    x="Occupation",
    y="Number of customers",
    rot=90,
    color="orange",
    title="Library customer occupation",
)

In [116]:
age_df = customers_df.group_by(pl.col("age_category")).len()
age_df = age_df.drop_nulls().rename(
    {"age_category": "Age Category", "len": "Number of customers"}
)
age_df = age_df.filter(pl.col("Age Category") != "Undefined")
age_df.hvplot.bar(
    x="Age Category",
    y="Number of customers",
    rot=90,
    color="purple",
    title="Library customer age group",
)

By analyzing categorical columns of customers we can conclude that:
1) Looking at the gender there are almost even number of male and female customers. 10% of the data are `nulls` or unknown values.
2) Education seems to be equally distributed among customers
3) The same applies to the occupation
4) By looking at the age group there are dominantly adult renters while young are around 10%.

It can be interesting to see which of these groups are late to return the books or if there is even a connection.

In [129]:
books_df.hvplot.hist(
    y="pages", bins=50, bin_range=(-50, 1100), title="Book pages distribution"
)

In [118]:
books_df.hvplot.hist(
    y="price", bins=50, bin_range=(-150, 800), title="Book prices distribution"
)

In [119]:
book_categories_df = books_df.group_by(pl.col("categories")).len()
book_categories_df = book_categories_df.drop_nulls()
book_categories_df.hvplot.bar(title="Book categories distribution")

There are three dominant categories of books while the rest are below 10 unit per category(ies).

In [130]:
books_df.hvplot.hist(
    y="published_year", bins=30, title="Book publishing year distribution"
)

Number of books published in the same year might not be relevent but we can what is rented the most (taking publishing year in account) and connect it to the customer data (age group, education etc.)

In [121]:
book_returned_df = checkouts_df.group_by("book_returned").len()
book_returned_df = book_returned_df.drop_nulls().rename(
    {"book_returned": "Book returned", "len": "Number of customers"}
)
book_returned_df.hvplot.bar(
    x="Book returned",
    y="Number of customers",
    rot=90,
    stacked=True,
    color="purple",
    title="Customers returning/not returning books",
)

By analyzing data about books we see that: 
1. Number of book categories and their distribution in total number of books is not relevant, it's doesn't tell much. There are a couple of categories that are dominant.
2. From publishing years we can see deficiency of books published between 1940. and 1960. year. These informations are not relevant for the problem we are trying to solve
3. Both prices and number of pages are following normal distribution
4. 20% does not return rented books on time
5. 10% of rented books data is invalid

Adding publisher column to the list of analyzed data doesn't accomplish anything and this column is not valuable as potential feature. 
Already from the `Customers returning/not returning books` bar plot we can see imbalance in number of positive and negative labels. Training predictive model is going to be harder. 

### Connecting data together

Joining customers and books data through checkouts will tell what type of customers and what types of books are returning late.
Selected columns should be tested for correlation with the label. 
After correlation check, columns above the threshold can be used for training predictive model.

In [131]:
customer_checkout_df = customers_df.rename({"id": "patron_id"}).join(
    checkouts_df, on="patron_id"
)
library_df = customer_checkout_df.join(books_df, on="id", how="left")

print(f"Number of data points: {len(library_df)}")

Number of data points: 2000


Number of data points fits the number of customers. 

### Columns and the resulting table for analysis

In [136]:
library_df.head(3)

patron_id,name,street_address,city,state,zipcode,birth_date,gender,education,occupation,age,age_category,id,library_id,date_checkout,date_returned,days_holding_book,book_returned,title,authors,publisher,published_date,categories,price,pages,published_year
str,str,str,str,str,i32,date,str,str,str,i32,str,str,str,date,date,i32,i32,str,list[str],str,str,list[str],f32,i32,i32
"""b1b2bd87045d7a18ede00c39c77ca0…","""Jimmy Helton""","""651 Se 9th Ave""",,"""Oregon""",97214,,"""male""","""Others""","""Education & Health""",,"""Undefined""","""7T9-BAAAQBAJ""","""228-222@5xc-jtz-hwk""",2018-08-14,2119-04-15,36768,0,"""Fundamentals Of Financial Mana…","[""Eugene F. Brigham""]","""Cengage Learning""","""2015-01-01""","[""Business & Economics""]",407.0,774,2015
"""8ba1250ea887f492d296b8a64bcd75…","""Mark Sowada""","""6080 Ne Alderwood Rd""","""Portland""","""Oregon""",97218,1988-12-04,"""male""","""College""","""Admin & Support""",35.0,"""Adult""","""ygoFAAAAQAAJ""","""224-222@5xc-jw2-t9z""",2119-10-18,2018-02-18,-37131,-1,"""Mechanics, Hydrostatics, Pneum…","[""Ireland commissioners of nat. educ""]",,"""1861""",,353.0,658,1861
"""6afa3dfab26081b9478dd84e56e6a7…","""Kristin Davis""","""8359 Se Magnolia St""","""Portland""","""Oregon""",97267,1985-07-04,"""female""","""Graduate Degree""","""Blue Collar""",39.0,"""Adult""","""53j1DwAAQBAJ""","""zzw-222@5xc-knn-c5z""",2018-11-21,2018-11-26,5,1,"""The Big Book Of Engines (Thoma…","[""Random House""]","""Random House Books For Young R…","""2020-07-07""","[""Juvenile Fiction""]",388.0,572,2020


In the table above, we have sample data and all columns listed. This mini table represent joined data between customers, books and checkouts.
By analyzing this dataframe one can check which customers returned books (or not), what kinds of books, their prices, pages, categories etc.
We can add also information about libraries and check which library has to most problems about late returnings.

In [140]:
library_df = library_df.select(
    [
        "patron_id",
        "name",
        "gender",
        "education",
        "occupation",
        "age",
        "age_category",
        "days_holding_book",
        "title",
        "published_year",
        "categories",
        "price",
        "pages",
        "book_returned",
    ]
)
library_df.hvplot.table(
    columns=["name", "gender", "age", "title", "price", "book_returned"],
    sortable=True,
    selectable=True,
)