# Analysis Sample Instructions

## Data Scientist

### Data Science and Storytelling

#### Overview

**The Background Story**

We are being hired by a local library with a problem: their books are being checked out and then returned late way too often. They would love to understand the cause of the issue and what they can learn from the data to proactively monitor the situation going forward.

**The Mission**

(Should you choose to accept)

We'd like you to analyze the library data located [here](https://drive.google.com/drive/folders/12Rx8fqey6TSvBhg-CsgB0mDq6mQStSo5?usp=sharing) and help us build a model to predict the likelihood of a late return of any book at checkout time. Are there any factors you can find that are connected with late returns? What would you recommend the library do to mitigate the risks you find? How would you present your findings to them to get buy-in? The data has the following schema, with each table represented by one CSV file with the matching name.

Good luck and happy analyzing!

## Data Analyst

### Requirements

**Before Starting...**

- Take a moment to think about how long you think this will take to get done.
- Send an email to the address in the footer of this document with your estimated completion time.

**Submitting Your Work**

- Post your full source code/notebook to a public repo on GitHub or your preferred source control website.
- Send an email to the address in the footer of this document with a link to the repository.

**General**

- Use R or Python for the analysis.
- Include credits in your source for any resources pulled from the internet (if applicable).
- Books are considered late if they are not returned within 28 days of checkout.
- Please don’t share the data or include it in your repo.

**Hints**

- Ask questions to clarify as needed.
- Answer the business questions posed above.
- Ensure that your notebook is a good representation of your style.
- Clearly document your thought process and any conclusions you reach.

**Bonus**

- Do something fun or creative with your analysis!
- What other stories can you tell with this data?
- Compare multiple models and showcase their strengths and weaknesses.


# Proposed Solution

This section provides the proposed solution to the presented problem. 

IMPORTANT: Please export the folder with CSV files to the parent directory.

## Setup

Installing and importing necessary packages.

In [52]:
!pip install numpy pandas plotly scikit-learn xgboost kneed




[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [53]:
import ast
import numpy as np
import pandas as pd
import plotly.express as px

from datetime import datetime
from itertools import chain
from kneed import KneeLocator
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

In [54]:
max_checkout_days = 28

## Data Exploration

Loading data from CSVs to pandas data frames.

### Books


In [55]:
df_books = pd.read_csv('Data Challenge/books.csv')
df_books

Unnamed: 0,id,title,authors,publisher,publishedDate,categories,price,pages
0,hVFwAAAAQBAJ,Ogilvy on Advertising,['David Ogilvy'],Vintage,2013-09-11,['Social Science'],72.99,320
1,bRY9AAAAYAAJ,Foreign Publications for Advertising American ...,['United States. Bureau of Foreign and Domesti...,,1913,['Advertising'],469.99,654
2,ZapAAAAAIAAJ,Advertising and the Public Interest,"['John A. Howard', 'James Hulbert']",,1973,['Advertising'],372.0,784
3,A-HthMfF5moC,Profitable Advertising,,,1894,['Advertising'],240.99USD,559
4,4Z9JAAAAMAAJ,Report of the Federal Trade Commission on Dist...,['United States. Federal Trade Commission'],,1944,['Government publications'],539.0,757
...,...,...,...,...,...,...,...,...
235,W58mAQAAIAAJ,Political and Commercial Control of the Minera...,['United States. Dept. of the Interior'],,1918,,153.0,503
236,frzDCQAAQBAJ,Water Resources Management IV,"['C.A. Brebbia', 'A. Kungolos']",WIT Press,2007-05-08,['Nature'],563.5,780
237,mQTxAAAAMAAJ,"Department Publications - State of California,...",['California. Dept. of Water Resources'],,1995,['Hydrology'],216.5|,748
238,lMkmAQAAMAAJ,Technical Report - South Carolina Marine Resou...,['South Carolina. Marine Resources Division'],,1979,['Marine resources'],11.5,236


### Customers

In [56]:
df_customers = pd.read_csv('Data Challenge/customers.csv')
df_customers

Unnamed: 0,id,name,street_address,city,state,zipcode,birth_date,gender,education,occupation
0,df83ec2d0d409395c0d8c2690cfa8b67,Cynthia Barnfield,44 NE Meikle Pl,Portland,Oregon,97213.0,2009-09-10,female,High School,
1,6aec7ab2ea0d67161dac39e5dcabd857,Elizabeth Smith,7511 SE Harrison St,Portland,Oregon,97215.0,1956-12-15,female,College,Blue Collar
2,0c54340672f510fdb9d2f30595c1ab53,Richard Pabla,1404 SE Pine St,Portland,Oregon,97214.0,1960-12-18,male,College,Education & Health
3,f0d9ce833ddc1f73c1e0b55bdebf012e,Charles Baker,12271 N Westshore Dr,Portland,Oregon,97217.0,2105-07-19,male,Graduate Degree,SALES
4,3720379163f6b46944db6c98c0485bfd,Ronald Lydon,5321 NE Skyport Way,,Oregon,97218.0,1961-03-14,male,Graduate Degree,Blue Collar
...,...,...,...,...,...,...,...,...,...,...
1995,ae55f0b71b8b8e91945cd9a91b6e45ee,JOE Roberts,7331 NE Killingsworth St,Portland,,97218.0,1955-05-23,male,Others,Business & Finance
1996,07fe407cc889ea21a8bdc04c305960b1,Matthew Coniglio,1908 NW Harborside Dr,Vancouver,washington,98660.0,1975-11-10,male,Others,Business & Finance
1997,9a2194fcd4f0f326f0ca334450e16a93,Earl Grier,22 NE graham ST,Portland,OREGON,97212.0,2007-10-02,male,Others,Education & Health
1998,01a598a05c48fdd18461d6411f51a109,Rogelio Richmann,7000 NE Airport Way,Portland,OREGON,97218.0,2001-02-19,male,College,Business & Finance


### Libraries


In [57]:
df_libraries = pd.read_csv('Data Challenge/libraries.csv')
df_libraries

Unnamed: 0,id,name,street_address,city,region,postal_code
0,226-222@5xc-kc4-fpv,Multnomah County Library Capitol Hill,10723 SW capitol Hwy,Portland,OR,97219
1,23v-222@5xc-jv7-v4v,Multnomah County Library Northwest,2300 NW Thurman St,,or,
2,222-222@5xc-jvf-skf,Multnomah County Library St Johns,7510 N Charleston Ave,portland,or,97203
3,227-222@5xc-jww-btv,Multnomah County Library Hillsdale,1525 SW Sunset blvd,Portland,or,-97239
4,22d-222@5xc-kcy-8sq,Multnomah County Library Sellwood Moreland,7860 SE 13th AVE,Portland,OR,97202
5,223-222@5xc-jxr-tgk,MULTNOMAH County Library Woodstock,6008 se 49TH AVE,Portland,OR,-97206
6,zzw-224@5xc-jwv-2rk,Multnomah County Library Central,801 SW 10th Ave,Portland,,97205
7,zzw-223@5xc-jv7-ct9,Friends OF the multnomah COUNTY Library,522 SW 5th Ave,,OR,97204
8,226-222@5xc-jxj-7yv,Multnomah County Library Belmont,1038 SE CESAR E CHAVEZ blvd,Portland,OR,97214
9,zzw-222@5xc-knn-c5z,Multnomah County Library Holgate,7905 SE Holgate Blvd,Portland,OR,


### Checkouts


In [58]:
df_checkouts = pd.read_csv('Data Challenge/checkouts.csv')
df_checkouts

Unnamed: 0,id,patron_id,library_id,date_checkout,date_returned
0,-xFj0vTLbRIC,b071c9c68228a2b1d00e6f53677e16da,225-222@5xc-jtz-hkf,2019-01-28,2018-11-13
1,HUX-y4oXl04C,8d3f63e1deed89d7ba1bf6a4eb101373,223-222@5xc-jxr-tgk,2018-05-29,2018-06-12
2,TQpFnkku2poC,4ae202f8de762591734705e0079d76df,228-222@5xc-jtz-hwk,2018-11-23,2019-01-24
3,OQ6sDwAAQBAJ,f9372de3c8ea501601aa3fb59ec0f524,23v-222@5xc-jv7-v4v,2018-01-15,2018-04-25
4,7T9-BAAAQBAJ,2cf3cc3b9e9f6c608767da8d350f77c9,225-222@5xc-jtz-hkf,2018-12-31,1804-01-23
...,...,...,...,...,...
1995,rNbuDwAAQBAJ,91871955f3641857832766ac3f5a0b95,222-222@5xc-jv5-nt9,2018-07-19,2018-08-12
1996,rcrCAgAAQBAJ,ad08956eb20efb746af650f906d439cf,22d-222@5xc-kcy-8sq,2018-03-07,2018-03-13
1997,F44fAQAAMAAJ,026262cc3454149303074c4113b5f118,226-222@5xc-jxj-7yv,2018-06-17,2018-06-27
1998,Ci1HAQAAMAAJ,08b29865e58e9b2aabff9684a703acf0,223-222@5xc-jxr-tgk,2018-12-10,2018-12-29


## Feature Engineering

Preparing features for building ML models.

### Parsing

Parsing dates and floats. The formats were obtained by analyzing values with ChatGPT. 

In [59]:
def parse_date(date):
    formats = [
        '%Y',            # Year only
        '%Y-%m-%d',      # Standard ISO format
        '%Y/%m/%d',      # ISO format with slashes
        '%Y.%m.%d',      # ISO format with dots
        '%Y%m%d',        # Compact format without separators
        '%d%m%Y',        # Compact format with day first
        '%d%m%y',        # Compact format with short year
        '%d-%m-%Y',      # Day first with dashes
        '%d/%m/%Y',      # Day first with slashes
        '%d.%m.%Y',      # Day first with dots
        '%d %m %Y',      # Day with spaces
        '%Y %b %d',      # Year with abbreviated month
        '%d %B %Y',      # Day with full month name
        '%d %b %Y',      # Day with abbreviated month
        '%Y|%m|%d',      # Custom format with pipes
        '%Y-%m-%d%',     # ISO format with trailing percent sign
        '%y-%m-%d',      # Short year format
        '%y%m%d'         # Short year compact format
    ]
    
    for fmt in formats:
        try:
            return datetime.strptime(str(date), fmt)
        except ValueError:
            continue
    return pd.NaT


df_books['price'] = df_books['price'].replace({'[*,$,USD,|]': ''}, regex=True).astype(float)
df_books['pages'] = df_books['pages'].replace({'[*,|,^,#]': ''}, regex=True).astype(float)
df_books['publishedDate'] = df_books['publishedDate'].apply(parse_date)
df_customers['birth_date'] = df_customers['birth_date'].apply(parse_date)
df_checkouts['date_checkout'] = df_checkouts['date_checkout'].apply(parse_date)
df_checkouts['date_returned'] = df_checkouts['date_returned'].apply(parse_date)

### Merging

Merging books, customers, libraries, and checkouts into one data frame for further analysis.

In [60]:
df_merged = pd.merge(df_checkouts, df_books.rename(columns={c: f'book_{c}' for c in df_books.columns}), left_on='id', right_on='book_id')
df_merged = pd.merge(df_merged, df_customers.rename(columns={c: f'customer_{c}' for c in df_customers.columns}), left_on='patron_id', right_on='customer_id')
df_merged = pd.merge(df_merged, df_libraries.rename(columns={c: f'library_{c}' for c in df_libraries.columns}), left_on='library_id', right_on='library_id')
df_merged

Unnamed: 0,id,patron_id,library_id,date_checkout,date_returned,book_id,book_title,book_authors,book_publisher,book_publishedDate,...,customer_zipcode,customer_birth_date,customer_gender,customer_education,customer_occupation,library_name,library_street_address,library_city,library_region,library_postal_code
0,-xFj0vTLbRIC,b071c9c68228a2b1d00e6f53677e16da,225-222@5xc-jtz-hkf,2019-01-28,2018-11-13,-xFj0vTLbRIC,Blood Engines,['T.A. Pratt'],Spectra,2007-09-25,...,97212.0,NaT,female,,Tech,MULTNOMAH County Library,216 ne Knott st,,OR,
1,HUX-y4oXl04C,8d3f63e1deed89d7ba1bf6a4eb101373,223-222@5xc-jxr-tgk,2018-05-29,2018-06-12,HUX-y4oXl04C,Indian Financial System 5E,['Khan'],Tata McGraw-Hill Education,2006-06-01,...,97202.0,1965-01-24,female,graduate DEGREE,Tech,MULTNOMAH County Library Woodstock,6008 se 49TH AVE,Portland,OR,-97206
2,TQpFnkku2poC,4ae202f8de762591734705e0079d76df,228-222@5xc-jtz-hwk,2018-11-23,2019-01-24,TQpFnkku2poC,Advertising Management,"['C. L. Tyagi', 'Arun Kumar']",Atlantic Publishers & Dist,2004-01-01,...,97212.0,1963-11-04,male,Graduate Degree,Education & Health,Multnomah County Library,205 NE Russell St,,,97212-
3,OQ6sDwAAQBAJ,f9372de3c8ea501601aa3fb59ec0f524,23v-222@5xc-jv7-v4v,2018-01-15,2018-04-25,OQ6sDwAAQBAJ,New Technologies for Emission Control in Marin...,"['Masaaki Okubo', 'Takuya Kuwahara']",Butterworth-Heinemann,2019-08-29,...,97227.0,2119-02-10,male,Graduate DEGREE,Sales,Multnomah County Library Northwest,2300 NW Thurman St,,or,
4,7T9-BAAAQBAJ,2cf3cc3b9e9f6c608767da8d350f77c9,225-222@5xc-jtz-hkf,2018-12-31,1804-01-23,7T9-BAAAQBAJ,Fundamentals of Financial Management,"['Eugene F. Brigham', 'Joel F. Houston']",Cengage Learning,2015-01-01,...,97218.0,2103-05-19,female,Others,Business & Finance,MULTNOMAH County Library,216 ne Knott st,,OR,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,rNbuDwAAQBAJ,91871955f3641857832766ac3f5a0b95,222-222@5xc-jv5-nt9,2018-07-19,2018-08-12,rNbuDwAAQBAJ,American Folk Medicine,['Wayland D. Hand'],,2021-01-08,...,97214.0_,2120-08-25,male,Graduate Degree,Education & Health,Multnomah County Library North Portland,512 N Killingsworth St,Portland,OR,#97217
1996,rcrCAgAAQBAJ,ad08956eb20efb746af650f906d439cf,22d-222@5xc-kcy-8sq,2018-03-07,2018-03-13,rcrCAgAAQBAJ,Mechanics,['J. P. Den Hartog'],Courier Corporation,2013-03-13,...,97267.0,1967-10-17,female,High School,Education & Health,Multnomah County Library Sellwood Moreland,7860 SE 13th AVE,Portland,OR,97202
1997,F44fAQAAMAAJ,026262cc3454149303074c4113b5f118,226-222@5xc-jxj-7yv,2018-06-17,2018-06-27,F44fAQAAMAAJ,Michigan Manufacturer & Financial Record,,,1916-01-01,...,97218.0,1812-03-13,female,High School,Education & Health,Multnomah County Library Belmont,1038 SE CESAR E CHAVEZ blvd,Portland,OR,97214
1998,Ci1HAQAAMAAJ,08b29865e58e9b2aabff9684a703acf0,223-222@5xc-jxr-tgk,2018-12-10,2018-12-29,Ci1HAQAAMAAJ,Modern Electric Railway Practice: Power statio...,,,1909-01-01,...,97266.0,1980-08-23,male,Graduate Degree,,MULTNOMAH County Library Woodstock,6008 se 49TH AVE,Portland,OR,-97206


### Featurization

Preparing features that an ML model can learn from as follows:
1. Selecting float values, such as book price and pages.
2. Selecting categorical features with not too high and not too low cardinality of distinct values, and transforming them into binary choices. 
3. Extending the feature set with date differences, such as the age of customers and books. 
4. Marking the late returns by calculating the difference between the checkout and returned date. 

In [61]:
def unique_category(category):
    return str(category).lower().strip().replace("  ", "")
    
    
def one_hot_encode(df, column):
    if column not in df.columns:
        return df
    
    df_encoded = pd.get_dummies(df[column].apply(unique_category)).astype(int)
    df_encoded.columns = [f'{column} - {_}' for _ in df_encoded.columns]
    df = pd.merge(df, df_encoded, left_index=True, right_index=True)
    df.drop(columns=[column], inplace=True)
    return df


def multi_hot_encode(df, column):
    if column not in df.columns:
        return df
    
    df_exploded = df[column].apply(lambda _: ast.literal_eval(_) if pd.notna(_) else _).explode()
    df_encoded = pd.get_dummies(df_exploded.apply(unique_category))
    df_encoded = df_encoded.groupby(df_encoded.index).sum()
    df_encoded.columns = [f'{column} - {_}' for _ in df_encoded.columns]
    df_result = pd.merge(df, df_encoded, left_index=True, right_index=True)
    df_result.drop(columns=[column], inplace=True)
    return df_result


df_xy = pd.DataFrame(df_merged[[
    'date_checkout',
    'date_returned',
    'book_authors', 
    'book_categories',
    'book_publishedDate',
    'book_price',
    'book_pages',
    'customer_birth_date',
    'customer_gender',
    'customer_education',
    'customer_occupation',
    'library_name',
]])

df_xy = df_xy[(datetime(2000, 1, 1) < df_xy['date_checkout']) & (df_xy['date_checkout'] < datetime.now())] # Selecting valid dates.
df_xy['checkout_days'] = (df_xy['date_returned'] - df_xy['date_checkout']).dt.days # For returned books.
df_xy.loc[df_xy['date_returned'].isna(), 'checkout_days'] = (datetime.now() - df_xy['date_checkout']).dt.days # For not yet returned books.
df_xy = df_xy[~((df_xy['date_returned'].isna()) & (df_xy['checkout_days'] <= 28))] # Removing rows where the book can be returned on time.
df_xy = df_xy[df_xy['checkout_days']>0] # Removing rows where the number of checkout days is negative.
df_xy['late_return'] = (df_xy['checkout_days'] > max_checkout_days).astype(int) # Whether the book was not returned on time.
df_xy['customer_age'] = (datetime.now() - df_xy['customer_birth_date']).dt.days / 365 # Transforming birth date to age.
df_xy['book_age'] = (datetime.now() - df_xy['book_publishedDate']).dt.days / 365 # Transforming published date to age.
df_xy.drop(columns=['date_returned', 'customer_birth_date', 'book_publishedDate'], inplace=True)
df_xy = multi_hot_encode(df_xy, 'book_authors')
df_xy = multi_hot_encode(df_xy, 'book_categories')
df_xy = one_hot_encode(df_xy, 'customer_gender')
df_xy = one_hot_encode(df_xy, 'customer_education')
df_xy = one_hot_encode(df_xy, 'customer_occupation')
df_xy = one_hot_encode(df_xy, 'library_name')
df_xy.drop(columns=[_ for _ in df_xy.columns if _.endswith(' - nan') or _.endswith(' - others')], inplace=True) # Drop undefined categories.
df_xy.drop(columns=[_ for _ in df_xy.columns if len(df_xy[_].unique()) == 1], inplace=True) # Drop constants.
df_xy.dropna(inplace=True)
df_xy

Unnamed: 0,date_checkout,book_price,book_pages,checkout_days,late_return,customer_age,book_age,book_authors - a. kungolos,book_authors - ahmed f. el-sayed,book_authors - akira ohata,...,library_name - multnomah county library midland,library_name - multnomah county library northwest,library_name - multnomah county library st johns,library_name - multnomah county library woodstock,library_name - multnomah countylibrary sellwoodmoreland,library_name - multnomahcounty library,library_name - multnomahcounty library albina,library_name - multnomahcounty library central,library_name - multnomahcountylibrary hollywood library,library_name - multnomahcountylibrarynorth portland
1,2018-05-29,416.99,752.0,14.0,0,59.652055,18.273973,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,2018-11-23,217.00,790.0,62.0,1,60.876712,20.690411,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,2018-01-15,190.50,597.0,100.0,1,-94.495890,5.021918,0,0,0,...,0,1,0,0,0,0,0,0,0,0
6,2018-01-10,414.50,561.0,25.0,0,40.134247,27.693151,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,2018-06-23,149.00,530.0,21.0,0,47.978082,30.695890,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,2018-07-19,302.00,668.0,24.0,0,-96.035616,3.657534,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1996,2018-03-07,506.99,493.0,6.0,0,56.923288,11.487671,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1997,2018-06-17,371.00,751.0,10.0,0,212.621918,108.750685,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1998,2018-12-10,484.00,635.0,19.0,0,44.063014,115.753425,0,0,0,...,0,0,0,1,0,0,0,0,0,0


## Feature Analysis

### Correlation Analysis

The following plot shows Spearman correlation of the feature values with the late returns. For simplicity, the categorical feature values that were previously transformed into binary choices are grouped by category names so that the plot shows only the minimum and maximum for each group. For the features that were not transformed in this way the plot shows the same minimum and maximum. 

The results show that the correlation is weak in all the cases, indicating that there is a need for exploring alternatives, such as incorporating features like holidays, weather, or local events that might influence book returns.

In [82]:
def plot_features(df, name_col, value_col):
    df_result = pd.DataFrame(df[[name_col, value_col]])
    df_result.columns=['feature', 'value']
    df_result['feature'] = df_result['feature'].apply(lambda _: _.split(' - ')[0])
    df_result = df_result.groupby('feature', sort=False).aggregate(['min', 'max']).reset_index()
    df_result.columns = ['feature', 'min', 'max']
    return px.bar(df_result, x='feature', y=['min', 'max'], barmode='group')
    
    
def plot_corr(df, target, method='spearman'):
    df_corr = df.corr(method)[target].reset_index()
    df_corr = df_corr[df_corr['index'] != target]
    df_corr.sort_values(target, ascending=False, inplace=True)
    return plot_features(df_corr, 'index', target)


plot_corr(df_xy.drop(columns=['date_checkout', 'checkout_days']), 'late_return').write_html('correlation_analysis.html')

### Trend Analysis

The following plot shows the number of late returns for different checkout dates. The periodic fluctuations of the presented values over time imply that the calendar features, such as day of the week, month of the year, etc., could be useful for predicting the late returns.



In [63]:
px.line(df_xy.groupby('date_checkout').sum('late_return').reset_index(), x='date_checkout', y='late_return')

## Feature Augmentation

### Time Encoders

Preparing classes for encoding calendar features like clock positions (2D Cartesian coordinates on a unit circle). This kind of encoding preserves both the distance and periodicity of the feature values in Euclidean space, in contrast to one-hot encoding for example.

In [64]:
class TimeEncoder:
    def __init__(self, *features: str):
        """
        Abstract time feature encoder.

        :param features: Feature names that will be used to create named tuples.
        """
        self.features = features

    def encode(self, t: pd.Timestamp) -> pd.DataFrame:
        pass

    def __call__(self, t: pd.Timestamp):
        """Transforms the specified timestamp into numeric values."""
        return self.encode(t)


class PeriodicTimeEncoder(TimeEncoder):
    def __init__(self):
        t = type(self).__name__
        super().__init__(f'{t}X', f'{t}Y')

    def length(self) -> int:
        """Returns the cycle length."""
        pass

    def index(self, t: pd.Series) -> pd.Series:
        """Extracts the index of the timestamp position on the cycle."""
        pass

    def encode(self, t: pd.Series) -> pd.DataFrame:
        clock_position = 2 * np.pi * self.index(t) / self.length()
        return pd.DataFrame({
            f'{type(self).__name__} - X': (np.sin(clock_position)+1)/2, 
            f'{type(self).__name__} - Y': (np.cos(clock_position)+1)/2
        })


class MonthOfYear(PeriodicTimeEncoder):
    """Encodes month of year as 2D Cartesian coordinates on a unit circle (clock positions)."""

    def length(self) -> int:
        return 12

    def index(self, t: pd.Series) -> pd.Series:
        return t.dt.month - 1


class DayOfMonth(PeriodicTimeEncoder):
    """Encodes week of year as 2D Cartesian coordinates on a unit circle (clock positions)."""

    def length(self) -> int:
        return 31

    def index(self, t: pd.Series) -> pd.Series:
        return t.dt.day - 1


class DayOfWeek(PeriodicTimeEncoder):
    """Encodes day of week as 2D Cartesian coordinates on a unit circle (clock positions)."""

    def length(self) -> int:
        return 7

    def index(self, t: pd.Series) -> pd.Series:
        return t.dt.dayofweek - 1

  
class TimeEncoders(TimeEncoder):
    def __init__(self, *encoders: TimeEncoder):
        """Encodes multiple time features using the specified time encoders."""
        super().__init__(*list(chain.from_iterable([_.features for _ in encoders])))
        self.encoders = encoders

    def encode(self, t: pd.Timestamp) -> pd.DataFrame:
        return pd.concat([_.encode(t) for _ in self.encoders], axis=1)

### Time Encoding
Encoding month of year, day of month, and day of week like clock positions.

In [65]:
time_encoders = TimeEncoders(MonthOfYear(), DayOfMonth(), DayOfWeek())#%% md
df_encoded_time = time_encoders(df_xy['date_checkout'])
df_xy = pd.concat((df_xy, df_encoded_time), axis=1)
df_xy

Unnamed: 0,date_checkout,book_price,book_pages,checkout_days,late_return,customer_age,book_age,book_authors - a. kungolos,book_authors - ahmed f. el-sayed,book_authors - akira ohata,...,library_name - multnomahcounty library albina,library_name - multnomahcounty library central,library_name - multnomahcountylibrary hollywood library,library_name - multnomahcountylibrarynorth portland,MonthOfYear - X,MonthOfYear - Y,DayOfMonth - X,DayOfMonth - Y,DayOfWeek - X,DayOfWeek - Y
1,2018-05-29,416.99,752.0,14.0,0,59.652055,18.273973,0,0,0,...,0,0,0,0,0.933013,0.250000,0.214366,0.910382,0.500000,1.000000
2,2018-11-23,217.00,790.0,62.0,1,60.876712,20.690411,0,0,0,...,0,0,0,0,0.066987,0.750000,0.015961,0.374674,0.716942,0.049516
3,2018-01-15,190.50,597.0,100.0,1,-94.495890,5.021918,0,0,0,...,0,0,0,0,0.500000,1.000000,0.649682,0.022930,0.109084,0.811745
6,2018-01-10,414.50,561.0,25.0,0,40.134247,27.693151,0,0,0,...,0,0,0,0,0.500000,1.000000,0.984039,0.374674,0.890916,0.811745
8,2018-06-23,149.00,530.0,21.0,0,47.978082,30.695890,0,0,0,...,0,0,0,1,0.750000,0.066987,0.015961,0.374674,0.283058,0.049516
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,2018-07-19,302.00,668.0,24.0,0,-96.035616,3.657534,0,0,0,...,0,0,0,1,0.500000,0.000000,0.257349,0.062827,0.987464,0.388740
1996,2018-03-07,506.99,493.0,6.0,0,56.923288,11.487671,0,0,0,...,0,0,0,0,0.933013,0.750000,0.968876,0.673653,0.890916,0.811745
1997,2018-06-17,371.00,751.0,10.0,0,212.621918,108.750685,0,0,0,...,0,0,0,0,0.750000,0.066987,0.449416,0.002565,0.012536,0.388740
1998,2018-12-10,484.00,635.0,19.0,0,44.063014,115.753425,0,0,0,...,0,0,0,0,0.250000,0.933013,0.984039,0.374674,0.109084,0.811745


### Correlation Analysis With New Features

The following plot shows that the newly added features (month of year, day of month, and day of week) could be useful for predicting the late returns, compared with the other features.

In [66]:
plot_corr(df_xy.drop(columns=['date_checkout', 'checkout_days']), 'late_return')

## Model Selection

Evaluating traditional (shallow learning) classifiers: logistic regression, decision tree, random forest, extreme gradient boosting, and support vector machines. For simplicity, the classifiers are used with default hyperparameters and evaluated with commonly used classification metrics utilizing cross validation. 

NOTE: The evaluation process can be extended to include hyperparameter optimization and other models.

### Data Splitting and Scaling

Preparing scaled features (x), targets (y), and weights (w).

In [67]:
df_x = df_xy.drop(columns=['date_checkout', 'checkout_days', 'late_return'])
x = MinMaxScaler().fit_transform(df_x.values)
y = df_xy['late_return'].values
w = df_xy['checkout_days'].values / max_checkout_days # Weights for handling imbalance in late returns. 

### Cross Validation

Evaluating classification models with 5-fold cross validation using the following metrics:
1. Accuracy
2. Precision
3. Recall
4. F1 Score 
5. ROC AUC (Receiver Operating Characteristic - Area Under the Curve)
6. Balanced Accuracy
7. Average Precision
8. Matthews Correlation Coefficient (MCC)

The following descriptions were generated by ChatGPT.

#### 1. Accuracy
**Definition:** The ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances.  
**Formula:**  
$$
\text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Instances}}
$$  
**Range:** [0, 1]  
**Interpretation:**
- **0:** No correct predictions (worst case).
- **1:** All predictions are correct (best case).
- **Usefulness:** Accuracy is a useful general measure, but it can be misleading in cases of imbalanced datasets where the majority class dominates.

#### 2. Precision
**Definition:** The ratio of correctly predicted positive instances (true positives) to the total predicted positives (true positives + false positives).  
**Formula:**  
$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$$  
**Range:** [0, 1]  
**Interpretation:**
- **0:** No true positives, only false positives (worst case).
- **1:** All predicted positives are true positives, no false positives (best case).
- **Usefulness:** Precision is crucial when the cost of false positives is high (e.g., in spam detection).

#### 3. Recall
**Definition:** The ratio of correctly predicted positive instances to the total actual positives (true positives + false negatives).  
**Formula:**  
$$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$$  
**Range:** [0, 1]  
**Interpretation:**
- **0:** No true positives are captured; all actual positives are missed (worst case).
- **1:** All actual positives are captured by the model (best case).
- **Usefulness:** Recall is important when the cost of false negatives is high (e.g., in medical diagnostics).

#### 4. F1 Score
**Definition:** The harmonic mean of precision and recall.  
**Formula:**  
$$
\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$  
**Range:** [0, 1]  
**Interpretation:**
- **0:** Either precision or recall (or both) is zero (worst case).
- **1:** Both precision and recall are perfect (best case).
- **Usefulness:** The F1 Score is useful when you need to balance precision and recall, especially in imbalanced datasets.

#### 5. ROC AUC (Receiver Operating Characteristic - Area Under the Curve)
**Definition:** AUC measures the area under the ROC curve, which plots the true positive rate against the false positive rate at various threshold settings.  
**Range:** [0, 1]  
**Interpretation:**
- **0.5:** The model performs no better than random guessing.
- **1:** The model perfectly distinguishes between the classes.
- **Usefulness:** ROC AUC is useful to evaluate the model’s ability to distinguish between the positive and negative classes across all threshold values.

#### 6. Balanced Accuracy
**Definition:** The average of recall obtained on each class, particularly useful for imbalanced datasets.  
**Formula:**  
$$
\text{Balanced Accuracy} = \frac{1}{2} \left( \frac{\text{True Positives}}{\text{Actual Positives}} + \frac{\text{True Negatives}}{\text{Actual Negatives}} \right)
$$  
**Range:** [0, 1]  
**Interpretation:**
- **0:** The model performs as badly as possible on both classes.
- **1:** The model perfectly predicts both classes.
- **Usefulness:** Balanced accuracy gives a more truthful measure of performance for imbalanced datasets by equally weighing the accuracy of each class.

#### 7. Average Precision
**Definition:** The average of precision scores calculated at different thresholds, weighted by the increase in recall from the previous threshold.  
**Range:** [0, 1]  
**Interpretation:**
- **0:** No precision; the model predicts only false positives.
- **1:** Perfect precision at all thresholds.
- **Usefulness:** Average precision provides a single-number summary of the precision-recall curve, useful in cases with imbalanced classes.

#### 8. Matthews Correlation Coefficient (MCC)
**Definition:** MCC is a measure of the quality of binary classifications, considering all four confusion matrix categories (true positives, false positives, true negatives, and false negatives).  
**Formula:**  
$$
\text{MCC} = \frac{(\text{True Positives} \times \text{True Negatives}) - (\text{False Positives} \times \text{False Negatives})}{\sqrt{(\text{True Positives} + \text{False Positives}) \times (\text{True Positives} + \text{False Negatives}) \times (\text{True Negatives} + \text{False Positives}) \times (\text{True Negatives} + \text{False Negatives})}}
$$  
**Range:** [-1, 1]  
**Interpretation:**
- **-1:** Total disagreement between predicted and actual values.
- **0:** Predictions are no better than random.
- **1:** Perfect prediction.
- **Usefulness:** MCC is a balanced metric even for imbalanced classes, providing a comprehensive view of prediction performance.


In [68]:
def evaluate_classifier(c, x_samples, y_samples, w_samples):
    scoring = {
        'Accuracy': 'accuracy',
        'Precision': 'precision',
        'Recall': 'recall',
        'F1': 'f1',
        'ROC_AUC': 'roc_auc',
        'Balanced_Accuracy': 'balanced_accuracy',
        'Average_Precision': 'average_precision',
        'MCC': 'matthews_corrcoef',
    }
    
    scores = cross_validate(c, x_samples, y_samples, params={'sample_weight': w_samples}, cv=5, scoring=scoring, return_train_score=True)
    df_result = pd.DataFrame(scores).reset_index()
    df_result.rename(columns={'index': 'partition'}, inplace=True)
    df_result['classifier'] = type(c).__name__.replace('Classifier', '')
    return df_result


classifiers = [
    LogisticRegression(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    XGBClassifier(),
    SVC(kernel='linear',probability=True)
]
  
df_scores = pd.concat([evaluate_classifier(_, x, y, w) for _ in classifiers])
df_scores.drop(columns=['partition']).groupby('classifier', sort=False).mean().T


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



classifier,LogisticRegression,DecisionTree,RandomForest,XGB,SVC
fit_time,0.119403,0.897334,1.64147,0.317706,0.950411
score_time,0.126845,0.05154,0.111242,0.02949,0.065683
test_Accuracy,0.298991,0.700264,0.81116,0.675986,0.457089
train_Accuracy,0.348173,1.0,1.0,0.979614,0.552033
test_Precision,0.193851,0.185853,0.3,0.252408,0.197655
train_Precision,0.221399,1.0,1.0,0.903224,0.29301
test_Recall,0.859507,0.170537,0.011393,0.363716,0.605878
train_Recall,0.972548,1.0,1.0,1.0,0.967808
test_F1,0.316322,0.176963,0.021953,0.297418,0.297875
train_F1,0.360622,1.0,1.0,0.948983,0.449706


### Reducing Overfitting

The winner is the XGB model, which has the best balance between the accuracy metrics, especially F1 score and ROC-AUC. However, the significant difference in accuracy on training and test sets implies overfitting. Simple approaches to reducing overfitting include the model simplification (e.g., by reducing max depth) and feature selection (e.g., by selecting the features above the elbow of the curve defined by feature importance). The following table shows that these approaches indeed reduce overfitting but also the accuracy, implying that more data could help the model generalize better.

In [69]:
xgb_importance = XGBClassifier().fit(x, y, sample_weight=w).feature_importances_
df_xgb_importance = pd.DataFrame({'feature': df_x.columns, 'importance': xgb_importance})
df_xgb_importance.sort_values('importance', ascending=False, inplace=True)
elbow = KneeLocator(range(len(df_xgb_importance)), df_xgb_importance.importance, curve='convex', direction='decreasing').knee
df_x_reduced = df_xy[df_xgb_importance[:elbow].feature]
x_reduced = df_x_reduced.values

df_xgb_scores = evaluate_classifier(XGBClassifier(max_depth=4), x_reduced, y, w)
df_xgb_scores.drop(columns=['partition', 'classifier']).mean().T

fit_time                   0.184497
score_time                 0.051287
test_Accuracy              0.206720
train_Accuracy             0.214771
test_Precision             0.189075
train_Precision            0.192765
test_Recall                0.973512
train_Recall               0.990530
test_F1                    0.316640
train_F1                   0.322715
test_ROC_AUC               0.541976
train_ROC_AUC              0.564044
test_Balanced_Accuracy     0.500871
train_Balanced_Accuracy    0.512349
test_Average_Precision     0.210096
train_Average_Precision    0.223634
test_MCC                   0.012499
train_MCC                  0.055645
dtype: float64

### The Winner

Retraining the winner model with all the data points and showing the feature importance. 

In [70]:
winner = XGBClassifier(max_depth=4).fit(x_reduced, y, sample_weight=w)

#### Feature Importance

For simplicity, the categorical feature values that were previously transformed into binary choices are grouped by category names so that the plot shows only the minimum and maximum for each group. For the features that were not transformed in this way the plot shows the same minimum and maximum.

In [71]:
df_winner_importance = pd.DataFrame({'feature': df_x_reduced.columns, 'importance': winner.feature_importances_})
df_winner_importance.sort_values('importance', ascending=False, inplace=True)
plot_features(df_winner_importance, 'feature', 'importance')

#### Top 10 Most Important Features

In [72]:
px.bar(df_winner_importance[:10], x='feature', y='importance')

## Final Result

The final result obtained from the winner model - the likelihood that a book will be returned late, presented in the form of histogram. 

In [73]:
p = winner.predict_proba(x_reduced)[:,1]
px.histogram(pd.DataFrame({'Likelihood of a Late Return': p}))

## Recommendations

The following strategies could be implemented based on the prepared model:
1. Proactively identifying checkouts with higher likelihood of a late return and informing the customer.
2. Providing more frequent reminders to customers for checkouts with higher likelihood of late returns.