Hi All,

In this notebook I am trying to understand which parameters are correlated with each other and library activity.

Thanks for checking this!

Billur

In [None]:
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import numpy as np

In [None]:
data = pd.read_csv("../input/Library_Usage.csv")
data.head(1)

# What attributes are most associated with library activity (# of checkouts, # of renewals)?

I suspect that the year of enrolment would be correlated with library activity

 - first let's see what is the distribution of member enrolments per year.

In [None]:
years = list(range(2003,2017))
member_count = pd.DataFrame({'count' : data.groupby(["Year Patron Registered"]).size()}).reset_index()
ax = sns.barplot(x = "Year Patron Registered",y = "count", data = member_count, order = years, palette = "YlGnBu")
ax = plt.xticks(rotation = 45)
ax = plt.title("Avg. # of Registration Through the Years", fontsize = 18)

The number of enrolled Patrons has an increasing trend over the years. However, 2003 and 2016 are the two outliers.
The enrollment boom in 2003 is most probably due to the transition to the digital archiving. They may have set existing members' registration year to 2003. The enrolment decay in 2016 may be explained by the records being incomplete.

Let's see:

In [None]:
month_dict = {"January":"1_", "February":"2_", "March":"3_", "April":"4_", "May":"5_", "June":"6_", "July":"7_", "August":"8_",\
              "September":"9_","October":"10_", "November":"11_", "December":"12_"}

data["Circulation Active Date"] = data["Circulation Active Month"].map(month_dict)  + data["Circulation Active Year"]
data[data["Circulation Active Year"] == "2016"]["Circulation Active Month"].unique()

we are right the 2016 records are upto July.

 - then lets see how the library activity is distributed over the years

In [None]:
ax = sns.stripplot(x = "Year Patron Registered", y = "Total Checkouts", data = data, jitter=True)
ax = plt.xticks(fontsize = 12,color="steelblue", alpha=0.8, rotation = 45)
ax = plt.yticks(fontsize = 12,color="steelblue", alpha=0.8)
ax = plt.xlabel("Registration Year", fontsize = 15)
ax = plt.ylabel("Total Checkout #", fontsize = 15)
ax = plt.title("Total Checkout vs Registration Year", fontsize = 18)

In [None]:
ax = sns.stripplot(x = "Year Patron Registered", y = "Total Renewals", data = data, jitter=True)
ax = plt.xticks(fontsize = 12,color="steelblue", alpha=0.8, rotation = 45)
ax = plt.yticks(fontsize = 12,color="steelblue", alpha=0.8)
ax = plt.xlabel("Registration Year",fontsize = 15)
ax = plt.ylabel("Total Renewal #",fontsize = 15)
ax = plt.title("Total Renewal vs Registration Year",fontsize = 18)

As one can see from the above 2 plots and the correlation grid below there is:

 -  -0.36 correlation between "Total Checkout and Registration Year"
 -  -0.29 correlation between "Total Renewal and Registration Year"

Even though the majority of the members have relatively low library activity, when we filter out the less active members (such as members with renewal< 2000 xor checkout < 5000) the correlations get weaker. That's why we will keep all members

Let's assign a representative (numerical) age per "Age Range" and check the correlation between every parameter in our dataset

In [None]:
dict_age = {'0 to 9 years' : 5, '10 to 19 years' : 15, '20 to 24 years' : 22, '25 to 34 years' : 30, \
            '35 to 44 years': 40, '45 to 54 years' : 50, '55 to 59 years' : 57,'60 to 64 years' : 62, '65 to 74 years' : 70,\
            '75 years and over': 80}
data["Age"] = data["Age Range"].map(dict_age)

def display_corr(values, size):
    sns.set(style="white")

    #the correlation matrix
    corr = values.corr()

    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(size, size))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, annot=True,cmap=cmap, vmax=.3,
            square=True,  ax=ax)
    ax = plt.xticks(fontsize = 12,color="steelblue", alpha=0.8, rotation=90)
    ax = plt.yticks(fontsize = 12,color="steelblue", alpha=0.8)
    
display_corr(data, 8)

- As expected, the most correlated parameter couple is "Total Checkout" and "Total Renewals". 
- Apart from that, Age is highly (negatively) correlated with Year Patron Registered, which makes sense. Older people may be registered earlier than the toddlers for example, or family may have registered later than the seniors.
- Age is more or less (weakly) correlated with everything. Patron Type groups are formed regarding the Age and senior people had more chance to checkout/renew books compared to the younger peers.
- Outside of Country is related with Year Patron Registered. This one deserves a closer look:

In [None]:
sns.violinplot(y="Year Patron Registered", data=data[data["Outside of County"] == True], split = True, palette="Set3")

The artificial 2003 increase in "the number of Patrons Registered" is probably due to the initial recording of the previous Patrons.

We observe a dramatic increase in the number of "Outside Country" registrations after 2011. This year corresponds to the "end of recession" and the "economic recovery". As we can see from the below listed articles; [the home prices](http://www.lao.ca.gov/reports/2015/3305/Bay-Area-Home-Prices-Outpaced-the-State.png) [1], [job trends](http://www.spur.org/sites/default/files/wysiwyg/u150/bay-area-job-trends.png) [2] and [the number of new constructions completed](https://cdn3.vox-cdn.com/uploads/chorus_asset/file/4558025/5-15_SF-New-Housing-Units-Completed_since-1995.0.jpg) all start to increase. These changes may point to an increased immigration to San Francisco and thus may explain the increased number of "Outside of Country" Patrons.

References:
 1. http://www.lao.ca.gov/Publications/Report/3305
 2. http://www.spur.org/news/2014-07-23/new-data-shows-bay-area-and-state-economies-are-booming



In [None]:
plt.figure(figsize=(10,8))
incidence_count_matrix_long = pd.DataFrame({'count' : data.groupby( [ "Patron Type Definition","Age"] ).size()}).reset_index()
incidence_count_matrix_pivot = incidence_count_matrix_long.pivot("Patron Type Definition","Age","count") 
ax = sns.heatmap(incidence_count_matrix_pivot, annot=True,  linewidths=1, square = False,cbar = False, cmap="Blues") 
ax = plt.xticks(fontsize = 12,color="steelblue", alpha=0.8)
ax = plt.yticks(fontsize = 12,color="steelblue", alpha=0.8)
ax = plt.xlabel("Age", fontsize = 24, color="steelblue")
ax = plt.ylabel("Type", fontsize = 24, color="steelblue")
ax = plt.title("Patron Type and Age Distributions", fontsize = 24, color="steelblue")

The above heatmap represents the count of members for the corresponding Age and Member Type.