# PG Bookshelves
Since the subject tags are very messy, I am exploring what are called bookshelves, which might be more carefully curated.

See [here](http://www.gutenberg.org/wiki/Category:Bookshelf)

### Getting the data
What I did is to scrape the PG wiki, see `data/bookshelves/`

In [1]:
import numpy as np
import pandas as pd
import glob
import lxml.html

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
BS_paths = glob.glob("../data/bookshelves/*Bookshelf*")
BS = [path.split("/")[-1] for path in BS_paths]

In [3]:
BS_dict = {}
for path in BS_paths:
    bs = path.split("/")[-1]
    BS_dict[bs] = []
    with open(path, "r") as foo:
        dom = lxml.html.fromstring(foo.read())
        # select the url in href for all a tags(links)
        for link in dom.xpath('//a/@href'):
            # links to ebooks that are not searches
            if link.find("ebooks")>-1 and link.find("search")==-1:
                PGid = "PG"+link.split("/")[-1]
                BS_dict[bs].append(PGid)

    # delete empty BSs
    if len(BS_dict[bs])==0:
        del BS_dict[bs]
    
# recompose list of BSs
BS = list(BS_dict.keys())

# list of unique PGids
PGids = list(set(np.concatenate(list(BS_dict.values()))))

In [31]:
# put in a DataFrame
df = pd.DataFrame(index = PGids, columns = BS)
for k,v in BS_dict.items():
    df.loc[v, k] = True

At the moment saving the BS df in `../data/`

In [6]:
df.to_pickle("../data/bookshelves.p")

### Bookshelves are almost non-overlapping

In [10]:
# Bookshelves with at least 100 books
sdf = df.loc[:, df.sum()>100]
sdf = sdf.loc[sdf.sum(axis=1).dropna().index]

In [27]:
# overlaps are small
ratios = []
from itertools import combinations
for s1, s2 in combinations(sdf.columns, 2):
    l1 = len(BS_dict[s1])
    l2 = len(BS_dict[s2])
    l3 = len(np.intersect1d(BS_dict[s1], BS_dict[s2]))
    intratio = l3/min(l1, l2)
    ratios.append([intratio, s1, s2])
intersetions_df = pd.DataFrame(data=ratios, columns = ["intersection", "bs1", "bs2"])

In [29]:
(intersetions_df.intersection<0.05).mean()

0.96617336152219868

That is, with probability 97% two BS have less than 5% of their books in common

In [30]:
intersetions_df.sort_values(by="intersection", ascending=False).head(n=25)

Unnamed: 0,intersection,bs1,bs2
724,1.0,Banned_Books_(Bookshelf),Banned_Books_from_Anne_Haight's_list_(Bookshelf)
539,0.939655,Animal_(Bookshelf),Animals-Wild_(Bookshelf)-Mammals
108,0.836935,"Bestsellers,_American,_1895-1923_(Bookshelf)","Bestsellers,_American,_1900-1923_(Bookshelf)"
229,0.362832,United_States_(Bookshelf),Children's_History_(Bookshelf)
312,0.151786,Classical_Antiquity_(Bookshelf),Harvard_Classics_(Bookshelf)
869,0.134454,Movie_Books_(Bookshelf),Best_Books_Ever_Listings_(Bookshelf)
880,0.132184,Banned_Books_from_Anne_Haight's_list_(Bookshelf),Best_Books_Ever_Listings_(Bookshelf)
725,0.132184,Banned_Books_(Bookshelf),Best_Books_Ever_Listings_(Bookshelf)
192,0.109244,Historical_Fiction_(Bookshelf),Movie_Books_(Bookshelf)
732,0.109195,Banned_Books_(Bookshelf),Harvard_Classics_(Bookshelf)
