# PG Bookshelves
Since the subject tags are very messy, I am exploring what are called bookshelves, which might be more carefully curated.

See [here](http://www.gutenberg.org/wiki/Category:Bookshelf)

### Getting the data
What I did is to scrape the PG wiki, see `data/bookshelves/`

In [19]:
import numpy as np
import pandas as pd
import glob
import lxml.html

import matplotlib.pyplot as plt
%matplotlib inline

In [20]:
BS_paths = glob.glob("../data/bookshelves/*Bookshelf*")
BS = [path.split("/")[-1] for path in BS_paths]

In [21]:
BS_dict = {}
for path in BS_paths:
    bs = path.split("/")[-1]
    BS_dict[bs] = []
    with open(path, "r") as foo:
        dom = lxml.html.fromstring(foo.read())
        # select the url in href for all a tags(links)
        for link in dom.xpath('//a/@href'):
            # links to ebooks that are not searches
            if link.find("ebooks")>-1 and link.find("search")==-1:
                PGid = "PG"+link.split("/")[-1]
                BS_dict[bs].append(PGid)

    # delete empty BSs
    if len(BS_dict[bs])==0:
        del BS_dict[bs]
    
# recompose list of BSs
BS = list(BS_dict.keys())

# list of unique PGids
PGids = list(set(np.concatenate(list(BS_dict.values()))))

In [22]:
# put in a DataFrame
df = pd.DataFrame(index = PGids, columns = BS)
for k,v in BS_dict.items():
    df.loc[v, k] = True

In [24]:
df.to_pickle("../data/bookshelves.p")

In [17]:
# Bookshelves with at least 100 books
sdf = df.loc[:, df.sum()>100]
sdf = sdf.loc[sdf.sum(axis=1).dropna().index]

In [18]:
# overlaps are small
from itertools import combinations
for s1, s2 in combinations(sdf.columns, 2):
    l1 = len(BS_dict[s1])
    l2 = len(BS_dict[s2])
    l3 = len(np.intersect1d(BS_dict[s1], BS_dict[s2]))
    intratio = l3/min(l1, l2)
    if intratio>0:
        print(intratio, s1, s2)

0.013071895424836602 Science_Fiction_(Bookshelf) Mathematics_(Bookshelf)
0.0038461538461538464 Science_Fiction_(Bookshelf) Animal_(Bookshelf)
0.10084033613445378 Science_Fiction_(Bookshelf) Movie_Books_(Bookshelf)
0.01834862385321101 Science_Fiction_(Bookshelf) Best_Books_Ever_Listings_(Bookshelf)
0.023809523809523808 Humor_(Bookshelf) Bestsellers,_American,_1895-1923_(Bookshelf)
0.008849557522123894 Humor_(Bookshelf) United_States_(Bookshelf)
0.02976190476190476 Humor_(Bookshelf) Children's_Picture_Books_(Bookshelf)
0.011904761904761904 Humor_(Bookshelf) World_War_I_(Bookshelf)
0.023809523809523808 Humor_(Bookshelf) Bestsellers,_American,_1900-1923_(Bookshelf)
0.03571428571428571 Humor_(Bookshelf) Best_Books_Ever_Listings_(Bookshelf)
0.009009009009009009 Humor_(Bookshelf) Detective_Fiction_(Bookshelf)
0.005952380952380952 Humor_(Bookshelf) United_Kingdom_(Bookshelf)
0.008064516129032258 Humor_(Bookshelf) Christmas_(Bookshelf)
0.011787819253438114 Bestsellers,_American,_1895-1923_(Book