# HTRC: Easy classification from HathiTrust collections

In [None]:
from htrc import workset
import pandas as pd
from htrc_features import FeatureReader

The following two collections are sets of books about knitting and about sewing.

I'm using functionality from the [HTRC Python SDK](https://github.com/htrc/HTRC-PythonSDK) to grab volume ids from each collection.

In [None]:
knitids = workset.load_hathitrust_collection('https://babel.hathitrust.org/cgi/mb?a=listis&c=1174943610')
sewids = workset.load_hathitrust_collection('https://babel.hathitrust.org/cgi/mb?a=listis&c=973680817')
print("We have %d books about knitting and %d books about sewing" % (len(knitids), len(sewids)))

We have 106 books about knitting and 214 books about sewing


With the new online loading, all you need are HathiTrust ids to load features for a file. e.g.

In [None]:
print("SAMPLE IDS", knitids[:10], "\nTITLES")
fr = FeatureReader(ids=knitids[:10])
for vol in fr.volumes():
    print(vol.title)

SAMPLE IDS ['mdp.39015060818443', 'pst.000016693067', 'mdp.39015056877270', 'inu.39000004664442', 'nyp.33433006775112', 'nnc1.cu56639260', 'caia.ark:/13960/t5n88cm6c', 'umn.31951000482982z', 'inu.30000100583941', 'loc.ark:/13960/t47p9n50g'] 
TITLES
A good yarn / Debbie Macomber.
A handy knitting library / [by Marti, pseud.
A history of hand knitting / Richard Rutt ; with foreword by Meg Swansen.
A history of hand knitting / Richard Rutt.
A manual of needlework, knitting and cutting out for evening continuation schools.
A text-book of needlework, knitting and cutting out, with methods of teaching.
A treatise on embroidery, crochet and knitting ... Edited by Miss Anna Grayson Ford [and others]. Comp. by George C. Perkins.
America's knitting book. Illustrated by Marjorie Tweed, Alan Howe, and Lyle Braden.
Anatolian knitting designs : Sivas stocking patterns / collected in an Istanbul shantytown by Betsy Harrell ; drawings by Betsy Harrell.
Art needlework.


## Example Use: classification

Collect the token frequencies for each of the knitting and sewing books, and concatenate those to a single `df` DataFrame.

In [None]:
def get_clean_tokens(vol):
    if vol.language != 'eng':
        raise
    tl = (vol.tokenlist(case=False, pages=False, pos=False)
                .reset_index('section', drop=True)
                .reset_index()
           )
    tl['vol'] = vol.id
    return tl

In [None]:
%%time
fr = FeatureReader(ids=knitids+sewids)
dfs = []
for vol in fr.volumes():
    dfs.append(get_clean_tokens(vol))
df = pd.concat(dfs)

CPU times: user 1min 35s, sys: 1.13 s, total: 1min 36s
Wall time: 2min 48s


In [None]:
df.sample(5)

Unnamed: 0,count,index,lowercase,section,vol
6534,1.0,,paraffin,,inu.30000108723713
514,1.0,,illustrator,,mdp.39015061342401
3701,1.0,,patfoftduca,,uc1.$b243421
1047,1.0,,academy,,wu.89055826556
2563,29.0,,were,,uma.ark:/13960/t2n614s14


Trim tokens to words that show up at least 400 times, are entirely alphabetical, include at least one lowercase character, and are more than 2 characters long.

Then, convert the long vol/token/count DataFrame to a wide one, where rows are documents, columns are tokens, and the cells show the count of each.

In [None]:
word_sums = df.groupby('lowercase')['count'].sum()
whitelist = word_sums[(word_sums > 400) & 
                      word_sums.index.str.isalpha() & 
                      word_sums.index.str.contains("[a-z]") &
                      (word_sums.index.to_series().apply(len) > 2)].index.values

filtered_df = df[df.lowercase.isin(whitelist)]
wide_df = filtered_df.pivot(index='vol', columns='lowercase', values='count').fillna(0)
wide_df.head()

lowercase,abbreviations,ability,able,about,above,accessories,according,account,accurate,accurately,...,your,yourself,zigzag,zipper,zippers,ﬁne,ﬁnish,ﬁnished,ﬁrst,ﬂat
vol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
aeu.ark:/13960/t3126jv35,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aeu.ark:/13960/t5t72tb3n,1.0,0.0,0.0,42.0,18.0,0.0,3.0,1.0,0.0,0.0,...,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
caia.ark:/13960/t03x9j98z,0.0,0.0,0.0,72.0,67.0,0.0,14.0,0.0,0.0,1.0,...,181.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
caia.ark:/13960/t13n3fp1h,0.0,1.0,3.0,8.0,11.0,0.0,3.0,0.0,0.0,2.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
caia.ark:/13960/t2r50x34b,0.0,0.0,0.0,33.0,34.0,0.0,7.0,0.0,0.0,0.0,...,49.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
labels = pd.DataFrame([knitids + sewids,["knit"] * len(knitids) + ["sew"] * len(sewids)], index=['volid', 'label']).T.set_index('volid')
labels.head()

Unnamed: 0_level_0,label
volid,Unnamed: 1_level_1
mdp.39015060818443,knit
pst.000016693067,knit
mdp.39015056877270,knit
inu.39000004664442,knit
nyp.33433006775112,knit


## Train Classifier

With the data in this wide format, it can easily be handed to any number of Scikit Learn algorithms. Here, we build a classifier.

### Randomize data for training

`sample_labels` aligns the label order with the sample order.

In [None]:
sample = wide_df.sample(frac=1)
sample_labels = labels.loc[sample.index]

### Train a Naive Bayes classifier

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(sample[:-10], sample_labels.label[:-10])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Score the classifier accuracy on 10 held out texts

Not a particularly large testing set, this is more as a sanity check. Perfect accuracy is a good sign, though.

In [None]:
clf.score(sample[-10:], sample_labels.label[-10:])

1.0

## What words are 'knitting' vs 'sewing' words?

Here, I simply inspect the highest probability words for the two classes.

In [None]:
for i, word in enumerate(['Knitting', 'Sewing']):
    print("\n%s words" % word)
    print("\t".join(pd.Series(clf.feature_log_prob_[i], index=wide_df.columns).sort_values().index.values[:120]))


Knitting words
topstitch	valance	fax	overhanding	bastings	underlining	fusible	armseye	fastener	chiffon	hemstitching	interlining	belting	perforations	fasteners	weaves	notches	snaps	tailoring	drapery	gingham	allowances	taffeta	lapped	basted	layout	unlined	waists	markings	furnishings	linens	shears	draperies	cording	plaids	seamline	crisp	wrinkles	grades	tum	shirring	interfacing	lingerie	transparent	overhand	upholstery	tacks	butterick	ans	slash	tracing	laundering	alteration	envelope	chap	ruffles	dressmaker	straighten	meat	flounce	prints	screw	draft	notch	thimbles	faced	laundry	crotch	grain	glue	corded	alterations	zippers	furniture	stains	stain	fat	piping	par	dressmaking	tailored	pupils	tailor	economics	drafting	sheer	snap	boiling	quilted	damask	pupil	padding	corset	tests	shuttle	singer	ﬁne	varieties	applying	lawn	windows	cambric	elbow	plaid	overcasting	sugar	slot	scalloped	dull	location	dust	salt	connect	plaits	task	basting	blind	fly	ruffle	slipstitch

Sewing words
rnd	yfwd	bethanne	ribber