## 09_Create_Features_Dictionary

Author: Daniel Hui

License: MIT

This notebook will take the dictionary dataset and create features from the information heldwithin

In [2]:
import pandas as pd
from collections import defaultdict
pd.set_option('display.max_rows', 125)

### Load Book Inventory Data
We only need to load the book inventory to do some EDA. It is not necessary for creating features. It is useful to see how many books in the collection fall into each category.

In [3]:
inventory_df = pd.read_csv('../01_Data/03_Cleaned/Library_Collection_Inventory_jan_2018_clean.csv',index_col=0)

  mask |= (ar1 == a)


In [4]:
books_df = inventory_df[["BibNum","Title","ISBN","PublicationYear","Subjects","ItemCollection"]]
books_df = books_df.drop_duplicates()
books_df.head()

Unnamed: 0,BibNum,Title,ISBN,PublicationYear,Subjects,ItemCollection
0,3177276,Day of the Dead.,,2016.0,"Rock music 2011 2020, Rock music",naover
1,395432,Swan Lake / Ann Nugent.,812056744.0,1985.0,Swan lake Choreographic work,canf
2,123754,Best short stories of Jack London.,,1945.0,,cs3fic
3,193328,The comedy of errors.,,1962.0,,canf
4,1764894,Below the belt : play / by Richard Dresser.,573696306.0,1997.0,,canf


There are 567,995 unique records in the book inventory. Each title may have multiple records because different copies of the book are held within different subcollections within the library collection

In [5]:
len(books_df)  

567995

There are 397,147 unique titles in the collection

In [6]:
len(books_df["BibNum"].drop_duplicates()) 

397147

### Load Book Code Data Dictionary
This is the basis of what we will use to create features

In [7]:
dictionary_df = pd.read_csv('../01_Data/02_Truncated/Data_Dictionary_Book_Codes.csv',index_col=0)
dictionary_df.head(5)

Unnamed: 0,Code,Description,Code Type,Format Group,Format Subgroup,Category Group,Category Subgroup
0,cabob,CA-Books on Bikes,ItemCollection,Print,Book,,
1,caesl,CA1-ESL,ItemCollection,Print,Book,,
2,caesla,CA1-ESL Advanced,ItemCollection,Print,Book,,
3,caeslb,CA1-ESL Beginning,ItemCollection,Print,Book,,
4,caeslc,CA1-ESL Citizenship,ItemCollection,Print,Book,,


There are 212 unique book codes in the dictionary

In [15]:
len(dictionary_df)  

212

### EDA: How many categories does each book fall into? 
I'm curious how many different subcollections each book may fall into

In [9]:
books_df[["BibNum","ItemCollection"]].groupby(["BibNum"]).count().reset_index().sort_values(by="ItemCollection",ascending=False).head(5)

Unnamed: 0,BibNum,ItemCollection
288076,2919133,7
383996,3274363,7
107621,1923072,7
376481,3244913,6
391177,3297285,6


Book with BibNum 2919133 has 7 different categories associated with it. Curious what the book is below: 

In [10]:
excess_code_df = inventory_df[inventory_df["BibNum"]==2919133][["Title","BibNum","ItemCollection"]].drop_duplicates()
excess_code_df = excess_code_df.rename({"ItemCollection":"Code"},axis=1)
pd.merge(excess_code_df,dictionary_df,on="Code",how="left")

Unnamed: 0,Title,BibNum,Code,Description,Code Type,Format Group,Format Subgroup,Category Group,Category Subgroup
0,"March. Book one / [written by] John Lewis, And...",2919133,nanf,NA-Nonfiction,ItemCollection,Print,Book,Nonfiction,
1,"March. Book one / [written by] John Lewis, And...",2919133,nacomic,NA-Comics/Graphic Novels,ItemCollection,Print,Book,Fiction,
2,"March. Book one / [written by] John Lewis, And...",2919133,nycomic,NY-Teen - Comics & Graphic Novels,ItemCollection,Print,Book,Fiction,
3,"March. Book one / [written by] John Lewis, And...",2919133,pknf,Peak Picks-Nonfiction,ItemCollection,Print,Book,Nonfiction,
4,"March. Book one / [written by] John Lewis, And...",2919133,naaanf,AfAm - Nf,ItemCollection,Print,Book,Nonfiction,
5,"March. Book one / [written by] John Lewis, And...",2919133,nanew,NA-New Adult Books,ItemCollection,Print,Book,Fiction,
6,"March. Book one / [written by] John Lewis, And...",2919133,canf,CA-Nonfiction,ItemCollection,Print,Book,Nonfiction,


### EDA: How Many books are associated with each book code? 

In [16]:
categories_df = books_df.groupby(["ItemCollection"]).count()
categories_df = categories_df.sort_values(["BibNum"],ascending=False)
categories_df = categories_df.reset_index()[["ItemCollection","BibNum"]]
categories_df = categories_df.rename({"BibNum":"Books","ItemCollection":"Code"},axis=1)
len(categories_df)  # There are 104 book Categories used in the collection

104

Within the collection, there are only 104 codes used. All their details, including how many books are in each

In [14]:
codes_df = pd.merge(categories_df,dictionary_df,on="Code",how="left")
codes_df.head(10)

Unnamed: 0,Code,Books,Description,Code Type,Format Group,Format Subgroup,Category Group,Category Subgroup
0,canf,192898,CA-Nonfiction,ItemCollection,Print,Book,Nonfiction,
1,nanf,57152,NA-Nonfiction,ItemCollection,Print,Book,Nonfiction,
2,cafic,29968,CA3-Fiction,ItemCollection,Print,Book,Fiction,
3,caln,22340,CA1-Language,ItemCollection,Print,Book,Language,
4,nafic,21169,NA-Fiction,ItemCollection,Print,Book,Fiction,
5,ncnf,14331,NC-Children's Nonfiction,ItemCollection,Print,Book,Nonfiction,
6,ccnf,12846,CC-Children's Nonfiction,ItemCollection,Print,Book,Nonfiction,
7,cab,12081,CA9-Biography,ItemCollection,Print,Book,Nonfiction,Biography
8,ncpic,11195,NC--Children's Picture Books,ItemCollection,Print,Book,,Picture
9,ccpic,10766,CC-Children's Picture Books,ItemCollection,Print,Book,,Picture


### EDA: Examining the Category Subgroup Column

In [12]:
codes_df["Category Subgroup"].unique()   #all the categories

array([nan, 'Biography', 'Picture', 'Large Print', 'Holiday'],
      dtype=object)

How many biography books? 

In [13]:
codes_df[codes_df["Category Subgroup"] == "Biography"]

Unnamed: 0,Code,Books,Description,Code Type,Format Group,Format Subgroup,Category Group,Category Subgroup
7,cab,12081,CA9-Biography,ItemCollection,Print,Book,Nonfiction,Biography
39,nab,2494,NA-Biography,ItemCollection,Print,Book,Nonfiction,Biography
49,ncb,1443,NC-Children's Biographies,ItemCollection,Print,Book,Nonfiction,Biography
51,ccb,1308,CC-Children's Biographies,ItemCollection,Print,Book,Nonfiction,Biography
60,naaab,722,AfAm - Biographies,ItemCollection,Print,Book,Nonfiction,Biography
64,cyb,368,CY9-Teen Biography,ItemCollection,Print,Book,Nonfiction,Biography
68,nyb,278,NY-Teen - Biography,ItemCollection,Print,Book,Nonfiction,Biography


In [14]:
codes_df[codes_df["Category Subgroup"] == "Biography"]["Books"].sum()   #18,694 Biographies

18694

How many picture books?

In [15]:
codes_df[codes_df["Category Subgroup"] == "Picture"]

Unnamed: 0,Code,Books,Description,Code Type,Format Group,Format Subgroup,Category Group,Category Subgroup
8,ncpic,11195,NC--Children's Picture Books,ItemCollection,Print,Book,,Picture
9,ccpic,10766,CC-Children's Picture Books,ItemCollection,Print,Book,,Picture
44,ccrdr,1808,CC-Children's Readers,ItemCollection,Print,Book,Fiction,Picture
98,ncbb,2,NC-Children's Board Books,ItemCollection,Print,Book,,Picture


In [16]:
codes_df[codes_df["Category Subgroup"] == "Picture"]["Books"].sum()   #23,771 Picture Books

23771

How many Large Print books?

In [17]:
codes_df[codes_df["Category Subgroup"] == "Large Print"]

Unnamed: 0,Code,Books,Description,Code Type,Format Group,Format Subgroup,Category Group,Category Subgroup
24,calpfic,4704,CA3-Large Print Fiction,ItemCollection,Print,Book,Fiction,Large Print
25,nalpfic,3975,NA-Large Print Fiction,ItemCollection,Print,Book,Fiction,Large Print
53,cs1malp,1251,CS 1 - MOB Large Print Nonfiction,ItemCollection,Print,Book,Nonfiction,Large Print
55,calpnf,925,CA3-Large Print Nonfiction,ItemCollection,Print,Book,Nonfiction,Large Print
57,nalpnf,852,NA-Large Print Nonfiction,ItemCollection,Print,Book,Nonfiction,Large Print
82,cylp,102,CY3-Teen Large Print,ItemCollection,Print,Book,Fiction,Large Print
85,cclpfic,57,CC-Children's Large Print,ItemCollection,Print,Book,Fiction,Large Print
87,nylp,50,NY-Teen Large Print,ItemCollection,Print,Book,Fiction,Large Print
88,nclp,35,NC-Children's Large Print,ItemCollection,Print,Book,Fiction,Large Print
90,nalpnew,23,NA-Large Print New,ItemCollection,Print,Book,Fiction,Large Print


In [18]:
codes_df[codes_df["Category Subgroup"] == "Large Print"]["Books"].sum()   #11,974 Large Print Books

11974

How Many Holiday Books? 

In [19]:
codes_df[codes_df["Category Subgroup"] == "Holiday"]

Unnamed: 0,Code,Books,Description,Code Type,Format Group,Format Subgroup,Category Group,Category Subgroup
34,nchol,2723,NC-Children's Holiday,ItemCollection,Print,Book,Fiction,Holiday
45,cchol,1803,CC-Children's Holiday,ItemCollection,Print,Book,Fiction,Holiday
72,nahol,249,NA-Holiday - Adult,ItemCollection,Print,Book,Nonfiction,Holiday
84,ncholsk,63,NC-Children's Holiday-STK,ItemCollection,Print,Book,Fiction,Holiday


In [20]:
codes_df[codes_df["Category Subgroup"] == "Holiday"]["Books"].sum()   #4,838 Holiday Books

4838

### EDA: Explore the Category Group Column

In [21]:
codes_df["Category Group"].unique()    #3 Category Types

array(['Nonfiction', 'Fiction', 'Language', nan], dtype=object)

How Many Language Books?

In [22]:
codes_df[codes_df["Category Group"] == "Language"]

Unnamed: 0,Code,Books,Description,Code Type,Format Group,Format Subgroup,Category Group,Category Subgroup
3,caln,22340,CA1-Language,ItemCollection,Print,Book,Language,
12,ccln,8076,CC-Children's Languages,ItemCollection,Print,Book,Language,
17,naln,6200,NA-Languages Collection,ItemCollection,Print,Book,Language,
22,ncln,5551,NC-Children's Languages,ItemCollection,Print,Book,Language,
66,calnr,321,CA1-World Languages Reference,ItemCollection,Print,Book,Language,
101,cs1lew,1,CS 1 - Language,ItemCollection,Print,Book,Language,


In [23]:
codes_df[codes_df["Category Group"] == "Language"]["Books"].sum()   #42,489 Books

42489

How Many Non-Fiction Books?

In [24]:
codes_df[codes_df["Category Group"] == "Nonfiction"]["Books"].sum()  #328,129 Books

328129

How Many Fiction Books?

In [25]:
codes_df[codes_df["Category Group"] == "Fiction"]["Books"].sum()      #175,414 Books

175414

### EDA: Explore the other Columns

In [17]:
codes_df["Format Subgroup"].unique()   #Nothing to see here

array(['Book'], dtype=object)

In [18]:
codes_df["Format Group"].unique()      #Nothing to see here

array(['Print'], dtype=object)

In [19]:
codes_df["Description"].unique()        #lot of potential features here

array(['CA-Nonfiction', 'NA-Nonfiction', 'CA3-Fiction', 'CA1-Language',
       'NA-Fiction', "NC-Children's Nonfiction",
       "CC-Children's Nonfiction", 'CA9-Biography',
       "NC--Children's Picture Books", "CC-Children's Picture Books",
       "NC-Children's Fiction", 'CA3-Mystery', "CC-Children's Languages",
       "CC-Children's Fiction", 'NA-Mysteries',
       'CS 1 - MOB Large Print Fiction', 'CA3-New Adult Books',
       'NA-Languages Collection', 'NY-Teen - Comics & Graphic Novels',
       'NY-Teen - Fiction', 'NC-Easy Nonfiction', 'NA-New Adult Books',
       "NC-Children's Languages", 'AfAm - Nf', 'CA3-Large Print Fiction',
       'NA-Large Print Fiction', 'CA3-Science Fiction',
       "NC-Children's New Materials", 'CY-Teen Nonfiction',
       'CY3-Teen Comics & Graphic Novels', 'CY3-Teen Fiction',
       'NA-Sci-Fic/Fantasy', 'NA-Comics/Graphic Novels',
       'NY-Teen - Nonfiction', "NC-Children's Holiday", 'NA-Oversize',
       'NC-Easy Fiction', "NC-Children's Series

Looking at the list, I will want to pull out categories such as teen, children, comic, african american, and mystery

### Features - Process the Description Column

In [20]:
def search(row, description):
    if description in row:
        return 1
    else: return 0

In [21]:
genre_df = pd.DataFrame()
genre_df["Code"] = codes_df["Code"]
genre_df["Description"] = codes_df["Description"].apply(lambda x: x.lower())         #make lowercase
genre_df["Children"] = genre_df["Description"].apply(search, description="child")
genre_df["Teen"] = genre_df["Description"].apply(search, description="teen")
genre_df["AfAm"] = genre_df["Description"].apply(search, description="afam")
genre_df["Mystery"] = genre_df["Description"].apply(search, description="myster")
genre_df["Comic"] = genre_df["Description"].apply(search, description="comic")       #Comic Books
genre_df["Graphic"] = genre_df["Description"].apply(search, description="graphic")   #sometimes 'Graphic Novel'
genre_df["Comic"] = genre_df["Comic"] + genre_df["Graphic"]                          #Combine
genre_df["Comic"] = genre_df["Comic"].apply(lambda x: min(x, 1))                     #cap max at 1
genre_df = genre_df.drop(["Graphic"],axis=1)

In [22]:
genre_df.tail(10)

Unnamed: 0,Code,Description,Children,Teen,AfAm,Mystery,Comic
94,cynew,cy3-teen new books,0,1,0,0,0
95,nypb,ny-teen paperbacks,0,1,0,0,0
96,cyser,cy3-teen series,0,1,0,0,0
97,ccnew,cc-children's new materials,1,0,0,0,0
98,ncbb,nc-children's board books,1,0,0,0,0
99,nahmwk,na-homework,0,0,0,0,0
100,cs1fic,cs 3 - fiction,0,0,0,0,0
101,cs1lew,cs 1 - language,0,0,0,0,0
102,navidg,na-video guides,0,0,0,0,0
103,nana,na-native american,0,0,0,0,0


### Features - Category Subgroup Dummies

In [23]:
subgroup_df = pd.get_dummies(codes_df["Category Subgroup"])
subgroup_df = subgroup_df.merge(codes_df, left_index=True, right_index=True,how='left')
subgroup_df = subgroup_df[["Code","Biography","Large Print","Picture"]]
subgroup_df.head()

Unnamed: 0,Code,Biography,Large Print,Picture
0,canf,0,0,0
1,nanf,0,0,0
2,cafic,0,0,0
3,caln,0,0,0
4,nafic,0,0,0


### Features - Category Group Dummies

In [26]:
group_df = pd.get_dummies(codes_df["Category Group"])
group_df = group_df.merge(codes_df, left_index=True, right_index=True,how='left')
group_df = group_df[["Code","Fiction","Language","Nonfiction"]]
group_df["Category"] = codes_df["Category Group"]
group_df.head()

Unnamed: 0,Code,Fiction,Language,Nonfiction,Category
0,canf,0,0,1,Nonfiction
1,nanf,0,0,1,Nonfiction
2,cafic,1,0,0,Fiction
3,caln,0,1,0,Language
4,nafic,1,0,0,Fiction


### Merge it all Together

In [27]:
code_feature_df = codes_df.merge(group_df,on="Code",how="left")
code_feature_df = code_feature_df.merge(subgroup_df,on="Code",how="left")
code_feature_df = code_feature_df.merge(genre_df,on="Code",how="left")
code_feature_df = code_feature_df[["Code","Description_x","Category","Fiction","Language","Nonfiction",
                                   "Biography","Large Print","Picture","Children","Teen","Mystery","AfAm",
                                   "Comic"]]
code_feature_df = code_feature_df.rename({"Description_x":"Description"},axis=1)

In [28]:
code_feature_df.head(5)

Unnamed: 0,Code,Description,Category,Fiction,Language,Nonfiction,Biography,Large Print,Picture,Children,Teen,Mystery,AfAm,Comic
0,canf,CA-Nonfiction,Nonfiction,0,0,1,0,0,0,0,0,0,0,0
1,nanf,NA-Nonfiction,Nonfiction,0,0,1,0,0,0,0,0,0,0,0
2,cafic,CA3-Fiction,Fiction,1,0,0,0,0,0,0,0,0,0,0
3,caln,CA1-Language,Language,0,1,0,0,0,0,0,0,0,0,0
4,nafic,NA-Fiction,Fiction,1,0,0,0,0,0,0,0,0,0,0


In [29]:
code_feature_df.to_csv("../01_Data/06_Features/Dictionary_Features.csv")