# Louisville Free Public Library


Exampe data analysis project using Louisville Metro Open Data data set for the library book collection.


## Data Sources

This project uses data from the Louisville Metro Open Data site. You can find the main info page for this data set here: [Library Collection Inventory](https://data.louisvilleky.gov/datasets/LOJIC::louisville-metro-ky-library-collection-inventory-/about).

This project also scrapes data about Young Adult book genre from wikipedia using the beautiful soup library. The wikipedia article is here: [title](https://github.com/jeff-dillon/library-collections/blob/main/url).


## Books

In [1]:
import pandas as pd

books = pd.read_csv("data/raw/books.csv.gz")

books.shape

(1190176, 10)

In [19]:
books

Unnamed: 0,BibNum,Title,Author,ISBN,PublicationYear,ItemType,ItemCollection,ItemLocation,ItemPrice,ReportDate
0,707409,"Jeff Immelt and the new GE way : innovation, t...","Magee, David, 1965-",9780071605878.000,2009,Book,Adult Non-Fiction,Main,25.950,02/01/2023 00:00:00
1,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",9780307451408.000,2009,Book,Adult Non-Fiction,Southwest,19.990,02/01/2023 00:00:00
2,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",9780307451408.000,2009,Book,Adult Non-Fiction,Southwest,19.990,02/01/2023 00:00:00
3,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",9780307451408.000,2009,Book,Adult Non-Fiction,Remote Shelving - Main,19.990,02/01/2023 00:00:00
4,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",9780307451408.000,2009,Book,Adult Non-Fiction,Remote Shelving - Main,19.990,02/01/2023 00:00:00
...,...,...,...,...,...,...,...,...,...,...
1190171,2608597,25 ready-to-use sustainable living programs fo...,,9780838936498.000,2022,Book,Adult Non-Fiction,South Central,63.690,02/01/2023 00:00:00
1190172,2608598,Crypto basics : a nontechnical introduction to...,"Gomzin, Slava",9781484283202.000,2022,Book,Adult Non-Fiction,Bon Air,30.090,02/01/2023 00:00:00
1190173,2608598,Crypto basics : a nontechnical introduction to...,"Gomzin, Slava",9781484283202.000,2022,Book,Adult Non-Fiction,Newburg,30.090,02/01/2023 00:00:00
1190174,2608599,Data governance,"Reichental, Jonathan",9781119906773.000,2023,Book,Adult Non-Fiction,Main,24.340,02/01/2023 00:00:00


### Library Books Data Dictionary

| Column Name | Type | Description | Cleaning Notes |
| ----------- | ----- | ----------- | ------------- |
| BibNum | number | The unique identifier of a bibliographic record within our materials database. Materials with the same bibliographic # will generally have the same cataloging metadata, differing only in the barcode number, assigned location and anything else specific to the individual copy. | |
| Title | text | The name of the material. | |
| Author | | The writer or creator of the material. | Format: LastName, FirstName, Dates - needs to match FirstName LastName |
| ISBN | | The International Standard Book Number is a numeric commercial book identifier that is intended to be unique. Publishers purchase ISBNs from an affiliate of the International ISBN Agency. An ISBN is assigned to each separate edition and variation of a publication. | |
| Publication Year | | The year that the material was originally published. | Includes some incorrect values: 0, 9999, 2109 |
| Item Type | | Describes the type of material of each item, including Books, Audiobooks, Serials, DVDs, Microforms, Three Dimensional Objects, Kits, and Printed Cartographic Materials. | |
| Item Collection | | Refers to the collection the material belongs to based on common themes, including but not limited to Adult Fiction, Adult Reference, Mystery, Children’s Fiction, etc.  | Will need to create new column based on this one for Young Adult. Can contain empty records. Can contain collections that are not books (DVD, etc.) Can be empty. |
| Item Location | | The library location where the material was assigned at the time the report was run. | |
| Item Price | | The price, in USD, that LFPL purchased the material for. | may need to round to 2 decimals |

In [4]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1190176 entries, 0 to 1190175
Data columns (total 10 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   BibNum           1190176 non-null  int64  
 1   Title            1190175 non-null  object 
 2   Author           1124225 non-null  object 
 3   ISBN             1153891 non-null  float64
 4   PublicationYear  1190176 non-null  int64  
 5   ItemType         1190176 non-null  object 
 6   ItemCollection   1190036 non-null  object 
 7   ItemLocation     1190176 non-null  object 
 8   ItemPrice        1190176 non-null  float64
 9   ReportDate       1190176 non-null  object 
dtypes: float64(2), int64(2), object(6)
memory usage: 90.8+ MB


In [5]:
books["ItemType"].describe()

count     1190176
unique          1
top          Book
freq      1190176
Name: ItemType, dtype: object

In [6]:
books["ItemCollection"].describe()

count               1190036
unique                   51
top       Adult Non-Fiction
freq                 371433
Name: ItemCollection, dtype: object

In [10]:
books["ItemCollection"].unique()

array(['Adult Non-Fiction', 'Adult Fiction', 'Mystery',
       'Older Teen Fiction', 'Younger Teen  Fiction', 'Adult Paperback',
       'Science Fiction', "Children's Fiction", 'Western',
       "Children's Picture Paperback", "Children's Paperback",
       "Children's Picture Book", 'International Collection',
       'ELL Collection', 'Teen Non-Fiction', "Children's Non-Fiction",
       'Holiday', 'Natural Resources', 'Kentucky History', 'Oversize',
       'Urban Fiction', 'Bestsellers', 'Storytime Collection',
       "Children's Board Book", "Children's Easy Reader",
       'Preschool  Picture Book', 'Adult Reference', 'Interlibrary Loan',
       nan, 'Adult Paperbacks Tall', "Children's Easy Reader Paperback",
       'Caldecott/Newbery', 'Laptop', 'Government Documents',
       'Large Print', 'Telereference', "Children's Non-Fiction Paperback",
       'Big Book', "Children's Reference", 'Teen Reference',
       'College Shop', 'Magazines and Newspaper',
       'Younger Teen  Paperba

In [2]:
books["ItemCollection"].value_counts()

ItemCollection
Adult Non-Fiction                   371433
Adult Fiction                       177604
Children's Non-Fiction               86356
Mystery                              60314
Children's Picture Book              59348
Preschool  Picture Book              51276
Children's Fiction                   48446
Adult Paperback                      45302
Children's Paperback                 45076
Children's Easy Reader               24511
Teen Non-Fiction                     24376
Older Teen Fiction                   23787
Children's Board Book                20057
Younger Teen  Fiction                17532
Kentucky History                     16962
Science Fiction                      16048
Children's Easy Reader Paperback     15959
Holiday                              15583
International Collection             15581
Adult Reference                      11197
Children's Picture Paperback          9731
Urban Fiction                         7601
Caldecott/Newbery                     6

In [13]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
books["ItemPrice"].describe()                                           

count   1190176.000
mean         18.451
std          15.998
min           0.000
25%          10.950
50%          15.990
75%          24.950
max        1077.000
Name: ItemPrice, dtype: float64

In [15]:
books["ItemLocation"].unique()

array(['Main', 'Southwest', 'Remote Shelving - Main', 'Newburg',
       'South Central', 'St Matthews', 'Fairdale', 'Bon Air',
       'Jeffersontown', 'Iroquois', 'Crescent Hill',
       'Remote Shelving - Shawnee', 'Northeast', 'Childrens Main Library',
       'Shively', 'Highlands - Shelby Park', 'Middletown', 'Portland',
       'Western', 'Main Teen', 'Shawnee', 'Childrens Bookmobile',
       'Content Management', 'Adult Bookmobile'], dtype=object)

In [3]:
books["ItemLocation"].value_counts()

ItemLocation
Remote Shelving - Main       139987
Northeast                    124473
Southwest                    122113
Main                         121439
South Central                115837
Bon Air                       74730
St Matthews                   69531
Jeffersontown                 56706
Iroquois                      52382
Highlands - Shelby Park       45539
Crescent Hill                 42837
Childrens Main Library        38994
Middletown                    33120
Shively                       23623
Newburg                       23586
Fairdale                      23149
Shawnee                       22906
Western                       21648
Portland                      13334
Childrens Bookmobile           9129
Remote Shelving - Shawnee      9083
Main Teen                      6024
Content Management                4
Adult Bookmobile                  2
Name: count, dtype: int64

In [20]:
books["PublicationYear"].unique()

array([2009, 2010, 2011, 2002, 2005, 2003, 2006, 2008, 2017, 2016, 1991,
       2012, 2015, 1989, 2014, 1990, 2007, 2021, 2022, 2013, 2000, 1994,
       1993, 2004, 1992, 2001, 1995, 1997, 1996, 1999, 1985, 1986, 1998,
       1974, 1982, 1983,    0, 2023, 1952, 1973, 1979, 1981, 1988, 1984,
       1976, 1980, 1929, 1975, 1968, 2019, 1977, 1930, 1935, 1954, 1962,
       1895, 2020, 1978, 1987, 2018, 1965, 1951, 1966, 1956, 1969, 1971,
       1950, 1964, 1963, 1957, 1972, 1958, 1953, 1970, 1961, 1955, 1941,
       1851, 1907, 1932, 1959, 1948, 1923, 1960, 1915, 1945, 1967, 1938,
       1949, 1939, 1916, 1913, 1924, 1911, 1937, 1876, 1942, 1898, 1940,
       1947, 1934, 1946, 1912, 1890, 1878, 1880, 1936, 1933, 1931, 1928,
       1892, 1918, 1917, 1919, 1914, 1944, 1943, 1900, 1909, 1903, 1908,
       1861, 1902, 1920, 1926, 1904, 1899, 1901, 1922, 1921, 1889, 1870,
       1887, 1905, 1884, 1906, 1883, 1897, 1893, 1877, 1829, 1925, 1875,
       1886, 1885, 1856, 1867, 1874, 1910, 1866, 19

## Authors

In [42]:
authors = pd.read_csv("data/raw/authors.csv", index_col=0)

authors.shape

(635, 1)

### Data Dictionary

| Column Name | Type | Description | Cleaning Notes |
| ----------- | ----- | ----------- | ------------- |
| index | number | unique id |  none |
| Name | text |  author's name | some names include parantheses that need to be removed. Example:(author, born 1954). Consider creating a last name field for easier matching. |

In [43]:
authors

Unnamed: 0,Name
1,Atia Abawi
2,Joan Abelove
3,Hailey Abbott
4,Faridah Àbíké-Íyímídé
5,Marguerite Abouet
...,...
631,Xiran Jay Zhao
632,Cecily von Ziegesar
633,Paul Zindel
634,Ibi Zoboi


In [6]:
authors["Name"].str.split().apply(len).value_counts()

Name
2    484
3    141
4      7
5      3
Name: count, dtype: int64

## Questions

- Which location has the most books?
- Are the collections (genres) different by location?
- Which books are most popular?
- When were most YA books published?
- Which books are the least popular?
- How much did the library spend per Genre, Location?
- What is considered older teen vs. younger teen?
- What percentage of all YA authors are represented in the catalog?
- Are there foreign language books? (might have to look at titles)
- Average publication date by genre


## Fields Needed

- Young Adult field
- Need an author field in the books data set that matches the format in the authors dataset
- Publication Decade
- Split the ItemCollection field into: Genre, TargetAudience, Format, Accolades