## Deal with data handling
- Data (to me) is central for this project and I would start by implementing how it is handled in our specific case


<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

### My take  (unhide this cell by clicking to the left of My)

There are two data aspects:
- __loading__ the file into a data structure that then can use to
- __search__ for the user's query
- note that this could be an exact search or a fuzzy search (e.g. matching some but not all words, or even matching misspelled words)

    
Notes:
- this assumes that csv file is small enough to be loaded into memory
- for larger files, we would need to use a database (e.g. sqlite) and search it


Data structures
- we could create a csv object using the csv Reader class from the csv module
- to search we would loop over the content of a row (a list) and check if a list element matches the query 

<p>

- but we can also use a __pandas DataFrame__, which is a 2D data structure with labeled axes (rows and columns)
- we can use the pandas read_csv function to read the csv file into a DataFrame
- we can then use the pandas query method to search specific columns 


### read data.csv into a pandas data frame    
- complete the code cell below
- if you need a refresher, see _Python refresher 4 b) pandas and numpy_
- For solutions, see Solutions_for_A.py


In [20]:
# import pandas package, rename to pd
import pandas as pd

# Read books.csv file into a DataFrame (see Python refresher 4)
book_df = pd.read_csv("books.csv")
df = pd.read_csv("books.csv")

# print out/display the dataframe  
print(book_df)

# print out the column names as a list 
print(list(book_df.columns))

          Subject                               Title                Author  \
0         Animals        Studies in Global Animal Law           Anne Peters   
1         Animals                  Life After Logging           E. Meijaard   
2    Architecture                    Everyday Streets     Agustina Martire    
3    Architecture  Erholung in siedlungsnahen Wäldern          Susanne Karn   
4    Architecture          Public Space in Transition             Dahae Lee   
..            ...                                 ...                   ...   
147        Poetry                   Pátria Invencível        Nelson Santrim   
148        Poetry                              Unfold          Ari B. Cofer   
149        Poetry                  Vetrikku Ore Vazhi              Multiple   
150        Poetry        Kanneerin micham kavithaigal          Karpagavalli   
151        Poetry               Memories and miracles  Leslin Sushma Dsouza   

             ISBN                                  

### Searching  

- we will only use the Title, Author and Summary columns here
- The user will have the ability to search by title, result of a successful search will be the author, title and summary.
- Let's do  a simple search by title  
- We can use == to search 
- again, see python refresher 4b, at the end of the pandas part 

In [16]:
# Simple search with == 

# a) search for the row where the Title is Unfold

new_df = book_df[book_df["Title"] == "Unfold"]

# print out the results using display()

display(new_df)

# You will notice that we still get the column titles printed out above the row.

# Now print out the Subject of the first search result (i.e. the value of a cell in the Subject column)

print(new_df["Subject"])

# Import: In order to print out just the value of the cell, you will need to use the .values[0] method

print(new_df["Subject"].values[0])

Unnamed: 0,Subject,Title,Author,ISBN,Summary
148,Poetry,Unfold,Ari B. Cofer,177168285.0,From the author of paper girl and the knives t...


148    Poetry
Name: Subject, dtype: object
Poetry


### What if we want to search for a book title that is not an exact match?
- Unless a title is very well known, it's unlikely that the user will know the exact title.
- This will make it frustrating for the user to search for a book, which good HCI needs to avoid!
- As one possible solution, we can use a fuzzy search that finds the best match for the user's query by not requiring an exact match but instead works with partial matches, doesn't care about the order of words, ignores case, etc.
- Ex: searching for "Egypt and the Classical World" we might type in: "egypt" or "Classic Egypt" or "egypt world" or "egypt and its world" or "egypt in classical world", etc.
- This following fuzzy function will return the best match of the search term.
- I'm bringing this up now b/c when we use the google books API, it will do something similar internally to return the best match for our search term
[fuzzywuzzy](https://github.com/seatgeek/thefuzz) module

In [17]:
%pip install rapidfuzz --upgrade

Collecting rapidfuzz
  Downloading rapidfuzz-3.9.7-cp312-cp312-win_amd64.whl.metadata (12 kB)
Downloading rapidfuzz-3.9.7-cp312-cp312-win_amd64.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.7 MB ? eta -:--:--
   ------------------------- -------------- 1.0/1.7 MB 3.6 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 3.1 MB/s eta 0:00:00
Installing collected packages: rapidfuzz
Successfully installed rapidfuzz-3.9.7
Note: you may need to restart the kernel to use updated packages.




In [18]:
# a fuzzy search function
from rapidfuzz import fuzz, process

# Function to find the best a search term in the Title column
def fuzzy_find(df, search_term, n=1):
    matches = process.extract(search_term, df["Title"], scorer=fuzz.token_set_ratio, limit=n)
    first_match = matches[0] # grab only first match, ignore n for now
    text = first_match[0] # the text of the first match
    match_score = first_match[1] # the similarity score of the first match
    index = first_match[2] # the index in the data frame of the first match
    
    print(f"searched for {search_term}\nbest match: {text}, {index}, {match_score}") # DEBUG a list of tuples (text, similarity score, index)
    
    return text, index, match_score

- note that here we limit the results to the 1 (n=1) best result, but you could also request the top 5 or 10 best matches
- even with n=1, `process.extract()` will always return a __list__ of tuples, so we need to extract the first tuple with `[0]` and then the text, score and index with `[0]`, `[1]` and `[2]` respectively
- the index is needed so that the caller can use it as a row number to retrieve the data for the best match
- the metric for best match is given in % and the function simply returns the text for the best match (string), the row index for best match and the % match and
- I've put in a DEBUG print statement to show the match and the % match

<pr>

- Note that this will return a match even if the match quality is terrible, e.g. a 10% match!
- Therefore, we will need to set a threshold for the match quality, e.g. 80% or higher
- Which threshold to use will depend on you and how much you'd rather get a so-so match vs. no match at all
- I'm using a 50% here

In [21]:
# fuzzy_find usage examples

# Good match
text, index, match = fuzzy_find(df, 'classic egypt')
if match > 50:
    row = df.iloc[index] # print the row of the best match
    print(row['Summary'][:500])
else:
    print('No good match found')

searched for classic egypt
best match: Egypt and the Classical World, 31, 52.38095238095238
Presenting dynamic research, this publication explores two millennia of cultural interactions between Egypt, Greece, and Rome. From Mycenaean weaponry found among the cargo of a Bronze Age shipwreck off the Turkish coast to the Egyptian-inspired domestic interiors of a luxury villa built in Greece during the Roman Empire, Egypt and the Classical World documents two millennia of cultural and artistic interconnectedness in the ancient Mediterranean. This volume gathers pioneering research from the


In [22]:
# fuzzy_find usage examples

# a bad match (match % is below 50%)
text, index, match = fuzzy_find(df, "Adapting Classics")
if match > 50:
    row = df.iloc[index] # print the row of the best match
    print(row['Summary'])
else:
    print('No good match found')

searched for Adapting Classics
best match: Climate Adaptation and Resilience Across Scales, 14, 46.875
No good match found


#### Create a command-line interface for fuzzy search books.csv (3 pts)

- open  __book_search_CLI.py__ now and complete the code there. 
- this will complete part A of the Example project HW

<p>

Inside the While True loop and using only input(), write code that:

    1) ask the user for the title search term 
    2) ask for a quality threshold (our match %), give it a default of 50. If the user just hits return, keep the existing value otherwise update it
    3) search the Title column in the DataFrame df for the search term with the fuzzy search 
    4) if there is a match (> 50!), print the book title, author, and summary but now nicely formatted (using a f string), like this: `f"The summary for {title} by {author} is:\n{summary}"` (
    5) if there is no good enough match, print a message saying so
    6) Finally ask the user if they want to search again. If they do, repeat the process. If not, exit the loop.
<p>

- Here's the output for an example search:
```
What title do you want to search for? Classic Reformations
What quality threshold do you want, currently 50 
match of 48.97959183673469 is lower than 50, no good match found!
Another Search? (y/n)y
What title do you want to search for? Classic Reformations
What quality threshold do you want, currently 50 40
The summary for Egypt and the Classical World by Jeffrey Spier is:
Presenting dynamic research, this publication explores two millennia of cultural interactions between Egypt, Greece, and Rome. From Mycenaean weaponry found among the cargo of a Bronze Age shipwreck off the Turkish coast to the Egyptian-inspired domestic interiors of a luxury villa built in Greece during the Roman Empire, Egypt and the Classical World documents two millennia of cultural and artistic interconnectedness in the ancient Mediterranean. This volume gathers pioneering research from the Getty scholars' symposium that helped shape the major international loan exhibition Beyond the Nile: Egypt and the Classical World (J. Paul Getty Museum, 2018). Generously illustrated essays consider a range of artistic and other material evidence, including archaeological finds, artworks, papyri, and inscriptions, to shed light on cultural interactions between Egypt, Greece, and Rome from the Bronze Age to the Late Period and Ptolemaic dynasty to the Roman Empire. 
Another Search? (y/n/)n
```

<p>

- Note: this is a good opportunity to practice using the debugger! Use it to step through the loop and verify that 
any results look reasonable. You can hover over the variables to see their values, or use the DEBUG CONSOLE to print out values. 
- You can also use the DEBUG CONSOLE to run code, like `df.head()` to see the first few rows of the data frame.