# Process - May 9 @ 12p

- I want to understand what parameters about a self help book will determine its price most
- I have, in archive/merged_self_help_books_0410-1137_binned.csv, the following cols:
    - name (book title)
    - author_clean (name of author)
    - summary (description of book)
    - about_author (includes number of books written and followers on Goodreads; formatted like "{"name":"Malcolm Gladwell","num_books":118,"num_followers":"36119"}")
    - genres (formatted like "["Art","Self Help","Nonfiction","Writing","Inspirational","Spirituality","Reference","Poetry","Psychology","Journal"]")
    - star_rating (stars out of 5)
    - num_ratings (total number of ratings given to book)
    - num_reviews (number of reviews left for book)
    - year_published
    - kindle_price_clean (price of book as float)
    - spectrum_clean (either Secular/Scientific or Spiritual/Religious)
    - key_cat_primary (category of book, e.g. ['Underachievement & Stalled Potential' 'Trauma Recovery & PTSD'
 'Stressful Life Transitions' 'Stress Management'
 'Spiritual & Existential Crisis' 'Self-Sabotage & Bad Habits'
 'Relationship Anxiety & Emotional Dependency'
 'Procrastination & Time Management'
 'Parenting Struggles & Family Tension'
 'Narcissistic Abuse & Manipulative Dynamics'
 'Leadership & Business Acumen' 'Lack of Direction & Goal-Setting'
 'Issues of Religious Faith' 'Inadequacy & Perfectionism'
 'Gender-based Insecurities' 'Finding Meaning in Metaphysics'
 'Financial Hardship & Debt' 'Confidence & Assertiveness Issues'
 'Chronic Health Issues & Pain'
 'Career Dissatisfaction & Job-Related Stress'
 'Body Image & Eating Disorders' 'Anxiety Disorders'])

    - More info:
        - DataFrame Shape: (22785, 12)

            Column Names and Data Types:
            name                   object
            author_clean           object
            summary                object
            about_author           object
            genres                 object
            star_rating           float64
            num_ratings           float64
            num_reviews           float64
            year_published        float64
            kindle_price_clean    float64
            spectrum_clean         object
            key_cat_primary        object

- I also have, in archive/zeroshot_analysis_results.csv, the following cols
    - name (same as above)
    - author_clean (same as above)
    - review_text (written review for book)
    - predicted_label (indicates if the reader found the book to be "Very Harmful, Somewhat Harmful, Somewhat Helpful or Very Helpful")
    - Zeroshot Analysis Results DataFrame:
        DataFrame Shape: (182535, 4)

        Column Names and Data Types:
        review_text        object
        name               object
        author_clean       object
        predicted_label    object


I've already imported and cleaned these datasets to as df_orig and df_reviews.

My goal ultimately is to:
- Input a fear or concern im having in my own life.
- use all of this data to programmatically surface:
    - Which books could be most HELPFUL, and what reviews are ultimately going to help me make an educated decision
    - Which authors to look up as relevant and helpful, versus authors who are relevant but harmful. 

Write the python code to pull this off. 

In [1]:
import pandas as pd

# Import the merged self help books data
df_orig = pd.read_csv('archive/merged_self_help_books_0410-1137_binned.csv')

# Keep only the specified columns
columns_to_keep = [
    'name',
    'author_clean', 
    'summary',
    'about_author',
    'genres',
    'star_rating',
    'num_ratings',
    'num_reviews',
    'year_published',
    'kindle_price_clean',
    'spectrum_clean',
    'key_cat_primary'
]
df_orig = df_orig[columns_to_keep]

# Print shape and column information
print("DataFrame Shape:", df_orig.shape)
print("\nColumn Names and Data Types:")
print(df_orig.dtypes)
print("\nUnique values in key_cat_primary:")
print(df_orig['key_cat_primary'].unique())


# Import the zeroshot analysis results
df_reviews = pd.read_csv('archive/zeroshot_analysis_results.csv')

# Drop the sentiment column if it exists
if 'sentiment' in df_reviews.columns:
    df_reviews = df_reviews.drop('sentiment', axis=1)

# Print column names and data types
print("\nZeroshot Analysis Results DataFrame:")
print("DataFrame Shape:", df_reviews.shape)
print("\nColumn Names and Data Types:")
print(df_reviews.dtypes)



  df_orig = pd.read_csv('archive/merged_self_help_books_0410-1137_binned.csv')


DataFrame Shape: (22785, 12)

Column Names and Data Types:
name                   object
author_clean           object
summary                object
about_author           object
genres                 object
star_rating           float64
num_ratings           float64
num_reviews           float64
year_published        float64
kindle_price_clean    float64
spectrum_clean         object
key_cat_primary        object
dtype: object

Unique values in key_cat_primary:
['Underachievement & Stalled Potential' 'Trauma Recovery & PTSD'
 'Stressful Life Transitions' 'Stress Management'
 'Spiritual & Existential Crisis' 'Self-Sabotage & Bad Habits'
 'Relationship Anxiety & Emotional Dependency'
 'Procrastination & Time Management'
 'Parenting Struggles & Family Tension'
 'Narcissistic Abuse & Manipulative Dynamics'
 'Leadership & Business Acumen' 'Lack of Direction & Goal-Setting'
 'Issues of Religious Faith' 'Inadequacy & Perfectionism'
 'Gender-based Insecurities' 'Finding Meaning in Metaphysic

In [2]:
###############################################################################
# A LIGHT-WEIGHT RECOMMENDATION PIPELINE
#
# 1.  Vectorise every book (summary + genres + category) with TF–IDF.
# 2.  Aggregate review sentiment to obtain a "helpful_ratio" for each book
#     and each author ( % of reviews classified as Helpful ).
# 3.  Given a user-supplied fear / concern, return:
#        • The most relevant & helpful books + representative reviews
#        • Helpful authors vs. potentially harmful authors for that concern
###############################################################################

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# ---------------------------------------------------------------------------
# 0.   QUICK TEXT PRE-PROCESSING + FEATURE ENGINEERING FOR EVERY BOOK
# ---------------------------------------------------------------------------
def _prep(text):
    """Very small normaliser – lower-case & cast NaNs to an empty string."""
    return str(text).lower() if pd.notnull(text) else ""

# Assemble a single text field that captures the gist of each title
df_orig['combined_text'] = (
    df_orig['summary'].apply(_prep)          + " " +
    df_orig['genres'].apply(_prep)           + " " +
    df_orig['key_cat_primary'].apply(_prep)
)

# ---------------------------------------------------------------------------
# 1.   FIT A TF-IDF VECTORIZER ON ALL BOOKS
# ---------------------------------------------------------------------------
vectorizer = TfidfVectorizer(stop_words='english', max_features=50_000)
X_books   = vectorizer.fit_transform(df_orig['combined_text'])

# ---------------------------------------------------------------------------
# 2.   BUILD HELPFUL / HARMFUL METRICS FROM THE ZEROSHOT REVIEW LABELS
# ---------------------------------------------------------------------------
helpful_labels  = {'Very Helpful', 'Somewhat Helpful'}
harmful_labels  = {'Very Harmful', 'Somewhat Harmful'}

df_reviews['is_helpful'] = df_reviews['predicted_label'].isin(helpful_labels)
df_reviews['is_harmful'] = df_reviews['predicted_label'].isin(harmful_labels)

# --------  BOOK-LEVEL STATS  -------------------------------------------------
book_stats = (
    df_reviews.groupby('name')
    .agg(helpful_count=('is_helpful', 'sum'),
         harmful_count=('is_harmful', 'sum'),
         total_reviews=('predicted_label', 'count'))
    .reset_index()
)

book_stats['helpful_ratio'] = (
    book_stats['helpful_count'] / book_stats['total_reviews'].replace(0, np.nan)
)

book_stats['harmful_ratio'] = (
    book_stats['harmful_count'] / book_stats['total_reviews'].replace(0, np.nan)
)

# Merge and fill NaN values first
df_orig = df_orig.merge(book_stats, on='name', how='left')
for col in ['helpful_ratio', 'harmful_ratio', 'total_reviews', 'helpful_count', 'harmful_count']:
    df_orig[col] = df_orig[col].fillna(0)

# --------  AUTHOR-LEVEL STATS  ----------------------------------------------
author_stats = (
    df_reviews.groupby('author_clean')
    .agg(helpful_count=('is_helpful', 'sum'),
         harmful_count=('is_harmful', 'sum'),
         total_reviews=('predicted_label', 'count'))
    .reset_index()
)

author_stats['helpful_ratio'] = (
    author_stats['helpful_count'] / author_stats['total_reviews'].replace(0, np.nan)
)
author_stats['harmful_ratio'] = (
    author_stats['harmful_count'] / author_stats['total_reviews'].replace(0, np.nan)
).fillna(0)


# ---------------------------------------------------------------------------
# 3.   MAIN USER-FACING FUNCTIONS
# ---------------------------------------------------------------------------
def recommend_books(user_issue, top_n=10, reviews_per_book=3, min_reviews=10):
    """
    Parameters
    ----------
    user_issue : str
        A short description of the fear / concern you are facing.
    top_n : int
        Number of book suggestions to return.
    reviews_per_book : int
        How many helpful / harmful review snippets to surface per book.
    min_reviews : int
        Ignore books with fewer than this many total reviews (for robustness).

    Returns
    -------
    pd.DataFrame with columns:
        Book | Author | Similarity | Helpful_Ratio | Total_Reviews | Star_Rating
              Price | Helpful Reviews | Harmful Reviews
    """
    # --- similarity ---------------------------------------------------------
    query_vec = vectorizer.transform([user_issue.lower()])
    similarity = cosine_similarity(query_vec, X_books).ravel()
    df_temp = df_orig.copy()
    df_temp['similarity'] = similarity

    # --- candidate selection ------------------------------------------------
    candidates = (
        df_temp[df_temp['total_reviews'] >= min_reviews]
        .copy()
        .sort_values(['similarity', 'helpful_ratio'], ascending=False)
    )

    # --- scoring: blend topic similarity & helpfulness ----------------------
    candidates['helpful_ratio_filled'] = candidates['helpful_ratio'].fillna(0)
    candidates['score'] = (
        0.70 * candidates['similarity'] +
        0.30 * candidates['helpful_ratio_filled']
    )

    top_books = candidates.nlargest(top_n, 'score')

    # --- gather representative reviews -------------------------------------
    results = []
    for _, row in top_books.iterrows():
        name   = row['name']
        author = row['author_clean']

        helpful_reviews = (
            df_reviews[(df_reviews['name'] == name) & (df_reviews['is_helpful'])]
            .sample(min(reviews_per_book,
                        df_reviews[(df_reviews['name'] == name) & (df_reviews['is_helpful'])].shape[0]),
                    random_state=42)
            ['review_text'].tolist()
        )

        harmful_reviews = (
            df_reviews[(df_reviews['name'] == name) & (df_reviews['is_harmful'])]
            .sample(min(reviews_per_book,
                        df_reviews[(df_reviews['name'] == name) & (df_reviews['is_harmful'])].shape[0]),
                    random_state=42)
            ['review_text'].tolist()
        )

        results.append({
            'Book'            : name,
            'Author'          : author,
            'Similarity'      : round(row['similarity'], 3),
            'Helpful_Ratio'   : round(row['helpful_ratio'], 3),
            'Total_Reviews'   : int(row['total_reviews']),
            'Star_Rating'     : row['star_rating'],
            'Price'          : row['kindle_price_clean'],
            'Helpful Reviews' : helpful_reviews,
            'Harmful Reviews' : harmful_reviews
        })

    return pd.DataFrame(results)


def recommend_authors(user_issue, top_n=10, min_reviews=30):
    """
    Return 2 DataFrames:
        • top helpful authors
        • top potentially harmful authors
    Both are ranked by how relevant the author is to the user_issue
    (max similarity across any of their books) blended with their helpfulness.

    An author must have at least `min_reviews` total reviews to be considered.
    """
    # Author relevance via book similarity
    query_vec = vectorizer.transform([user_issue.lower()])
    similarity = cosine_similarity(query_vec, X_books).ravel()

    similarity_df = pd.DataFrame({
        'author_clean': df_orig['author_clean'],
        'sim_to_issue': similarity
    })

    author_relevance = (
        similarity_df.groupby('author_clean')
        .agg(max_sim=('sim_to_issue', 'max'))
        .reset_index()
    )

    author_merged = author_relevance.merge(author_stats, on='author_clean', how='left')
    author_merged = author_merged[author_merged['total_reviews'] >= min_reviews].copy()
    author_merged['helpful_ratio'] = author_merged['helpful_ratio'].fillna(0)

    # blended score: 70% relevance, 30% helpfulness (same weighting as books)
    author_merged['score'] = (
        0.70 * author_merged['max_sim'] + 0.30 * author_merged['helpful_ratio']
    )

    # Helpful authors: helpful_ratio ≥ 0.5
    helpful_authors = (
        author_merged[author_merged['helpful_ratio'] >= 0.5]
        .nlargest(top_n, 'score')
        [['author_clean', 'helpful_ratio', 'total_reviews', 'max_sim']]
        .rename(columns={'max_sim': 'relevance'})
        .reset_index(drop=True)
    )

    # Potentially harmful authors: helpful_ratio < 0.5
    harmful_authors = (
        author_merged[author_merged['helpful_ratio'] < 0.5]
        .nlargest(top_n, 'score')
        [['author_clean', 'helpful_ratio', 'total_reviews', 'max_sim']]
        .rename(columns={'max_sim': 'relevance'})
        .reset_index(drop=True)
    )

    return helpful_authors, harmful_authors


# ---------------------------------------------------------------------------
# 4.   EXAMPLE USAGE
# ---------------------------------------------------------------------------
# (Un-comment the following lines to try the system.)

my_concern = "I'm a lonely teenager."

books_df = recommend_books(my_concern, top_n=5, reviews_per_book=2)
print("=== RECOMMENDED BOOKS ===")
display(books_df)

good_authors, risky_authors = recommend_authors(my_concern, top_n=5)
print("\n=== AUTHORS LIKELY TO BE HELPFUL ===")
display(good_authors)
print("\n=== AUTHORS YOU MAY APPROACH WITH CAUTION ===")
display(risky_authors)


=== RECOMMENDED BOOKS ===


Unnamed: 0,Book,Author,Similarity,Helpful_Ratio,Total_Reviews,Star_Rating,Price,Helpful Reviews,Harmful Reviews
0,Chicken Soup for the Soul: Teens Talk Tough Ti...,Jack Canfield,0.286,0.929,14,4.1,9.99,[I liked this book for many reasons but one of...,[This was a very emotional book to me. So many...
1,The Art of Being a Brilliant Teenager,Andy Cope,0.301,0.857,28,4.11,11.0,[Read this book when I was like 9 or at most 1...,[This book heavily stereotypes teenagers as la...
2,Connected: Curing the Pandemic of Everyone Fee...,Erin Davis,0.227,0.952,21,4.01,7.99,[I received a free copy of this book via NetGa...,"[I've never written a review, or rated a book,..."
3,What I Wish I'd Known in High School: A Crash ...,John Bytheway,0.153,1.0,27,4.19,14.39,[John Bytheway uses his fantastic sense of hum...,[]
4,The Solo Travel Handbook,Lonely Planet,0.15,0.929,28,3.77,9.99,"[Useful enough. ""Just as your parents have a r...",[this book terrifies and excites me all at onc...



=== AUTHORS LIKELY TO BE HELPFUL ===


Unnamed: 0,author_clean,helpful_ratio,total_reviews,relevance
0,Kevin Leman,0.823293,249.0,0.392759
1,Andy Cope,0.901961,51.0,0.300972
2,Jack Canfield,0.892617,596.0,0.286422
3,John Bytheway,1.0,93.0,0.152879
4,Iyanla Vanzant,0.91358,162.0,0.147773



=== AUTHORS YOU MAY APPROACH WITH CAUTION ===


Unnamed: 0,author_clean,helpful_ratio,total_reviews,relevance
0,Theodor Reik,0.466667,30.0,0.0
1,سعد سعود الكريباني,0.466667,30.0,0.0
2,Alan Ken Thomas,0.433333,30.0,0.0
3,Ping Fu,0.433333,30.0,0.0
4,Sahar Hashemi,0.433333,30.0,0.0


In [6]:
df_reviews.to_csv('self_help_reviews.csv', index=False)

In [5]:
# Export the dataset to CSV for the Gradio demo
df_orig.to_csv('self_help_books.csv', index=False)



import gradio as gr

def recommend_for_concern(concern, num_books=5, num_reviews=2):
    """Wrapper function to format recommendations for Gradio"""
    books_df = recommend_books(concern, top_n=num_books, reviews_per_book=num_reviews)
    good_authors, risky_authors = recommend_authors(concern, top_n=num_books)
    
    # Format book recommendations
    book_output = "=== RECOMMENDED BOOKS ===\n\n"
    for _, book in books_df.iterrows():
        book_output += f"📚 {book['Book']}\n"
        book_output += f"👤 Author: {book['Author']}\n"
        book_output += f"⭐ Rating: {book['Star_Rating']}\n"
        book_output += f"💰 Price: ${book['Price']}\n"
        book_output += f"📊 Helpful Ratio: {book['Helpful_Ratio']:.2f}\n"
        
        if book['Helpful Reviews']:
            book_output += "\n✅ Helpful Reviews:\n"
            for review in book['Helpful Reviews']:
                book_output += f"• {review}\n"
                
        if book['Harmful Reviews']:
            book_output += "\n⚠️ Critical Reviews:\n"
            for review in book['Harmful Reviews']:
                book_output += f"• {review}\n"
        
        book_output += "\n" + "-"*50 + "\n\n"

    # Format author recommendations
    author_output = "=== RECOMMENDED AUTHORS ===\n\n"
    author_output += "✅ Authors Likely to be Helpful:\n"
    for _, author in good_authors.iterrows():
        author_output += f"• {author['author_clean']} (Helpful ratio: {author['helpful_ratio']:.2f})\n"
    
    author_output += "\n⚠️ Authors to Approach with Caution:\n"
    for _, author in risky_authors.iterrows():
        author_output += f"• {author['author_clean']} (Helpful ratio: {author['helpful_ratio']:.2f})\n"

    return book_output + "\n\n" + author_output

# Create the Gradio interface
iface = gr.Interface(
    fn=recommend_for_concern,
    inputs=[
        gr.Textbox(label="What concern or fear would you like help with?", placeholder="e.g. I'm a lonely teenager"),
        gr.Slider(minimum=1, maximum=10, value=5, step=1, label="Number of recommendations"),
        gr.Slider(minimum=1, maximum=5, value=2, step=1, label="Reviews per book")
    ],
    outputs=gr.Textbox(label="Recommendations", lines=20),
    title="Self-Help Book Recommender",
    description="Get personalized book recommendations based on your concerns or fears.",
    examples=[
        ["I'm a lonely teenager", 5, 2],
        ["I'm worried about my career", 5, 2],
        ["I have anxiety about the future", 5, 2]
    ]
)

iface.launch()


Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.




In [None]:
!git clone https://huggingface.co/spaces/joshstrupp/Self-Help-Book-Recommendation-Engine