# Perfume Project Portfolio

## 🪞 Introduction
Exploring emotional expression, trends, and consumer behavior in perfume communities.

## 🌐 Data Collection
Reddit (using PRAW). Posts from r/fragrance.

## 🧹 Data Cleaning
Removing stopwords, symbols, formatting, emojis.

## 🔍 Exploratory Data Analysis (EDA)
Word frequency, top perfumes mentioned, trends.

## ❤️ Sentiment Analysis
How do people *feel* about scents? Positive/negative/neutral or emotional categories (joy, longing, etc.).

## 🧠 Topic Modeling (Optional)
Uncover themes: first dates, summer scents, confidence boosters.

## 📊 Visualizations
Bar charts, word clouds, timelines.

## 💡 Insights & Takeaways
Key patterns or narratives. What surprised us?

## 🔚 Conclusion
Final thoughts. Future work. Possible brand applications.


In [None]:
# Perfume Analysis Query Engine
This notebook demonstrates how to search Reddit for a specific phrase (e.g., "Love" or "Underated") and analyze posts and comments to extract sentiment and fragrance-related entities.


In [None]:
import praw
import pandas as pd
import time
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
import spacy
from collections import Counter
import matplotlib.pyplot as plt

# Download necessary NLTK data (if not already downloaded)
nltk.download('vader_lexicon')
# Load spaCy's English model (make sure you've installed it: python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")
# Initialize VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()


In [None]:
# Configure Reddit API credentials (replace it with your actual credentials, you are required to have a Reddit account to get all this!)
reddit = praw.Reddit(
    client_id="your_actual_client_id",
    client_secret="your_actual_client_secret",
    user_agent="your_user_agent"
)

# Define query parameters (the query can be anything related to fragrances or scents, though, the broader the better)
query = "Love" 
# (Keep the subreddit as fragrance unless you have a better subreddit in mind, I used it due to its popularity)
subreddit_name = "fragrance"
post_limit = 100
#(post limits and comment limits can be adjusted to your liking, trying to extract more data does take more time, spend your time wisely)
comment_limit = 20

posts_data = []
comments_data = []

print(f"Searching for posts matching '{query}' in r/{subreddit_name}...")
for submission in reddit.subreddit(subreddit_name).search(query, limit=post_limit, sort="relevance"):
    post_details = {
        'post_id': submission.id,
        'title': submission.title,
        'body': submission.selftext,
        'upvotes': submission.score,
        'timestamp': submission.created_utc,
        'url': submission.url,
        'num_comments': submission.num_comments
    }
    posts_data.append(post_details)
    
    # Get top-level comments (ignore nested replies)
    submission.comments.replace_more(limit=0)
    count = 0
    for comment in submission.comments:
        if comment.parent_id.startswith("t3_"):
            comments_data.append({
                'post_id': submission.id,
                'comment_id': comment.id,
                'comment_body': comment.body,
                'comment_score': comment.score,
                'comment_timestamp': comment.created_utc
            })
            count += 1
            if count >= comment_limit:
                break
    time.sleep(2)  # Respect rate limits

posts_df = pd.DataFrame(posts_data)
comments_df = pd.DataFrame(comments_data)


In [None]:
# Combine title and body for posts
posts_df['full_text'] = posts_df['title'].fillna('') + " " + posts_df['body'].fillna('')
posts_df['sentiment'] = posts_df['full_text'].apply(lambda text: sid.polarity_scores(text)['compound'])
comments_df['sentiment'] = comments_df['comment_body'].apply(lambda text: sid.polarity_scores(text)['compound'])

# Function to extract entities (ORG and PRODUCT labels)
def extract_entities(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents if ent.label_ in ['ORG', 'PRODUCT']]

posts_df['entities'] = posts_df['full_text'].apply(lambda x: extract_entities(x))

# Flatten entity list and count frequencies
all_entities = [entity for sublist in posts_df['entities'] for entity in sublist]
entity_counts = Counter(all_entities)
common_entities = entity_counts.most_common(20)

print("\n--- Sentiment Analysis ---")
print("Average sentiment for posts: ", posts_df['sentiment'].mean())
print("Average sentiment for comments: ", comments_df['sentiment'].mean())

print("\n--- Common Fragrance-Related Entities ---")
for entity, count in common_entities:
    print(f"{entity}: {count}")


In [None]:
# Visualization: Bar chart for top fragrance-related entities
entities, counts = zip(*common_entities)

plt.figure(figsize=(10, 6))
bars = plt.bar(entities, counts, color='mediumpurple')
plt.xlabel('Entity')
plt.ylabel('Frequency')
plt.title('Top 20 Fragrance-Related Entities for Query "Love"')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Visualization: Histogram for sentiment distribution in posts
plt.figure(figsize=(10, 6))
plt.hist(posts_df['sentiment'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Sentiment Distribution for Posts')
plt.xlim(-1, 1)
plt.show()


In [None]:
## Conclusion & Future Work

The analysis shows that when querying "Love" in r/fragrance:
- The average sentiment of posts is fairly positive.
- Specific fragrance-related entities (e.g., certain brands or product names) appear frequently.

Future improvements could include:
- Further refining the entity extraction to filter out irrelevant terms.
- Integrating this analysis into an interactive web application.
- Allowing users to drill down for more detailed sentiment analysis for each fragrance.
