# Sentiment Analysis and Personalized Messaging 
## Project Overview: 
This project demonstrates the application of ethical web scraping, sentiment analysis, and AI-powered personalized messaging to identify potential participants for clinical trials. By analyzing posts and comments from Reddit, the project gauges user sentiments and interest levels toward clinical trials. Using these insights, the project generates customized outreach messages to engage users, leveraging the OpenAI API for dynamic message creation. The goal is to provide a scalable and ethical framework for clinical trial recruitment.
## Notebook Overview: 
This notebook provides a comprehensive analysis of Reddit data to identify users who are interested in participating in clinical trials. It covers the following:

1. Reddit Post Filtering: Identifies posts relevant to clinical trials based on a similarity score derived from cosine similarity with a predefined query. The score is calculated from `reddit_scrapping.py`. Posts focused on career discussions or unrelated topics are penalized.

2. EDA: 
- Sentiment Analysis: Analyzes the sentiment (positive, neutral, negative) of posts and comments, identifying trends, topics, and linguistic patterns associated with each sentiment category.
- Topic Modeling: Uses LDA to identify themes within posts and comments, distinguishing between sentiment-driven and general topics.
Keyword Analysis with TF-IDF: Extracts significant words and phrases for each sentiment category to uncover linguistic patterns.
- Interest Scoring: Develops a scoring system to rank users' interest levels in clinical trials based on their activity, sentiment, and intent.

3. Personalized Messaging: Demonstrates how to generate tailored messages for users with high interest, using aggregated user content and extracted demographic data.

## Find Reddit Post that is Clinical Trials related 

The focus of this project is on posts from individuals who are interested in joining clinical trials or have prior experience participating in them. The aim is not to include posts that simply share news about clinical trials. For instance, a post in `r/science` discussing an article like "Lab-grown retinal eye cells make successful connections, opening the door for clinical trials to treat blindness" mentions clinical trials but does not involve individuals' personal experiences or intentions to join.


I have chosen to scrape data exclusively from `r/clinicalresearch` and `r/clinicaltrials` (refered to `reddit_scrapping.py`). While it is possible that discussions about joining clinical trials might occur in disease-specific subreddits, individuals seeking urgent advice or sharing their experiences often cross-post to broader communities like these. Additionally, due to time constraints, scraping all disease-specific subreddits is not feasible.


After fully scrape all the subreddit data from the about subreddit with `reddit_scrapping.py`; I also assign each post with a score that determines how relevant each post is since I am only interested in finding posts that talks about user wants to and have some questions about participating in a clinical trials or someone who already had experience in clinical trials 

This score is critical as it allows us to filter out posts that are not relevant with our project goal. The similarity score is calculated based on comparing cosine similarity between the `defined_query` and the combination of post's title and its body text and penalized and rewarded score based on keywords. I want to penalize on post that talks about career development and reward post that talks about clinical trial experience. 
The `defined_query` might need some more fine-tuning if later on the project.s

```python
defined_query = """
    I am looking for personal experiences in clinical trials or interest in joining one.
    I am interested in learning more about the process of clinical trials and how they work, how to find and participate in clinical trials.
    medicine, research, study, trials, drug, treatment, experimental, patient, clinical, health, participation, patient recruitment
    Looking for patient recruitment
    """
```

In [1]:
import pandas as pd
import json
from openai import OpenAI
import numpy as np
from scipy.spatial.distance import cosine
from datetime import datetime
import os 
import re
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
current_wd = os.getcwd()
# Load the JSON file
file_path = f"{current_wd}/data/post_data.json"  # Replace with the path to your JSON file
with open(file_path, "r", encoding="utf-8") as f:
    data = json.load(f)

# Convert the JSON dictionary to a DataFrame
df = pd.DataFrame.from_dict(data, orient="index")

# remane column selftext to body_text
df = df.rename(columns={"selftext": "body_text", "weighted_similarity":"weighted_sim_score"})

df = df.sort_values(by='weighted_sim_score', ascending=False)

From the similarity score, if a post matches with what the interest topics which are about clinical trial experience and interest in participating in on, then the value would be larger than 0 and larger than 0.6. With this 0.6 threshold, I can then narrow to posts that match with my objective for sentiment analysis 

In [3]:
# filter rows with weighted_sim_score > 0.6
processed_df = df.copy()
processed_df = processed_df[processed_df['weighted_sim_score'] >= 0.6]
print(processed_df.shape)
processed_df.head()

(100, 12)


Unnamed: 0,title,body_text,author,weighted_sim_score,upvote_ratio,created_utc,created_date,score,url,num_comments,subreddit,comments
d61idj,Types and phases of clinical trials,There are different types of clinical trials i...,gafaind,4.895395,1.0,1568829000.0,2019-09-18,7,https://www.reddit.com/r/clinicaltrials/commen...,1,clinicaltrials,"[{'author': 'DeanOnDelivery', 'body': 'What an..."
dd9irp,About Different Clinical Trials,"Once on the market, the drug remains closely ...",canadianblog,3.559457,1.0,1570206000.0,2019-10-04,1,https://www.reddit.com/r/clinicaltrials/commen...,0,clinicaltrials,[]
1gb033f,Trying to get a clinical trial started as a pr...,"Hey, guys! I need some help, if possible. I ma...",empty-health-bar,3.040946,0.43,1729768000.0,2024-10-24,0,https://www.reddit.com/r/clinicalresearch/comm...,71,clinicalresearch,"[{'author': 'vathena', 'body': 'You should get..."
ol3ysd,158 Recently Updated Clinical Trials - activel...,#158 Clinical Trials updated on 2021-07-14\n\n...,ClinicalTrialsBot,1.832374,0.84,1626390000.0,2021-07-15,4,https://www.reddit.com/r/clinicaltrials/commen...,0,clinicaltrials,[]
olfc72,122 Recently Updated Clinical Trials - activel...,#122 Clinical Trials updated on 2021-07-15\n\n...,ClinicalTrialsBot,1.748216,1.0,1626436000.0,2021-07-16,1,https://www.reddit.com/r/clinicaltrials/commen...,0,clinicaltrials,[]


In [4]:
processed_df.columns

Index(['title', 'body_text', 'author', 'weighted_sim_score', 'upvote_ratio',
       'created_utc', 'created_date', 'score', 'url', 'num_comments',
       'subreddit', 'comments'],
      dtype='object')

## EDA: Sentiment Analysis
Some of the sentiment analysis questions that I am interested in answering given the above data 

1. What are the dominant sentiments (positive, neutral, negative) expressed in posts/ comments about clinical trials?
2. Are there specific topics within the posts (e.g., patient experience, risks, benefits) that show stronger positive or negative sentiment?
3. What keywords or phrases are strongly associated with neutral, positive or negative sentiment in posts and comments?
4. How can we identify users who are most interested in clinical trials based on their activity, sentiment, and intent?


In [5]:
# Create a comments only data 
# Assuming `df` is the original DataFrame
comments_data = []

# Iterate through each post and its comments
for post_id, row in df.iterrows():
    post_title = row['title']  # Get the post title
    post_author = row['author']  # Get the post author
    for comment in row['comments']:
        # Append each comment's data to the list
        comments_data.append({
            'post_id': post_id,
            'post_title': post_title,
            'comment_author': comment['author'],
            'comment_body': comment['body'],
            'comment_score': comment['score'],
            'comment_created_date': datetime.fromtimestamp(comment['created_utc']).strftime('%Y-%m-%d'),
        })

# Create a new DataFrame from the flattened comments data
comments_df = pd.DataFrame(comments_data)
print("Comments dataframe shape: ", comments_df.shape)
comments_df.head()

Comments dataframe shape:  (11570, 6)


Unnamed: 0,post_id,post_title,comment_author,comment_body,comment_score,comment_created_date
0,d61idj,Types and phases of clinical trials,DeanOnDelivery,What an excellent post. About the only thing I...,1,2019-12-23
1,1gb033f,Trying to get a clinical trial started as a pr...,vathena,You should get off reddit and go connect with ...,107,2024-10-24
2,1gb033f,Trying to get a clinical trial started as a pr...,NewBenefit6035,"How many participants, funding, starting larg...",24,2024-10-24
3,1gb033f,Trying to get a clinical trial started as a pr...,FuriousKittens,You might have some success petitioning the sp...,19,2024-10-24
4,1gb033f,Trying to get a clinical trial started as a pr...,Gazorninplat6,It's great when private citizens want to get i...,6,2024-10-24


### Sentiment Analysis with Pre-trained Model for Posts and Comments data

In [6]:
# per post 
from transformers import pipeline
from huggingface_hub import login 
hugging_token= os.getenv("HUGGINGFACE_API_KEY")
print(hugging_token)
login(token=hugging_token)


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
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to C:\Users\tata\.cache\huggingface\token
Login successful



I use the **"cardiffnlp/twitter-roberta-base-sentiment-latest"** model, a pre-trained sentiment analysis model based on the Roberta architecture. Although trained on tweet data, its similarity to Reddit language makes it suitable for this task.

To handle longer Reddit posts, I combine the title and body text, split the text into smaller chunks, and analyze each chunk. The dominant sentiment across all chunks determines the overall sentiment of the post. This approach adapts the model to longer content effectively.

In [7]:
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
sentiment_pipeline = pipeline("sentiment-analysis", model=model_path, tokenizer= model_path)

def analyze_sentiment(pipeline, text):
    max_length = 512
    # Split text into smaller chunks to avoid token limit
    chunks = [text[i:i+max_length] for i in range(0, len(text), max_length)]
    # If only one chunk, return the pipeline result directly
    if len(chunks) == 1:
        result = pipeline(chunks[0])[0]
        return result['label'].lower(), result['score']

    # If multiple chunks, analyze sentiment for each
    sentiments = [pipeline(chunk)[0] for chunk in chunks]

    # Determine the predominant label
    label_counts = {}
    for sentiment in sentiments:
        label = sentiment['label'].lower()
        label_counts[label] = label_counts.get(label, 0) + 1
    predominant_label = max(label_counts, key=label_counts.get)
    # Calculate the average score for the predominant label
    avg_score = sum(s['score'] for s in sentiments if s['label'].lower() == predominant_label) / label_counts[predominant_label]

    return  predominant_label,  avg_score


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
processed_df[['sentiment_label', 'sentiment_score']] = processed_df.apply(
    lambda row: pd.Series(analyze_sentiment(sentiment_pipeline, row['title'] + " " + row['body_text'])),
    axis=1)

In [9]:
processed_df['sentiment_label'].value_counts()

sentiment_label
neutral     75
positive    13
negative    12
Name: count, dtype: int64

For a good chunk of the post, the sentiment is relatively neutral. It kinda make sense as the typically people make posts to share that they are looking for information or even post about clinical trials 

In [10]:
# sentiment analysis for comments
comments_df[['sentiment_label', 'sentiment_score']] = comments_df.apply(
    lambda row: pd.Series(analyze_sentiment(sentiment_pipeline, row['comment_body'])),
    axis=1)
print(comments_df.shape)
comments_df.head(5)


(11570, 8)


Unnamed: 0,post_id,post_title,comment_author,comment_body,comment_score,comment_created_date,sentiment_label,sentiment_score
0,d61idj,Types and phases of clinical trials,DeanOnDelivery,What an excellent post. About the only thing I...,1,2019-12-23,positive,0.870846
1,1gb033f,Trying to get a clinical trial started as a pr...,vathena,You should get off reddit and go connect with ...,107,2024-10-24,negative,0.778827
2,1gb033f,Trying to get a clinical trial started as a pr...,NewBenefit6035,"How many participants, funding, starting larg...",24,2024-10-24,neutral,0.68768
3,1gb033f,Trying to get a clinical trial started as a pr...,FuriousKittens,You might have some success petitioning the sp...,19,2024-10-24,neutral,0.730971
4,1gb033f,Trying to get a clinical trial started as a pr...,Gazorninplat6,It's great when private citizens want to get i...,6,2024-10-24,neutral,0.56007


In [11]:
# find how many comments are positive, negative and neutral
comments_df['sentiment_label'].value_counts()

sentiment_label
neutral     5775
negative    3359
positive    2436
Name: count, dtype: int64

**Observations** 
- More than 70% of posts lean towards neutral sentiment, possibly reflecting a more informative or less subjective tone.
- Comments provide a slightly richer sentiment distribution. Eventhough the dominant sentiment is still neutral, n~egative comments (~29%) outnumber positive ones (21%), suggesting that users may express concerns, criticisms, or negative experiences more frequently in the comment sections.

### Topic Identify for Posts and Comment by sentiment
This section examines whether specific topics (e.g., patient experiences, risks, benefits) are associated with positive, negative, or neutral sentiment in posts and comments.

Methodology
- General Topic Modeling:
    -   Apply LDA with 5 topics to all posts using CountVectorizer to preprocess the text.
- Sentiment-Specific Modeling:
    - Perform LDA separately on positive, negative, and neutral posts and comments to identify sentiment-driven theme
    
The top 10 keywords for each topic are extracted, highlighting how themes vary across sentiments. From these keywords, we can then extract/reveals pattern that related to clinical trials 


In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# General Topic Modeling without sentiment
vectorizer = CountVectorizer(max_df=0.9, min_df=10, stop_words='english')
dtm = vectorizer.fit_transform(processed_df['body_text'])

# Fit LDA model
lda = LatentDirichletAllocation(n_components=5, random_state=42)  # 5 topics
lda.fit(dtm)

# Extract topics
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]]}")

Topic 0: ['time', 'like', 'treatment', 'patient', 'patients', 'research', 'trials', 'study', 'trial', 'clinical']
Topic 1: ['nyulangone', 'traumatic', 'symptoms', 'ptsd', 'participants', 'com', 'study', 'survey', 'https', 'org']
Topic 2: ['niaid', 'university', 'study', 'bethesda', '18', 'md', 'usa', 'clinicaltrials', 'gov', 'https']
Topic 3: ['countries', '60', 'paid', 'safety', 'age', 'factors', 'cov', 'brain', 'early', 'state']
Topic 4: ['treatment', 'cost', 'thanks', 'hospital', 'hi', 'interested', 'non', 'different', 'clinical', 'trials']


In [13]:

# Function to perform LDA and print topics for a given sentiment
def analyze_topics_for_sentiment(data, sentiment_label, n_topics=5, max_features=10):
    print(f"\nAnalyzing topics for sentiment: {sentiment_label}")
    
    # Filter data for the given sentiment
    sentiment_data = data[data['sentiment_label'] == sentiment_label]['combined_text']
    
    if sentiment_data.empty:
        print("No data available for this sentiment.")
        return
    
    # Vectorize the text data
    vectorizer = CountVectorizer(max_df=0.9, min_df=6, stop_words='english')
    dtm = vectorizer.fit_transform(sentiment_data)
    
    # Fit LDA model
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    lda.fit(dtm)
    
    # Extract and print topics
    for idx, topic in enumerate(lda.components_):
        print(f"Topic {idx}: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-max_features:]]}")


In [14]:
#for Posts data 
processed_df['combined_text'] = processed_df['title'] + " " + processed_df['body_text']
print("Analyzing themes for Posts data")
# Analyze topics for each sentiment
for sentiment in processed_df['sentiment_label'].unique():
    analyze_topics_for_sentiment(processed_df, sentiment, n_topics=5, max_features=10)

Analyzing themes for Posts data

Analyzing topics for sentiment: neutral
Topic 0: ['nyulangone', 'traumatic', 'symptoms', 'participants', 'research', 'survey', 'https', 'tbi', 'ptsd', 'study']
Topic 1: ['niaid', 'university', 'study', 'bethesda', '18', 'md', 'usa', 'clinicaltrials', 'https', 'gov']
Topic 2: ['experience', 'help', 've', 'patients', 'patient', 'study', 'trials', 'research', 'trial', 'clinical']
Topic 3: ['site', 'study', 'phase', 'patient', 'new', 'trials', 'cancer', 'trial', 'clinical', 'treatment']
Topic 4: ['mailto', 'com', 'contact', 'brain', 'research', 'information', 'clinical', 'placebo', 'study', 'org']

Analyzing topics for sentiment: negative
Topic 0: ['study', 'clinical', 'just', 'patient', 'work', 'trials', 'trial']
Topic 1: ['patient', 'study', 'just', 'trial', 'work', 'trials', 'clinical']
Topic 2: ['study', 'trials', 'just', 'trial', 'clinical', 'patient', 'work']
Topic 3: ['trials', 'work', 'just', 'clinical', 'study', 'trial', 'patient']
Topic 4: ['clini

In [15]:
print("Analyzing themes for Comments data")
# Analyze topics for each sentiment for Comments data
comments_df['combined_text'] = comments_df['post_title'] + " " + comments_df['comment_body']
for sentiment in comments_df['sentiment_label'].unique():
    analyze_topics_for_sentiment(comments_df, sentiment, n_topics=5, max_features=10)

Analyzing themes for Comments data

Analyzing topics for sentiment: positive
Topic 0: ['trial', 'thank', 'trials', 'iqvia', 'small', 'new', 'good', 'thanks', 'research', 'clinical']
Topic 1: ['know', 'protocol', 'cros', 'people', 'good', 'cra', 'new', 'love', 'advice', 'thank']
Topic 2: ['work', 'life', 'job', 'industry', 'crc', 'years', 'cra', 'experience', 'clinical', 'research']
Topic 3: ['live', 'help', 'little', 'visits', 'icon', 'edc', 'like', 'thank', 'site', 'cra']
Topic 4: ['company', 'crc', 'people', 'good', 'just', 'sponsor', 'like', 'cro', 'job', 'work']

Analyzing topics for sentiment: negative
Topic 0: ['people', 'don', 'company', 'like', 'sponsor', 'just', 'new', 'medpace', 'cro', 'work']
Topic 1: ['thing', 'like', 'patients', 'want', 'research', 'document', 'don', 'just', 'study', 'patient']
Topic 2: ['don', 'people', 'cra', 'years', 'jobs', 'just', 'experience', 'clinical', 'research', 'job']
Topic 3: ['trial', 'just', 'patients', 'better', 'icon', 'job', 'know', 'indu

 **Observations**
- Neutral Sentiment Dominance: Both posts and comments with neutral sentiment provide factual, logistic-focused discussions, reflecting the nature of clinical research communities.
- Repetition in Positive and Negative Sentiments: Positive and negative topics often use repetitive keywords, particularly in posts, which may indicate a lack of diversity in sentiment-specific discussions.
- Professional Focus in Comments: Comments, across all sentiments, show a strong emphasis on career-related terms like "cra," "cro," and "sponsor,". This can mean that method of removing career-focused or irrelevant posts are not accurately. I might need to look into other method that is more dynamic and not relied on a set of words

## Identify Most Common words or key phrases with TF-IDF
This section uses TF-IDF (Term Frequency-Inverse Document Frequency) to identify the most significant words and key phrases in posts and comments. The method highlights terms that are frequent in specific documents but uncommon across the entire dataset, making them more relevant to the content. By applying TF-IDF, common words are extracted for the overall dataset as well as for positive, negative, and neutral sentiment categories. This approach helps uncover distinct linguistic patterns and key phrases associated with each sentiment, providing deeper insights into the language used in clinical trial discussions.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

def apply_tfidf(df, text_column, sentiment_column=None, sentiment_value=None, max_features=100, ngram_range=(1, 1)):

    # Filter by sentiment if specified
    if sentiment_column and sentiment_value:
        filtered_df = df[df[sentiment_column] == sentiment_value]
        print(filtered_df.shape)
    else:
        filtered_df = df
    
    # Combine all text data into a single string for TF-IDF
    text_data = filtered_df[text_column].dropna().tolist()
    
    # Apply TF-IDF
    vectorizer = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(text_data)
    
    # Extract terms and scores
    terms = vectorizer.get_feature_names_out()
    scores = tfidf_matrix.sum(axis=0).A1  # Sum TF-IDF scores across all documents
    tfidf_df = pd.DataFrame({'term': terms, 'score': scores}).sort_values(by='score', ascending=False)
    
    return tfidf_df

In [17]:
processed_df['sentiment_label'].unique()

array(['neutral', 'negative', 'positive'], dtype=object)

In [18]:
# find overall tfidf for posts data
overall_tfidf_processed_df = apply_tfidf(processed_df, text_column='combined_text', max_features=10)
print("General TF-IDF:")
print(overall_tfidf_processed_df)

# Find the TF-IDF for each sentiment
# "positive" sentiment posts
post_tfidf_positive = apply_tfidf(processed_df, text_column='combined_text', sentiment_column='sentiment_label', sentiment_value='positive', max_features=10)
print("\nTF-IDF for Positive Sentiment:")
print(post_tfidf_positive)

# "negative" sentiment posts
post_tfidf_negative = apply_tfidf(processed_df, text_column='combined_text', sentiment_column='sentiment_label', sentiment_value='negative', max_features=10)
print("\nTF-IDF for Negative Sentiment:")
print(post_tfidf_negative)

# neutral sentiment posts
post_tfidf_neutral = apply_tfidf(processed_df, text_column='combined_text', sentiment_column='sentiment_label', sentiment_value='neutral', max_features=10)
print("\nTF-IDF for Neutral Sentiment:")
print(post_tfidf_neutral)

General TF-IDF:
             term      score
2        clinical  40.524944
8           study  27.140628
7        research  22.865330
5           https  17.831205
0              18   8.636337
4             gov   7.455405
9             usa   7.129458
3  clinicaltrials   6.563470
6              md   3.381821
1        bethesda   2.976502
(13, 15)

TF-IDF for Positive Sentiment:
       term     score
1  clinical  4.771542
8    trials  4.090183
7     trial  3.479967
6      time  2.212342
9      work  1.925192
3    people  1.920281
5      site  1.880980
4  research  1.792152
2      like  1.530756
0      best  1.389975
(12, 15)

TF-IDF for Negative Sentiment:
       term     score
8     trial  3.573427
4  patients  3.509911
0  clinical  3.504557
3   patient  3.326698
6     study  2.402698
2      just  1.679895
9      work  1.464485
5      said  1.094170
1       day  0.976687
7  subjects  0.610092
(75, 15)

TF-IDF for Neutral Sentiment:
             term      score
2        clinical  30.911476
7

In [19]:
# Most common words per sentiment for comments data 
# find overall tfidf for comments data
overall_tfidf_comments_df = apply_tfidf(comments_df, text_column='combined_text', max_features=10)
print("General TF-IDF:")
print(overall_tfidf_comments_df)

# Find the TF-IDF for each sentiment
# "positive" sentiment comments
cmt_tfidf_positive = apply_tfidf(comments_df, text_column='combined_text', sentiment_column='sentiment_label', sentiment_value='positive', max_features=10)
print("\nTF-IDF for Positive Sentiment:")
print(cmt_tfidf_positive)

# "negative" sentiment comments
cmt_tfidf_negative = apply_tfidf(comments_df, text_column='combined_text', sentiment_column='sentiment_label', sentiment_value='negative', max_features=10)
print("\nTF-IDF for Negative Sentiment:")
print(cmt_tfidf_negative)

# neutral sentiment comments
cmt_tfidf_neutral = apply_tfidf(comments_df, text_column='combined_text', sentiment_column='sentiment_label', sentiment_value='neutral', max_features=10)
print("\nTF-IDF for Neutral Sentiment:")
print(cmt_tfidf_neutral)

General TF-IDF:
         term        score
1         cra  1414.894306
9        work  1228.955042
7    research  1203.196994
0    clinical  1157.952053
5        just  1089.836702
4         job  1085.264334
6        like   982.789417
2         cro   970.992987
8        site   859.173779
3  experience   829.856807
(2436, 9)

TF-IDF for Positive Sentiment:
         term       score
1         cra  318.910229
7    research  287.190761
8       thank  268.807890
9        work  259.208195
0    clinical  257.695147
4        good  238.418979
5         job  219.308570
6        like  197.543760
2         cro  194.594254
3  experience  174.634921
(3359, 9)

TF-IDF for Negative Sentiment:
       term       score
5      just  403.965403
9      work  391.446134
1       cra  384.126386
4       job  381.352030
3       don  374.093040
6      like  335.887736
2       cro  289.411828
7    people  285.759602
8  research  280.845648
0  clinical  256.371900
(5775, 9)

TF-IDF for Neutral Sentiment:
         ter

**Observation**: 
- Dominant terms like "clinical," "study," and "research" in posts indicate a strong emphasis on clinical trials and research participation. The data also reveals several posts discussing the latest clinical trials actively recruiting participants. 
- In both posts and comments, positive sentiment emphasizes terms like "clinical," "trials," "research," and "thank" suggesting some sort of optimisim sentiment towards trials and research 
- Negative comments frequently use terms like "job," "work," "don," and "cra,". As addressed above, this mean that my method is not 100% full proof of removing career-oriented posts


### Identify Interest Level
This section addresses the question: "How can we identify users who are most interested in clinical trials based on their activity, sentiment, and intent?" To determine interest levels, a scoring system was created that combines multiple metrics, each with assigned weights, to rank users based on their likelihood of engaging in clinical trials.

Metrics Used in Scoring:

1. User Activity (30% weight):
    - Users who post or comment more frequently receive higher scores.
    - Posting/commenting activity is normalized, and both types are combined into a total activity score.

2. Sentiment-Based Weighting (20% weight):
    - Positive posts/comments add to the score.
    - Negative posts/comments detract from the score.
    - Neutral activity contributes moderately.

3. Intent Detection (50% weight):
    - Posts with strong intent phrases like "want to join," "looking for trials," or "how to participate" are scored highest.
    - Intent detection applies only to posts due to resource constraints and its strong impact on relevance.

A scoring function combines the three metrics into a single interest score using the formula:

$$\text{Interest Score}= w_1​\times \text{Activity Score} +w_|2​\times \text{Sentiment Score} +w_3​\times \text{Intent Score}$$

Weights: $w1=0.3$, $w2=0.2$ , $w3=0.5$

**Notes**: 
- Activity Score: Calculated using normalized posting/commenting frequency.
- Sentiment Score: Accounts for the ratio of positive to negative posts/comments.
- Intent Score: determined by averaging the mapped values of intent levels for each user, where "strong" intent is scored as 1.0, "weak" intent as 0.5, and "no interest" as 0.0

In [20]:
from pydantic import BaseModel
client= OpenAI()
# Define the structured output with Pydantic
class InterestIntent(BaseModel):
    intent: str  # Values: 'strong', 'weak', 'no interest'

# Function to parse response
def classify_intent(text: str):
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",  # Adjust for your GPT version
        messages=[
            {"role": "system", "content": "Classify the user's intent into strong, weak, or no interest based on the text"},
            {"role": "user", "content": text},
        ],
        response_format=InterestIntent,  # Use the Pydantic model
    )
    return completion.choices[0].message.parsed.intent

# Example Input
post_text = "I would love to join this clinical trial if it meets my conditions. Can you tell me more about the process?"

# Classify the intent
result = classify_intent(post_text)
result

'strong'

In [21]:
processed_df['intent'] = processed_df['combined_text'].apply(classify_intent)

In [22]:
from sklearn.preprocessing import MinMaxScaler
#calculate interest level 
# w1: activity weight, w2: sentiment weight, w3: intent weight
w1, w2, w3 = (0.3, 0.2, 0.5) 
    
# Aggregate posts data
posts_agg = processed_df.groupby('author').agg(
        post_count=('author', 'size'),
        positive_posts=('sentiment_label', lambda x: (x == 'positive').sum()),
        neutral_posts=('sentiment_label', lambda x: (x == 'neutral').sum()),
        negative_posts=('sentiment_label', lambda x: (x == 'negative').sum()),
        avg_intent_score=('intent', lambda x: x.map({'strong': 1.0, 'weak': 0.5, 'no interest': 0.0}).mean())
    ).reset_index()

# Aggregate comments data
comments_agg = comments_df.groupby('comment_author').agg(
    comment_count=('comment_author', 'size'),
    positive_comments=('sentiment_label', lambda x: (x == 'positive').sum()),
    neutral_comments=('sentiment_label', lambda x: (x == 'neutral').sum()),
    negative_comments=('sentiment_label', lambda x: (x == 'negative').sum())
).reset_index()


In [23]:
comments_agg.head()

Unnamed: 0,comment_author,comment_count,positive_comments,neutral_comments,negative_comments
0,000Jelly,1,0,1,0
1,0123_456_789,1,0,1,0
2,100percentmillenial,1,0,1,0
3,109genp_fully,2,0,0,2
4,10brat,1,0,0,1


In [24]:
posts_agg.head()

Unnamed: 0,author,post_count,positive_posts,neutral_posts,negative_posts,avg_intent_score
0,AcanthisittaSea6459,1,0,1,0,0.5
1,Accomplished_Cat5475,1,0,1,0,1.0
2,AnonymousStrawb,1,0,1,0,1.0
3,Appropriate-Tear4783,1,0,1,0,1.0
4,CNDLab,2,0,2,0,0.75


In [25]:
# user stats: 
user_stats = pd.merge(posts_agg, comments_agg, left_on='author',right_on='comment_author', how='outer').fillna(0)

# Total activity: sum of posts and comments
user_stats['total_activity'] = user_stats['post_count'] + user_stats['comment_count']

# Normalize activity score
scaler = MinMaxScaler(feature_range=(0, 1))
user_stats['activity_score'] = scaler.fit_transform(user_stats[['total_activity']])

total_positive = user_stats['positive_posts'] + user_stats['positive_comments']
total_negative = user_stats['negative_posts'] + user_stats['negative_comments']
total_count = user_stats['total_activity']
user_stats['sentiment_score'] = ((total_positive - total_negative) / total_count).fillna(0)

# calculate intent score
user_stats['interest_score'] = (
        w1 * user_stats['activity_score'] +
        w2 * user_stats['sentiment_score'] +
        w3 * user_stats['avg_intent_score']
    )
# determine the interest level
user_stats['interest_level'] = user_stats['interest_score'].apply(lambda x: 'high' if x > 0.7 else 'low' if x < 0.3 else 'medium')
user_stats= user_stats.sort_values(by='interest_score', ascending=False).reset_index(drop=True)
user_stats.head()

Unnamed: 0,author,post_count,positive_posts,neutral_posts,negative_posts,avg_intent_score,comment_author,comment_count,positive_comments,neutral_comments,negative_comments,total_activity,activity_score,sentiment_score,interest_score,interest_level
0,,1.0,1.0,0.0,0.0,1.0,,319.0,28.0,245.0,46.0,320.0,1.0,-0.053125,0.789375,high
1,JamieCFlores,1.0,1.0,0.0,0.0,1.0,JamieCFlores,2.0,2.0,0.0,0.0,3.0,0.00627,1.0,0.701881,high
2,OtherwiseSlip6516,1.0,1.0,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.7,medium
3,lolita2805,1.0,1.0,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.7,medium
4,mwrig2,1.0,1.0,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.7,medium


## Personalized Message based on Users 
We first get users who have a high interest level. The code section below will demonstrate how can you get personalized message given all the users' prior posts and comments to the subreddit 

1. **Aggregate User Content:**  
   - For a given user name `user`, all text from the `combined_text` column in the `user_content` DataFrame is combined into a single string.  
   - This aggregated text represents the user's posts and comments and the posts' content corresponding to that comment

2. **Extract Demographic Data:**  
   - The aggregated text is processed using GPT and a predefined prompt to extract key user details:
     - Gender, age, research topics, health conditions, location, etc.
   - The extracted data is returned in a structured JSON format.

3. **Generate Personalized Message:**  
   - The extracted data is used to create a tailored message emphasizing the user’s suitability for a clinical trial.
   - Fallback values are applied to ensure the message remains complete if certain details are unavailable.

**Notes:** It is very normal to have users who don't list any of their user information at all 

In [26]:
comments_tmp = comments_df.rename(columns={'combined_text': 'comment_combined_text'})
# first join comments and posts data
comments_with_post_info = pd.merge(
    processed_df[['combined_text']].reset_index(), 
    comments_tmp, 
    left_on='index', 
    right_on='post_id', 
    how='inner'
)

comments_with_post_info['comments_n_post_text'] = comments_with_post_info['combined_text'] + " " + comments_with_post_info['comment_combined_text']

In [27]:
# users with high interest
interest_user = user_stats[user_stats['interest_level'].isin(['high', 'medium'])]

# filter only users with interest

comments_tmp = comments_df.rename(columns={'combined_text': 'comment_combined_text'})
comments_tmp = comments_tmp[comments_tmp['comment_author'].isin(interest_user['author'])]
processed_df_tmp = processed_df[processed_df['author'].isin(interest_user['author'])]

# first join comments and posts data
comments_with_post_info = pd.merge(
    processed_df_tmp[['combined_text']].reset_index(), 
    comments_tmp, 
    left_on='index', 
    right_on='post_id', 
    how='inner'
)
# join the comments and posts info together to create context for the comments
comments_with_post_info['comments_n_post_text'] = comments_with_post_info['combined_text'] + " " + comments_with_post_info['comment_combined_text']

# Filter posts and comments for the users in the list
posts = processed_df_tmp[['author', 'combined_text']].copy()
comments = comments_with_post_info[['comment_author', 'comments_n_post_text']].copy()

# Add the content_type column to distinguish the source
posts['content_type'] = 'post'
comments['content_type'] = 'comments'


# Rename columns for consistency before merging
posts.rename(columns={'author': 'user'}, inplace=True)
comments.rename(columns={'comment_author': 'user', 'comments_n_post_text':'combined_text'}, inplace=True)

# Merge the two DataFrames
user_content = pd.concat([posts, comments], ignore_index=True).reset_index(drop=True)

# Optional: Sort by user for easier inspection
print("Data shape: ", user_content.shape)


Data shape:  (192, 3)


In [28]:
print("Number of unique user who are medium/high interest in joining a study: ",user_content['user'].nunique())

Number of unique user who are medium/high interest in joining a study:  59


In [29]:
all_users_present=interest_user['author'].isin(user_content['user']).all()

# check if all the  user in interest_user is in user_content
print("All users in interest_user are in user_content:", all_users_present)

All users in interest_user are in user_content: True


In [30]:
from pydantic import BaseModel
client= OpenAI()

prompt = """
You are tasked with extracting specific user information from a given set of posts and comments. Analyze the provided text and extract the following details:

1. **Gender:** Extract the user's gender if explicitly mentioned or can be inferred from pronouns, phrases, or statements in the text (e.g., "male", "female", "non-binary"). If not mentioned, return "Null".

2. **Age:** Extract the user's exact age or age range if explicitly mentioned in the text (e.g., "25 years old", "in my thirties"). Turn the text into integer, for example "25 years old" to 25. If not mentioned, return "Null".

3. **Mentioned Research Topics:** Identify any **specific fields or areas of research** that the user explicitly discusses or shows interest in through their posts or comments. These may include fields such as "neuroscience," "cancer," "cardiovascular health," or "dermatology."  
   - **Do not include generic terms** like "clinical trials," "research," "studies," or "research participation."
   - **Only extract meaningful research fields.** If none are explicitly mentioned, return "Null."

4. **Language:** Determine the user's language preference if explicitly mentioned in the text (e.g., "I prefer communicating in Spanish"). If not mentioned, return "Null".

5. **Health Condition:** Extract any specific health conditions or diseases mentioned by the user (e.g., "diabetes", "hypertension"). If none are mentioned, return "Null".

6. **Medication:** Extract any medications mentioned by the user (e.g., "metformin", "insulin"). If none are mentioned, return "Null".

7. **Treatment:** Identify any treatments the user mentions undergoing or considering (e.g., "radiation therapy", "physical therapy"). If not mentioned, return "Null".

8. **Location:** Extract any location details mentioned by the user (e.g., "Boston, MA", "United States"). If not mentioned, return "Null".

---

**Important Notes:**
- **Do not include generic terms** like "clinical trials," "research," "studies," or "research participation" in the `mentioned_research_topic` field.
- Only extract information explicitly stated in the text. Do not infer or assume any details not directly mentioned.
- Return the result in the following structured JSON format:

```json
{
  "gender": "Extracted or Null",
  "age": "Extracted or Null",
  "mentioned_research_topic": "Extracted topics or Null",
  "language": "Extracted or Null",
  "health_condition": "Extracted or Null",
  "medication": "Extracted or Null",
  "treatment": "Extracted or Null",
  "location": "Extracted or Null"
}
"""
# Function to parse response
def extract_user_info(text: str, prompt: str):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",  # Adjust for your GPT version
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": text},
        ]
    )
    return completion.choices[0].message.content



In [31]:
# Example Input
post_text = "I am a 35-year-old male living in New York. I've been diagnosed with Type 2 Diabetes and currently take metformin. I’m interested in participating in clinical trials. I speak English."

# Classify the intent
result = extract_user_info(post_text, prompt)
result

'```json\n{\n  "gender": "male",\n  "age": 35,\n  "mentioned_research_topic": "Null",\n  "language": "English",\n  "health_condition": "Type 2 Diabetes",\n  "medication": "metformin",\n  "treatment": "Null",\n  "location": "New York"\n}\n```'

In [32]:
def create_personalized_message(result):
    cleaned_result = result.strip().strip("```").replace("json", "").strip()
    user_info = json.loads(cleaned_result)

    # Process and extract information
    gender = user_info.get("gender", "participant")
    gender = "participant" if gender == "Null" else gender

    age = user_info.get("age", "your age group")
    age = "your age group" if age == "Null" else age

    location = user_info.get("location", "your area")
    location = "your area" if location == "Null" else location

    research_topic = user_info.get("mentioned_research_topic", "medical research")
    research_topic = "medical research" if research_topic == "Null" else research_topic

    health_condition = user_info.get("health_condition", "your health")
    health_condition = "your health condition" if health_condition == "Null" else health_condition

    medication = user_info.get("medication", "your current treatment")
    medication = "your current treatment" if medication == "Null" else medication

    # Generate the personalized message
    message = f"""
    Hi future participant,
    
    Thank you for your interest in advancing medical research! Based on your profile, we believe you might be an excellent candidate for a clinical trial focusing on {research_topic}.
    
    This trial is designed for individuals with {health_condition} in {location}. Your experience with {medication} could provide invaluable insights.
    
    If you’re interested in participating, please visit [trial link] or contact us at [contact info] to learn more.
    
    Your contribution could help shape the future of {research_topic} care and treatment.
    
    Best regards,
    Clinical Trial Team
    """
    return message.strip()

# Generate the personalized message
personalized_message = create_personalized_message(result)
print(personalized_message)


Hi future participant,
    
    Thank you for your interest in advancing medical research! Based on your profile, we believe you might be an excellent candidate for a clinical trial focusing on medical research.
    
    This trial is designed for individuals with Type 2 Diabetes in New York. Your experience with metformin could provide invaluable insights.
    
    If you’re interested in participating, please visit [trial link] or contact us at [contact info] to learn more.
    
    Your contribution could help shape the future of medical research care and treatment.
    
    Best regards,
    Clinical Trial Team


In [33]:
# Display users with high interest
user_content.shape

(192, 3)

In [34]:
# Example: Aggregating and processing user data
user_name = "JamieCFlores" # high level of interest
user_context = " ".join(user_content[user_content['user'] == user_name]['combined_text'].transform(str))
print(user_context)



Clinical Trials for all Good evening Everyone,

I just recently joined the sub-reddit, and I've been more of a casual browser on Reddit than an active user. But I reach out to you today, because as a member of this channel I'm sure we all share at least one common interest, and that's Clinical Trials. 

I myself am based out of Tokyo, and am a rookie in the field, but I am really motivated and impassioned about the topic!

My objective is simple. With people being a little more in tune with clinical trial information (COVID pushing it into the public eye and radar a lot more), I feel like this is an ideal time to make some information accessible. I'm not talking about fantastic 3 hour seminars on YouTube that industry professionals would enjoy. I'm thinking a little more PR, layman, pop-science access to the massive mountain of information and complexity that is clinical trials.  It be cool to have non-corporate・non-PR industry perspectives, or fire side chat, or dialogue, or breakdown

In [35]:
# Extract demographic details
user_info = extract_user_info(user_context, prompt)
print(user_info)

# Generate personalized message
personalized_message = create_personalized_message(user_info)
print(personalized_message)

```json
{
  "gender": "Null",
  "age": "Null",
  "mentioned_research_topic": "Null",
  "language": "Null",
  "health_condition": "Null",
  "medication": "Null",
  "treatment": "Null",
  "location": "Tokyo"
}
```
Hi future participant,
    
    Thank you for your interest in advancing medical research! Based on your profile, we believe you might be an excellent candidate for a clinical trial focusing on medical research.
    
    This trial is designed for individuals with your health condition in Tokyo. Your experience with your current treatment could provide invaluable insights.
    
    If you’re interested in participating, please visit [trial link] or contact us at [contact info] to learn more.
    
    Your contribution could help shape the future of medical research care and treatment.
    
    Best regards,
    Clinical Trial Team


**Observation**

- This solution for personalized messages will allow for robust structured output to find user information. However, it still faces some challenges with hallucinations and inefficiencies due to repeated information from aggregating all posts and associated comments, as the context for multiple comments often redundantly includes the same post text.