#### Creating a dataframe of relevant posts from the Kaggle dataset

Since the final plan is to use the scraped data, this will not be used in the final project. However, in order to develop and benchmark models for sentiment analysis and consensus, it is helpful to have a large collection of posts which are relevant.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot  as plt

In [2]:
df1=pd.read_csv('../data/mental_disorders_reddit.csv')
df1=df1.dropna(how='any')
# Dropping the data points with null values 
df1 = df1.dropna(how = 'any', axis = 0)
# lowercasing the column names so it will be easier for access ^^
df1.columns = df1.columns.str.lower()

In [3]:
# Step 1: Changing to Lower Case
df1['selftext'] = df1['selftext'].str.lower()

# Step 2: Replacing the Repeating Pattern of '&#039;'
df1['selftext'] = df1['selftext'].str.replace("&#039;", "")

# Step 3: Removing All Special Characters
df1['selftext'] = df1['selftext'].str.replace(r'[^\w\d\s]', '')

# Step 4: Removing Leading and Trailing Whitespaces
df1['selftext'] = df1['selftext'].str.strip()

# Step 5: Replacing Multiple Spaces with Single Space
df1['selftext'] = df1['selftext'].str.replace(r'\s+', ' ')

In [4]:
# Assuming 'selftext' is one of the columns you expect in df1
# You should check the actual columns in your DataFrame
# Make sure to load your DataFrame properly before running these operations

# Check if 'selftext' is in the columns
if 'selftext' in df1.columns:
    # Drop rows where 'selftext' is '[removed]' or '\[removed\]'
    df1.drop(df1[(df1['selftext'] =='\\[removed\\]')].index, inplace=True)
    df1.drop(df1[(df1['selftext'] =='[removed]')].index, inplace=True)

    df1.drop(df1[(df1['selftext'] =='\\[deleted\\]')].index, inplace=True)
    df1.drop(df1[(df1['selftext'] =='[deleted]')].index, inplace=True)


    # Drop rows with missing values
    df1.dropna(inplace=True)


    # Randomly sample 2 rows
    print(df1.sample(2))
else:
    print("'selftext' column not found in DataFrame")

                                                    title  \
307043  I think I might kill myself. I can't take it a...   
257115  did he even love me or was it all a manic epis...   

                                                 selftext  created_utc  \
307043  idk if i will, but there's this bridge that i ...   1662091898   
257115  a few months ago, i started seeing someone. we...   1663631540   

        over_18   subreddit  
307043    False  depression  
257115    False     bipolar  


In [5]:
def utctodatetime(utc):
    return datetime.fromtimestamp(utc)

In [6]:
df1['date_created'] = pd.to_datetime(df1['created_utc'], unit='s')

In [7]:
df_bpd = df1[df1['subreddit'] == 'BPD']

In [12]:
start_date = '2014-01-01'
end_date = '2022-01-01'
df = df_bpd[(df_bpd['date_created'] >= start_date) & (df_bpd['date_created'] < end_date)]

In [14]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, make_scorer, accuracy_score
from sklearn.metrics import recall_score, precision_score
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

In [46]:
# Combine text columns into a single column
df['combined_text'] = df['title'] + ' ' + df['selftext']

# Punctuation removal
def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

# Text preprocessing function
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    
    # Tokenization
    tokens = remove_punctuation(text).split()
    
    # Lowercase and remove stopwords
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]
    
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)

# Apply preprocessing to the combined text column
df['processed_text'] = df['combined_text'].apply(preprocess_text)
df['processed_text'] = df['processed_text'].apply(remove_punctuation)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['combined_text'] = df['title'] + ' ' + df['selftext']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['processed_text'] = df['combined_text'].apply(preprocess_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['processed_text'] = df['processed_text'].apply(remove_punctuation)


#### Now we run the relevance model to find the relevant posts

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from scipy.sparse import hstack, csr_matrix
import numpy as np

class TextRelevanceModel:
    def __init__(self, keyword_categories, negative_keywords=None, model=None):
        self.keyword_categories = keyword_categories
        self.negative_keywords = negative_keywords if negative_keywords is not None else []
        self.vectorizer = TfidfVectorizer()
        self.model = model if model is not None else LogisticRegression(max_iter=1000, random_state=42)
    
    def keyword_count(self, text, keywords):
        return sum(1 for word in text.lower().split() if word in keywords)
    
    def prepare_data(self, df, text_column):
        df[text_column] = df[text_column].str.lower()

        keyword_counts = pd.DataFrame()
        
        for category, keywords in self.keyword_categories.items():
            keyword_counts[category + '_count'] = df[text_column].apply(lambda x: self.keyword_count(x, keywords))
        
        keyword_counts['negative_keyword_count'] = df[text_column].apply(lambda x: self.keyword_count(x, self.negative_keywords))
        
        X_text = self.vectorizer.fit_transform(df[text_column])
        X_keywords = keyword_counts.to_numpy()
        X = hstack([X_text, csr_matrix(X_keywords)])
        
        return X
    
    def train(self, X, y):
        keyword_present = np.any(X[:, -len(self.keyword_categories):].toarray(), axis=1)
        X_train = X[keyword_present]
        y_train = y[keyword_present]
        
        self.model.fit(X_train, y_train)
    
    def predict_proba(self, text):
        X_text_new = self.vectorizer.transform([text])
        keyword_counts_new = np.array([[self.keyword_count(text, keywords) for keywords in self.keyword_categories.values()]])
        negative_keyword_count_new = np.array([[self.keyword_count(text, self.negative_keywords)]])
        X_new = hstack([X_text_new, csr_matrix(keyword_counts_new), csr_matrix(negative_keyword_count_new)])
        
        if not np.any(keyword_counts_new):
            return 0.0
        
        return self.model.predict_proba(X_new)[0, 1]


In [17]:
# Example usage
if __name__ == "__main__":
    # Importing and dropping rows from Frame

    df_coded = pd.read_csv('../data/processed_and_coded_posts.csv')
    df2 = df_coded[['processed_text','highly_relevant']]
    
    #Importing keywords

    csv_file_path = '../keywords/medications.csv'

    # Read the CSV file
    df_med = pd.read_csv(csv_file_path)

    # Extract the first column as a list of keywords
    medications = df_med.iloc[:, 0].tolist()

    csv_file_path_2 = '../keywords/Treatment.csv'

    # Read the CSV file
    df_therapy = pd.read_csv(csv_file_path_2)

    # Extract the first column as a list of keywords
    therapy = df_therapy.iloc[:, 0].tolist()

    general_keywords = ['diagnose', 'diagnosed', 'dosage','dose', 'drug', 'drugs', 'harming', 'med', 'medication', 'medicine', 'medicines', 'meds', 'prescribe', 'prescribed', 'psychiatrist', 'psychiatrists', 'psychotherapy', 'recovery', 'session', 'therapist', 'therapists', 'therapy', 'treatment']

    # Define categories of keywords
    
    keyword_categories = {
    'general_keywords': general_keywords,
    'medications': medications,
    'therapy': therapy
}
    
    # Define negative keywords
    #negative_keywords = ['relationship', 'friend', 'together', 'fp', 'people', 'person', 'partner', 'dating']
    negative_keywords_2 = []


    # Create an instance of the model
    model = TextRelevanceModel(keyword_categories,negative_keywords_2)
    



    #Prepare data
    X = model.prepare_data(df2, text_column='processed_text')
    y = df2['highly_relevant']

    # Train the model
    model.train(X, y)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[text_column] = df[text_column].str.lower()


Now we run the model on the full set of data.

In [57]:
second_title = df.iloc[2]['processed_text']
print(second_title)

help loved one seek treatment therapy counseling made made decide take step towards treatment wife bpd trait im trying diagnose rough childhood marriage suffering trying place blame disregard fault issue bpd seems thing make sense gently suggested talking professional good term get disagreement anger make productive conversation impossible find option professional intervention refuse help refuse acknowledge past effect current life situation would like move forward


In [18]:
df['relevance_probability']=0.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['relevance_probability']=0.0


In [34]:
model.predict_proba('This is a random string with dbt in it.')

0.15687799477401465

In [51]:
df_coded.processed_text[1]

'feel like i’m losing battle nothing helping i’ve struggled depression since 1213 year it’s progressively got worse lead bpd diagnosis recently i’ve really well really tough year thing expect someone go attempt losing friend addiction etc recently i’ve breaking much honestly don’t know parent scared feel like friend would don’t tell anyone anymore struggle it’s worth don’t want pity party boyfriend help way doesn’t provide comfort reassurance i’ll okay he’s straight thinking give logical reason problem issue know it’s way helping doesn’t provide give help ‘i love u it’s gonna okay u me’ would give feel like one take seriously genuinely fucking depressed don’t know much longer please help someone talk much going head it’s slowly killing though finally getting better'

In [52]:
df.head()

Unnamed: 0,title,selftext,created_utc,over_18,subreddit,date_created,combined_text,processed_text,relevance_probability
754,How Do I Not Ruin Relationships?,i started talking to someone and it already fe...,1640894272,False,BPD,2021-12-30 19:57:52,How Do I Not Ruin Relationships? i started tal...,ruin relationship started talking someone alre...,0.0
757,Is it ever a good idea to tell your FP that th...,"i’m not sure if i even want to do this, but i’...",1640892801,False,BPD,2021-12-30 19:33:21,Is it ever a good idea to tell your FP that th...,ever good idea tell fp fp i’m sure even want i...,0.0
759,How to help loved one seek treatment,those of you who who are in or have been in th...,1640891824,False,BPD,2021-12-30 19:17:04,How to help loved one seek treatment those of ...,help loved one seek treatment therapy counseli...,0.0
761,I’m proud of myself for ending a relationship ...,"yes, i did make a post the other day and yes, ...",1640891085,False,BPD,2021-12-30 19:04:45,I’m proud of myself for ending a relationship ...,i’m proud ending relationship wasn’t good yes ...,0.0
763,Do I have BPD?,how do i know which kind of personality disord...,1640889998,False,BPD,2021-12-30 18:46:38,Do I have BPD? how do i know which kind of per...,bpd know kind personality disorder want work n...,0.0


In [58]:
# Function to get the relevance probability for a given text
def get_relevance_probability(text):
    return model.predict_proba(text)

# Apply the function to each row in the 'processed_text' column and create the new column
df['relevance_probability'] = df['processed_text'].apply(get_relevance_probability)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['relevance_probability'] = df['processed_text'].apply(get_relevance_probability)


In [59]:
df.to_csv('../data/posts_with_relevance_scores.csv', index=False)

In [60]:
df=df[df['relevance_probability']>.05]

In [63]:
df.sample(20)

Unnamed: 0,title,selftext,created_utc,over_18,subreddit,date_created,combined_text,processed_text,relevance_probability
57668,I’m genuinely struggling to see myself in the ...,24yo asian working in management consulting an...,1624966564,False,BPD,2021-06-29 11:36:04,I’m genuinely struggling to see myself in the ...,i’m genuinely struggling see future 24yo asian...,0.108607
147698,I'm looking for some help for a friend,"long story short, i am helping a friend with s...",1575512032,False,BPD,2019-12-05 02:13:52,I'm looking for some help for a friend long st...,im looking help friend long story short helpin...,0.277355
218323,How to co-parent with someone with (possible) BPD,hi. my wife and i separated about 4 months ago...,1494819627,False,BPD,2017-05-15 03:40:27,How to co-parent with someone with (possible) ...,coparent someone possible bpd hi wife separate...,0.260959
176160,I made my gf mad for the first time in 5 years...,"hi, first time poster, please let me know if ...",1551022086,False,BPD,2019-02-24 15:28:06,I made my gf mad for the first time in 5 years...,made gf mad first time 5 year meltdown last ni...,0.252895
153807,I wanna feel emotional again I feel so numb,i remember crying a lot when i was a teen but ...,1599928573,False,BPD,2020-09-12 16:36:13,I wanna feel emotional again I feel so numb i ...,wanna feel emotional feel numb remember cry lo...,0.4827
97219,Do they ever think of you as much as you think...,i really do wonder. i’ve known this woman for ...,1623025813,True,BPD,2021-06-07 00:30:13,Do they ever think of you as much as you think...,ever think much think really wonder i’ve known...,0.103134
100418,Feeling like you’re a burden to your therapist?,i doubt this is strictly people with bpd but i...,1634079356,False,BPD,2021-10-12 22:55:56,Feeling like you’re a burden to your therapist...,feeling like you’re burden therapist doubt str...,0.115631
225612,Diagnosed at age 38 during the worst time in m...,hello fellow friends.\ni have been part of the...,1458255278,False,BPD,2016-03-17 22:54:38,Diagnosed at age 38 during the worst time in m...,diagnosed age 38 worst time life hello fellow ...,0.255463
71677,Cant remember what my therapist said to me,dae after a session with your therapist doesnt...,1611069541,False,BPD,2021-01-19 15:19:01,Cant remember what my therapist said to me dae...,cant remember therapist said dae session thera...,0.114512
187786,Trying not to blow up at someone right now and...,i feel so incredibly abandoned by someone righ...,1557200147,False,BPD,2019-05-07 03:35:47,Trying not to blow up at someone right now and...,trying blow someone right dont know deal feel ...,0.113385


<bound method DataFrame.info of                                                     title  \
759                  How to help loved one seek treatment   
763                                        Do I have BPD?   
768                         Not Sure How to Go About This   
775                                     What is DBT like?   
799     I'm starting a treatment (STEPPS) which is aim...   
...                                                   ...   
240474  Having extreme difficulty maintaining my emoti...   
240477  How do I broach the subject of suicidal though...   
240478                           I have always felt weird   
240481                     LONG pre-diagnosis question...   
240482                   I see a new therapist tomorrow..   

                                                 selftext  created_utc  \
759     those of you who who are in or have been in th...   1640891824   
763     how do i know which kind of personality disord...   1640889998   
768     i am 

In [65]:
df_sorted_desc = df.sort_values(by='relevance_probability', ascending=False)

In [None]:
threshold = .3

In [74]:
df_highquality = df_sorted_desc[df_sorted_desc['relevance_probability']>threshold]

In [75]:
df_highquality.tail(10)

Unnamed: 0,title,selftext,created_utc,over_18,subreddit,date_created,combined_text,processed_text,relevance_probability
94350,"When processing trauma, how should I really fe...",i have been going to therapy for several month...,1608620721,False,BPD,2020-12-22 07:05:21,"When processing trauma, how should I really fe...",processing trauma really feel growing complica...,0.300262
80454,Little rant about the UK's mental health service,i was talking with my friend about being on th...,1630731212,False,BPD,2021-09-04 04:53:32,Little rant about the UK's mental health servi...,little rant uk mental health service talking f...,0.30024
103711,Newly diagnosed with BPD. Any advice would be ...,i am 32 years old and male. back in the end of...,1627779120,False,BPD,2021-08-01 00:52:00,Newly diagnosed with BPD. Any advice would be ...,newly diagnosed bpd advice would great 32 year...,0.300231
224508,Difficulty Getting into DBT Groups,"hey. hi. how are you guys. \n\nso, my psych do...",1464011120,False,BPD,2016-05-23 13:45:20,Difficulty Getting into DBT Groups hey. hi. ho...,difficulty getting dbt group hey hi guy psych ...,0.300218
196052,DBT group not working?,i've been going to a dbt group but i don't fee...,1516336266,False,BPD,2018-01-19 04:31:06,DBT group not working? i've been going to a db...,dbt group working ive going dbt group dont fee...,0.300202
162088,“You just need to start having fun!”: I wish i...,“you just need to start having fun!”: i wish i...,1548810848,False,BPD,2019-01-30 01:14:08,“You just need to start having fun!”: I wish i...,“you need start fun” wish easy “you need start...,0.300165
111081,I think we might be okay.,i usually come to this sub when i'm not doing ...,1606585759,False,BPD,2020-11-28 17:49:19,I think we might be okay. i usually come to th...,think might okay usually come sub im well need...,0.300062
152291,How to move on from an realistic FP/Ex,"i cant take it anymore, and i do not feel isol...",1598566950,False,BPD,2020-08-27 22:22:30,How to move on from an realistic FP/Ex i cant ...,move realistic fpex cant take anymore feel iso...,0.300036
147636,CBT Apps help BPD?,i recently discovered there are a number of ap...,1575573840,False,BPD,2019-12-05 19:24:00,CBT Apps help BPD? i recently discovered there...,cbt apps help bpd recently discovered number a...,0.300014
143840,help in Australia,"hi yall, my names dani, im 20 (21 soon) and wa...",1602388348,False,BPD,2020-10-11 03:52:28,"help in Australia hi yall, my names dani, im 2...",help australia hi yall name dani im 20 21 soon...,0.300004


In [76]:
df_highquality.to_csv('../data/highly_relevant_posts_descending_threshold_30.csv', index=False)

In [77]:
df_highquality.info

<bound method DataFrame.info of                                                     title  \
116835  Informal survey: What (legal) drugs, if any, '...   
127361  What combination of meds finally worked for yo...   
233225  Does anyone have any experience(s) with any an...   
79644             My experience with lamictal/lamotrigine   
137784  How did antidepressants modify your behavior a...   
...                                                   ...   
162088  “You just need to start having fun!”: I wish i...   
111081                          I think we might be okay.   
152291             How to move on from an realistic FP/Ex   
147636                                 CBT Apps help BPD?   
143840                                  help in Australia   

                                                 selftext  created_utc  \
116835  hi! so i'm officially diagnosed, tried dbt but...   1602953594   
127361  right now i’m on:\nprozac 40mg\nwellbutrin xl ...   1593096447   
233225  i abu