# Identifying Women's Health Concerns From Online Forums

## Problem Statement

Citizen science is an emerging field of research in which members of the public volunteer to participate in scientific research [(1)](https://www.citizenscience.gov/about/#). One of the most well-known, crowd-sourced citizen science is projects is [American Gut](https://msystems.asm.org/content/3/3/e00031-18), a citizen science project designed to better understand the human microbiome. For this project, citizens interested in contributing to the project paid $99 to receive a sample collection kit and were given instructions to submit the sample [(2)](https://anesthesiology.duke.edu/?p=846744). Within approximately five years, it was estimated that American Gut received samples from over 11,000 people in 45 different countries [(2)](https://anesthesiology.duke.edu/?p=846744), illustrating the willingness of individualas to participate in research and the power of citizen science for generating large datasets that can answer important research questions.

The field of developmental neurotoxicology is interested in understanding the developmental origins of the nervous system throughout the lifespan [(3)](https://www.dntshome.org). Within this field, a substantial amount of research is dedicated to understanding how events that occur during a woman's pregnancy (e.g. illness, treatment with medication, psychological distress, etc.) impact the development of the baby's brain and behavior. Traditional research within the field of maternal health and infant development has relied on retrospective maternal report of any events that occurred during pregnancy through regularly-scheduled interviews with a trained research assistant or counselor. Although these interviews provide important information for the studies, the accuracy of reporting depends on the ability of the expecting mother to recall any important events that happened during her pregnancy. 

In order to eliminate any gaps in reporting and to gain a more-representative picture of a woman's pregnancy, we would like to to plan a citizen science project in which women who are interested in participating can download a mobile application that would allow them to log important events during their pregnancy in real time. However, understanding that this will be a large time commitment for participating women, we would also like to provide them with specially curated resources related to women's health concerns. 

In order to identify concerns that are relevant to women's health, a focus group was initially proposed. However, focus groups can be costly in terms of time and money, and it is also possible that the small number of women who can join a focus group are not representative of the larger population of women who could opt to particpate in the research. Therefore, an alternative method of identifying concerns was proposed.

In this project, we will use natural language processing and unsupervised clustering techniques to identify women's health concerns from posts in online forums. Specific concerns will be identified for general women's health, fertility and pregnancy, and early parenthood.

## Imports

In [2]:
import pandas as pd

## Data

For each area of interest, 18 months of data was scraped from three or more subreddits related to the area of interest. Eighteen months of data was chosen so that concerns pre- and post-COVID-19 could be compared.

***Note:*** Although I had initially hoped to scrape from other online forums, there appeared to be a scarcity of forums with quality, scrapable data. Because Reddit has moderators, some of which are Ob/Gyn's, I am hoping that the quality of this data will be higher. 

### Women's Health

In [None]:
womens_health = pd.read_csv('../data/womens_health.csv')

In [6]:
womens_health.drop(columns = 'Unnamed: 0', inplace = True)

In [7]:
womens_health.shape

(31385, 9)

In [9]:
womens_health.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Been to the clinic twice and they don’t know w...,So I’ve been having problems with discharge an...,WomensHealth,1596818251,thecrazedbunny,0,1,True,2020-08-07
1,Period going on for 14 days today. Help!!,Hi guys. I'm getting a bit worried about my pe...,WomensHealth,1596822599,Help-Me-Already,4,1,True,2020-08-07
2,Question about birth control - advice apprecia...,Hi! So I am 18 and I have been on birth contro...,WomensHealth,1596822743,blueberrydoe,2,1,True,2020-08-07
3,Yeast infection vs. Cytolytic Vaginosis?,Hi everyone! Hoping someone here can help me....,WomensHealth,1596826397,TheTinyOne23,1,1,True,2020-08-07
4,Virgin concerned about transvaginal ultrasound,I went in for my well woman exam today and the...,WomensHealth,1596829944,BabyFaceIT,4,1,True,2020-08-07


### Fertility and Pregnancy

In [8]:
fert_and_preg = pd.read_csv('../data/fertility_and_pregnancy.csv')

In [10]:
fert_and_preg.drop(columns = 'Unnamed: 0', inplace = True)

In [11]:
fert_and_preg.shape

(98138, 9)

In [12]:
fert_and_preg.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Adding to the kitchen sink approach: I just bo...,This time I’m going to be using a menstrual cu...,TryingForABaby,1596839749,lastput1,7,1,True,2020-08-07
1,Has anyone used/or currently an app to track t...,\nMy husband and I are new to TTC.\n\nWe have ...,TryingForABaby,1596841178,ParkingFrosting4,8,1,True,2020-08-07
2,Invited to my first “virtual” baby shower,I don’t know why but I feel a type of way. Sit...,TryingForABaby,1596841808,MinimalTa1886,16,1,True,2020-08-07
3,Constantly testing during the TWW?,I know it's hard not to test constantly during...,TryingForABaby,1596842913,amma017,6,1,True,2020-08-07
4,When do I actually ovulate?,I’ve been tracking my cycles for two months no...,TryingForABaby,1596842925,bethrowinaway,4,1,True,2020-08-07


### Postpartum

In [13]:
postpartum = pd.read_csv('../data/postpartum.csv')

In [15]:
postpartum.drop(columns = 'Unnamed: 0', inplace = True)

In [16]:
postpartum.shape

(51674, 9)

In [17]:
postpartum.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Question about bottles,I'm researching baby bottles and am trying to ...,BabyBumps,1596850138,All_Hail_CC,4,1,True,2020-08-07
1,Anyone else getting that generation gap judgem...,"Aunt- I don’t eat much meat, but when I was pr...",BabyBumps,1596850152,waterfallsummer,86,1,True,2020-08-07
2,Insurance perks.,Hello All we are expecting our first in mid of...,BabyBumps,1596850522,mkokit1,9,1,True,2020-08-07
3,37 weeks and extra moody,Is just me or do these last few weeks just mak...,BabyBumps,1596851772,denisemescudi,2,1,True,2020-08-07
4,FTM Breastfeeding question,Hey folks!\n\nFTM mom here. There’s a lot of t...,BabyBumps,1596853969,malinjl36,18,1,True,2020-08-07


## *VERY* Light EDA

### Unique Authors

In [19]:
womens_health['author'].nunique()

20024

In [20]:
fert_and_preg['author'].nunique()

35517

In [21]:
postpartum['author'].nunique()

21821

### Dates of Posts

In [24]:
womens_health['timestamp'] = pd.to_datetime(womens_health['timestamp'])
fert_and_preg['timestamp'] = pd.to_datetime(fert_and_preg['timestamp'])
postpartum['timestamp'] = pd.to_datetime(postpartum['timestamp'])

In [25]:
womens_health['timestamp'].describe()

count                   31385
unique                    541
top       2020-08-03 00:00:00
freq                      103
first     2019-02-15 00:00:00
last      2020-08-08 00:00:00
Name: timestamp, dtype: object

In [26]:
fert_and_preg['timestamp'].describe()

count                   98138
unique                    541
top       2020-08-04 00:00:00
freq                      261
first     2019-02-15 00:00:00
last      2020-08-08 00:00:00
Name: timestamp, dtype: object

In [27]:
postpartum['timestamp'].describe()

count                   51674
unique                    542
top       2020-06-22 00:00:00
freq                      134
first     2019-02-15 00:00:00
last      2020-08-09 00:00:00
Name: timestamp, dtype: object