# 2 Select Random Sample from the Total Negative Reviews

This notebook guides you on randomly selecting reviews from the total pool of negative reviews. I categorized 8825 negative reviews from seven apps based on a midpoint date, May 2022, into two groups: "Before-Roe" and "Post-Roe." Using Python's "sample()" function, I randomly selected about 5% of the reviews from each group, totaling 453 samples (164 from Group 1 and 289 from Group 2). For apps with smaller review sizes, <i>myPill Birth Control Reminder</i> and <i>Birth Control Pill Reminder</i>, I randomly selected 50% of their reviews to ensure diversity.


- **Goal:** to get review samples from the negative reviews pool (Note: negative reviews = "1 and 2-star reviews")
- **Input:** all_7apps_reviews.csv - raw reviews scraped from the all seven selected apps.
- **Output:**
    - **Group 1: "Before-Roe" reviews <br>**
    1) group1_negative_reviews.csv - a separate csv file that contains negative reviews **before** 2022-05-02<br>
    2) group1_samples.csv - random sample selected from "group1_negative_reviews.csv"<br>
    - **Group 2: "Post-Roe" reviews <br>**
    1) group2_negative_reviews.csv - a separate file that contains negative reviews **after** 2022-05-02<br>
    2) group2_samples.csv - random sample selected from "group2_negative_reviews.csv"

### 1) Check the number of each app's negative reviews before and after the midpoint date

In [46]:
import pandas as pd

# Load your CSV file into a DataFrame
df = pd.read_csv('all_7apps_reviews.csv')

# Convert the "date" column to datetime format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')

# Filter the DataFrame to include only negative reviews (1 and 2-star ratings)
negative_reviews = df[df['rating'].isin([1, 2])]

# Separate the negative reviews based on the "date" column
midpoint_date = pd.Timestamp('2022-05-02')
group_1_negative_reviews = negative_reviews[negative_reviews['date'] < midpoint_date]
group_2_negative_reviews = negative_reviews[negative_reviews['date'] >= midpoint_date]

# Group by the "app_name" column and count the number of negative reviews for each app before and after the midpoint
group_1_counts = group_1_negative_reviews.groupby('app_name').size().reset_index(name='count_before_midpoint')
group_2_counts = group_2_negative_reviews.groupby('app_name').size().reset_index(name='count_after_midpoint')

# Merge the counts for both groups
merged_counts = pd.merge(group_1_counts, group_2_counts, on='app_name', how='outer').fillna(0)

# Add a row at the end to count the total negative reviews for all apps before and after the midpoint
total_before_midpoint = merged_counts['count_before_midpoint'].sum()
total_after_midpoint = merged_counts['count_after_midpoint'].sum()

total_row = pd.DataFrame({'app_name': 'Total', 'count_before_midpoint': total_before_midpoint, 'count_after_midpoint': total_after_midpoint}, index=[len(merged_counts)])

merged_counts = merged_counts.append(total_row, ignore_index=True)
merged_counts

  merged_counts = merged_counts.append(total_row, ignore_index=True)


Unnamed: 0,app_name,count_before_midpoint,count_after_midpoint
0,birth-control-pill-reminder,7,2
1,clue-period-tracker-calendar,864,1663
2,flo-period-pregnancy-tracker,1833,3441
3,mypill-birth-control-reminder,14,4
4,natural-cycles-birth-control,71,166
5,nurx-birth-control-delivered,255,420
6,planned-parenthood-direct,50,35
7,Total,3094,5731


### 2) Separate the negative reviews based on the midpoint and save them as two csv files

In [47]:
# Load your CSV file into a DataFrame
df = pd.read_csv('all_7apps_reviews.csv')

# Filter the DataFrame to include only reviews with 1 and 2-star ratings
filtered_df = df[df['rating'].isin([1, 2])]

# Separate the negative reviews based on the "date" column
midpoint_date = pd.Timestamp('2022-05-02')

# Before Roe leaked
group_1 = negative_reviews[negative_reviews['date'] < midpoint_date]
# After Roe leaked
group_2 = negative_reviews[negative_reviews['date'] >= midpoint_date]

# Save the separated negative reviews to new CSV files
#group_1.to_csv('group1_negative_reviews.csv', index=False)
#group_2.to_csv('group2_negative_reviews.csv', index=False)

### 3) Select random samples from Group 1

In [48]:
# Load the group1_negative_reviews.csv file into a DataFrame
group1_df = pd.read_csv('group1_negative_reviews.csv')

# Define the app names for which to select samples
apps_5percent = ['nurx-birth-control-delivered', 'natural-cycles-birth-control','planned-parenthood-direct','flo-period-pregnancy-tracker', 'clue-period-tracker-calendar']
apps_50percent = ['mypill-birth-control-reminder', 'birth-control-pill-reminder']

# Select 5% of reviews for each of the apps in apps_5percent
group1_samples_5percent = group1_df[group1_df['app_name'].isin(apps_5percent)].sample(frac=0.05, random_state=42)

# Select 50% of reviews for each of the apps in apps_50percent
group1_samples_50percent = group1_df[group1_df['app_name'].isin(apps_50percent)].sample(frac=0.5, random_state=42)

# Combine the two samples
group1_samples = pd.concat([group1_samples_5percent, group1_samples_50percent])
group1_samples

# Save the combined sample to a new CSV file
#group1_samples.to_csv('group1_samples.csv', index=False)

### 4) Select random samples from Group 2

In [49]:
# Load the group2_negative_reviews.csv file into a DataFrame
group2_df = pd.read_csv('group2_negative_reviews.csv')

# Define the app names for which to select samples
apps_5percent = ['nurx-birth-control-delivered', 'natural-cycles-birth-control','planned-parenthood-direct','flo-period-pregnancy-tracker', 'clue-period-tracker-calendar']
apps_50percent = ['mypill-birth-control-reminder', 'birth-control-pill-reminder']

# Select 5% of reviews for each of the apps in apps_5percent
group2_samples_5percent = group2_df[group2_df['app_name'].isin(apps_5percent)].sample(frac=0.05, random_state=42)

# Select 50% of reviews for each of the apps in apps_50percent
group2_samples_50percent = group2_df[group2_df['app_name'].isin(apps_50percent)].sample(frac=0.5, random_state=42)

# Combine the two samples
group2_samples = pd.concat([group2_samples_5percent, group2_samples_50percent])
group2_samples

# Save the combined sample to a new CSV file
#group2_samples.to_csv('group2_samples.csv', index=False)