## 4 OpenAI Classification Performance Evaluation

In this notebook, I will evaluate the performance of the prediction model I selected by comparing the results it predicted, categorized by GPT (gpt-4o), against the results I manually tagged. I will use the [F1-score](https://www.geeksforgeeks.org/f1-score-in-machine-learning/) as the metric to assess the model's effectiveness.

Note: We can prompt GPT with [whatever model we want](https://platform.openai.com/docs/models). In my case, I chose to utilize "gpt-4o" model to expect the latest advanced and affordable performance.

#### **1) Collection 1**
- **combined_bc_samples.csv** - This file contains the random selected sample reviews from 5 birth control apps (not tagged).
- **combined_bc_tagged.csv** - This file contains the random selected sample reviews that have been manually tagged from 5 birth control apps (tagged).

#### **2) Collection 2**
- **combined_pt_samples.csv** - This file contains the randomly selected sample reviews from 2 period tracking apps (not tagged).
- **combined_pt_tagged.csv** - This file contains the random selected sample reviews that have been manually tagged from 2 birth control apps (tagged).

## Collection 1 - Birth-Control-Oriented Apps (x5)

### Step 1: Set up the connection to OpenAI

We need to import from the `openai` package and give it our [API key](https://platform.openai.com/api-keys). It's probably best if you [hide the API key with python-dotenv](https://www.youtube.com/watch?v=YdgIWTYQ69A) but do whatever you want!

In [7]:
from dotenv import load_dotenv
import os

load_dotenv()  # take environment variables

API_KEY = os.getenv("API_KEY")

In [21]:
#!pip install openai
from openai import OpenAI
import math # We use `math.exp` to get the probability 

client = OpenAI(api_key= API_KEY)

# my chatgpt API Key - For security reasons, I won't be able to view it again through your OpenAI account

### Step 2: Setting up and test the prompt

It's a lot easier to build a nice, well-formatted prompt in Python than in Google Sheets! I'm using triple quotation marks so my text can take up multiple rows.

prompt = """
Below is the text to a review. Classify it as one or more of the following categories:

- l1_inaccurate_cycle_prediction: This category suggests that the app's cycle prediction algorithm is inaccurate, sometimes leading to unplanned pregnancies.
- l2_delayed_customer_service: This category suggests that difficulty in contacting customer service and long wait times, which oftentimes result in late or inaccurate deliveries of prescriptions and medications.
- l3_poor_prescription_management: This category suggests users experience issues such as missing or incorrect prescriptions, incorrect birth control medications, inaccurate refill frequencies, late deliveries, and canceled medications.
- l4_problematic_billing_practices: This category suggests that users encounter unexpected charges including but not limited to auto-renewals without notification, and charges on old credit cards without refunds, or they fail to use the current insurance plan for insurance billing.

The review may be assigned to multiple categories. Please list all applicable categories based on the review content.

Review text:

{text}
"""

review = """
"This whole business is a scam. It’s an exhausting and consistently frustrating take on a crucial service. I absolutely love the concept of getting it delivered and being able to do it all online. First month was fine. Every month since they have tried charging my card anywhere from $8-$10 even though under my insurance all forms of birth control pills are free at no cost to me at all. Every time I’ve tried to explain that to the staff at Nurx their only response is to talk to my insurance… even if I’ve already told them that I have. I can explain it to them all day I only get automated responses, never any real help at all. Also the customer service is the worst I’ve seen on any online platform. It takes up to 6 days for someone to get back to you on anything. The earliest for me was 3 days. So you send them a message and by the time they respond you’re so close to running out of pills that you rush to get it taken care of. 
Also you can specify on your app settings that you want to be notified if insurance doesn’t cover it before you’re charged. At first I had it turned off then after the second month I turned it on. Well then they just charged me without any heads up the third month, completely ignoring my preference. Then, when I looked at the settings it was gone. 
If I hadn’t spent the money on the annual fee I would drop it. However any more of this insanity and I’ll just delete it all together. So sick of it."
"""

results = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "system", "content": "You are a review assistant. Be brief and consider multiple categories in your responses."},
    {"role": "user", "content": prompt.format(text=review)}
  ],
  temperature=0,
  logprobs=True, #including probablities
)

categories = results.choices[0].message.content
print(categories)

### Step 3: Create a function to read through all the reviews in CSV file

Now we can use the pandas library to read in our reviews.

In [5]:
import pandas as pd 

bc_df = pd.read_csv("combined_bc_samples.csv")
#bc_df

We're going to move our content into a function called `categorize`. It *could* just take the review, but sometimes you might want to fill in multiple blanks to build your prompt, so we're going with the whole row to future-proof.

In [100]:
def categorize(row):
    prompt = """
        Below is the text to a review. Classify it as one or more of the following categories:

        - l1_inaccurate_cycle_prediction: This category suggests that the app's cycle prediction algorithm is inaccurate, sometimes leading to unplanned pregnancies.
        - l2_delayed_customer_service: This category suggests that difficulty in contacting customer service and long wait times, which oftentimes result in late or inaccurate deliveries of prescriptions and medications.
        - l3_poor_prescription_management: This category suggests users experience issues such as missing or incorrect prescriptions, incorrect birth control medications, inaccurate refill frequencies, late deliveries, and canceled medications.
        - l4_problematic_billing_practices: This category suggests that users encounter unexpected charges including but not limited to auto-renewals without notification, and charges on old credit cards without refunds, or they fail to use the current insurance plan for insurance billing.

        The review may be assigned to multiple categories. Please list all applicable categories based on the review content.

        Review text:

        {text}
        """
    
    results = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are a review assistant. Be brief and consider multiple categories in your responses."},
        {"role": "user", "content": prompt.format(text=row['review'])}
      ],
      temperature=0,
      logprobs=True, #including probablities
    )

    return pd.Series({
        'content': results.choices[0].message.content,
        'probability': math.exp(results.choices[0].logprobs.content[0].logprob)
    })

Let's try it with the first row of our dataset.

In [101]:
bc_df.iloc[0]

date                                                    12/19/20 15:29
developerResponse    {'id': 19882037, 'body': 'Thank you for your r...
review               Nurx informed me that my lab results came back...
rating                                                               1
isEdited                                                         False
userName                                                    dcladyboss
title                                             Fail. Get OneMedical
app_name                                  nurx-birth-control-delivered
app_id                                                      1213141301
Name: 0, dtype: object

In [102]:
categorize(bc_df.iloc[0])

content        - l2_delayed_customer_service\n- l3_poor_presc...
probability                                             0.996655
dtype: object

### Step 4:  Add tqdm to pandas to get a nice progress bar to understand how long this is going to take

In [103]:
from tqdm.auto import tqdm
tqdm.pandas()

In [39]:
bc_df[['category', 'probability']] = bc_df.progress_apply(categorize, axis=1)
bc_df

  0%|          | 0/89 [00:00<?, ?it/s]

In [104]:
#bc_df.to_csv("bc_categorized.csv", index=False)

### Step 5: Create separate column for each category to organize the predicted results

In [55]:
bc_df = pd.read_csv("bc_categorized.csv")

categories = [
    'l1_inaccurate_cycle_prediction',
    'l2_delayed_customer_service',
    'l3_poor_prescription_management',
    'l4_problematic_billing_practices'
]

# Create columns for each category initialized to 0
for category in categories:
    bc_df[category] = 0

# Function to update the category columns based on the categorized column
def update_category_columns(row):
    if pd.notna(row['category']):  # Check if the categorized cell is not NaN
        for category in categories:
            if category in row['category']:
                row[category] = 1
    return row

# Apply the function to each row
new_bc_df = bc_df.apply(update_category_columns, axis=1)
new_bc_df

# Save the updated DataFrame
#new_bc_df.to_csv("bc_categorized_with_columns.csv", index=False)

Unnamed: 0,date,developerResponse,review,rating,isEdited,userName,title,app_name,app_id,category,probability,l1_inaccurate_cycle_prediction,l2_delayed_customer_service,l3_poor_prescription_management,l4_problematic_billing_practices
0,12/19/20 15:29,"{'id': 19882037, 'body': 'Thank you for your r...",Nurx informed me that my lab results came back...,1,False,dcladyboss,Fail. Get OneMedical,nurx-birth-control-delivered,1213141301,- l2_delayed_customer_service\n- l3_poor_presc...,0.996655,0,1,1,0
1,6/15/21 0:56,"{'id': 23421452, 'body': 'Hi Britt, we are dis...",TOTAL BAIT & SWITCH. I read the bad reviews be...,1,False,Britt.r.funnnn,BEWARE- will switch medication to generic w/ n...,nurx-birth-control-delivered,1213141301,- l3_poor_prescription_management: The review ...,0.986529,0,1,1,0
2,5/12/20 14:04,,I got this app almost immediately after learni...,2,False,Leighann breeze,"FDA approved, not the best",natural-cycles-birth-control,765535549,- l1_inaccurate_cycle_prediction,0.994495,1,0,0,0
3,1/30/22 18:50,"{'id': 27785221, 'body': ""Hello, we're sorry t...",I’ve gotten in contact with Nurx’s customer se...,2,False,☀️💁🏽,Bad Customer Service,nurx-birth-control-delivered,1213141301,- l2_delayed_customer_service\n- l3_poor_presc...,0.987476,0,1,1,0
4,1/21/22 17:37,"{'id': 27667407, 'body': 'Hello, and thank you...",I used the app for 6 months before falling pre...,1,False,ArcyDarcy94,Unplanned Pregnancy & Autorenewal Scam!,natural-cycles-birth-control,765535549,- l1_inaccurate_cycle_prediction\n- l2_delayed...,0.876135,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,12/29/22 17:35,"{'id': 34025910, 'body': ""We're deeply saddene...",They take days to respond.,1,False,WadeTM,Zero customer service,nurx-birth-control-delivered,1213141301,- l2_delayed_customer_service,0.978373,0,1,0,0
85,3/2/23 18:54,"{'id': 35230364, 'body': 'Hi there, \n\nThank ...","Inaccurate, no way to exclude a cycle, expensi...",1,False,elizmart13,Not accurate or flexible,natural-cycles-birth-control,765535549,- l1_inaccurate_cycle_prediction\n- l2_delayed...,0.959048,1,1,0,0
86,1/3/24 0:32,"{'id': 12880885, 'body': 'Thank you for sharin...",I love this app don’t get me wrong I have neve...,1,True,nenanacolee,Love the app but…,mypill-birth-control-reminder,425632209,The review does not clearly fall into any of t...,0.600835,0,0,0,0
87,3/26/24 2:09,,The app is simple and and has a nice theme to ...,2,False,Ashe Marie C,Easy app but not worth,mypill-birth-control-reminder,425632209,- l4_problematic_billing_practices,0.484219,0,0,0,1


### Step 6: Caching the results

What happens if your computer shuts off partway through, or your internet disappears, or Open AI decides you've run out of credits? You don't want to start all over! Instead, we'll use [joblib's caching function](https://joblib.readthedocs.io/en/latest/auto_examples/memory_basic_usage.html) to store the results of each function call.

We'll store the cached results in a folder called `cachedir` so we can see it and be happy about its presence.

In [30]:
from joblib import Memory

memory = Memory("cachedir", verbose=0)
@memory.cache
def categorize_cache(row):
    prompt = """
    Below is the text to a review. Classify it as one or more of the following categories:

    - l1_inaccurate_cycle_prediction: This category suggests that the app's cycle prediction algorithm is inaccurate, sometimes leading to unplanned pregnancies.
    - l2_delayed_customer_service: This category suggests that difficulty in contacting customer service and long wait times, which oftentimes result in late or inaccurate deliveries of prescriptions and medications.
    - l3_poor_prescription_management: This category suggests users experience issues such as missing or incorrect prescriptions, incorrect birth control medications, inaccurate refill frequencies, late deliveries, and canceled medications.
    - l4_problematic_billing_practices: This category suggests that users encounter unexpected charges including but not limited to auto-renewals without notification, and charges on old credit cards without refunds, or they fail to use the current insurance plan for insurance billing.
    The review may be assigned to multiple categories. Please list all applicable categories based on the review content.

    Review text:

    {text}
    """
    
    results = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a review assistant. Be brief in your responses."},
            {"role": "user", "content": prompt.format(text=row['review'])}
        ],
        temperature=0,
        logprobs=True
    )

    return pd.Series({
        'content': results.choices[0].message.content,
        'probability': math.exp(results.choices[0].logprobs.content[0].logprob)
    })

In [31]:
bc_df[['category', 'probability']] = bc_df.progress_apply(categorize_cache, axis=1)

  0%|          | 0/89 [00:00<?, ?it/s]

### Step 7: Evaluate GPT's performance with manual tagged result

#### 1)  Create a new CSV file to only include ai-generated results and manually-tagged results

In [108]:
# Load your CSV files
bc_manual_df = pd.read_csv("combined_bc_tagged.csv")
bc_ai_df = pd.read_csv("categorized_with_columns.csv")

# Define the columns you want to extract, excluding 'review' to avoid duplication in merge
columns_to_keep = [
    'l1_inaccurate_cycle_prediction',
    'l2_delayed_customer_service', 
    'l3_poor_prescription_management',
    'l4_problematic_billing_practices'
]

# Prepare the manual and AI DataFrames by selecting the necessary columns
manual_selected = bc_manual_df[['review'] + columns_to_keep]
ai_selected = bc_ai_df[['review'] + columns_to_keep]

# Rename AI DataFrame columns for clarity in the merged file
ai_selected.columns = ['review'] + [f'AI_{col}' for col in columns_to_keep]

# Merge the DataFrames on the 'review' column
bc_merged_df = pd.merge(manual_selected, ai_selected, on='review', how='inner')
#bc_merged_df

Unnamed: 0,review,l1_inaccurate_cycle_prediction,l2_delayed_customer_service,l3_poor_prescription_management,l4_problematic_billing_practices,AI_l1_inaccurate_cycle_prediction,AI_l2_delayed_customer_service,AI_l3_poor_prescription_management,AI_l4_problematic_billing_practices
0,Nurx informed me that my lab results came back...,0,1,1,0,0,1,1,0
1,TOTAL BAIT & SWITCH. I read the bad reviews be...,0,1,1,0,0,1,1,0
2,I got this app almost immediately after learni...,1,0,0,0,1,0,0,0
3,I’ve gotten in contact with Nurx’s customer se...,0,1,1,0,0,1,1,0
4,I used the app for 6 months before falling pre...,1,0,0,1,1,1,0,1
...,...,...,...,...,...,...,...,...,...
84,They take days to respond.,0,1,0,0,0,1,0,0
85,"Inaccurate, no way to exclude a cycle, expensi...",1,0,0,0,1,1,0,0
86,I love this app don’t get me wrong I have neve...,0,0,0,0,0,0,0,0
87,The app is simple and and has a nice theme to ...,0,0,0,0,0,0,0,1


#### 2) Calculate and Compare the F1-score

In [107]:
# Define the categories
categories = [
    'l1_inaccurate_cycle_prediction',
    'l2_delayed_customer_service', 
    'l3_poor_prescription_management',
    'l4_problematic_billing_practices'
]

# List to hold data for analysis
data_for_analysis = []

# Loop through each category to calculate concordance and gather statistics
for category in categories:
    for index, row in merged_df.iterrows():
        manual_label = row[category]
        ai_label = row[f'AI_{category}']
        concordance = 'Agree' if manual_label == ai_label else 'Disagree'
        data_for_analysis.append({'Category': category, 'Concordance': concordance, 'Count': 1})

# Create a DataFrame from the list
analysis_df = pd.DataFrame(data_for_analysis)

# Generate a pivot table to summarize concordance counts
pivot_table = analysis_df.pivot_table(index='Category', columns='Concordance', values='Count', aggfunc='sum', fill_value=0)
print(pivot_table)

# Prepare a DataFrame for precision, recall, and F1-score calculations
bc_summary_data = {
    'Category': categories,
    'Agree': [88, 77, 74, 77],
    'Disagree': [1, 12, 15, 12]
}
bc_summary_df = pd.DataFrame(bc_summary_data)

# Calculate precision, recall, and F1-score for each category
bc_summary_df['Precision'] = bc_summary_df['Agree'] / (bc_summary_df['Agree'] + bc_summary_df['Disagree'])
bc_summary_df['Recall'] = bc_summary_df['Agree'] / (bc_summary_df['Agree'] + bc_summary_df['Disagree'])  # Assuming complete data
bc_summary_df['F1-Score'] = 2 * (bc_summary_df['Precision'] * bc_summary_df['Recall']) / (bc_summary_df['Precision'] + bc_summary_df['Recall'])

print(bc_summary_df[['Category', 'Precision', 'Recall', 'F1-Score']])


Concordance                       Agree  Disagree
Category                                         
l1_inaccurate_cycle_prediction       88         1
l2_delayed_customer_service          77        12
l3_poor_prescription_management      74        15
l4_problematic_billing_practices     77        12
                           Category  Precision    Recall  F1-Score
0    l1_inaccurate_cycle_prediction   0.988764  0.988764  0.988764
1       l2_delayed_customer_service   0.865169  0.865169  0.865169
2   l3_poor_prescription_management   0.831461  0.831461  0.831461
3  l4_problematic_billing_practices   0.865169  0.865169  0.865169


In [None]:
# All four categories have F1-scores above 0.8,
# indicating that the model not only identifies relevant instances with high recall 
# but also maintains accuracy in its identifications with high precision.


## Collection 2 - Period-and-Fertility-Tracking Apps (x2)

- Repeated the same steps as above 1 - Birth-Control-Oriented Apps (x5)

In [6]:
import pandas as pd 

pt_df = pd.read_csv("combined_pt_samples.csv")

In [78]:
pt_df.columns

Index(['date', 'developerResponse', 'review', 'rating', 'isEdited', 'userName',
       'title', 'app_name', 'app_id'],
      dtype='object')

In [79]:
def categorize(row):
    prompt = """
        Below is the text to a review. Classify it as one or more of the following categories:

        - l1_inaccurate_cycle_prediction: This category suggests that the app's cycle prediction algorithm is inaccurate, sometimes leading to unplanned pregnancies.
        - l2_unfair_functionality_charges: This category suggests that many users express frustration over unreasonable fees for basic functions, excessive ads or aggressive premium upgrade promotions, new updates that degrade app performance, removal of essential features for predicting ovulation, and late or missing reminder notifications, etc.
        - l3_user_data_privacy_concerns: This category suggests that users are worried their collected period data could be used against them in the future, especially following the overturn of Roe v. Wade.
        - l4_if_related_to_the_overturn: This category suggests that these reviews directly talk about the concerns of their experiences with the app due to the 2022 overturn of Roe v. Wade, with key words such as "Roe v. Wade" or "overturn" explicitly appearing in the reviews.

        The review may be assigned to multiple categories. Please list all applicable categories based on the review content.

        Review text:

        {text}
        """
    
    results = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are a review assistant. Be brief and consider multiple categories in your responses."},
        {"role": "user", "content": prompt.format(text=row['review'])}
      ],
      temperature=0,
      logprobs=True, #including probablities
    )

    return pd.Series({
        'content': results.choices[0].message.content,
        'probability': math.exp(results.choices[0].logprobs.content[0].logprob)
    })

In [80]:
print(pt_df.iloc[0])
print(categorize(pt_df.iloc[0]))

date                                                     10/23/20 0:03
developerResponse    {'id': 18876181, 'body': "Hello, we are terrib...
review               I’ve recently updated the app and I can no lon...
rating                                                               1
isEdited                                                         False
userName                                                 hi I'm catbug
title                                          crashing after updating
app_name                                  clue-period-tracker-calendar
app_id                                                       657189652
Name: 0, dtype: object
content        The review text does not fall into any of the ...
probability                                             0.583782
dtype: object


In [81]:
from tqdm.auto import tqdm
tqdm.pandas()

pt_df[['category', 'probability']] = pt_df.progress_apply(categorize, axis=1)
pt_df

  0%|          | 0/364 [00:00<?, ?it/s]

Unnamed: 0,date,developerResponse,review,rating,isEdited,userName,title,app_name,app_id,category,probability
0,10/23/20 0:03,"{'id': 18876181, 'body': ""Hello, we are terrib...",I’ve recently updated the app and I can no lon...,1,False,hi I'm catbug,crashing after updating,clue-period-tracker-calendar,657189652,The review text can be classified under the fo...,0.503775
1,12/22/21 0:53,"{'id': 27011516, 'body': ""Hi Shaygify! We're v...",This app cost too much for what it’s worth. I’...,1,False,Shaygify,Too expensive,flo-period-pregnancy-tracker,1038369065,- l2_unfair_functionality_charges,0.991688
2,2/6/21 18:36,"{'id': 21004233, 'body': 'Hello! Thanks for th...",My girlfriend used to use this app to share he...,1,False,johngbirds,Updates make app useless,clue-period-tracker-calendar,657189652,- l2_unfair_functionality_charges,0.673256
3,2/24/21 23:00,"{'id': 21376078, 'body': 'Hello, thanks for yo...",Bring back the fertile window!! Some users use...,1,False,seller32,Bring back fertile window,clue-period-tracker-calendar,657189652,- l2_unfair_functionality_charges,0.994868
4,12/1/20 4:08,"{'id': 19489707, 'body': 'Hi soffffyyyyyyyyy,\...","used to be great and so helpful!! but now, for...",2,False,soffffyyyyyyyyy,flo premium ruined it,flo-period-pregnancy-tracker,1038369065,- l2_unfair_functionality_charges,0.977993
...,...,...,...,...,...,...,...,...,...,...,...
359,4/6/24 1:51,"{'id': 43115059, 'body': 'Hey, thanks for gett...",I’m connected to the internet using cellular a...,1,False,M. C.....,Keeps saying I’m offline,clue-period-tracker-calendar,657189652,- l1_inaccurate_cycle_prediction,0.953489
360,6/12/23 17:08,"{'id': 37111732, 'body': ""Hi, Thanks for shari...",I’ve used this app for about 8 years now and I...,1,False,clwalzer,Minimal features and zero support,clue-period-tracker-calendar,657189652,- l2_unfair_functionality_charges,0.990573
361,4/27/24 0:39,"{'id': 43632413, 'body': 'Hi there! Thank you ...",IT SELLE YOUR DATA AND IS BEING SUED AGAIN USE...,1,False,micky macky bo ba boo,DO NOT DOWNLOAD,flo-period-pregnancy-tracker,1038369065,- l3_user_data_privacy_concerns,0.971701
362,2/18/23 13:46,"{'id': 35056532, 'body': ""Hi Escobar2003!\nWe'...",What is going on with the Flo app? it’s not wo...,2,False,Escobar2003,Not working,flo-period-pregnancy-tracker,1038369065,The review does not clearly fit into any of th...,0.571061


In [84]:
#pt_df.to_csv("pt_categorized.csv", index=False)

In [89]:
pt_df = pd.read_csv("pt_categorized.csv")

categories = [
    'l1_inaccurate_cycle_prediction',
    'l2_unfair_functionality_charges', 
    'l3_user_data_privacy_concerns',
    'l4_if_related_to_the_overturn',
]

# Create columns for each category initialized to 0
for category in categories:
    pt_df[category] = 0

# Function to update the category columns based on the categorized column
def update_category_columns(row):
    if pd.notna(row['category']):  # Check if the categorized cell is not NaN
        for category in categories:
            if category in row['category']:
                row[category] = 1
    return row

# Apply the function to each row
new_pt_df = pt_df.apply(update_category_columns, axis=1)
new_pt_df

Unnamed: 0,date,developerResponse,review,rating,isEdited,userName,title,app_name,app_id,category,probability,l1_inaccurate_cycle_prediction,l2_unfair_functionality_charges,l3_user_data_privacy_concerns,l4_if_related_to_the_overturn
0,10/23/20 0:03,"{'id': 18876181, 'body': ""Hello, we are terrib...",I’ve recently updated the app and I can no lon...,1,False,hi I'm catbug,crashing after updating,clue-period-tracker-calendar,657189652,The review text can be classified under the fo...,0.503775,0,1,0,0
1,12/22/21 0:53,"{'id': 27011516, 'body': ""Hi Shaygify! We're v...",This app cost too much for what it’s worth. I’...,1,False,Shaygify,Too expensive,flo-period-pregnancy-tracker,1038369065,- l2_unfair_functionality_charges,0.991688,0,1,0,0
2,2/6/21 18:36,"{'id': 21004233, 'body': 'Hello! Thanks for th...",My girlfriend used to use this app to share he...,1,False,johngbirds,Updates make app useless,clue-period-tracker-calendar,657189652,- l2_unfair_functionality_charges,0.673256,0,1,0,0
3,2/24/21 23:00,"{'id': 21376078, 'body': 'Hello, thanks for yo...",Bring back the fertile window!! Some users use...,1,False,seller32,Bring back fertile window,clue-period-tracker-calendar,657189652,- l2_unfair_functionality_charges,0.994868,0,1,0,0
4,12/1/20 4:08,"{'id': 19489707, 'body': 'Hi soffffyyyyyyyyy,\...","used to be great and so helpful!! but now, for...",2,False,soffffyyyyyyyyy,flo premium ruined it,flo-period-pregnancy-tracker,1038369065,- l2_unfair_functionality_charges,0.977993,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359,4/6/24 1:51,"{'id': 43115059, 'body': 'Hey, thanks for gett...",I’m connected to the internet using cellular a...,1,False,M. C.....,Keeps saying I’m offline,clue-period-tracker-calendar,657189652,- l1_inaccurate_cycle_prediction,0.953489,1,0,0,0
360,6/12/23 17:08,"{'id': 37111732, 'body': ""Hi, Thanks for shari...",I’ve used this app for about 8 years now and I...,1,False,clwalzer,Minimal features and zero support,clue-period-tracker-calendar,657189652,- l2_unfair_functionality_charges,0.990573,0,1,0,0
361,4/27/24 0:39,"{'id': 43632413, 'body': 'Hi there! Thank you ...",IT SELLE YOUR DATA AND IS BEING SUED AGAIN USE...,1,False,micky macky bo ba boo,DO NOT DOWNLOAD,flo-period-pregnancy-tracker,1038369065,- l3_user_data_privacy_concerns,0.971701,0,0,1,0
362,2/18/23 13:46,"{'id': 35056532, 'body': ""Hi Escobar2003!\nWe'...",What is going on with the Flo app? it’s not wo...,2,False,Escobar2003,Not working,flo-period-pregnancy-tracker,1038369065,The review does not clearly fit into any of th...,0.571061,1,1,1,1


In [90]:
# Save the updated DataFrame
#new_pt_df.to_csv("pt_categorized_with_columns.csv", index=False)

In [92]:
# Load your CSV files
pt_manual_df = pd.read_csv("combined_pt_tagged.csv")
pt_ai_df = pd.read_csv("pt_categorized_with_columns.csv")

# Define the columns you want to extract, excluding 'review' to avoid duplication in merge
columns_to_keep = [
    'l1_inaccurate_cycle_prediction',
    'l2_unfair_functionality_charges', 
    'l3_user_data_privacy_concerns',
    'l4_if_related_to_the_overturn',
]

# Prepare the manual and AI DataFrames by selecting the necessary columns
manual_selected = pt_manual_df[['review'] + columns_to_keep]
ai_selected = pt_ai_df[['review'] + columns_to_keep]

# Rename AI DataFrame columns for clarity in the merged file
ai_selected.columns = ['review'] + [f'AI_{col}' for col in columns_to_keep]

# Merge the DataFrames on the 'review' column
pt_merged_df = pd.merge(manual_selected, ai_selected, on='review', how='inner')
pt_merged_df


Unnamed: 0,review,l1_inaccurate_cycle_prediction,l2_unfair_functionality_charges,l3_user_data_privacy_concerns,l4_if_related_to_the_overturn,AI_l1_inaccurate_cycle_prediction,AI_l2_unfair_functionality_charges,AI_l3_user_data_privacy_concerns,AI_l4_if_related_to_the_overturn
0,I’ve recently updated the app and I can no lon...,0,0,0,0,0,1,0,0
1,This app cost too much for what it’s worth. I’...,0,1,0,0,0,1,0,0
2,My girlfriend used to use this app to share he...,0,0,0,0,0,1,0,0
3,Bring back the fertile window!! Some users use...,0,1,0,0,0,1,0,0
4,"used to be great and so helpful!! but now, for...",0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...
359,I’m connected to the internet using cellular a...,1,0,0,0,1,0,0,0
360,I’ve used this app for about 8 years now and I...,0,1,0,0,0,1,0,0
361,IT SELLE YOUR DATA AND IS BEING SUED AGAIN USE...,0,0,1,0,0,0,1,0
362,What is going on with the Flo app? it’s not wo...,0,0,0,0,1,1,1,1


In [94]:
# Define the categories
categories = [
    'l1_inaccurate_cycle_prediction',
    'l2_unfair_functionality_charges', 
    'l3_user_data_privacy_concerns',
    'l4_if_related_to_the_overturn',
]

# List to hold data for analysis
data_for_analysis = []

# Loop through each category to calculate concordance and gather statistics
for category in categories:
    for index, row in pt_merged_df.iterrows():
        manual_label = row[category]
        ai_label = row[f'AI_{category}']
        concordance = 'Agree' if manual_label == ai_label else 'Disagree'
        data_for_analysis.append({'Category': category, 'Concordance': concordance, 'Count': 1})

# Create a DataFrame from the list
analysis_df = pd.DataFrame(data_for_analysis)

# Generate a pivot table to summarize concordance counts
pivot_table = analysis_df.pivot_table(index='Category', columns='Concordance', values='Count', aggfunc='sum', fill_value=0)
print(pivot_table)

Concordance                      Agree  Disagree
Category                                        
l1_inaccurate_cycle_prediction     346        18
l2_unfair_functionality_charges    316        48
l3_user_data_privacy_concerns      348        16
l4_if_related_to_the_overturn      358         6


In [95]:
# Prepare a DataFrame for precision, recall, and F1-score calculations
pt_summary_data = {
    'Category': categories,
    'Agree': [346, 316, 348, 358],
    'Disagree': [18, 48, 16, 6]
}
pt_summary_df = pd.DataFrame(pt_summary_data)

# Calculate precision, recall, and F1-score for each category
pt_summary_df['Precision'] = pt_summary_df['Agree'] / (pt_summary_df['Agree'] + pt_summary_df['Disagree'])
pt_summary_df['Recall'] = pt_summary_df['Agree'] / (pt_summary_df['Agree'] + pt_summary_df['Disagree'])  # Assuming complete data
pt_summary_df['F1-Score'] = 2 * (pt_summary_df['Precision'] * pt_summary_df['Recall']) / (pt_summary_df['Precision'] + pt_summary_df['Recall'])

print(pt_summary_df[['Category', 'Precision', 'Recall', 'F1-Score']])


                          Category  Precision    Recall  F1-Score
0   l1_inaccurate_cycle_prediction   0.950549  0.950549  0.950549
1  l2_unfair_functionality_charges   0.868132  0.868132  0.868132
2    l3_user_data_privacy_concerns   0.956044  0.956044  0.956044
3    l4_if_related_to_the_overturn   0.983516  0.983516  0.983516


In [None]:
# All four categories have F1-scores above 0.8,
# indicating that the model not only identifies relevant instances with high recall 
# but also maintains accuracy in its identifications with high precision.
