## 4 Build and Train the Classifier 2.0 (Separate Classifier) ❌

In this notebook, to troubleshoot imbalance data, I attempted to a separate classifier for each category before combining the predicted results. However, this approach was still heavily affected by data imbalance issue.

### **1) Collection 1**
- **Training Data - combined_bc_tagged.csv** - This file contains all the manually tagged data from before and after the overturn, covering 5 apps focused on birth control.
- **Unlabeled Data - combined_bc_unlabeled.csv** - This file contains all the unlabeled data from before and after the overturn, covering 5 Birth-Control-Oriented Apps (Collection 1).


### **2) Collection 2**
- **Training Data - combined_pt_tagged.csv** - This file contains all the manually tagged data from before and after the overturn, covering 2 Period-and-Fertility-Tracking Apps (Collection 2).

- **Unlabeled Data - combined_pt_unlabeled.csv** - This file includes all the unlabeled data from before and after the overturn, covering 2 Period-and-Fertility-Tracking Apps (Collection 2).


## Collection 1 - Birth-Control-Oriented Apps (x5)

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load the datasets
tagged_data = pd.read_csv('combined_bc_tagged.csv')
unlabeled_data = pd.read_csv('combined_bc_unlabeled.csv')

# Preprocessing the text data
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tagged = vectorizer.fit_transform(tagged_data['review'])
X_unlabeled = vectorizer.transform(unlabeled_data['review'])

# Labels
labels = ['l1_inaccurate_cycle_prediction', 'l2_delayed_customer_service', 
          'l3_poor_prescription_management', 'l4_problematic_billing_practices']

# Initialize a DataFrame to hold all predictions for unlabeled data
predictions_df = pd.DataFrame(index=range(len(unlabeled_data)))

# Process each label with a separate classifier
for label in labels:
    print(f"Training classifier for {label}...")
    y_tagged = tagged_data[label]

    # Splitting the data
    X_train, X_test, y_train, y_test = train_test_split(
        X_tagged, y_tagged, test_size=0.2, random_state=42, stratify=y_tagged)

    # Classifier
    classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    classifier.fit(X_train, y_train)

    # Evaluate the classifier
    y_pred = classifier.predict(X_test)
    print(f"Classification report for {label}:")
    print(classification_report(y_test, y_pred))

    # Predict on the unlabeled data
    label_predictions = classifier.predict(X_unlabeled)
    predictions_df[label] = label_predictions

# Combine predictions with the unlabeled reviews
bc_unlabeled_data_with_predictions = pd.concat([unlabeled_data, predictions_df], axis=1)
bc_unlabeled_data_with_predictions

# Optionally, export to CSV
# bc_unlabeled_data_with_predictions.to_csv('bc_unlabeled_data_with_predictions.csv', index=False)

print("Completed predicting labels for all categories.")


Training classifier for l1_inaccurate_cycle_prediction...
Classification report for l1_inaccurate_cycle_prediction:
              precision    recall  f1-score   support

           0       0.94      1.00      0.97        17
           1       0.00      0.00      0.00         1

    accuracy                           0.94        18
   macro avg       0.47      0.50      0.49        18
weighted avg       0.89      0.94      0.92        18

Training classifier for l2_delayed_customer_service...


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Classification report for l2_delayed_customer_service:
              precision    recall  f1-score   support

           0       0.62      0.89      0.73         9
           1       0.80      0.44      0.57         9

    accuracy                           0.67        18
   macro avg       0.71      0.67      0.65        18
weighted avg       0.71      0.67      0.65        18

Training classifier for l3_poor_prescription_management...
Classification report for l3_poor_prescription_management:
              precision    recall  f1-score   support

           0       0.71      0.71      0.71         7
           1       0.82      0.82      0.82        11

    accuracy                           0.78        18
   macro avg       0.77      0.77      0.77        18
weighted avg       0.78      0.78      0.78        18

Training classifier for l4_problematic_billing_practices...
Classification report for l4_problematic_billing_practices:
              precision    recall  f1-score   support

In [3]:
bc_unlabeled_data_with_predictions

Unnamed: 0.1,Unnamed: 0,date,developerResponse,review,rating,isEdited,userName,title,app_name,app_id,l1_inaccurate_cycle_prediction,l2_delayed_customer_service,l3_poor_prescription_management,l4_problematic_billing_practices
0,16,2021-04-08 12:35:25,"{'id': 22156213, 'body': ""Hi Lynn, we are disa...",I first used Nurx a few years ago and it was a...,1,False,ALynnJ42,Used to be good,nurx-birth-control-delivered,1213141301,0,1,1,0
1,19,2021-01-05 06:10:19,,"I am not one to usually write reviews, but my ...",1,False,cp_2015,Avoid at all costs,nurx-birth-control-delivered,1213141301,0,1,1,0
2,33,2020-05-29 20:59:19,"{'id': 11157202, 'body': 'We understand your r...",First the bad. I thought with this app/service...,2,True,Jen316,Good and bad,nurx-birth-control-delivered,1213141301,0,1,1,0
3,38,2021-08-12 03:31:37,"{'id': 24534904, 'body': ""Hello Anna, thank yo...","If this worked well, I would love it. Unfortun...",1,False,anna.eliza,"Good idea, bad execution",nurx-birth-control-delivered,1213141301,0,0,0,0
4,39,2020-12-29 21:47:51,"{'id': 20112386, 'body': ""We are sorry to hear...","I hesitate to write negative reviews, but this...",1,False,Sacatu,Waste of time and $$,nurx-birth-control-delivered,1213141301,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,263,2024-02-01 20:20:43,"{'id': 41980499, 'body': 'Oh no! We sincerely ...","In miserable pain for a UTI, showing all sympt...",1,False,dessert enjoyer,Cancelled UTI prescription request,planned-parenthood-direct,1214393415,0,0,0,0
931,268,2023-12-07 15:27:51,,The app is not working! don't waste your time,1,False,paulineczka1212,The app is not working,planned-parenthood-direct,1214393415,0,0,0,0
932,278,2023-09-15 02:26:42,"{'id': 39037722, 'body': 'Hi there - we are so...",it’s been 6 days and i have yet to get my pack...,1,False,uraqtbaeee,"never received, nobody answered",planned-parenthood-direct,1214393415,0,1,0,0
933,279,2023-06-11 13:16:06,"{'id': 37257712, 'body': ""Oh no, that doesn't ...",dumb app,1,False,Destiny Amari Robinson,doesnt work,planned-parenthood-direct,1214393415,0,0,0,0


In [4]:
unique_values = bc_unlabeled_data_with_predictions[['l1_inaccurate_cycle_prediction', 'l2_delayed_customer_service', 'l3_poor_prescription_management', 'l4_problematic_billing_practices']].nunique()

# Print the number of unique values for each label column
print(unique_values)

l1_inaccurate_cycle_prediction      1
l2_delayed_customer_service         2
l3_poor_prescription_management     2
l4_problematic_billing_practices    2
dtype: int64


## Collection 2 - Period-and-Fertility-Tracking Apps (x2)

- Repeated the same steps as above

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report

# Load the datasets
tagged_data_2 = pd.read_csv('combined_pt_tagged.csv')
unlabeled_data_2 = pd.read_csv('combined_pt_unlabeled.csv')

# Preprocessing the text data
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Fitting the vectorizer on the tagged data reviews and transforming the text
X_tagged_2 = vectorizer.fit_transform(tagged_data_2['review'])
X_unlabeled_2 = vectorizer.transform(unlabeled_data_2['review']).copy() 
# Making a copy to ensure it's writable❗️❗️❗️❗️

# Extracting the labels
y_tagged_2 = tagged_data_2[['l1_inaccurate_cycle_prediction', 'l2_unfair_functionality_charges', 'l3_user_data_privacy_concerns', 'l4_if_related_to_the_overturn']]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tagged_2, y_tagged_2, test_size=0.2, random_state=42)

# Initialize the MultiOutputClassifier with a RandomForest
forest = RandomForestClassifier(n_estimators=100, random_state=42)
multi_target_forest = MultiOutputClassifier(forest, n_jobs=1)  # Set n_jobs to 1 to avoid parallel processing issues

# Train the model
multi_target_forest.fit(X_train, y_train)

# Predict on the test portion of the tagged data
y_pred = multi_target_forest.predict(X_test)
print(classification_report(y_test, y_pred))

# Predict labels for unlabeled data
predictions_2 = multi_target_forest.predict(X_unlabeled_2)  # Use the copy for prediction

# Create a DataFrame with predictions
predicted_labels_2 = pd.DataFrame(predictions_2, columns=['l1_inaccurate_cycle_prediction', 'l2_unfair_functionality_charges', 'l3_user_data_privacy_concerns', 'l4_if_related_to_the_overturn'])

# Combine predictions with the unlabeled reviews
pt_unlabeled_data_with_predictions = pd.concat([unlabeled_data_2, predicted_labels_2], axis=1)
pt_unlabeled_data_with_predictions

# Export to CSV
#pt_unlabeled_data_with_predictions.to_csv('pt_unlabeled_data_with_predictions.csv', index=False)


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.76      0.91      0.83        45
           2       1.00      0.18      0.31        11
           3       1.00      0.33      0.50         3

   micro avg       0.77      0.73      0.75        60
   macro avg       0.69      0.36      0.41        60
weighted avg       0.80      0.73      0.70        60
 samples avg       0.58      0.56      0.57        60



Unnamed: 0.1,Unnamed: 0,date,developerResponse,review,rating,isEdited,userName,title,app_name,app_id,l1_inaccurate_cycle_prediction,l2_unfair_functionality_charges,l3_user_data_privacy_concerns,l4_if_related_to_the_overturn
0,7,2022-02-20 02:16:25,"{'id': 28190339, 'body': 'Hi kennaliz122,\n\nT...",I downloaded Flo when I was a sophomore in hig...,2,False,kennaliz122,Too much for too little,flo-period-pregnancy-tracker,1038369065,0,1,0,0
1,13,2020-10-02 23:29:22,"{'id': 18298178, 'body': 'Hi kathynicole,\n\nT...",I originally downloaded this app to track my p...,1,False,kathynicole,Extra features only available with premium,flo-period-pregnancy-tracker,1038369065,0,1,0,0
2,16,2021-04-12 09:16:17,"{'id': 22237904, 'body': 'Hi Jamieeeeeee21,\nT...",I only want notifications that tell me when to...,2,False,Jamieeeeeee21,Too many ads,flo-period-pregnancy-tracker,1038369065,0,1,0,0
3,22,2022-04-27 20:32:32,,The concept of the app is great actually! But ...,2,False,Giavanna Steiner,Developers- please read,flo-period-pregnancy-tracker,1038369065,0,1,0,0
4,23,2021-08-05 15:20:44,"{'id': 24399169, 'body': 'Hi there,\n\nWe unde...",I understand that Flo is run by A LOT of peopl...,2,False,3457811946,Disappointment as the app “grows”,flo-period-pregnancy-tracker,1038369065,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7432,4741,2022-09-07 14:09:11,"{'id': 31973009, 'body': 'Hey, thanks for reac...",Where did all my data go? I updated the app an...,1,False,GS/BZ,All my data disappeared,clue-period-tracker-calendar,657189652,0,0,0,0
7433,4742,2022-08-06 00:11:40,"{'id': 31642002, 'body': ""Again, we're sorry t...",Clue App Keeps Freezing and/Crashing,1,False,Afnoir,Clue App Keeps Freezing and/Crashing,clue-period-tracker-calendar,657189652,0,0,0,0
7434,4743,2022-06-25 00:21:51,"{'id': 30598266, 'body': 'Hey, thanks for reac...",DELETE YOUR DATA WITHIN THE APP BEFORE UNINSTA...,1,False,Krislynx,DELETE - THEY WILL SHARE YOUR DATA,clue-period-tracker-calendar,657189652,0,0,0,0
7435,4745,2022-05-21 16:46:22,"{'id': 29934599, 'body': 'Hey, thanks for reac...",Despite Clue’s privacy policy and their social...,1,False,eloise199,Data is stored on AWS cloud storage,clue-period-tracker-calendar,657189652,0,0,1,0


In [2]:
unique_values2 = pt_unlabeled_data_with_predictions[['l1_inaccurate_cycle_prediction', 'l2_unfair_functionality_charges', 'l3_user_data_privacy_concerns', 'l4_if_related_to_the_overturn']].nunique()

# Print the number of unique values for each label column
print(unique_values2)

l1_inaccurate_cycle_prediction     1
l2_unfair_functionality_charges    2
l3_user_data_privacy_concerns      2
l4_if_related_to_the_overturn      2
dtype: int64
