## 4 Build and Train the Classifier 1.0 ❌

In this notebook, [I categorized seven apps into two collections by their function](https://docs.google.com/document/d/1Mh4KAlSKspelD7M2ScHmYqfBRzzbVUd6yZL6CK7jwZM/edit?usp=sharing) for manual tagging. I then built two classifiers with the sklearn package to predict the categories of remaining unlabeled reviews, aiming to verify:

- **whether unlabeled reviews from Collection 1 fell into one or more following categories:** <br>l1_inaccurate_cycle_prediction, l2_delayed_customer_service, l3_poor_prescription_management, l4_problematic_billing_practices, l5_if_related_to_the_overturn

- **whether unlabeled reviews from Collection 2 fell into one or more following categories:**
l1_inaccurate_cycle_prediction	l2_unfair_functionality_charges	l3_user_data_privacy_concerns	l4_if_related_to_the_overturn!

### **1) Collection 1**
- **Training Data - combined_bc_tagged.csv** - This file contains all the manually tagged data from before and after the overturn, covering 5 apps focused on birth control.
- **Unlabeled Data - combined_bc_unlabeled.csv** - This file contains all the unlabeled data from before and after the overturn, covering 5 Birth-Control-Oriented Apps (Collection 1).


### **2) Collection 2**
- **Training Data - combined_pt_unlabeled.csv** - This file includes all the unlabeled data from before and after the overturn, covering 2 Period-and-Fertility-Tracking Apps (Collection 2).

- **Unlabeled Data - combined_pt_tagged.csv** - This file contains all the manually tagged data from before and after the overturn, covering 2 Period-and-Fertility-Tracking Apps (Collection 2).


## Collection 1 - Birth-Control-Oriented Apps (x5)

### Step 1: Loading the Data

In [1]:
import pandas as pd

# Load the datasets
tagged_data = pd.read_csv('combined_bc_tagged.csv')
unlabeled_data = pd.read_csv('combined_bc_unlabeled.csv')

# Explore the first few rows of the datasets
#print(tagged_data.head())
#print(unlabeled_data.head())

# Check for missing values
#print(tagged_data.isnull().sum())
#print(unlabeled_data.isnull().sum())

### Step 2: Text Preprocessing

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Preprocessing the text data
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
## TF-IDF Vectorizer: Converts text data into a matrix of TF-IDF features. 

# Fitting the vectorizer on the tagged data reviews and transforming the text
X_tagged = vectorizer.fit_transform(tagged_data['review'])
## fit_transform(): Learns the vocabulary and inverse document frequency weightings for tagged data.

X_unlabeled = vectorizer.transform(unlabeled_data['review'])  # Only transform for the unlabeled data
## transform(): Transforms the unlabeled data into the same feature space without fitting to avoid data leakage.


In [57]:
tagged_data.columns

Index(['date', 'developerResponse', 'review', 'rating', 'isEdited', 'userName',
       'title', 'app_name', 'app_id', 'l1_inaccurate_cycle_prediction',
       'l2_delayed_customer_service', 'l3_poor_prescription_management',
       'l4_problematic_billing_practices', 'all_text'],
      dtype='object')

In [68]:
tagged_data["l1_inaccurate_cycle_prediction"]

0     0
1     0
2     1
3     0
4     1
     ..
84    0
85    1
86    0
87    0
88    0
Name: l1_inaccurate_cycle_prediction, Length: 89, dtype: int64

### Step 3: Extract Labels

- This step is just about getting the correct columns for labels and features but actual learning involves the model figuring out how to assign labels to reviews by recognizing patterns in the text. This is done through a series of algorithms and statistical processing that take place during the model's training.

In [58]:
# Extracting the labels
y_tagged = tagged_data[['l1_inaccurate_cycle_prediction', 'l2_delayed_customer_service', 'l3_poor_prescription_management', 'l4_problematic_billing_practices']]


In [59]:
# Assuming y_tagged is your series of labels
print(y_tagged['l1_inaccurate_cycle_prediction'].value_counts())
print(y_tagged['l2_delayed_customer_service'].value_counts())
print(y_tagged['l3_poor_prescription_management'].value_counts())
print(y_tagged['l4_problematic_billing_practices'].value_counts())


0    82
1     7
Name: l1_inaccurate_cycle_prediction, dtype: int64
0    45
1    44
Name: l2_delayed_customer_service, dtype: int64
1    54
0    35
Name: l3_poor_prescription_management, dtype: int64
0    59
1    30
Name: l4_problematic_billing_practices, dtype: int64


### Step 4: Split Tagged Data for Training and Testing

This steps refers to dividing the **tagged data** into two sets: one for training the model and the other for testing its performance, to ensure that the model can accurately generalize to new, unseen data, rather than just memorizing the patterns from the training data.

A portion of the tagged data, typically a majority (like 80% in your case), used to train the model. The model learns from this data by adjusting its internal parameters to predict the labels as accurately as possible.

- **Training Set:** it is usually the larger portion of the tagged data (e.g., 80%) where the model learns to predict labels by adjusting its parameters.

- **Testing Set:** Comprising the smaller portion of the tagged data (e.g., 20%), this data tests the model’s ability to predict new entries accurately, serving as a proxy for unseen data. It's not used for training but for evaluating the model's prediction accuracy.

**Note:** about **random_state=42**: If you run a model with random_state=42 and get an accuracy of 85%, then run the model again with random_state=100 and get an accuracy of 86%, the slight difference isn't because 100 is inherently better than 42. It's because the way the data was shuffled and split just happened to group the data slightly differently, which might have captured different relationships or patterns in that particular run. If the dataset is large and well-distributed, these differences should be minor.


In [60]:
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tagged, y_tagged, test_size=0.1, random_state=42,stratify=y_tagged)
## X_tagged and y_tagged --> are your features (reviews) and labels, respectively.
## test_size=0.2 --> means 20% of the data is set aside for testing. The remaining 80% is used for training.
## random_state=42 --> ensures reproducibility of the results

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

In [67]:
from sklearn.model_selection import train_test_split

# Assuming your features DataFrame is named X_tagged
# and your labels DataFrame y_tagged is as described:
# y_tagged = tagged_data[['l1_inaccurate_cycle_prediction', 'l2_delayed_customer_service', 'l3_poor_prescription_management', 'l4_problematic_billing_practices']]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_tagged,  # Your features
    y_tagged,  # Your labels
    test_size=0.2,  # 20% of the data is set aside for testing
    random_state=42,  # Ensures reproducibility of the results
    stratify=y_tagged['l1_inaccurate_cycle_prediction']  # Stratifying based on 'l1_inaccurate_cycle_prediction'
)

# Optionally, to verify the distribution of classes in the train and test sets:
print("Training set distribution:")
print(y_train['l1_inaccurate_cycle_prediction'].value_counts())
print("Testing set distribution:")
print(y_test['l1_inaccurate_cycle_prediction'].value_counts())


TypeError: '<=' not supported between instances of 'list' and 'int'

### Step 5: Initialize and Train Classifier

- **Random Forest Classifier:** A robust ensemble machine learning algorithm used for classification.
- 🌟**MultiOutputClassifier🌟:** Handles multiple labels simultaneously by fitting one classifier per target.

In [62]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier

# Initialize the MultiOutputClassifier with a RandomForest
forest = RandomForestClassifier(n_estimators=100, random_state=42)
## n_estimators=100 --> The higher the number is, the more decision trees are built, which can lead to more inaccuracy but increases more computational costs.
## random_state=42 --> ensures reproducibility of the results

multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
## forest --> This is the RandomForest classifier I created. 
## n_jobs=-1 --> leave it. -1 is generally a good choice.

# Train the model
multi_target_forest.fit(X_train, y_train)


### Step 6: Model Evaluation

- **Precision:** Indicates the accuracy of positive predictions. A high precision means that the model did not label many negative samples as positive.
- **Recall:** Reflects the model's ability to find all the relevant cases (positive samples). A high recall means that the model found most of the positive samples.
- **F1-Score:** A high F1-score suggests a good balance between precision and recall.
- **Support:** The number of actual occurrences of the class in the test set. This value can indicate the impact of the metric score on the overall data set.

In [63]:
from sklearn.metrics import classification_report

# Predict on the test portion of the tagged data
y_pred = multi_target_forest.predict(X_test)

# Print classification report --> to evaluate the quality of predictions made by a classifier.
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.88      0.88      0.88         8
           2       0.82      0.82      0.82        11
           3       0.50      0.20      0.29         5

   micro avg       0.81      0.68      0.74        25
   macro avg       0.55      0.47      0.49        25
weighted avg       0.74      0.68      0.70        25
 samples avg       0.54      0.48      0.50        25



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Step 7: Predict Unlabeled Data and Export Results

In [64]:
# Predict labels for unlabeled data
predictions = multi_target_forest.predict(X_unlabeled)
## The X_unlabeled --> is the result of transforming the raw text data from unlabeled_data through this vectorizer. 
## This transformation converts text data into a numerical format (a feature vector) that the model can interpret and use for making predictions.

# Create a DataFrame with predictions
predicted_labels = pd.DataFrame(predictions, columns=['l1_inaccurate_cycle_prediction', 'l2_delayed_customer_service', 'l3_poor_prescription_management', 'l4_problematic_billing_practices'])

# Combine predictions with the unlabeled reviews
bc_unlabeled_data_with_predictions = pd.concat([unlabeled_data, predicted_labels], axis=1)
bc_unlabeled_data_with_predictions

# Export to CSV
#bc_unlabeled_data_with_predictions.to_csv('bc_unlabeled_data_with_predictions.csv', index=False)

Unnamed: 0.1,Unnamed: 0,date,developerResponse,review,rating,isEdited,userName,title,app_name,app_id,all_text,l1_inaccurate_cycle_prediction,l2_delayed_customer_service,l3_poor_prescription_management,l4_problematic_billing_practices
0,16,2021-04-08 12:35:25,"{'id': 22156213, 'body': ""Hi Lynn, we are disa...",I first used Nurx a few years ago and it was a...,1,False,ALynnJ42,Used to be good,nurx-birth-control-delivered,1213141301,I first used Nurx a few years ago and it was a...,0,1,1,1
1,19,2021-01-05 06:10:19,,"I am not one to usually write reviews, but my ...",1,False,cp_2015,Avoid at all costs,nurx-birth-control-delivered,1213141301,"I am not one to usually write reviews, but my ...",0,1,1,0
2,33,2020-05-29 20:59:19,"{'id': 11157202, 'body': 'We understand your r...",First the bad. I thought with this app/service...,2,True,Jen316,Good and bad,nurx-birth-control-delivered,1213141301,First the bad. I thought with this app/service...,0,0,1,0
3,38,2021-08-12 03:31:37,"{'id': 24534904, 'body': ""Hello Anna, thank yo...","If this worked well, I would love it. Unfortun...",1,False,anna.eliza,"Good idea, bad execution",nurx-birth-control-delivered,1213141301,"If this worked well, I would love it. Unfortun...",0,0,1,0
4,39,2020-12-29 21:47:51,"{'id': 20112386, 'body': ""We are sorry to hear...","I hesitate to write negative reviews, but this...",1,False,Sacatu,Waste of time and $$,nurx-birth-control-delivered,1213141301,"I hesitate to write negative reviews, but this...",0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,263,2024-02-01 20:20:43,"{'id': 41980499, 'body': 'Oh no! We sincerely ...","In miserable pain for a UTI, showing all sympt...",1,False,dessert enjoyer,Cancelled UTI prescription request,planned-parenthood-direct,1214393415,"In miserable pain for a UTI, showing all sympt...",0,0,1,0
931,268,2023-12-07 15:27:51,,The app is not working! don't waste your time,1,False,paulineczka1212,The app is not working,planned-parenthood-direct,1214393415,The app is not working! don't waste your time ...,0,0,0,0
932,278,2023-09-15 02:26:42,"{'id': 39037722, 'body': 'Hi there - we are so...",it’s been 6 days and i have yet to get my pack...,1,False,uraqtbaeee,"never received, nobody answered",planned-parenthood-direct,1214393415,it’s been 6 days and i have yet to get my pack...,0,0,0,0
933,279,2023-06-11 13:16:06,"{'id': 37257712, 'body': ""Oh no, that doesn't ...",dumb app,1,False,Destiny Amari Robinson,doesnt work,planned-parenthood-direct,1214393415,dumb app doesnt work,0,0,0,0


### Step 8: Check if each label has more than one value (1 & 0) in the prediction

In [66]:
unique_values = bc_unlabeled_data_with_predictions[['l1_inaccurate_cycle_prediction', 'l2_delayed_customer_service', 'l3_poor_prescription_management', 'l4_problematic_billing_practices']].nunique()

# Print the number of unique values for each label column
print(unique_values)

l1_inaccurate_cycle_prediction      1
l2_delayed_customer_service         2
l3_poor_prescription_management     2
l4_problematic_billing_practices    2
dtype: int64


## Collection 2 - Period-and-Fertility-Tracking Apps (x2)

- Repeated the same steps as above

In [98]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report

# Load the datasets
tagged_data_2 = pd.read_csv('combined_pt_tagged.csv')
unlabeled_data_2 = pd.read_csv('combined_pt_unlabeled.csv')

# Preprocessing the text data
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Fitting the vectorizer on the tagged data reviews and transforming the text
X_tagged_2 = vectorizer.fit_transform(tagged_data_2['review'])
X_unlabeled_2 = vectorizer.transform(unlabeled_data_2['review']).copy() 
# Making a copy to ensure it's writable❗️❗️❗️❗️

# Extracting the labels
y_tagged_2 = tagged_data_2[['l1_inaccurate_cycle_prediction', 'l2_unfair_functionality_charges', 'l3_user_data_privacy_concerns', 'l4_if_related_to_the_overturn']]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tagged_2, y_tagged_2, test_size=0.2, random_state=42)

# Initialize the MultiOutputClassifier with a RandomForest
forest = RandomForestClassifier(n_estimators=100, random_state=42)
multi_target_forest = MultiOutputClassifier(forest, n_jobs=1)  # Set n_jobs to 1 to avoid parallel processing issues

# Train the model
multi_target_forest.fit(X_train, y_train)

# Predict on the test portion of the tagged data
y_pred = multi_target_forest.predict(X_test)
print(classification_report(y_test, y_pred))

# Predict labels for unlabeled data
predictions_2 = multi_target_forest.predict(X_unlabeled_2)  # Use the copy for prediction

# Create a DataFrame with predictions
predicted_labels_2 = pd.DataFrame(predictions_2, columns=['l1_inaccurate_cycle_prediction', 'l2_unfair_functionality_charges', 'l3_user_data_privacy_concerns', 'l4_if_related_to_the_overturn'])

# Combine predictions with the unlabeled reviews
pt_unlabeled_data_with_predictions = pd.concat([unlabeled_data_2, predicted_labels_2], axis=1)
pt_unlabeled_data_with_predictions

# Export to CSV
#pt_unlabeled_data_with_predictions.to_csv('pt_unlabeled_data_with_predictions.csv', index=False)


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.76      0.91      0.83        45
           2       1.00      0.18      0.31        11
           3       1.00      0.33      0.50         3

   micro avg       0.77      0.73      0.75        60
   macro avg       0.69      0.36      0.41        60
weighted avg       0.80      0.73      0.70        60
 samples avg       0.58      0.56      0.57        60



Unnamed: 0.1,Unnamed: 0,date,developerResponse,review,rating,isEdited,userName,title,app_name,app_id,l1_inaccurate_cycle_prediction,l2_unfair_functionality_charges,l3_user_data_privacy_concerns,l4_if_related_to_the_overturn
0,7,2022-02-20 02:16:25,"{'id': 28190339, 'body': 'Hi kennaliz122,\n\nT...",I downloaded Flo when I was a sophomore in hig...,2,False,kennaliz122,Too much for too little,flo-period-pregnancy-tracker,1038369065,0,1,0,0
1,13,2020-10-02 23:29:22,"{'id': 18298178, 'body': 'Hi kathynicole,\n\nT...",I originally downloaded this app to track my p...,1,False,kathynicole,Extra features only available with premium,flo-period-pregnancy-tracker,1038369065,0,1,0,0
2,16,2021-04-12 09:16:17,"{'id': 22237904, 'body': 'Hi Jamieeeeeee21,\nT...",I only want notifications that tell me when to...,2,False,Jamieeeeeee21,Too many ads,flo-period-pregnancy-tracker,1038369065,0,1,0,0
3,22,2022-04-27 20:32:32,,The concept of the app is great actually! But ...,2,False,Giavanna Steiner,Developers- please read,flo-period-pregnancy-tracker,1038369065,0,1,0,0
4,23,2021-08-05 15:20:44,"{'id': 24399169, 'body': 'Hi there,\n\nWe unde...",I understand that Flo is run by A LOT of peopl...,2,False,3457811946,Disappointment as the app “grows”,flo-period-pregnancy-tracker,1038369065,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7432,4741,2022-09-07 14:09:11,"{'id': 31973009, 'body': 'Hey, thanks for reac...",Where did all my data go? I updated the app an...,1,False,GS/BZ,All my data disappeared,clue-period-tracker-calendar,657189652,0,0,0,0
7433,4742,2022-08-06 00:11:40,"{'id': 31642002, 'body': ""Again, we're sorry t...",Clue App Keeps Freezing and/Crashing,1,False,Afnoir,Clue App Keeps Freezing and/Crashing,clue-period-tracker-calendar,657189652,0,0,0,0
7434,4743,2022-06-25 00:21:51,"{'id': 30598266, 'body': 'Hey, thanks for reac...",DELETE YOUR DATA WITHIN THE APP BEFORE UNINSTA...,1,False,Krislynx,DELETE - THEY WILL SHARE YOUR DATA,clue-period-tracker-calendar,657189652,0,0,0,0
7435,4745,2022-05-21 16:46:22,"{'id': 29934599, 'body': 'Hey, thanks for reac...",Despite Clue’s privacy policy and their social...,1,False,eloise199,Data is stored on AWS cloud storage,clue-period-tracker-calendar,657189652,0,0,1,0


In [96]:
unique_values2 = pt_unlabeled_data_with_predictions[['l1_inaccurate_cycle_prediction', 'l2_unfair_functionality_charges', 'l3_user_data_privacy_concerns', 'l4_if_related_to_the_overturn']].nunique()

# Print the number of unique values for each label column
print(unique_values2)

l1_inaccurate_cycle_prediction     1
l2_unfair_functionality_charges    2
l3_user_data_privacy_concerns      2
l4_if_related_to_the_overturn      2
dtype: int64
