# Classification for therapy app reviews

Sometimes it's too much work to read thousands and thousands of texts. Instead, you'll read a few of them and let a machine do the rest!

This was a [story for my class by Julia Ingram](https://juliaingram.github.io/therapy-apps/), which then turned into [a completely different story](https://twitter.com/juliaingram_/status/1552724931460403201), too! Give her a [follow on Twitter](https://twitter.com/juliaingram_) and a shout out for doing an amazing job.

## Reading in our data

We have a lot of reviews, and didn't read them all! Some of them are labeled as being about unfair charges, unresponsive therapists, or a bad therapist in general.

In [4]:
import pandas as pd
pd.options.display.max_colwidth = 400

df = pd.read_csv("partially-labeled.csv")
df.head(3)

Unnamed: 0,rating,review,app_name,unfair_charge,unresponsive_therapist,bad_therapist
0,1,"My last review got deleted. This app does not provide the service it appears to. Like the other reviews say its not a texting therapy app, you wait for hours and don't get responses. Talk space has a 1 star review on BBB which is pretty safe to say its a scam. Its a shame to profit off of people searching for help is such a dark time. Shame on you talk space. For the amount of money required t...",talkspace-therapy-counseling,1.0,1.0,0.0
1,1,"Just a blank white page after I enter my credentials and log in. Tried the typical troubleshooting routine and made sure iOS and the app are updated. I emailed customer service and messaged my “care team” via the desktop website, and they have either not responded after a number of days, or the person was seemingly even less qualified to answer my own question than I am. \n\nThis service has a...",cerebral-mental-health,0.0,0.0,0.0
2,2,You pay a monthly fee up front and it guarantees you a certain number of contacts. I’ve yet to receive any services. My provider missed her reply window 3 times in a row so I’m paying for a essentially a subscription with no benefit. I’m sure they are overwhelmed with clients but they need to stop enrolling people and taking their money if they can’t provide what’s promised. I’m only giving 2 ...,talkspace-therapy-counseling,1.0,1.0,0.0


## Split into labeled and unlabeled

We'll need to **train our model** on the labeled reviews, and then make predictions for the **unlabeled reviews**. To do that we'll filter based on whether the `unfair_charge` column is missing or not.

In [5]:
labeled_data = df[df.unfair_charge.notna()]
labeled_data.head(3)

Unnamed: 0,rating,review,app_name,unfair_charge,unresponsive_therapist,bad_therapist
0,1,"My last review got deleted. This app does not provide the service it appears to. Like the other reviews say its not a texting therapy app, you wait for hours and don't get responses. Talk space has a 1 star review on BBB which is pretty safe to say its a scam. Its a shame to profit off of people searching for help is such a dark time. Shame on you talk space. For the amount of money required t...",talkspace-therapy-counseling,1.0,1.0,0.0
1,1,"Just a blank white page after I enter my credentials and log in. Tried the typical troubleshooting routine and made sure iOS and the app are updated. I emailed customer service and messaged my “care team” via the desktop website, and they have either not responded after a number of days, or the person was seemingly even less qualified to answer my own question than I am. \n\nThis service has a...",cerebral-mental-health,0.0,0.0,0.0
2,2,You pay a monthly fee up front and it guarantees you a certain number of contacts. I’ve yet to receive any services. My provider missed her reply window 3 times in a row so I’m paying for a essentially a subscription with no benefit. I’m sure they are overwhelmed with clients but they need to stop enrolling people and taking their money if they can’t provide what’s promised. I’m only giving 2 ...,talkspace-therapy-counseling,1.0,1.0,0.0


In [6]:
unlabeled_data = df[df.unfair_charge.isna()]
unlabeled_data.head(3)

Unnamed: 0,rating,review,app_name,unfair_charge,unresponsive_therapist,bad_therapist
180,1,"Give me all of my money back. I don’t write app reviews, but there is no way to get any type of response from customer service. I was a brand new paying subscriber that needed assistance with a billing/technical issue. I’ve reached out via email multiple times for a week with no response, which is valuable time and money going down the drain as I wait on customer service to respond to what sho...",talkspace-therapy-counseling,,,
181,2,"I was initially matched with a therapist that specialized in a subject that I had no issues with nor did I specify I had issues with. I had to accept in order to try to change therapists, only one of which was female (which is what I wanted) so I had to accept that.. but I felt like I wished I had more choices to pick from. My therapist was nice. Very generic though, and didn’t know how to use...",talkspace-therapy-counseling,,,
182,1,"I paid hundreds of dollars just for my psychiatrist to meet with me once and then to never respond to me again, despite my many attempts to contact her. She prescribed me new medication and then never once followed up on it, which was incredibly unprofessional and unsafe. I then emailed TalkSpace support, who informed me that I couldn’t get a refund, but I could get rematched and spend more mo...",talkspace-therapy-counseling,,,


In [7]:
print("Labeled is", len(labeled_data))
print("Unlabeled is", len(unlabeled_data))

Labeled is 180
Unlabeled is 2507


## Building our model

First we need to **vectorize** our text, which means converting it from "normal" text into numbers a computer can understand. This is a fancy version of vectorizing, because we're also **stemming**. Stemming combines run/running/runner/runs into `run`.

In [9]:
# Uncomment if you need to install these
# !pip install pystemmer
# !pip install sklearn

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

stemmer = Stemmer.Stemmer('en')
analyzer = TfidfVectorizer().build_analyzer()

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

vectorizer = StemmedTfidfVectorizer(max_features=500, max_df=0.30)
matrix = vectorizer.fit_transform(labeled_data.review)

In [12]:
# What are the words it learned?
vectorizer.get_feature_names_out()

array(['10', '200', '24', '30', '85', 'abl', 'about', 'absolut', 'access',
       'account', 'actual', 'ad', 'add', 'addit', 'address', 'adhd',
       'advertis', 'afford', 'after', 'again', 'agent', 'ahead', 'all',
       'allow', 'almost', 'alreadi', 'also', 'am', 'amount', 'an', 'ani',
       'annoy', 'anoth', 'answer', 'anxieti', 'anyon', 'anyth', 'anywher',
       'appoint', 'appt', 'are', 'around', 'as', 'ask', 'at', 'attempt',
       'autom', 'avail', 'aw', 'away', 'back', 'bad', 'bank', 'base',
       'basic', 'becaus', 'been', 'befor', 'best', 'better', 'betterhelp',
       'big', 'bill', 'bipolar', 'book', 'busi', 'by', 'call', 'can',
       'cancel', 'cannot', 'card', 'care', 'caus', 'cerebr', 'chang',
       'charg', 'chat', 'cheaper', 'check', 'choic', 'choos', 'clear',
       'client', 'come', 'communic', 'compani', 'complet', 'concern',
       'confirm', 'consist', 'constant', 'contact', 'convers', 'coordin',
       'cost', 'could', 'couldn', 'counsel', 'counselor', 'cou

In [13]:
# If we want to look at this in another way...
pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,10,200,24,30,85,abl,about,absolut,access,account,...,wors,worth,would,wouldn,write,wrong,year,yet,your,yourself
0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0
1,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.185492,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0
2,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.192021,0.000000,0.0
3,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.172191,0.0
4,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.126286,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.078105,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
175,0.0,0.0,0.0,0.000000,0.0,0.0,0.109859,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.090341,0.0
176,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.179293,0.0
177,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0
178,0.0,0.0,0.0,0.095882,0.0,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.107784,0.0,0.0,0.0,0.000000,0.000000,0.0


## Training the model

There are all kinds of models we could use to train this! To pick one at semi-random, we'll use a LinearSVC. If we were doing this for real, we might try a few different models and see which one performs the best.

In [14]:
from sklearn.svm import LinearSVC

#Unfair charge classifier
X = matrix
y = labeled_data.unfair_charge

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)

## What words are important?

As a journalist, you want to know *why* the model is making a decision. This is called **explanability** or **interpretability**. The [eli5](https://eli5.readthedocs.io/en/latest/overview.html) Python package does a good job explaining the important words in our dataset.

In [15]:
#!pip install eli5

In [16]:
import eli5

eli5.explain_weights(clf, feature_names=vectorizer.get_feature_names_out())

Weight?,Feature
+2.680,refund
+1.340,charg
+1.233,bill
+1.124,compani
+1.038,refus
+0.928,reach
+0.878,up
+0.856,cancel
+0.811,email
+0.806,been


## Making our predictions

Let's take our **unlabeled data** and predict whether they're about unfair charges or not.

In [20]:
X = vectorizer.transform(unlabeled_data.review)

unlabeled_data['predicted'] = clf.predict(X)
unlabeled_data['predicted_proba'] = clf.decision_function(X)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unlabeled_data['predicted'] = clf.predict(X)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unlabeled_data['predicted_proba'] = clf.decision_function(X)


We added two columns: `predicted` and `predicted_proba`.
    
* `predicted` just says, **YES!!!** it's about unfair charges, or **NO!!!** it's not
* `predicted_proba` is a number that is higher or lower based on how certain the model is

Depending on what we're doing with the output we might use one or the other.

In [21]:
unlabeled_data.sort_values(by='predicted_proba', ascending=False)

Unnamed: 0,rating,review,app_name,unfair_charge,unresponsive_therapist,bad_therapist,predicted,predicted_proba
1879,1,"My fiancé has been contacting Cerebral for several days to cancel & get a refund. We were charged $115 and have not had even one session. No response. Called the only number we could find, only to get an automated message stating that if it is in regards to refunds that we must send an email. Second email was sent, no response. Considering this is a mental health company it would be superb to ...",cerebral-mental-health,,,,1.0,1.940808
2256,1,"DO NOT TRUST THIS COMPANY.\n\nI canceled on February 9th. I filled out all required paperwork. After signing up AND PAYING I was told I was ineligible due to family history with heart conditions and the pop up box said I would receive a full refund, which has STILL not happened. Not only have I not received a refund, but they have not discontinued my account at all.\n\nThere is absolutely NO c...",cerebral-mental-health,,,,1.0,1.669842
364,1,Sketchy and deceiving company. \n\nEnded up not using any of the services so I can not speak on the counseling services. I paid the original fee which was fine but then they charged my account $400 the following month after I didn’t use any of the original hours and time I paid for. I was informed the was a subscription. \n\nThe site made it like look like it was a one time payment at a discou...,talkspace-therapy-counseling,,,,1.0,1.611305
2107,1,My wife installed the app and was immediately bombarded to upgrade from the website. It’s very difficult to cancel and needed to email support to cancel. Then support told us the company does not offer refunds. \n\nThere’s very limited appointments from the app . I went to my credit card company to dispute the charges to get our refund. Do not install! \n\nShame on Simone Biles and other celeb...,cerebral-mental-health,,,,1.0,1.579943
2418,1,Theives. They refunded my original money paid and then charged me $325 3 days after telling my my cancellation has been “scheduled”.\n\nDon’t lose your money too. Don’t trust them. This app is a scam.,cerebral-mental-health,,,,1.0,1.577669
...,...,...,...,...,...,...,...,...
826,2,"It needs improvement. I tried to get matched with a therapist but if you don’t match right when you download it, you won’t be matched. I thought it would be better than better help because it seemed to have more professional concepts but I was wrong.",talkspace-therapy-counseling,,,,0.0,-1.497476
1272,1,I reached out for therapy for the first time ever. My matching agent asked what was going on and if I would be interested in how it works. I said yes and then I got what seemed like a copy and paste message about what it will cost. I looked over the plans and I told her I understand it is lower cost than “traditional” therapy but I simply cannot afford on my income. She sent back a message “I ...,talkspace-therapy-counseling,,,,0.0,-1.498484
2520,1,Asked a bunch of personal questions just to tel you that you should buy an insurance plan. What a cheap and lousy way to sell insurance for blue cross blue shield,cerebral-mental-health,,,,0.0,-1.577024
396,1,"Just because I’m 10, that’s stupid. I’ve been looking for therapy on the App Store because my parents won’t take me to real therapy and this is what I get. It’s dumb, my friends abuse me and I just want help, but I can’t get it no matter how hard I try, it’s stupid",talkspace-therapy-counseling,,,,0.0,-1.581016


For example, we might want to save the ones that are the top 1000 most likely about unfair charges to investigate further.

In [None]:
unlabeled_data.sort_values(by='predicted_proba', ascending=False).head(1000).to_csv("research.csv")

Alternatively, we might want to quickly see which app is most likely to have reviews talking about unfair charges.

In [36]:
unlabeled_data.pivot_table(index='app_name', values='predicted', aggfunc='mean').round(2)

Unnamed: 0_level_0,predicted
app_name,Unnamed: 1_level_1
betterhelp-therapy,0.19
cerebral-mental-health,0.51
talkspace-therapy-counseling,0.3
woebot-your-self-care-expert,0.03
wysa-mental-health-support,0.12
youper-online-therapy,0.05


## Testing the model

How do we know if it did a good job, though? We can look at a **confusion matrix** if we *really* want to.

We'll do the following steps:

1. Split into two groups of reviews: We'll let it study 80% of the reviews and save 20% qiuz it on
2. Train it on the "training set"
3. Test it on the "test set"
4. How did it do???

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

X = vectorizer.fit_transform(labeled_data.review)
y = labeled_data.unfair_charge

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Train
clf = LinearSVC(class_weight='balanced')
clf.fit(X_train, y_train)

# Test
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

# How did it do?
label_names = pd.Series(['not unfair charge', 'unfair charge'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not unfair charge,Predicted unfair charge
Is not unfair charge,26,6
Is unfair charge,5,8


How do we feel about the results?

## What comes out of it?

A [story for class](https://juliaingram.github.io/therapy-apps/), but then a [completely unrelated story](https://twitter.com/juliaingram_/status/1552724931460403201), too!