# Instructions

Use "prodigyEnv" conda environment for this notebook.

To set up Prodigy environment, download the wheel file from the Prodigy email (which you receive after purchasing a license). 

Then run `pip install ./prodigy*.whl`

Instructions: https://prodi.gy/docs/install

Database is stored at /

<br><br>

# Imports

In [55]:
from collections import defaultdict
import random
import re

import pandas as pd
from prodigy.components.db import connect

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='ticks', font_scale=1.2)

In [56]:
def sort_by_mean(df, by, column, rot=0):
    # use dict comprehension to create new dataframe from the iterable groupby object
    # each group name becomes a column in the new dataframe
    df2 = pd.DataFrame({col:vals[column] for col, vals in df.groupby(by)})
    # find and sort the median values in this new dataframe
    means = df2.mean().sort_values()
    # use the columns in the dataframe, ordered sorted by median value
    # return axes so changes can be made outside the function
#     return df2[meds.index].boxplot(rot=rot, return_type="axes")
    return means

<br><br>

# Connect to database

In [57]:
db = connect()

db.datasets # This will list all of your prodigy databases

['bc-reddit-posts',
 'bc-reddit-comments',
 'bc-twitter-posts',
 'bc-twitter-replies',
 'discourse-reddit-posts',
 'discourse-reddit-comments',
 'discourse-webmd-reviews']

In [45]:
# db.drop_dataset('discourse-reddit-posts')  # Only do this if you want to delete all your annotations!!!!!!!!!!!

<br><br>

# Explore REDDIT posts

In [31]:
examples = db.get_dataset('discourse-reddit-posts')

print(len(examples))

146


In [32]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

46 	 SHARING EXPERIENCES
27 	 NONE
25 	 SHARING DECISION-MAKING PROCESSES
21 	 SEEKING INFORMATION
14 	 SHARING NEGATIVE EMOTIONS
10 	 SHARING FUTURE PLANS
9 	 SEEKING EXPERIENCES
7 	 SHARING PERSONAL BACKGROUND
6 	 SHARING OPINIONS AND PREFERENCES
5 	 SEEKING ADVICE
3 	 SEEKING NORMALITY
2 	 PROVIDING INFORMATION
2 	 SHARING SECONDHAND EXPERIENCES
2 	 SHARING POSITIVE EMOTIONS
1 	 PROVIDING ADVICE
1 	 META DISCUSSION
1 	 SEEKING EMOTIONAL SUPPORT
1 	 PROVIDING NORMALITY


In [33]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
pill
--------------------------------
18 	 SHARING EXPERIENCES
7 	 SHARING DECISION-MAKING PROCESSES
7 	 SEEKING INFORMATION
4 	 SHARING PERSONAL BACKGROUND
2 	 SHARING FUTURE PLANS
2 	 SHARING NEGATIVE EMOTIONS
2 	 SEEKING ADVICE
1 	 PROVIDING ADVICE
1 	 PROVIDING INFORMATION
1 	 SEEKING EXPERIENCES
1 	 SEEKING EMOTIONAL SUPPORT
1 	 SEEKING NORMALITY
1 	 PROVIDING NORMALITY
1 	 SHARING OPINIONS AND PREFERENCES

--------------------------------
iud
--------------------------------
14 	 SHARING EXPERIENCES
11 	 SHARING DECISION-MAKING PROCESSES
7 	 SEEKING EXPERIENCES
7 	 SEEKING INFORMATION
6 	 SHARING FUTURE PLANS
4 	 SHARING NEGATIVE EMOTIONS
3 	 SHARING PERSONAL BACKGROUND
2 	 SHARING SECONDHAND EXPERIENCES
2 	 SEEKING ADVICE
2 	 SHARING OPINIONS AND PREFERENCES
1 	 META DISCUSSION
1 	 PROVIDING INFORMATION
1 	 SHARING POSITIVE EMOTIONS

--------------------------------
implant
--------------------------------
14 	 SHARING EXPERIENCES
8 	 SHARING NEG

In [34]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

31.5% 	 SHARING EXPERIENCES
18.5% 	 NONE
17.1% 	 SHARING DECISION-MAKING PROCESSES
14.4% 	 SEEKING INFORMATION
9.6% 	 SHARING NEGATIVE EMOTIONS
6.8% 	 SHARING FUTURE PLANS
6.2% 	 SEEKING EXPERIENCES
4.8% 	 SHARING PERSONAL BACKGROUND
4.1% 	 SHARING OPINIONS AND PREFERENCES
3.4% 	 SEEKING ADVICE
2.1% 	 SEEKING NORMALITY
1.4% 	 PROVIDING INFORMATION
1.4% 	 SHARING SECONDHAND EXPERIENCES
1.4% 	 SHARING POSITIVE EMOTIONS
0.7% 	 PROVIDING ADVICE
0.7% 	 META DISCUSSION
0.7% 	 SEEKING EMOTIONAL SUPPORT
0.7% 	 PROVIDING NORMALITY


In [37]:
for _label, _texts in label_texts_dict.items():
    if _label == 'SHARING DECISION-MAKING PROCESSES':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
SHARING DECISION-MAKING PROCESSES
------------------------------------------

Since I've been so stressed out, I decided that since I had forgot a few pills, I would go off the pill for a month or two and let my body rest during the Christmas university break.
It lasted for a few weeks but I figured it was because of the change .
It is honestly the best choice for me.
I thought I would be fine since I was on the depo so long beforehand, and didn't even realize this could possibly be a symptom of the implant.
This lasted for MONTHS, so I read somewhere online that vitamin e and zinc help with this, and it did stop the bleeding for a couple weeks, but I just started spotting again today.
I'm getting my IUD placed during my cycle, and debating whether I should wear my cup, a flex disc, or a pad.
But I have been reluctant to go back on the meds because I am not ok with the side effects.
I'm now doing a 5 month Accutane course as suggested by my de

<br><br>

# Explore REDDIT comments

In [48]:
examples = db.get_dataset('discourse-reddit-comments')

print(len(examples))

100


In [49]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

23 	 SHARING EXPERIENCES
20 	 SHARING INFORMATION
17 	 NONE
9 	 SHARING ADVICE
7 	 SHARING OPINIONS AND PREFERENCES
4 	 SHARING CAUSAL REASONING / HYPOTHESIZING
4 	 SEEKING EXPERIENCES
4 	 META DISCUSSION
3 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
3 	 SHARING EMOTIONAL SUPPORT
3 	 SHARING FUTURE PLANS
3 	 SHARING SECONDHAND EXPERIENCES
3 	 SHARING NORMALITY
2 	 SEEKING INFORMATION
1 	 SHARING POSITIVE EMOTIONS
1 	 SHARING NEGATIVE EMOTIONS
1 	 SHARING PERSONAL BACKGROUND


In [50]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
implant
--------------------------------
7 	 SHARING INFORMATION
6 	 SHARING EXPERIENCES
3 	 SHARING CAUSAL REASONING / HYPOTHESIZING
2 	 SEEKING EXPERIENCES
2 	 META DISCUSSION
2 	 SHARING SECONDHAND EXPERIENCES
1 	 SHARING OPINIONS AND PREFERENCES
1 	 SHARING ADVICE
1 	 SHARING FUTURE PLANS
1 	 SEEKING INFORMATION
1 	 SHARING NORMALITY
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH

--------------------------------
iud
--------------------------------
8 	 SHARING INFORMATION
8 	 SHARING EXPERIENCES
5 	 SHARING OPINIONS AND PREFERENCES
2 	 SEEKING EXPERIENCES
2 	 SHARING FUTURE PLANS
2 	 META DISCUSSION
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 SHARING ADVICE
1 	 SHARING SECONDHAND EXPERIENCES
1 	 SHARING EMOTIONAL SUPPORT
1 	 SHARING PERSONAL BACKGROUND
1 	 SHARING NORMALITY

--------------------------------
pill
--------------------------------
9 	 SHARING EXPERIENCES
7 	 SHARING ADVICE
5 	 SHARING INFORMATION
2 	 SHARING EMOTIONAL SUPPORT
1 	 SHARI

In [51]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

23.0% 	 SHARING EXPERIENCES
20.0% 	 SHARING INFORMATION
17.0% 	 NONE
9.0% 	 SHARING ADVICE
7.0% 	 SHARING OPINIONS AND PREFERENCES
4.0% 	 SHARING CAUSAL REASONING / HYPOTHESIZING
4.0% 	 SEEKING EXPERIENCES
4.0% 	 META DISCUSSION
3.0% 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
3.0% 	 SHARING EMOTIONAL SUPPORT
3.0% 	 SHARING FUTURE PLANS
3.0% 	 SHARING SECONDHAND EXPERIENCES
3.0% 	 SHARING NORMALITY
2.0% 	 SEEKING INFORMATION
1.0% 	 SHARING POSITIVE EMOTIONS
1.0% 	 SHARING NEGATIVE EMOTIONS
1.0% 	 SHARING PERSONAL BACKGROUND


In [54]:
for _label, _texts in label_texts_dict.items():
    if _label == 'SHARING CAUSAL REASONING / HYPOTHESIZING':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
SHARING CAUSAL REASONING / HYPOTHESIZING
------------------------------------------

I expect they can, if you don't have insurance coverage and are paying for it yourself I don't know why they would care where you currently reside.
That just sounds like ovulation to me...
Eek, no! Sounds like it wasn’t placed correctly.
I don't really know what I'm talking about, but it could be possible that she had implanon, and got a nexplanon put in instead?


<br><br>

# Explore TWITTER posts

In [381]:
examples = db.get_dataset('discourse-twitter-posts')

print(len(examples))

200


In [382]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

38 	 providing information (educational)
38 	 narrating personal experiences
33 	 NONE
29 	 other discourse
13 	 humor
11 	 seeking information (educational)
11 	 negative self-disclosure
9 	 seeking experiences
8 	 politics
7 	 providing other experiences
5 	 positive self-disclosure
5 	 weighing options
4 	 providing information (advice)
1 	 providing personal experiences
1 	 seeking information (advice)


In [383]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
pill
--------------------------------
21 	 providing information (educational)
14 	 other discourse
9 	 humor
3 	 seeking information (educational)
3 	 narrating personal experiences
1 	 politics
1 	 seeking experiences
1 	 providing other experiences
1 	 providing information (advice)

--------------------------------
implant
--------------------------------
25 	 narrating personal experiences
9 	 providing information (educational)
8 	 negative self-disclosure
7 	 seeking experiences
6 	 providing other experiences
5 	 seeking information (educational)
2 	 positive self-disclosure
2 	 politics
2 	 other discourse
1 	 providing information (advice)
1 	 humor
1 	 seeking information (advice)
1 	 weighing options

--------------------------------
iud
--------------------------------
13 	 other discourse
10 	 narrating personal experiences
8 	 providing information (educational)
5 	 politics
4 	 weighing options
3 	 seeking information (educational)
3 	 n

In [384]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

19.0% 	 providing information (educational)
19.0% 	 narrating personal experiences
16.5% 	 NONE
14.5% 	 other discourse
6.5% 	 humor
5.5% 	 seeking information (educational)
5.5% 	 negative self-disclosure
4.5% 	 seeking experiences
4.0% 	 politics
3.5% 	 providing other experiences
2.5% 	 positive self-disclosure
2.5% 	 weighing options
2.0% 	 providing information (advice)
0.5% 	 providing personal experiences
0.5% 	 seeking information (advice)


In [385]:
for _label, _texts in label_texts_dict.items():
    if _label == 'providing other experiences':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
providing other experiences
------------------------------------------

Mom solves daughter’s mystery illness:
Woman forced young girl to get birth control implant in her arm, police say https://t.co/YTq9CvSdI9
I understand the pill causes negative affects on some women but it did amazing things for me.
My wife just got a birth control implant that has no hormones and will last over 10 years because her boyfriend was tired of pulling out.
#News Update: Contraceptive Implant Gets Stuck In Woman And Doctors Still Can't Find It https://t.co/Bhwb1eh2Fe
Nexplanon has been good for me so far, seen ppl say they hate it
Woman has contraceptive implant 'lost' in her arm for two years - now she needs surgery https://t.co/5LdQtqf5uP


<br><br>

# Explore Twitter REPLIES

In [418]:
examples = db.get_dataset('discourse-twitter-replies')

print(len(examples))

200


In [419]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

72 	 narrating personal experiences
38 	 NONE
31 	 providing information (educational)
21 	 other discourse
10 	 weighing options
9 	 providing information (advice)
8 	 politics
8 	 seeking experiences
7 	 humor
7 	 providing other experiences
4 	 negative self-disclosure
3 	 seeking information (educational)
3 	 providing personal experiences
2 	 providing emotional support
1 	 seeking information (advice)


In [420]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
pill
--------------------------------
15 	 other discourse
13 	 providing information (educational)
6 	 humor
5 	 narrating personal experiences
3 	 politics
2 	 providing personal experiences
2 	 seeking experiences
1 	 negative self-disclosure
1 	 providing information (advice)
1 	 seeking information (educational)

--------------------------------
implant
--------------------------------
37 	 narrating personal experiences
8 	 providing information (educational)
5 	 providing information (advice)
4 	 weighing options
3 	 negative self-disclosure
3 	 providing other experiences
3 	 seeking experiences
2 	 other discourse
1 	 providing personal experiences
1 	 providing emotional support
1 	 seeking information (educational)

--------------------------------
iud
--------------------------------
30 	 narrating personal experiences
10 	 providing information (educational)
6 	 weighing options
5 	 politics
4 	 other discourse
4 	 providing other experienc

In [421]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

36.0% 	 narrating personal experiences
19.0% 	 NONE
15.5% 	 providing information (educational)
10.5% 	 other discourse
5.0% 	 weighing options
4.5% 	 providing information (advice)
4.0% 	 politics
4.0% 	 seeking experiences
3.5% 	 humor
3.5% 	 providing other experiences
2.0% 	 negative self-disclosure
1.5% 	 seeking information (educational)
1.5% 	 providing personal experiences
1.0% 	 providing emotional support
0.5% 	 seeking information (advice)


In [422]:
for _label, _texts in label_texts_dict.items():
    if _label == 'rant':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

<br><br><br><br>

# Explore WebMD reviews

In [64]:
examples = db.get_dataset('discourse-webmd-reviews')

print(len(examples))

100


In [65]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

75 	 SHARING EXPERIENCES
22 	 SHARING OPINIONS AND PREFERENCES
7 	 SHARING PERSONAL BACKGROUND
6 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
6 	 SHARING FUTURE PLANS
3 	 SHARING NEGATIVE EMOTIONS
3 	 SHARING CAUSAL REASONING / HYPOTHESIZING
2 	 NONE
2 	 META DISCUSSION
1 	 SHARING INFORMATION
1 	 SHARING POSITIVE EMOTIONS
1 	 SHARING ADVICE


In [66]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
pill
--------------------------------
23 	 SHARING EXPERIENCES
3 	 SHARING OPINIONS AND PREFERENCES
2 	 SHARING PERSONAL BACKGROUND
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 META DISCUSSION
1 	 SHARING FUTURE PLANS

--------------------------------
iud
--------------------------------
25 	 SHARING EXPERIENCES
7 	 SHARING OPINIONS AND PREFERENCES
4 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
4 	 SHARING PERSONAL BACKGROUND
2 	 SHARING NEGATIVE EMOTIONS
2 	 SHARING CAUSAL REASONING / HYPOTHESIZING
1 	 SHARING INFORMATION
1 	 SHARING POSITIVE EMOTIONS
1 	 SHARING FUTURE PLANS

--------------------------------
implant
--------------------------------
27 	 SHARING EXPERIENCES
12 	 SHARING OPINIONS AND PREFERENCES
4 	 SHARING FUTURE PLANS
1 	 SHARING CAUSAL REASONING / HYPOTHESIZING
1 	 META DISCUSSION
1 	 SHARING PERSONAL BACKGROUND
1 	 SHARING ADVICE
1 	 SHARING NEGATIVE EMOTIONS
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH



In [69]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

75.0% 	 75 	 SHARING EXPERIENCES
22.0% 	 22 	 SHARING OPINIONS AND PREFERENCES
7.0% 	 7 	 SHARING PERSONAL BACKGROUND
6.0% 	 6 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
6.0% 	 6 	 SHARING FUTURE PLANS
3.0% 	 3 	 SHARING NEGATIVE EMOTIONS
3.0% 	 3 	 SHARING CAUSAL REASONING / HYPOTHESIZING
2.0% 	 2 	 NONE
2.0% 	 2 	 META DISCUSSION
1.0% 	 1 	 SHARING INFORMATION
1.0% 	 1 	 SHARING POSITIVE EMOTIONS
1.0% 	 1 	 SHARING ADVICE


In [68]:
for _label, _texts in label_texts_dict.items():
    if _label == 'SHARING/DESCRIBING ADDITIONAL RESEARCH':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
SHARING/DESCRIBING ADDITIONAL RESEARCH
------------------------------------------

So if anyone is wondering, I have read all of these comments and they all apply.bloating, nausea, weight/appetite gain (it's been 12 days) extreme rage, breat tenderness and lumps.
So i went back again, gave me different meds, which i couldnt take upset stomach along with my regular symptoms then i kept having this pain, and hardness in my lower left abdomen, painful sex, doctor said my cervix and uterus were swollen, cramping, just horrible pain and very annoying.
While doing my research I read quite a few reviews about how much it hurt.
Before getting this form of birth control, I read tons of reviews.
But that was all explained to me before I chose to get it.
Also hearing some woman got pregnant while on Nexplanon is scary.


<br><br>

# Backup labeling into a CSV

In [435]:
reddit_post_examples = db.get_dataset('discourse-reddit-posts')
reddit_comment_examples = db.get_dataset('discourse-reddit-comments')
twitter_post_examples = db.get_dataset('discourse-twitter-posts')
twitter_replies_examples = db.get_dataset('discourse-twitter-replies')

In [436]:
len(reddit_post_examples), len(reddit_comment_examples), len(twitter_post_examples), len(twitter_replies_examples)

(200, 200, 200, 200)

In [448]:
label_dicts = []
for e in reddit_post_examples + reddit_comment_examples + twitter_post_examples + twitter_replies_examples:
    for _label in e['accept']:
        label_dicts.append({'Source': e['meta']['Source'],
                            'ID': e['meta']['ID'],
                            'Label': _label,
                            'Text': e['text']})
    if len(e['accept']) == 0:
        label_dicts.append({'Source': e['meta']['Source'],
                            'ID': e['meta']['ID'],
                            'Label': 'NONE',
                            'Text': e['text']})
label_df = pd.DataFrame(label_dicts)

In [449]:
len(label_df)

895

In [480]:
label_df['Label'].value_counts()

narrating personal experiences         275
providing information (educational)    125
NONE                                   121
other discourse                         58
negative self-disclosure                44
seeking information (educational)       43
seeking experiences                     42
weighing options                        39
providing information (advice)          31
providing other experiences             22
humor                                   20
providing personal experiences          19
seeking information (advice)            19
politics                                16
positive self-disclosure                11
providing emotional support              7
seeking emotional support                3
Name: Label, dtype: int64

In [450]:
label_df.sample(3)

Unnamed: 0,Source,ID,Label,Text
677,twitter-replies,408246293442469900,providing information (educational),IUD removal takes about two seconds &amp; is m...
877,twitter-replies,332926197128364000,politics,Judge Slams 'Frivolous' Obama Defense Of Birth...
267,reddit-comments,e2vbh8x,narrating personal experiences,The doctor who removed mine at PP listened int...


In [451]:
for i, r in label_df[label_df['Label'] == 'NONE'].sample(10).iterrows():
    print(' '.join(r['Text'].split()))

Contraceptive pill being available?
She only dressed quickly into her underwear and slid the dress over her head.
Perhaps he confused them with IUD's?
In two months I'll be 18
And I probably shouldnt have phrased that the way I did without providing more context.
Haha, nothing really.
Ughhhh this sucks
What's going on?
It's always a surprise!
Im into distribution, i'm like atlantic.


In [452]:
label_df.to_csv('/Volumes/Passport-1/data/birth-control/labeling/label-sentences/labeled_by_maria.all.csv')

# Try training a simple model

In [482]:
data_directory_path   = '/Volumes/Passport-1/data/birth-control'
test_df = pd.read_csv(data_directory_path + '/labeling/label-sentences/sampled-sentences.test.csv')
len(test_df)

11993

In [483]:
test_df.sample(3)

Unnamed: 0.1,Unnamed: 0,text,meta
3859,3859,The only thing that works that late is the cop...,"{'ID': 'eyx0ilp', 'Source': 'reddit-comments',..."
11745,11745,i have been on this for 7 months and i bleed f...,"{'ID': 'w11392', 'Source': 'webmd-reviews', 'M..."
10011,10011,The entire time was awful.,"{'ID': 'w12188', 'Source': 'webmd-reviews', 'M..."


In [484]:
len(label_df.index)

895

In [465]:
label_df.sample(3)

Unnamed: 0,Source,ID,Label,Text
516,twitter-posts,300336102865248260,narrating personal experiences,getting the contraceptive implant in possibly ...
96,reddit-posts,gjld30,narrating personal experiences,Although sometimes I feel like I am gonna have...
39,reddit-posts,c2gb30,seeking information (advice),Can I take these steri-strips (butterfly stitc...


In [458]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [460]:
def binarize_label(label, target_label):
    if label == target_label:
        return 1
    return 0

In [510]:
for _target_label in label_df['Label'].unique():

    _binarized_df = label_df.copy()
    _binarized_df['Label'] = label_df['Label'].apply(lambda x: binarize_label(x, _target_label))
    _positive_ids = _binarized_df[_binarized_df['Label'] == 1]['ID'].tolist()
    _binarized_df = _binarized_df[~((_binarized_df['ID'].isin(_positive_ids)) & (_binarized_df['Label'] == 0))]

    _binarized_df = _binarized_df.groupby('Label').sample(n=len(_binarized_df[_binarized_df['Label'] == 1]), random_state=1)

    if len(_binarized_df.index) > 50:

        _train_df, _test_df = train_test_split(_binarized_df, test_size=0.33, random_state=42)

        _train_texts = _train_df['Text']
        _train_labels = _train_df['Label']
        _test_texts = _test_df['Text']
        _test_labels = _test_df['Label']

        _vectorizer = TfidfVectorizer()
        _X_train = _vectorizer.fit_transform(_train_texts)
        _X_test = _vectorizer.transform(_test_texts)

        _model = LogisticRegression(C=10).fit(_X_train, _train_labels)
        _predictions = _model.predict(_X_test)

        print(_target_label)
        print(classification_report(_test_labels, _predictions))

negative self-disclosure
              precision    recall  f1-score   support

           0       0.77      0.59      0.67        17
           1       0.59      0.77      0.67        13

    accuracy                           0.67        30
   macro avg       0.68      0.68      0.67        30
weighted avg       0.69      0.67      0.67        30

narrating personal experiences
              precision    recall  f1-score   support

           0       0.82      0.82      0.82        88
           1       0.83      0.83      0.83        94

    accuracy                           0.82       182
   macro avg       0.82      0.82      0.82       182
weighted avg       0.82      0.82      0.82       182

seeking experiences
              precision    recall  f1-score   support

           0       0.86      0.80      0.83        15
           1       0.79      0.85      0.81        13

    accuracy                           0.82        28
   macro avg       0.82      0.82      0.82        2

In [478]:
_binarized_df['Label'].value_counts()

0    16
1    16
Name: Label, dtype: int64

In [502]:
def process_string(text):
    text = text.lower()
    text = re.sub('[0-9]+', 'NUM', text)
    text = re.sub(r'[^\sA-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ/]', ' \1 ', text)
    text = ' '.join(text.split())
    return text

In [505]:
t = process_string('Does this work? Hmmm,how about this???')

'does this work \x01 hmmm \x01 how about this \x01 \x01 \x01'

In [511]:
for _target_label in label_df['Label'].unique():

    _binarized_df = label_df.copy()
    _binarized_df['Label'] = label_df['Label'].apply(lambda x: binarize_label(x, _target_label))
    _positive_ids = _binarized_df[_binarized_df['Label'] == 1]['ID'].tolist()
    _binarized_df = _binarized_df[~((_binarized_df['ID'].isin(_positive_ids)) & (_binarized_df['Label'] == 0))]

    _binarized_df = _binarized_df.groupby('Label').sample(n=len(_binarized_df[_binarized_df['Label'] == 1]), random_state=1)

    if len(_binarized_df.index) > 50:

        _train_texts = _binarized_df['Text']
        _train_labels = _binarized_df['Label']

        _test_texts = test_df['text']

        _train_texts_processed = [process_string(t) for t in _train_texts]
        _test_texts_processed  = [process_string(t) for t in _test_texts]

        _vectorizer = TfidfVectorizer()
        _X_train = _vectorizer.fit_transform(_train_texts_processed)
        _X_test = _vectorizer.transform(_test_texts_processed)

        _model = LogisticRegression(C=10).fit(_X_train, _train_labels)
        _predictions = _model.predict(_X_test)

        print('---------------------------------')
        print(_target_label)
        print('---------------------------------')
        print()

        _positive_texts = [_text for _prediction, _text in zip(_predictions, _test_texts) if _prediction == 1]
        _negative_texts = [_text for _prediction, _text in zip(_predictions, _test_texts) if _prediction == 0]

        print('POSITIVE')
        for _text in random.sample(_positive_texts, 10):
            print(' '.join(_text.split()))
        
        print()


---------------------------------
negative self-disclosure
---------------------------------

POSITIVE
the implanon has given me MAD mood swings fml
No weight gain either, just got my period a few times/year.
So, I started this pill a month and a half and it's been great , I haven't spotted at all or gotten any acne like some have said, but I have been very depressed since starting and it just occurred to me today to look up side effects because I have never felt like this ever
A day late isn’t really late.
THIS THING!!!good luck and give it a try!~
If it gets knocked out it wasnt put in right RT :
I had the Implanon and found it sent my sex drive plummeting whilst making me put on weight it was awful
Just watched a video of how the nexplanon implant is removed and now I wanna cry😫
The chance of an IUD dislodging is actually really low (despite horror stories on reddit).
I was looking on getting a implanon but the side effects scare me a bit lol

---------------------------------
narra