# Instructions

Use "prodigyEnv" conda environment for this notebook.

To set up Prodigy environment, download the wheel file from the Prodigy email (which you receive after purchasing a license). 

Then run `pip install ./prodigy*.whl`

Instructions: https://prodi.gy/docs/install

Database is stored at /

<br><br>

# Imports

In [153]:
from collections import defaultdict
import random
import re

import pandas as pd
from prodigy.components.db import connect

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='ticks', font_scale=1.2)

In [154]:
def sort_by_mean(df, by, column, rot=0):
    # use dict comprehension to create new dataframe from the iterable groupby object
    # each group name becomes a column in the new dataframe
    df2 = pd.DataFrame({col:vals[column] for col, vals in df.groupby(by)})
    # find and sort the median values in this new dataframe
    means = df2.mean().sort_values()
    # use the columns in the dataframe, ordered sorted by median value
    # return axes so changes can be made outside the function
#     return df2[meds.index].boxplot(rot=rot, return_type="axes")
    return means

<br><br><br><br>

---

<br><br><br><br>


# Connect to database

In [155]:
db = connect()

db.datasets # This will list all of your prodigy databases

['bc-reddit-posts',
 'bc-reddit-comments',
 'bc-twitter-posts',
 'bc-twitter-replies',
 'discourse-reddit-posts',
 'discourse-reddit-comments',
 'discourse-twitter-posts',
 'discourse-twitter-replies',
 'discourse-webmd-reviews']

In [156]:
# db.drop_dataset('discourse-webmd-reviews')  # Only do this if you want to delete all your annotations!!!!!!!!!!!

<br><br><br><br>

---

<br><br><br><br>

# Explore REDDIT posts

In [157]:
examples = db.get_dataset('discourse-reddit-posts')

print(len(examples))

20


In [158]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

6 	 NONE
3 	 SHARING PERSONAL EXPERIENCES
3 	 SEEKING INFORMATION
2 	 SEEKING EXPERIENCES
2 	 SHARING CAUSAL REASONING / HYPOTHESIZING
2 	 SHARING FUTURE PLANS
1 	 SHARING ADVICE
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 SHARING PERSONAL BACKGROUND
1 	 SHARING INFORMATION
1 	 SHARING SECONDHAND EXPERIENCES
1 	 SEEKING EMOTIONAL SUPPORT


In [159]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
pill
--------------------------------
1 	 SHARING ADVICE
1 	 SEEKING EXPERIENCES
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 SHARING PERSONAL EXPERIENCES
1 	 SHARING PERSONAL BACKGROUND
1 	 SHARING CAUSAL REASONING / HYPOTHESIZING
1 	 SEEKING INFORMATION

--------------------------------
iud
--------------------------------
2 	 SEEKING INFORMATION
1 	 SEEKING EXPERIENCES
1 	 SHARING INFORMATION
1 	 SHARING SECONDHAND EXPERIENCES
1 	 SHARING FUTURE PLANS

--------------------------------
implant
--------------------------------
2 	 SHARING PERSONAL EXPERIENCES
1 	 SHARING FUTURE PLANS
1 	 SEEKING EMOTIONAL SUPPORT
1 	 SHARING CAUSAL REASONING / HYPOTHESIZING



In [160]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

30.0% 	 6 	 NONE
15.0% 	 3 	 SHARING PERSONAL EXPERIENCES
15.0% 	 3 	 SEEKING INFORMATION
10.0% 	 2 	 SEEKING EXPERIENCES
10.0% 	 2 	 SHARING CAUSAL REASONING / HYPOTHESIZING
10.0% 	 2 	 SHARING FUTURE PLANS
5.0% 	 1 	 SHARING ADVICE
5.0% 	 1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
5.0% 	 1 	 SHARING PERSONAL BACKGROUND
5.0% 	 1 	 SHARING INFORMATION
5.0% 	 1 	 SHARING SECONDHAND EXPERIENCES
5.0% 	 1 	 SEEKING EMOTIONAL SUPPORT


In [161]:
for _label, _texts in label_texts_dict.items():
    if _label == 'SHARING CAUSAL REASONING / HYPOTHESIZING':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
SHARING CAUSAL REASONING / HYPOTHESIZING
------------------------------------------

It lasted for a few weeks but I figured it was because of the change .
But I know it will probably take up to 6 months for my body to re-adjust.


<br><br>

# Explore REDDIT comments

In [162]:
examples = db.get_dataset('discourse-reddit-comments')

print(len(examples))

20


In [163]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

6 	 NONE
6 	 SHARING INFORMATION
3 	 SHARING ADVICE
3 	 SHARING PERSONAL EXPERIENCES
2 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 SHARING OPINIONS AND PREFERENCES


In [164]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
iud
--------------------------------
4 	 SHARING INFORMATION
1 	 SHARING PERSONAL EXPERIENCES
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH

--------------------------------
implant
--------------------------------
2 	 SHARING INFORMATION
1 	 SHARING OPINIONS AND PREFERENCES
1 	 SHARING ADVICE
1 	 SHARING PERSONAL EXPERIENCES

--------------------------------
pill
--------------------------------
2 	 SHARING ADVICE
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 SHARING PERSONAL EXPERIENCES



In [165]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

30.0% 	 6 	 NONE
30.0% 	 6 	 SHARING INFORMATION
15.0% 	 3 	 SHARING ADVICE
15.0% 	 3 	 SHARING PERSONAL EXPERIENCES
10.0% 	 2 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
5.0% 	 1 	 SHARING OPINIONS AND PREFERENCES


In [166]:
for _label, _texts in label_texts_dict.items():
    if _label == 'SHARING/DESCRIBING ADDITIONAL RESEARCH':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
SHARING/DESCRIBING ADDITIONAL RESEARCH
------------------------------------------

My doctor hasn't been concerned by it.
Reading/ watching experience stories has become my nightly routine lol


<br><br>

# Explore TWITTER posts

In [167]:
examples = db.get_dataset('discourse-twitter-posts')

print(len(examples))

20


In [168]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

5 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
4 	 SHARING PERSONAL EXPERIENCES
3 	 NONE
2 	 SHARING FUTURE PLANS
2 	 META DISCUSSION
2 	 SHARING ADVICE
2 	 SHARING OPINIONS AND PREFERENCES
2 	 SHARING INFORMATION
1 	 SEEKING EMOTIONAL SUPPORT
1 	 SEEKING INFORMATION
1 	 SHARING SECONDHAND EXPERIENCES
1 	 SHARING PERSONAL BACKGROUND


In [169]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
iud
--------------------------------
2 	 SHARING PERSONAL EXPERIENCES
1 	 SHARING FUTURE PLANS
1 	 META DISCUSSION
1 	 SHARING OPINIONS AND PREFERENCES
1 	 SHARING PERSONAL BACKGROUND
1 	 SHARING INFORMATION
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH

--------------------------------
pill
--------------------------------
2 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 META DISCUSSION
1 	 SEEKING INFORMATION

--------------------------------
implant
--------------------------------
2 	 SHARING PERSONAL EXPERIENCES
2 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
2 	 SHARING ADVICE
1 	 SEEKING EMOTIONAL SUPPORT
1 	 SHARING FUTURE PLANS
1 	 SHARING OPINIONS AND PREFERENCES
1 	 SHARING SECONDHAND EXPERIENCES
1 	 SHARING INFORMATION



In [170]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

25.0% 	 5 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
20.0% 	 4 	 SHARING PERSONAL EXPERIENCES
15.0% 	 3 	 NONE
10.0% 	 2 	 SHARING FUTURE PLANS
10.0% 	 2 	 META DISCUSSION
10.0% 	 2 	 SHARING ADVICE
10.0% 	 2 	 SHARING OPINIONS AND PREFERENCES
10.0% 	 2 	 SHARING INFORMATION
5.0% 	 1 	 SEEKING EMOTIONAL SUPPORT
5.0% 	 1 	 SEEKING INFORMATION
5.0% 	 1 	 SHARING SECONDHAND EXPERIENCES
5.0% 	 1 	 SHARING PERSONAL BACKGROUND


In [171]:
for _label, _texts in label_texts_dict.items():
    if _label == 'SHARING INFORMATION':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
SHARING INFORMATION
------------------------------------------

It cause a lot of hormonal imbalances.
http://t.co/fWeL2X2M IUD Beats Pill at Preventing Pregnancy - WebMD: http://t.co/vpcDmI3X IUD Beats… http://t.co/UvM4gSDM


<br><br>

# Explore Twitter REPLIES

In [172]:
examples = db.get_dataset('discourse-twitter-replies')

print(len(examples))

20


In [173]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

8 	 SHARING PERSONAL EXPERIENCES
5 	 NONE
3 	 SHARING PERSONAL BACKGROUND
3 	 META DISCUSSION
2 	 SHARING OPINIONS AND PREFERENCES
2 	 SHARING INFORMATION
1 	 SHARING FUTURE PLANS


In [174]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
implant
--------------------------------
4 	 SHARING PERSONAL EXPERIENCES
3 	 SHARING PERSONAL BACKGROUND
1 	 SHARING OPINIONS AND PREFERENCES
1 	 SHARING INFORMATION
1 	 SHARING FUTURE PLANS

--------------------------------
pill
--------------------------------
3 	 SHARING PERSONAL EXPERIENCES
2 	 META DISCUSSION
1 	 SHARING OPINIONS AND PREFERENCES

--------------------------------
iud
--------------------------------
1 	 META DISCUSSION
1 	 SHARING INFORMATION
1 	 SHARING PERSONAL EXPERIENCES



In [175]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

40.0% 	 8 	 SHARING PERSONAL EXPERIENCES
25.0% 	 5 	 NONE
15.0% 	 3 	 SHARING PERSONAL BACKGROUND
15.0% 	 3 	 META DISCUSSION
10.0% 	 2 	 SHARING OPINIONS AND PREFERENCES
10.0% 	 2 	 SHARING INFORMATION
5.0% 	 1 	 SHARING FUTURE PLANS


In [176]:
for _label, _texts in label_texts_dict.items():
    if _label == 'SHARING OPINIONS AND PREFERENCES':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
SHARING OPINIONS AND PREFERENCES
------------------------------------------

oh helll nahhhhhh I had one with hormones and that shit was horrible.
Tht bc pill is a girl's bff, ha


<br><br><br><br>

# Explore WebMD reviews

In [177]:
examples = db.get_dataset('discourse-webmd-reviews')

print(len(examples))

20


In [178]:
label_count_dict = defaultdict(int)
method_label_count_dict = defaultdict(lambda: defaultdict(int))
label_texts_dict = defaultdict(list)
for e in examples:
    for _label in e['accept']:
        label_count_dict[_label] += 1
        method_label_count_dict[e['meta']['Method']][_label] += 1
        label_texts_dict[_label].append(e['text'])
    if len(e['accept']) < 1:
        label_count_dict['NONE'] += 1
        label_texts_dict['NONE'].append(e['text'])

print('------------------------------------------------------')
print('total number of posts labeled')
print('------------------------------------------------------')
print()
for _label, _count in sorted(label_count_dict.items(), key=lambda x: x[1], reverse=True):
    print(_count, '\t', _label)

------------------------------------------------------
total number of posts labeled
------------------------------------------------------

15 	 SHARING PERSONAL EXPERIENCES
5 	 SHARING OPINIONS AND PREFERENCES
3 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 SHARING PERSONAL BACKGROUND
1 	 NONE


In [179]:
for _method, _label_count_dict in method_label_count_dict.items():
    print('--------------------------------')
    print(_method)
    print('--------------------------------')
    for _label, _count in sorted(_label_count_dict.items(), key=lambda x: x[1], reverse=True):
        print(_count, '\t', _label)
    print()

--------------------------------
pill
--------------------------------
6 	 SHARING PERSONAL EXPERIENCES
1 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
1 	 SHARING PERSONAL BACKGROUND

--------------------------------
iud
--------------------------------
5 	 SHARING PERSONAL EXPERIENCES
3 	 SHARING OPINIONS AND PREFERENCES
2 	 SHARING/DESCRIBING ADDITIONAL RESEARCH

--------------------------------
implant
--------------------------------
4 	 SHARING PERSONAL EXPERIENCES
2 	 SHARING OPINIONS AND PREFERENCES



In [180]:
label_percent_dict = {_label: _count/float(len(examples)) for _label, _count in label_count_dict.items()}

print('------------------------------')
print('percent of posts with label')
print('------------------------------')
print()
for _label, _percent in sorted(label_percent_dict.items(), key=lambda x: x[1], reverse=True):
    print(str(round(_percent*100, 1)) + '%', '\t', label_count_dict[_label], '\t', _label)

------------------------------
percent of posts with label
------------------------------

75.0% 	 15 	 SHARING PERSONAL EXPERIENCES
25.0% 	 5 	 SHARING OPINIONS AND PREFERENCES
15.0% 	 3 	 SHARING/DESCRIBING ADDITIONAL RESEARCH
5.0% 	 1 	 SHARING PERSONAL BACKGROUND
5.0% 	 1 	 NONE


In [181]:
for _label, _texts in label_texts_dict.items():
    if _label == 'SHARING/DESCRIBING ADDITIONAL RESEARCH':
        print('------------------------------------------')
        print(_label)
        print('------------------------------------------')
        print()
        for e in _texts:
            print(' '.join(e.split()))

------------------------------------------
SHARING/DESCRIBING ADDITIONAL RESEARCH
------------------------------------------

So if anyone is wondering, I have read all of these comments and they all apply.bloating, nausea, weight/appetite gain (it's been 12 days) extreme rage, breat tenderness and lumps.
So i went back again, gave me different meds, which i couldnt take upset stomach along with my regular symptoms then i kept having this pain, and hardness in my lower left abdomen, painful sex, doctor said my cervix and uterus were swollen, cramping, just horrible pain and very annoying.
I did not research until after placement...


<br><br><br><br>

---

<br><br><br><br>

# Backup labeling into a CSV

In [182]:
reddit_post_examples = db.get_dataset('discourse-reddit-posts')
reddit_comment_examples = db.get_dataset('discourse-reddit-comments')
twitter_post_examples = db.get_dataset('discourse-twitter-posts')
twitter_replies_examples = db.get_dataset('discourse-twitter-replies')
webmd_reviews_examples = db.get_dataset('discourse-webmd-reviews')

In [183]:
len(reddit_post_examples), len(reddit_comment_examples), len(twitter_post_examples), len(twitter_replies_examples), len(webmd_reviews_examples)

(20, 20, 20, 20, 20)

In [184]:
label_dicts = []
for e in reddit_post_examples + reddit_comment_examples + twitter_post_examples + twitter_replies_examples + webmd_reviews_examples:
    for _label in e['accept']:
        label_dicts.append({'Source': e['meta']['Source'],
                            'ID': e['meta']['ID'],
                            'Label': _label,
                            'Text': e['text']})
    if len(e['accept']) == 0:
        label_dicts.append({'Source': e['meta']['Source'],
                            'ID': e['meta']['ID'],
                            'Label': 'NONE',
                            'Text': e['text']})
label_df = pd.DataFrame(label_dicts)

In [185]:
len(label_df)

120

In [186]:
label_df['Label'].value_counts()

SHARING PERSONAL EXPERIENCES                33
NONE                                        21
SHARING/DESCRIBING ADDITIONAL RESEARCH      11
SHARING INFORMATION                         11
SHARING OPINIONS AND PREFERENCES            10
SHARING ADVICE                               6
SHARING PERSONAL BACKGROUND                  6
SHARING FUTURE PLANS                         5
META DISCUSSION                              5
SEEKING INFORMATION                          4
SEEKING EXPERIENCES                          2
SHARING SECONDHAND EXPERIENCES               2
SHARING CAUSAL REASONING / HYPOTHESIZING     2
SEEKING EMOTIONAL SUPPORT                    2
Name: Label, dtype: int64

In [187]:
label_df['Source'].value_counts()

twitter-posts      26
webmd-reviews      25
reddit-posts       24
twitter-replies    24
reddit-comments    21
Name: Source, dtype: int64

In [188]:
label_df.sample(3)

Unnamed: 0,Source,ID,Label,Text
13,reddit-posts,dcgtjd,SHARING FUTURE PLANS,I have an implant scheduled in three days and ...
70,twitter-posts,299305881168904200,NONE,I cannot remember the doctors name who placed ...
95,webmd-reviews,w19433,SHARING/DESCRIBING ADDITIONAL RESEARCH,"So if anyone is wondering, I have read all of ..."


In [189]:
for i, r in label_df[label_df['Label'] == 'NONE'].sample(10).iterrows():
    print(' '.join(r['Text'].split()))

Nexplanon Contraceptive Implants The New Alternative Ways of ...: 1 day 20 hr ago View in Crawl 4.
Nope not at all!
Thanks in advance!
I'd imagine not.
I usually can't do that.
but it didn't!!!
I want to reach out and offer help, but I’m not sure how to do that.
I didn’t know how much of a whiny baby I could be.
Yeah, I don't blame you.
Not sure though as they aren't in clear packaging.


In [190]:
label_df.to_csv('/Users/maria/Documents/data/birth-control/labeling/label-discourse/labeled_by_maria.all.csv')

<br><br><br><br>

---

<br><br><br><br>

# Try training a simple model

In [191]:
data_directory_path   = '/Users/maria/Documents/data/birth-control'
test_df = pd.read_csv(data_directory_path + '/labeling/label-discourse/sampled-sentences.test.csv')
len(test_df)

11993

In [192]:
test_df.sample(3)

Unnamed: 0.1,Unnamed: 0,text,meta
8772,8772,It doesn't work in the hormones,"{'ID': 1121445910016331783, 'Source': 'twitter..."
7566,7566,I am breaking out really bad ever since switch...,"{'ID': 862147357621841922, 'Source': 'twitter-..."
9070,9070,Whoa I wonder if that's why they didn't work f...,"{'ID': 1307317228455378944, 'Source': 'twitter..."


In [193]:
len(label_df.index)

120

In [194]:
label_df.sample(3)

Unnamed: 0,Source,ID,Label,Text
42,reddit-comments,e3t46dh,SHARING PERSONAL EXPERIENCES,After that it lightens up for me
118,webmd-reviews,w9404,SHARING PERSONAL EXPERIENCES,I was on the pill for 6 years and I'm excited ...
33,reddit-comments,doz2rtt,NONE,"Yeah, I don't blame you."


In [195]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [196]:
def binarize_label(label, target_label):
    if label == target_label:
        return 1
    return 0

In [197]:
for _target_label in label_df['Label'].unique():

    _binarized_df = label_df.copy()
    _binarized_df['Label'] = label_df['Label'].apply(lambda x: binarize_label(x, _target_label))
    _positive_ids = _binarized_df[_binarized_df['Label'] == 1]['ID'].tolist()
    _binarized_df = _binarized_df[~((_binarized_df['ID'].isin(_positive_ids)) & (_binarized_df['Label'] == 0))]

    _binarized_df = _binarized_df.groupby('Label').sample(n=len(_binarized_df[_binarized_df['Label'] == 1]), random_state=1)

    if len(_binarized_df.index) > 50:

        _train_df, _test_df = train_test_split(_binarized_df, test_size=0.33, random_state=42)

        _train_texts = _train_df['Text']
        _train_labels = _train_df['Label']
        _test_texts = _test_df['Text']
        _test_labels = _test_df['Label']

        _vectorizer = TfidfVectorizer()
        _X_train = _vectorizer.fit_transform(_train_texts)
        _X_test = _vectorizer.transform(_test_texts)

        _model = LogisticRegression(C=10).fit(_X_train, _train_labels)
        _predictions = _model.predict(_X_test)

        print(_target_label)
        print(classification_report(_test_labels, _predictions))

SHARING PERSONAL EXPERIENCES
              precision    recall  f1-score   support

           0       1.00      0.62      0.76        13
           1       0.64      1.00      0.78         9

    accuracy                           0.77        22
   macro avg       0.82      0.81      0.77        22
weighted avg       0.85      0.77      0.77        22



In [198]:
_binarized_df['Label'].value_counts()

0    5
1    5
Name: Label, dtype: int64

In [199]:
def process_string(text):
    text = text.lower()
    text = re.sub('[0-9]+', 'NUM', text)
    text = re.sub(r'[^\sA-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ/]', ' \1 ', text)
    text = ' '.join(text.split())
    return text

In [200]:
t = process_string('Does this work? Hmmm,how about this???')

In [201]:
for _target_label in label_df['Label'].unique():

    _binarized_df = label_df.copy()
    _binarized_df['Label'] = label_df['Label'].apply(lambda x: binarize_label(x, _target_label))
    _positive_ids = _binarized_df[_binarized_df['Label'] == 1]['ID'].tolist()
    _binarized_df = _binarized_df[~((_binarized_df['ID'].isin(_positive_ids)) & (_binarized_df['Label'] == 0))]

    _binarized_df = _binarized_df.groupby('Label').sample(n=len(_binarized_df[_binarized_df['Label'] == 1]), random_state=1)

    if len(_binarized_df.index) > 50:

        _train_texts = _binarized_df['Text']
        _train_labels = _binarized_df['Label']

        _test_texts = test_df['text']

        _train_texts_processed = [process_string(t) for t in _train_texts]
        _test_texts_processed  = [process_string(t) for t in _test_texts]

        _vectorizer = TfidfVectorizer()
        _X_train = _vectorizer.fit_transform(_train_texts_processed)
        _X_test = _vectorizer.transform(_test_texts_processed)

        _model = LogisticRegression(C=10).fit(_X_train, _train_labels)
        _predictions = _model.predict(_X_test)

        print('---------------------------------')
        print(_target_label)
        print('---------------------------------')
        print()

        _positive_texts = [_text for _prediction, _text in zip(_predictions, _test_texts) if _prediction == 1]
        _negative_texts = [_text for _prediction, _text in zip(_predictions, _test_texts) if _prediction == 0]

        print('POSITIVE')
        for _text in random.sample(_positive_texts, 10):
            print(' '.join(_text.split()))
        
        print()


---------------------------------
SHARING PERSONAL EXPERIENCES
---------------------------------

POSITIVE
Derek Skees has no idea how my IUD works.
So true, I had the nexplanon implant and it was the most torturous thing for my hormones and mental health
- I had my Implanon (Birth control)...
I like the fact that I don’t have to do anything and it’s minimally invasive.
i got my nexplanon replaced in december and had my period for three months straight y’all 3 day people are so blessed
and it worked for a little while.
The birth control pill was invented 50 years ago today...
: IED's - I heard IUD's too though.
She always having a damn dream too.
I had a IUD first and it fucked me up physically and mentally.

