<p> <font face = 'Nunito'>

## Data Manipulation </font></p>

<u>**requirements:**</u> 

* comments_raw.csv
* submissions.csv
* users.csv
* ref_sample_raw.csv
* comments_df_label.csv (comments classified with Davidson labels)
* submissions_label.csv (submissions classified with Davidson labels)

<u>**generates:**</u>
* comments.csv
* submissions.csv
* users.csv
* ref_sample.csv
* comments_sample.csv
* submissions_sample.csv
* users_sample.csv

<u>**manipulations:**</u>
* convert date
    * eliminate comments from before and after january<br/>
* drop duplicate comments
* drop comments created by bots
* filter out comments containing the most freq promotion (spam)
* get gender
* get detoxify scores
* eliminate foreign language comments
* merge Davidson classification
* create samples
* get gender for January sample


##### Import libraries

In [None]:
!pip install gensim==4.1.2
!pip install detoxify

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
import pandas as pd
import numpy as np
from orange_functions import *
import datetime as dt

import re
import nltk

# !pip install transformers==4.17.0
from detoxify import Detoxify

from tqdm import tqdm

RANDOM_SEED = 697

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [None]:
nltk.download('stopwords')
stopwords=nltk.corpus.stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Load data

In [None]:
c = pd.read_csv('data/processed/comments_whole.csv')

In [None]:
comments_raw_path = 'data/raw/comments.csv'
submissions_path = 'data/raw/submissions.csv'
users_path = 'data/raw/users.csv'
jan_sample = 'data/raw/reference.csv'
c_davidson_path = 'data/interim/comments_df_label.csv'
s_davidson_path = 'data/interim/submissions_label.csv'

In [None]:
c = pd.read_csv(comments_raw_path)
s = pd.read_csv(submissions_path)
u = pd.read_csv(users_path)
jan_sample = pd.read_csv(jan_sample)
c_davidson_label = pd.read(c_davidson_path)
s_davidson_label = pd.read(s_davidson_path)

#### 1. Convert dates

In [None]:
c['created'] = pd.to_datetime(c.created_utc, unit='s')
c['retrieved'] = pd.to_datetime(c.retrieved_utc, unit='s')

s['created'] = pd.to_datetime(s.created_utc, unit='s')
u['created'] = pd.to_datetime(u.created_utc, unit='s')

#### 1b. Drop comments from before and after January

In [None]:
# trimming buffer on both ends so final dataset contains only january comments
# c = c[c.created.dt.month==1]

#### 2. Drop duplicate comments

In [None]:
c = c[~c.body.duplicated()]

#### 3. Drop comments created by bots 

In [None]:
c = c[(~c.body.duplicated()) & 
            (~c.body.str.contains(r'I.{,4}am.{,4}a.{,4}bot', regex=True)) & 
            (~c.body.str.contains('cumalloverus'))
            ]

#### 4. Filter out comments containing the most freq promotion (spam)

In [None]:
c = c[~c.body.str.contains('cumalloverus')]

##### New submissions.csv and users.csv 
Comments have been trimmed down and submissions and users need to be edited accordingly.

In [None]:
s = s.set_index('id').loc[list(c.submission_id.unique())].reset_index()
u = u.set_index('name').loc[list(c.author.unique())].reset_index()

#### Extract valid submission selftext and valid submission title for next steps

In [None]:
s_selftext = s[(s.selftext != '[removed]') & 
                (s.selftext != '[deleted]') & 
                (s.selftext.notnull())].loc[:, ['id', 'selftext']]

s_title = s[(s.title != '[ Removed by Reddit ]') & 
            (s.title != '[deleted by user]') & 
            (s.title.notnull())].loc[:,['id', 'title']]

#### 5. Get gender

In [None]:
c['gender'] = c.body.apply(find_gender)

In [None]:
s_selftext['selftext_gender'] = s_selftext.selftext.apply(find_gender)
s_title['title_gender'] = s_title.title.apply(find_gender)

#### 6. Get detoxify

In [None]:
def get_detoxify(df, col_to_classify):
    model = Detoxify('unbiased')
    
    nandict = {'toxicity': np.nan,
            'severe_toxicity': np.nan,
            'obscene': np.nan,
            'identity_attack': np.nan,
            'insult': np.nan,
            'threat': np.nan,
            'sexual_explicit': np.nan}

    tpd = pd.DataFrame(model.predict(df[col_to_classify][0]), index=[df['id'][0]])

    for i in tqdm(range(1, df.shape[0])):
        try:
            tpd = pd.concat([tpd, 
                            pd.DataFrame(model.predict(df[col_to_classify][i]), 
                                        index=[df['id'][i]])]) 
        except:
            tpd = pd.concat([tpd, pd.DataFrame(nandict, index=[df['id'][i]])])
    return tpd.reset_index()

In [None]:
c_detoxify = get_detoxify(c, 'body')
s_selftext_detoxify = get_detoxify(s_selftext, 'selftext')
s_title_detoxify = get_detoxify(s_title, 'title')

In [None]:
c = c.merge(c_detoxify, on='id', how='left')
s = s.merge(s_selftext.loc[:, ['id', 'selftext_gender']], on='id', how='left')
s = s.merge(s_title.loc[:, ['id', 'title_gender']], on='id', how='left')
s = s.merge(s_selftext_detoxify, on='id', how='left')
s = s.merge(s_title_detoxify, on='id', how='left')

#### 7. Eliminate foreign language comments

In [None]:
# The 10-15% stopword range seems ideal for 
# preserving potentially valuable multi-lingual posts 
# setting filter closer to 10% for caution
threshold = 0.11
c['english'] = filter_foreign_language_comments(c.body)
c = c[c.english > threshold]

#### 8. Merge Davidson classification

In [None]:
c.merge(c_davidson_label.loc[:, ['id', 'label']], on='id', how='left')
s.merge(s_davidson_label.loc[:, ['id', 'selftext_label', 'title_label']], on='id', how='left')

In [None]:
c.reset_index().iloc[:,1:].to_csv('/data/processed/comments.csv', index=False)
s.reset_index().iloc[:,1:].to_csv('/data/processed/submissions.csv', index=False)
u.reset_index().iloc[:,1:].to_csv('/data/processed/users.csv', index=False)

#### 9. Create samples for GitHub

In [None]:
# cs = c.sample(15000, random_state=RANDOM_SEED)
# cs.to_csv('data/processed/comments.csv')

In [None]:
# ss = s.set_index('id').loc[list(cs.submission_id.unique())].reset_index()
# ss.to_csv('/data/processed/submissions_sample.csv')

# us = u.set_index('name').loc[list(cs.author.unique())].reset_index()
# us.to_csv('/data/processed/users_sample.csv')

In [None]:
# cs.loc[:, ['all_awardings', 'author', 'author_flair_type', 'author_fullname',
#        'author_premium', 'body', 'body_sha1', 'controversiality',
#        'created_utc', 'distinguished', 'edited', 'gildings', 'id',
#        'is_submitter', 'link_id', 'parent_id', 'permalink', 'retrieved_utc',
#        'score', 'status', 'subreddit', 'subreddit_id',
#        'subreddit_name_prefixed', 'subreddit_type', 'updated_body']
       
#        ].to_csv('/data/processed/comments_raw_sample.csv')

### Reference sample
#### 10. Convert date

In [None]:
jan = pd.read_csv('data/raw/reference.csv')

In [None]:
jan['created'] = pd.to_datetime(jan.created_utc, unit='s')

#### 11. Drop duplicate, filter out bots, filter out outlier spammer in primary dataset

In [None]:
jan = jan[(~jan.body.duplicated()) & 
            (~jan.body.str.contains(r'I.{,4}am.{,4}a.{,4}bot', regex=True)) & 
            (~jan.body.str.contains('cumalloverus'))
            
            ]

#### 12. Get gender

In [None]:
jan['gender'] = jan.body.apply(find_gender)

In [None]:
jan.gender.value_counts()

none      28797
male       8675
female     3025
both       2585
Name: gender, dtype: int64

In [None]:
jan.to_csv('data/processed/reference.csv', index=False)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b22dad3f-c925-4cd0-bb81-e22d83bd774f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>