# ISYE6740 ML Project Spring 2021
## Classifying Twitter Bots using ML
### Katherine Barthelson, Elizabeth Yates, and Dan Tylutki
This notebook is for data preprocessing which includes data cleaning and feature enhancement. Original data features and descriptions provided for reference:

- author (object): the username of the account that created the post
- author_verified (float64): flag that shows if author has verified their account via email (1) or not (0)
- author_comment_karma (float64): the total karma that the author has received from other users on comments minus karma given
- author_link_karma (float64): the link karma for the user
- is_submitter (float64): Whether or not the comment author is also the author of the submission.
- body (object): the text of the comment
- num_comments (float64): the number of comments replying to this comment
- banned_by (float64): moderator that the subject user was banned by
- no_follow (float64): FROM WIKIPEDIA: the nofollow setting allows web site authors to indicate that the presence of a link is not an endorsement of the target site's importance.
- link_id (object): The submission ID that the comment belongs to.
- gilded (float64): Whether the user has been gifted a Reddit Gold award for their comment by another user.
- created_utc (float64): Time the comment was created, represented in Unix Time.
- score (float64): The number of upvotes for the comment.
- over_18 (float64): Whether the post is NSFW (1) or not graphic (0)
- ups (float64): The number of upvotes on the comment.
- downs (float64): The number of downvotes on the comment.
- num_reports (float64): The number of times other users have reported the comment.
- controversiality (float64): indicates if the post is controversial (1) or not (0) as determined by a roughly even ratio of upvotes to down votes
- quarantine (float64): Whether the comment is quarantined (1) or not (0). Quarantined content is that which has been deemed extremely offensive or upsetting to the average redditor.
- is_bot (bool): Flag that categorizes the user as a bot (1) or not (0)
- is_troll (bool): Flag that categorizes the user as a troll (1) or not (0)

# Imports

In [1]:
### Load all necessary libraries
import string
import numpy as np
import pandas as pd
import emoji

# Load Data

In [10]:
## Import the dataset
### update the file path below to where you have data stored locally
### I did not put data in our Git project to avoid slow pulls/pushes
file_path = 'reddit_bot_train_unclean.csv'
data = pd.read_csv(file_path)
display(data.head())

Unnamed: 0,author,author_verified,author_comment_karma,author_link_karma,is_submitter,body,num_comments,banned_by,no_follow,link_id,...,created_utc,score,over_18,ups,downs,num_reports,controversiality,quarantine,is_bot,is_troll
0,ADHDbot,0.0,-6.0,1.0,0.0,"As per the rules in the side bar, yes or no qu...",1.0,,1.0,t3_2l5szg,...,1415027000.0,1.0,0.0,1.0,0.0,,0.0,0.0,True,False
1,ADHDbot,0.0,-6.0,1.0,0.0,Meme and image posts are not allowed on this s...,1.0,,1.0,t3_2l61gs,...,1415032000.0,1.0,0.0,1.0,0.0,,0.0,0.0,True,False
2,ADHDbot,0.0,-6.0,1.0,0.0,"As per the rules in the side bar, yes or no qu...",1.0,,1.0,t3_2l7ma8,...,1415060000.0,1.0,0.0,1.0,0.0,,0.0,0.0,True,False
3,ADHDbot,0.0,-6.0,1.0,0.0,"As per the rules in the side bar, yes or no qu...",1.0,,1.0,t3_2l7t5h,...,1415064000.0,1.0,0.0,1.0,0.0,,0.0,0.0,True,False
4,ADHDbot,0.0,-6.0,1.0,0.0,We cannot and will not diagnose anyone. You n...,1.0,,1.0,t3_2l900k,...,1415096000.0,1.0,0.0,1.0,0.0,,0.0,0.0,True,False


In [24]:
n_original = len(data)
n_original

834132

# Clean Data
## Data Cleaning Functions

In [2]:
## Define helper functions to classify text-correction behaviors for individual tokens
## Functions adapted from CSE6242 Project
def is_hashtag(token):
    return token[0] == "#"

def normalize_hashtag(token):
    return token  # don't change it

def is_link(token):
    return (token.find("http://") == 0) or (token.find("https://") == 0)

def normalize_link(token):
    return token  # don't change it, Google NLP uses links in topic categorization

def is_mention(token):
    return token[0] == "@"

def normalize_mention(token):
    return ""  # remove mentions

def is_artifact(token, is_first_token):
    return is_first_token and (token in twitter_artifacts)

def normalize_artifact(token):
    return ""

def remove_puctuation(token: str):
    replacement_dict = {key: " " for key in string.punctuation}
    replacement_table = str.maketrans(replacement_dict)
    return token.translate(replacement_table)

def tokenize_tweet(text):
    return text.split()

def normalize_text(tweet_text):
    tokens = tokenize_tweet(tweet_text)
    norm_tweet = []  # all the normed tokens from each token in the tweet
    
    is_first_token = True
    for token in tokens:
        norm_tokens = []  # the normed token(s) from this token in the tweet
        try:
            if is_hashtag(token):
                norm_tokens.append(normalize_hashtag(token))
            elif is_mention(token):
                norm_tokens.append(normalize_mention(token))
            elif is_link(token):
                norm_tokens.append(normalize_link(token))
            else:
                # if not a hashtag, mention or link, remove any punctuation and process the token(s)
                token_no_punc = remove_puctuation(token)
                sub_tokens = tokenize_tweet(token_no_punc)
                for sub_token in sub_tokens:
                    if is_artifact(token, is_first_token):
                        norm_tokens.append(normalize_artifact(sub_token))
                    else:
                        if token != "":
                            norm_tokens += normalise_custom(sub_token)
                            # if normalise just expanded the letters of an abbreviation then use the original token
                            if "".join(norm_tokens).replace(" ", "") == token:
                                norm_tokens = [token]
        except:
            e = sys.exc_info()[0]
            logger.debug(f"Exception encountered. Token: {token} Exception: {e}")
            norm_tokens = [token]  # use the original token if we failed to normalize it

        norm_tweet += norm_tokens
        is_first_token = False

    # rejoin the tokens to re-form the Tweet
    return " ".join(norm_tweet).strip()

def replace_bad_chars(string):
    """Used to replace characters that are uncommon or create difficulties for text processing
    with accepted equivalents"""
    for char in ["’", "`", "‚", "‘", "’"]:
        string = string.replace(char, "\'")
    for char in ["“", "”", "„"]:
        string = string.replace(char, "\"")
    for char in ["˜ "]:
        string = string.replace(char, "~")
    for char in ["›"]:
        string = string.replace(char, ">")
    for char in ["‹"]:
        string = string.replace(char, "<")
    for char in ["ˆ"]:
        string = string.replace(char, "^")
    return string

def convert_bool_to_int(value):
    value_str = str(value).lower()
    if value_str == 'true':
        value = int(1)
    elif value_str == 'false':
        value = int(0)
    else:
        value = int(value)
    return value

## Drop Duplicates

In [25]:
data.drop_duplicates(inplace=True, ignore_index=True)

In [27]:
n_after_dedupe = len(data)
print("Current number of data points:", n_after_dedupe)
print("Duplicate records dropped:", n_original - n_after_dedupe)

Current number of data points: 762896
Duplicate records dropped: 71236


## Drop or replace nulls

In [34]:
### Remove columns that are mostly null
null_flags = data.isnull()
for c in data.columns:
    null_proportion = len(null_flags[null_flags[c]==True])/n_after_dedupe
    if null_proportion > 0.5:
        data.drop(columns=c, inplace=True)
        print(c, "---", null_proportion, "--- COLUMN DROPPED")
    else:
        print(c, "---", null_proportion)

author --- 0.0
author_verified --- 0.0013094838614961934
author_comment_karma --- 0.0
author_link_karma --- 0.0
is_submitter --- 0.0
body --- 7.864767936914074e-06
num_comments --- 9.044483127451186e-05
banned_by --- 1.0 --- COLUMN DROPPED
no_follow --- 0.0
link_id --- 0.0
gilded --- 0.0
created_utc --- 0.0
score --- 0.0
over_18 --- 0.0
ups --- 0.0
downs --- 0.0
num_reports --- 1.0 --- COLUMN DROPPED
controversiality --- 0.0
quarantine --- 0.0
is_bot --- 0.0
is_troll --- 0.0


In [41]:
### The num comments field has no values of 0 for num_comments, so the null values should be converted to 0 as we can
### reasonably expect that the NaN coincides with posts that have 0 comments.
data.num_comments.value_counts()

1.0        44662
2.0        21694
3.0        18074
4.0        17792
5.0        16809
           ...  
11915.0        1
22409.0        1
11912.0        1
11902.0        1
4233.0         1
Name: num_comments, Length: 10902, dtype: int64

In [42]:
### Convert NaN values in num_comments field to 0
data.num_comments.fillna(value=0, inplace=True)

In [52]:
### drop remaining rows with null values
data.dropna(axis=0, inplace=True)

In [53]:
### Check number of records removed
n_after_null_drops = len(data)
print("Number of rows after dropping nulls:", n_after_null_drops)
print("Number of records with nulls dropped:", n_after_dedupe - n_after_null_drops)

Number of rows after dropping nulls: 761891
Number of records with nulls dropped: 1005


## Remove fields with little analytical value
The following fields will be removed:
- link_id: the link to the submission the comment was made on will not provide much value unless we use it to go gather more data on that submission, which we will not be doing.

In [57]:
data.drop(columns="link_id", inplace=True)

## Convert data types

### Convert label fields from boolean to int

In [62]:
data.is_bot.value_counts()

False    572642
True     189249
Name: is_bot, dtype: int64

In [63]:
data.is_troll.value_counts()

False    755338
True       6553
Name: is_troll, dtype: int64

In [66]:
for c in ['is_troll', 'is_bot']:
    data[c] = data[c].apply(lambda x: convert_bool_to_int(x))

In [67]:
data.is_bot.value_counts()

0    572642
1    189249
Name: is_bot, dtype: int64

In [68]:
data.is_troll.value_counts()

0    755338
1      6553
Name: is_troll, dtype: int64

### Convert all float fields to int
None of these original fields need to be floats as all values represent whole numbers or binary flags

In [76]:
for c in data.columns:
    if data[c].dtype == float:
        #data[c] = data[c].astype(int)  # converts to int32
        data[c] = data[c].apply(lambda x: int(x))  # converts to int64, the standard for int(x)

### Ensure object fields contain only strings
Only fields remaining that are not int should be str ("author" and "body"). Note that after conversion from object type to str type, these fields still appear as object type when running data.dtypes

In [89]:
for c in data.columns:
    if data[c].dtype == object:
        data[c] = data[c].astype(str)

In [90]:
data.dtypes

author                  object
author_verified          int64
author_comment_karma     int64
author_link_karma        int64
is_submitter             int64
body                    object
num_comments             int64
no_follow                int64
gilded                   int64
created_utc              int64
score                    int64
over_18                  int64
ups                      int64
downs                    int64
controversiality         int64
quarantine               int64
is_bot                   int64
is_troll                 int64
dtype: object

## Other Data Cleaning

In [82]:
### Replace bad characters in comment body
data['body'] = data['body'].apply(lambda x: replace_bad_chars(x))

In [83]:
### Replace emojies with text representations: e.g. the rocket emoji becomes ":rocket:" (without quotes)
data['body'] = data['body'].apply(lambda x: emoji.demojize(x))

In [84]:
### Check for posts where author is deleted
data[data.author == '[deleted]']

Unnamed: 0,author,author_verified,author_comment_karma,author_link_karma,is_submitter,body,num_comments,no_follow,gilded,created_utc,score,over_18,ups,downs,controversiality,quarantine,is_bot,is_troll


In [85]:
### Check for posts where comment text has been removed
data[data.body == '[removed]']

Unnamed: 0,author,author_verified,author_comment_karma,author_link_karma,is_submitter,body,num_comments,no_follow,gilded,created_utc,score,over_18,ups,downs,controversiality,quarantine,is_bot,is_troll


In [18]:
### leave out comments where body is only space
data = data[~data.body.str.isspace()]

## Save cleaned data

In [92]:
# clean_data_path = 'C:/Users/Dan/Documents/GT/ISYE6740/project/data/reddit_bot_train_clean.csv'
# data.to_csv(clean_data_path, index=False)

# Feature Enhancement
New features to create:
- author_%_is_submitter (float): the proportion of the author's comments that are posted on their own submission
- author_avg_num_comments (float): the average number of comments that a user receives on their comments
- author_%_no_follow (float): the proportion of the author's comments where nofollow attribute is set to True
- author_%_gilded (float): the proportion of the author's comments that are gilded
- author_avg_score (float): the average score of the author's comments
- author_%_over_18 (float): the proportion of comments by the author that are marked NSFW
- author_avg_ups (float): the average number of up-votes that the user gets per comment
- author_avg_downs (float): the average number of down-votes that the user gets per comment
- author_%_controversiality (float): the proportion of the author's comments that are controversial
- author_%_quarantine (float): the proportion of the author's comments that have been quarantined
- sentiment (int): the predicted sentiment of the comment. 1 == Positive, 0 == Neutral, and -1 == Negative.
- author_avg_sentiment (float): the average sentiment of the user's recent comments
- author_avg_comment_similarity (float): the average similarity score of the user's recent comments

## Additional Imports

In [3]:
import spacy
model = 'en_core_web_md'
try:
    nlp = spacy.load(model)
except:
    import os
    os.system(f"python -m spacy download {model}")
    nlp = spacy.load(model)
from transformers import pipeline
sentiment = pipeline('sentiment-analysis')
from itertools import combinations
from tqdm import tqdm

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [4]:
file_path = 'reddit_bot_train_clean.csv'
data = pd.read_csv(file_path)
display(data.head(3))

Unnamed: 0,author,author_verified,author_comment_karma,author_link_karma,is_submitter,body,num_comments,no_follow,gilded,created_utc,score,over_18,ups,downs,controversiality,quarantine,is_bot,is_troll
0,ADHDbot,0,-6,1,0,"As per the rules in the side bar, yes or no qu...",1,1,0,1415026958,1,0,1,0,0,0,1,0
1,ADHDbot,0,-6,1,0,Meme and image posts are not allowed on this s...,1,1,0,1415031687,1,0,1,0,0,0,1,0
2,ADHDbot,0,-6,1,0,"As per the rules in the side bar, yes or no qu...",1,1,0,1415060465,1,0,1,0,0,0,1,0


## Functions

In [5]:
def retrieve_recent_comments(username, n=5, body_only=False):
    """Returns the n most recent comments by the given user."""
    recent_comments = data[data.author == username].sort_values('created_utc', ascending=False)
    if body_only==True:
        return list(recent_comments.body)[:n]
    else:
        return recent_comments.head(n)
    
def label_sentiment(text, score_threshold=0.999):
    """Uses the transformers sentiment-analysis pipeline to predict sentiment of a text then modifies the result to
    be a single number for easier analysis. The score is the amount of confidence the model has in its label. Texts
    with scores less than 0.999 are usually pretty neutral."""
    try:
        sentiment_result = sentiment(text)
        if sentiment_result[0]['score'] < score_threshold:
            return int(0)
        if sentiment_result[0]['label'] == 'POSITIVE':
            return int(1)
        else:
            return int(-1)
    except:
        return int(0)
    
def calculate_comment_similarity(username):
    ### Get the recent tweets by user (up to 5)
    recent_comments = retrieve_recent_comments(username, n=5, body_only=True)
    if len(recent_comments) == 1:
        return 0.0
    else:
        ### Calculate the similarity between the recent comments
        comment_docs = []
        for comment in recent_comments:
            comment_docs.append(nlp(comment))
        doc_indices = range(len(comment_docs))
        doc_pairs = [c for c in combinations(doc_indices, 2)]
        similarities = []
        for pair in doc_pairs:
            i, j = pair[0], pair[1]
            doc_i, doc_j = comment_docs[i], comment_docs[j]
            sim = doc_i.similarity(doc_j)
            similarities.append(sim)
        ### Calculate and return the average similarity
        avg_similarity = sum(similarities)/len(similarities)
        return avg_similarity
    
def calculate_average_recent_sentiment(username):
    recent_comments = retrieve_recent_comments(username, n=5, body_only=True)
    recent_sentiments = list(recent_comments['sentiment'])
    return sum(recent_sentiments)/len(recent_sentiments)

def calculate_author_average(field, username):
    field_average = data[field][data.author==username].mean()
    return field_average

In [6]:
### sentiment labeling test
print(label_sentiment("I hate this class."))
print(label_sentiment("This class is meh."))
print(label_sentiment("This class is great."))

-1
0
1


## Enhance Features

In [27]:
users = data.author.unique()

In [28]:
users

array(['ADHDbot', 'ALTcointip', 'AVR_Modbot', ..., 'lochjessmonstah',
       'rallymax', 'great_waldini'], dtype=object)

In [29]:
author_avgs = []
for user in tqdm(users):
    user_avgs = {
        'author': user,
        'author_%_is_submitter': calculate_author_average('is_submitter', user),
        'author_avg_num_comments': calculate_author_average('num_comments', user),
        'author_%_no_follow': calculate_author_average('no_follow', user),
        'author_%_gilded': calculate_author_average('gilded', user),
        'author_avg_score': calculate_author_average('score', user),
        'author_%_over_18': calculate_author_average('over_18', user),
        'author_avg_ups': calculate_author_average('ups', user),
        'author_avg_downs': calculate_author_average('downs', user),
        'author_%_controversiality': calculate_author_average('controversiality', user),
        'author_%_quarantine': calculate_author_average('quarantine', user),
        #'author_avg_sentiment': calculate_author_average('sentiment', user),
        'author_avg_comment_similarity': calculate_comment_similarity(user),
    }
    author_avgs.append(user_avgs)




  0%|                                                                                         | 0/1344 [00:00<?, ?it/s][A[A[A


  0%|                                                                                 | 1/1344 [00:00<10:41,  2.09it/s][A[A[A


  0%|                                                                                 | 2/1344 [00:00<10:53,  2.05it/s][A[A[A


  0%|▏                                                                                | 3/1344 [00:01<10:52,  2.05it/s][A[A[A


  0%|▏                                                                                | 4/1344 [00:01<10:53,  2.05it/s][A[A[A


  0%|▎                                                                                | 5/1344 [00:02<10:46,  2.07it/s][A[A[A


  0%|▎                                                                                | 6/1344 [00:02<10:33,  2.11it/s][A[A[A


  1%|▍                                                                          

  9%|███████▏                                                                       | 123/1344 [00:56<09:19,  2.18it/s][A[A[A


  9%|███████▎                                                                       | 124/1344 [00:57<09:17,  2.19it/s][A[A[A


  9%|███████▎                                                                       | 125/1344 [00:57<09:11,  2.21it/s][A[A[A


  9%|███████▍                                                                       | 126/1344 [00:58<09:10,  2.21it/s][A[A[A


  9%|███████▍                                                                       | 127/1344 [00:58<09:08,  2.22it/s][A[A[A


 10%|███████▌                                                                       | 128/1344 [00:59<08:58,  2.26it/s][A[A[A


 10%|███████▌                                                                       | 129/1344 [00:59<08:59,  2.25it/s][A[A[A


 10%|███████▋                                                                      

 18%|██████████████▌                                                                | 247/1344 [01:52<08:10,  2.24it/s][A[A[A


 18%|██████████████▌                                                                | 248/1344 [01:53<08:15,  2.21it/s][A[A[A


 19%|██████████████▋                                                                | 249/1344 [01:53<08:17,  2.20it/s][A[A[A


 19%|██████████████▋                                                                | 250/1344 [01:54<08:20,  2.19it/s][A[A[A


 19%|██████████████▊                                                                | 251/1344 [01:54<08:18,  2.19it/s][A[A[A


 19%|██████████████▊                                                                | 252/1344 [01:55<08:23,  2.17it/s][A[A[A


 19%|██████████████▊                                                                | 253/1344 [01:55<08:21,  2.17it/s][A[A[A


 19%|██████████████▉                                                               

 28%|█████████████████████▊                                                         | 371/1344 [02:48<07:04,  2.29it/s][A[A[A


 28%|█████████████████████▊                                                         | 372/1344 [02:48<07:03,  2.29it/s][A[A[A


 28%|█████████████████████▉                                                         | 373/1344 [02:49<07:00,  2.31it/s][A[A[A


 28%|█████████████████████▉                                                         | 374/1344 [02:49<06:52,  2.35it/s][A[A[A


 28%|██████████████████████                                                         | 375/1344 [02:50<06:55,  2.33it/s][A[A[A


 28%|██████████████████████                                                         | 376/1344 [02:50<06:57,  2.32it/s][A[A[A


 28%|██████████████████████▏                                                        | 377/1344 [02:50<06:50,  2.36it/s][A[A[A


 28%|██████████████████████▏                                                       

 37%|█████████████████████████████                                                  | 495/1344 [03:42<06:07,  2.31it/s][A[A[A


 37%|█████████████████████████████▏                                                 | 496/1344 [03:43<06:13,  2.27it/s][A[A[A


 37%|█████████████████████████████▏                                                 | 497/1344 [03:43<06:12,  2.28it/s][A[A[A


 37%|█████████████████████████████▎                                                 | 498/1344 [03:44<06:18,  2.24it/s][A[A[A


 37%|█████████████████████████████▎                                                 | 499/1344 [03:44<06:16,  2.25it/s][A[A[A


 37%|█████████████████████████████▍                                                 | 500/1344 [03:44<06:21,  2.21it/s][A[A[A


 37%|█████████████████████████████▍                                                 | 501/1344 [03:45<06:14,  2.25it/s][A[A[A


 37%|█████████████████████████████▌                                                

 46%|████████████████████████████████████▍                                          | 619/1344 [04:38<05:23,  2.24it/s][A[A[A


 46%|████████████████████████████████████▍                                          | 620/1344 [04:39<05:21,  2.25it/s][A[A[A


 46%|████████████████████████████████████▌                                          | 621/1344 [04:39<05:19,  2.26it/s][A[A[A


 46%|████████████████████████████████████▌                                          | 622/1344 [04:40<05:20,  2.25it/s][A[A[A


 46%|████████████████████████████████████▌                                          | 623/1344 [04:40<05:25,  2.21it/s][A[A[A


 46%|████████████████████████████████████▋                                          | 624/1344 [04:41<05:22,  2.23it/s][A[A[A


 47%|████████████████████████████████████▋                                          | 625/1344 [04:41<05:32,  2.16it/s][A[A[A


 47%|████████████████████████████████████▊                                         

 55%|███████████████████████████████████████████▋                                   | 743/1344 [05:35<04:33,  2.20it/s][A[A[A


 55%|███████████████████████████████████████████▋                                   | 744/1344 [05:35<04:32,  2.20it/s][A[A[A


 55%|███████████████████████████████████████████▊                                   | 745/1344 [05:36<04:28,  2.23it/s][A[A[A


 56%|███████████████████████████████████████████▊                                   | 746/1344 [05:36<04:28,  2.23it/s][A[A[A


 56%|███████████████████████████████████████████▉                                   | 747/1344 [05:36<04:26,  2.24it/s][A[A[A


 56%|███████████████████████████████████████████▉                                   | 748/1344 [05:37<04:27,  2.23it/s][A[A[A


 56%|████████████████████████████████████████████                                   | 749/1344 [05:37<04:25,  2.24it/s][A[A[A


 56%|████████████████████████████████████████████                                  

 65%|██████████████████████████████████████████████████▉                            | 867/1344 [06:32<03:43,  2.13it/s][A[A[A


 65%|███████████████████████████████████████████████████                            | 868/1344 [06:32<03:38,  2.17it/s][A[A[A


 65%|███████████████████████████████████████████████████                            | 869/1344 [06:33<03:35,  2.21it/s][A[A[A


 65%|███████████████████████████████████████████████████▏                           | 870/1344 [06:33<03:33,  2.22it/s][A[A[A


 65%|███████████████████████████████████████████████████▏                           | 871/1344 [06:34<03:30,  2.24it/s][A[A[A


 65%|███████████████████████████████████████████████████▎                           | 872/1344 [06:34<03:37,  2.17it/s][A[A[A


 65%|███████████████████████████████████████████████████▎                           | 873/1344 [06:35<03:36,  2.18it/s][A[A[A


 65%|███████████████████████████████████████████████████▎                          

 74%|██████████████████████████████████████████████████████████▎                    | 991/1344 [07:28<02:40,  2.19it/s][A[A[A


 74%|██████████████████████████████████████████████████████████▎                    | 992/1344 [07:29<02:40,  2.19it/s][A[A[A


 74%|██████████████████████████████████████████████████████████▎                    | 993/1344 [07:29<02:38,  2.22it/s][A[A[A


 74%|██████████████████████████████████████████████████████████▍                    | 994/1344 [07:29<02:36,  2.23it/s][A[A[A


 74%|██████████████████████████████████████████████████████████▍                    | 995/1344 [07:30<02:38,  2.20it/s][A[A[A


 74%|██████████████████████████████████████████████████████████▌                    | 996/1344 [07:30<02:39,  2.18it/s][A[A[A


 74%|██████████████████████████████████████████████████████████▌                    | 997/1344 [07:31<02:36,  2.21it/s][A[A[A


 74%|██████████████████████████████████████████████████████████▋                   

 83%|████████████████████████████████████████████████████████████████▋             | 1115/1344 [08:24<01:43,  2.22it/s][A[A[A


 83%|████████████████████████████████████████████████████████████████▊             | 1116/1344 [08:24<01:41,  2.24it/s][A[A[A


 83%|████████████████████████████████████████████████████████████████▊             | 1117/1344 [08:25<01:41,  2.24it/s][A[A[A


 83%|████████████████████████████████████████████████████████████████▉             | 1118/1344 [08:25<01:40,  2.25it/s][A[A[A


 83%|████████████████████████████████████████████████████████████████▉             | 1119/1344 [08:26<01:40,  2.24it/s][A[A[A


 83%|█████████████████████████████████████████████████████████████████             | 1120/1344 [08:26<01:39,  2.24it/s][A[A[A


 83%|█████████████████████████████████████████████████████████████████             | 1121/1344 [08:27<01:39,  2.25it/s][A[A[A


 83%|█████████████████████████████████████████████████████████████████             

 92%|███████████████████████████████████████████████████████████████████████▉      | 1239/1344 [09:21<00:48,  2.15it/s][A[A[A


 92%|███████████████████████████████████████████████████████████████████████▉      | 1240/1344 [09:22<00:48,  2.14it/s][A[A[A


 92%|████████████████████████████████████████████████████████████████████████      | 1241/1344 [09:22<00:48,  2.13it/s][A[A[A


 92%|████████████████████████████████████████████████████████████████████████      | 1242/1344 [09:23<00:47,  2.15it/s][A[A[A


 92%|████████████████████████████████████████████████████████████████████████▏     | 1243/1344 [09:23<00:47,  2.13it/s][A[A[A


 93%|████████████████████████████████████████████████████████████████████████▏     | 1244/1344 [09:24<00:48,  2.06it/s][A[A[A


 93%|████████████████████████████████████████████████████████████████████████▎     | 1245/1344 [09:24<00:48,  2.05it/s][A[A[A


 93%|████████████████████████████████████████████████████████████████████████▎     

In [31]:
author_avgs_df = pd.DataFrame.from_dict(author_avgs, orient='columns')
author_avgs_df.head()

Unnamed: 0,author,author_%_is_submitter,author_avg_num_comments,author_%_no_follow,author_%_gilded,author_avg_score,author_%_over_18,author_avg_ups,author_avg_downs,author_%_controversiality,author_%_quarantine,author_avg_comment_similarity
0,ADHDbot,0.0,2.556886,1.0,0.0,0.971058,0.001996,0.971058,0.0,0.0,0.0,0.901214
1,ALTcointip,0.012048,258.448795,0.990964,0.0,1.294177,0.0,1.294177,0.0,0.003012,0.0,0.986591
2,AVR_Modbot,0.0,9.473684,0.842105,0.0,2.368421,0.0,2.368421,0.0,0.0,0.0,0.65063
3,A_random_gif,0.216561,376.095541,0.834395,0.0,2.401274,0.063694,2.401274,0.0,0.006369,0.0,0.897853
4,AltCodeBot,0.0,643.324074,0.953704,0.0,4.157407,0.027778,4.157407,0.0,0.0,0.0,0.997175


In [32]:
### merge author averages with data frame
data2 = data.merge(author_avgs_df, on='author')
data2.head()

Unnamed: 0,author,author_verified,author_comment_karma,author_link_karma,is_submitter,body,num_comments,no_follow,gilded,created_utc,...,author_avg_num_comments,author_%_no_follow,author_%_gilded,author_avg_score,author_%_over_18,author_avg_ups,author_avg_downs,author_%_controversiality,author_%_quarantine,author_avg_comment_similarity
0,ADHDbot,0,-6,1,0,"As per the rules in the side bar, yes or no qu...",1,1,0,1415026958,...,2.556886,1.0,0.0,0.971058,0.001996,0.971058,0.0,0.0,0.0,0.901214
1,ADHDbot,0,-6,1,0,Meme and image posts are not allowed on this s...,1,1,0,1415031687,...,2.556886,1.0,0.0,0.971058,0.001996,0.971058,0.0,0.0,0.0,0.901214
2,ADHDbot,0,-6,1,0,"As per the rules in the side bar, yes or no qu...",1,1,0,1415060465,...,2.556886,1.0,0.0,0.971058,0.001996,0.971058,0.0,0.0,0.0,0.901214
3,ADHDbot,0,-6,1,0,"As per the rules in the side bar, yes or no qu...",1,1,0,1415064198,...,2.556886,1.0,0.0,0.971058,0.001996,0.971058,0.0,0.0,0.0,0.901214
4,ADHDbot,0,-6,1,0,We cannot and will not diagnose anyone. You n...,1,1,0,1415096133,...,2.556886,1.0,0.0,0.971058,0.001996,0.971058,0.0,0.0,0.0,0.901214


In [None]:
col_order = [
    'author',
    'author_verified',
    'author_comment_karma',
    'author_link_karma',
    'author_%_is_submitter',
    'author_avg_num_comments',
    'author_%_no_follow',
    'author_%_gilded',
    'author_avg_score',
    'author_%_over_18',
    'author_avg_ups',
    'author_avg_downs',
    'author_%_controversiality',
    'author_%_quarantine',
    'author_avg_comment_similarity',
    'is_submitter',
    'body',
    'num_comments',
    'no_follow',
    'gilded',
    'created_utc',
    'score',
    'over_18',
    'ups',
    'downs',
    'controversiality',
    'quarantine',
    'is_bot',
    'is_troll'
]

In [None]:
data2 = data2[col_order]

In [34]:
data2.to_csv('reddit_bots_train_gold.csv', index=False)

### Add sentiment

In [None]:
### WARNING: Very long run time, estimated at 1-2 days by tqdm
#data['sentiment'] = data['body'].apply(lambda text: label_sentiment(text))

bodies = list(data.body)
sentiments = sentiment(bodies)
### output looks like:
### [{'label': 'POSITIVE', 'score': 0.9998360276222229}, {'label': 'NEGATIVE', 'score': 0.9997872114181519}, ...]

In [None]:
# sent_labels = []
# for sent in sentiments:
#     label = 