# The Bot Importing and Sorting notebook

This notebook is fairly straight forward. We import the bot accounts with a overall raw bot score of 0.5 or higher, put it into a list and filter the list by removing accounts I have manually checked and thought were not bots, as explained in the thesis

Libraries needed:

In [None]:
import json
import pandas as pd

### Import the bots from the .txt file and extract only the ones with a overall raw bot score of 0.5 or higher

In [4]:
# Open the JSON file for reading
with open("all_user_ids_botometer.txt", "r") as file:
    # Create a list where we can store the data of the accounts with the bot score
    accounts_with_high_raw_score = []

    # Go through every line in the file
    for line in file:
        # Make sure to load all the data in the lines as JSON data
        data = json.loads(line)

        # Go through the keys in the JSON line
        for user_id, user_data in data.items():
            # Check if the current user has a raw score and extract data if the score is 0.5 or more
            if "raw_scores" in user_data and "universal" in user_data["raw_scores"]:
                raw_score_overall = user_data["raw_scores"]["universal"]["overall"]
                if raw_score_overall >= 0.5:  
                    screen_name = user_data["user"]["user_data"]["screen_name"]
                    # take this data and add it to the original accounts_with_high_raw_score list
                    accounts_with_high_raw_score.append({
                        "user_id": user_id,
                        "screen_name": screen_name,
                        "overall_raw_score": raw_score_overall
                    })

# Turn the list to a Dataframe
bots = pd.DataFrame(accounts_with_high_raw_score)

In [6]:
# Make sure the user_id column in bots is an integer just like user.id in the big file
bots["user_id"] = bots["user_id"].astype(int)

### Remove not_bots from the bots list

In [7]:
# Open the .txt file with accounts that were in the bot list but were deemed not to be bots
with open('not_bots.txt', 'r') as file:
    not_bots = [line.strip() for line in file]

In [8]:
# Check how many bots were in the not_bots list
len(not_bots)

519

In [9]:
# Import the gaswinning_tweets_complete file. 
groningen_complete = pd.read_feather("gaswinning_tweets_compleet.feather")

# Take only user.id and user.screen_name
user_ids = groningen_complete[["user.id", "user.screen_name"]]

To ensure that all accounts identified as not bots are removed, we cross-reference the list of not_bot screen_names with the gaswinning dataset based on user IDs. However, some screen names in the not_bot list may not exist in their original form in the tweet dataset due to changes in user names over time. 

Therefore, we also remove any accounts from the bot_list that match these screen names, ensuring comprehensive deletion of identified not_bots. This approach accounts for potential username changes over time, which may not be reflected in the bot list derived from Botometer's 2022 data.

In [10]:
# make both column lowercase to make sure they are the same
user_ids['screen_name_lower'] = user_ids['user.screen_name'].str.lower()
not_bots_lower = [name.lower() for name in not_bots]

# Find  user_ids in the original dataframe that belong to the screen_names in not_bots
related_user_ids = user_ids[user_ids['screen_name_lower'].isin(not_bots_lower)][['user.id']]

related_user_ids = related_user_ids['user.id']

# Turn it into a list
related_user_ids_list = related_user_ids.tolist()

# Apply filtering
bots = bots[~bots['user_id'].isin(related_user_ids_list)]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_ids['screen_name_lower'] = user_ids['user.screen_name'].str.lower()


In [11]:
bots.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2103 entries, 0 to 2434
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   user_id            2103 non-null   int64  
 1   screen_name        2103 non-null   object 
 2   overall_raw_score  2103 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 65.7+ KB


### Save it to the final bot_accounts file used to filter the bots in the dataset

In [12]:
bots.to_feather("bot_accounts.feather")