This notebook opens several json files (the ones listed in well_formatted_json_files.txt) and puts them in the same dataframe. It adds two columns to this dataframe: One ('species') indicates whether the account is a human or a bot. The other ('source') says which dataset this account came from (like 'verfied-2018', etc.)

The final dataset (`trimmed_raw_data`) should hopefully be a good foundation to build an ML model off of. We still need to reformat the data, like by one-hot-encoding categorical variables.

Here is a link to a data dictionary (which explains what the column names mean): https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object.

## Front matter

In [1]:
from pathlib import Path # For navigating to the datafiles
import json # For processing json files
import csv # For processing csv files

from functools import reduce # So Chris can pretend he's a CS major

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt # For simple graphing
import seaborn as sns # For sns.set_style("whitegrid"), I guess?

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

  import pandas.util.testing as tm


## Load json data into raw dataframes

In [2]:
# Read "well_formatted_json_files.txt"

data_files_to_read_path = Path('well_formatted_json_files.txt')
with open(data_files_to_read_path) as file:
    lines = file.read().splitlines()
    # Remove commented lines and blank lines:
    lines = [li for li in lines if ((li[0] != '#') and ( li.strip()))]
    print(len(lines))

10


In [3]:
# Read in json and tsv files

data_path = Path('data')

l = list()
for li in lines:
    # Read into dataframes
    
    json_path = data_path / Path(li + '_tweets.json')
    with open(json_path) as file:
        very_raw_json_contents = json.load(file)
    raw_json_contents = [x['user'] for x in very_raw_json_contents]
    json_contents = pd.DataFrame(raw_json_contents)
    tsv_path = data_path / Path(li + '.tsv')
    tsv_contents = pd.read_csv(tsv_path, sep='\t')
    tsv_contents.columns = ['user_id','species']
    
    # Set indexes to user ids and remove duplicate indices
        
    json_contents = json_contents.set_index('id')
    tsv_contents = tsv_contents.set_index('user_id')
    json_contents = json_contents.loc[~json_contents.index.duplicated(keep='last')]
    tsv_contents = tsv_contents.loc[~tsv_contents.index.duplicated(keep='last')]
    
    # Merge
    
    merged_data = pd.concat([json_contents, tsv_contents], join='inner', axis = 1)
    merged_data = merged_data.assign(source=li)
    
    l.append(merged_data)


cumulative_raw_data = reduce(lambda x, y : pd.concat([x,y]), l)

Be cautioned that some of the entries in the dataframe are json objects that got turned into python dictionaries. Python dictionaries are not hashable, so this can cause methods like `df.drop_duplicates` to fail. Perhaps remove these unhashable columns...

In [4]:
cumulative_raw_data

Unnamed: 0,follow_request_sent,has_extended_profile,profile_use_background_image,default_profile_image,profile_background_image_url_https,verified,translator_type,profile_text_color,profile_image_url_https,profile_sidebar_fill_color,...,url,created_at,contributors_enabled,time_zone,protected,default_profile,is_translator,species,source,withheld_in_countries
602249341,False,False,False,False,https://abs.twimg.com/images/themes/theme4/bg.gif,False,none,000000,https://pbs.twimg.com/profile_images/923924342...,000000,...,https://t.co/e5t6p9w7D8,Thu Jun 07 22:16:27 +0000 2012,False,London,False,False,False,human,botometer-feedback-2019,
797617218511060992,False,True,False,False,https://abs.twimg.com/images/themes/theme1/bg.png,False,none,000000,https://pbs.twimg.com/profile_images/855244571...,000000,...,,Sun Nov 13 01:48:58 +0000 2016,False,Pacific Time (US & Canada),False,False,False,bot,botometer-feedback-2019,
889925474,False,False,True,False,https://pbs.twimg.com/profile_background_image...,False,none,333333,https://pbs.twimg.com/profile_images/964079832...,DDEEF6,...,http://t.co/7gh2Iu1AT4,Thu Oct 18 23:19:38 +0000 2012,False,,False,False,False,human,botometer-feedback-2019,
96435556,False,True,True,False,https://abs.twimg.com/images/themes/theme6/bg.gif,False,none,333333,https://pbs.twimg.com/profile_images/311429969...,A0C5C7,...,,Sat Dec 12 22:53:04 +0000 2009,False,Rome,False,False,False,bot,botometer-feedback-2019,
16905397,False,False,True,False,https://pbs.twimg.com/profile_background_image...,False,regular,666666,https://pbs.twimg.com/profile_images/969705141...,252429,...,https://t.co/VRgsX8eVR2,Wed Oct 22 13:43:42 +0000 2008,False,Bern,False,False,False,human,botometer-feedback-2019,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76235237,False,True,False,False,https://abs.twimg.com/images/themes/theme14/bg...,True,none,000000,https://pbs.twimg.com/profile_images/969605999...,000000,...,https://t.co/XWFZ4aqcZ4,Tue Sep 22 03:10:00 +0000 2009,False,,False,False,False,human,verified-2019,
2159329092,False,False,False,False,https://abs.twimg.com/images/themes/theme1/bg.png,True,none,000000,https://pbs.twimg.com/profile_images/112290924...,000000,...,https://t.co/OJ69l0pxyk,Sun Oct 27 17:46:59 +0000 2013,False,,False,False,False,human,verified-2019,
25527618,False,True,True,False,https://abs.twimg.com/images/themes/theme10/bg...,True,none,3D1957,https://pbs.twimg.com/profile_images/110004956...,7AC3EE,...,https://t.co/IiGhu084D5,Fri Mar 20 16:12:36 +0000 2009,False,,False,False,False,human,verified-2019,
43654274,False,False,True,False,https://abs.twimg.com/images/themes/theme14/bg...,True,none,333333,https://pbs.twimg.com/profile_images/297271636...,EFEFEF,...,https://t.co/wBpiJ5kRRf,Sun May 31 06:37:34 +0000 2009,False,,False,False,False,human,verified-2019,


These should be the column names of raw_data:

* profile_sidebar_border_color
* name
* profile_background_image_url
* listed_count
* favourites_count
* created_at
* followers_count
* friends_count
* profile_background_color
* id_str
* is_translation_enabled
* translator_type
* profile_use_background_image
* notifications
* is_translator
* profile_sidebar_fill_color
* utc_offset
* follow_request_sent
* default_profile
* id
* verified
* default_profile_image
* profile_text_color
* entities
* profile_background_tile
* protected
* profile_background_image_url_https
* location
* profile_image_url
* following
* lang
* geo_enabled
* time_zone
* profile_image_url_https
* profile_link_color
* url
* statuses_count
* screen_name
* has_extended_profile
* description
* contributors_enabled

Here is a dictionary for all this: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object. The Twitter API dictionary says not all of these attributes are still used; some are deprecated and null in every entry. However, as the next section shows, most of the attributes in the datasets are usable after all. (They must have been deprecated from the API sometime after this data was collected.)

## Which attributes are mostly null?

In [5]:
def how_na_are_cols(df_arg):
    print(f'total number of observations is {len(df_arg)}')
    print()
    for col_name in df_arg.columns:
        print('NA values in column {:35} : {}'.format(
              col_name, df_arg[col_name].count()))

In [6]:
how_na_are_cols(cumulative_raw_data)

total number of observations is 44595

NA values in column follow_request_sent                 : 44595
NA values in column has_extended_profile                : 44595
NA values in column profile_use_background_image        : 44595
NA values in column default_profile_image               : 44595
NA values in column profile_background_image_url_https  : 39647
NA values in column verified                            : 44595
NA values in column translator_type                     : 44595
NA values in column profile_text_color                  : 44595
NA values in column profile_image_url_https             : 44595
NA values in column profile_sidebar_fill_color          : 44595
NA values in column entities                            : 44595
NA values in column followers_count                     : 44595
NA values in column profile_sidebar_border_color        : 44595
NA values in column id_str                              : 44595
NA values in column profile_background_color            : 44595
N

In [9]:
# Inspect how many null values are in each column

threshold = 40000

na_cols = [x for x in cumulative_raw_data.columns
               if (cumulative_raw_data[x].count()
               <= threshold)]

trimmed_raw_data = cumulative_raw_data.drop(na_cols, axis = 1)

print('dropped these columns with more than {} non-NA values:'.format(threshold))
print()
for x in na_cols: print(x)
print()
print()
print('These columns remain:')
print()
for x in trimmed_raw_data.columns: print(x)

dropped these columns with more than 40000 non-NA values:

profile_background_image_url_https
utc_offset
profile_banner_url
profile_background_image_url
url
time_zone
withheld_in_countries


These columns remain:

follow_request_sent
has_extended_profile
profile_use_background_image
default_profile_image
verified
translator_type
profile_text_color
profile_image_url_https
profile_sidebar_fill_color
entities
followers_count
profile_sidebar_border_color
id_str
profile_background_color
listed_count
is_translation_enabled
statuses_count
description
friends_count
location
profile_link_color
profile_image_url
following
geo_enabled
screen_name
lang
profile_background_tile
favourites_count
name
notifications
created_at
contributors_enabled
protected
default_profile
is_translator
species
source
