This notebook opens several json files (the ones listed in well_formatted_json_files.txt) and puts them in the same dataframe. It adds two columns to this dataframe: One ('species') indicates whether the account is a human or a bot. The other ('source') says which dataset this account came from (like 'verfied-2018', etc.)

The final dataset (`trimmed_raw_data`) should hopefully be a good foundation to build an ML model off of. We still need to reformat the data, like by one-hot-encoding categorical variables.

Here is a link to a data dictionary (which explains what the column names mean): https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object.

These should be the column names provided by the Twitter API:

* profile_sidebar_border_color
* name
* profile_background_image_url
* listed_count
* favourites_count
* created_at
* followers_count
* friends_count
* profile_background_color
* id_str
* is_translation_enabled
* translator_type
* profile_use_background_image
* notifications
* is_translator
* profile_sidebar_fill_color
* utc_offset
* follow_request_sent
* default_profile
* id
* verified
* default_profile_image
* profile_text_color
* entities
* profile_background_tile
* protected
* profile_background_image_url_https
* location
* profile_image_url
* following
* lang
* geo_enabled
* time_zone
* profile_image_url_https
* profile_link_color
* url
* statuses_count
* screen_name
* has_extended_profile
* description
* contributors_enabled

Here is a dictionary for all this: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object. The Twitter API dictionary says not all of these attributes are still used; some are deprecated and null in every entry. However, as the next section shows, most of the attributes in the datasets are usable after all. (They must have been deprecated from the API sometime after this data was collected.)

## Front matter

In [18]:
from pathlib import Path # For navigating to the datafiles
import json # For processing json files
import csv # For processing csv files

from functools import reduce # So Chris can pretend he's a CS major

import collections

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt # For simple graphing
import seaborn as sns # For sns.set_style("whitegrid"), I guess?

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

## Load json data into raw dataframes

In [3]:
# Read "well_formatted_json_files.txt"

data_files_to_read_path = Path('well_formatted_json_files.txt')
with open(data_files_to_read_path) as file:
    lines = file.read().splitlines()
    # Remove commented lines and blank lines:
    lines = [li for li in lines if ((li[0] != '#') and ( li.strip()))]
    print(len(lines))

10


In [4]:
# Read in json and tsv files

data_path = Path('data')

l = list()
for li in lines:
    # Read into dataframes
    
    json_path = data_path / Path(li + '_tweets.json')
    with open(json_path) as file:
        very_raw_json_contents = json.load(file)
    raw_json_contents = [x['user'] for x in very_raw_json_contents]
    json_contents = pd.DataFrame(raw_json_contents)
    tsv_path = data_path / Path(li + '.tsv')
    tsv_contents = pd.read_csv(tsv_path, sep='\t')
    tsv_contents.columns = ['user_id','species']
    
    # Set indexes to user ids and remove duplicate indices
        
    json_contents = json_contents.set_index('id')
    tsv_contents = tsv_contents.set_index('user_id')
    json_contents = json_contents.loc[~json_contents.index.duplicated(keep='last')]
    tsv_contents = tsv_contents.loc[~tsv_contents.index.duplicated(keep='last')]
    
    # Merge
    
    merged_data = pd.concat([json_contents, tsv_contents], join='inner', axis = 1)
    merged_data = merged_data.assign(source=li)
    
    l.append(merged_data)


cumulative_raw_data = reduce(lambda x, y : pd.concat([x,y]), l)

Be cautioned that some of the entries in the dataframe are json objects that got turned into python dictionaries. Python dictionaries are not hashable, so this can cause methods like `df.drop_duplicates` to fail. Perhaps remove these unhashable columns...

In [5]:
cumulative_raw_data

Unnamed: 0,follow_request_sent,has_extended_profile,profile_use_background_image,default_profile_image,profile_background_image_url_https,verified,translator_type,profile_text_color,profile_image_url_https,profile_sidebar_fill_color,...,url,created_at,contributors_enabled,time_zone,protected,default_profile,is_translator,species,source,withheld_in_countries
602249341,False,False,False,False,https://abs.twimg.com/images/themes/theme4/bg.gif,False,none,000000,https://pbs.twimg.com/profile_images/923924342...,000000,...,https://t.co/e5t6p9w7D8,Thu Jun 07 22:16:27 +0000 2012,False,London,False,False,False,human,botometer-feedback-2019,
797617218511060992,False,True,False,False,https://abs.twimg.com/images/themes/theme1/bg.png,False,none,000000,https://pbs.twimg.com/profile_images/855244571...,000000,...,,Sun Nov 13 01:48:58 +0000 2016,False,Pacific Time (US & Canada),False,False,False,bot,botometer-feedback-2019,
889925474,False,False,True,False,https://pbs.twimg.com/profile_background_image...,False,none,333333,https://pbs.twimg.com/profile_images/964079832...,DDEEF6,...,http://t.co/7gh2Iu1AT4,Thu Oct 18 23:19:38 +0000 2012,False,,False,False,False,human,botometer-feedback-2019,
96435556,False,True,True,False,https://abs.twimg.com/images/themes/theme6/bg.gif,False,none,333333,https://pbs.twimg.com/profile_images/311429969...,A0C5C7,...,,Sat Dec 12 22:53:04 +0000 2009,False,Rome,False,False,False,bot,botometer-feedback-2019,
16905397,False,False,True,False,https://pbs.twimg.com/profile_background_image...,False,regular,666666,https://pbs.twimg.com/profile_images/969705141...,252429,...,https://t.co/VRgsX8eVR2,Wed Oct 22 13:43:42 +0000 2008,False,Bern,False,False,False,human,botometer-feedback-2019,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76235237,False,True,False,False,https://abs.twimg.com/images/themes/theme14/bg...,True,none,000000,https://pbs.twimg.com/profile_images/969605999...,000000,...,https://t.co/XWFZ4aqcZ4,Tue Sep 22 03:10:00 +0000 2009,False,,False,False,False,human,verified-2019,
2159329092,False,False,False,False,https://abs.twimg.com/images/themes/theme1/bg.png,True,none,000000,https://pbs.twimg.com/profile_images/112290924...,000000,...,https://t.co/OJ69l0pxyk,Sun Oct 27 17:46:59 +0000 2013,False,,False,False,False,human,verified-2019,
25527618,False,True,True,False,https://abs.twimg.com/images/themes/theme10/bg...,True,none,3D1957,https://pbs.twimg.com/profile_images/110004956...,7AC3EE,...,https://t.co/IiGhu084D5,Fri Mar 20 16:12:36 +0000 2009,False,,False,False,False,human,verified-2019,
43654274,False,False,True,False,https://abs.twimg.com/images/themes/theme14/bg...,True,none,333333,https://pbs.twimg.com/profile_images/297271636...,EFEFEF,...,https://t.co/wBpiJ5kRRf,Sun May 31 06:37:34 +0000 2009,False,,False,False,False,human,verified-2019,


## What entries are in each attribute?

First I started counting how many NA values were in each column:

In [6]:
def how_na_are_cols(df_arg):
    print(f'total number of observations is {len(df_arg)}')
    print()
    for col_name in df_arg.columns:
        print('NA values in column {:35} : {}'.format(
              col_name, df_arg[col_name].count()))

In [7]:
how_na_are_cols(cumulative_raw_data)

total number of observations is 44595

NA values in column follow_request_sent                 : 44595
NA values in column has_extended_profile                : 44595
NA values in column profile_use_background_image        : 44595
NA values in column default_profile_image               : 44595
NA values in column profile_background_image_url_https  : 39647
NA values in column verified                            : 44595
NA values in column translator_type                     : 44595
NA values in column profile_text_color                  : 44595
NA values in column profile_image_url_https             : 44595
NA values in column profile_sidebar_fill_color          : 44595
NA values in column entities                            : 44595
NA values in column followers_count                     : 44595
NA values in column profile_sidebar_border_color        : 44595
NA values in column id_str                              : 44595
NA values in column profile_background_color            : 44595
N

In [8]:
# Inspect how many null values are in each column
# I commented this out because I can do this better with value_counts(dropna=False)
'''
threshold = 40000

na_cols = [x for x in cumulative_raw_data.columns
               if (cumulative_raw_data[x].count()
               <= threshold)]

trimmed_raw_data = cumulative_raw_data.drop(na_cols, axis = 1)

print('dropped these columns with more than {} non-NA values:'.format(threshold))
print()
for x in na_cols: print(x)
print()
print()
print('These columns remain:')
print()
for x in trimmed_raw_data.columns: print(x)
'''

"\nthreshold = 40000\n\nna_cols = [x for x in cumulative_raw_data.columns\n               if (cumulative_raw_data[x].count()\n               <= threshold)]\n\ntrimmed_raw_data = cumulative_raw_data.drop(na_cols, axis = 1)\n\nprint('dropped these columns with more than {} non-NA values:'.format(threshold))\nprint()\nfor x in na_cols: print(x)\nprint()\nprint()\nprint('These columns remain:')\nprint()\nfor x in trimmed_raw_data.columns: print(x)\n"

It turns out that the non-hashable entries in the notebook are causing problems, so I threw them out:

In [64]:
unhashable_cols = set()
for col in cumulative_raw_data.sample(20, random_state = 855):
    if any([not isinstance(x, collections.Hashable) for x in cumulative_raw_data[col]]):
        print('column', col, 'not always hashable')
        unhashable_cols.add(col)

column entities not always hashable
column withheld_in_countries not always hashable


In [61]:
hashable_raw_data = cumulative_raw_data.drop(unhashable_cols, axis=1)

print(f'{len(cumulative_raw_data.columns)-len(hashable_raw_data.columns)} columns dropped')

2 columns dropped


Now we're in a position to examine the most common entries in every column:

In [63]:
for col in hashable_raw_data.columns:
    print('~~~', col, '~~~')
    display(hashable_raw_data[col].value_counts(dropna=False).head(5))
    print()

~~~ follow_request_sent ~~~


False    44595
Name: follow_request_sent, dtype: int64


~~~ has_extended_profile ~~~


False    36158
True      8437
Name: has_extended_profile, dtype: int64


~~~ profile_use_background_image ~~~


True     40601
False     3994
Name: profile_use_background_image, dtype: int64


~~~ default_profile_image ~~~


False    44106
True       489
Name: default_profile_image, dtype: int64


~~~ profile_background_image_url_https ~~~


https://abs.twimg.com/images/themes/theme1/bg.png     27607
NaN                                                    4948
https://abs.twimg.com/images/themes/theme14/bg.gif     1513
https://abs.twimg.com/images/themes/theme9/bg.gif       973
https://abs.twimg.com/images/themes/theme10/bg.gif      579
Name: profile_background_image_url_https, dtype: int64


~~~ verified ~~~


False    37307
True      7288
Name: verified, dtype: int64


~~~ translator_type ~~~


none         43421
regular       1164
badged           8
moderator        2
Name: translator_type, dtype: int64


~~~ profile_text_color ~~~


333333    35194
000000     3472
666666     1140
3D1957      551
3C3940      435
Name: profile_text_color, dtype: int64


~~~ profile_image_url_https ~~~


https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png    489
https://pbs.twimg.com/profile_images/897733049370251265/_9TrrWY__normal.jpg         2
https://pbs.twimg.com/profile_images/437757958085414912/vmUt7tpc_normal.jpeg        2
https://pbs.twimg.com/profile_images/947789418328350720/YJ0hdsbA_normal.jpg         2
https://pbs.twimg.com/profile_images/627504688611610624/0Ej34t6a_normal.jpg         2
Name: profile_image_url_https, dtype: int64


~~~ profile_sidebar_fill_color ~~~


DDEEF6    30735
000000     3045
EFEFEF     1630
252429     1122
FFFFFF      577
Name: profile_sidebar_fill_color, dtype: int64


~~~ followers_count ~~~


0    1201
5     878
1     843
8     823
6     819
Name: followers_count, dtype: int64


~~~ profile_sidebar_border_color ~~~


C0DEED    28380
FFFFFF     5121
000000     4649
EEEEEE     1327
181A1E      770
Name: profile_sidebar_border_color, dtype: int64


~~~ id_str ~~~


304679484     3
46440103      2
46476526      2
2295217322    2
26534759      2
Name: id_str, dtype: int64


~~~ profile_background_color ~~~


C0DEED    24250
F5F8FA     4952
000000     3533
131516     1632
1A1B1F     1091
Name: profile_background_color, dtype: int64


~~~ listed_count ~~~


0    23635
1     3511
2     1655
3     1007
4      598
Name: listed_count, dtype: int64


~~~ is_translation_enabled ~~~


False    43942
True       653
Name: is_translation_enabled, dtype: int64


~~~ utc_offset ~~~


 NaN        36200
-25200.0     1520
-14400.0      956
-18000.0      947
-10800.0      910
Name: utc_offset, dtype: int64


~~~ statuses_count ~~~


28    354
25    340
24    327
23    321
27    318
Name: statuses_count, dtype: int64


~~~ description ~~~


                                                                                                                                                    6023
.                                                                                                                                                     11
Actor                                                                                                                                                 10
We #followback Right-Minded Patriots & Liberty Republicans from any State. Part of the MAGOP Project - Learn More On the Web About What We Do #2       9
We #followback Right-Minded Patriots & Liberty Republicans from any State. Part of the MAGOP Project - Learn More On the Web About What We Do #3       8
Name: description, dtype: int64


~~~ friends_count ~~~


0    19072
1      329
2      244
3      171
4      148
Name: friends_count, dtype: int64


~~~ location ~~~


                   24581
Los Angeles, CA      275
New York, NY         224
Los Angeles          173
United States        157
Name: location, dtype: int64


~~~ profile_link_color ~~~


1DA1F2    26882
0084B4     2969
009999     1207
2FC2EF      865
FF0000      638
Name: profile_link_color, dtype: int64


~~~ profile_image_url ~~~


http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png    489
http://pbs.twimg.com/profile_images/492878662446768128/BHP_9bkq_normal.jpeg        2
http://pbs.twimg.com/profile_images/896850928422141952/-spcGI8__normal.jpg         2
http://pbs.twimg.com/profile_images/676011489191301121/Cw6In7Xc_normal.jpg         2
http://pbs.twimg.com/profile_images/431276795216678912/K9GgLuad_normal.png         2
Name: profile_image_url, dtype: int64


~~~ following ~~~


False    44591
True         4
Name: following, dtype: int64


~~~ geo_enabled ~~~


False    32214
True     12381
Name: geo_enabled, dtype: int64


~~~ profile_banner_url ~~~


NaN                                                          7946
https://pbs.twimg.com/profile_banners/93719349/1508895371       2
https://pbs.twimg.com/profile_banners/25382962/1368400797       2
https://pbs.twimg.com/profile_banners/65773523/1488131239       2
https://pbs.twimg.com/profile_banners/33923443/1515039960       2
Name: profile_banner_url, dtype: int64


~~~ profile_background_image_url ~~~


http://abs.twimg.com/images/themes/theme1/bg.png     27607
NaN                                                   4948
http://abs.twimg.com/images/themes/theme14/bg.gif     1513
http://abs.twimg.com/images/themes/theme9/bg.gif       973
http://abs.twimg.com/images/themes/theme10/bg.gif      579
Name: profile_background_image_url, dtype: int64


~~~ screen_name ~~~


HamillHimself      3
JLo                2
DoveCameron        2
KatherynWinnick    2
justinbieber       2
Name: screen_name, dtype: int64


~~~ lang ~~~


en     26429
NaN     2683
es      2519
pt      1891
it      1440
Name: lang, dtype: int64


~~~ profile_background_tile ~~~


False    37046
True      7549
Name: profile_background_tile, dtype: int64


~~~ favourites_count ~~~


0      759
1      237
2      160
3      122
133    122
Name: favourites_count, dtype: int64


~~~ name ~~~


.       51
Alex    12
B       10
Mike     7
A        6
Name: name, dtype: int64


~~~ notifications ~~~


False    44595
Name: notifications, dtype: int64


~~~ url ~~~


NaN                        31969
https://t.co/mAixJrdlV1       13
http://t.co/ap5l34TcZV        11
http://t.co/jYvqmtPizF         6
http://t.co/jYvqmu7rNN         4
Name: url, dtype: int64


~~~ created_at ~~~


Tue May 24 22:52:49 +0000 2011    3
Mon Apr 20 18:36:09 +0000 2009    3
Tue Jan 11 18:30:34 +0000 2011    2
Sat Feb 14 05:56:28 +0000 2009    2
Thu Mar 19 20:53:02 +0000 2009    2
Name: created_at, dtype: int64


~~~ contributors_enabled ~~~


False    44595
Name: contributors_enabled, dtype: int64


~~~ time_zone ~~~


NaN                           36200
Pacific Time (US & Canada)     1553
Eastern Time (US & Canada)      890
Central Time (US & Canada)      527
Brasilia                        402
Name: time_zone, dtype: int64


~~~ protected ~~~


False    44595
Name: protected, dtype: int64


~~~ default_profile ~~~


True     26880
False    17715
Name: default_profile, dtype: int64


~~~ is_translator ~~~


False    44585
True        10
Name: is_translator, dtype: int64


~~~ species ~~~


bot      28389
human    16206
Name: species, dtype: int64


~~~ source ~~~


pronbots-2019        17881
cresci-stock-2018    13275
celebrity-2019        5917
gilani-2017           2483
verified-2019         1986
Name: source, dtype: int64




Looking at this, I think the following attributes would be worth using in a model. I included in square brackets any pre-processing we'd have to do.

* has_extended_profile
* profile_use_background_image
* profile_background_image_url_https [consider categorical: equal to theme1, theme14, NaN, or other]
* verified
* profile_text_color [consider categorical: either equal to 333333 or not]
* profile_sidebar_fill_color [consider categorical: either equal to DDEEF6 or not]
* followers_count
* profile_sidebar_border_color [consider categorical: either equal to C0DEED, FFFFFF, 000000, or other]
* profile_background_color [categorical: equal to C0DEED, F5F8FA, 000000, or other]
* listed_count
* utc_offset [categorical? UTC offset measured in seconds]
* statuses_count
* friends_count
* profile_link_color [categorical: 1DA1F2 or other]
* geo_enabled
* profile_background_image_url [categorical: theme1, NaN, or other]
* lang [categorical: en or other (maybe NaN, es, pt, it, too)]
* profile_background_tile
* favourites_count
* url (NaN or other)
* created at (look at day or week, or at time of day)
* time_zone [categorical: NaN or other]
* default_profile [boolean]

I think withheld_in_countries might also be useful. Not sure.