The goal of this notebook is to get a dataframe that's formatted and 100% ready for training a model. One-hot encoded and everything.

Here is a link to a data dictionary (which explains what the column names mean): https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object.

The idea is to keep most of the following attributes (identified in Which_attributes_are_usable) and format them as hinted in the square brackets.

* has_extended_profile
* profile_use_background_image
* profile_background_image_url_https [consider categorical: equal to theme1, theme14, NaN, or other]
* verified
* profile_text_color [consider categorical: either equal to 333333 or not]
* profile_sidebar_fill_color [consider categorical: either equal to DDEEF6 or not]
* followers_count
* profile_sidebar_border_color [consider categorical: either equal to C0DEED, FFFFFF, 000000, or other]
* profile_background_color [categorical: equal to C0DEED, F5F8FA, 000000, or other]
* listed_count
* utc_offset [categorical? UTC offset measured in seconds]
* statuses_count
* friends_count
* profile_link_color [categorical: 1DA1F2 or other]
* geo_enabled
* profile_background_image_url [categorical: theme1, NaN, or other]
* lang [categorical: en or other (maybe NaN, es, pt, it, too)]
* profile_background_tile
* favourites_count
* url (NaN or other)
* created at (look at day or week, or at time of day)
* time_zone [categorical: NaN or other]
* default_profile [boolean]

I think withheld_in_countries might also be useful. Not sure.

## Front matter

In [5]:
from pathlib import Path # For navigating to the datafiles
import json # For processing json files
import csv # For processing csv files

from functools import reduce # So Chris can pretend he's a CS major

import collections

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt # For simple graphing
import seaborn as sns # For sns.set_style("whitegrid"), I guess?

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

## Load json data into raw dataframes

In [6]:
# Read "well_formatted_json_files.txt"

data_files_to_read_path = Path('well_formatted_json_files.txt')
with open(data_files_to_read_path) as file:
    lines = file.read().splitlines()
    # Remove commented lines and blank lines:
    lines = [li for li in lines if ((li[0] != '#') and ( li.strip()))]
    print(len(lines))

10


In [7]:
# Read in json and tsv files

data_path = Path('data')

l = list()
for li in lines:
    # Read into dataframes
    
    json_path = data_path / Path(li + '_tweets.json')
    with open(json_path) as file:
        very_raw_json_contents = json.load(file)
    raw_json_contents = [x['user'] for x in very_raw_json_contents]
    json_contents = pd.DataFrame(raw_json_contents)
    tsv_path = data_path / Path(li + '.tsv')
    tsv_contents = pd.read_csv(tsv_path, sep='\t')
    tsv_contents.columns = ['user_id','species']
    
    # Set indexes to user ids and remove duplicate indices
        
    json_contents = json_contents.set_index('id')
    tsv_contents = tsv_contents.set_index('user_id')
    json_contents = json_contents.loc[~json_contents.index.duplicated(keep='last')]
    tsv_contents = tsv_contents.loc[~tsv_contents.index.duplicated(keep='last')]
    
    # Merge
    
    merged_data = pd.concat([json_contents, tsv_contents], join='inner', axis = 1)
    merged_data = merged_data.assign(source=li)
    
    l.append(merged_data)


cumulative_raw_data = reduce(lambda x, y : pd.concat([x,y]), l)

In [52]:
columns_to_keep = ['has_extended_profile','profile_use_background_image','profile_background_image_url_https','verified','profile_text_color','profile_sidebar_fill_color','followers_count','profile_sidebar_border_color','profile_background_color','listed_count','utc_offset','statuses_count','profile_link_color','geo_enabled','profile_background_image_url','lang','profile_background_tile','favourites_count','url','created_at','time_zone','default_profile','friends_count']

target = [cumulative_raw_data['species'] == 'bot']
source = cumulative_raw_data['source']
cut_raw_data = cumulative_raw_data[columns_to_keep]

cut_raw_data.sample(3)

Unnamed: 0,has_extended_profile,profile_use_background_image,profile_background_image_url_https,verified,profile_text_color,profile_sidebar_fill_color,followers_count,profile_sidebar_border_color,profile_background_color,listed_count,...,geo_enabled,profile_background_image_url,lang,profile_background_tile,favourites_count,url,created_at,time_zone,default_profile,friends_count
273767254,False,False,https://abs.twimg.com/images/themes/theme1/bg.png,False,0,000000,241,FFFFFF,000000,8,...,True,http://abs.twimg.com/images/themes/theme1/bg.png,en,False,2534,,Tue Mar 29 03:42:41 +0000 2011,,False,494
3324596300,False,True,https://abs.twimg.com/images/themes/theme1/bg.png,False,333333,DDEEF6,18,C0DEED,C0DEED,0,...,False,http://abs.twimg.com/images/themes/theme1/bg.png,zh-TW,False,165,,Sat Aug 22 16:38:59 +0000 2015,,True,0
1649927018,False,True,https://abs.twimg.com/images/themes/theme1/bg.png,False,333333,DDEEF6,40,C0DEED,C0DEED,0,...,False,http://abs.twimg.com/images/themes/theme1/bg.png,fr,False,538,,Tue Aug 06 09:29:48 +0000 2013,,True,292


In [53]:
numerical_cols = ['favourites_count','statuses_count','friends_count','followers_count','listed_count']
boolean_cols = ['profile_background_tile','default_profile','geo_enabled','verified','has_extended_profile','profile_use_background_image']

assert set(numerical_cols) <= set(cut_raw_data.columns)
assert set(boolean_cols) <= set(cut_raw_data.columns)

cleaned_data = cut_raw_data[numerical_cols + boolean_cols]

cleaned_data.sample(3, random_state=855)

Unnamed: 0,favourites_count,statuses_count,friends_count,followers_count,listed_count,profile_background_tile,default_profile,geo_enabled,verified,has_extended_profile,profile_use_background_image
1870916851,132,30,0,1,0,False,True,False,False,False,True
340052871,2497,7452,183,199,0,True,False,True,False,False,True
156031870,6635,19851,146,259,4,True,False,True,False,True,True


NUMERICAL:

* favourites_count
* statuses_count
* friends_count
* followers_count
* listed_count

BOOLEAN:

* profile_background_tile
* default_profile [boolean]
* geo_enabled
* verified
* has_extended_profile
* profile_use_background_image

CATEGORICAL:

* profile_background_image_url_https [consider categorical: equal to theme1, theme14, NaN, or other]
* url (NaN or other)
* lang [categorical: en or other (maybe NaN, es, pt, it, too)]
* profile_link_color [categorical: 1DA1F2 or other]
* profile_background_image_url [categorical: theme1, NaN, or other]
* profile_sidebar_border_color [consider categorical: either equal to C0DEED, FFFFFF, 000000, or other]
* profile_background_color [categorical: equal to C0DEED, F5F8FA, 000000, or other]
* profile_text_color [consider categorical: either equal to 333333 or not]
* profile_sidebar_fill_color [consider categorical: either equal to DDEEF6 or not]

WEIRD/OTHER:

* utc_offset [categorical? UTC offset measured in seconds]
* created at [look at day or week, or at time of day]
* time_zone [categorical: NaN or other]