# Creating `top-fibers` visualizations

### Purpose: This is a notebook that walks through how to create the visualizations presented in the OSoMe tools meeting by Matt on 2022-11-17.

### Notes:
- This is not meant to be utilized for anything other than showcasing the [`altair`](https://altair-viz.github.io/) Python package. (Also see [this great reference](https://uwdata.github.io/visualization-curriculum/intro.html).)
- The functions in this notebook will eventually be built into the backend pipeline so that there is less (probably none at all) work to do on the front end.

In [1]:
import datetime
import pandas as pd
import altair as alt
alt.data_transformers.disable_max_rows() # Allows plotting with more than 5k data points.

In [2]:
def create_account_link(user_id):
    """Create a link to a Twitter user profile page"""
    return f"https://twitter.com/i/user/{user_id}"

def create_post_link(tweet_id):
    """Create a link to a Twitter post page"""
    return f"https://twitter.com/i/status/{tweet_id}"

def tstamp_to_datetime(tstamp):
    """Convenience function for converting a string timestamp to datetime object"""
    return datetime.datetime.fromtimestamp(int(tstamp))

# Scatter plot

### Load the data.
NOTE: if you are play with this yourself, this will probably not work as I am trying to keep data out of the repository for the time being.

- `hitlist`: This represents a dataframe that contains each post by the worst misinformation spreaders. The "worst misinformation spreaders" are found in the following way. First, we take the users with the 50 highest FIB indices and then we take the users with the 50 most total retweets of misinformation and then we combine these lists and take only the set of unique users.
- `fib_indices` : This data frame contains all users observed in the data and each row represents on user. Important columns are `fib_index` and `total_retweets`.

In [3]:
hitlist = pd.read_parquet("../results/2022_11_17__hitlist_rts.parquet")
fib_indices = pd.read_parquet("../results/2022_11_17__fib_indices.parquet")

In [4]:
hitlist.head()

Unnamed: 0,user_id,tweet_id,num_rts,timestamp
0,20545835,1585501550314946560,37,1666848001
1,20545835,1586810172529610752,68,1667160001
2,20545835,1585496534938185728,7,1666846805
3,20545835,1564327023128223744,6,1661799600
4,20545835,1554093344904531968,83,1659359701


In [5]:
fib_indices.head()

Unnamed: 0,user_id,username,fib_index,total_retweets
0,18856867,zerohedge,279,485060
1,1194770389634895873,BonginoReport,231,162708
2,523248016,jsolomonReports,202,156658
3,18266688,TomFitton,199,187716
4,375721095,Breaking911,182,150052


### Calculate the number of tweets sent by each user, the mean RTs they earn for each post, and create a link to their account

In [6]:
# Get a list of the user IDs
hitlist_ids = list(hitlist.user_id.unique())

# Use them to select only the rows from `fib_indices` for these users
hitlist_fib_indices = fib_indices[fib_indices.user_id.isin(hitlist_ids)].reset_index(drop=True)

# Calculate the total number of tweets sent by each user
hitlist_num_tweets = hitlist.groupby('user_id')['tweet_id'].count().to_frame("num_tweets").reset_index()

# Merge these frames on user ID to add the "num_tweets" column
hitlist_full_info = hitlist_fib_indices.merge(hitlist_num_tweets, on='user_id')

# Calculate the mean RTs
hitlist_full_info['mean_rts_per_post'] = hitlist_full_info['total_retweets'] / hitlist_full_info['num_tweets']
hitlist_full_info['mean_rts_per_post'] = hitlist_full_info['mean_rts_per_post'].round(2)

# Add account link for each user
hitlist_full_info['account_link'] = hitlist_full_info.user_id.map(create_account_link)

hitlist_full_info.head()

Unnamed: 0,user_id,username,fib_index,total_retweets,num_tweets,mean_rts_per_post,account_link
0,18856867,zerohedge,279,485060,4880,99.4,https://twitter.com/i/user/18856867
1,1194770389634895873,BonginoReport,231,162708,750,216.94,https://twitter.com/i/user/1194770389634895873
2,523248016,jsolomonReports,202,156658,1043,150.2,https://twitter.com/i/user/523248016
3,18266688,TomFitton,199,187716,341,550.49,https://twitter.com/i/user/18266688
4,375721095,Breaking911,182,150052,896,167.47,https://twitter.com/i/user/375721095


In [7]:
# This sorts by FIB index so that values are nice and sorted (at least at the beginning)
# It is common altair convention to call your data `source`, so I do that here
source = hitlist_full_info.sort_values("fib_index", ascending=False)

# Set the brush selection type
# This allows you to use your mouse to select an interval
# along both x and y axis (i.e., select a box to highlight in the figure)
############################
brush = alt.selection(type='interval')

# Build the scatter plot
############################
points = alt.Chart(source).mark_point().encode(
    x = alt.X('fib_index:Q', title = "FIB index"),
    y = alt.Y(
        'total_retweets:Q',
        title = "Total retweets",
        scale=alt.Scale(type='log')
    ),
#     size = alt.Size('mean_rts_per_post:Q'), # Uncomment this to size point bubbles by avg RTs
    color=alt.condition(brush, alt.value('steelblue'), alt.value('grey')),
    href = alt.Href("account_link:N"),
    tooltip = [
        alt.Tooltip("username:N", title="Username"),
        alt.Tooltip("fib_index:N", title="FIB index"),
        alt.Tooltip("total_retweets:N", title="Low cred. RTs"),
        alt.Tooltip("num_tweets:N", title="Low cred. posts"),
        alt.Tooltip("mean_rts_per_post:N", title="Avg RTs/post"),
    ]
).add_selection(brush).properties(width=600,height=600)


# Build the BASE chart for the data tables
############################
ranked_text = alt.Chart(source).mark_text(align='right', fontSize=14).encode(
    y=alt.Y('row_number:O',axis=None)
).transform_filter( # This filters the data down to whatever is selected by the user on the graph
    brush
).transform_window(
    row_number='row_number()'
).transform_filter(
    'datum.row_number < 25'  # `datum` is a vega convention
)

# Build the data tables
# Using the base chart built above, we can add text based on the selection (`brush`)
# Each of the below creates a single separate plot which is just a title with row items
# that are indicated in the `encode` function
############################
username_text = ranked_text.encode(
    text='username:N'
).properties(
    title=alt.TitleParams(
        text='Username',
        align='right',
        fontSize=14
    )
)
fib_text = ranked_text.encode(
    text='fib_index:N'
).properties(
    title=alt.TitleParams(
        text='FIB index',
        align='right',
        fontSize=14
    )
)
rt_text = ranked_text.encode(
    text=alt.Text(
        'total_retweets:N',
        format=","
    )
).properties(
    title=alt.TitleParams(
        text='Low cred. RTs',
        align='right',
        fontSize=14
    )
)
num_rts_text = ranked_text.encode(
    text=alt.Text(
        'num_tweets:N',
        format=","
    )
).properties(
    title=alt.TitleParams(
        text='Low cred. posts',
        align='right',
        fontSize=14
    )
)
mean_rts_text = ranked_text.encode(
    text=alt.Text(
        'mean_rts_per_post:N',
        format=","
    )
).properties(
    title=alt.TitleParams(
        text='Avg. RTs/post',
        align='right',
        fontSize=14
    )
)

# This horizontally concatenates all of the text figures we made into one table figure
text = alt.hconcat(username_text, fib_text, rt_text, num_rts_text, mean_rts_text)

# The we can put the points from above with this text table
fig = alt.hconcat(
    points,
    text
).resolve_legend(
    color="independent"
).configure_view(
    strokeWidth=0
).configure_axis( # Configure all axes title/label fontsize
    titleFontSize=15,
    labelFontSize=14
)

# This makes it so that when you click on an href embedded point, it opens in a new tab
fig['usermeta'] = {
    "embedOptions": {
        'loader': {'target': '_blank'}
    }
}

fig

### See `altair`'s page on saving figures [here](https://altair-viz.github.io/user_guide/saving_charts.html?highlight=saving)

The gist is that `altair` infers what to do based on the extension in the file name. If you pass `.html` it will render an entire html page. If you pass `.png` it will save an image file. If you pass `.json` it will save the vega-lite json data which can then be loaded into your front end. `scale_factor` "scales" the number of pixels up so that you can have higher resolution images.

In [8]:
!mkdir plots

mkdir: cannot create directory ‘plots’: File exists


In [9]:
fig.save('./plots/scatter.html', scale_factor=2)


# Create a plot for each individual user

In [13]:
# Save each figure in this list for later experimentation
all_figures = []

for username in hitlist_full_info.username:

    # Create a dataframe of only their posts
    user_id = hitlist_full_info.loc[hitlist_full_info.username == username, "user_id"]
    selected_user_df = hitlist[hitlist["user_id"]==user_id.item()].reset_index(drop=True)
    
    if len(selected_user_df) < 5:
        print(f"Skipping @{username}. Less than 5 posts.")
        continue

    # Covert the timestamps to datetimes and add post links
    selected_user_df['dts'] = selected_user_df.timestamp.map(tstamp_to_datetime)
    selected_user_df['post_link'] = selected_user_df.tweet_id.map(create_post_link)


    brush = alt.selection_interval(encodings=['x']);

    base = alt.Chart(
        title = f"@{username} low-credibility posts"
    ).mark_point().encode(
        alt.X('yearmonthdate(dts):T', title=None),
        alt.Y('num_rts:Q', title="Number of retweets"),
        href = alt.Href("post_link:N"),
        tooltip = [
            alt.Tooltip("num_rts:Q", title="Num. RTs", format=","),
            alt.Tooltip("dts:T", title="Date posted"),
        ]
    ).properties(
        width=650
    )

    bars = alt.Chart(
        title="select date range below to highlight above"
    ).mark_bar(opacity=1, color='black').add_selection(
        brush
    ).encode(
        alt.X('yearmonthdate(dts):T', title='When low cred. post was shared'),
        alt.Y('count():Q', title = ["Num.", "posts"])
    ).properties(
        width=650,
        height=50
    )

    user_fig = alt.vconcat(
        base.encode(alt.X('dts:T', title=None, scale=alt.Scale(domain=brush))),
        bars.add_selection(brush).properties(height=60),
        data=selected_user_df
    )


    user_fig = user_fig.configure_axis(
        titleFontSize=15,
        labelFontSize=14
    ).configure_title(
        fontSize=16
    )


    user_fig['usermeta'] = {
        "embedOptions": {
            'loader': {'target': '_blank'}
        }
    }
    all_figures.append(user_fig)
    
    user_fig.save(f'./plots/{username}.json', scale_factor=2)


Skipping @weareoneEXO. Less than 5 posts.
Skipping @FLSurgeonGen. Less than 5 posts.
Skipping @NewYorkStateAG. Less than 5 posts.
Skipping @NCTsmtown. Less than 5 posts.
Skipping @LoganPaul. Less than 5 posts.
Skipping @FORTUNE_NEWSS. Less than 5 posts.
Skipping @RepLizCheney. Less than 5 posts.
Skipping @jsrailton. Less than 5 posts.
Skipping @mintamolly. Less than 5 posts.


In [14]:
all_figures[20]