# Brief

Scenario: You are part of a team tasked with developing findings on the spread of misinformation surrounding the US presidential elections on X. The team has been asked to develop work that explores the kinds of election misinformation narratives that are spreading on the platform. You have been asked to examine the dataset of Community Notes to find potential leads. Information on what the dataset contains is available here.

## Task 1: 
Do some simple initial exploratory work on the dataset to help the team understand its contents and potential directions for research. Your findings should be presented as a one-page document that:
1. Is understandable by a non-technical audience
2. Contains basic statistics that could help direct further work

You can use bullet points and visualizations if you wish. Please submit your notebook to show your working alongside the document. You should work in Python.  

## Task 2: 

Create a one-page plan for a two-week analysis project on the dataset that uses more complex methods. You can assume that you would have help from colleagues with this task. As part of this task, you should suggest some expected top-line findings for the final research piece that your analysis could provide. Please also state the specific Python modules you would use.

- Notes: Contains a table representing all notes
- Ratings: Contains a table representing all ratings
- Note Status History: Contains a table with metadata about notes including what statuses they received and when.
- User Enrollment: Contains a table with metadata about each user's enrollment state.

# 1 - Packages

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List
import tools as tl
import importlib
import plotly.express as px
import networkx as nx
from tqdm import tqdm
import spacy
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.neural_network import MLPClassifier

In [2]:
#!python -m spacy download en_core_web_sm

# 2 - Data load & cleanup

In [3]:
# URLS for dataset downloads

note_url = (
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/notes/notes-00000.tsv"
)
ratings_urls = [
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteRatings/ratings-00005.tsv",
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteRatings/ratings-00004.tsv",
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteRatings/ratings-00000.tsv",
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteRatings/ratings-00006.tsv",
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteRatings/ratings-00002.tsv",
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteRatings/ratings-00001.tsv",
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteRatings/ratings-00007.tsv",
    "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteRatings/ratings-00003.tsv",
]
status_history_url = "https://ton.twimg.com/birdwatch-public-data/2024/10/17/noteStatusHistory/noteStatusHistory-00000.tsv"
user_enrol_status_url = "https://ton.twimg.com/birdwatch-public-data/2024/10/17/userEnrollment/userEnrollment-00000.tsv"

In [None]:
# Uncomment if running for first time:
# tl.download_to_parquet(note_url, fname="notes")
notes = pd.read_parquet("data/notes.parquet")
notes = notes.dropna(subset="summary")
notes["datetime"] = pd.to_datetime(notes["createdAtMillis"], unit="ms")
notes["harmful"] = notes["harmful"].fillna("Null")
notes["believable"] = notes["believable"].fillna("Null")
notes.info()

In [None]:
# Uncomment download line if running for first time
for idx, url in enumerate(ratings_urls):
    # tl.download_to_parquet(url, fname=f"ratings_{idx}")
    df = pd.read_parquet(f"data/ratings_{idx}.parquet")
    if idx == 0:
        ratings = df
    else:
        ratings = pd.concat([ratings, df])

In [None]:
ratings = pd.read_parquet(f"data/ratings_0.parquet")
ratings.info()

In [None]:
# tl.download_to_parquet(status_history_url, fname="stat_hist")
stat_hist = pd.read_parquet("data/stat_hist.parquet")
stat_hist.info()

In [None]:
# tl.download_to_parquet(user_enrol_status_url, fname="user_stat")
user_stat = pd.read_parquet("data/user_stat.parquet")
user_stat.rename(columns={"participantId": "noteAuthorParticipantId"}, inplace=True)
user_stat.info()

# 3 - Schema of data tables

Understand how information can be linked between these separate tables via matching columns (IE finding join keys)

In [6]:
all_cols = {
    "notes": notes.columns,
    "user_stats": user_stat.columns,
    "ratings": ratings.columns,
    "status_history": stat_hist.columns,
}
graph_colors = {
    "notes": "#ea5545",
    "user_stats": "#ef9b20",
    "ratings": "#87bc45",
    "status_history": "#f46a9b",
}

In [7]:
g = nx.Graph()
for table in all_cols.keys():
    table_graph = nx.Graph()
    table_graph.add_node(table, color=graph_colors[table])
    for node in list(all_cols[table]):
        table_graph.add_node(node, color="gainsboro")
        table_graph.add_edge(table, node)
    g = nx.compose(g, table_graph)

In [None]:
importlib.reload(tl)

In [None]:
tl.plot_single_graph(g, layout="neato", color_attr=True, figsize=(30, 30))

In [11]:
node_degree_dict = nx.degree(g)
g2 = nx.subgraph(g, [x for x in g.nodes() if node_degree_dict[x] > 1])

In [None]:
tl.plot_single_graph(g2, color_attr=True)

Tables can be joined on noteId, participantId, and createdAtMillis for full data on a given author or note.

# 4 - Notes dataset EDA

In [None]:
notes["summary_len"] = notes["summary"].apply(lambda x: len(x.split()))
print(notes["summary_len"].mean())
print(notes["summary_len"].std())

We can examine whether there is any apparent relationship between the binary scores in the 'feature' columns (user tags on the characteristics of a post), and its harmfulness rating

In [None]:
import itertools

feature_cols = [
    "misleadingOther",
    "misleadingFactualError",
    "misleadingManipulatedMedia",
    "misleadingOutdatedInformation",
    "misleadingMissingImportantContext",
    "misleadingUnverifiedClaimAsFact",
    "misleadingSatire",
    "notMisleadingOther",
    "notMisleadingFactuallyCorrect",
    "notMisleadingOutdatedButNotWhenWritten",
    "notMisleadingClearlySatire",
    "notMisleadingPersonalOpinion",
    "trustworthySources",
]


notes_sum = (
    notes.groupby(["harmful"])[feature_cols]
    .sum()
    .reset_index()
    .melt(id_vars=["harmful"], var_name="category", value_name="count")
)

notes_sum["proportion"] = None
for harm, reason in itertools.product(
    notes_sum["harmful"].unique(), notes_sum["category"].unique()
):
    count = notes_sum[
        (notes_sum["harmful"] == harm) & (notes_sum["category"] == reason)
    ]["count"].values[0]
    total_at_harm = notes_sum[(notes_sum["harmful"] == harm)]["count"].sum()
    notes_sum.loc[
        (notes_sum["harmful"] == harm) & (notes_sum["category"] == reason), "proportion"
    ] = (count / total_at_harm)

fig, ax = plt.subplots(figsize=(20, 10))
colors = ["firebrick", "khaki", "forestgreen"]
sns.barplot(
    ax=ax, x="category", y="proportion", hue="harmful", data=notes_sum, palette=colors
)
fig = plt.gcf()
fig.autofmt_xdate()
sns.despine()

- All categories (considerable_harm, little_harm, and null) have modal features: misleadingFactualError, misleadingMissingImportantContext, misleadingUnverifiedClaimAsFact, and trustworthySources.
- considerable_harm posts are more likely to be tagged with the 'misleading' modal features from this list, and less likely to have 'trustworthy sources'

# 5 - Notes data over time

In [15]:
# separate pre & post-takeover data

notes_pre = notes[notes["datetime"] < dt.strptime("2022-10-28", "%Y-%m-%d")]
notes_post = notes[notes["datetime"] > dt.strptime("2022-10-28", "%Y-%m-%d")]

In [None]:
fig = px.histogram(
    notes,
    x="datetime",
    color_discrete_sequence=["firebrick"],
    opacity=0.8,
    width=1600,
    height=450,
)
fig.update_layout(
    margin=dict(l=20, r=20, t=30, b=20),
    xaxis_title=None
)


fig.add_vline(
    x=dt.strptime("2022-10-28", "%Y-%m-%d").timestamp() * 1000,
    annotation_text="Change in management",
    annotation=dict(font_size=18),
    opacity=0.2,
)

fig.show()



In [None]:

fig = px.histogram(
    notes_pre,

    x="datetime",
    color="harmful",
    color_discrete_sequence=["firebrick", "khaki", "forestgreen"],
    title="Pre-takeover, colored by harm category",
)


fig.show()


fig = px.histogram(
    notes_pre,

    x="datetime",
    color="believable",
    color_discrete_sequence=["firebrick", "khaki", "forestgreen"],
    title="Pre-takeover, colored by believability category",
)


fig.show()

In [None]:
notes["classification"].value_counts()

# 6  - Classifying harmfulness level based on feature columns
 We can use the pre-takeover flags of 'considerable' and 'little' harm as labels to train an NN classifier, using the feature columns as our training data. This may allow us to estimate the proportions of post-takeover posts in the 'harmful' and 'little' harm categories, even though these labels were discontinued post-takeover

In [19]:
# separate train-test portion from prediction portion
tt_notes = notes[notes["datetime"] < dt.strptime("2022-10-28", "%Y-%m-%d")]


pred_notes = notes[notes["datetime"] > dt.strptime("2022-10-28", "%Y-%m-%d")]

In [None]:
# Train/test split from pre-takeover data
feature_cols = [
    "misleadingOther",
    "misleadingFactualError",
    "misleadingManipulatedMedia",
    "misleadingOutdatedInformation",
    "misleadingMissingImportantContext",
    "misleadingUnverifiedClaimAsFact",
    "misleadingSatire",
    "notMisleadingOther",
    "notMisleadingFactuallyCorrect",
    "notMisleadingOutdatedButNotWhenWritten",
    "notMisleadingClearlySatire",
    "notMisleadingPersonalOpinion",
    "trustworthySources",
]


X = tt_notes[feature_cols]



y = tt_notes["harmful"]



X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)



clf = MLPClassifier(
    hidden_layer_sizes=(100, 100, 100),
    max_iter=500,
    alpha=0.0001,
    solver="sgd",
    verbose=10,
    random_state=21,
    tol=0.000000001,
)



clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

 Our overall performance (weighted avg) is 75% precision,  78% recall.

This is decent, but not fantastic. Could potentially be improved with some feature engineering in the future. We probably wouldnt want to use this model to predict at the level of individual posts, but the performance is sufficient for us to examine the overall trend of harmful categorised posts after the takeover

In [None]:
# Run prediction on post-takeover data
post_data = notes_post[feature_cols]
post_dts = notes_post["datetime"]

post_pred = clf.predict(post_data)
pred_df = pd.DataFrame(columns=["harmful"], data=post_pred)
pred_df["datetime"] = post_dts

# Plot results

color_dict = {
    "CONSIDERABLE_HARM": "firebrick",
    "LITTLE_HARM": "forestgreen",
    "Null": "khaki",
}


pred_df = pred_df.sort_values(by="harmful")

fig = px.histogram(
    pred_df,
    x="datetime",
    color="harmful",
    color_discrete_map=color_dict,
    opacity=0.8,
    title="Predicted harm labels for post-takeover notes",
    width=900,
    height=450,
)
fig.update_layout(legend=dict(
    yanchor="top",
    y=0.99,
    xanchor="left",
    x=0.01
))

fig.show()

In [None]:
pred_df[pred_df["datetime"] > dt.strptime("2024-01-01", "%Y-%m-%d")][
    "harmful"
].value_counts(normalize=True)

In [None]:
tt_notes[tt_notes["datetime"] < dt.strptime("2024-01-01", "%Y-%m-%d")][
    "harmful"
].value_counts(normalize=True)

In [None]:
# Stat test on harmfulness proportions, pre vs post takeover
from scipy.stats import chisquare

f_obs = pred_df["harmful"].value_counts(normalize=True).mul(100).values
f_exp = tt_notes["harmful"].value_counts(normalize=True).mul(100).values
chisquare(f_obs=f_obs, f_exp=f_exp)

Most of the recent & unlabelled notes probably relate to "considerable harmful" content!

Lets build a network graph of commonly occuring named entities in the predicted harmful CN summaries

In [28]:
# map predicted harm labels onto post-takeover notes dataframe
notes_post = notes_post.merge(pred_df[["datetime", "harmful"]], on="datetime")

In [None]:
harmful_notes = notes_post[
    (notes_post["harmful"] == "CONSIDERABLE_HARM")
    & (notes_post["summary"].str.contains("election|voting|voter"))
].reset_index()
g = tl.entity_graph(harmful_notes, top_prop=0.1)


In [None]:
largest_cc = max(nx.connected_components(g), key=len)  # get largest component of graph
subg = g.subgraph(largest_cc).copy()
tl.plot_single_graph(subg, layout="neato", save_path="plots/election_network.png", figsize=(20, 12))

Lots of familiar names, and the network structure makes intuitive sense (big names in the centre)

In [None]:
degree_sequence = sorted((d for n, d in subg.degree()), reverse=True)
dmax = max(degree_sequence)

fig, axs = plt.subplots(nrows=1, ncols=2)

axs[0].plot(degree_sequence, "b-", marker="o")
axs[0].set_title("Degree Rank Plot")
axs[0].set_ylabel("Degree")
axs[0].set_xlabel("Rank")


axs[1].bar(*np.unique(degree_sequence, return_counts=True))
axs[1].set_title("Degree histogram")
axs[1].set_xlabel("Degree")
axs[1].set_ylabel("# of Nodes")

fig.tight_layout()
plt.show()

In [70]:
# Highest degree nodes:
node_degrees = nx.degree_centrality(subg)
sorted_degrees = dict(
    sorted(node_degrees.items(), key=lambda item: item[1], reverse=True)
)

In [None]:
print("Number of unique Authors:")
print(len(notes["noteAuthorParticipantId"].unique()))

print("Number of notes")
print(notes.shape[0])

In [None]:
notes["noteAuthorParticipantId"].value_counts(normalize=True).mul(100).reset_index()

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

notes["noteAuthorParticipantId"].value_counts().plot.hist(
    ax=axs[0], bins=50, log=True, alpha=0.8
)
axs[0].set_xlabel("note Author Participant Id counts in dataset")
axs[0].set_ylabel("Number")
axs[0].set_title("noteAuthor counts")

notes["noteAuthorParticipantId"].value_counts(normalize=True).mul(100).head(
    10
).plot.bar(ax=axs[1], alpha=0.8)
axs[1].set_xlabel("User")
axs[1].set_xticks([])
axs[1].set_ylabel("% of total notes")
axs[1].set_title("Contribution of 10 most active users")


sns.despine()
sns.set_context("notebook")
plt.show()

Vast majority of unique Author IDs have very few entries in the dataframe. Very few users have >4000 entries in the dataframe, however those users are associated with a very disproportionate amount of the records

# 7 User stats

In [72]:
def plots_from_value_counts(df, col, proportional=False, kind="bar", date_fmt=False):
    fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
    counts = df[col].value_counts(normalize=proportional)
    counts.plot(kind=kind, ax=axs[0], alpha=0.8)
    axs[0].set_xlabel(col)
    axs[0].set_ylabel("#")

    counts.plot(kind=kind, ax=axs[1], alpha=0.8, logy=True)
    axs[1].set_xlabel(col)
    axs[1].set_ylabel("# log scale")
    if date_fmt:
        fig = plt.gcf()
        fig.autofmt_xdate()
    sns.despine()
    sns.set_context("notebook")
    plt.show()

In [None]:
data_cols = [
    "enrollmentState",
    "successfulRatingNeededToEarnIn",
    "modelingPopulation",
    "modelingGroup",
    "numberOfTimesEarnedOut",
]

print(user_stat["enrollmentState"].value_counts(normalize=True))
plots_from_value_counts(user_stat, "enrollmentState", date_fmt=True)

Vast majority (~97%) of notes rating users are 'new' (58%) or 'earned in' (38%). 

In [None]:
plots_from_value_counts(user_stat, "modelingPopulation", date_fmt=True)

The majority of users are in the 'core' group, supposedly used as a reliable baseline of longer term contributors. Smaller proportion in the 'expansion' and 'expansion plus' populations

In [None]:
user_stat.head()

In [None]:
for population in ["CORE", "EXPANSION", "EXPANSION_PLUS"]:
    pop_stat = user_stat[user_stat["modelingPopulation"] == population]
    print(population)
    print(pop_stat["enrollmentState"].value_counts(normalize=True))
    plots_from_value_counts(pop_stat, "enrollmentState", date_fmt=True)

In [None]:
for idx, population in enumerate(["CORE", "EXPANSION", "EXPANSION_PLUS"]):
    pop_stat = user_stat[user_stat["modelingPopulation"] == population]
    percentages = pop_stat["enrollmentState"].value_counts(normalize=True).mul(100).rename(population).reset_index()
    if idx == 0:
        df = percentages
    else:
        df = df.merge(percentages, on="enrollmentState")
df = df.set_index("enrollmentState")


In [95]:
from scipy.stats import chi2_contingency
res = chi2_contingency(df.values, correction=True)

In [None]:
res

Populations look fairly similar w.r.t. user privileges. Might be worth looking at their actual ratings/contributions to see if they differ or align vs the core population?

In [None]:
notes_post["noteAuthorParticipantId"] = notes_post["noteAuthorParticipantId"].astype(
    str
)

In [55]:
notes_post = notes_post.merge(
    user_stat[["noteAuthorParticipantId", "modelingPopulation"]],
    on="noteAuthorParticipantId",
)

In [None]:
fig = px.histogram(
    notes_post,
    x="datetime",
    color="modelingPopulation",
    color_discrete_sequence=["forestgreen", "khaki", "firebrick"],
    opacity=0.8,
)
fig.show()

# MISC User status

In [None]:
user_stat.info()

In [None]:
print(user_stat["numberOfTimesEarnedOut"].value_counts(normalize=True))
plots_from_value_counts(user_stat, "numberOfTimesEarnedOut")

In [None]:
problem_users = user_stat[user_stat["numberOfTimesEarnedOut"] >= 10][
    "noteAuthorParticipantId"
].unique()

len(problem_users)

In [None]:
user_stat[user_stat["noteAuthorParticipantId"].isin(problem_users)]["modelingPopulation"].value_counts()

In [None]:
problem_user_notes = notes[
    notes["noteAuthorParticipantId"].isin(problem_users)
].reset_index()

problem_user_notes.shape

Here we have 2939 notes authored by 3 'problem users' (users who have been 'earned out' 10 or more times)

In [None]:
g = tl.entity_graph(problem_user_notes)

In [None]:
largest_cc = max(nx.connected_components(g), key=len)  # get largest component of graph
subg = g.subgraph(largest_cc).copy()
tl.plot_single_graph(subg)

In [127]:
notes["problem"] = False
notes.loc[notes["noteAuthorParticipantId"].isin(problem_users), "problem"] = True

In [None]:
problem_notes = notes[notes["problem"]==True]
print(problem_notes[["noteId", "summary"]].head().to_markdown())

Potentially interesting - are these multiple-suspended users repeatedly posting bad info? or just unpopular takes? Could be an interesting thread, especially if theyre posting bad political/election content and not being permanently banned

# MISC helpfulness ratings of notes

In [None]:
import itertools

eature_cols = [
    "helpfulInformative",
    "helpfulClear",
    "helpfulEmpathetic",
    "helpfulGoodSources",
    "helpfulUniqueContext",
    "helpfulAddressesClaim",
    "helpfulImportantContext",
    "helpfulUnbiasedLanguage",
    "notHelpfulOther",
    "notHelpfulIncorrect",
    "notHelpfulSourcesMissingOrUnreliable",
    "notHelpfulOpinionSpeculationOrBias",
    "notHelpfulMissingKeyPoints",
    "notHelpfulOutdated",
    "notHelpfulHardToUnderstand",
    "notHelpfulArgumentativeOrBiased",
    "notHelpfulOffTopic",
    "notHelpfulSpamHarassmentOrAbuse",
    "notHelpfulIrrelevantSources",
    "notHelpfulOpinionSpeculation",
    "notHelpfulNoteNotNeeded",
]


ratings_sum = (
    ratings.groupby(["helpfulnessLevel"])[feature_cols]
    .sum()
    .reset_index()
    .melt(id_vars=["helpfulnessLevel"], var_name="category", value_name="count")
)

ratings_sum["proportion"] = None
for helpfulness, reason in itertools.product(
    ratings_sum["helpfulnessLevel"].unique(), ratings_sum["category"].unique()
):
    count = ratings_sum[
        (ratings_sum["helpfulnessLevel"] == helpfulness)
        & (ratings_sum["category"] == reason)
    ]["count"].values[0]
    total_at_hfulness = ratings_sum[(ratings_sum["helpfulnessLevel"] == helpfulness)][
        "count"
    ].sum()
    ratings_sum.loc[
        (ratings_sum["helpfulnessLevel"] == helpfulness)
        & (ratings_sum["category"] == reason),
        "proportion",
    ] = (
        count / total_at_hfulness
    )

fig, ax = plt.subplots(figsize=(20, 10))
colors = ["forestgreen", "firebrick", "khaki"]
sns.barplot(
    ax=ax,
    x="category",
    y="proportion",
    hue="helpfulnessLevel",
    data=ratings_sum,
    palette=colors,
)
fig = plt.gcf()
fig.autofmt_xdate()
sns.despine()

# MISC - Free text cleanup

In [None]:
importlib.reload(tl)
focus_df = notes[(notes["datetime"] > "2024-01-01")].reset_index()
print(focus_df.shape)

2-stage cleanup/filter for english language content
1. ASCII latin encoded only
2. English language detected only

In [None]:
focus_df = focus_df.dropna(subset=["summary"])
focus_df["asci"] = focus_df["summary"].apply(lambda x: x.isascii())
focus_df = focus_df[focus_df["asci"] == True]
focus_df.reset_index(inplace=True)
print(focus_df.shape)

In [None]:
from spacy_langdetect import LanguageDetector
from spacy.language import Language


def get_lang_detector(nlp, name):
    return LanguageDetector()


nlp = spacy.load("en_core_web_sm")  # 1#
Language.factory("language_detector", func=get_lang_detector)
nlp.add_pipe("language_detector", last=True)

In [None]:
def eng_check(text):
    doc = nlp(text)
    lang = doc._.language
    if lang["score"] > 0.6:
        return True
    else:
        return


"""
for idx, row in tqdm(focus_df.iterrows(), total=focus_df.shape[0]):
    text = row["summary"]
    doc = nlp(text)
    lang = doc._.language
    if lang["score"] > 0.6:
        focus_df.at[idx, "language"] = lang["language"]

"""
tqdm.pandas()
focus_df["eng"] = focus_df["summary"].progress_apply(eng_check)


# save if first time running

# notes.to_parquet("notes_fmt.parquet")

In [27]:
focus_df.to_parquet("data/focus_df.parquet")
# focus_df = pd.read_parquet("data/focus_df.parquet")

In [None]:
# subselect to English only
focus_df = focus_df[focus_df["eng"] == True]
focus_df.shape

In [None]:
# Keyword subselection
focus_df = focus_df[(focus_df["summary"].str.contains("election|vote|voting"))]
focus_df.shape

In [None]:
# Graph network of entities in election posts, keeping top 50% of entities (by occurence)
g = tl.entity_graph(focus_df, top_prop=0.5)
largest_cc = max(nx.connected_components(g), key=len)  # get largest component of graph
subg = g.subgraph(largest_cc).copy()

In [39]:
nx.write_graphml_lxml(subg, "data/focus_g.graphml")
# subg = nx.read_graphml("data/focus_g.graphml")

In [None]:
tl.plot_single_graph(subg, layout="neato", figsize=(30, 30))

The graph is still very big and needs some cleanup (e.g. duplicates, single name entities), but we can already see some structure here:
- There are apparent clusters relating to political activity in several countries (e.g. US, UK, India)
- 'hub' nodes for some important names (e.g. Narendra Modi, Boris Johnson, Trump/Harris/Biden/Musk)

In [None]:
degree_sequence = sorted((d for n, d in subg.degree()), reverse=True)
dmax = max(degree_sequence)

fig, axs = plt.subplots(nrows=1, ncols=2)

axs[0].plot(degree_sequence, "b-", marker="o")
axs[0].set_title("Degree Rank Plot")
axs[0].set_ylabel("Degree")
axs[0].set_xlabel("Rank")


axs[1].bar(*np.unique(degree_sequence, return_counts=True))
axs[1].set_title("Degree histogram")
axs[1].set_xlabel("Degree")
axs[1].set_ylabel("# of Nodes")

fig.tight_layout()
plt.show()

In [47]:
# Highest degree nodes:
node_degrees = nx.degree_centrality(subg)
sorted_degrees = dict(
    sorted(node_degrees.items(), key=lambda item: item[1], reverse=True)
)

In [None]:
degree_sequence[0:5]

In [None]:
hub_nodes = {k: sorted_degrees[k] for k in list(sorted_degrees)[:10]}
hub_list = list(hub_nodes.keys())
plt.bar(range(len(hub_nodes)), list(hub_nodes.values()), align="center")
plt.xticks(range(len(hub_nodes)), list(hub_nodes.keys()))
plt.ylabel("Degree")
fig = plt.gcf()
fig.autofmt_xdate()
sns.despine()
plt.show()

# Status history


# 

In [None]:
stat_hist.head()

In [None]:
stat_hist.shape

In [None]:
rated_stats = stat_hist[stat_hist["currentStatus"] != "NEEDS_MORE_RATINGS"]
rated_stats.shape

In [None]:
rated_stats["hr_to_first_stat"] = (
    rated_stats["timestampMillisOfFirstNonNMRStatus"] - rated_stats["createdAtMillis"]
) / 3600000

rated_stats["hr_to_current_stat"] = (
    rated_stats["timestampMillisOfCurrentStatus"] - rated_stats["createdAtMillis"]
) / 3600000

rated_stats["hr_between_first_and_current_stat"] = (
    rated_stats["timestampMillisOfCurrentStatus"]
    - rated_stats["timestampMillisOfFirstNonNMRStatus"]
) / 3600000

In [None]:
rated_stats["days_to_current_stat"] = rated_stats["hr_to_current_stat"] / 60

In [None]:
sns.histplot(data=rated_stats, x="days_to_current_stat", hue="currentStatus")

In [None]:
rated_stats.value_counts(["firstNonNMRStatus", "currentStatus"])