# RQ3 Link Analysis: 2020 US Elections 

In [28]:
import json
import pandas as pd
import numpy as np
import networkx as nx

Import donwloaded tweets

In [2]:
with open("data/final_project.json", "rb") as f:
    data = f.readlines()
    data = [json.loads(str_) for str_ in data]

In [3]:
df_tweets = pd.DataFrame.from_records(data)

In [4]:
print("The total number of tweets is: {}" .format(len(df_tweets)))

The total number of tweets is: 111925


From the tweets you downloaded, generate the retweet graph. The directed final graph G = (V,E) is constituted by all the users retweeting at least once, and a generic edge (u,v) means that users u retweeted at least once a tweet posted by the user v. In this part, we’re going to test some contact recommendation algorithms to predict next retweets:

a. Generate the retweet graph

In [5]:
# Take the tweets that are a retweet
df_retweets = df_tweets[df_tweets["text"].apply(lambda x: x[:2]) == "RT"]

In [6]:
#Remove tweets that have a corrupted retweeted status
removed_tweets = 0

for pos, tweet in df_retweets.iterrows():
    tweet_id = df_retweets["id"][pos]

    # Compute number of likes and number of retweets for the tweet
    retweeted_status = df_retweets["retweeted_status"][pos]
    if isinstance(retweeted_status,dict) != True:
        df_retweets.drop([pos], inplace=True)
        removed_tweets += 1
                
print("The number of removed tweets is: {}" .format(removed_tweets))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


The number of removed tweets is: 12


In [7]:
df_graph = pd.DataFrame(columns=["source", "destination"])

# add source-nodes
df_graph["source"] = df_retweets["user"].apply(lambda x: x["screen_name"]) #Retrieve name of user retweeting

# add destination-nodes
df_graph["destination"] = df_retweets["retweeted_status"].apply(lambda x: x["user"]["screen_name"]) #Retrieve name of user that is being retweeted

In [8]:
df_graph.drop_duplicates(inplace=True)
df_graph.head()

Unnamed: 0,source,destination
0,wabisabine,Amy_Siskind
1,nikriv2,AdamParkhomenko
2,FoxfireStronomy,tara_atrandom
3,haschke_sam,realDonaldTrump
4,LisPG168,realTrumpForce


b. Split the edges in train/test, where the test is given by the 30% of the edges of the graph. These edges are sampling by random from the whole edge list.

In [27]:
# Generate test with 30% of edges
df_retweets_test = df_retweets.sample(frac=0.3, random_state=1)

#Generate train with remaining edges (70%)
df_retweets_train = df_retweets.drop(df_retweets_test.index)

c. Network-based predictions: Train 4 different algorithms to predict the edges in the test. The prediction is supposed to be done only using the edges information and run over only the edges at distance 2 (friends of friends) of the source nodes contained in the test-set. 

The list of potential recommendations are given only by all the friends-of-friends of the source
nodes included in the test-set. 

The training phase is instead applied over all 70% remaining edges.

c.1. Adamic-Adar

In [29]:
G_train = nx.DiGraph()
G_train.add_edges_from(df_retweets_train.values)

preds = nx.adamic_adar_index(G_train)

NetworkXError: Edge tuple ['Sun Nov 22 09:01:08 +0000 2020' 1330436137874288643
 '1330436137874288643'
 'RT @Amy_Siskind: Biden has surpassed a 6 million popular vote victory margin as he approaches 80 million votes. \n\nHe won the electoral coll…'
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>'
 False nan None nan None None
 {'id': 2360551810, 'id_str': '2360551810', 'name': 'wabi sabi', 'screen_name': 'wabisabine', 'location': None, 'url': None, 'description': None, 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 281, 'friends_count': 2999, 'listed_count': 7, 'favourites_count': 17327, 'statuses_count': 89681, 'created_at': 'Sun Feb 23 10:33:20 +0000 2014', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'C0DEED', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/772968539691048961/F4zURWmS_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/772968539691048961/F4zURWmS_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/2360551810/1602479941', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}
 None None None None
 {'created_at': 'Sat Nov 21 14:45:02 +0000 2020', 'id': 1330160296128892933, 'id_str': '1330160296128892933', 'text': 'Biden has surpassed a 6 million popular vote victory margin as he approaches 80 million votes. \n\nHe won the elector… https://t.co/OEH4WG6F2t', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 15976705, 'id_str': '15976705', 'name': 'Amy Siskind', 'screen_name': 'Amy_Siskind', 'location': 'New York', 'url': 'https://AmySiskind.com', 'description': 'Activist, feminist, author. The Weekly List website, podcast https://t.co/tsD5bH0kXa & book. POLITICO 50. Pres of @TheNewAgenda. Fmr Wall Street exec. She/her 🏳️\u200d🌈', 'translator_type': 'none', 'protected': False, 'verified': True, 'followers_count': 498245, 'friends_count': 298, 'listed_count': 3945, 'favourites_count': 68564, 'statuses_count': 84894, 'created_at': 'Mon Aug 25 02:52:25 +0000 2008', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '0099B9', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme4/bg.gif', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme4/bg.gif', 'profile_background_tile': False, 'profile_link_color': '0099B9', 'profile_sidebar_border_color': '5ED4DC', 'profile_sidebar_fill_color': '95E8EC', 'profile_text_color': '3C3940', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1302091524470169600/fCGkHJ5k_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1302091524470169600/fCGkHJ5k_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/15976705/1600542625', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'extended_tweet': {'full_text': 'Biden has surpassed a 6 million popular vote victory margin as he approaches 80 million votes. \n\nHe won the electoral college 306-236. \n\nEspecially against an incumbent, this is a blowout and decisive, not a close race. The American people need to hear the truth.', 'display_text_range': [0, 263], 'entities': {'hashtags': [], 'urls': [], 'user_mentions': [], 'symbols': []}}, 'quote_count': 209, 'reply_count': 158, 'retweet_count': 2649, 'favorite_count': 10934, 'entities': {'hashtags': [], 'urls': [{'url': 'https://t.co/OEH4WG6F2t', 'expanded_url': 'https://twitter.com/i/web/status/1330160296128892933', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}], 'user_mentions': [], 'symbols': []}, 'favorited': False, 'retweeted': False, 'filter_level': 'low', 'lang': 'en'}
 False 0 0 0 0
 {'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'Amy_Siskind', 'name': 'Amy Siskind', 'id': 15976705, 'id_str': '15976705', 'indices': [3, 15]}], 'symbols': []}
 False False 'low' 'en' '1606035668234' nan nan nan nan nan nan nan nan
 nan] must be a 2-tuple or 3-tuple.

In [None]:
import networkx as nx
>>> G = nx.complete_graph(5)
>>> preds = nx.adamic_adar_index(G, [(0, 1), (2, 3)])
>>> for u, v, p in preds:
...     '(%d, %d) -> %.8f' % (u, v, p)
...
'(0, 1) -> 2.16404256'
'(2, 3) -> 2.16404256'

c.2. Alternative Least Squares

c.3. Pagerank

c.4. Node2vec

d. Generate a top-10 list of recommendations for all the source nodes present in the test-set.

RQ 3A - Which is the best algorithm among the 4 selected in terms of accuracy? HINT - Use the nDCG, plus one other measurest (of your choice) to compare the results.

In [None]:
# evaluation

Now, trying to exploit other features, like text from tweets or other users' information try to answer the next question.

RQ 3B - Propose a new strategy to predict the links in the test-set. Which is the accuracy of the new algorithm if compared with the previous ones? Explain in detail the strategy you used and prove its effectiveness.