# Extract data from Mongo DB with pymongo

### Nodes

set_mongo_connection(), sets up a connection to a MongoDB database using a connection string, database name, and collection name. It returns the client, db, and col objects which can be used to perform further operations on the database.

get_node_users() retrieves information about Twitter users from a MongoDB database, either from tweets that include user mentions or from tweets that include certain keywords or hashtags. It creates a dictionary representing each user found, including the user's ID, username, number of followers (if available), and the date the tweet was created. It then combines the dictionaries into a single list, removes any duplicate users based on their ID, and returns the resulting list sorted by the date the tweet was created.


get_node_tweets(), retrieves information about Twitter tweets from the specified collection. It searches for documents that contain a key called "includes.tweets" and extracts information such as tweet ID, author ID, creation time, and number of retweets. It also checks for any referenced tweets and adds them to a list if present. The function returns a list of tweet objects.

get_node_hashtags(), retrieves a list of unique hashtags from the first tweet in each document in the specified collection. It uses the collection.find() method to search for documents that contain a key called "includes.tweets.0.entities.hashtags", which indicates that the first tweet in the document has at least one hashtag. It then loops through each matching document, extracts the hashtag tags from the first tweet, and adds them to a set to remove duplicates. Finally, the function converts the set to a list and returns it.


get_node_urls() retrieves a list of unique urls from the first tweet in each document in the specified collection. It uses the collection.find() method to search for documents that contain a key called "includes.tweets.0.entities.urls", which indicates that the first tweet in the document has at least one url. It then loops through each matching document, extracts the urls from the first tweet, and adds them to a set to remove duplicates. Finally, the function converts the set to a list and returns it.



### Relations

get_relationship_tweeted() takes a MongoDB collection as input and returns a list of dictionaries containing information about users who have tweeted. It first retrieves all documents that do not have any referenced tweets. For each document, it extracts the user ID, tweet ID, and created_at_converted. It then checks if the user ID is already in a dictionary called user_tweeted_tweet. If the user ID is not in the dictionary, it adds a new entry with the user ID, an empty list for tweeted tweets, and an empty list for the created_at_converted values. It then appends the tweet ID and created_at_converted value to the corresponding lists in the dictionary. Finally, it returns a list of the values in the user_tweeted_tweet dictionary.

get_relationship_retweeted() takes a MongoDB collection as input and returns a list of dictionaries containing information about users who have retweeted tweets. It first retrieves all documents that have at least one referenced tweet. For each document, it extracts the user ID, tweet ID, created_at_converted, and referenced tweet type. If the referenced tweet type is "retweeted", it checks if the user ID is already in a dictionary called user_retweeted_tweet. If the user ID is not in the dictionary, it adds a new entry with the user ID, an empty list for retweeted tweets, and an empty list for the created_at_converted values. It then appends the tweet ID and created_at_converted value to the corresponding lists in the dictionary. Finally, it returns a list of the values in the user_retweeted_tweet dictionary.

get_relationship_quoted() takes a MongoDB collection as input and returns a list of dictionaries containing information about users who have quoted tweets. It first retrieves all documents that have at least one referenced tweet. For each document, it extracts the user ID, tweet ID, created_at_converted, and referenced tweet type. If the referenced tweet type is "quoted", it checks if the user ID is already in a dictionary called user_quoted_tweet. If the user ID is not in the dictionary, it adds a new entry with the user ID, an empty list for quoted tweets, and an empty list for the created_at_converted values. It then appends the tweet ID and created_at_converted value to the corresponding lists in the dictionary. Finally, it returns a list of the values in the user_quoted_tweet dictionary.

get_relationship_replied_to() takes a MongoDB collection as input and returns a list of dictionaries containing information about users who have replied to tweets. It first retrieves all documents that have at least one referenced tweet. For each document, it extracts the user ID, tweet ID, created_at_converted, and referenced tweet type. If the referenced tweet type is "replied_to", it checks if the user ID is already in a dictionary called user_replied_to_tweet. If the user ID is not in the dictionary, it adds a new entry with the user ID, an empty list for replied tweets, and an empty list for the created_at_converted values. It then appends the tweet ID and created_at_converted value to the corresponding lists in the dictionary. Finally, it returns a list of the values in the user_replied_to_tweet dictionary.

get_relationship_has_hashtag(), retrieves information about the relationship between tweets and hashtags in the specified collection. It searches for documents that contain a key called "includes.tweets.0.entities.hashtags" and extracts the tweet ID and associated hashtags. The function returns a list of tweet objects with their associated hashtags.

get_relationship_has_url(), retrieves information about the relationship between tweets and urls in the specified collection. It searches for documents that contain a key called "includes.tweets.0.entities.urls" and extracts the tweet ID and associated urls. The function returns a list of tweet objects with their associated urls.

get_relationship_used_hashtag() retrieves all tweets that contains hashtags. For each tweet, the function extracts any new hashtag found in the tweet to the user's existing list of hashtags, and removes any duplicates. Once all documents have been processed, the function returns a list of dictionaries, where each dictionary represents a user and the hashtags they have used in their tweets.

get_relationship_used_urls() retrieves all tweets that contains urls. For each tweet, the function extracts any new url found in the tweet to the user's existing list of urls, and removes any duplicates. Once all documents have been processed, the function returns a list of dictionaries, where each dictionary represents a user and the urls they have used in their tweets.


get_relationship_mentioned() extracts a list of dictionaries representing the relationships between users based on mentions in their tweets. It first queries the collection for documents with tweets that mention other users. For each document, it extracts the user ID and iterates over each mention in the tweet's "entities" field to count how many times the user mentioned that particular mention. Finally, it returns a list of dictionaries, where each dictionary contains a user ID and a list of mentions with their counts.

In [16]:
from python_utils.mongo_python import set_mongo_connection, get_node_users , get_node_tweets , get_node_hashtags , get_node_urls, get_relationship_tweeted , get_relationship_retweeted, get_relationship_quoted, get_relationship_replied_to ,get_relationship_has_hashtag, get_relationship_has_url, get_relationship_used_hashtag, get_relationship_used_urls, get_relationship_mentioned

Examples :

In [2]:
client,database,collection = set_mongo_connection(
    "mongodb://localhost:27017/",
    "local",
    "test"
)

#nodes
users = get_node_users(collection)
tweets = get_node_tweets(collection)
hashtags = get_node_hashtags(collection)
urls = get_node_urls(collection)

#relationships
tweeted = get_relationship_tweeted(collection)
retweeted = get_relationship_retweeted(collection)
quoted = get_relationship_quoted(collection)
replied_to = get_relationship_replied_to(collection)
has_hashtag = get_relationship_has_hashtag(collection)
has_url = get_relationship_has_url(collection)
used_hashtag = get_relationship_used_hashtag(collection)
used_urls = get_relationship_used_urls(collection)
mentioned = get_relationship_mentioned(collection)


## Below, there are samples of the produced data of each function

### User nodes

In [3]:
print("------ User nodes --------")
for user in users[:5]:
    print(user)

------ User nodes --------
{'id': '860025761167572993', 'followers': 35, 'username': 'Sukhmanjeetkau2', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 30)}
{'id': '1389705763208237056', 'followers': 12552, 'username': 'LauraDoodlesToo', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 16)}
{'id': '1495625777605488642', 'followers': 7701, 'username': 'AnastasiaNFTart', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 16)}
{'id': '1613967463', 'followers': 439, 'username': 'esquivel_gifted', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 58)}
{'id': '959355565', 'followers': 342, 'username': 'DevonGarritt', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 58)}


### Tweet nodes

In [4]:
print("------ Tweet nodes --------")
for tweet in tweets[:5]:
    print(tweet)

------ Tweet nodes --------
{'tweet_id': '1618398184312639491', 'author_id': '860025761167572993', 'created_at': '2023-01-25T23:59:30.000Z', 'retweet_count': 0, 'referenced_tweets': [{'type': None, 'tweet_id': None}]}
{'tweet_id': '1618398126364299264', 'author_id': '1389705763208237056', 'created_at': '2023-01-25T23:59:16.000Z', 'retweet_count': 15, 'referenced_tweets': [{'type': 'retweeted', 'tweet_id': '1618170911366184962'}]}
{'tweet_id': '1618397809140715524', 'author_id': '1613967463', 'created_at': '2023-01-25T23:58:00.000Z', 'retweet_count': 0, 'referenced_tweets': [{'type': 'replied_to', 'tweet_id': '1618315710299803649'}]}
{'tweet_id': '1618397053452963840', 'author_id': '1267581258424557568', 'created_at': '2023-01-25T23:55:00.000Z', 'retweet_count': 12, 'referenced_tweets': [{'type': 'retweeted', 'tweet_id': '1618354372135747586'}]}
{'tweet_id': '1618397050256904193', 'author_id': '24725968', 'created_at': '2023-01-25T23:54:59.000Z', 'retweet_count': 1, 'referenced_tweets':

### Hashtag nodes

In [5]:
print("------ Hashtag nodes --------")
for hashtag in hashtags[:5]:
    print(hashtag)

------ Hashtag nodes --------
samaaupdates
femalelife
kdrama
womenincrypto
womeningrss


### Url nodes

In [6]:
print("------ Url nodes --------")
for url in urls[:5]:
    print(url)

------ Url nodes --------
https://t.co/PrXaQ6IyJL
https://t.co/YGX6684LHE
https://t.co/O3KSmRoF3R
https://t.co/MBkAoDibQt
https://t.co/kWGY9l6Axc


### User tweeted a tweet relation

In [7]:
print("------ User tweeted a tweet --------")
for user_tweet_relation in tweeted[:2]:
    print(user_tweet_relation)

------ User tweeted a tweet --------
{'id': '860025761167572993', 'tweeted': ['1618398184312639491', '1616586245395648512'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 59, 30), datetime.datetime(2023, 1, 20, 23, 59, 30)]}
{'id': '820359116212142081', 'tweeted': ['1618396803979788289', '1617599317564325911', '1616873521614884865', '1616511130158997504'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 54, 1), datetime.datetime(2023, 1, 23, 19, 5, 5), datetime.datetime(2023, 1, 21, 19, 1, 2), datetime.datetime(2023, 1, 20, 19, 1, 1)]}


### User retweeted a tweet relation

In [8]:
print("------ User retweeted a tweet --------")
for user_retweet_relation in retweeted[:3]:
    print(user_retweet_relation)

------ User retweeted a tweet --------
{'id': '1389705763208237056', 'retweeted': ['1618398126364299264', '1617651217836167169'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 59, 16), datetime.datetime(2023, 1, 23, 22, 31, 19)]}
{'id': '1267581258424557568', 'retweeted': ['1618397053452963840'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 55)]}
{'id': '24725968', 'retweeted': ['1618397050256904193', '1617715226211921925', '1616547691047043072', '1616097000478449664', '1616096893960081414', '1615756643668746240', '1615509814880178176', '1615451679385608211', '1615439765817462784', '1615439720313413650', '1615026052048486400'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 54, 59), datetime.datetime(2023, 1, 24, 2, 45, 40), datetime.datetime(2023, 1, 20, 21, 26, 18), datetime.datetime(2023, 1, 19, 15, 35, 25), datetime.datetime(2023, 1, 19, 15, 34, 59), datetime.datetime(2023, 1, 18, 17, 2, 57), datetime.datetime(2023, 1, 18, 0, 42, 9), dateti

### User quoted a tweet relation

In [9]:
print("------ User quoted a tweet --------")
for user_quoted_relation in quoted[:5]:
    print(user_quoted_relation)

------ User quoted a tweet --------
{'id': '2305580258', 'quoted': ['1618395216091951104'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 47, 42)]}
{'id': '1105501302056980487', 'quoted': ['1618388634025234432'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 21, 33)]}
{'id': '1275952135683923970', 'quoted': ['1618387011932676096', '1618386859104800769', '1618386762719719425', '1618386625704361984', '1618386217867055104', '1618385577308745730'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 15, 6), datetime.datetime(2023, 1, 25, 23, 14, 30), datetime.datetime(2023, 1, 25, 23, 14, 7), datetime.datetime(2023, 1, 25, 23, 13, 34), datetime.datetime(2023, 1, 25, 23, 11, 57), datetime.datetime(2023, 1, 25, 23, 9, 24)]}
{'id': '1676817841', 'quoted': ['1618386195893059584'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 11, 51)]}
{'id': '1315511783784734720', 'quoted': ['1618383805093654528'], 'created_at_converted': [datetime.datetime(20

### User replied to a tweet relation

In [10]:
print("------ User replied to a tweet -------")
for user_replied_to_relation in replied_to[:4]:
    print(user_replied_to_relation)

------ User replied to a tweet -------
{'id': '1613967463', 'replied_to': ['1618397809140715524'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 58)]}
{'id': '16677451', 'replied_to': ['1618393931364065281'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 42, 36)]}
{'id': '61746508', 'replied_to': ['1618386434796453888'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 23, 12, 48)]}
{'id': '1373680554420416513', 'replied_to': ['1618371504294735873', '1617982923961176064', '1617212709204033536', '1616796940749996034', '1616529674473033728', '1616165881108643867', '1615764790118318116', '1615413379568484353', '1615039508340432896'], 'created_at_converted': [datetime.datetime(2023, 1, 25, 22, 13, 29), datetime.datetime(2023, 1, 24, 20, 29, 24), datetime.datetime(2023, 1, 22, 17, 28, 50), datetime.datetime(2023, 1, 21, 13, 56, 43), datetime.datetime(2023, 1, 20, 20, 14, 42), datetime.datetime(2023, 1, 19, 20, 9, 7), datetime.datetime(2023, 1, 18, 17, 35, 

### Tweet has hashtag relation

In [11]:
print("------ Tweet has hashtag --------")
for tweet_hashtag_relation in has_hashtag[:5]:
    print(tweet_hashtag_relation)

------ Tweet has hashtag --------
{'tweet_id': '1618398184312639491', 'hashtags': ['leetcode', 'django', 'python', 'java', 'codinglife', 'softwaredeveloper', 'coding', '100daysofcode', 'womenintech', 'womenwhocode', '100daysofcodechallenge']}
{'tweet_id': '1618398126364299264', 'hashtags': ['gm']}
{'tweet_id': '1618397050256904193', 'hashtags': ['avtweeps']}
{'tweet_id': '1618396952034693121', 'hashtags': ['transgirl', 'tsgirl', 'trans']}
{'tweet_id': '1618396803979788289', 'hashtags': ['womeninspringwomen', 'iwintech', 'womenintech']}


### Tweet has url relation

In [12]:
print("------ Tweet has url --------")
for tweet_url_relation in has_url[:5]:
    print(tweet_url_relation)

------ Tweet has url --------
{'tweet_id': '1618396803979788289', 'urls': ['https://t.co/Vx1u1HWryP', 'https://t.co/u4dNamohFg', 'https://t.co/n3FZa3UrjS']}
{'tweet_id': '1618395216091951104', 'urls': ['https://t.co/B3zelTTchl', 'https://t.co/siff9VQxin', 'https://t.co/HkyhbGBfNH']}
{'tweet_id': '1618395108868608001', 'urls': ['https://t.co/cKAzCbhK5C']}
{'tweet_id': '1618393537044942848', 'urls': ['https://t.co/aJvRhVApmH']}
{'tweet_id': '1618393472859508739', 'urls': ['https://t.co/aJvRhVApmH']}


### User used hashtag relation

In [13]:
print("----- User used hashtag ------")
for user_hashtag_relation in used_hashtag[:5]:
    print(user_hashtag_relation)

----- User used hashtag ------
{'id': '860025761167572993', 'hashtags': ['leetcode', '100daysofcodechallenge', 'backtracking', 'java', 'inheritance', 'codinglife', 'softwaredeveloper', 'lookingforjob', 'coding', '100daysofcode', 'womenwhocode', 'womenintech', 'python', 'django']}
{'id': '1389705763208237056', 'hashtags': ['gm']}
{'id': '24725968', 'hashtags': ['womeninav', 'avtweeps', 'mondaymotivation', 'avixawomen', 'befierce', 'thefutureisfemale', 'dei', 'wods', 'womeninstem', 'wavit', 'womenintech', 'digitalsignage', 'womeninit', 'ripplesmakewaves', 'womenwhocode', 'avpluspod', 'bebold', 'proav']}
{'id': '921741324579155968', 'hashtags': ['transgirl', 'tsgirl', 'trans']}
{'id': '820359116212142081', 'hashtags': ['womeninspringwomen', 'heritagefoundationofpakistan', 'iwintech', 'womeninstem', 'womenintech', 'stem', 'womeninengineering']}


### User used url relation

In [14]:
print("----- User used url -----")
for user_url_relation in used_urls[:5]:
    print(user_url_relation)

----- User used url -----
{'id': '820359116212142081', 'urls': ['https://t.co/yHTOWzEkWU', 'https://t.co/8UNsZxfr27', 'https://t.co/LLYjgvz4LU', 'https://t.co/h6OH2BAtvo', 'https://t.co/rKZjaVZO2g', 'https://t.co/n3FZa3UrjS', 'https://t.co/Vx1u1HWryP', 'https://t.co/FPyJQF1v78', 'https://t.co/u4dNamohFg']}
{'id': '2305580258', 'urls': ['https://t.co/HkyhbGBfNH', 'https://t.co/5gDVFjUyqn', 'https://t.co/USZiKgmBl0', 'https://t.co/ElnK4JE7OV', 'https://t.co/3qiMagXsw1', 'https://t.co/B3zelTTchl', 'https://t.co/S3bcGUiNc3', 'https://t.co/G5UMoeuyaK', 'https://t.co/x9viNy3ZEi', 'https://t.co/siff9VQxin']}
{'id': '182004865', 'urls': ['https://t.co/w64H55TqHj', 'https://t.co/oZewMwrGKN', 'https://t.co/cKAzCbhK5C', 'https://t.co/s54OWUEtpe', 'https://t.co/UHOOlO6Qdm', 'https://t.co/hEOt7qitkf']}
{'id': '545453835', 'urls': ['https://t.co/aJvRhVApmH']}
{'id': '890402868288737280', 'urls': ['https://t.co/aJvRhVApmH']}


### User mentioned user relation

In [15]:
print("---- User mentioned user -----")
for user_mention_relation in mentioned[:5]:
    print(user_mention_relation)

---- User mentioned user -----
{'id': '1389705763208237056', 'mentions': [{'id': '1495625777605488642', 'count': 1}, {'id': '29335769', 'count': 1}, {'id': '1549195125204480000', 'count': 1}, {'id': '913423000820682762', 'count': 1}]}
{'id': '1613967463', 'mentions': [{'id': '959355565', 'count': 1}, {'id': '430995737', 'count': 1}, {'id': '23483274', 'count': 1}]}
{'id': '1267581258424557568', 'mentions': [{'id': '29335769', 'count': 1}, {'id': '3056658161', 'count': 1}]}
{'id': '24725968', 'mentions': [{'id': '1458538450571907073', 'count': 4}, {'id': '1562815722455109632', 'count': 8}, {'id': '18138503', 'count': 1}, {'id': '1446159865328685059', 'count': 2}, {'id': '204837082', 'count': 2}, {'id': '15074647', 'count': 3}, {'id': '28601072', 'count': 2}, {'id': '18167325', 'count': 2}, {'id': '777967633', 'count': 1}, {'id': '624844467', 'count': 3}, {'id': '171675792', 'count': 1}, {'id': '811296326197579782', 'count': 4}, {'id': '17158776', 'count': 2}, {'id': '2636082375', 'count

### Next these functions are going to be used in order to load the data in Neo4j DB