# Extract data from Mongo DB with pymongo

### Nodes

set_mongo_connection(), sets up a connection to a MongoDB database using a connection string, database name, and collection name. It returns the client, db, and col objects which can be used to perform further operations on the database.

get_node_users() retrieves information about Twitter users from a MongoDB database, either from tweets that include user mentions or from tweets that include certain keywords or hashtags. It creates a dictionary representing each user found, including the user's ID, username, number of followers (if available), and the date the tweet was created. It then combines the dictionaries into a single list, removes any duplicate users based on their ID, and returns the resulting list sorted by the date the tweet was created.


get_node_tweets(), retrieves information about Twitter tweets from the specified collection. It searches for documents that contain a key called "includes.tweets" and extracts information such as tweet ID, author ID, creation time, and number of retweets. It also checks for any referenced tweets and adds them to a list if present. The function returns a list of tweet objects.

get_node_hashtags(), retrieves a list of unique hashtags from the first tweet in each document in the specified collection. It uses the collection.find() method to search for documents that contain a key called "includes.tweets.0.entities.hashtags", which indicates that the first tweet in the document has at least one hashtag. It then loops through each matching document, extracts the hashtag tags from the first tweet, and adds them to a set to remove duplicates. Finally, the function converts the set to a list and returns it.


get_node_urls() retrieves a list of unique urls from the first tweet in each document in the specified collection. It uses the collection.find() method to search for documents that contain a key called "includes.tweets.0.entities.urls", which indicates that the first tweet in the document has at least one url. It then loops through each matching document, extracts the urls from the first tweet, and adds them to a set to remove duplicates. Finally, the function converts the set to a list and returns it.



### Relations

get_relationship_tweeted() takes a MongoDB collection as input and returns a list of dictionaries containing information about users who have tweeted. It first retrieves all documents that do not have any referenced tweets. For each document, it extracts the user ID, tweet ID, and created_at_converted. It then checks if the user ID is already in a dictionary called user_tweeted_tweet. If the user ID is not in the dictionary, it adds a new entry with the user ID, an empty list for tweeted tweets, and an empty list for the created_at_converted values. It then appends the tweet ID and created_at_converted value to the corresponding lists in the dictionary. Finally, it returns a list of the values in the user_tweeted_tweet dictionary.

get_relationship_retweeted() takes a MongoDB collection as input and returns a list of dictionaries containing information about users who have retweeted tweets. It first retrieves all documents that have at least one referenced tweet. For each document, it extracts the user ID, tweet ID, created_at_converted, and referenced tweet type. If the referenced tweet type is "retweeted", it checks if the user ID is already in a dictionary called user_retweeted_tweet. If the user ID is not in the dictionary, it adds a new entry with the user ID, an empty list for retweeted tweets, and an empty list for the created_at_converted values. It then appends the tweet ID and created_at_converted value to the corresponding lists in the dictionary. Finally, it returns a list of the values in the user_retweeted_tweet dictionary.

get_relationship_quoted() takes a MongoDB collection as input and returns a list of dictionaries containing information about users who have quoted tweets. It first retrieves all documents that have at least one referenced tweet. For each document, it extracts the user ID, tweet ID, created_at_converted, and referenced tweet type. If the referenced tweet type is "quoted", it checks if the user ID is already in a dictionary called user_quoted_tweet. If the user ID is not in the dictionary, it adds a new entry with the user ID, an empty list for quoted tweets, and an empty list for the created_at_converted values. It then appends the tweet ID and created_at_converted value to the corresponding lists in the dictionary. Finally, it returns a list of the values in the user_quoted_tweet dictionary.

get_relationship_replied_to() takes a MongoDB collection as input and returns a list of dictionaries containing information about users who have replied to tweets. It first retrieves all documents that have at least one referenced tweet. For each document, it extracts the user ID, tweet ID, created_at_converted, and referenced tweet type. If the referenced tweet type is "replied_to", it checks if the user ID is already in a dictionary called user_replied_to_tweet. If the user ID is not in the dictionary, it adds a new entry with the user ID, an empty list for replied tweets, and an empty list for the created_at_converted values. It then appends the tweet ID and created_at_converted value to the corresponding lists in the dictionary. Finally, it returns a list of the values in the user_replied_to_tweet dictionary.

get_relationship_has_hashtag(), retrieves information about the relationship between tweets and hashtags in the specified collection. It searches for documents that contain a key called "includes.tweets.0.entities.hashtags" and extracts the tweet ID and associated hashtags. The function returns a list of tweet objects with their associated hashtags.

get_relationship_has_url(), retrieves information about the relationship between tweets and urls in the specified collection. It searches for documents that contain a key called "includes.tweets.0.entities.urls" and extracts the tweet ID and associated urls. The function returns a list of tweet objects with their associated urls.

get_relationship_used_hashtag() retrieves all tweets that contains hashtags. For each tweet, the function extracts any new hashtag found in the tweet to the user's existing list of hashtags, and removes any duplicates. Once all documents have been processed, the function returns a list of dictionaries, where each dictionary represents a user and the hashtags they have used in their tweets.

get_relationship_used_urls() retrieves all tweets that contains urls. For each tweet, the function extracts any new url found in the tweet to the user's existing list of urls, and removes any duplicates. Once all documents have been processed, the function returns a list of dictionaries, where each dictionary represents a user and the urls they have used in their tweets.


get_relationship_mentioned() extracts a list of dictionaries containing information about users and the other users they mentioned in their tweets. The function starts by querying the collection to find documents where the "entities.mentions" field exists within the first tweet in the "includes.tweets" array. It then loops through each document, extracts the user ID and the tweet information, and adds the mentioned users to a set. If the user has not been encountered before, a new dictionary is created with the user's ID and a list of mentioned user IDs. If the user has been encountered before, the mentioned user IDs are added to the existing list and duplicates are removed. Finally, the function converts the dictionary of user information to a list of dictionaries and returns it.

In [1]:
from python_utils.mongo_python import set_mongo_connection, get_node_users , get_node_tweets , get_node_hashtags , get_node_urls, get_relationship_tweeted , get_relationship_retweeted, get_relationship_quoted, get_relationship_replied_to ,get_relationship_has_hashtag, get_relationship_has_url, get_relationship_used_hashtag, get_relationship_used_urls, get_relationship_mentioned

Examples :

In [2]:
client,database,collection = set_mongo_connection(
    "mongodb://localhost:27017/",
    "local",
    "test"
)

#nodes
users = get_node_users(collection)
tweets = get_node_tweets(collection)
hashtags = get_node_hashtags(collection)
urls = get_node_urls(collection)

#relationships
tweeted = get_relationship_tweeted(collection)
retweeted = get_relationship_retweeted(collection)
quoted = get_relationship_quoted(collection)
replied_to = get_relationship_replied_to(collection)
has_hashtag = get_relationship_has_hashtag(collection)
has_url = get_relationship_has_url(collection)
used_hashtag = get_relationship_used_hashtag(collection)
used_urls = get_relationship_used_urls(collection)
mentioned = get_relationship_mentioned(collection)



#print heads of data
print("------NODES--------")
for user in users[:5]:
    print(user)
print("--------------")
for tweet in tweets[:5]:
    print(tweet)
print("--------------")
for hashtag in hashtags[:5]:
    print(hashtag)
print("--------------")
for url in urls[:5]:
    print(url)
print("------RELATIONS--------")
for user_tweet_relation in tweeted[:5]:
    print(user_tweet_relation)
print("--------------")
for user_retweet_relation in retweeted[:5]:
    print(user_retweet_relation)
print("--------------")
for user_quoted_relation in quoted[:5]:
    print(user_quoted_relation)
print("--------------")
for user_replied_to_relation in replied_to[:5]:
    print(user_replied_to_relation)
print("--------------")
for tweet_hashtag_relation in has_hashtag[:5]:
    print(tweet_hashtag_relation)
print("--------------")
for tweet_url_relation in has_url[:5]:
    print(tweet_url_relation)
print("--------------")
for user_hashtag_relation in used_hashtag[:5]:
    print(user_hashtag_relation)
print("--------------")
for user_url_relation in used_urls[:5]:
    print(user_url_relation)
print("--------------")
for user_mention_relation in mentioned[:5]:
    print(user_mention_relation)


------NODES--------
{'id': '860025761167572993', 'followers': 35, 'username': 'Sukhmanjeetkau2', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 30)}
{'id': '1389705763208237056', 'followers': 12552, 'username': 'LauraDoodlesToo', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 16)}
{'id': '1495625777605488642', 'followers': 7701, 'username': 'AnastasiaNFTart', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 16)}
{'id': '1613967463', 'followers': 439, 'username': 'esquivel_gifted', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 58)}
{'id': '959355565', 'followers': 342, 'username': 'DevonGarritt', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 58)}
--------------
{'tweet_id': '1618398184312639491', 'author_id': '860025761167572993', 'created_at': '2023-01-25T23:59:30.000Z', 'retweet_count': 0, 'referenced_tweets': None}
{'tweet_id': '1618398126364299264', 'author_id': '1389705763208237056', 'created_at': '2023-01-25T23:59:16.000Z', 