# Extract data from Mongo DB with pymongo

### Nodes

set_mongo_connection(), sets up a connection to a MongoDB database using a connection string, database name, and collection name. It returns the client, db, and col objects which can be used to perform further operations on the database.

get_node_users() retrieves information about Twitter users from a MongoDB database, either from tweets that include user mentions or from tweets that include certain keywords or hashtags. It creates a dictionary representing each user found, including the user's ID, username, number of followers (if available), and the date the tweet was created. It then combines the dictionaries into a single list, removes any duplicate users based on their ID, and returns the resulting list sorted by the date the tweet was created.


get_node_tweets(), retrieves information about Twitter tweets from the specified collection. It searches for documents that contain a key called "includes.tweets" and extracts information such as tweet ID, author ID, creation time, and number of retweets. It also checks for any referenced tweets and adds them to a list if present. The function returns a list of tweet objects.

get_node_hashtags(), retrieves a list of unique hashtags from the first tweet in each document in the specified collection. It uses the collection.find() method to search for documents that contain a key called "includes.tweets.0.entities.hashtags", which indicates that the first tweet in the document has at least one hashtag. It then loops through each matching document, extracts the hashtag tags from the first tweet, and adds them to a set to remove duplicates. Finally, the function converts the set to a list and returns it.


get_node_urls() retrieves a list of unique urls from the first tweet in each document in the specified collection. It uses the collection.find() method to search for documents that contain a key called "includes.tweets.0.entities.urls", which indicates that the first tweet in the document has at least one url. It then loops through each matching document, extracts the urls from the first tweet, and adds them to a set to remove duplicates. Finally, the function converts the set to a list and returns it.



### Relations

get_relationship_tweeted() takes a MongoDB collection as input and returns a list of dictionaries, where each dictionary represents a user who has tweeted at least once, and the values in the dictionary include the user ID, a list of tweet IDs the user has tweeted, and a list of corresponding tweet creation dates. The function first queries the MongoDB collection for documents where the first tweet exists, and then iterates through these documents. For each document, it extracts the user ID, tweet ID, and creation date for the first tweet in the document. If the user ID has not yet been encountered, a new dictionary is created for that user with the user ID as the 'id' value, empty lists for tweet IDs and creation dates. Otherwise, the tweet ID and creation date are appended to the corresponding lists for that user. Finally, the function returns a list of the user dictionaries.

get_relationship_has_hashtag(), retrieves information about the relationship between tweets and hashtags in the specified collection. It searches for documents that contain a key called "includes.tweets.0.entities.hashtags" and extracts the tweet ID and associated hashtags. The function returns a list of tweet objects with their associated hashtags.

get_relationship_has_url(), retrieves information about the relationship between tweets and urls in the specified collection. It searches for documents that contain a key called "includes.tweets.0.entities.urls" and extracts the tweet ID and associated urls. The function returns a list of tweet objects with their associated urls.

get_relationship_used_hashtag() retrieves all tweets that contains hashtags. For each tweet, the function extracts any new hashtag found in the tweet to the user's existing list of hashtags, and removes any duplicates. Once all documents have been processed, the function returns a list of dictionaries, where each dictionary represents a user and the hashtags they have used in their tweets.

get_relationship_used_urls() retrieves all tweets that contains urls. For each tweet, the function extracts any new url found in the tweet to the user's existing list of urls, and removes any duplicates. Once all documents have been processed, the function returns a list of dictionaries, where each dictionary represents a user and the urls they have used in their tweets.

In [1]:
from python_utils.mongo_python import set_mongo_connection, get_node_users , get_node_tweets , get_node_hashtags , get_node_urls, get_relationship_tweeted ,get_relationship_has_hashtag, get_relationship_has_url, get_relationship_used_hashtag, get_relationship_used_urls

Examples :

In [2]:
client,database,collection = set_mongo_connection(
    "mongodb://localhost:27017/",
    "local",
    "test"
)

#nodes
users = get_node_users(collection)
tweets = get_node_tweets(collection)
hashtags = get_node_hashtags(collection)
urls = get_node_urls(collection)

#relationships
tweeted = get_relationship_tweeted(collection)
has_hashtag = get_relationship_has_hashtag(collection)
has_url = get_relationship_has_url(collection)
used_hashtag = get_relationship_used_hashtag(collection)
used_urls = get_relationship_used_urls(collection)



#print heads of data
print("------NODES--------")
for user in users[:5]:
    print(user)
print("--------------")
for tweet in tweets[:5]:
    print(tweet)
print("--------------")
for hashtag in hashtags[:5]:
    print(hashtag)
print("--------------")
for url in urls[:5]:
    print(url)
print("------RELATIONS--------")
for user_tweet_relation in tweeted[:5]:
    print(user_tweet_relation)
print("--------------")
for tweet_hashtag_relation in has_hashtag[:5]:
    print(tweet_hashtag_relation)
print("--------------")
for tweet_url_relation in has_url[:5]:
    print(tweet_url_relation)
print("--------------")
for user_hashtag_relation in used_hashtag[:5]:
    print(user_hashtag_relation)
print("--------------")
for user_url_relation in used_urls[:5]:
    print(user_url_relation)


------NODES--------
{'id': '860025761167572993', 'followers': 35, 'username': 'Sukhmanjeetkau2', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 30)}
{'id': '1389705763208237056', 'followers': 12552, 'username': 'LauraDoodlesToo', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 16)}
{'id': '1495625777605488642', 'followers': 7701, 'username': 'AnastasiaNFTart', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 59, 16)}
{'id': '1613967463', 'followers': 439, 'username': 'esquivel_gifted', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 58)}
{'id': '959355565', 'followers': 342, 'username': 'DevonGarritt', 'info_last_updated': datetime.datetime(2023, 1, 25, 23, 58)}
--------------
{'tweet_id': '1618398184312639491', 'author_id': '860025761167572993', 'created_at': '2023-01-25T23:59:30.000Z', 'retweet_count': 0, 'referenced_tweets': None}
{'tweet_id': '1618398126364299264', 'author_id': '1389705763208237056', 'created_at': '2023-01-25T23:59:16.000Z', 