# Aufgabe Beschreibung

- Mindestens eine NoSQL DB verwenden (Docker, Docker-compose)
- Lessons Learned wichtiger als optimale Lösung (Was hätte ich anders gemacht?)

## Abzugeben:
- Dockerfile / Dockercompose pro DB
- PDF Datenmodell -> Aufbau von System
- Skript / Programm zum Laden von Daten in die DB
- Abfragen zum Szenarien
- PDF Lessons Learned (`lessons-learned.pdf`)

## DB:
- Sollte auf mehrere Container / Knoten laufen
  - Wenn es nicht geht, erklären wieso das nicht ging / was dafür gebraucht ist

## App (Optional)
- Laden von Init-Daten
- Abfragen feuern
- Inhalt anzeigen

## Systemanforderungen
- Es gibt:
  - Follower-Beziehungen
  - Posts von Prominenten
- Aufgaben:
  - Posts von Prominenten auf die 100 IDs verteilen, die am meisten Follower haben (Influencer)
  - Posts können geliked werden (von welchem User wurde ein Post eines anderen Users geliked)
    - Zufällig generiert
- Anfragen:
  - Auflistung von zu einem Account zugeordneten Posts
  - Auflistung der 100 Accounts mit den meisten Followern (Influencer)
  - Auflistung der 100 Accounts, die den meisten der Influencer folgen
  - Startseite für ein beliebiges Account (Influencer sind hier gut):
    - Anzahl Followers
    - Anzahl gefolgte Accounts
    - 25 Posts:
      - Neueste
      - Meisten Likes von gefolgte Accounts
    - Caching der Posts für die Startseite:
      - Fan-Out in den Cache jedes Followers beim Schreiben eines neuen Posts (Fragen nicht von zentraler Tabelle, sondern jedes Account hat eigenen Tweets-Feed)
    - Auflistung der 25 Posts, die ein Wort beinhalten:
      - (Optional: Und-verknüpfte Wörter)


# Setup

## Install needed libraries and import components

In [1]:
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure
from datetime import datetime

import pandas as pd

import csv
import random

## Define the MongoDB server details

In [20]:
# Define the MongoDB server details
host = 'localhost'
port = 27017
username = 'devroot'  # Replace with your MongoDB username
password = 'devroot'  # Replace with your MongoDB password

# Create the connection string
connection_string = f'mongodb://{username}:{password}@{host}:{port}'

## Connect to DB and test connection

In [21]:
# Connect to the MongoDB server
client = MongoClient(connection_string)

In [22]:
try:
    # Verify connection
    client.admin.command('ping')
    print("Connected successfully to MongoDB")
    
    # List all databases
    databases = client.list_database_names()
    print("Databases:", databases)
        
except ConnectionFailure as e:
    print(f"Could not connect to MongoDB: {e}")

Connected successfully to MongoDB
Databases: ['admin', 'config', 'local', 'social_network']


# Global Definitions

## Collection definition

In [23]:
# Select the database and collections
db = client['social_network']
users_collection = db['users']
followers_collection = db['followers']
posts_collection = db['posts']
likes_collection = db['likes']
feeds_collection = db['feeds']

## Helper functions

In [9]:
# Function to add a user
def add_user(user_id):
    user = {"_id": user_id, "following_count": 0, "followers_count": 0}
    users_collection.insert_one(user)

# Function to check if user already exists
def user_exists(user_id):
    return users_collection.count_documents({"_id": user_id}) > 0

# Function to follow a user
def follow_user(follower_id, followed_id):
    relationship = {"follower_id": follower_id, "followed_id": followed_id}
    followers_collection.insert_one(relationship)
    users_collection.update_one({"_id": follower_id}, {"$inc": {"following_count": 1}})
    users_collection.update_one({"_id": followed_id}, {"$inc": {"followers_count": 1}})

# Function to add a post
def add_post(user_id, content, date):
    post = {
        "user_id": user_id,
        "content": content,
        "timestamp": date,
        "likes": 0
    }
    post_id = posts_collection.insert_one(post).inserted_id
    propagate_post_to_followers(post_id, user_id, content, date)
    return post_id

# Function to propagate a post to all followers' feeds
def propagate_post_to_followers(post_id, user_id, content, date):
    followers = followers_collection.find({"followed_id": user_id})
    for follower in followers:
        feed_entry = {
            "user_id": follower["follower_id"],
            "post_id": post_id,
            "poster_id": user_id,
            "content": content,
            "timestamp": date,
            "likes": 0
        }
        feeds_collection.insert_one(feed_entry)

# Function to like a post
def like_post(user_id, post_id):
    like = {"user_id": user_id, "post_id": post_id}
    likes_collection.insert_one(like)
    posts_collection.update_one({"_id": post_id}, {"$inc": {"likes": 1}})
    feeds_collection.update_many({"post_id": post_id}, {"$inc": {"likes": 1}})

def get_top_influencers(n):
    return users_collection.find().sort("followers_count", -1).limit(n)

def get_user_posts(user_id):
    return posts_collection.find({"user_id": user_id})

# This function needs to be changed to only take into account the amount of followed influencers
def get_top_follower_users(n):
    return users_collection.find().sort("following_count", -1).limit(n)

def get_user_profile(user_id, n):
    user = users_collection.find_one({"_id": user_id})
    followers_count = user["followers_count"]
    following_count = user["following_count"]
    
    recent_posts = posts_collection.find({"user_id": user_id}).sort("timestamp", -1).limit(n)
    popular_posts = posts_collection.find({"user_id": user_id}).sort("likes", -1).limit(n)
    feed = feeds_collection.find({"user_id": user_id}).sort("timestamp", -1).limit(n)
    
    profile = {
        "user_id": user_id,
        "followers_count": followers_count,
        "following_count": following_count,
        "recent_posts": list(recent_posts),
        "popular_posts": list(popular_posts),
        "feed": list(feed)
    }
    return profile

def print_user_profile(user_profile):
    print("User profile with id:", user_profile["user_id"])
    print("Followers count:", user_profile["followers_count"])
    print("Following count:", user_profile["following_count"])
    print("Recent posts:")
    for post in user_profile["recent_posts"]:
        print(post["content"], "date:", post["timestamp"])
    print("Popular posts:")
    for post in user_profile["popular_posts"]:
        print(post["content"], "likes", post["likes"])
    print("Feed:")
    for post in user_profile["feed"]:
        print(post["content"], "date:", post["timestamp"])

def get_posts_with_word(word, n):
    return posts_collection.find({"content": {"$regex": word, "$options": "i"}}).limit(n)

# Function to read the CSV file and process the data
def process_csv(file_path):
    data_map = {}  # Dictionary to store the concatenated string and date

    with open(file_path, mode='r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        
        # Iterate through each row in the CSV file
        for index, row in enumerate(reader):
            author = row['author']
            content = row['content']
            date_time_str = row['date_time']
            
            # Concatenate author and content
            concatenated = f"{author}: {content}"
            
            # Convert the date_time string to a datetime object
            date_time = datetime.strptime(date_time_str, '%d/%m/%Y %H:%M')
            
            # Store the concatenated string and date in the dictionary
            data_map[index] = {'tweet': concatenated, 'date': date_time}
    
    return data_map

# Function to get a random user ID from the users collection
def get_random_user_id():
    count = users_collection.count_documents({})
    if count > 0:
        random_index = random.randint(0, count - 1)
        random_user = users_collection.find().skip(random_index).limit(1)
        for user in random_user:
            return user['_id']
    else:
        return None

# Insert Data into DB

## Insert users and following relationships to db

In [7]:
from pymongo import InsertOne, UpdateOne

file_path = './InputData/twitter_combined.txt'

# Read file content
with open(file_path, 'r') as file:
    lines = file.readlines()

# Prepare data
user_pairs = [tuple(map(int, line.strip().split())) for line in lines]

# Get a unique set of all users involved
all_users = {user for pair in user_pairs for user in pair}

# Check which users already exist in the database
existing_users = set(users_collection.distinct("_id", {"_id": {"$in": list(all_users)}}))

# Identify new users
new_users = all_users - existing_users

# Prepare bulk operations for new users
user_bulk_operations = [
    InsertOne({"_id": user_id, "following_count": 0, "followers_count": 0})
    for user_id in new_users
]

# Execute bulk insert for new users
if user_bulk_operations:
    users_collection.bulk_write(user_bulk_operations)

# Prepare bulk operations for relationships and updating counts
relationship_bulk_operations = []
user_update_operations = []

for user1, user2 in user_pairs:
    relationship_bulk_operations.append(InsertOne({"follower_id": user1, "followed_id": user2}))
    user_update_operations.append(UpdateOne({"_id": user1}, {"$inc": {"following_count": 1}}))
    user_update_operations.append(UpdateOne({"_id": user2}, {"$inc": {"followers_count": 1}}))

# Execute bulk insert for relationships
if relationship_bulk_operations:
    followers_collection.bulk_write(relationship_bulk_operations)

# Execute bulk update for user counts
if user_update_operations:
    users_collection.bulk_write(user_update_operations)


In [7]:
file_path = './InputData/twitter_combined.txt'

with open(file_path, 'r') as file:
    lines = file.readlines()
    user_pairs = [tuple(map(int, line.strip().split())) for line in lines]

    for user1, user2 in user_pairs:
        if not user_exists(user1):
            add_user(user1)
        
        if not user_exists(user2):
            add_user(user2)
        
        follow_user(user1, user2)

## Find out the top 100 most followed users (Influencers)

In [24]:
top_influencers = get_top_influencers(100)
top_influencers_list = list(top_influencers)
print("Top influencers:")
for influencer in top_influencers_list:
    print("user id:", influencer["_id"], "Follower count", influencer["followers_count"])

Top influencers:
user id: 40981798 Follower count 8660
user id: 43003845 Follower count 7700
user id: 22462180 Follower count 7623
user id: 34428380 Follower count 7558
user id: 115485051 Follower count 4798
user id: 15913 Follower count 4337
user id: 3359851 Follower count 3986
user id: 11348282 Follower count 3850
user id: 7861312 Follower count 3712
user id: 27633075 Follower count 3655
user id: 31331740 Follower count 3623
user id: 18996905 Follower count 3255
user id: 7860742 Follower count 3197
user id: 813286 Follower count 3172
user id: 22784458 Follower count 2974
user id: 17868918 Follower count 2904
user id: 10671602 Follower count 2874
user id: 117674417 Follower count 2858
user id: 48485771 Follower count 2725
user id: 34068984 Follower count 2693
user id: 18927441 Follower count 2680
user id: 83943787 Follower count 2678
user id: 15853668 Follower count 2634
user id: 1183041 Follower count 2593
user id: 238260874 Follower count 2560
user id: 8088112 Follower count 2539
us

## Assign the input tweets to the influencers

In [46]:
# Path to the CSV file
file_path = './InputData/tweets.csv'

# Function to add a post and propagate it to followers
def add_post(user_id, content, date):
    post = {
        "user_id": user_id,
        "content": content,
        "timestamp": date,
        "likes": 0
    }
    post_id = posts_collection.insert_one(post).inserted_id
    propagate_post_to_followers(post_id, user_id, content, date)
    return post_id

# Function to propagate a post to all followers' feeds
def propagate_post_to_followers(post_id, user_id, content, date):
    followers = followers_collection.find({"followed_id": user_id})
    for follower in followers:
        follower_id = follower["follower_id"]
        # Check if the post_id already exists in the follower's feed
        existing_feed = feeds_collection.find_one({
            "user_id": follower_id
        })
        if existing_feed:
            # Update feed entry if the post_id does not already exist in the feed
            if not any(entry['post_id'] == post_id for entry in existing_feed['feed']):
                feeds_collection.update_one(
                    {"_id": existing_feed["_id"]},
                    {"$push": {"feed": {"poster_id": user_id, "post_id": post_id, "content": content, "timestamp": date}}}
                )
        else:
            # Insert new feed entry
            new_feed_entry = {
                "user_id": follower_id,
                "feed": [{"poster_id": user_id, "post_id": post_id, "content": content, "timestamp": date}]
            }
            feeds_collection.insert_one(new_feed_entry)

df = pd.read_csv(file_path)
tweets_map = df.to_dict(orient='index')

tweets_ids = []

for idx, tweet_data in tweets_map.items():
    random_influencer_document = random.choice(top_influencers_list)
    influencer_id = random_influencer_document['_id']
    date = datetime.strptime(tweet_data['date_time'], '%d/%m/%Y %H:%M')
    content = tweet_data['content']
    temp_id = add_post(influencer_id, content, date)
    tweets_ids.append(temp_id)


ValueError: time data '27/12/2016 04:04' does not match format '%m/%d/%Y %H:%M'

In [9]:
# Path to the CSV file
file_path = './InputData/tweets.csv'

# Get the concatenated strings
tweets_map = process_csv(file_path)
tweets_ids = []

# Print the concatenated strings
for key, value in tweets_map.items():
    random_influencer_document = random.choice(top_influencers_list)
    influencer_id = random_influencer_document['_id']
    temp_id = add_post(influencer_id, value['tweet'], value['date'])
    tweets_ids.append(temp_id)


KeyboardInterrupt: 

## Create random likes

In [27]:
import random
from pymongo import InsertOne, UpdateOne

likes_count = 1000

# Fetch all user IDs once
user_ids = list(users_collection.distinct('_id'))
user_count = len(user_ids)

# Precompute random choices to reduce repeated calls to random.choice
random_tweet_ids = [random.choice(tweets_ids) for _ in range(likes_count)]
random_user_ids = [user_ids[random.randint(0, user_count - 1)] for _ in range(likes_count)]

# Prepare bulk operations for likes insertion and updates
like_operations = []
post_update_operations = {}
feed_update_operations = {}

for i in range(likes_count):
    random_tweet_id = random_tweet_ids[i]
    random_user = random_user_ids[i]

    # Prepare like insertion operation
    like = {"user_id": random_user, "post_id": random_tweet_id}
    like_operations.append(InsertOne(like))

    # Prepare post likes increment operation
    if random_tweet_id not in post_update_operations:
        post_update_operations[random_tweet_id] = 1
    else:
        post_update_operations[random_tweet_id] += 1

    # Prepare feed likes increment operation
    if random_tweet_id not in feed_update_operations:
        feed_update_operations[random_tweet_id] = 1
    else:
        feed_update_operations[random_tweet_id] += 1

# Execute bulk insert for likes
if like_operations:
    likes_collection.bulk_write(like_operations)

# Execute bulk update for posts and feeds
post_updates = [UpdateOne({"_id": post_id}, {"$inc": {"likes": count}}) for post_id, count in post_update_operations.items()]
if post_updates:
    posts_collection.bulk_write(post_updates)

feed_updates = [UpdateOne({"post_id": post_id}, {"$inc": {"likes": count}}) for post_id, count in feed_update_operations.items()]
if feed_updates:
    feeds_collection.bulk_write(feed_updates)


In [10]:
likes_count = 2000

for i in range(likes_count):
    random_tweet_id = random.choice(tweets_ids)
    random_user = get_random_user_id()
    like_post(random_user, random_tweet_id)

KeyboardInterrupt: 

# Request data from DB

## Find out top 100 most followed

In [None]:
# Get and print top influencers
top_influencers = get_top_influencers(2)
print("Top influencers:")
for influencer in top_influencers:
    print("user id:", influencer["_id"], "Follower count", influencer["followers_count"])

## Find out top 100 influencer followers

In [None]:
# Get and print top followers
top_followers = get_top_follower_users(2)
print("Top followers:")
for follower in top_followers:
    print("user id:", follower["_id"], "Influencers followed", follower["following_count"])

## Show user profile

In [28]:
# Get and print user profile
user_id = 40981798
tweets_to_show = 25
user_profile = get_user_profile(user_id, tweets_to_show)
print_user_profile(user_profile)

User profile with id: 40981798
Followers count: 8660
Following count: 1676
Recent posts:
katyperry: This #Thanksgiving #IStandWithStandingRock text WATER to 82623 to help now! Sign the petition here: https://t.co/GFhQKfsB0i date: 2016-11-24 22:31:00
katyperry: #Rio2016 ARE YOU READY TO #RISE https://t.co/QtFBSCQAPY date: 2016-08-05 21:17:00
katyperry: @hotncolds swallowing lots of rain water? date: 2016-08-05 09:22:00
katyperry: @diddleurskittle @Katysanalien omg! U r first! I'm flustered. date: 2016-07-23 07:17:00
katyperry: @perryx84 omg this is one of my fav moments on teevee. Let's be friends? date: 2016-07-15 05:45:00
katyperry: I may have just finished lint rolling my face. Goodnight.                                 (to be fair it's the best way to get glitter off) date: 2016-07-11 07:40:00
katyperry: Every1 else havin a lit Friday night &amp; I'm here examining the nose pore cleaner strip I just peeled off &amp; I must say, nothing can top this. date: 2016-07-02 04:49:00
katyper

## Posts containing word

In [13]:
# Get and print posts with a given word
posts_with_word = get_posts_with_word("test", 2)
print("Posts containing 'beautiful':")
for post in posts_with_word:
    print(post["content"])

Posts containing 'beautiful':
katyperry: 🙏🏼🇺🇸PEACEFULLY PROTEST🇺🇸🙏🏼
katyperry: No! 1. It's not happening and 2. Always stay and peacefully protest.  https://t.co/3bYoUbOh1g


# Cleanup

In [6]:
db.users.drop()
db.followers.drop()
db.posts.drop()
db.likes.drop()
db.feeds.drop()
print("Database cleared.")

Database cleared.


In [19]:
# Close the connection
client.close()