# Latam Challenge!

First step is to set the File Path, in this example the file is in the same foldes as this notebook.

In [None]:
file_path = "farmers-protest-tweets-2021-2-4.json"

Installing need packages to measure memory usage, the following line is commented, assuming it is already installed. 

In [None]:
#%pip install memory_profiler

Load the package memory_profiler

In [None]:
%load_ext memory_profiler

Now, let's begin to import the challenge files, we are using load to load an external file. 

First challenge:
The top 10 dates with the most tweets. Mention the user (username) with the most posts for each of those days.

In [None]:
# %load q1_time.py
import json
from datetime import datetime
from collections import defaultdict
from typing import List, Tuple

def q1_time(file_path: str) -> List[Tuple[datetime.date, str]]:
    date_tweet_count = defaultdict(int)
    date_user_tweets = defaultdict(lambda: defaultdict(int))

    with open(file_path, 'r', encoding='utf-8') as json_file: # import json content
        for line in json_file:
            tweet = json.loads(line)
            date_str = tweet["date"][:10]  # extract date part from the datetime string
            date_tweet_count[date_str] += 1 # count tweets
            date_user_tweets[date_str][tweet["user"]["username"]] += 1

    top_dates_users = []
    for date, user_tweets in date_user_tweets.items():
        top_user = max(user_tweets, key=user_tweets.get)
        top_dates_users.append((datetime.strptime(date, '%Y-%m-%d').date(), top_user))

    return sorted(top_dates_users, key=lambda x: date_tweet_count[x[0]], reverse=True)[:10]

The following code runs the function, returns the expected and print it.

In [None]:
# Example usage:
file_path = "/Users/marlonoliveira/Downloads/farmers-protest-tweets-2021-2-4.json"
result = q1_time(file_path)
print(result)

Time Optimizations:

* Used a single sorted call on top_dates_users with a custom key function to prioritize dates by tweet count.
* Used max directly on user_tweets to find the top user, avoiding unnecessary conversion to items.

Note the code bellow, using %prun command, we can see the execution time and the details about it. Another commands are commented because they do similar tasks. 

In [None]:
%prun q1_time(file_path)
#%timeit q1_time(file_path)
#%time q1_time(file_path)

The code bellow measures the memory usage during the execution of the q1_time function, note it is only to show that after optimization it reduces memora usage. 

In [None]:
%memit q1_time(file_path)

As said before, the following code contains memory usage optimization:

In [None]:
# %load q1_memory.py
import json
from datetime import datetime
from collections import defaultdict
from typing import List, Tuple

def q1_memory(file_path: str) -> List[Tuple[datetime.date, str]]:
    date_tweet_count = defaultdict(int)
    date_user_tweets = defaultdict(lambda: defaultdict(int))

    with open(file_path, 'r', encoding='utf-8') as json_file:
        for line in json_file:
            tweet = json.loads(line)
            date_str = tweet["date"][:10]  # Extract date part from the datetime string
            date_tweet_count[date_str] += 1
            date_user_tweets[date_str][tweet["user"]["username"]] += 1

    top_dates_users = []
    for date, user_tweets in date_user_tweets.items():
        top_user = max(user_tweets, key=user_tweets.get)
        top_dates_users.append((datetime.strptime(date, '%Y-%m-%d').date(), top_user))

    return sorted(top_dates_users, key=lambda x: date_tweet_count[x[0]], reverse=True)[:10]

Optimizations made to reduce memory usage:

* Process the JSON file line by line, minimizing the amount of data loaded into memory at once.
* Only store necessary information (date, tweet count, and user tweets) in dictionaries.
* These optimizations help reduce memory usage while still achieving the desired functionality.

The code bellow runs the funcion, save the return in a variable called 'result' and print it.

In [None]:
file_path = "/Users/marlonoliveira/Downloads/farmers-protest-tweets-2021-2-4.json"
result = q1_memory(file_path)
print(result)

The code bellow measures the execution time, however we are not interested in time optimization at this time.

In [None]:
%time q1_memory(file_path)

In [None]:
%memit q1_memory(file_path)

Second challenge:
The top 10 most used emojis with their respective counts. 

In [None]:
# %load q2_time.py
import json
import re
from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple

def extract_emojis(text: str) -> List[str]:
    emoji_pattern = r'\uD83C[\uDF00-\uDFFF]|\uD83D[\uDC00-\uDDFF]|\uD83E[\uDD00-\uDDFF]|[\u2600-\u2B55]'
    emojis = [match.group() for match in re.finditer(emoji_pattern, text)]
    return emojis

def process_tweet(line):
    tweet = json.loads(line)
    emojis = extract_emojis(tweet["content"])
    return emojis

def q2_time(file_path: str) -> List[Tuple[str, int]]:
    emoji_counter = Counter()

    with open(file_path, 'r', encoding='utf-8') as json_file:
        with ThreadPoolExecutor() as executor:
            emojis_lists = list(executor.map(process_tweet, json_file))

    for emojis in emojis_lists:
        emoji_counter.update(emojis)

    # Get the top 10 most used emojis
    top_emojis = emoji_counter.most_common(10)

    return top_emojis

This modification uses the ThreadPoolExecutor to concurrently process tweets and extract emojis, which can improve the overall execution time, especially when dealing with a large number of tweets. 

In [None]:
# Example usage:
file_path = "/Users/marlonoliveira/Downloads/farmers-protest-tweets-2021-2-4.json"
result = q2_time(file_path)
print(result)

The following code show the execution times for the funcion

In [None]:
%prun q1_time(file_path)

The code bellow show the same code, however now optimized for memory usage:

In [None]:
# %load q2_memory.py
import json
import re
from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple

def extract_emojis(text: str) -> List[str]:
    emoji_pattern = r'\uD83C[\uDF00-\uDFFF]|\uD83D[\uDC00-\uDDFF]|\uD83E[\uDD00-\uDDFF]|[\u2600-\u2B55]'
    emojis = [match.group() for match in re.finditer(emoji_pattern, text)]
    return emojis

def process_tweet(line, emoji_counter):
    tweet = json.loads(line)
    emojis = extract_emojis(tweet["content"])
    emoji_counter.update(emojis)

def q2_memory(file_path: str) -> List[Tuple[str, int]]:
    emoji_counter = Counter()

    with open(file_path, 'r', encoding='utf-8') as json_file:
        with ThreadPoolExecutor() as executor:
            executor.map(lambda line: process_tweet(line, emoji_counter), json_file)

    top_emojis = emoji_counter.most_common(10) # get the top 10 most used emojis

    return top_emojis

In this version, we process each line individually, and as soon as we extract the emojis from a tweet, we update the Counter. This way, we avoid keeping a large list of emojis in memory.

The following code show the memory usage. Now we are interested in seeing how the optimization improves the memory usage. 

In [None]:
%memit q2_memory(file_path)

Third challenge:
The historical top 10 users (username) most influential based on the count of mentions (@) they register. 

In [None]:
# %load q3_time.py
import json
import re
from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple

def extract_mentions(text: str) -> List[str]:
    mention_pattern = r'@(\w+)'  # Adjust the regex pattern as needed
    mentions = re.findall(mention_pattern, text)
    return mentions

def process_tweet(line, mention_counter):
    tweet = json.loads(line)
    mentions = extract_mentions(tweet["content"])
    mention_counter.update(mentions)

def q3_time(file_path: str) -> List[Tuple[str, int]]:
    mention_counter = Counter()

    with open(file_path, 'r', encoding='utf-8') as json_file:
        with ThreadPoolExecutor() as executor:
            executor.map(lambda line: process_tweet(line, mention_counter), json_file)

    # Get the top 10 most mentioned users
    top_mentions = mention_counter.most_common(10)

    return top_mentions

In [None]:
# This version uses ThreadPoolExecutor to parallelize the processing of tweets. The process_tweet function extracts mentions from each tweet, and the ThreadPoolExecutor efficiently distributes the workload across multiple threads.

# Note: The effectiveness of parallelization depends on the number of available CPU cores and the nature of the processing tasks. Adjust the regex pattern (mention_pattern) based on your specific dataset.

In [None]:
# Example usage:
result = q3_time(file_path)
print(result)

The following code show the execution times for the funcion

In [None]:
%prun q3_time(file_path)

The next code is to optimize the memory usage of the second challenge.

In [None]:
# %load q3_memory.py
import json
import re
from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple

def extract_mentions(text: str) -> List[str]:
    mention_pattern = r'@(\w+)'  # Adjust the regex pattern as needed
    mentions = re.findall(mention_pattern, text)
    return mentions

def process_tweet(line, mention_counter):
    tweet = json.loads(line)
    mentions = extract_mentions(tweet["content"])
    mention_counter.update(mentions)

def q3_memory(file_path: str) -> List[Tuple[str, int]]:
    mention_counter = Counter()

    with open(file_path, 'r', encoding='utf-8') as json_file:
        with ThreadPoolExecutor() as executor:
            executor.map(lambda line: process_tweet(line, mention_counter), json_file)

    # Get the top 10 most mentioned users
    top_mentions = mention_counter.most_common(10)

    return top_mentions

For memory usage optimization, I make some modifications to reduce the memory footprint. Specifically, avoiding storing the entire list of mentions in memory and update the counter directly as we process each line

The following code show the memory usage. Now we are interested in seeing how the optimization improves the memory usage. 

In [None]:
%memit q3_memory(file_path)

The folling code makes a Post request.

In [None]:
import requests

# Define the URL and JSON 
url = "https://advana-challenge-check-api-cr-k4hdbggvoq-uc.a.run.app/data-engineer"
payload = {
    "name": "Marlon Oliveira",
    "mail": "oliwer.marlon@gmail.com",
    "github_url": "https://github.com/marlondcu/latam-challenge.git"
}

response = requests.post(url, json=payload)# Make the POST request

if response.status_code == 200: # Check if the request was successful (status code 200)
    print("POST request was successful!")
    print("Response:", response.text)
else:
    print("POST request failed with status code:", response.status_code)