# Exercise: CRM campaign_Teste

In this notebook we will test our system on the data generated in the notebook `Generate_Data`.

In [1]:
!pip install memory-profiler



In [2]:
import os 
import pandas as pd
from datetime import datetime, timedelta
from memory_profiler import memory_usage

In [3]:
# Function to read the daily log file and return a DataFrame with the data

def read_daily_log_file(log_file):
    # Read the daily log file and return a DataFrame containing the data
    df = pd.read_csv(log_file, sep='|', header=None, names=['sng_id', 'user_id', 'country'])
    return df

In [4]:
# To compute top_songs:

def compute_top_songs(data, output_file):
   
    # Group by 'country' and 'sng_id' and count the streams
    # We are using `size()`, because we want to include missing values whether they exist
    grouped = data.groupby(['country', 'sng_id']).size().reset_index(name='streams')
   
    # Sort the data (top songs per country)
    sorted_grouped = grouped.sort_values(['country', 'streams'], ascending=[True, False])

    # Create a dictionary to store the top_50 songs for each country
    top_50_songs = {}
    for country, group in sorted_grouped.groupby('country'):
        top_50_songs[country] = group.head(50)

    # Write the top 50 songs for each country to the output file
    with open(output_file, 'w') as file:
        for country, group in top_50_songs.items():
            songs_str = ','.join(f"{sng_id}:{streams}" for _, sng_id, streams in group.itertuples(index=False))
            file.write(f"{country}|{songs_str}\n")

In [5]:
def main(input_folder, output_folder, start_date):
    # Convert the start_date to a datetime object
    start_date = datetime.strptime(start_date, '%Y-%m-%d')
    date_range = [start_date - timedelta(days=i) for i in range(1, 8)]

    data_list = []  # List to store data for the past 7 days

    for day in date_range:
        log_file = f"listen-{day.strftime('%Y-%m-%d')}.txt"
        log_file_path = os.path.join(input_folder, log_file)

        if os.path.isfile(log_file_path):
            data = read_daily_log_file(log_file_path)
            data_list.append(data)
            print(f"Processed log file: {log_file}")
        else:
            print(f"Log file not found: {log_file}")
            

    if data_list:
        # Concatenate data from the past 7 days into a single DataFrame
        combined_data = pd.concat(data_list, ignore_index=True)
        output_file = os.path.join(output_folder, f"country_top50_{start_date.strftime('%Y-%m-%d')}.txt")
        compute_top_songs(combined_data, output_file)
        print(f"Top 50 Songs for the past 7 days saved in {output_file}")

> Note: the code above is different from the one in my final exercise. The one that I have in my final exercise will work with the current date, this one I will manually input the date I want to start, just then we can check if the code works with the data I generated. 

In [None]:
# Let's check the time that takes this system

if __name__ == "__main__":
    input_folder = "data"  
    output_folder = "output_top-50-songs"  
    start_date = '2021-12-08'
    
    main(input_folder, output_folder, start_date)
    
    # To calculate the memory used my this system
    memory_usage_result = memory_usage((main, (input_folder, output_folder, start_date)))
    print(f"Memory usage: {max(memory_usage_result)} MB")

Processed log file: listen-2021-12-07.txt
Processed log file: listen-2021-12-06.txt
Processed log file: listen-2021-12-05.txt
Processed log file: listen-2021-12-04.txt
Processed log file: listen-2021-12-03.txt
Processed log file: listen-2021-12-02.txt
Processed log file: listen-2021-12-01.txt


> **Note**: Before combining all the data from the 7 files in one DataFrame the code was working as expected. However now that we have combined the data from the 7 files into one DataFrame, the code is taking more time to run. The next step would be try to make the code run more faster (the code is running for more than 2 hours now). Before combining all the data into one DataFrame, we were calculating the top 50 songs per country and per day, whihch was more light than the actual DataFrame the code is dealing right now. This is my solution, which should take some time but works, if I would have more time, I would try to make it more fast, find a way to make it run (compute) in a faster way. 