# Generate Data

In this file we will generate data based on the sample provided, we will generate 7 files and store them in the folder `data`. These data will help us to test our system and test if it works as supose. The data was generated according the information provided in the exercise. 

- Objective: "(...)suggest a system that computes on a daily basis, the top 50 songs the most listened in each country on the last 7 days."

- "(...)consider that we are receiving each day in a folder, a text file named listen-YYYYMMDD.log that contains the logs of all listening streams made on Deezer on that date."

- "There is a row per stream (1 listening)."

- "Each row is in the following format: sng_id|user_id|country"

- "sng_id: Unique song identifier, an integer. For your information, Deezer catalog contains more than 80M songs, a number that is constantly increasing."

- "user_id: Unique user identifier, an integer. Deezer currently has millions of users, a number that is constantly increasing."

- "country: 2 characters string upper case that matches the country ISO code (Ex: FR, GB, BE, ...). There are 249 existing country codes, this number rarely changes (only when there is massive geopolitical change)."

- "(...) we are considering that the daily number of streams is around 30M."

- "(...) the file contains occasionally corrupted rows"

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

In [2]:
data_path = 'data/sample_listen-2021-12-01.txt'

df = pd.read_csv(data_path, sep='|', header=None, names=['sng_id', 'user_id', 'country'])

df

Unnamed: 0,sng_id,user_id,country
0,29569957,7788196,BL
1,27818575,9642856,GN
2,14684680,8316482,BN
3,21133485,6802606,ST
4,11751494,3748041,PW
...,...,...,...
95,8458924,5951329,NG
96,1770944,1833529,CL
97,22468228,3060444,CW
98,11716006,8067420,BI


In [3]:
# Numbers of days to generate data 
# We want to generate for 7 days 

num_days = 7 

# List of countries for which to generate data
countries = df['country'].drop_duplicates().tolist() # Countries included in the sample provided 

In [4]:
# Loop to generate data for each day 

for day in range(num_days):
    
    num_streams = 30000000  # 30M streams per day
    
    sng_ids = np.random.randint(1, 80000000, num_streams) # more than 80M songs
    
    user_ids = np.random.randint(1, 10000000, num_streams) # millions of users 
    
    countries_data = np.random.choice(countries, num_streams)
    
    
    
    # To create generated data
    df_generated = pd.DataFrame({'sng_id': sng_ids, 'user_id': user_ids, 'country': countries_data})
    
    # To save generated data to a file
    date_str = (datetime(2021, 12, 1) + timedelta(days=day)).strftime('%Y-%m-%d')
    
    output_file = f'data/listen-{date_str}.txt'
    
    df_generated.to_csv(output_file, sep='|', index=False, header=False)
    
    print(f"Data for day {day+1} generated and saved in {output_file}")

print("Data generation complete.")

Data for day 1 generated and saved in data/listen-2021-12-01.txt
Data for day 2 generated and saved in data/listen-2021-12-02.txt
Data for day 3 generated and saved in data/listen-2021-12-03.txt
Data for day 4 generated and saved in data/listen-2021-12-04.txt
Data for day 5 generated and saved in data/listen-2021-12-05.txt
Data for day 6 generated and saved in data/listen-2021-12-06.txt
Data for day 7 generated and saved in data/listen-2021-12-07.txt
Data generation complete.
