Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generators #2

Open
kaburia opened this issue Aug 19, 2023 · 2 comments
Open

Generators #2

kaburia opened this issue Aug 19, 2023 · 2 comments

Comments

@kaburia
Copy link
Owner

kaburia commented Aug 19, 2023

Use generators in the multiple_measurements function to reduce memory usage
Alternatively to find a more optimal way rather than for loops

@kaburia
Copy link
Owner Author

kaburia commented Oct 6, 2023

resulted to multi processing
def multiple_measurements(self, stations_list, csv_file, startDate, endDate, variables, dataset='controlled', aggregate=True):
"""
Retrieves measurements for multiple stations and saves the aggregated data to a CSV file.

    Parameters:
    -----------
    - stations_list (list): A list of strings containing the names of the stations to retrieve data from.
    - csv_file (str): The name of the CSV file to save the data to.
    - startDate (str): The start date for the measurements, in the format 'yyyy-mm-dd'.
    - endDate (str): The end date for the measurements, in the format 'yyyy-mm-dd'.
    - variables (list): A list of strings containing the names of the variables to retrieve.
    - dataset (str): The name of the dataset to retrieve the data from. Default is 'controlled'.

    Returns:
    -----------
    - df (pandas.DataFrame): A DataFrame containing the aggregated data for all stations.

    Raises:

        ValueError: If stations_list is not a list.
    """
    if not isinstance(stations_list, list):
        raise ValueError('Pass in a list')

    error_dict = {}
    pool = mp.Pool(processes=mp.cpu_count())  # Use all available CPU cores

    try:
        results = []
        with tqdm(total=len(stations_list), desc='Retrieving data for stations') as pbar:
            for station in stations_list:
                results.append(pool.apply_async(self.retrieve_data, args=(station, startDate, endDate, variables, dataset, aggregate), callback=lambda _: pbar.update(1)))

            pool.close()
            pool.join()

        df_stats = [result.get() for result in results if isinstance(result.get(), pd.DataFrame)]

        if len(df_stats) > 0:
            df = pd.concat(df_stats, axis=1)
            df.to_csv(f'{csv_file}.csv')
            return df
    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        pool.terminate()

@kaburia
Copy link
Owner Author

kaburia commented Jul 20, 2024

The method is well optimized for requesting single variables with a list of stations to get data and might not work as well given multiple variables together with multiple stations list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant