This notebook is designed to provide you with the context you need to navigate [Beam API ](https://docs.predicthq.com/resources/beam) effectively.

[Beam](https://www.predicthq.com/beam) is PredictHQ's automated correlation engine to accurately reveal the events that drive demand for your business.  It not only highlights the correlation between events and your demand data but also offers a decomposition of your demand data, aiding in enhancing the precision of your demand forecasts. For a deeper dive into Beam, explore the [Beam Overview](https://www.predicthq.com/support/beam-overview).

Our objective through this notebook is to navigate you through the following use cases facilitated by the Beam API:

1. [Upload demand data to Beam](#part-1-upload-demand-data-to-beam): Upload your data to Beam for analysis.
2. [Generate Beam analysis results](#part-2-generate-beam-analysis-results): Generate Beam analysis results between events and your demand data using Beam.
3. [Plot and interpret Beam output](#part-3-plot-and-interpret-beam-output): Visualize and interpret the output from Beam for better insights.
4. [Identify relevant features using Feature Importance](#part-4-identify-relevant-features-using-feature-importance): Explore Feature Importance to identify and extract relevant features for your forecasting model.
5. [(Optional) Create Groups for the Analyses]( #part-5optional-create-groups-for-the-analyses): Organize your locations into groups using Beam to enhance your analysis by aggregating similar patterns.

Utilizing Beam's decomposition and the Feature Importance feature can substantially augment the accuracy of your forecasts. If you have not been decomposing your data for forecasting purposes, Beam’s functionalities provide a robust framework to dissect your data into baseline demand and remainder components, coupled with identifying critical event features essential for an accurate forecasting model. The latest feature grouping is designed to help you manage and analyze your data more efficiently by aggregating demand across multiple stores or locations. 
For additional insights into the Beam API, please refer to our [technical documentation](https://docs.predicthq.com/api/beam).

If using Google Colab uncomment the following code block, this is used to download a repo that contains sample data used for this notebook. 

In [1]:
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/beam-api-notebook
#!pip install pandas == 2.0.2 requests == 2.24.0  plotly == 5.3.0  predicthq==2.0.6  numpy==1.24.3

Alternatively if you're running this notebook on a local machine, you can set up a Python environment using requirements.txt file shared alongside the notebook.

You can install the necessary requirements by executing the command below.

In [2]:
# !pip install -r requirements.txt

In [2]:
import pandas as pd 
import os 
import requests
import plotly.graph_objects as go
from predicthq import Client
import collections
import numpy as np
from datetime import datetime, date, timedelta
from itertools import chain

## Part 1. Upload demand data to Beam

Beam works by creating an analysis for each location.  A location typically refers to the physical premise of a business like a store or a hotel. For instance, in a restaurant chain, each restaurant would represent a distinct location. See more details on how to prepare demand data [here](https://www.predicthq.com/support/uploading-your-demand-data-to-beam).

When running Beam, you execute it over a set of locations and create a Beam analysis for each location. Then, for each location, you upload your historical demand data, such as historical retail sales data or historical hotel room booking data. This is the demand data that Beam correlates with event data. For each location, you need the latitude and longitude information. When searching for events that can impact your bussiness, we need a suitable radius for the bussiness location. Our [Suggested Radius API](https://docs.predicthq.com/resources/suggested-radius) will calculate the radius for you.

The example code below loops over a list of input locations, creates a Beam analysis_id for each location, and uploads demand data for each location. If you are adapting this for your context, you may be loading demand data from a database and an API to upload into Beam.

You can use this approach to upload data for hundreds or thousands of locations. 

### 1.1 Get Suggested Radius for all locations

An Access Token is required to query the API.

The following link will guide you through creating an account and an access token. Please ensure that your API token is updated to access the latest features of Beam.
- https://docs.predicthq.com/guides/quickstart/

In [1]:
ACCESS_TOKEN = 'AGsSJ-HXlSmc7PtBOUbMTPGvYQOQWQ-13WHj8xIZ'
# 'REPLACE_WITH_ACCESS_TOKEN'

In [3]:
# Read in a CSV file with latitude, longitude and industry for each location
location_sample = pd.read_csv(f'location_sample.csv')
# Set urls for API requests
SUGGESTED_RADIUS_URL="https://api.predicthq.com/v1/suggested-radius/"
BEAM_URL = "https://api.predicthq.com/v1/beam"
FEATURES_API_URL = "https://api.predicthq.com/v1/features"

In [4]:
location_sample

Unnamed: 0,location,lat,lon,timezone,industry
0,store_0,40.751583,-73.981559,America/New_York,restaurants
1,store_1,40.745592,-73.994521,America/New_York,restaurants
2,store_2,33.971192,-118.164362,America/Los_Angeles,restaurants


In [5]:
def get_suggested_radius(lat_lon, industry, radius_unit):
    """
    Returns the suggested radius for a given latitude and longitude.

    Args:
        lat_lon (str): The latitude and longitude of the location in the format "lat,lon".
        industry (str): The industry of interest that the radius will be calculated for. 
        radius_unit (str): Unit in which the suggested radius will be returned.
        
    Returns:
        float: The suggested radius in your preferred unit.
    """
     # Set the url for the API call
    url = SUGGESTED_RADIUS_URL
    # Set the query parameters for the API call
    params = {
        "location.origin": lat_lon,
        "industry": industry, 
        "radius_unit": radius_unit 
    }
     # Set the headers for the API call (including the access token)
    headers={
              "Authorization": "Bearer " + ACCESS_TOKEN,
              "Accept": "application/json"
            }
    # Make the API call and get the JSON response
    response = requests.get(url, params=params, headers=headers)
    if response.status_code == 200:
        # Extract the radius from the JSON response and return it
        radius =  response.json()['radius']
        return radius

In [6]:
rads = []
# Please specify your preferred unit here
radius_unit = 'mi' # radius unit in mile
# Loop through each row in the location_sample dataframe to generate the suggested radius for each location
for index, location in location_sample.iterrows():
    # Call the get_suggested_radius function to get the suggested radius for each location
    rad = get_suggested_radius(f"{location['lat']}, {location['lon']}", {location['industry']}, radius_unit)
    # Append the suggested radius to the list
    rads.append(rad)

# Add a new column to the location_sample dataframe to store the suggested radii
location_sample[f"radius"] = rads
# Add a new column to the location_sample dataframe to store the suggested radii units
location_sample[f"unit"] = radius_unit

In [7]:
location_sample

Unnamed: 0,location,lat,lon,timezone,industry,radius,unit
0,store_0,40.751583,-73.981559,America/New_York,restaurants,1.21,mi
1,store_1,40.745592,-73.994521,America/New_York,restaurants,1.27,mi
2,store_2,33.971192,-118.164362,America/Los_Angeles,restaurants,1.56,mi


### 1.2 Get an analysis_id for each analysis

To set parameters for the Beam API, you can refer to our documentation available at [Beam API](https://docs.predicthq.com/api/beam/create-an-analysis). It provides detailed information on how to configure the API according to your requirements.

In [8]:
# Please specify a rank threshold for the Beam analysis, this will also be used to extract event features from Features API
RANK_THRESHOLD = 50

In [9]:
def generate_analysis_ids(location, lat, lon, radius, radius_unit, timezone, rank_threshold = RANK_THRESHOLD):
    """
    Generates an analysis ID for a given location, latitude, longitude, radius and its unit.

    Args:
        location (str): The unique ID of the location to generate an analysis for.
        lat (float): The latitude of the location.
        lon (float): The longitude of the location.
        radius (float): The radius to use for the analysis.
        radius_unit(str): The unit of measurement used for radius. 
        rank_threshold (int): The minimum rank threshold for the analysis.
    Returns:
        str: The analysis ID generated by the API.
    """
    # Set the URL and JSON payload for the API call
    url = f"{BEAM_URL}/analyses"
    json = {
            "name": f"{location}_analysis",
                "location": {
                    "geopoint": {
                        "lat": lat,
                        "lon": lon,
                    },
                    "radius": radius,
                    "unit": radius_unit 
                },
                "rank": {
                    "type": "phq",
                "levels": {
                "phq": {
                 "min": rank_threshold
                    }
                    }
                },
                "tz": timezone,
            }
    # Make a POST request to the API to generate the analysis ID
    response = requests.post(
            url = url,
            headers={
                "Authorization": "Bearer " + ACCESS_TOKEN,
                "Accept": "application/json"},
            json = json)
    # Extract the analysis ID from the JSON response and return it if the status code is 200
    if response.status_code == 201:
        return response.json()['analysis_id']
    else:
        print(response.status_code)
        print(f"Error generating analysis ID for {location}")

In [10]:
analysis_ids = []
# Loop through each row in the locations dataframe
for index, location in location_sample.iterrows():
    # Generate a Beam analysis id for each location
    analysis_id = generate_analysis_ids(location['location'], str(location['lat']), str(location['lon']), location['radius'], location['unit'], location['timezone'])
    # Add the analysis ids to the list
    analysis_ids.append(analysis_id)
location_sample['analysis_id'] = analysis_ids

In [11]:
location_sample

Unnamed: 0,location,lat,lon,timezone,industry,radius,unit,analysis_id
0,store_0,40.751583,-73.981559,America/New_York,restaurants,1.21,mi,jHKkjAQVIm4
1,store_1,40.745592,-73.994521,America/New_York,restaurants,1.27,mi,CNsDRuJZAeM
2,store_2,33.971192,-118.164362,America/Los_Angeles,restaurants,1.56,mi,0bhDu9k33SQ


### 1.3 Upload demand data for each analysis_id

The example below is reading demand data from multiple businss locations from a single CSV file e, and uploading the demand data for each corresponding analysis_id that is specific to each location. 

When you use this in the context of your business you could read the demand data from a database, API or internal product for example. Your demand data needs to be aggregated to daily values for each given location, where date is YYYY-MM-DD format and demand is a numeric value. Please ensure that your demand data file contains both of two columns: `date` and `demand`.  See the example files for how the data should be formatted and Upload Demand Data to an Analysis in the [Beam API](https://docs.predicthq.com/resources/beam) documentation for more details.

Please also ensure that `location` is a unique key in both the demand data and the location data.

In [12]:
# Read in the data and loop through the locations_ids
demand_data_sample = pd.read_csv('demand_data_sample.csv')
demand_data_by_location = demand_data_sample.groupby('location')

In [13]:
def upload_demand_data(analysis_id, demand_data_for_location):
    """
    Uploads demand data for a specific analysis_id to Beam.

    Args:
        analysis_id (str): The unique ID of a Beam analysis.
        demand_data_for_location (DataFrame): The demand data for the location.

    Returns:
        str: A message indicating the result of the data upload process.
    """
    try:   
        # Only get date and demand from demand_data_for_location
        individual_demand = demand_data_for_location[['date', 'demand']]
        # Convert individual_demand to json format
        individual_demand_json = individual_demand.to_json(orient='records')
        
        # Upload the individual_demand data to Beam
        response = requests.post(
            url=f"{BEAM_URL}/analyses/{analysis_id}/sink",
            headers={
                "Authorization": "Bearer " + ACCESS_TOKEN,
                "Content-Type": "application/json"
            },
            data=individual_demand_json
        )
        
        # Check the status code of the response to see if the request has been accepted
        if response.status_code == 202:
            return 'The request has been accepted for processing.'
        else:
            return f"Failed to upload demand data for analysis_id {analysis_id}. Error: {response.content}"
    except Exception as e:
        return f"An error occurred while uploading demand data for analysis_id {analysis_id}. Error: {str(e)}"

In [14]:
# upload demand data for each location
for location, demand_data_for_location in demand_data_by_location:
    # Get the corresponding analysis_id for each location
    analysis_id = location_sample[location_sample['location'] == location]['analysis_id'].values[0]
    print(upload_demand_data(analysis_id, demand_data_for_location))

The request has been accepted for processing.
The request has been accepted for processing.
The request has been accepted for processing.


### 1.4 Check the readiness_status

Now, we can make a request to the Beam API to retrieve the Beam analysis output for each location. The "readiness_status" indicates whether the data has been successfully uploaded and processed. Please ensure that the readiness_status is "ready" before you continue.

In [15]:
def check_analysis_readiness(analysis_id):
    """
    Checks the readiness status of an analysis for a given analysis_id.
    Args:
        analysis_id (str): The analysis_id to check readiness status for.
    Returns:
        str: A message indicating the analysis's readiness status or an error message.
    """
    try:
        response = requests.get(
            url=f"{BEAM_URL}/analyses/{analysis_id}",
            headers={
                "Authorization": "Bearer " + ACCESS_TOKEN,
                "Accept": "application/json"
            }
        )
        if response.status_code == 200:
            analysis_name = response.json().get('name')
            readiness_status = response.json().get('readiness_status')
            status_message = f"{analysis_name} (ID: {analysis_id}): {readiness_status}"
            if readiness_status != "ready":
                status_message += f"\nBeam analysis {analysis_name} (ID: {analysis_id}) is not ready yet, please wait until the readiness_status is 'ready'."
            return status_message
        else:
            return f"Error checking readiness status for analysis ID {analysis_id}: HTTP {response.status_code}"
    except Exception as e:
        return f"An error occurred: {str(e)}"

In [17]:
# Please ensure that the readiness_status is "ready" before you continue, this might take a couple of minutes. 
# Run this cell to check the readiness status of each analysis.
for location, demand_data_for_location in demand_data_by_location:
    # Get the corresponding analysis_id for each store
    analysis_id = location_sample[location_sample['location'] == location]['analysis_id'].values[0] 
    status_message = check_analysis_readiness(analysis_id)
    print(status_message)

store_0_analysis (ID: jHKkjAQVIm4): ready
store_1_analysis (ID: CNsDRuJZAeM): ready
store_2_analysis (ID: 0bhDu9k33SQ): ready


## Part 2.  Generate Beam analysis results

Correlation results show you your decomposed demand data as well as the event impact data for each location. This is the same data that you can see in the Beam UI - see [Viewing the Time Series Impact Analysis](https://www.predicthq.com/support/viewing-the-time-series-impact-analysis). You can see correlation where there are remainder values corresponding with significant event impact values.

In [18]:
def generate_beam_result(analysis_id, demand_data_for_location):
    """
    Correlates demand data using analysis_id and generates Beam decomposition outputs.

    Args:
        analysis_id (str): The unique ID of a Beam analysis.
        demand_data_for_location (DataFrame): The demand data for the location.
    Returns:
        DataFrame/str: A DataFrame with Beam results for the location or an error message.
    """
    try:
        
        # Only get date and demand from demand_data_for_location
        individual_demand = demand_data_for_location[['date', 'demand']]
        
        # Extract min and max dates for each store
        min_date = individual_demand['date'].min()
        max_date = individual_demand['date'].max()
        
        # Set parameters for Beam API request
        url = f"{BEAM_URL}/analyses/{analysis_id}/correlate"
        headers = {
            "Authorization": "Bearer " + ACCESS_TOKEN,
            "Accept": "application/json"
        }
        params = {
            "date.gte": min_date,
            "date.lte": max_date,
            "limit": 2000
        }

        response = requests.get(url=url, headers=headers, params=params)
        if response.status_code == 200:
            df = pd.DataFrame(response.json()["dates"]).loc[:, ["date", "actual_demand", "baseline_demand", "remainder", "phq_attendance_sum"]]
            # Add analysis_id column to the DataFrame
            df['analysis_id'] = analysis_id
            # reorder the columns
            df = df[['analysis_id', 'date', 'actual_demand', 'baseline_demand', 'remainder', "phq_attendance_sum"]]
            return df
        else:
            return f"Error processing request for analysis_id {analysis_id}: HTTP {response.status_code}"

    except Exception as e:
        return f"An error occurred while generating beam data output for analysis_id {analysis_id}. Error: {str(e)}"

In [19]:
# Generate Beam results for each location
all_locations_beam = []
for location, demand_data_for_location in demand_data_by_location:
    analysis_id = location_sample[location_sample['location'] == location]['analysis_id'].values[0]
    beam_output = generate_beam_result(analysis_id, demand_data_for_location)
    all_locations_beam.append(beam_output)
# Concatenate all the dataframes
all_locations_beam_df = pd.concat(all_locations_beam)

In [20]:
all_locations_beam_df

Unnamed: 0,analysis_id,date,actual_demand,baseline_demand,remainder,phq_attendance_sum
0,jHKkjAQVIm4,2017-01-02,3350.804294,6688.798587,-3337.994292,60604
1,jHKkjAQVIm4,2017-01-03,7974.534129,7720.345753,254.188375,61907
2,jHKkjAQVIm4,2017-01-04,7274.021429,8003.666169,-729.644740,36210
3,jHKkjAQVIm4,2017-01-05,7504.479021,8433.525995,-929.046974,14397
4,jHKkjAQVIm4,2017-01-06,7091.141396,6512.486925,578.654472,36590
...,...,...,...,...,...,...
693,0bhDu9k33SQ,2018-11-26,2567.725141,2543.414081,24.311060,0
694,0bhDu9k33SQ,2018-11-27,2480.132093,2368.503978,111.628114,0
695,0bhDu9k33SQ,2018-11-28,2716.229757,2520.617594,195.612163,0
696,0bhDu9k33SQ,2018-11-29,3010.198718,2595.433854,414.764864,0


## Part 3. Plot and interpret Beam output

Beam's decomposition process bifurcates the demand data into baseline demand and remainder components. Additionally, Beam correlates your demand with PHQ events, delivering an accurate and customized insight into the relationships between events and your demand.

Now we have the Beam output, we can take one of the Beam outputs as an example and convert it into a dataframe.

In [21]:
# take the first analysis_id as an example 
beam_analysis = all_locations_beam_df[all_locations_beam_df['analysis_id'] == 'jHKkjAQVIm4']

Please note the demand time series is decomposed into baseline demand time series and remainder time series. 

`Baseline demand`:  The baseline demand time series represents the estimated demand and contains information about trends and seasonality within the demand time series.

`PHQ attendance sum`: For events belonging to attended categories, the corresponding daily total attendance represents the event impact.

`Reminder`: The remainder is the difference between the demand time series and the baseline demand time series.

In [22]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(x=beam_analysis.date, y=beam_analysis.actual_demand, name='actual_demand',mode='lines')
)
fig.add_trace(
    go.Scatter(x=beam_analysis.date, y=beam_analysis.baseline_demand, name="baseline_demand",mode='lines')
)

fig.add_trace(go.Scatter(
    x=beam_analysis.date, y=beam_analysis.phq_attendance_sum, name="phq_attendance_sum",
    yaxis="y2",mode='lines'
))

# Create axis objects
fig.update_layout(
    xaxis=dict(
        domain=[0.0, 0.9]
    ),
    yaxis=dict(
        title="demand"
    )
    ,
    yaxis2=dict(
        title="impact",
        anchor="x",
        overlaying="y",
        side="right",
    )
)
fig.show()

## Part 4.  Identify relevant features using Feature Importance & Features API

Feature Importance provides feature importance results for an existing analysis, and returns a list of feature groups with associated Features API features and group p-values.

These values represent each group of features' statistical significance when it comes to impacting observable incremental/decremental changes in demand.

You might need to refresh an analysis to generate insights on the most relevant features for a location if the analysis was created before Oct, 2023.

In [None]:
# Uncomment and run this if you created your analysis before Oct, 2023 
# for index, location in location_sample.iterrows():
#   response = requests.post(f"{BEAM_URL}/analyses/{location['analysis_id']}/refresh")
#   if response.status_code == 202:
#     print(f"The analysis {location['analysis_id']} has been refreshed.")

### 4.1 Get Feature Importance

In [23]:
def fetch_feature_importance_for_analysis(analysis_id):
    """
    Fetches feature importance data for a given analysis ID from the Beam API.

    Args:
        analysis_id (str): The unique ID of a Beam analysis.  
    Returns:
        list or None: The feature importance data as a list if the request is successful, None otherwise.
    """
    # Set the URL for the API call
    url = f"{BEAM_URL}/analyses/{analysis_id}/feature-importance"
    response = requests.get(
        url=url,
        headers={
            "Authorization": "Bearer " + ACCESS_TOKEN,
            "Accept": "application/json"
        })
    if response.status_code == 200:
        print('The request has been accepted for processing.')
        feature_importance = response.json().get('feature_importance', [])
        return feature_importance
    else:
        print(f"Error fetching feature importance for analysis ID {analysis_id}: HTTP {response.status_code}")
        return None

In [24]:
# Fetch feature importance for each location
feature_importance_list = []
for index, location in location_sample.iterrows():
    feature_importance = fetch_feature_importance_for_analysis(location['analysis_id'])
    feature_importance_list.append(feature_importance)
location_sample['feature_importance_list'] = feature_importance_list

The request has been accepted for processing.
The request has been accepted for processing.
The request has been accepted for processing.


In [25]:
location_sample

Unnamed: 0,location,lat,lon,timezone,industry,radius,unit,analysis_id,feature_importance_list
0,store_0,40.751583,-73.981559,America/New_York,restaurants,1.21,mi,jHKkjAQVIm4,"[{'feature_group': 'performing-arts', 'feature..."
1,store_1,40.745592,-73.994521,America/New_York,restaurants,1.27,mi,CNsDRuJZAeM,"[{'feature_group': 'public-holidays', 'feature..."
2,store_2,33.971192,-118.164362,America/Los_Angeles,restaurants,1.56,mi,0bhDu9k33SQ,"[{'feature_group': 'public-holidays', 'feature..."


### 4.2 Feature Importance interpretation

`feature_group` is the name of the group, which typically refers to an event category, such as concerts, conferences, etc.

`features` is the names of the features in the feature group. These refer directly to features available in Features API.

`p_value`:  The p-value associated with this feature group for this analysis. It indicates how important the features in the group are in terms of demand. The lower the p-value, the more important the feature group is. 
- p-value < 0.05: The impact is very high.  <br>
- 0.05 <= p-value < 0.75: The impact is high.  <br>
- 0.075 <= p-value < 0.1: The impact is moderate.  <br>
- p-value >= 0.1: This is no impact.

`important`: A true of false value indicating whether the feature group is considered important for this analysis. Equivalent to p_value < 0.1  We suggest using this value to determine whether or not to include this group of features in your modeling.

In [26]:
# Add a column to store categories where 'important' is True
location_sample['important_categories'] = location_sample['feature_importance_list'].apply(lambda x: [item['feature_group'] for item in x if item['important']])

# Add a column to store import features along with the p values 
location_sample['important_features'] = location_sample['feature_importance_list'].apply(
    lambda x: list(chain.from_iterable([item['features'] for item in x if item['important']]))
)

In [27]:
location_sample

Unnamed: 0,location,lat,lon,timezone,industry,radius,unit,analysis_id,feature_importance_list,important_categories,important_features
0,store_0,40.751583,-73.981559,America/New_York,restaurants,1.21,mi,jHKkjAQVIm4,"[{'feature_group': 'performing-arts', 'feature...","[performing-arts, public-holidays, expos, spor...","[phq_attendance_performing_arts, phq_rank_publ..."
1,store_1,40.745592,-73.994521,America/New_York,restaurants,1.27,mi,CNsDRuJZAeM,"[{'feature_group': 'public-holidays', 'feature...","[public-holidays, school-holidays, sports, per...","[phq_rank_public_holidays, phq_attendance_scho..."
2,store_2,33.971192,-118.164362,America/Los_Angeles,restaurants,1.56,mi,0bhDu9k33SQ,"[{'feature_group': 'public-holidays', 'feature...","[public-holidays, school-holidays, academic, f...","[phq_rank_public_holidays, phq_attendance_scho..."


### 4.3 Get important features from Features API
Once you're able to identify the importance features for each location, the next step is to extract these features. This section of the notebook provides a step-by-step instruction on how to do it.  We also have [ Feature Engineering Guide ](https://github.com/predicthq/phq-data-science-docs/blob/master/feature-engineering-guide/feature_engineering_guide.ipynb) notebook that guides you in creating events-based machine learning features. The notebook also provides guidance on selecting varying radii for different features.

In [28]:
def get_analysis(analysis_id):
    """
    Fetches analysis data for a given analysis ID from Beam.
    Parameters:
    - analysis_id (str): The unique identifier for the analysis to retrieve.
    Returns:
    - dict: A dictionary containing the response data from the server in JSON format.
    """
    response = requests.get(
        url=f"{BEAM_URL}/analyses/{analysis_id}",
        headers={
            "Authorization": "Bearer " + ACCESS_TOKEN,
            "Accept": "application/json",
        },
    )

    return response.json()

In [29]:
DATE_FORMAT = "%Y-%m-%d"

phq = Client(access_token=ACCESS_TOKEN)

def get_date_groups(start, end):
    """
    Features API allows a range of up to 90 days, so we have to do several requests
    """

    def _split_dates(s, e):
        capacity = timedelta(days=90)
        interval = 1 + int((e - s) / capacity)
        for i in range(interval):
            yield s + capacity * i
        yield e

    dates = list(_split_dates(start, end))
    for i, (d1, d2) in enumerate(zip(dates, dates[1:])):
        if d2 != dates[-1]:
            d2 -= timedelta(days=1)
        yield d1.strftime(DATE_FORMAT), d2.strftime(DATE_FORMAT)

#### 4.3.1  Attendance-based features
Because the API call for various types of events differs slightly, we utilize different functions for extracting event features.

In [30]:
def get_important_features_api_attended_data(lat, lon, start, end, radius, unit, important_categories_attended, rank_threshold = RANK_THRESHOLD):
    """Get attendance based features using features API"""

    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": f"{radius}{unit}"},
        }

        query.update({f"{f}__stats": ["sum"] for f in important_categories_attended})
        query.update(
            {f"{f}__phq_rank": {"gte": rank_threshold} for f in important_categories_attended}
        )

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                elif k in important_categories_attended:
                    record[k] = v.get("stats", {}).get("sum")
            result.append(record)

    return result

#### 4.3.2  Rank-based features

In [31]:
def get_important_features_api_rank_data(lat, lon, start, end, radius, unit, important_categories_rank):
    """Get rank based features using features API"""

    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": f"{radius}{unit}"},
        }

        query.update({f"{f}": True for f in important_categories_rank})

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                elif k in important_categories_rank:
                    record[f"{k}"] = sum(
                        [
                            int(rank_level) * int(level_count)
                            for rank_level, level_count in v.get("rank_levels", {}).items()
                        ]
                    )

            result.append(record)

    return result

#### 4.3.3  Impact-based features
<b> Severe Weather features </b>

Please note that impact-based features(severe weather features) are for the retail industry only. The severe weather features use demand impact patterns. Demand impact patterns calculate impact duration of a severe weather event and are based on industry specific information. Our severe weather features are currently designed and tested on data for the retail segment only. If your business is in an industry segment other than retail (e.g. accomodation or travel) then these features may not work for you or may be less effective.

Severe weather events are represented using polygons rather than points to depict the geographical area affected, necessitating a distinct radius setting. 
Learn more about [polygons](https://www.predicthq.com/features/polygons). 
For additional insights on configuring the radius setting, refer to our [ Feature Engineering Guide ](https://github.com/predicthq/phq-data-science-docs/blob/master/feature-engineering-guide/feature_engineering_guide.ipynb).

In [32]:
def get_important_features_api_impact_data(lat, lon, start, end, important_categories_impact, rank_threshold = RANK_THRESHOLD):
    " Get impact based features using features API"
    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": '3ft'},
        }

        query.update({f"{f}__stats": ["max"] for f in important_categories_impact})
        query.update(
            {f"{f}__phq_rank": {"gte": rank_threshold} for f in important_categories_impact}
        )

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                else:
                    record[k] = v.get("stats", {}).get("max")

            result.append(record)

    return result

#### 4.3.4  Combine demand and Attendance-based, Rank-based and Impact-based features

##### 4.3.4.1  Combine Attendance-based, Rank-based and Impact-based features

We can have a funtion that consolidates data related to attendance, rank, and impact features for each Beam analysis and aggregates all the event featues into a DataFrame.

In [33]:
def get_features_api_data(analysis_id, features):
    """
    Fetches feature importance data for a given analysis ID from the Beam API.
    Args:
        location (str): The unique ID of the location. 
        features (list): The list of important features to fetch data for. 
    Returns:
        dataframe: a dataframe that stores all relevant important event features.
    """
    # Get the corresponding analysis_id for each store
    analysis = get_analysis(analysis_id)
    lat = analysis["location"]["geopoint"]["lat"]
    lon = analysis["location"]["geopoint"]["lon"]
    radius = analysis["location"]["radius"]
    radius_unit = analysis["location"]["unit"]
    min_phq_rank = analysis["rank"]["levels"]["phq"]["min"]
    start = analysis["readiness_checks"]["date_range"]["start"]
    end = analysis["readiness_checks"]["date_range"]["end"]

    categories_attended_important_features = [feature for feature in features if "attendance" in feature]
    categories_rank_important_features = [feature for feature in features if "rank" in feature]
    categories_impact_important_features = [feature for feature in features if "impact" in feature]

    attended_df = None
    rank_df = None
    impact_df = None
    if categories_attended_important_features:
        attended_data = get_important_features_api_attended_data(
            lat, lon, start, end, radius, radius_unit, categories_attended_important_features, min_phq_rank
        )
        attended_df = pd.DataFrame(attended_data)
    if categories_rank_important_features:
        rank_data = get_important_features_api_rank_data(
            lat, lon, start, end, radius, radius_unit, categories_rank_important_features
        )
        rank_df = pd.DataFrame(rank_data)

    if categories_impact_important_features:
        impact_data = get_important_features_api_impact_data(
            lat, lon, start, end, categories_impact_important_features, min_phq_rank
        )
        impact_df = pd.DataFrame(impact_data)
        
    # merge data
    df_all_features = None
    for df in [attended_df, rank_df, impact_df]:
        if df_all_features is None:
            if df is not None:
                df_all_features = df
        else:
            if df is not None:
                df_all_features = df_all_features.merge(df, on="date", how="outer")

    return df_all_features

You can have the important features organized as a dataframe for each location and consolidate them into a single dataframe.

In [34]:
all_locations_features_df = []
for location, demand_data_for_location in demand_data_by_location:
    analysis_id = location_sample[location_sample['location'] == location]['analysis_id'].values[0]
    all_important_features = location_sample[location_sample['location'] == location]['important_features'].values[0]
    event_features_df = get_features_api_data(analysis_id, all_important_features)
    event_features_df['location'] = location
    event_features_df['analysis_id'] = analysis_id
    # reorder the columns
    event_features_df = event_features_df[['location', 'analysis_id', 'date'] + all_important_features]
    all_locations_features_df.append(event_features_df)

# Concatenate all the dataframes
all_locations_features_df = pd.concat(all_locations_features_df)

In [35]:
all_locations_features_df.head()

Unnamed: 0,location,analysis_id,date,phq_attendance_performing_arts,phq_rank_public_holidays,phq_attendance_expos,phq_attendance_sports,phq_attendance_festivals,phq_attendance_community,phq_rank_observances,phq_rank_academic_exam,phq_rank_academic_holiday,phq_attendance_concerts,phq_attendance_school_holidays,phq_attendance_conferences
0,store_0,jHKkjAQVIm4,2017-01-02,2465.0,5,0.0,19812.0,0.0,6704.0,0.0,0.0,0.0,31623.0,,
1,store_0,jHKkjAQVIm4,2017-01-03,5574.0,0,0.0,18006.0,0.0,6704.0,0.0,0.0,0.0,31623.0,,
2,store_0,jHKkjAQVIm4,2017-01-04,9694.0,0,0.0,19812.0,0.0,6704.0,3.0,0.0,0.0,0.0,,
3,store_0,jHKkjAQVIm4,2017-01-05,5581.0,0,0.0,0.0,0.0,6704.0,0.0,0.0,0.0,2112.0,,
4,store_0,jHKkjAQVIm4,2017-01-06,5574.0,0,0.0,19500.0,0.0,9480.0,3.0,0.0,0.0,2036.0,,


##### 4.3.4.2  Combine Demand data and event features

In [36]:
# Combine beam demand dataall_locations_beam_df with event features all_locations_features_df
all_locations_df = all_locations_beam_df.merge(all_locations_features_df, on=['analysis_id', 'date'], how='left')
# clean the data by removing the rows with missing actual demand 
all_locations_df = all_locations_df.dropna(subset=['actual_demand'])

In [37]:
all_locations_df.head()

Unnamed: 0,analysis_id,date,actual_demand,baseline_demand,remainder,phq_attendance_sum,location,phq_attendance_performing_arts,phq_rank_public_holidays,phq_attendance_expos,phq_attendance_sports,phq_attendance_festivals,phq_attendance_community,phq_rank_observances,phq_rank_academic_exam,phq_rank_academic_holiday,phq_attendance_concerts,phq_attendance_school_holidays,phq_attendance_conferences
0,jHKkjAQVIm4,2017-01-02,3350.804294,6688.798587,-3337.994292,60604,store_0,2465.0,5,0.0,19812.0,0.0,6704.0,0.0,0.0,0.0,31623.0,,
1,jHKkjAQVIm4,2017-01-03,7974.534129,7720.345753,254.188375,61907,store_0,5574.0,0,0.0,18006.0,0.0,6704.0,0.0,0.0,0.0,31623.0,,
2,jHKkjAQVIm4,2017-01-04,7274.021429,8003.666169,-729.64474,36210,store_0,9694.0,0,0.0,19812.0,0.0,6704.0,3.0,0.0,0.0,0.0,,
3,jHKkjAQVIm4,2017-01-05,7504.479021,8433.525995,-929.046974,14397,store_0,5581.0,0,0.0,0.0,0.0,6704.0,0.0,0.0,0.0,2112.0,,
4,jHKkjAQVIm4,2017-01-06,7091.141396,6512.486925,578.654472,36590,store_0,5574.0,0,0.0,19500.0,0.0,9480.0,3.0,0.0,0.0,2036.0,,


## Part 5[Optional].  Create Groups for the Analyses

While single Analyses are ideal, offering insights tailored to each store or location, grouping Analyses can present a more practical approach as it provides a manageable, aggregated view across multiple stores or locations. This can be useful for businesses, such as retail chains, that manage operations at a regional or state level. Group Analyses support these operations, like marketing campaigns, inventory management, and demand forecasting, by delivering aggregated insights consolidated across all stores. Please refer to this [page](https://www.predicthq.com/support/grouping-analyses-in-beam) for more information on grouping analyses. 

### 5.1 Create groups with analysis ids

To efficiently analyze data, group your locations with shared characteristics. For example, assign store_0 and store_1 to 'group_A' to analyze them together, indicating similar demand patterns or operational strategies. Store_2, with differing attributes, goes into 'group_B'.

In [38]:
# read in the group_sample data 
group_sample = pd.read_csv('group_sample.csv')

In [39]:
def create_group(group_name, analysis_ids):
    # This function takes two parameters:
    # group_name: A string representing the name of the group to be created.
    # analysis_ids: A list of strings representing the IDs of analyses that should be associated with this group.
    response = requests.post(
        url=f"{BEAM_URL}/analysis-groups",
        headers={
            "Authorization": "Bearer " + ACCESS_TOKEN,
            "Accept": "application/json",
        },
        json={"name": group_name, "analysis_ids": analysis_ids},
    )

    if response.status_code == 201:
        return response.json()['group_id']
    else:
        print(response.status_code)
        print(f"Error creating group for {group_name}")

In [40]:
# Merge the two DataFrames on the 'location'column to combine the analysis IDs with group names
merged_df = pd.merge(location_sample, group_sample, on='location')
# Group the merged DataFrame by 'group_name' and aggregate the 'analysis_id' into lists
group_with_analysis_ids = merged_df.groupby('group_name')['analysis_id'].apply(list).reset_index()
group_with_location = merged_df.groupby('group_name')['location'].apply(list).reset_index()

In [41]:
# merge the group_with_analysis_ids with group_with_location 
group_with_analysis_ids = pd.merge(group_with_analysis_ids, group_with_location, on='group_name')
# rename analysis_id to analysis_ids to be more descriptive
group_with_analysis_ids.rename(columns={'analysis_id': 'analysis_ids'}, inplace=True)
# rename location to locations to be more descriptive
group_with_analysis_ids.rename(columns={'location': 'locations'}, inplace=True)

In [42]:
group_with_analysis_ids

Unnamed: 0,group_name,analysis_ids,locations
0,group_A,"[jHKkjAQVIm4, CNsDRuJZAeM]","[store_0, store_1]"
1,group_B,"[CNsDRuJZAeM, 0bhDu9k33SQ]","[store_1, store_2]"


In [43]:
# add a new column to store the group_id
group_ids = []
for index, group in group_with_analysis_ids.iterrows():
    group_id = create_group(group['group_name'], group['analysis_ids'])
    group_ids.append(group_id)
group_with_analysis_ids['group_id'] = group_ids

In [44]:
group_with_analysis_ids

Unnamed: 0,group_name,analysis_ids,locations,group_id
0,group_A,"[jHKkjAQVIm4, CNsDRuJZAeM]","[store_0, store_1]",yXLpj0yCTE4
1,group_B,"[CNsDRuJZAeM, 0bhDu9k33SQ]","[store_1, store_2]",p6kDpkeLuIU


#### 5.1.1 Check group status

Now that you have created a Beam group for each demand data group, the next step is to check the status of the groups. Groups need to be ready and Feature Importance processing need to be completed before proceeding to the next steps. Refresh as needed to get the latest status.

In [45]:
def group_status(group_id):
    """
    Checks the status of a specific analysis group by its ID.

    This function makes an API call to retrieve the current status of the analysis group.
    It checks if the group is ready and if the feature importance processing has been completed.

    Parameters:
    - group_id (str): The unique identifier for the analysis group to check.

    Outputs:
    - Prints a message indicating whether the analysis group and feature importance processing are ready.
    """
    response = requests.get(
        url=f"{BEAM_URL}/analysis-groups/{group_id}",  
        headers={
            "Authorization": "Bearer " + ACCESS_TOKEN, 
            "Accept": "application/json",       
        },
    )
    result = response.json()
    readiness_status = result.get("readiness_status")
    feature_importance = result.get("processing_completed", {}).get("feature_importance")

    # Check if the readiness status is not 'ready' or if feature importance is True
    if readiness_status != "ready" or not feature_importance:
        print(f"Beam analysis group creation or feature importance processing for {group_id} is not ready yet, please wait until it's finished.")
    else:
        print(f"Beam analysis group creation and feature importance processing for {group_id} is ready.")


In [46]:
# check the status of each group
for index, group in group_with_analysis_ids.iterrows():
    group_status(group['group_id'])

Beam analysis group creation and feature importance processing for yXLpj0yCTE4 is ready.
Beam analysis group creation and feature importance processing for p6kDpkeLuIU is ready.


#### 5.1.2 Fetch relevant analysis_ids for existing groups

If you have an existing group_id, you can fetch the analysis_ids from the group to get relevant analysis_ids and event features. 

In [47]:
def get_group(group_id):
    """
    Checks all the analysis_ids of a specific analysis group by its ID.
    Parameters:
    - group_id (str): The unique identifier for the analysis group to check.

    Outputs:
    - Prints a message indicating the analysis_ids of the group.
    """
    response = requests.get(
        url=f"{BEAM_URL}/analysis-groups/{group_id}",
        headers={
             "Authorization": "Bearer " + ACCESS_TOKEN, 
            "Accept": "application/json",
        },
    )

    data = response.json()
    name = data["name"]
    excluded_analysis_ids = {
        entry["analysis_id"]
        for entry in data.get("processing_completed", {}).get("excluded_analyses", [])
    }
    analysis_ids = [
        id for id in data.get("analysis_ids", []) if id not in excluded_analysis_ids
    ]

    return {"name": name, "analysis_ids": analysis_ids}

In [None]:
group_id = 'SAMPLE_GROUP_ID'  # replace with the group_id you want to get the analysis_ids for
print(get_group(group_id))

### 5.2 Identify relevant group features using Feature importance & Features API

#### 5.2.1 Get group feature importance

Category importance at the group-level follows a similar interpretation as that of single Analyses. It represents a weighted aggregation of the Category Importance from each contributing Analysis, where the weights are proportional to the average daily demand of each. This gives more influence to Analyses with a larger share of the overall group demand. 

As with single Analyses, the important categories highlight key drivers of demand for your stores or locations, though with a more generalized view. Below is how you can get category importance at the group-level. 

In [48]:
def get_group_feature_importance(group_id):
    """
    Retrieves the feature importance data for a specified analysis group.
    Parameters:
    - group_id (str): The unique identifier of the analysis group for which to retrieve feature importance data.

    Returns:
    - dict: A dictionary containing the feature importance data in JSON format.
    """
    response = requests.get(
        url=f"{BEAM_URL}/analysis-groups/{group_id}/feature-importance",
        headers={
            "Authorization": "Bearer " + ACCESS_TOKEN,
            "Accept": "application/json",
        },
    )
    return response.json()

In [49]:
# add important_categories and important_features to the group_with_analysis_ids dataframe
important_categories_all_groups = []
important_features_all_groups = []
for index, group in group_with_analysis_ids.iterrows():
    group_feature_importance = get_group_feature_importance(group['group_id'])['feature_importance']
    group_important_categories = [item['feature_group'] for item in group_feature_importance if item['important']]
    group_important_features = list(chain.from_iterable([item['features'] for item in group_feature_importance if item['important']]))
    important_categories_all_groups.append(group_important_categories)
    important_features_all_groups.append(group_important_features)
group_with_analysis_ids['group_important_categories'] = important_categories_all_groups
group_with_analysis_ids['group_important_features'] = important_features_all_groups

In [50]:
group_with_analysis_ids

Unnamed: 0,group_name,analysis_ids,locations,group_id,group_important_categories,group_important_features
0,group_A,"[jHKkjAQVIm4, CNsDRuJZAeM]","[store_0, store_1]",yXLpj0yCTE4,"[performing-arts, public-holidays, sports, com...","[phq_attendance_performing_arts, phq_rank_publ..."
1,group_B,"[CNsDRuJZAeM, 0bhDu9k33SQ]","[store_1, store_2]",p6kDpkeLuIU,"[public-holidays, school-holidays, sports, per...","[phq_rank_public_holidays, phq_attendance_scho..."


#### 5.2.2 Get important group-level features from Features API 

The Beam API provides a list of features for each important category, which can be incorporated into your models via the Features API. 

In [51]:
# Fetch features data for each group
group_features_data = []
for index, group in group_with_analysis_ids.iterrows():
    group_name = group['group_name']
    analysis_ids = group['analysis_ids'] 
    group_id = group['group_id']
    important_features = group['group_important_features']

    print(f"Getting features data for {group_name} ({group_id})...")

    for analysis_id in analysis_ids:  # Loop through each analysis_id for the group
        print(f"--- Getting features data for analysis ID {analysis_id}...")

        try:
            features = get_features_api_data(
                analysis_id=analysis_id, features = important_features
            )
            features['group_name'] = group_name  # Adding group_name as a new column
            features['analysis_id'] = analysis_id  # Adding analysis_id as a new column
            # reorder the columns
            features = features[['group_name', 'analysis_id', 'date'] + important_features]
            group_features_data.append(features)
        except Exception as e:
            print(f"An error occurred while fetching data for analysis ID {analysis_id}: {e}")
            continue

# concatenate all the dataframes containing features data for each group
group_features_df = pd.concat(group_features_data, ignore_index=True)

Getting features data for group_A (yXLpj0yCTE4)...
--- Getting features data for analysis ID jHKkjAQVIm4...
--- Getting features data for analysis ID CNsDRuJZAeM...
Getting features data for group_B (p6kDpkeLuIU)...
--- Getting features data for analysis ID CNsDRuJZAeM...
--- Getting features data for analysis ID 0bhDu9k33SQ...


Combine Demand data and group event features

In [52]:
# Combine beam decomposition output(all_locations_beam_df) with event features at group-level(all_locations_features_df)
result_df = all_locations_beam_df.merge(group_features_df, on=['analysis_id', 'date'], how='left')
# clean the data by removing the rows with missing actual demand
result_df = result_df.dropna(subset=['actual_demand'])
# fill the missing values with 0
result_df.fillna(0, inplace=True)
result_df

Unnamed: 0,analysis_id,date,actual_demand,baseline_demand,remainder,phq_attendance_sum,group_name,phq_attendance_performing_arts,phq_rank_public_holidays,phq_attendance_sports,phq_attendance_community,phq_attendance_school_holidays,phq_rank_observances,phq_attendance_conferences
0,jHKkjAQVIm4,2017-01-02,3350.804294,6688.798587,-3337.994292,60604,group_A,2465.0,5,19812.0,6704.0,0.0,0.0,0.0
1,jHKkjAQVIm4,2017-01-03,7974.534129,7720.345753,254.188375,61907,group_A,5574.0,0,18006.0,6704.0,0.0,0.0,0.0
2,jHKkjAQVIm4,2017-01-04,7274.021429,8003.666169,-729.644740,36210,group_A,9694.0,0,19812.0,6704.0,0.0,3.0,0.0
3,jHKkjAQVIm4,2017-01-05,7504.479021,8433.525995,-929.046974,14397,group_A,5581.0,0,0.0,6704.0,0.0,0.0,0.0
4,jHKkjAQVIm4,2017-01-06,7091.141396,6512.486925,578.654472,36590,group_A,5574.0,0,19500.0,9480.0,0.0,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3975,0bhDu9k33SQ,2018-11-26,2567.725141,2543.414081,24.311060,0,group_B,0.0,0,0.0,0.0,0.0,0.0,0.0
3976,0bhDu9k33SQ,2018-11-27,2480.132093,2368.503978,111.628114,0,group_B,0.0,0,0.0,0.0,0.0,0.0,0.0
3977,0bhDu9k33SQ,2018-11-28,2716.229757,2520.617594,195.612163,0,group_B,0.0,0,0.0,0.0,0.0,0.0,0.0
3978,0bhDu9k33SQ,2018-11-29,3010.198718,2595.433854,414.764864,0,group_B,0.0,0,0.0,0.0,0.0,0.0,0.0
