This notebook is designed to provide you with the context you need to navigate [Beam API ](https://docs.predicthq.com/resources/beam) effectively.

[Beam](https://www.predicthq.com/beam) is PredictHQ's automated correlation engine to accurately reveal the events that drive demand for your business.  It not only highlights the correlation between events and your demand data but also offers a decomposition of your demand data, aiding in enhancing the precision of your demand forecasts. For a deeper dive into Beam, explore the [Beam Overview](https://www.predicthq.com/support/beam-overview).

Our objective through this notebook is to navigate you through the following use cases facilitated by the Beam API:

1. [Upload demand data to Beam](#part-1-upload-demand-data-to-beam): Upload your data to Beam for analysis.
2. [Generate Beam analysis results](#part-2-generate-beam-analysis-results): Generate Beam analysis results between events and your demand data using Beam.
3. [Plot and interpret Beam output](#part-3-plot-and-interpret-beam-output): Visualize and interpret the output from Beam for better insights.
4. [Identify relevant features using Feature Importance](#part-4-identify-relevant-features-using-feature-importance): Explore Feature Importance to identify and extract relevant features for your forecasting model.

Utilizing Beam's decomposition and the Feature Importance feature can substantially augment the accuracy of your forecasts. If you have not been decomposing your data for forecasting purposes, Beam’s functionalities provide a robust framework to dissect your data into baseline demand and remainder components, coupled with identifying critical features essential for an accurate forecasting model.

For additional insights into the Beam API, please refer to our [technical documentation](https://docs.predicthq.com/api/beam).

If using Google Colab uncomment the following code block, this is used to download a repo that contains sample data used for this notebook. 

In [None]:
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/beam-api-notebook
#!pip install pandas == 2.0.2 requests == 2.24.0  plotly == 5.3.0  predicthq==2.0.6  numpy==1.24.3

Alternatively if you're running this notebook on a local machine, you can set up a Python environment using requirements.txt file shared alongside the notebook.

You can install the necessary requirements by executing the command pip install -r requirements.txt.

In [None]:
import pandas as pd 
import os 
import requests
import plotly.graph_objects as go
from predicthq import Client
import collections
import numpy as np
from datetime import datetime, date, timedelta

## Part 1. Upload demand data to Beam

Beam works by creating an analysis for each location.  A location typically refers to the physical premise of a business like a store or a hotel. For instance, in a restaurant chain, each restaurant would represent a distinct location.

When running Beam, you execute it over a set of locations and create a Beam analysis for each location. Then, for each location, you upload your historical demand data, such as historical retail sales data or historical hotel room booking data. This is the demand data that Beam correlates with event data. For each location, you need the latitude and longitude information. When searching for events that can impact your bussiness, we need a suitable radius for the bussiness location. Our [Suggested Radius API](https://docs.predicthq.com/resources/suggested-radius) will calculate the radius for you.

The example code below loops over a list of input locations, creates a Beam analysis_id for each location, and uploads demand data for each location. If you are adapting this for your context, you may be loading demand data from a database and an API to upload into Beam.

You can use this approach to upload data for hundreds or thousands of locations. 

### 1.1 Get Suggested Radius for all locations

An Access Token is required to query the API.

The following link will guide you through creating an account and an access token. Please ensure that your API token is updated to access the latest features of Beam.
- https://docs.predicthq.com/guides/quickstart/

In [None]:
ACCESS_TOKEN = 'ACCESS_TOKEN'

In [None]:
# Read in a CSV file with latitude, longitude and industry for each location
locations = pd.read_csv(f'location_sample.csv')
# Set urls for API requests
SUGGESTED_RADIUS_URL="https://api.predicthq.com/v1/suggested-radius/"
BEAM_URL = "https://api.predicthq.com/v1/beam/analyses"
FEATURES_API_URL = "https://api.predicthq.com/v1/features"

In [None]:
locations

Unnamed: 0,location_id,lat,lon,industry
0,store_0,37.784,-122.404,restaurants
1,store_1,47.611,-122.338,restaurants
2,store_2,40.735,-73.869,restaurants


In [None]:
def get_suggested_radius(lat_lon, industry, radius_unit):
    """
    Returns the suggested radius for a given latitude and longitude.

    Args:
        lat_lon (str): The latitude and longitude of the location in the format "lat,lon".
        industry (str): The industry of interest that the radius will be calculated for. 
        radius_unit (str): Unit in which the suggested radius will be returned.
        
    Returns:
        float: The suggested radius in your preferred unit.
    """
     # Set the url for the API call
    url = SUGGESTED_RADIUS_URL
    # Set the query parameters for the API call
    params = {
        "location.origin": lat_lon,
        "industry": industry, 
        "radius_unit": radius_unit 
    }
     # Set the headers for the API call (including the access token)
    headers={
              "Authorization": "Bearer " + ACCESS_TOKEN,
              "Accept": "application/json"
            }
    # Make the API call and get the JSON response
    response = requests.get(url, params=params, headers=headers)
    if response.status_code == 200:
        # Extract the radius from the JSON response and return it
        radius =  response.json()['radius']
        return radius

In [None]:
rads = []
# Please specify your preferred unit here
radius_unit = 'm' # radius unit in meter
# Loop through each row in the locations dataframe to generate the suggested radius for each location
for index, location in locations.iterrows():
    # Call the get_suggested_radius function to get the suggested radius for each location
    r = get_suggested_radius(f"{location['lat']}, {location['lon']}", {location['industry']}, radius_unit)
    # Append the suggested radius to the list
    rads.append(r)

# Add a new column to the locations dataframe to store the suggested radii
locations[f"radius"] = rads
# Add a new column to the locations dataframe to store the suggested radii units
locations[f"unit"] = radius_unit

In [None]:
locations

Unnamed: 0,location_id,lat,lon,industry,radius,unit
0,store_0,37.784,-122.404,restaurants,1.29,mi
1,store_1,47.611,-122.338,restaurants,1.31,mi
2,store_2,40.735,-73.869,restaurants,1.31,mi


### 1.2 Get an analysis_id for each analysis

To set parameters for the Beam API, you can refer to our documentation available at [Beam API](https://docs.predicthq.com/api/beam/create-an-analysis). It provides detailed information on how to configure the API according to your requirements.

In [None]:
# Please specify a rank threshold for the Beam analysis, this will also be used to extract event features from Features API
RANK_THRESHOLD = 50

In [None]:
def generate_analysis_ids(location_id, lat, lon, radius, radius_unit, rank_threshold = RANK_THRESHOLD):
    """
    Generates an analysis ID for a given location, latitude, longitude, radius and its unit.

    Args:
        location_id (str): The unique ID of the location to generate an analysis for.
        lat (float): The latitude of the location.
        lon (float): The longitude of the location.
        radius (float): The radius to use for the analysis.
        radius_unit(str): The unit of measurement used for radius. 
        rank_threshold (int): The minimum rank threshold for the analysis.

    Returns:
        str: The analysis ID generated by the API.
    """
    # Set the URL and JSON payload for the API call
    url = BEAM_URL
    json = {
            "name": f"{location_id}_analysis",
                "location": {
                    "geopoint": {
                        "lat": lat,
                        "lon": lon,
                    },
                    "radius": radius,
                    "unit": radius_unit 
                },
                "rank": {
                    "type": "phq",
                "levels": {
                "phq": {
                 "min": rank_threshold
                    }
                    }
                },
                "tz": "UTC"
            }
    # Make a POST request to the API to generate the analysis ID
    response = requests.post(
            url = url,
            headers={
                "Authorization": "Bearer " + ACCESS_TOKEN,
                "Accept": "application/json"},
            json = json)
    # Extract the analysis ID from the JSON response and return it if the status code is 200
    if response.status_code == 201:
        return response.json()['analysis_id']
    else:
        print(response.status_code)
        print(f"Error generating analysis ID for {location_id}")

In [None]:
analysis_ids = []
# Loop through each row in the locations dataframe
for index, location in locations.iterrows():
    # Generate a Beam analysis id for each location
    r = generate_analysis_ids(location['location_id'], str(location['lat']), str(location['lon']), location['radius'], location['unit'])
    # Add the analysis ids to the list
    analysis_ids.append(r)
locations['analysis_id'] = analysis_ids

In [None]:
locations

Unnamed: 0,location_id,lat,lon,industry,radius,unit,analysis_id
0,store_0,37.784,-122.404,restaurants,1.29,mi,U8S2xAMmyRI
1,store_1,47.611,-122.338,restaurants,1.31,mi,-dYjPRzPVO4
2,store_2,40.735,-73.869,restaurants,1.31,mi,D_pQxmXIavM


### 1.3 Upload demand data for each analysis_id

The example below is reading demand data from multiple businss locations from a single CSV file e, and uploading the demand data for each corresponding analysis_id that is specific to each location. 

When you use this in the context of your business you could read the demand data from a database, API or internal product for example. Your demand data needs to be aggregated to daily values for each given location, where date is YYYY-MM-DD format and demand is a numeric value. Please ensure that your demand data file contains both of two columns: `date` and `demand`.  See the example files for how the data should be formatted and Upload Demand Data to an Analysis in the [Beam API](https://docs.predicthq.com/resources/beam) documentation for more details.

Please also ensure that `location_id` is a unique key in both the demand data and the location data.

In [None]:
# Read in the data and loop through the locations_ids
demand_data = pd.read_csv('demand_data/demand_data_sample.csv')
grouped = demand_data.groupby('location_id')

In [None]:
for location_id, grouped_data in grouped:
    # Get the corresponding analysis_id for each store
    analysis_id = locations[locations['location_id'] == location_id]['analysis_id'].values[0]
    # Only get date and demand from grouped_data
    individual_demand = grouped_data[['date', 'demand']]
    # Covert individual_demand to json format
    individual_demand_json = individual_demand.to_json(orient='records')
    # Upload the individual_demand data to Beam
    response = requests.post(
    url=f"{BEAM_URL}/{analysis_id}/sink",
    headers={
        "Authorization": "Bearer " + ACCESS_TOKEN,
        "Content-Type": "application/json"
    },
    data = individual_demand_json
    )
    # Check the status code of the response to see if the request has been accepted
    if response.status_code == 202:
        print('The request has been accepted for processing.') 
    else:
        print(response.content)

The request has been accepted for processing.
The request has been accepted for processing.
The request has been accepted for processing.


### 1.4 Check the readiness_status

Now, we can make a request to the Beam API to retrieve the Beam analysis output for each location. The "readiness_status" indicates whether the data has been successfully uploaded and processed. Please ensure that the readiness_status is "ready" before you continue.

In [None]:
# Please ensure that the readiness_status is "ready" before you continue, this might take a couple of minutes
for index, location in locations.iterrows():
    response = requests.get(
    url = f"{BEAM_URL}/{location['analysis_id']}",
    headers = {
      "Authorization": "Bearer " + ACCESS_TOKEN,
      "Accept": "application/json"
    })
    print(f"{response.json()['name']}: {response.json()['readiness_status']}")
    if response.json()['readiness_status'] != "ready":
        print(f"Beam {response.json()['name']} is not ready yet, please wait until the readiness_status is ready.")
    # print(response.json()) # uncomment this line if you want more information about the analyses

store_0_analysis: ready
store_1_analysis: ready
store_2_analysis: ready


In [None]:
locations

Unnamed: 0,location_id,lat,lon,industry,radius,unit,analysis_id
0,store_0,37.784,-122.404,restaurants,1.29,mi,U8S2xAMmyRI
1,store_1,47.611,-122.338,restaurants,1.31,mi,-dYjPRzPVO4
2,store_2,40.735,-73.869,restaurants,1.31,mi,D_pQxmXIavM


## Part 2.  Generate Beam analysis results

Correlation results show you your decomposed demand data as well as the event impact data for each location. This is the same data that you can see in the Beam UI - see [Viewing the Time Series Impact Analysis](https://www.predicthq.com/support/viewing-the-time-series-impact-analysis). You can see correlation where there are remainder values corresponding with significant event impact values.

In [None]:
date_col_name = 'date'
analyses = []
location_ids = []
analyses_dic = {}
for location_id, grouped_data in grouped:
    # Get the corresponding analysis id for each store
    analysis_id = locations[locations['location_id'] == location_id]['analysis_id'].values[0]
    # Only use the date and demand columns from grouped_data for the API request
    individual_demand = grouped_data[['date', 'demand']]
    # Extract min and max dates for each store
    min_date = individual_demand[date_col_name].min()
    max_date = individual_demand[date_col_name].max()
    # Set parameters for Beam API request
    url = f"{BEAM_URL}/{analysis_id}/correlate"
    url_str = ''.join(url)
    headers = {
            "Authorization": "Bearer " + ACCESS_TOKEN,
            "Accept": "application/json"
            }
    params = {
            "date.gte": min_date,
            "date.lte": max_date
            }

    response = requests.get(
        url = url_str,
        headers = headers,
        params = params)
    
    if response.status_code == 200:
        print('The request has been accepted for processing.')
        analyses.append(response.json())
        location_ids.append(location_id)
for key, value in zip(location_ids, analyses):
  analyses_dic[key] = value

The request has been accepted for processing.
The request has been accepted for processing.
The request has been accepted for processing.


## Part 3. Plot and interpret Beam output

Beam's decomposition process bifurcates the demand data into baseline demand and remainder components. Additionally, Beam correlates your demand with PHQ events, delivering an accurate and customized insight into the relationships between events and your demand.

Now we have the Beam output, we can take location_id == store_1 as an example and convert it into a dataframe.

In [None]:
beam_analysis = pd.DataFrame(analyses_dic['store_1']['dates'])
# Extract required columns
beam_analysis = beam_analysis[['date', 'actual_demand', 'baseline_demand', 'remainder', 'impact']]

Please note the demand time series is decomposed into baseline demand time series and remainder time series. 

`Baseline demand`:  The baseline demand time series represents the estimated demand and contains information about trends and seasonality within the demand time series.

`Event imapct`: For events belonging to attended categories, the corresponding daily total attendance represents the event impact.

`Reminder`: The remainder is the difference between the demand time series and the baseline demand time series.

In [None]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(x=beam_analysis.date, y=beam_analysis.actual_demand, name='actual_demand',mode='lines+markers')
)
fig.add_trace(
    go.Scatter(x=beam_analysis.date, y=beam_analysis.baseline_demand, name="baseline_demand",mode='lines+markers')
)

fig.add_trace(go.Scatter(
    x=beam_analysis.date, y=beam_analysis.impact, name="event_impact",
    yaxis="y2",mode='lines+markers'
    # ,fill='tozeroy'
))

# Create axis objects
fig.update_layout(
    xaxis=dict(
        domain=[0.0, 0.7]
    ),
    yaxis=dict(
        title="demand"
    ),
    yaxis2=dict(
        title="impact",
        anchor="x",
        overlaying="y",
        side="right",
    ),
    yaxis3=dict(
        title="public_holidays",
        anchor="free",
        overlaying="y",
        side="right",
        position=0.85
    )
)

# Update layout properties
# fig.update_layout(
#     width=1000,
# )

fig.show()

## Part 4.  Identify relevant features using Feature Importance

Feature Importance provides feature importance results for an existing analysis, and returns a list of feature groups with associated Features API features and group p-values.

These values represent each group of features' statistical significance when it comes to impacting observable incremental/decremental changes in demand.

You might need to refresh an analysis to generate insights on the most relevant features for a location if the analysis was created before Oct, 2023.

In [None]:
# Uncomment and run this if you created your analysis before Oct, 2023 
# for index, location in locations.iterrows():
#   response = requests.post(f"{BEAM_URL}/{location['analysis_id']}/refresh")
#   if response.status_code == 202:
#     print(f"The analysis {location['analysis_id']} has been refreshed.")

### 4.1 Get Feature Importance

In [None]:
feature_importance_list = []
for index, location in locations.iterrows():
  url = f"{BEAM_URL}/{location['analysis_id']}/feature-importance"
  response = requests.get(
    url= url,
    headers={
      "Authorization": "Bearer " + ACCESS_TOKEN,
      "Accept": "application/json"
    })
  if response.status_code == 200:
    print('The request has been accepted for processing.')
    feature_importance = response.json()['feature_importance']
    feature_importance_list.append(feature_importance)
locations['feature_importance_list'] = feature_importance_list

The request has been accepted for processing.
The request has been accepted for processing.
The request has been accepted for processing.


In [None]:
locations

Unnamed: 0,location_id,lat,lon,industry,radius,unit,analysis_id,feature_importance_list
0,store_0,37.784,-122.404,restaurants,1.29,mi,U8S2xAMmyRI,"[{'feature_group': 'conferences', 'features': ..."
1,store_1,47.611,-122.338,restaurants,1.31,mi,-dYjPRzPVO4,"[{'feature_group': 'severe-weather', 'features..."
2,store_2,40.735,-73.869,restaurants,1.31,mi,D_pQxmXIavM,"[{'feature_group': 'expos', 'features': ['phq_..."


### 4.2 Feature Importance interpretation

`feature_group` is the name of the group, which typically refers to an event category, such as concerts, conferences, etc.

`features` is the names of the features in the feature group. These refer directly to features available in Features API.

`p_value`:  The p-value associated with this feature group for this analysis. It indicates how important the features in the group are in terms of demand. The lower the p-value, the more important the feature group is. 
- p-value < 0.05: The impact is very high.  <br>
- 0.05 <= p-value < 0.75: The impact is high.  <br>
- 0.075 <= p-value < 0.1: The impact is moderate.  <br>
- p-value >= 0.1: This is no impact.

`important`: A true of false value indicating whether the feature group is considered important for this analysis. Equivalent to p_value < 0.1  We suggest using this value to determine whether or not to include this group of features in your modeling.

In [None]:
# Add a column to store categories where 'important' is True
locations['important_categories'] = locations['feature_importance_list'].apply(lambda x: [item['feature_group'] for item in x if item['important']])

# Add a column to store import features along with the p values 
locations['important_features'] = locations['feature_importance_list'].apply(lambda x: [item['features'][0] for item in x if item['important']])

In [None]:
locations

Unnamed: 0,location_id,lat,lon,industry,radius,unit,analysis_id,feature_importance_list,important_categories,important_features
0,store_0,37.784,-122.404,restaurants,1.29,mi,U8S2xAMmyRI,"[{'feature_group': 'conferences', 'features': ...","[conferences, expos, sports, festivals, academ...","[phq_attendance_conferences, phq_attendance_ex..."
1,store_1,47.611,-122.338,restaurants,1.31,mi,-dYjPRzPVO4,"[{'feature_group': 'severe-weather', 'features...","[severe-weather, observances, public-holidays,...","[phq_impact_severe_weather_air_quality_retail,..."
2,store_2,40.735,-73.869,restaurants,1.31,mi,D_pQxmXIavM,"[{'feature_group': 'expos', 'features': ['phq_...","[expos, school-holidays, academic, sports]","[phq_attendance_expos, phq_attendance_school_h..."


### 4.3 Get important features from Features API
Once you're able to identify the importance features for each location, the next step is to extract these features. This section of the notebook provides a step-by-step instruction on how to do it.  We also have [ Feature Engineering Guide ](https://github.com/predicthq/phq-data-science-docs/blob/master/feature-engineering-guide/feature_engineering_guide.ipynb) notebook that guides you in creating events-based machine learning features. The notebook also provides guidance on selecting varying radii for different features.

In [None]:
DATE_FORMAT = "%Y-%m-%d"


phq = Client(access_token=ACCESS_TOKEN)

def get_date_groups(start, end):
    """
    Features API allows a range of up to 90 days, so we have to do several requests
    """

    def _split_dates(s, e):
        capacity = timedelta(days=90)
        interval = 1 + int((e - s) / capacity)
        for i in range(interval):
            yield s + capacity * i
        yield e

    dates = list(_split_dates(start, end))
    for i, (d1, d2) in enumerate(zip(dates, dates[1:])):
        if d2 != dates[-1]:
            d2 -= timedelta(days=1)
        yield d1.strftime(DATE_FORMAT), d2.strftime(DATE_FORMAT)

#### 4.3.1  Attendance-based features
Because the API call for various types of events differs slightly, we utilize different functions for extracting event features.

In [None]:
CATEGORIIES_ATTENDED = [
    "phq_attendance_sports",
    "phq_attendance_conferences",
    "phq_attendance_expos",
    "phq_attendance_concerts",
    "phq_attendance_festivals",
    "phq_attendance_performing_arts",
    "phq_attendance_community",
    "phq_attendance_school_holidays",
]
# Create a new column to only include important features that are attendance-based
locations['categories_attended_important_features'] = locations['important_features'].apply(lambda x: [item for item in x if item in CATEGORIIES_ATTENDED])

In [None]:
def get_important_features_api_attended_data(lat, lon, start, end, radius, unit, important_categories_attended, rank_threshold = RANK_THRESHOLD):
    """Get attendance based features using features API"""

    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": f"{radius}{unit}"},
        }

        query.update({f"{f}__stats": ["sum"] for f in important_categories_attended})
        query.update(
            {f"{f}__phq_rank": {"gte": rank_threshold} for f in important_categories_attended}
        )

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                elif k in CATEGORIIES_ATTENDED:
                    record[k] = v.get("stats", {}).get("sum")
            result.append(record)

    return result

In [None]:
date_col_name = 'date'
attended_data_list = []
location_ids = []
attended_data_dic = {}
for location_id, grouped_data in grouped:
    # Get the latitude and longitude for each store
    lat = locations[locations['location_id'] == location_id]['lat'].values[0]
    lon = locations[locations['location_id'] == location_id]['lon'].values[0]
    # Only use the date and demand columns from grouped_data for the API request
    individual_demand = grouped_data[['date', 'demand']]
    # Extract min and max dates for each store
    start = individual_demand[date_col_name].min()
    end = individual_demand[date_col_name].max()
    # Get suggested radius and its unit for each store
    radius = locations[locations['location_id'] == location_id]['radius'].values[0]
    unit = locations[locations['location_id'] == location_id]['unit'].values[0]
    # Get the important categories attended for each store
    categories_attended_important_features = locations[locations['location_id'] == location_id]['categories_attended_important_features'].values[0]

    attended_data_df = pd.DataFrame()
    if categories_attended_important_features:
      # Get attended data from important features API
      attended_data = get_important_features_api_attended_data(lat, lon, start, end, radius, unit, categories_attended_important_features, RANK_THRESHOLD)
      attended_data_df = pd.DataFrame(attended_data)
    attended_data_dic[location_id] = attended_data_df

#### 4.3.2  Rank-based features

In [None]:
CATEGORIIES_RANK = [
     "phq_rank_health_warnings",
     "phq_rank_observances",
     "phq_rank_public_holidays",
     "phq_rank_school_holidays",
     "phq_rank_academic_session",
     "phq_rank_academic_exam",
     "phq_rank_academic_holiday"
]
# Create a new column to only include important features that are rank-based
locations['categories_rank_important_features'] = locations['important_features'].apply(lambda x: [item for item in x if item in CATEGORIIES_RANK])

In [None]:
def get_important_features_api_rank_data(lat, lon, start, end, radius, unit, important_categories_rank):
    """Get rank based features using features API"""

    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": f"{radius}{unit}"},
        }

        query.update({f"{f}": True for f in important_categories_rank})

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                elif k in important_categories_rank:
                    record[f"{k}"] = sum(
                        [
                            int(rank_level) * int(level_count)
                            for rank_level, level_count in v.get("rank_levels", {}).items()
                        ]
                    )

            result.append(record)

    return result

In [None]:
date_col_name = 'date'
rank_data_list = []
location_ids = []
rank_data_dic = {}
for location_id, grouped_data in grouped:
    # Get the latitude and longitude for each store
    lat = locations[locations['location_id'] == location_id]['lat'].values[0]
    lon = locations[locations['location_id'] == location_id]['lon'].values[0]
    # Only use the date and demand columns from grouped_data for the API request
    individual_demand = grouped_data[['date', 'demand']]
    # Extract min and max dates for each store
    start = individual_demand[date_col_name].min()
    end = individual_demand[date_col_name].max()
    # Get suggested radius and its unit for each store
    radius = locations[locations['location_id'] == location_id]['radius'].values[0]
    unit = locations[locations['location_id'] == location_id]['unit'].values[0]
    # Get the rank-based features for each store
    categories_rank_important_features = locations[locations['location_id'] == location_id]['categories_rank_important_features'].values[0]

    rank_data_df = pd.DataFrame()
    if categories_rank_important_features:
      # Get attended data from important features API
      rank_data = get_important_features_api_rank_data(lat, lon, start, end, radius, unit, categories_rank_important_features)
      rank_data_df = pd.DataFrame(rank_data)
    rank_data_dic[location_id] = rank_data_df

#### 4.3.3  Impact-based features
<b> Severe Weather features </b>

Please note that impact-based features(severe weather features) are for the retail industry only. The severe weather features use demand impact patterns. Demand impact patterns calculate impact duration of a severe weather event and are based on industry specific information. Our severe weather features are currently designed and tested on data for the retail segment only. If your business is in an industry segment other than retail (e.g. accomodation or travel) then these features may not work for you or may be less effective.

In [None]:
CATEGORIES_IMPACT = {
    "phq_impact_severe_weather_air_quality_retail",
    "phq_impact_severe_weather_blizzard_retail",
    "phq_impact_severe_weather_cold_wave_retail",
    "phq_impact_severe_weather_cold_wave_snow_retail",
    "phq_impact_severe_weather_cold_wave_storm_retail",
    "phq_impact_severe_weather_dust_retail",
    "phq_impact_severe_weather_dust_storm_retail",
    "phq_impact_severe_weather_flood_retail",
    "phq_impact_severe_weather_heat_wave_retail",
    "phq_impact_severe_weather_hurricane_retail",
    "phq_impact_severe_weather_thunderstorm_retail",
    "phq_impact_severe_weather_tornado_retail",
    "phq_impact_severe_weather_tropical_storm_retail",
}
# Create a new column to only include important features that are impact-based
locations['categories_impact_important_features'] = locations['important_features'].apply(lambda x: [item for item in x if item in CATEGORIES_IMPACT])

Severe weather events are represented using polygons rather than points to depict the geographical area affected, necessitating a distinct radius setting. 
Learn more about [polygons](https://www.predicthq.com/features/polygons). 
For additional insights on configuring the radius setting, refer to our [ Feature Engineering Guide ](https://github.com/predicthq/phq-data-science-docs/blob/master/feature-engineering-guide/feature_engineering_guide.ipynb).

In [None]:
def get_important_features_api_impact_data(lat, lon, start, end, important_categories_impact, rank_threshold = RANK_THRESHOLD):
    " Get impact based features using features API"
    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": '1m'},
        }

        query.update({f"{f}__stats": ["max"] for f in important_categories_impact})
        query.update(
            {f"{f}__phq_rank": {"gte": rank_threshold} for f in important_categories_impact}
        )

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                else:
                    record[k] = v.get("stats", {}).get("max")

            result.append(record)

    return result

In [None]:
date_col_name = 'date'
impact_data_list = []
location_ids = []
impact_data_dic = {}
for location_id, grouped_data in grouped:
    # Get the latitude and longitude for each store
    lat = locations[locations['location_id'] == location_id]['lat'].values[0]
    lon = locations[locations['location_id'] == location_id]['lon'].values[0]
    # Only use the date and demand columns from grouped_data for the API request
    individual_demand = grouped_data[['date', 'demand']]
    # Extract min and max dates for each store
    start = individual_demand[date_col_name].min()
    end = individual_demand[date_col_name].max()
    categories_impact_important_features = locations[locations['location_id'] == location_id]['categories_impact_important_features'].values[0]

    impact_data_df = pd.DataFrame()
    if categories_impact_important_features:
        # Get impact-based features from features API
        impact_data = get_important_features_api_impact_data(lat, lon, start, end, categories_impact_important_features, RANK_THRESHOLD)
        impact_data_df = pd.DataFrame(impact_data)
    impact_data_dic[location_id] = impact_data_df

#### 4.3.4  Combine Attendance-based, Rank-based and Impact-based features

In [None]:
combined_dict = {}

# Get all unique keys across the dictionaries
all_keys = set(attended_data_dic.keys()) | set(rank_data_dic.keys()) | set(impact_data_dic.keys())

# Iterate through each unique key
for key in all_keys:
    dfs_to_concat = [attended_data_dic.get(key), rank_data_dic.get(key), impact_data_dic.get(key)]
    # Remove any None values in case a key is not present in one of the dictionaries
    dfs_to_concat = [df for df in dfs_to_concat if df is not None]
    # Set 'date' column as index for each DataFrame
    dfs_to_concat = [df.set_index('date') if 'date' in df.columns else df for df in dfs_to_concat]
    # Concatenate the DataFrames along the columns (axis=1)
    combined_df = pd.concat(dfs_to_concat, axis=1)
    # Reset the index of the combined DataFrame to have 'date' back as a column
    combined_df.reset_index(inplace=True)
    # Store the combined DataFrame in the combined_dict
    combined_dict[key] = combined_df

You now have the important features organized as a dataframe for each location. Alternatively, you may choose to consolidate them into a single dataframe.

#### 4.3.5 Combine features from all locations

In [None]:
dfs_to_concat = []
# Iterate through the location_ids in combined_dict 
for location_id, combined_df in combined_dict.items():
    combined_df_copy = combined_df.copy()
    # Add the identifier column to df_copy
    combined_df_copy['location_id'] = location_id
    # Append df_copy to dfs_to_concat
    dfs_to_concat.append(combined_df_copy)
# Concatenate the DataFrames along the rows (axis=0)
result_df = pd.concat(dfs_to_concat, ignore_index=True)
# Reorder the columns
cols = ['location_id'] + [col for col in combined_df if col != 'location_id']
result_df = result_df[cols]

In [None]:
result_df.head()

Unnamed: 0,location_id,date,phq_attendance_concerts,phq_attendance_conferences,phq_attendance_expos,phq_attendance_festivals,phq_attendance_performing_arts,phq_attendance_school_holidays,phq_attendance_sports,phq_rank_observances,phq_rank_academic_exam
0,store_2,2017-01-01,,,0.0,,,0.0,0.0,,0.0
1,store_2,2017-01-02,,,0.0,,,0.0,0.0,,0.0
2,store_2,2017-01-03,,,0.0,,,0.0,0.0,,0.0
3,store_2,2017-01-04,,,0.0,,,0.0,0.0,,0.0
4,store_2,2017-01-05,,,0.0,,,0.0,0.0,,0.0


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=781a3458-d047-4aca-8b3f-554d3dc464fa' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>