This notebook is designed to provide you with the context you need to get started with the [Beam API ](https://docs.predicthq.com/resources/beam) and use it effectively.

[Beam](https://www.predicthq.com/beam) is PredictHQ's automated correlation engine to accurately reveal the events that drive demand for your business. As well as showing you the correlation between events and your demand data. Beam can also decompose your demand data which can help improve your demand forecasting accuracy. For more information on Beam see the [Beam Overview](https://www.predicthq.com/support/beam-overview).

Our objective is to facilitate three essential use cases through the Beam APIs: bulk data uploading, data decomposition retrieval, and important feature identification. The bulk upload feature empowers you to generate multiple analyses simultaneously from your source data. In contrast, the Beam decomposed data feature empowers you to extract decomposed results using the Beam API after their data has been uploaded. The final identifying important feature enables you to identify features and event categories that are highly likely to be relevant to your forecasting model.

Utilizing the decomposition of your demand data can enhance the accuracy of your forecasts. If you currently do not decompose your data for forecasting purposes, you can leverage Beam's decomposition functionality to obtain a breakdown of your data. Beam's decomposition process separates your data into baseline demand and remainder components. Improved decomposition data can lead to enhanced forecast accuracy.

1. [Uploading location and demand data to Beam](#part-1-uploading-location-and-demand-data-to-beam)
2. [Generating Beam correlation results](#part-2-generating-correlation-results)
3. [Identifying important features](#part-3-identifying-important-features)
4. [Plotting and interpretation](#part-4-plotting-and-interpretation) 

For more information on Beam API see our [technical documentation](https://docs.predicthq.com/resources/beam).

In [3]:
import pandas as pd 
import os 
import requests
import plotly.graph_objects as go
from predicthq import Client
import collections
import numpy as np
from datetime import datetime, date, timedelta

If using Google Colab uncomment the following code block, this is used to download a repo that contains sample data used for this notebook. 

In [4]:
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/beam-api-notebook

## Part 1. Uploading location and demand data to Beam.

Beam works by creating an analysis for each location. A location is usually associated with a physical business location such as a store, hotel, or some other type of office or business location. For example, in a restaurant chain, you may have a location for each restaurant.

When running Beam, you execute it over a set of locations and create a Beam analysis for each location. Then, for each location, you upload your historical demand data, such as historical retail sales data or historical hotel room booking data. This is the demand data that Beam correlates with event data. For each location, you need the latitude and longitude information. Our [Suggested Radius API](https://docs.predicthq.com/resources/suggested-radius) will calculate the radius for you.

The example code below loops over a list of input locations, creates a Beam analysis for each location, and uploads demand data for each location. If you are adapting this for your context, you may be loading demand data from a database and an API to upload into Beam.

You can use this approach to upload data for hundreds or thousands of locations. 

### 1.1 Upload a location file to get suggested radii with suggested radius API.

This provided ACCESS_TOKEN is limited to the demo example.The following link will guide you through creating an account and an access token. Please make sure your API token is up-to-date to identify feature importance. 
https://docs.predicthq.com/guides/quickstart/

In [5]:
ACCESS_TOKEN = 'Uuv2xci7wNAU4NHPBZTmY5KqGiIjdw4ymTMlDc_g'

In [6]:
# Locate the directory where the data is stored
demand_data_wd = '/Users/charlottegao/Desktop/RND/Beam/beam_analyses/demand_data_wd'
# Read in a CSV file with latitude and longitude data for each location
locations = pd.read_csv(f'{demand_data_wd}/location_sample.csv')
# Set urls for API requests
SUGGESTED_RADIUS_URL="https://api.predicthq.com/v1/suggested-radius/"
BEAM_URL = "https://api.predicthq.com/v1/beam/analyses"
FEATURES_API_URL = "https://api.predicthq.com/v1/features"

In [7]:
locations

Unnamed: 0,location_id,lat,lon,industry
0,store_0,37.784,-122.404,restaurants
1,store_1,47.611,-122.338,restaurants
2,store_2,40.735,-73.869,restaurants


In [8]:
def get_suggested_radius(lat_lon, industry, radius_unit):
    """
    Returns the suggested radius for a given latitude and longitude.

    Args:
        lat_lon (str): The latitude and longitude of the location in the format "lat,lon".
        industry (str): The industry of interest that the radius will be calculated for. 
        radius_unit (str): Unit in which the suggested radius will be returned.
        
    Returns:
        float: The suggested radius in your perferred unit.
    """
     # Set the url for the API call
    url = SUGGESTED_RADIUS_URL
    # Set the query parameters for the API call
    params = {
        "location.origin": lat_lon,
        "industry": industry, 
        "radius_unit": radius_unit 
    }
     # Set the headers for the API call (including the access token)
    headers={
              "Authorization": "Bearer " + ACCESS_TOKEN,
              "Accept": "application/json"
            }
    # Make the API call and get the JSON response
    response = requests.get(url, params=params, headers=headers).json()
    # Extract the radius from the JSON response and return it
    radius =  response['radius']
    return radius

In [9]:
# Initialize an empty list to store the suggested radius for each location
rads = []
# Please specify your preferred unit here
radius_unit = 'km'
# Loop through each row in the locations dataframe to generate the suggested radius for each location
for index, location in locations.iterrows():
    # Call the get_suggested_radius function to get the suggested radius for each location
    r = get_suggested_radius(f"{location['lat']}, {location['lon']}", {location['industry']}, radius_unit)
    # Append the suggested radius to the list
    rads.append(r)

# Add a new column to the locations dataframe to store the suggested radii
locations[f"suggested_radius"] = rads
# Add a new column to the locations dataframe to store the suggested radii units
locations[f"unit"] = radius_unit

In [10]:
locations

Unnamed: 0,location_id,lat,lon,industry,suggested_radius,unit
0,store_0,37.784,-122.404,restaurants,2.08,km
1,store_1,47.611,-122.338,restaurants,2.11,km
2,store_2,40.735,-73.869,restaurants,2.1,km


#### 1.2 Get an analysis_id with Beam API for each analysis.

To set parameters for the Beam API, you can refer to our documentation available at [Beam API](https://docs.predicthq.com/resources/beam). It provides detailed information on how to configure the API according to your requirements.

In [11]:
# Please specify a rank threshold for the Beam analysis, this will also be used to extract event features from Features API
RANK_THRESHOLD = 51

In [12]:
def generate_analysis_ids(location_id, lat, lon, radius, radius_unit, rank_threshold = RANK_THRESHOLD):
    """
    Generates an analysis ID for a given location, latitude, longitude, radius and its unit.

    Args:
        location_id (str): The unique ID of the location to generate an analysis for.
        lat (float): The latitude of the location.
        lon (float): The longitude of the location.
        radius (float): The radius to use for the analysis.
        radius_unit(str): The unit of measurement used for radius. 
        rank_threshold (int): The minimum rank threshold for the analysis.

    Returns:
        str: The analysis ID generated by the API.
    """
    # Set the URL and JSON payload for the API call
    url = BEAM_URL
    json = {
            "name": f"{location_id}_analysis",
                "location": {
                    "geopoint": {
                        "lat": lat,
                        "lon": lon,
                    },
                    "radius": radius,
                    "unit": radius_unit 
                },
                "rank": {
                    "type": "phq",
                "levels": {
                "phq": {
                 "min": rank_threshold
                    }
                    }
                },
                "tz": "UTC"
            }
    # Make a POST request to the API to generate the analysis ID
    response = requests.post(
            url = url,
            headers={
                "Authorization": "Bearer " + ACCESS_TOKEN,
                "Accept": "application/json"},
            json = json)
    # Extract the analysis ID from the JSON response and return it
    return response.json()['analysis_id']

In [13]:
analysis_ids = []
# Loop through each row in the locations dataframe
for index, location in locations.iterrows():
    # Generate a Beam analysis id for each location
    r = generate_analysis_ids(location['location_id'], str(location['lat']), str(location['lon']), location['suggested_radius'], location['unit'])
    # Add the analysis ids to the list
    analysis_ids.append(r)
locations['analysis_id'] = analysis_ids

In [14]:
locations

Unnamed: 0,location_id,lat,lon,industry,suggested_radius,unit,analysis_id
0,store_0,37.784,-122.404,restaurants,2.08,km,zqtAJYuG4tw
1,store_1,47.611,-122.338,restaurants,2.11,km,ute9ZAJJhBY
2,store_2,40.735,-73.869,restaurants,2.1,km,Pn7xfKyRDMk


#### 1.3 Upload the corresponding demand data with the generated analysis_id to Beam.

The example below is reading demand data from multiple businss locations from a single CSV file. When you use this in the context of your business you could read the demand data from a database, API or internal product for example. Your demand data needs to be aggregated to daily values for each given location, where date is YYYY-MM-DD format and demand is a numeric value. Please ensure that your demand data file contains both of two columns: date and demand.  See the example files for how the data should be formatted and Upload Demand Data to an Analysis in the [Beam API](https://docs.predicthq.com/resources/beam) documentation for more details.

In [15]:
# Read in the data and loop through the locations_ids
demand_data = pd.read_csv(f'{demand_data_wd}/demand_data_sample.csv')
grouped = demand_data.groupby('location_id')
for location_id, grouped_data in grouped:
    # Get the corresponding analysis id for each store
    analysis_id = locations[locations['location_id'] == location_id]['analysis_id'].values[0]
    # Only get date and demand from grouped_data
    individual_demand = grouped_data[['date', 'demand']]
    # Covert individual_demand to json format
    individual_demand_json = individual_demand.to_json(orient='records')
    # Upload the individual_demand data to Beam
    response = requests.post(
    url=f"{BEAM_URL}/{analysis_id}/sink",
    headers={
        "Authorization": "Bearer " + ACCESS_TOKEN,
        "Content-Type": "application/json"
    },
    data = individual_demand_json
    )
    # Check the status code of the response to see if the request has been accepted
    if response.status_code== 202:
        print('The request has been accepted for processing.') 
    else:
        print(response.content)

The request has been accepted for processing.
The request has been accepted for processing.
The request has been accepted for processing.


### 1.4 Check the readiness_status before correlating

Now, we can make a request to the Beam API to retrieve the analysis data for each location. The "readiness_status" indicates whether the data has been successfully uploaded and processed. Please ensure that the readiness_status is "ready" before you continue.

In [22]:
# Please ensure that the readiness_status is "ready" before you continue, this might take a couple of minutes
for index, location in locations.iterrows():
    response = requests.get(
    url = f"{BEAM_URL}/{location['analysis_id']}",
    headers = {
      "Authorization": "Bearer " + ACCESS_TOKEN,
      "Accept": "application/json"
    })
    print(f"{response.json()['name']}: {response.json()['readiness_status']}")
    # print(response.json()) # uncomment this line if you want more information about the analyses

store_0_analysis: ready
store_1_analysis: ready
store_2_analysis: ready


In [23]:
locations

Unnamed: 0,location_id,lat,lon,industry,suggested_radius,unit,analysis_id
0,store_0,37.784,-122.404,restaurants,2.08,km,zqtAJYuG4tw
1,store_1,47.611,-122.338,restaurants,2.11,km,ute9ZAJJhBY
2,store_2,40.735,-73.869,restaurants,2.1,km,Pn7xfKyRDMk


## Part 2.  Generating correlation results

Correlation results show you your decomposed demand data as well as the event impact data for each location. This is the same data that you can see in the Beam UI - see [Viewing the Time Series Impact Analysis](https://www.predicthq.com/support/viewing-the-time-series-impact-analysis). You can see correlation where there are remainder values corresponding with significant event impact values.

In [24]:
date_col_name = 'date'
analyses = []
location_ids = []
analyses_dic = {}
for location_id, grouped_data in grouped:
    # Get the corresponding analysis id for each store
    analysis_id = locations[locations['location_id'] == location_id]['analysis_id'].values[0]
    # Only use the date and demand columns from grouped_data for the API request
    individual_demand = grouped_data[['date', 'demand']]
    # Extract min and max dates for each store
    min_date = individual_demand[date_col_name].min()
    max_date = individual_demand[date_col_name].max()
    # Set parameters for Beam API request
    url = f"{BEAM_URL}/{analysis_id}/correlate"
    url_str = ''.join(url)
    headers = {
            "Authorization": "Bearer " + ACCESS_TOKEN,
            "Accept": "application/json"
            }
    params = {
            "date.gte": min_date,
            "date.lte": max_date
            }

    response = requests.get(
        url = url_str,
        headers = headers,
        params = params)
    
    if response.status_code == 200:
        print('The request has been accepted for processing.')
        analyses.append(response.json())
        location_ids.append(location_id)
for key, value in zip(location_ids, analyses):
  analyses_dic[key] = value

The request has been accepted for processing.
The request has been accepted for processing.
The request has been accepted for processing.


### Part 3.  Identifying important features

You might need to refresh an analysis to generate insights on the most relevant categories for a location if the analysis was created previously.

In [19]:
# # Skip this if you're idenitying important features for newly created analyses
# for index, location in locations.iterrows():
#   response = requests.post(f"{BEAM_URL}/{location['analysis_id']}/refresh")
#   if response.status_code == 202:
#     print(f"The analysis {location['analysis_id']} has been refreshed.")

### 3.1 Generate a feature importance list for each location

In [25]:
feature_importance_list = []
for index, location in locations.iterrows():
  url = f"{BEAM_URL}/{location['analysis_id']}/feature-importance"
  response = requests.get(
    url= url,
    headers={
      "Authorization": "Bearer " + ACCESS_TOKEN,
      "Accept": "application/json"
    })
  if response.status_code == 200:
    print('The request has been accepted for processing.')
    feature_importance = response.json()['feature_importance']
    feature_importance_list.append(feature_importance)
locations['feature_importance_list'] = feature_importance_list

The request has been accepted for processing.
The request has been accepted for processing.
The request has been accepted for processing.


### 3.2 Feature importance output interpretation
##### Feature_group is the name of the group, which typically refers to an event category, such as concerts, conferences, etc.
##### Features is the names of the features in the feature group. These refer directly to features available in Features API.
##### P_value:  The p-value associated with this feature group for this analysis. It indicates how important the features in the group are in terms of demand. The lower the p-value, the more important the feature group is. 
`0 <= p-value <= 0.05: The impact is very high. ` <br>
`0.05 < p-value <= 0.075: The impact is high. ` <br>
`0.075 < p-value <= 0.1: The impact is moderate. ` <br>
##### Important: A true of false value indicating whether the feature group is considered important for this analysis. Equivalent to p_value < 0.1  We suggest using this value to determine whether or not to include this group of features in your modeling.

In [26]:
# Add a column to store categories where 'important' is True
locations['important_categories'] = locations['feature_importance_list'].apply(lambda x: [item['feature_group'] for item in x if item['important']])

# Add a column to store import features along with the p values 
locations['important_features'] = locations['feature_importance_list'].apply(lambda x: [item['features'][0] for item in x if item['important']])

### 3.3 Get corresponding features from Features API
Once you're able to identify the importance features for each location, the next step is to extract these features. This section of the notebook provides a step-by-step instruction on how to do it.  We also have [ Feature Engineering Guide ](https://github.com/predicthq/phq-data-science-docs/blob/master/feature-engineering-guide/feature_engineering_guide.ipynb) notebook that guides you in creating events-based machine learning features. The notebook also provides guidance on selecting varying radii for different features.

In [27]:
DATE_FORMAT = "%Y-%m-%d"


phq = Client(access_token=ACCESS_TOKEN)

def get_date_groups(start, end):
    """
    Features API allows a range of up to 90 days, so we have to do several requests
    """

    def _split_dates(s, e):
        capacity = timedelta(days=90)
        interval = 1 + int((e - s) / capacity)
        for i in range(interval):
            yield s + capacity * i
        yield e

    dates = list(_split_dates(start, end))
    for i, (d1, d2) in enumerate(zip(dates, dates[1:])):
        if d2 != dates[-1]:
            d2 -= timedelta(days=1)
        yield d1.strftime(DATE_FORMAT), d2.strftime(DATE_FORMAT)

#### 3.3.1  Attendance based features
Because the API call for various types of events differs slightly, we utilize different functions for extracting event features.

In [28]:
categories_attended = [
    "phq_attendance_sports",
    "phq_attendance_conferences",
    "phq_attendance_expos",
    "phq_attendance_concerts",
    "phq_attendance_festivals",
    "phq_attendance_performing_arts",
    "phq_attendance_community",
    "phq_attendance_school_holidays",
]
# Create a new column to only include important features that are attendance based
locations['categories_attended_important_features'] = locations['important_features'].apply(lambda x: [item for item in x if item in categories_attended])

In [29]:
def get_important_features_api_attended_data(lat, lon, start, end, radius, unit, rank_threshold, important_categories_attended):
    "Get attendance based features using features API"
    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()
    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": f"{radius}{unit}"},
        }
        query.update({f"{f}__stats": ["sum"] for f in important_categories_attended})
        query.update(
            {f"{f}__phq_rank": {"gte": rank_threshold} for f in important_categories_attended}
        )

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                elif k in categories_attended:
                    record[k] = v.get("stats", {}).get("sum")
            result.append(record)

    return result

In [30]:
date_col_name = 'date'
grouped = demand_data.groupby('location_id')
attended_data_list = []
location_ids = []
attended_data_dic = {}
for location_id, grouped_data in grouped:
    # Get the latitude and longitude for each store
    lat = locations[locations['location_id'] == location_id]['lat'].values[0]
    lon = locations[locations['location_id'] == location_id]['lon'].values[0]
    # Only use the date and demand columns from grouped_data for the API request
    individual_demand = grouped_data[['date', 'demand']]
    # Extract min and max dates for each store
    start = individual_demand[date_col_name].min()
    end = individual_demand[date_col_name].max()
    # Get suggested radius and its unit for each store
    radius = locations[locations['location_id'] == location_id]['suggested_radius'].values[0]
    unit = locations[locations['location_id'] == location_id]['unit'].values[0]
    # specify the rank threshold
    rank_threshold = 51
    # Get the important categories attended for each store
    categories_attended_important_features = locations[locations['location_id'] == location_id]['categories_attended_important_features'].values[0]
    # # Get attended data from important features API
    attended_data = get_important_features_api_attended_data(lat, lon, start, end, radius, unit, rank_threshold, categories_attended_important_features)
    attended_data_list.append(attended_data)
    location_ids.append(location_id)
for key, value in zip(location_ids, attended_data_list):
  attended_data_dic[key] = value

#### 3.3.2  Rank based features

In [32]:
categories_rank = [
     "phq_rank_health_warnings",
     "phq_rank_observances",
     "phq_rank_public_holidays",
     "phq_rank_school_holidays",
     "phq_rank_academic_session",
     "phq_rank_academic_exam",
     "phq_rank_academic_holiday"
]
# Create a new column to only include important features that are rank based
locations['categories_rank_important_features'] = locations['important_features'].apply(lambda x: [item for item in x if item in categories_rank])

In [33]:
def get_important_features_api_rank_data(lat, lon, start, end, radius, unit, important_categories_rank):
    " Get rank based features using features API"
    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        for category in important_categories_rank:
            # use a different radius setting for phq_rank_observances and phq_rank_public_holidays
            if category in ['phq_rank_observances', 'phq_rank_public_holidays']:
                # Set a specific radius for the selected categories
                radius_for_category = "1mi"
            else:
                # Use the default radius and unit for other categories
                radius_for_category = f"{radius}{unit}"

            query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": radius_for_category},
            }

            query.update({f"{f}": True for f in important_categories_rank})

            features = phq.features.obtain_features(**query)
            for feature in features:
                record = {}
                for k, v in feature.to_dict().items():
                    if k == "date":
                        record[k] = v.strftime(DATE_FORMAT)
                    elif k in categories_rank:
                        for rank_level, level_count in v.get("rank_levels", {}).items():
                            record[f"{k}_level_{rank_level}"] = float(level_count)

                        # sum all values
                        record[f"{k}"] = sum(
                        [
                            float(level_count)
                            for level_count in v.get("rank_levels", {}).values()
                        ]
                        )
                # keep sum of all levels only
                for k in list(record.keys()):
                    if "level" in k:
                        del record[k]
                result.append(record)
    return result

In [34]:
date_col_name = 'date'
grouped = demand_data.groupby('location_id')
rank_data_list = []
location_ids = []
rank_data_dic = {}
for location_id, grouped_data in grouped:
    # Get the latitude and longitude for each store
    lat = locations[locations['location_id'] == location_id]['lat'].values[0]
    lon = locations[locations['location_id'] == location_id]['lon'].values[0]
    # Only use the date and demand columns from grouped_data for the API request
    individual_demand = grouped_data[['date', 'demand']]
    # Extract min and max dates for each store
    start = individual_demand[date_col_name].min()
    end = individual_demand[date_col_name].max()
    # Get suggested radius and its unit for each store
    radius = locations[locations['location_id'] == location_id]['suggested_radius'].values[0]
    unit = locations[locations['location_id'] == location_id]['unit'].values[0]
    # Get the important categories attended for each store
    categories_rank_important_features = locations[locations['location_id'] == location_id]['categories_rank_important_features'].values[0]
    # # Get attended data from important features API
    rank_data = get_important_features_api_rank_data(lat, lon, start, end, radius, unit, categories_rank_important_features)
    rank_data_list.append(rank_data)
    location_ids.append(location_id)
for key, value in zip(location_ids, rank_data_list):
  rank_data_dic[key] = value

#### 3.3.3  Impact based features
<b> Severe weahter features </b>

Please note that impact-based features(Severe weahter features) are for the retail industry only. The severe weather features use demand impact patterns. Demand impact patterns calculate impact duration of a severe weather event and are based on industry specific information. Our severe weather features are currently designed and tested on data for the retail segment only. If your business is in an industry segment other than retail (e.g. accomodation or travel) then these features may not work for you or may be less effective.

In [35]:
categories_impact = {
    "phq_impact_severe_weather_air_quality_retail",
    "phq_impact_severe_weather_blizzard_retail",
    "phq_impact_severe_weather_cold_wave_retail",
    "phq_impact_severe_weather_cold_wave_snow_retail",
    "phq_impact_severe_weather_cold_wave_storm_retail",
    "phq_impact_severe_weather_dust_retail",
    "phq_impact_severe_weather_dust_storm_retail",
    "phq_impact_severe_weather_flood_retail",
    "phq_impact_severe_weather_heat_wave_retail",
    "phq_impact_severe_weather_hurricane_retail",
    "phq_impact_severe_weather_thunderstorm_retail",
    "phq_impact_severe_weather_tornado_retail",
    "phq_impact_severe_weather_tropical_storm_retail",
}
# Create a new column to only include important features that are rank based
locations['categories_impact_important_features'] = locations['important_features'].apply(lambda x: [item for item in x if item in categories_impact])

In [36]:
def get_important_features_api_impact_data(lat, lon, start, end, rank_threshold, categories_impact):
    " Get impact based features using features API"
    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": "1mi"},
        }

        query.update({f"{f}__stats": ["max"] for f in categories_impact})
        query.update(
            {f"{f}__phq_rank": {"gte": rank_threshold} for f in categories_impact}
        )

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                else:
                    record[k] = v.get("stats", {}).get("max")

            result.append(record)

    return result

In [37]:
date_col_name = 'date'
grouped = demand_data.groupby('location_id')
impact_data_list = []
location_ids = []
impact_data_dic = {}
for location_id, grouped_data in grouped:
    # Get the latitude and longitude for each store
    lat = locations[locations['location_id'] == location_id]['lat'].values[0]
    lon = locations[locations['location_id'] == location_id]['lon'].values[0]
    # Only use the date and demand columns from grouped_data for the API request
    individual_demand = grouped_data[['date', 'demand']]
    # Extract min and max dates for each store
    start = individual_demand[date_col_name].min()
    end = individual_demand[date_col_name].max()
    # specify the rank threshold
    rank_threshold = 51
    # Get the important categories attended for each store
    try:
        categories_impact_important_features = locations[locations['location_id'] == location_id]['categories_impact_important_features'].values[0]
    except IndexError:
        # Handle the case where categories_impact_important_features is empty or doesn't exist
        categories_impact_important_features = []  # You can set it to an empty list or handle it as needed
    # Get attended data from important features API
    impact_data = get_important_features_api_impact_data(lat, lon, start, end, rank_threshold, categories_impact)
    impact_data_list.append(impact_data)
    location_ids.append(location_id)
for key, value in zip(location_ids, impact_data_list):
    impact_data_dic[key] = value  # Corrected dictionary name

#### 3.3.4  Combine the 3 different types of features

In [39]:
combined_dict = {}
# Iterate through the keys in attended_data_dic
for location_id in attended_data_dic.keys():
    combined_dict[location_id] = []  # Initialize an empty list for the combined data
    attended_data_dic_data = attended_data_dic[location_id]
    rank_data_dic_data = rank_data_dic.get(location_id, [])  # Use an empty list if the key doesn't exist in dict2
    impact_data_dic_data = impact_data_dic.get(location_id, [])  # Use an empty list if the key doesn't exist in dict3

    # Iterate through the data entries in attended_data_dic
    for entry1 in attended_data_dic_data:
        # Find the corresponding entry in attended_data_dic_data based on the date
        corresponding_entry2 = next((entry2 for entry2 in rank_data_dic_data if entry2['date'] == entry1['date']), None)
        # Find the corresponding entry in impact_data_dic_data based on the date
        corresponding_entry3 = next((entry3 for entry3 in impact_data_dic_data if entry3['date'] == entry1['date']), None)

        if corresponding_entry2 and corresponding_entry3:
            # Merge all three dictionaries for the same date
            combined_entry = {**entry1, **corresponding_entry2, **corresponding_entry3}
            combined_dict[location_id].append(combined_entry)
        else:
            # If any of the dictionaries doesn't have a corresponding entry, add only the entry from attended_data_dic
            combined_dict[location_id].append(entry1)

#### 3.3.5  Covert the output into DFs and store in spearate csvs

In [42]:
# Iterate through the keys (location_ids) in combined_dict
for location_id, combined_data in combined_dict.items():
    # Create a dataframe for each store
    df = pd.DataFrame(combined_data)
    # Set the 'date' column as the index 
    df.set_index('date', inplace=True)
    # Store the dataframe in a csv file with the location_id specified in the file name
    df.to_csv(f'{demand_data_wd}/{location_id}_beam_analysis_with_features.csv')

### Part 4. Plotting and interpretation

Take location_id == store_0 as an example and convert it into a dataframe 

In [49]:
beam_analysis = pd.DataFrame(analyses_dic['store_0']['dates'])
# Extract required columns
beam_analysis = beam_analysis[['date', 'actual_demand', 'baseline_demand', 'remainder', 'impact']]

#### Please note the demand time series is decomposed into baseline demand time series and remainder time series. 
#### Baseline demand:  The baseline demand time series represents the estimated demand and contains information about trends and seasonality within the demand time series.
#### Event imapct: For events belonging to attended categories, the corresponding daily total attendance represents the event impact.
#### Reminder: The remainder is the difference between the demand time series and the baseline demand time series.

In [51]:
from IPython.display import HTML
fig = go.Figure()
fig.add_trace(
    go.Scatter(x=beam_analysis.date, y=beam_analysis.actual_demand, name='demand',mode='lines+markers')
)
fig.add_trace(
    go.Scatter(x=beam_analysis.date, y=beam_analysis.baseline_demand, name="estimated_demand",mode='lines+markers')
)

fig.add_trace(go.Scatter(
    x=beam_analysis.date, y=beam_analysis.impact, name="impact",
    yaxis="y2",mode='lines+markers'
    # ,fill='tozeroy'
))

# Create axis objects
fig.update_layout(
    xaxis=dict(
        domain=[0.0, 0.7]
    ),
    yaxis=dict(
        title="demand"
    ),
    yaxis2=dict(
        title="impact",
        anchor="x",
        overlaying="y",
        side="right",
    ),
    yaxis3=dict(
        title="public_holidays",
        anchor="free",
        overlaying="y",
        side="right",
        position=0.85
    )
)

# Update layout properties
fig.update_layout(
    width=1000,
)

fig.show()
# HTML(fig.to_html())