<div style="color:#00BFFF">

---

##### Creating Composite indexes


<div style="color:#FF7F50">

**Objective**

</div>

- **Reduce collinearity** among the 123 economic indicators in the dataset by creating composite indices. The dataset includes **multiple indicators that are components of each other**, covering totals and granular levels, or different measures providing similar information.
- **Creating composite indices** is a preemptive step to **decrease collinearity**, enhancing the effectiveness of our end model fitting.
- This approach aims to consolidate related indicators into single measures, thereby simplifying the dataset and improving model interpretability and performance.

<div style="color:#FF7F50">

**Process**

</div>

**1. Importing Composite Indices Mapping**:

- We import a pre-defined dictionary, `composite_indices_info`, from a module located in the `utils` subfolder.
- This dictionary maps economic indicators to their respective composite indices, indicating how individual data columns should be aggregated.

**2. Merge and Clean Data**

- The `merge_and_clean_data` function incorporates composite indices into the main dataset.
- Removes original indicators that are now aggregated, reducing dataset complexity and potential multicollinearity.
- Drop granular indices and keep Total Indices

**3. Dataframe Update**

- Post-merging, drops the aggregated columns from the `joined_dataset`.
- Cleans the `defn` dataset to exclude removed indicators, maintaining dataset integrity and relevance.


In [None]:
# def create_composite_index(dataframe, columns_to_combine, index_name, method="mean"):
#     """
#     Create a composite index by combining specified columns in a dataframe using a given method.
#     If columns are missing, they are ignored in the calculation.
#     """
#     # Filter out columns that are not in the dataframe
#     existing_columns = [col for col in columns_to_combine if col in dataframe.columns]

#     # If no valid columns are left, raise an error
#     if not existing_columns:
#         raise ValueError(
#             f"None of the specified columns {columns_to_combine} are present in the dataframe."
#         )

#     # Proceed with calculation using existing columns only
#     if method == "mean":
#         composite = dataframe[existing_columns].mean(axis=1)
#     elif method == "sum":
#         composite = dataframe[existing_columns].sum(axis=1)
#     else:
#         raise ValueError("Method must be 'mean' or 'sum'.")

#     return pd.DataFrame({index_name: composite})


# def merge_and_clean_data(joined_dataset, defn, composite_indices_info):
#     """
#     Merge the composite indices into the joined dataset and remove the old indicators,
#     while handling missing columns and avoiding creating indices from single indicators.
#     """
#     columns_to_drop = []

#     for index_name, (columns, method) in composite_indices_info.items():
#         # Filter columns to ensure they exist in the dataset
#         existing_columns = [col for col in columns if col in joined_dataset.columns]

#         # Skip index creation if no columns exist or only one column exists
#         if len(existing_columns) > 1:
#             composite_df = create_composite_index(
#                 joined_dataset, existing_columns, index_name, method
#             )
#             joined_dataset = pd.merge(
#                 joined_dataset,
#                 composite_df,
#                 left_index=True,
#                 right_index=True,
#                 how="left",
#             )
#             columns_to_drop.extend(
#                 existing_columns
#             )  # Add only existing columns to the drop list
#         elif len(existing_columns) == 1:
#             pass

#     # Drop columns that were successfully combined into indices, ensuring they exist in the dataset
#     columns_to_drop = [col for col in columns_to_drop if col in joined_dataset.columns]
#     if columns_to_drop:
#         joined_dataset.drop(columns=columns_to_drop, inplace=True)

#     # This part remains unchanged, assuming `defn` is adjusted based on columns_to_drop
#     defn_cleaned = defn[~defn["description"].isin(columns_to_drop)]

#     return joined_dataset, defn_cleaned

In [None]:
# from utils.composite_index_mapping import composite_indices_info

# #  merge and clean function
# joined_dataset, defn = merge_and_clean_data(
#     joined_dataset, defn, composite_indices_info
# )

<div style="color:#00BFFF">

---

##### Drop granular indices and keep Total Indices


<div style="color:#FF7F50">

**Dropping Granular Indicators for Enhanced Predictive Power**

</div>

We've made a strategic decision to drop granular indicators in favor of their respective totals to streamline our predictive modeling process, enhance model performance, and maintain consistency and interpretability. Here's why:

**1. Enhanced Predictive Power**

Granular indicators often represent detailed sub-components or sub-categories of a broader metric. While these granular details may provide insights into specific aspects, they can introduce noise and redundancy into our predictive models.

**2. Reduction of Dimensionality**

Including both granular indicators and their totals can result in high dimensionality in our dataset. High dimensionality can lead to increased computational complexity, longer training times, and a higher risk of overfitting.

**3. Consistency and Interpretability**

Total indicators offer consistency in measurement units and interpretation. Granular indicators may have varying units or scales, making it challenging to compare and analyze them effectively.

**4. Focus on Key Drivers**

In many cases, it's the total figures that directly impact the outcomes we want to predict.

</div>


In [None]:
# granular_indices_to_drop = [
#     # Granular Housing data keep-> 'Housing Starts: Total New Privately Owned', 'New Private Housing Permits (SAAR)'
# "Help-Wanted Index for United States",
# "Avg Weekly Hours : Goods-Producing",
# "Avg Weekly Hours : Manufacturing",
# # "Housing Starts: Total New Privately Owned",
# "Housing Starts, Northeast",
# "Housing Starts, Midwest",
# "Housing Starts, South",
# "Housing Starts, West",
# # "New Private Housing Permits (SAAR)",
# "New Private Housing Permits, Northeast (SAAR)",
# "New Private Housing Permits, Midwest (SAAR)",
# "New Private Housing Permits, South (SAAR)",
# "New Private Housing Permits, West (SAAR)",
# "S&P s Common Stock Price Index: Composite",
# "S&P s Common Stock Price Index: Industrials",
# "Moody s Aaa Corporate Bond Minus FEDFUNDS",
# "Moody s Baa Corporate Bond Minus FEDFUNDS",

# #Features with relatively high standard deviation values:
# "Civilians Unemployed - Less Than 5 Weeks",
# "Civilians Unemployed for 5-14 Weeks",
# "Civilians Unemployed - 15 Weeks & Over",
# "Civilians Unemployed for 15-26 Weeks",
# "Civilians Unemployed for 27 Weeks and Over",
# "Initial Claims",
# "New Orders for Nondefense Capital Goods",
# # "M1 Money Stock",
# "Total Reserves of Depository Institutions",
# "Reserves Of Depository Institutions",
# "S&P s Composite Common Stock: Price-Earnings Ratio",
# "PPI: Crude Materials",
# "Crude Oil, spliced WTI and Cushing",
# "PPI: Metals and metal products:"
# ]
# joined_dataset.drop(columns=granular_indices_to_drop, inplace=True)

# # delete rows in defn where the value is in columns_to_drop
# # defn = defn[~defn["description"].isin(granular_indices_to_drop)]