# Problem

We’re building a simple model to forecast nutritional program intake using food insecurity and demographic data. The goal is to understand how population and economic context relate to participation over time.

# Loading Datasets

Here we're loading all 3 datasets and checking their common columns to merge them

In [617]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [618]:
nutritional_df = pd.read_csv('./data/Nutritional_Programming_West.csv')

  nutritional_df = pd.read_csv('./data/Nutritional_Programming_West.csv')


In [619]:
nutritional_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149557 entries, 0 to 149556
Data columns (total 15 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Sex                            111144 non-null  object 
 1   Marital Status                 86940 non-null   object 
 2   Year of Birth                  131425 non-null  float64
 3   Birth Province / Country Code  53135 non-null   object 
 4   Home Province Code             95691 non-null   object 
 5   Creation Date                  128297 non-null  object 
 6   Last Modified                  88018 non-null   object 
 7   employmentstatus_DC            21594 non-null   object 
 8   System_CU                      37245 non-null   object 
 9   City_CU                        29656 non-null   object 
 10  Program_CU                     37245 non-null   object 
 11  Citizenship_CU                 37245 non-null   object 
 12  Primary Diagnosis_CU          

In [620]:
demographic_df = pd.read_csv('./data/Food_insecurity_selected_demographic_characteristics.csv')

In [621]:
demographic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   REF_DATE                        156 non-null    int64  
 1   GEO                             156 non-null    object 
 2   DGUID                           156 non-null    object 
 3   Demographic characteristics     156 non-null    object 
 4   Household food security status  156 non-null    object 
 5   Statistics                      156 non-null    object 
 6   UOM                             156 non-null    object 
 7   UOM_ID                          156 non-null    int64  
 8   SCALAR_FACTOR                   156 non-null    object 
 9   SCALAR_ID                       156 non-null    int64  
 10  VECTOR                          156 non-null    object 
 11  COORDINATE                      156 non-null    object 
 12  VALUE                           152 

In [622]:
economic_df = pd.read_csv('./data/Food_insecurity_economic_family_type.csv')

In [623]:
economic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108 entries, 0 to 107
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   REF_DATE                        108 non-null    int64  
 1   GEO                             108 non-null    object 
 2   DGUID                           108 non-null    object 
 3   Economic family type            108 non-null    object 
 4   Household food security status  108 non-null    object 
 5   Statistics                      108 non-null    object 
 6   UOM                             108 non-null    object 
 7   UOM_ID                          108 non-null    int64  
 8   SCALAR_FACTOR                   108 non-null    object 
 9   SCALAR_ID                       108 non-null    int64  
 10  VECTOR                          108 non-null    object 
 11  COORDINATE                      108 non-null    object 
 12  VALUE                           108 

# Data Transformations

Make the datasets compatible so they can be merged later.

In [624]:
scaler = StandardScaler()

### nutritional_programming

In [625]:
nutritional_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149557 entries, 0 to 149556
Data columns (total 15 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Sex                            111144 non-null  object 
 1   Marital Status                 86940 non-null   object 
 2   Year of Birth                  131425 non-null  float64
 3   Birth Province / Country Code  53135 non-null   object 
 4   Home Province Code             95691 non-null   object 
 5   Creation Date                  128297 non-null  object 
 6   Last Modified                  88018 non-null   object 
 7   employmentstatus_DC            21594 non-null   object 
 8   System_CU                      37245 non-null   object 
 9   City_CU                        29656 non-null   object 
 10  Program_CU                     37245 non-null   object 
 11  Citizenship_CU                 37245 non-null   object 
 12  Primary Diagnosis_CU          

### Demographic dataset

In [626]:
demographic_cols_drop = [
    'DGUID', 'VECTOR', 'COORDINATE', 'STATUS', 'SYMBOL',
    'TERMINATED', 'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID',
    'Statistics', 'UOM', 'DECIMALS'
]
demographic_df.drop(columns=[c for c in demographic_cols_drop if c in demographic_df.columns], inplace=True)

In [627]:
rename_demographic = {
    'REF_DATE': 'Year',
    'GEO': 'Region',
    'Demographic characteristics': 'Demographic',
    'Household food security status': 'FoodSecurityStatus',
    'VALUE': 'Value',
}
demographic_df.rename(columns=rename_demographic, inplace=True)

In [628]:
demographic_df['Year'] = pd.to_numeric(demographic_df['Year'], errors='coerce')
demographic_df = pd.get_dummies(demographic_df, columns=['Demographic', 'FoodSecurityStatus'], drop_first=True)

In [629]:
demographic_df.head()

Unnamed: 0,Year,Region,Value,Demographic_Females,Demographic_Indigenous population aged 15 years and over,Demographic_Males,Demographic_Persons 18 to 24 years,Demographic_Persons 25 to 34 years,Demographic_Persons 35 to 44 years,Demographic_Persons 45 to 54 years,Demographic_Persons 55 to 64 years,Demographic_Persons 65 years and over,Demographic_Persons under 18 years,Demographic_Recent immigrants (10 years or less) aged 15 years and over,Demographic_Visible minority population,"FoodSecurityStatus_Food insecure, moderate or severe"
0,2018,Canada,16.8,False,False,False,False,False,False,False,False,False,False,False,False,False
1,2019,Canada,15.9,False,False,False,False,False,False,False,False,False,False,False,False,False
2,2020,Canada,15.7,False,False,False,False,False,False,False,False,False,False,False,False,False
3,2021,Canada,18.4,False,False,False,False,False,False,False,False,False,False,False,False,False
4,2022,Canada,22.9,False,False,False,False,False,False,False,False,False,False,False,False,False


### Economic dataset

In [630]:
economic_cols_drop = [
    'DGUID', 'VECTOR', 'COORDINATE', 'STATUS', 'SYMBOL',
    'TERMINATED', 'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID',
    'Statistics', 'UOM', 'DECIMALS'
]
economic_df.drop(columns=[c for c in economic_cols_drop if c in economic_df.columns], inplace=True)


In [631]:
rename_economic = {
    'REF_DATE': 'Year',
    'GEO': 'Region',
    'Economic family type': 'EconomicType',
    'Household food security status': 'FoodSecurityStatus',
    'VALUE': 'Value',
}
economic_df.rename(columns=rename_economic, inplace=True)

In [632]:
economic_df['Year'] = pd.to_numeric(economic_df['Year'], errors='coerce')
economic_df = pd.get_dummies(economic_df, columns=['EconomicType', 'FoodSecurityStatus'], drop_first=True)

# Merging datasets

In [None]:
merged_df = pd.merge(
    demographic_df, economic_df,
    on=['Region', 'Year', 'FoodSecurityStatus'],
    how='left'
)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2808 entries, 0 to 2807
Data columns (total 26 columns):
 #   Column                                                                   Non-Null Count  Dtype  
---  ------                                                                   --------------  -----  
 0   Year                                                                     2808 non-null   int64  
 1   Region                                                                   2808 non-null   object 
 2   Value_x                                                                  2736 non-null   float64
 3   Demographic_Females                                                      2808 non-null   bool   
 4   Demographic_Indigenous population aged 15 years and over                 2808 non-null   bool   
 5   Demographic_Males                                                        2808 non-null   bool   
 6   Demographic_Persons 18 to 24 years                                      

# Prepare for modeling

In [634]:
merged_df = merged_df.sort_values(['Year'])

In [635]:
merged_df['Intake_lag1'] = merged_df.groupby('Region')['Demographic_Persons 18 to 24 years'].shift(1)

In [636]:
# Fit only on training numeric columns
feature_cols = ['Year', 'Value_x', 'Value_y']

In [637]:
train = merged_df[merged_df['Year'] <= 2019]
test  = merged_df[merged_df['Year'] >  2019]

In [638]:
feature_cols = ['Year', 'Demographic_Persons 18 to 24 years', 'Value_x', 'Intake_lag1']
X_train, y_train = train[feature_cols], train['Demographic_Persons 18 to 24 years']
X_test,  y_test  = test[feature_cols], test['Demographic_Persons 18 to 24 years']

In [639]:
# Scale numeric features
scaler.fit(X_train[feature_cols])

0,1,2
,copy,True
,with_mean,True
,with_std,True


In [640]:

X_train[feature_cols] = scaler.transform(X_train[feature_cols])
X_test[feature_cols]  = scaler.transform(X_test[feature_cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[feature_cols] = scaler.transform(X_train[feature_cols])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test[feature_cols]  = scaler.transform(X_test[feature_cols])
