### Author: Jianfeng Huang {.unnumbered}
## **Abstract** {.unnumbered}

This research tackles predicting household food insecurity in Canada using machine learning techniques. Specifically, we leveraged ensemble methods and regularisation to predict food insecurity using the Canadian Income Survey (CIS) 2018 dataset. The most effective model was the Regularised Random Forest, which outperformed others in recall rate and test accuracy. Notably, this model demonstrated consistency and reliability when applied to the 2019 CIS data. Feature importance measures revealed key factors influencing food insecurity, such as the age of the oldest household member, provincial tax credits, and education level of the primary earner. The study's findings provide a foundation for policy prioritisation and decision-making, emphasizing the potential of machine learning in predicting food insecurity and informing effective interventions in the context of developed countries such as Canada.

## **Introduction** {.unnumbered}

In 2021, about 16% of households spread across ten provinces in Canada, equivalent to roughly 5.8 million individuals - including close to 1.4 million children under 18 - faced some level of food insecurity. Despite continuous tracking of food insecurity starting from 2005, the issue has remained stagnant with little to no improvement from 2019 to 2021[1].  
The urgency of addressing food insecurity stems from its significant impact on public health, as it profoundly affects individuals' health and welfare. Individuals from food-insecure households exhibit a higher likelihood of being diagnosed with various chronic conditions, encompassing mental health disorders [2], [3], non-communicable diseases [4], [5], and infections [6]-[8]. Moreover, food-insecure individuals are less capable of managing chronic conditions, leading to unfavorable disease outcomes[9],[10], increased hospitalizations[11], and premature death[8], which further imposes a significant burden on the healthcare system, contributing to escalated expenditures[11],[12].  
Data mining and machine learning techniques have been used in several studies as efficient tools for predicting and identifying risk factors associated with food insecurity. However, most of these studies are conducted in the context of developing countries, where factors such as agriculture, weather, and geographical conditions play a crucial role [13]-[15]. Canada, as a developed nation, may have different predictors of food insecurity. Informed by [13] and [16], which emphasized the vital role of predictive models when primary data is scarce, and the promising accuracy of machine learning in determining food security status, this study adopts a similar approach in the Canadian context. Our main goal is to pinpoint key predictive variables that are both accessible and, importantly, relevant within the Canadian context. We also aim to evaluate the predictive performance of these significant variables for a robustness test.  
In Aotearoa New Zealand, even amidst abundant national food reserves and a robust welfare system, food insecurity still plagues a significant fraction of households. This scenario paints a reflective portrait of the circumstances in prosperous capitalist nations, illustrating that ample food supply provisions alone are not sufficient to fully eradicate food insecurity [17]. As highlighted by a Q-methodology study in Aotearoa [18], socio-economic factors, rather than food supply, significantly impact food insecurity. These findings emphasize the crucial role of state interventions, individual socio-economic circumstxances, and household-level financial constraints. As a result, when developing a predictive model for the prevalence of food insecurity in Canada, it is imperative to prioritize socio-economic predictors over traditionally emphasized factors like weather and agricultural variables.  
Informed by [15]'s methodology, which uses household survey data to build a predictive model to identify households vulnerable to food insecurity, we employ the Canadian Income Survey (CIS) dataset, which encapsulates predictors identified by [15], enabling our machine learning model to understand food insecurity in Canada.  
Food insecurity is a multifaceted social determinant of health with considerable socio-economic implications in Canada. Prior research, such as [19], has contributed to this understanding, revealing that residence, income, education, aboriginal status, and household structure considerably affect the odds of food insecurity. Importantly, public policies can significantly impact these factors and thus, the prevalence of food insecurity. The complexity of income, social benefits, and various household characteristics in shaping food insecurity trends is unveiled in [20]. It indicates a higher risk of food insecurity for households reliant on social assistance compared to those reliant on wages or retirement incomes. Moreover, the Canada Child Benefit (CCB) was found to substantially alleviate severe food insecurity as a causal effect, particularly among the most vulnerable families [21]. Nevertheless, the characteristics of persistently food-insecure households evolve over time, creating a reciprocal relationship, as suggested by [22]. Considering the scarce resources allocated to safety net programs, it is of utmost importance to efficiently and accurately identify those at greatest risk. Homeownership status, housing debt, and housing expenditure all correlate with the risk of food insecurity [23]. Furthermore, even households reliant on employment income can experience food insecurity, especially when primary earners are less educated, earn lower incomes, or are part of racialized minority groups [24]. These findings underscore the necessity for our predictive model to consider demographics, household characteristics, income, and housing characteristics.  
This study draws upon the methodologies and findings of [13], [15]. We implemented SHapley Additive exPlanations (SHAP) values to discern key predictors in our model. Furthermore, considering the significant class imbalance in our dataset, we employed the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC). This assisted us in the effective management of missing data, which in turn, significantly enhanced the accuracy of our predictive model, by reducing the model's bias towards the dominant class and increasing the predictive accuracy for the infrequent class [25]-[27].

## **Methodology** {.unnumbered}

### **Study Sample** {.unnumbered}

This study utilizes data derived from the 2018 Canadian Income Survey (CIS). The CIS is an annual, nationally representative cross-sectional survey, designed to collate extensive information on individual income, sources of income, and sociodemographic traits. However, it excludes populations residing in institutions, residents on First Nations reserves, and those situated in extremely remote areas, hence covering data for the 10 provinces. Measures of food insecurity have been incorporated into the CIS since 2018. As an income-centric survey, the CIS calibrates its sample weights to align with the T4 population totals for wages and salaries, as recorded by the Canada Revenue Agency. This method ensures that the weighted income distribution in the data, founded on wages and salaries, precisely mirrors that of the Canadian population. This results in a higher response rate and more accurate income information in comparison to traditional sources of national food security statistics such as the Canadian Community Health Survey.

### **Measures** {.unnumbered}

The measurement and surveillance of food insecurity in Canada emphasize a household’s experiences of food insecurity, which pertains to insufficient or unreliable access to food due to financial constraints. The Household Food Security Survey Module (HFSSM), Canada’s primary validated instrument for measuring food insecurity, comprises 18 questions intended to assess food insecurity arising from limited financial resources [27]. A household's responses facilitate the categorization of food insecurity into three levels, as presented in Table 1. All three levels are defined as food insecurity within this study.

**Table 1**: Description of Food Insecurity Level

| Food Insecurity Level | Description |
| --- | --- |
| Marginal food insecurity | Concerns about running out of food and/or limited food selection due to a lack of money for food. |
| Moderate food insecurity | Compromises in quality and/or quantity of food due to a lack of money for food. |
| Severe food insecurity | Skipping meals, reducing food intake, and, at the most extreme, going day(s) without food. |

### **Data processing** {.unnumbered}

In our study, the assessment of food insecurity was carried out at the household level, directing our focus toward variables evaluated at the level of the economic family. These variables included Income and Taxes, Government Assistance and Benefits, Employment, and Housing, as well as demographic and education-related factors pertaining to the primary income earner within the economic family. To comprehend the associations between each data feature, and also with the labels, we embarked upon an exploratory data analysis. Variables that exhibited high correlation and those with more than 40% missing data were eliminated, culminating in 37 predictors, as depicted in Table A.1. Despite this, the raw data presented two key challenges:

1) Missing Values:

In our dataset, four variables were identified as having missing values, characterized as either categorical or ordinal, as illustrated in Table 2. The distribution of these variables across different levels of food insecurity was uneven. Table A.2 and Figure A.1 provide an example of the variable 'Type of Dwelling'. Furthermore, given the presence of both categorical and ordinal variables (the latter being treatable as numerical variables in discrete quantities), we employed the Memory Efficient Multiple Imputation by Chained Equations (MICE) algorithm to impute the missing values.


**Table 2**: Variables with Missing Values

| Variable | Description | Variable Type | Count | Proportion |
| --- | --- | --- | --- | --- |
| dwltyp | Type of dwelling | Categorical | 2277 | 0.053656 |
| marstp | Marital status | Categorical | 1859 | 0.043806 |
| eftyp | Economic family type | Categorical | 859 | 0.020242 |
| hlev2g | Highest level of education of person | Ordinal | 1824 | 0.042981 |



2) Imbalance in the Weights of Food Insecurity Levels

The challenge of imbalance in the sample had to be addressed, due to the unequal representation of food insecure (FI) and non-FI households, as illustrated in Table 3. When building a predictive model with an imbalanced class dataset, there is a propensity for the model to disproportionately weight the dominant class [25]. This skew can lead to a reduced predictive accuracy for the less frequently represented class [26]. It is important to note that our predictors are of various types, including numerical, categorical, and ordinal. Therefore, we employ the Synthetic Minority Over-Sampling Technique for Nominal and Continuous (SMOTE-NC) in the process of splitting.


**Table 3**: Proportion of Severity of Food Insecurity and Presence of Food Insecurity

| fschhldm | Label                        | Count  | Proportion |
|----------|------------------------------|--------|------------|
| 0        | Food secure                  | 35394  | 83.4%      |
| 1        | Marginally food insecure     | 2001   | 4.7%       |
| 2        | Moderately food insecure     | 3091   | 7.3%       |
| 3        | Severely food insecure       | 1951   | 4.6%       |


| fschhldm | Label                     | Count  | Proportion |
|----------|---------------------------|--------|------------|
| 0        | Food secure               | 35394  | 83.4%      |
| 1, 2, 3  | Food insecure (combined)  | 7043   | 16.6%      |


### **Models** {.unnumbered}

#### **Baseline Model (Non-Machine Learning):** {.unnumbered}

The chosen baseline model for this study is Logistic Regression. This statistical method is ideally suited to analyse the linear relationship between the input variables (features) and the log-odds of the output variable (target). The application of the logistic model, as used in this research, mirrors the precedent set in a prior study that focused on geographic and socio-demographic characteristics related to the prevalence of food insecurity [15].

#### **Machine Learning Models:** {.unnumbered}

##### **Decision Tree:** {.unnumbered}

Contrary to logistic regression, a decision tree does not make any assumptions about the specific form of the relationship between the input and the output variables. The algorithm essentially divides the explanatory characteristics into pertinent and irrelevant groups. While doing this, it aims to minimize the misclassification cost, making it naturally adept at modeling non-linear and intricate relationships, as well as interactions between different variables.

##### **Ensemble Methods:** {.unnumbered}

**1. Random Forests:**

The Random Forest algorithm leverages multiple decision trees to make predictions, drawing on varying subsets of data and features. Its inherent resilience against overfitting ensures a reliable model that generalizes effectively to unseen data. Moreover, its ability to process non-linear relationships between features helps capture complex data patterns, thereby enhancing the model's predictive accuracy. An added advantage of using Random Forests lies in its inherent ability to estimate feature importance, providing a quantifiable measure of each predictor's contribution towards the final prediction.aggregating predictions.



**2.Gradient Boosting:**  

Gradient Boosting is another ensemble method that progressively minimizes bias by assembling a series of weak learners, often decision trees, into a robust model. It starts with a basic prediction and calculates the residuals - the differences between the predicted and actual values. Subsequently, a new tree is fit on these residuals, predicting the errors of the previous tree and thereby reducing the model's bias.

##### **Regularisation:** {.unnumbered}  

We applied regularisation to prevent overfitting and to enhance the selected top-performing model's generalizability. We adopted a two-step hyperparameter tuning strategy involving both RandomisedSearch and GridSearch. RandomisedSearch sampled a variety of hyperparameters, providing a comprehensive overview of the search space, while GridSearch facilitated a thorough investigation of specific regions. We concurrently employed these methods to evaluate their performances and establish the optimal set of hyperparameters.

### **Measures of Feature Importance:** {.unnumbered}

#### **Gini Importance** {.unnumbered}  

Gini Importance provides a global measure of feature importance by indicating a feature's overall influence within the model, without specifying the direction (positive or negative) of this influence. The importance of correlated features might be distributed among them. Notably, Gini Importance can sometimes be inconsistent, meaning that the importance of a feature may decrease if additional useful features are included in the model.

#### **SHAP Values** {.unnumbered}  

The SHAP framework was implemented for the top-performing model to interpret the predictions made by complex machine learning algorithms. This game-theory-based method determines the contributions of predictor features to the final predictions, treating predictor features as players in a coalitional game. The game payoff (or the predicted probability in our case) is distributed among the features based on Shapley concepts in game theory. Compared to Gini Importance, SHAP values provide local interpretability and overall feature importance. This approach offers a more accurate representation of feature importance when features are correlated. SHAP values also ensure a fair and consistent distribution of contribution among features, thanks to their grounding in cooperative game theory. A comparison between these two measures is detailed in Table A.3.

## **Result** {.unnumbered}

### **Summary Statistics** {.unnumbered}

The data reveals that the presence of household food insecurity in Canada for the year 2018 stands at 16.6%. A detailed breakdown shows the following distribution: 4.7% are marginally food insecure, 7.3% moderately food insecure, and 4.6% severely food insecure (refer to Table 3 for details). The summary statistics for numerical, categorical and ordinal predictors are presented in Table 4.

**Table 4A**: Summary Statistics of Numerical Predcitors

| Variable | Count | Mean | Std | Min | Max |
| --- | --- | --- | --- | --- | --- |
| efalimo | 42437 | 273.896 | 2906.95 | 0 | 135000 |
| efalip | 42437 | 276.472 | 2540.88 | 0 | 92500 |
| efcapgn | 42437 | 1505.52 | 15025.9 | 0 | 690000 |
| efccar | 42437 | 546.202 | 2405.51 | 0 | 46000 |
| efchtxb | 42437 | 1514.39 | 3863.37 | 0 | 40000 |
| efcpqpp | 42437 | 4127.67 | 6154.36 | 0 | 45750 |
| efearng | 42437 | 62094.8 | 74033.6 | -199500 | 1.37e+06 |
| efgi | 42437 | 667.49 | 2386.27 | 0 | 36000 |
| efgstxc | 42437 | 231.75 | 302.423 | 0 | 2250 |
| efgtr | 42437 | 12446.6 | 12232.1 | 0 | 136550 |
| efinva | 42437 | 4469.85 | 23121.7 | -100000 | 1.375e+06 |
| efoasgi | 42437 | 3456.59 | 5924.69 | 0 | 52950 |
| efogovtr | 42437 | 188.908 | 315.585 | 0 | 6100 |
| efothinc | 42437 | 1664.9 | 9408.24 | 0 | 400725 |
| efpen | 42437 | 8175 | 19353.7 | 0 | 330000 |
| efpenrec | 42437 | 1234.55 | 4844.61 | 0 | 77500 |
| efphpr | 42437 | 68.9504 | 231.081 | 0 | 2250 |
| efpvtxc | 42437 | 313.021 | 564.33 | 0 | 11000 |
| efrppc | 42437 | 1800.47 | 3854.43 | 0 | 40500 |
| efrspwi | 42437 | 630.26 | 4254.06 | 0 | 119000 |
| efsapis | 42437 | 811.476 | 3239.27 | 0 | 50750 |
| efsemp | 42437 | 3221.72 | 18179.8 | -205000 | 440500 |
| efsize | 42437 | 2.22297 | 1.26481 | 1 | 7 |
| efuiben | 42437 | 1408.35 | 4436.65 | 0 | 73000 |
| efwkrcp | 42437 | 394.422 | 3300.36 | 0 | 121500 |


**Table 4B**: Summary Statistics of Categorical and Ordinal Predcitors

| Variable | Index | Label                                                      | Counts |
|:---------|:-----:|:-----------------------------------------------------------|-------:|
| eftyp    | 22  | Non-elderly couple with children                           |   8005 |
| eftyp    | 21  | Non-elderly couple with no children or relatives           |   6975 |
| eftyp    | 24  | Elderly couple with no children or other relatives         |   5372 |
| eftyp    | 13  | Non-elderly male not in an economic family                 |   5088 |
| eftyp    | 14  | Non-elderly female not in an economic family               |   4042 |
| eftyp    | 12  | Elderly female not in an economic family                   |   3566 |
| eftyp    | 23  | Non-elderly couple with other relatives, no children       |   2195 |
| eftyp    | 11  | Elderly male not in an economic family                     |   1800 |
| eftyp    | 31  | Female lone-parent family                                  |   1646 |
| eftyp    | 43  | Other family type - non-elderly male                       |   1129 |
| eftyp    | 44  | Other family type - non-elderly female                     |   1077 |
| eftyp    | 41  | Other family type - elderly male                           |    562 |
| eftyp    | 42  | Other family type - elderly female                         |    513 |
| eftyp    | 32  | Male lone-parent family                                    |    467 |
| prov     | 35    | Newfoundland and Labrador                                 |  11125 |
| prov     | 24    | Quebec                                                    |   8059 |
| prov     | 59    | British Columbia                                          |   4874 |
| prov     | 48    | Alberta                                                   |   4335 |
| prov     | 46    | Manitoba                                                  |   3321 |
| prov     | 47    | Saskatchewan                                              |   3083 |
| prov     | 12    | Nova Scotia                                               |   2466 |
| prov     | 13    | New Brunswick                                             |   2316 |
| prov     | 10    | Newfoundland and Labrador                                 |   1651 |
| prov     | 11    | Prince Edward Island                                      |   1207 |
| dwltyp   | 1   | Single detached house                                     |  27555 |
| dwltyp   | 3   | Apartment                                                 |   8949 |
| dwltyp   | 2   | Double, row or terrace, duplex                            |   5196 |
| dwltyp   | 4   | Other                                                     |    737 |
| dwtenr   | 1     | Owned by a member of the household                        |  30129 |
| dwtenr   | 2     | Not owned by a member of the household                    |  12308 |
| marstp   | 1  | Married                                                   |  19180 |
| marstp   | 4   | Single (never married)                                    |   9622 |
| marstp   | 3  | Separated, divorced or widowed                            |   9159 |
| marstp   | 2  | Common-law                                                |   4476 |
| efmjsi   | 1     | No income                                                 |      0 |
| efmjsi   | 2     | Wages and salaries                                        |  24885 |
| efmjsi   | 4     | Government transfers                                      |   9850 |
| efmjsi   | 6     | Retirement pensions                                       |   4730 |
| efmjsi   | 3     | Self-employment income                                    |   1496 |
| efmjsi   | 5     | Investment income                                         |   1158 |
| efmjsi   | 7     | Other income                                              |    318 |
| sex      | 1     | Male                                                      |  24562 |
| sex      | 2     | Female                                                    |  17875 |
| immst    | 2     | No                                                        |  35765 |
| immst    | 1     | Yes                                                       |   5961 |
| immst    | 9     | Unknown                                                   |    711 |
| uszgap   | 6     | Rural area outside CMAs or CAs                            |  12355 |
| uszgap   | 9     | CMA, population 500,000 and over                          |  11682 |
| uszgap   | 1     | Rural area or CA, population under 100,000                |   6299 |
| uszgap   | 2     | CA, population under 30,000                               |   3381 |
| uszgap   | 5     | CA, population under 100,000                              |   2404 |
| uszgap   | 3     | CA, population 30,000 to 99,999                          |   2399 |
| uszgap   | 4     | CA, pop. under 100,000 or CMA, pop. 100,000 to 499,999    |   2295 |
| uszgap   | 7     | CMA, population 100,000 to 499,999                        |    873 |
| uszgap   | 8     | CA, pop. 30,000 to 99,999 or CMA, pop. 100,000 to 499,999 |    749 |
| efagofmp | 15    | 70 years and over                                         |   9297 |
| efagofmp | 13    | 60 to 64 years                                            |   4495 |
| efagofmp | 12    | 55 to 59 years                                            |   4438 |
| efagofmp | 14    | 65 to 69 years                                            |   4195 |
| efagofmp | 11    | 50 to 54 years                                            |   3709 |
| efagofmp | 10    | 45 to 49 years                                            |   3367 |
| efagofmp | 8     | 35 to 39 years                                            |   3313 |
| efagofmp | 9     | 40 to 44 years                                            |   3168 |
| efagofmp | 7     | 30 to 34 years                                            |   2952 |
| efagofmp | 6     | 25 to 29 years                                            |   2247 |
| efagofmp | 5     | 18 to 24 years                                            |   1237 |
| efagofmp | 4     | Unknown                                                   |     19 |
| efagyfmp | 15    | 70 years and over                                         |   6578 |
| efagyfmp | 1     | 0 to 5 years                                              |   4475 |
| efagyfmp | 13    | 60 to 64 years                                            |   3809 |
| efagyfmp | 5     | 18 to 24 years                                            |   3603 |
| efagyfmp | 12    | 55 to 59 years                                            |   3399 |
| efagyfmp | 14    | 65 to 69 years                                            |   3366 |
| efagyfmp | 3     | 10 to 15 years                                            |   2982 |
| efagyfmp | 6     | 25 to 29 years                                            |   2705 |
| efagyfmp | 11    | 50 to 54 years                                            |   2435 |
| efagyfmp | 2     | 6 to 9 years                                              |   2191 |
| efagyfmp | 7     | 30 to 34 years                                            |   1914 |
| efagyfmp | 10    | 45 to 49 years                                            |   1614 |
| efagyfmp | 8     | 35 to 39 years                                            |   1262 |
| efagyfmp | 9     | 40 to 44 years                                            |   1183 |
| efagyfmp | 4     | 16 to 17 years                                            |    921 |
| hlev2g   | 3   | Non-university postsecondary certificate or diploma       |   14736 |
| hlev2g   | 4   | University degree or certificate                          |   12043 |
| hlev2g   | 2   | Graduated high school or partial postsecondary education |   9889 |
| hlev2g   | 1   | Less than high school graduation                          |   5769 |

**Table 5**: Evaluation Metrics for Performance of Classifier

| Metric | Equation | Interpretation |
| --- | --- | --- |
| Recall-Secure | TP/(TP+FN) | Number of true positives identified out of total actual food insecure households |
| Accuracy | (TP+TN)/(TP+FP+TN+FN) | How well the algorithm has classified positive and negative classes over total cases |
| ROC AUC | No closed form equation | The area under the Receiver Operating Characteristics curve. It measures the entire two-dimensional area underneath the entire ROC curve (from (0,0) to (1,1)) |

**Table 6**: Comparing Model Performances

| Model                          | Recall - Secure | Recall - Insecure | Training Accuracy | Test Accuracy | Train ROC AUC | Test ROC AUC |
|--------------------------------|-----------------|-------------------|-------------------|---------------|---------------|--------------|
| Logistic Regression            | 0.70            | 0.71              | 0.7000            | 0.6987        | 0.78          | 0.78         |
| Decision Tree                  | 0.84            | 0.37              | 1.00              | 0.7594        | 1.00          | 0.60         |
| Random Forest                  | 0.94            | 0.89              | 1.00              | 0.9169        | 1.00          | 0.97         |
| Regularised Random Forest   | 0.94            | 0.90              | 0.9999            | 0.9184        | 1.00          | 0.97         |
| Gradient Boosting              | 0.91            | 0.84              | 0.8765            | 0.8744        | 0.95          | 0.94         |


### **Model Validation** {.unnumbered}

The data was partitioned into a training set (75%) and a test set (25%) for model validation. The primary evaluation metrics considered were the balanced recall for each class and test accuracy. We referred to the training accuracy to identify potential overfitting, and additionally assessed the Mean Area Under the Curve (AUC) as an alternative performance measure. The equation and interpretation of each metric are provided in Table 5.

According to Table 6, the overall test accuracy of our selected models ranged from 0.6987 to 0.9184. The recall rate for food-secure households spanned from 0.37 (for the decision tree model) to 0.90 (for the Regularised Random Forest model) — the latter demonstrating best performance. Both the decision tree and random forest models, along with the Regularised Random Forest, exhibited perfect matches with the training set, indicating possible overfitting. As depicted in Figure 1, the Gradient Boosting model might offer better generalizability, as suggested by the proximity of its train and test accuracy values. Notably, despite holding the second-highest accuracy among all six models studied, the Gradient Boosting model falls short in terms of interpretability. The Logistic Regression model, although it offers the highest interpretability, lags in predictive accuracy. Nevertheless, it's worth noting that its recall rate for food-insecure households surpasses that of the decision tree model.

**Figure 1**: Test ccuracy versus Training Accuracy among models  
![Distribution of dwltyp]((./images/Trainvstest.png)

### **Identifying Key Predictors for Presence of Food Insecurity** {.unnumbered}

#### **Gini importance** {.unnumbered}
Figure 2 displays the ten most influential features as indicated by Gini importance. 'Age group of oldest person in economic family' (efagofmp) emerges as the most critical feature, followed by 'Provincial tax credits of the economic family' (efpvtxc) and 'Highest level of education of major earner' (hlev2g). The subsequent features, including 'Earnings (employment income)' (efearng), 'Federal GST/HST Credit (excludes-provincial sales taxes) of the economic family', and 'Age group of youngest person in economic family' (efagyfmp), exhibit nearly equivalent importance, with approximately 83.3% of the Gini importance ascribed to 'Age group of oldest person in economic family' (efagofmp).

#### **SHAP values** {.unnumbered}

Figure 3 exhibits the 20 most significant features as measured by SHAP-values. The x-axis in the plot denotes the feature's impact on the model's prediction, with the color indicating the feature's value (red for high, blue for low). A positive SHAP-value implies the feature's presence enhances the model's output, thus increasing the projected likelihood of food insecurity.
It is pertinent to note that SHAP utilizes dummy variables (also known as one-hot encoding) for categorical variables.  
The "bulb" shape around the 0.05 SHAP-value implies a substantial number of observations sharing a similar SHAP-value. The blend of purple and blue points suggests that both moderate and low age groups of the oldest person in a household slightly increase the predicted likelihood of food insecurity. The substantial stretch of negative SHAP-values on the plot's left side indicates that a higher age generally lowers food insecurity likelihood. The lack of a bulb shape implies a more linear relationship.  
One key feature is the type of dwelling not owned by a household member (i.e., renter status). A blue bulb with a small negative SHAP-value indicates that homeownership slightly decreases predicted food insecurity. In contrast, the majority of red points on the right suggests that renters have a stronger inclination towards food insecurity, corroborating existing literature.  
The highest level of education of the major earner (hlev2g) displays wide ranges of both positive and negative SHAP-values, indicating its diverse impact on the prediction. Most blue and purple points with small negative SHAP-values, suggest lower education levels have a small impact on food insecurity prediction. Considering that the highest hlev2g level represents a university degree or certificate, further investigation into the association between these qualifications and food insecurity is warranted.


**Figure 2**: Top 10 feature importance measured by Gini importance  
![Distribution of dwltyp](Gini.png)


**Figure 3**: Top 20 most important feature measured by SHAP values. SHAP utilizes dummy variables (also known as one-hot encoding) for categorical or ordinal variables not recorded in binary format(i.e., 1 and 0).
![Distribution of dwltyp](SHAPv.png)



### **Robustness Test** {.unnumbered}

We further evaluated our 2018 model's performance on the 2019 dataset using all 37 predictors and class imbalance adjustment via SMOTE shown in Table 7. This variant yielded an impressive accuracy of 97.96% and an ROC AUC of 1.00.  
Our Random Forest model's consistent performance across datasets and predictor counts underscores its robustness in predicting food insecurity, confirming our initial model's validity. However, slight performance variations call for further investigation and continuous model refinement. The enhanced performance using all 37 predictors emphasizes the importance of considering a comprehensive set of variables for optimal prediction.

**Table 7**: Performance of Random Forest Model for Robustness Test

| Training Model                | Test Set       | Recall - Secure | Recall - Insecure | Test Accuracy | Test ROC AUC |
|-------------------------------|----------------|-----------------|-------------------|---------------|--------------|
| Random Forest (10 predictors) | Test Set 2018  | 0.87            | 0.80              | 0.8362        | 0.92         |
| Random Forest (10 predictors) | Test Set 2019  | 0.86 | 0.66 | 0.7617 | 0.85 |
| Random Forest (37 predictors) | Test Set 2019  | 0.99 | 0.97 | 0.9796 | 1.00 |




## Conclusion {.unnumbered}
This study showcased the powerful potential of machine learning models, particularly the Regularised Random Forest, in predicting household food insecurity in Canada. Such models, informed by diverse socioeconomic and demographic indicators, provide robust, reliable predictions and meaningful insights into the key drivers of food insecurity.  
The regularised Random Forest model's superior performance and consistent reliability across datasets validate its utility. It enables us to understand the critical features influencing food insecurity, such as age, provincial tax credits, and education level of the primary earner, which align with the existing literature[19],[23],[24].  
While these findings are significant, the model's enhanced performance with a comprehensive set of 37 predictors suggests that an extensive feature set could yield more accurate predictions. Future work should focus on integrating more granular and varied data to refine the predictive power of these models further.  
It can also help with building imputing-missing-value models for future food insecurity survey in Canada. Policymakers can use these findings to target interventions more effectively, mitigating the risk factors associated with food insecurity. This research also underscores the importance of diverse socioeconomic variables and their interaction in predicting food insecurity, providing a strong foundation for policy prioritisation and decision-making.

## References {.unnumbered}

[1] Tarasuk, V., Li, T., & Fafard St-Germain, A.A. (2022). Household food insecurity in Canada, 2021. Toronto: Research to identify policy options to reduce food insecurity (PROOF). Retrieved from https://proof.utoronto.ca/

[2] Jessiman-Perreault, G., & McIntyre, L. (2017). The household food insecurity gradient and potential reductions in adverse population mental health outcomes in Canadian adults. SSM - Population Health, 3, 464-472. doi:10.1016/j.ssmph.2017.05.013

[3] Tarasuk, V., Gundersen, C., Wang, X., et al. (2020). Maternal food insecurity is positively associated with postpartum mental disorders in Ontario, Canada. Journal of Nutrition, 150(11), 3033-3040. doi:10.1093/jn/nxaa240

[4] Tarasuk, V., Mitchell, A., McLaren, L., et al. (2013). Chronic physical and mental health conditions among adults may increase vulnerability to household food insecurity. Journal of Nutrition, 143(11), 1785-1793. doi:10.3945/jn.113.178483

[5] Tait, C., L’Abbe, M., Smith, P., et al. (2018). The association between food insecurity and incident type 2 diabetes in Canada: A population-based cohort study. PLoS One, 13(5), e0195962. doi:10.1371/journal.pone.0195962

[6] Bekele, T., Globerman, J., Watson, J., et al. (2018). Prevalence and predictors of food insecurity among people living with HIV affiliated with AIDS service organizations in Ontario, Canada. AIDS Care, 30(5), 663-671. doi:10.1080/09540121.2017.1394435

[7] Cox, J., Hamelin, A.M., McLinden, T., et al. (2016). Food insecurity in HIV-hepatitis C virus co-infected individuals in Canada: The importance of co-morbidities. AIDS and Behavior, 21(3), 792-802. doi:10.1007/s10461-016-1326-9

[8] Men, F., Gundersen, C., Urquia, M.L., et al. (2020). Association between household food insecurity and mortality in Canada: A population-based retrospective cohort study. Canadian Medical Association Journal, 192(3), E53-E60. doi:10.1503/cmaj.190385

[9] Gucciardi, E., DeMelo, M., Vogt, J., et al. (2009). Exploration of the relationship between household food insecurity and diabetes in Canada. Diabetes Care, 32, 2218-2224. doi:10.2337/dc09-0823

[10] Anema, A., Chan, K., Weiser, S., et al. (2013). Relationship between food insecurity and mortality among HIV-positive injection drug users receiving antiretroviral therapy in British Columbia, Canada. PLoS One, 8(5), e61277. doi:10.1371/journal.pone.0061277

[11] Men, F., Gundersen, C., Urquia, M.L., et al. (2020). Food insecurity is associated with higher health care use and costs amongCanadian adults. Health Affairs, 39(8), 1377-1385. doi:10.1377/hlthaff.2019.01637

[12] Tarasuk, V., Cheng, J., de Oliveira, C., et al. (2015). Association between household food insecurity and annual health care costs. Canadian Medical Association Journal, 187(14), E429-E436. doi:10.1503/cmaj.150234

[13] Martini, G., Bracci, A., Riches, L. et al. (2022). Machine learning can guide food security efforts when primary data are not available. Nature Food, 3, 716–728. doi:10.1038/s43016-022-00587-8.

[14] Gholami, S., Knippenberg, E., Campbell, J., Andriantsimba, D., Kamle, A., Parthasarathy, P., … Lavista Ferres, J. (2022). Food security analysis and forecasting: A machine learning case study in southern Malawi. Data & Policy, 4, e33. doi: 10.1017/dap.2022.25

[15] Meerza, S. I. A., Meerza, S. I. A., & Ahamed, A. (2021). Food Insecurity Through Machine Learning Lens: Identifying Vulnerable Households. doi:10.22004/ag.econ.314072.

[16] Abdul Razzaq, U., Ahmed, U. I., Hashim, S., Hussain, A., Qadri, S., Ullah, S., Shah, A. N., Imran, A., & Asghar, A. (2021). An Automatic Determining Food Security Status: Machine Learning based Analysis of Household Survey Data. International Journal of Food Properties, 24(1), 726-736. doi: 10.1080/10942912.2021.1919703

[17] Reynolds, D., & Mirosa, M. (2021). Understandings of Food Insecurity in Aotearoa New Zealand: Considering Practitioners’ Perspectives in a Neoliberal Context Using Q Methodology. Sustainability, 14, 178. doi:10.3390/su14010178.

[18] Reynolds, D., & Mirosa, M. (2022). Understandings of Food Insecurity in Aotearoa New Zealand: Considering Practitioners’ Perspectives in a Neoliberal Context Using Q Methodology. Sustainability, 14, 178. doi:10.3390/su14010178.

[19] Tarasuk, V., Fafard St-Germain, AA. & Mitchell, A. (2019). Geographic and socio-demographic predictors of household food insecurity in Canada, 2011–12. BMC Public Health, 19, 12. doi:10.1186/s12889-018-6344-2

[20] Men, F., Fafard St-Germain, A. A., Ross, K., Remtulla, R., & Tarasuk, V. (2023). Effect of Canada Child Benefit on Food Insecurity: A Propensity Score-Matched Analysis. American Journal of Preventive Medicine, 64(6), 844–852. doi:10.1016/j.amepre.2023.01.027

[21] Brown, E. M., & Tarasuk, V. (2019). Money speaks: Reductions in severe food insecurity follow the Canada Child Benefit. Preventive Medicine, 129, 105876. doi:10.1016/j.ypmed.2019.105876

[22] Weiser, S. D., Young, S. L., Cohen, C. R., Kushel, M. B., Tsai, A. C., Tien, P. C., Hatcher, A. M., Frongillo, E. A., & Bangsberg, D. R. (2011). Conceptual framework for understanding the bidirectional links between food insecurity and HIV/AIDS. The American Journal of Clinical Nutrition, 94(6), 1729S–1739S. doi:10.3945/ajcn.111.012070

[23] Fafard St-Germain, A. A., & Tarasuk, V. (2020). Homeownership status and risk of food insecurity: Examining the role of housing debt, housing expenditure and housing asset using a cross-sectional population-based survey of Canadian households. International Journal for Equity in Health, 19(1), 5. doi:10.1186/s12939-019-1114-z

[24] McIntyre, L., Bartoo, A. C., & Emery, J. C. (2014). When working is not enough: Food insecurity in the Canadian labour force. Public Health Nutrition, 17(1), 49–57. doi:10.1017/S1368980012004053

[25] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357. doi:10.1613/jair.953

[26] Manikas, I., Ali, B. M., & Sundarakani, B. (2023). A systematic literature review of indicators measuring food security. Agriculture & Food Security, 12(1), 10. doi:10.1186/s40066-023-00415-7

[27] Caron, N., & Plunkett-Latimer, J. (2022). Canadian Income Survey: Food insecurity and unmet health care needs, 2018 and 2019. Statistics Canada. Retrieved June 28, 2023, from https://www150.statcan.gc.ca/n1/pub/75f0002m/75f0002m2021009-eng.htm







# Appendix {.unnumbered}





## Additional Figures {.unnumbered}

**Figure A.1**: Distribution of Type of dwelling among each level of food insecurity

![Distribution of dwltyp](un.png)


## Additional Tables {.unnumbered}

**Table A.2**: Predictor Description

| Category | Variable | Description |
| --- | --- | --- |
| Demographics | `sex` | Sex |
| | `marstp` | Marital status |
| | `immst` | Flag - Person is a landed immigrant |
| | `hlev2g` | Highest level of education of person |
| | `prov` | Province |
| Economic Family Characteristics | `efagofmp` | Age group of oldest person in economic family |
| | `efagyfmp` | Age group of youngest person in economic family |
| | `efsize` | Number of economic family members |
| | `eftyp` | Economic family type |
| | `efmjsi` | Major source of income for the economic family |
| Income and Expenses | `efalimo` | EF - Support payments received |
| | `efalip` | EF - Support payments paid |
| | `efcapgn` | EF - Taxable capital gains |
| | `efccar` | EF - Child care expenses |
| | `efchtxb` | EF - Total federal and provincial child benefits |
| | `efcpqpp` | EF - CPP and QPP benefits |
| | `efearng` | EF - Earnings (employment income) |
| | `efgi` | EF - Guaranteed Income Supplement under federal |
| | `efgstxc` | EF - Federal GST/HST Credit (excludes provincial sales taxes) |
| | `efgtr` | EF - Government transfers, federal and provincial |
| | `efinva` | EF - Investment income |
| | `efoasgi` | EF - Total of Old Age Security benefits |
| | `efogovtr` | EF - Other government transfers |
| | `efothinc` | EF - Other income |
| | `efpen` | EF - Private retirement pensions (includes pension income splitting) |
| | `efpenrec` | EF - Elected split-pension amount |
| | `efphpr` | EF - Public health insurance premiums |
| | `efpvtxc` | EF - Provincial tax credits |
| | `efrppc` | EF - Registered pension plan contributions |
| | `efrspwi` | EF - RRSP withdrawals |
| | `efsapis` | EF - Social assistance benefits |
| | `efsemp` | EF - Self-employment net income |
| | `efuiben` | EF - Employment Insurance benefits |
| | `efwkrcp` | EF - Workers' compensation benefits |
| Housing | `dwltyp` | Type of dwelling |
| | `dwtenr` | Ownership of dwelling |
| | `uszgap` | Adjusted size of area of residence |


**Table A.2**: Distribution of dwelling among each level of food insecurity

| fschhldm  | dwltyp 1.0 | dwltyp 2.0 | dwltyp 3.0 | dwltyp 4.0 | dwltyp NaN |
| --------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| 0         | 0.651466   | 0.108267   | 0.179550   | 0.014437   | 0.046279   |
| 1         | 0.521739   | 0.130935   | 0.254873   | 0.019990   | 0.072464   |
| 2         | 0.465222   | 0.151084   | 0.270463   | 0.018764   | 0.094468   |
| 3         | 0.371604   | 0.140953   | 0.352640   | 0.031266   | 0.103537   |


**Table A.3**: Different between Gini importance and SHAP-values

|                             | Gini Importance                                                     | SHAP Values                                                                |
|-----------------------------|---------------------------------------------------------------------|----------------------------------------------------------------------------|
| Global vs. Local            | Provides a global measure of feature importance across all instances | Provides local interpretability as well as a global measure of importance  |
| Correlated Features         | Can overestimate the importance of correlated features              | Accurately handles correlated features by calculating all possible combinations |
| Consistency                 | Can be inconsistent in certain situations                          | Consistent, feature importance does not decrease with additional useful features |
| Fair Contribution           | Does not offer a fair distribution of contribution among features   | Provides a fair distribution of contribution among features based on cooperative game theory |
| Direction of Importance     | Does not provide information about the direction of feature importance | Provides information about the direction of feature importance (i.e., positive or negative impact on output) |
| Computational Efficiency    | Less computationally intensive                                      | More computationally intensive                                            |


