# [Take-home Assessment] Food Crisis Early Warning 

Welcome to the assessment. You will showcase your modeling and research skills by investigating news articles (in English and Arabic) as well as a set of food insecurity risk factors. 

We suggest planning to spend **~6–8 hours** on this assessment. **Please submit your response by Monday, September 15th, 9:00 AM EST via email to dime_ai@worldbank.org**. Please document your code with comments and explanations of design choices. There is one question on reflecting upon your approach at the end of this notebook.

**Name:**  Jonas Nothnagel

**Email:** jonas.nothnagel@gmail.com

# Part 1: Technical Assessment


## Task:

We invite you to approach the challenge of understanding (and potentially predicting) food insecurity using the provided (limited) data. Your response should demonstrate how you tackle open-ended problems in data-scarce environments.

Some example questions to consider:
- What is the added value of geospatial data?
- How can we address the lack of ground-truth information on food insecurity levels?
- What are the benefits and challenges of working with multilingual data?
- ...

These are just guiding examples — you are free to explore any relevant angles to this topic/data.

**Note:** There is no single "right" approach. Instead, we want to understand how you approach and structure open-ended problems in data-scarce environments. Given the large number of applicants, we will preselect the most impressive and complete submissions. Please take effort in structuring your response, as selection will depend on its depth and originality.


## Provided Data:

1. **Risk Factors:** A file containing 167 risk factors (unigrams, bigrams, and trigrams) in the `english_keywords` column and an empty `keywords_arabic` column. A separate file with the mapping of English risk factors to pre-defined thematic cluster assignments.


2. **News Articles:** Two files containing one month of news articles from the Mashriq region:
   - `news-articles-eng.csv`
   - `news-articles-ara.csv`
   - **Note:** You may work on a sample subset during development.
   
   
3. **Geographic Taxonomy:** A file containing the names of the countries, provinces, and districts for the subset of Mashriq countries that is covered by the news articles. The files are a dictionary mapping from a key to the geographic name.
   - `id_arabic_location_name.pkl`
   - `id_english_location_name.pkl`
   - **Note:** Each unique country/province/district is assigned a key (e.g. `iq`,`iq_bg` and `iq_bg_1` for country Iraq, province Baghdad, and district 1 in Baghdad respectively).
   - The key of country names is a two character abbreviation as follows.
       - 'iq': 'Iraq'
       - 'jo': 'Jordan'
       - 'lb': 'Lebanon'
       - 'ps': 'Palestine'
       - 'sy': 'Syria'
       
   - The key of provinces is a two-character abbreviation of the country followed by two-character abbreviation of the province **`{country_abbreviation}_{province_abbreviation}`**, and the key of districts is **`{country_abbreviation}_{province_abbreviation}_{unique_number}`**.
       


## Submission Guidelines:

- **Code:** Follow best coding practices and ensure clear documentation. All notebook cells should be executed with outputs saved, and the notebook should run correctly on its own. Name your file **`solution_{FIRSTNAME}_{LASTNAME}.ipynb`**. If your solution relies on additional open-access data, either include it in your submission (e.g., as part of a ZIP file) or provide clear data-loading code/instructions as part of the nottebook. 
- **Report:** Submit a separate markdown file communicating your approach to this research problem. We expect you to detail the models, methods, or (additional) data you are using.

Good luck!


---

---

## Your Submission

In [9]:
# Import libraries 
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')
import os

In [11]:
# Load cleaned data from previous steps that we stored under new_data folder
news_articles_english = pd.read_excel('new_data/english_news_clean.xlsx')
news_articles_arabic = pd.read_excel('new_data/arabic_news_clean.xlsx')
cluster_df = pd.read_excel('new_data/bilingual_clusters.xlsx')

In [13]:
# Create non-random subset with seed to ensure reproducibility for easier sampling and development.
# I opt to use a non-random sample to keep the time series structure of the data intact.
# Let's pick a block of n rows from a random start index though.
n_rows_en = len(news_articles_english)
n_rows_ara = len(news_articles_arabic)
block_size = 10000

# set random seed
np.random.seed(42)

# pick a random start index so that the slice fits inside the dataframe
start_en = np.random.randint(0, n_rows_en - block_size + 1)
start_ara = np.random.randint(0, n_rows_ara - block_size + 1)

# slice the dataframe
subset_new_en = news_articles_english.iloc[start_en:start_en + block_size].reset_index(drop=True)
subset_new_ara = news_articles_arabic.iloc[start_ara:start_ara + block_size].reset_index(drop=True)

### Step 1 of Analysis - Match Clusters to articles:
- Several ways to do this. It depends how much Precision/Recall we want, and wether we opt for open-source tools (yes bcs reproducibility) or more conservative approaches such us simple keyword matching. I will opt for using sentence transformers to account for semantic similiarity but will probably set the treshhold on a more higher level for higher precision. For this we need to embedd both titles/body as well as keywords/clusters. 
- We could think of some sort of hierarchical semantic similarity matching prioritising the title.
- BIG Question: Do we allow articles to enter multiple clusters or only one?



---

# Part 2: Reflection

Please outline (1) some of the limitations of your approach and (2) how you would tackle these if you had more time.