# IOM Missing Migrants Challenge

This jupyter notebook should just be a starting point for you to get familiar with the data. 

# Initial Data Loading and Inspection

In [1]:
#load your libraries here
import pandas as pd
import numpy as np

In [3]:
#loading the raw data
df_raw = pd.read_csv('data/Missing_Migrants_Global_Figures_allData.csv')

- Perform a basic inspection of the data using methods like .head(), .info(), shape, and .describe() to understand the structure, types of data available, and statistical summaries.

In [4]:
df_raw.head()

Unnamed: 0,Main ID,Incident ID,Incident Type,Region of Incident,Incident Date,Incident Year,Month,Number of Dead,Minimum Estimated Number of Missing,Total Number of Dead and Missing,...,Region of Origin,Cause of Death,Country of Incident,Migration Route,Location of Incident,Coordinates,UNSD Geographical Grouping,Information Source,URL,Source Quality
0,2014.MMP00001,2014.MMP00001,Incident,North America,2014-01-06,2014,January,1.0,,1,...,Central America,Mixed or unknown,United States of America,US-Mexico border crossing,Pima Country Office of the Medical Examiner ju...,"31.650259, -110.366453",Northern America,Pima County Office of the Medical Examiner (PC...,http://humaneborders.info/,5
1,2014.MMP00002,2014.MMP00002,Incident,North America,2014-01-12,2014,January,1.0,,1,...,Latin America / Caribbean (P),Mixed or unknown,United States of America,US-Mexico border crossing,Pima Country Office of the Medical Examiner ju...,"31.59713, -111.73756",Northern America,Pima County Office of the Medical Examiner (PC...,,5
2,2014.MMP00003,2014.MMP00003,Incident,North America,2014-01-14,2014,January,1.0,,1,...,Latin America / Caribbean (P),Mixed or unknown,United States of America,US-Mexico border crossing,Pima Country Office of the Medical Examiner ju...,"31.94026, -113.01125",Northern America,Pima County Office of the Medical Examiner (PC...,,5
3,2014.MMP00004,2014.MMP00004,Incident,North America,2014-01-16,2014,January,1.0,,1,...,Central America,Violence,United States of America,US-Mexico border crossing,"near Douglas, Arizona, USA","31.506777, -109.315632",Northern America,"Ministry of Foreign Affairs Mexico, Pima Count...",http://bit.ly/1qfIw00,5
4,2014.MMP00005,2014.MMP00005,Incident,Europe,2014-01-16,2014,January,1.0,0.0,1,...,Northern Africa,Harsh environmental conditions / lack of adequ...,Russian Federation,,Border between Russia and Estonia,"59.1551, 28",Northern Europe,EUBusiness (Agence France-Presse),http://bit.ly/1rTFTjR,1


In [5]:
df_raw.shape

(15927, 25)

In [10]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15927 entries, 0 to 15926
Data columns (total 25 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Main ID                              15927 non-null  object 
 1   Incident ID                          15927 non-null  object 
 2   Incident Type                        15927 non-null  object 
 3   Region of Incident                   15927 non-null  object 
 4   Incident Date                        15912 non-null  object 
 5   Incident Year                        15927 non-null  int64  
 6   Month                                15927 non-null  object 
 7   Number of Dead                       15133 non-null  float64
 8   Minimum Estimated Number of Missing  1592 non-null   float64
 9   Total Number of Dead and Missing     15927 non-null  int64  
 10  Number of Survivors                  2540 non-null   float64
 11  Number of Females           

- We can see that the data has 15927 rows and 25 columns. We would advise you too look into the codebook and get familiar with the different features. 

# Data Preparation and Cleaning

- The Coordinates variable has comma-separated longitudes and latitudes. The next cell separates them into 2 columns. 

In [11]:
df_raw[['lat', 'lon']] = df_raw['Coordinates'].str.split(',', expand=True)
df_raw['lat'] = pd.to_numeric(df_raw['lat'], errors='coerce')
df_raw['lon'] = pd.to_numeric(df_raw['lon'], errors='coerce')

- Identify and handle missing values. Decide whether to fill them with data (e.g., mean, median) or remove the rows/columns entirely.

In [12]:
df_raw.isna().sum()

Main ID                                    0
Incident ID                                0
Incident Type                              0
Region of Incident                         0
Incident Date                             15
Incident Year                              0
Month                                      0
Number of Dead                           794
Minimum Estimated Number of Missing    14335
Total Number of Dead and Missing           0
Number of Survivors                    13387
Number of Females                      11951
Number of Males                         5732
Number of Children                     13104
Country of Origin                          8
Region of Origin                           1
Cause of Death                             0
Country of Incident                        0
Migration Route                         2478
Location of Incident                       0
Coordinates                                1
UNSD Geographical Grouping                 1
Informatio

- As you can see we have som variables with a high number of missing data in some of the columns.
- The next cell removes the columns with more than 50% missing data. Should you be interested in those columns specifically just work with the df_raw.

In [8]:
threshold = len(df_raw) * 0.5
df_clean = df_raw.dropna(thresh=threshold, axis=1)
df_clean.isna().sum()

Main ID                                0
Incident ID                            0
Incident Type                          0
Region of Incident                     0
Incident Date                         15
Incident Year                          0
Month                                  0
Number of Dead                       794
Total Number of Dead and Missing       0
Number of Males                     5732
Country of Origin                      8
Region of Origin                       1
Cause of Death                         0
Country of Incident                    0
Migration Route                     2478
Location of Incident                   0
Coordinates                            1
UNSD Geographical Grouping             1
Information Source                    13
URL                                 5644
Source Quality                         0
dtype: int64

# Feature Engineering

- Create new columns that could be useful for analysis. For example you could use outside data on countries of origin to get new features that could be useful for the analysis.

# Exploratory Data Analysis (EDA)

- Utilize visualizations (histograms, scatter plots, box plots) to understand distributions, relationships, and potential outliers in the data.
- Use libraries like matplotlib, seaborn, and plotly to create visualizations.


# Discussion Channels

- Join the discussion on our designated Discord channels to ask questions about the challenge: https://discord.gg/2V535gBRY4

# Ethical Considerations

- Remember to handle this data responsibly. Respect the sensitivity of the information and ensure your analyses do not harm individuals or communities.

# Idea Generation and Hypothesis Formation
- Based on your initial findings, brainstorm potential use cases or areas of interest for further exploration. This could range from visualizing the missing migrants numbers on an interactive map to building a scraper that finds new cases on the websites featured in the dataset.
- Formulate hypotheses or questions that you want your analysis to address.
- Have a look at the Global Migration Data Analysis Centre and their various projects to get inspired. 

In [13]:
# Now have its your turn to jump into the data! 
# Happy hacking!