# 1. Define the Problem

From https://machinelearningmastery.com/process-for-working-through-machine-learning-problems/

I like to use a three step process to define the problem. I like to move quickly and I use this mini process to see the problem from a few different perspectives very quickly:

* Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar problems.
* Step 2: Why does the problem need to be solved? List your motivation for solving the problem, the benefits a solution provides and how the solution will be used.
* Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.


## 1.1 What is the problem? 
Given a set of metrics organised by type and country, we need to predict a subset of these metrics for one year ahead and five years ahead.
The data is quite sparse and very few entries have values for the complete time range of 1972-2007. It also appears that metric names may not be consistent between countries. The dataset will require a lot of cleaning before we can use it.

## 1.2 Why does the problem need to be solved?

### 1.2.1 From a global perspective
* The United Nations millenium development goals are a way of measuring the level of development of countries around the world in terms of important metrics such as poverty levels and female empowerment. It would be very useful to be able to predict when a given country might reach the levels the UN has set as targets.

### 1.2.2 From a study group perspective
* This dataset gives us a chance to work with time-series data (which we have not addressed as a group before)
* The dataset is quite sparse with many missing values and possible inconsistent fields. A successfull result with this will give us badly needed experience with data cleaning.

## 1.3 How would I solve the problem?
TBD

# 1.4 Current status 24/4/2019
There are two files in the dataset
* training data which gives the values of particular metrics (AKA series) for a given country for the years 1972-2007 (many have years missing)
* submission data which gives us the row IDs we need to predict one year (2008) and five years (2012) into the future

What we have learned so far
* The series name and series code are perfectly correlated, we can drop the series name with no loss of data
* The series codes may not be consistent. Some may include a country code and others not. The names of the series also appear to vary between countries. It's likely we'll need to do some work to come up with a consistent set of series code across the dataset.
* The first column in the training and submission sets is a row ID that we can use to join the two datasets
* If we consider only the joined dataset then we have a much higher proportion of year data for series than the dataset as a whole. We also only need to predict values for 737 series/country combinations out of the 195402 present in the training set.
* The 737 desired values may not be enough to train a model so we can't necessarily discard 194665 (195402-737) rows
* We can problem frame this as a regression problem. Given N years of data for a metric, predict next year

Next steps
* Convert year values from columns to rows: Convert 2D table to 3D dataset. This will make it easier to drop empty values
* Prove correlation between series name and series code
* separate series code and determine whether that correlates with country or is orthogonal
* Work out what data can be dropped
* try framing as regression problem: how many years do we need?
* are metrics independent can we train on a metric/country basis or do we need some method for training per metric (aggregate across countries)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/TrainingSet.csv', index_col=0)
sr = pd.read_csv('data/SubmissionRows.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],1981 [YR1981],...,2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],Country Name,Series Code,Series Name
0,,,,,,,,,,,...,,,,,,,3.769214,Afghanistan,allsi.bi_q1,(%) Benefits held by 1st 20% population - All ...
1,,,,,,,,,,,...,,,,,,,7.027746,Afghanistan,allsp.bi_q1,(%) Benefits held by 1st 20% population - All ...
2,,,,,,,,,,,...,,,,,,,8.244887,Afghanistan,allsa.bi_q1,(%) Benefits held by 1st 20% population - All ...
4,,,,,,,,,,,...,,,,,,,12.933105,Afghanistan,allsi.gen_pop,(%) Generosity of All Social Insurance
5,,,,,,,,,,,...,,,,,,,18.996814,Afghanistan,allsp.gen_pop,(%) Generosity of All Social Protection


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195402 entries, 0 to 286117
Data columns (total 39 columns):
1972 [YR1972]    64945 non-null float64
1973 [YR1973]    64443 non-null float64
1974 [YR1974]    64966 non-null float64
1975 [YR1975]    66973 non-null float64
1976 [YR1976]    67717 non-null float64
1977 [YR1977]    69735 non-null float64
1978 [YR1978]    69763 non-null float64
1979 [YR1979]    69906 non-null float64
1980 [YR1980]    75250 non-null float64
1981 [YR1981]    78034 non-null float64
1982 [YR1982]    79016 non-null float64
1983 [YR1983]    78982 non-null float64
1984 [YR1984]    79532 non-null float64
1985 [YR1985]    81017 non-null float64
1986 [YR1986]    81455 non-null float64
1987 [YR1987]    82752 non-null float64
1988 [YR1988]    83242 non-null float64
1989 [YR1989]    86331 non-null float64
1990 [YR1990]    106955 non-null float64
1991 [YR1991]    106991 non-null float64
1992 [YR1992]    112243 non-null float64
1993 [YR1993]    114553 non-null float64
1994 

In [5]:
df.shape

(195402, 39)

In [6]:
sr.head()

Unnamed: 0,2008 [YR2008],2012 [YR2012]
559,,
618,,
753,,
1030,,
1896,,


In [7]:
sr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 737 entries, 559 to 285811
Data columns (total 2 columns):
2008 [YR2008]    0 non-null float64
2012 [YR2012]    0 non-null float64
dtypes: float64(2)
memory usage: 17.3 KB


In [8]:
sr.shape

(737, 2)