# Linear Regression on the World Population

Now that we know about cleaning and exploring a dataset, we will now train a simple linear regression model on a set of data. We'll use the world population data from the Analyse Supplementary Exam.

## Imports

In [13]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

In [14]:
df_pop = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv')

In [15]:
df_pop.head()

Unnamed: 0,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
1,AFG,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
2,AGO,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
3,ALB,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
4,AND,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


## Questions

### Question 1

The world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the future population of a particular country might be. To do this, we're going to ignore the 2017 column from our data, and use this as a metric for testing the accuracy of our prediction.

First, however, we need to formulate our data such that the sklearn's `LinearRegression` class can train on our data. To do this, we will write a function that takes as input a country code and return a 2-d numpy array that contains the year and the measured population. 

_**Function Specifications:**_
* Should take a `str` as input and return a numpy `array` type as output.
* The array should only have two columns containing the year and the population, in other words, it should have a shape `(?, 2)` where `?` is the length of the data.
* The values within the array should be of type `int`.

_**Hint:**_
You'll first need to filter the dataframe on the given country code. Once you do this, you'll notice the DataFrame has the years as columns. You can just use `pd.melt(df)` and slice your data to get to the right format, before using `.get_values()`.

In [16]:
def get_year_pop(code):
    # your code here
    df = df_pop[df_pop['Country Code'] == code]
    melted = pd.melt(df)
    v = melted[1:]
    length = len(v['variable'])

    pop = []
    for i in range(length):
        pop.append([int(v['variable'].iloc[i]),int(v['value'].iloc[i])])

    pop_array = np.array(pop)
    
    return pop_array

In [17]:
get_year_pop('ABW')

array([[  1960,  54211],
       [  1961,  55438],
       [  1962,  56225],
       [  1963,  56695],
       [  1964,  57032],
       [  1965,  57360],
       [  1966,  57715],
       [  1967,  58055],
       [  1968,  58386],
       [  1969,  58726],
       [  1970,  59063],
       [  1971,  59440],
       [  1972,  59840],
       [  1973,  60243],
       [  1974,  60528],
       [  1975,  60657],
       [  1976,  60586],
       [  1977,  60366],
       [  1978,  60103],
       [  1979,  59980],
       [  1980,  60096],
       [  1981,  60567],
       [  1982,  61345],
       [  1983,  62201],
       [  1984,  62836],
       [  1985,  63026],
       [  1986,  62644],
       [  1987,  61833],
       [  1988,  61079],
       [  1989,  61032],
       [  1990,  62149],
       [  1991,  64622],
       [  1992,  68235],
       [  1993,  72504],
       [  1994,  76700],
       [  1995,  80324],
       [  1996,  83200],
       [  1997,  85451],
       [  1998,  87277],
       [  1999,  89005],


_**Expected Outputs:**_
```python
get_year_pop('ABW')
```
> ```
array([[  1960,  54211],
       [  1961,  55438],
       [  1962,  56225],
        ...
       [  2016, 104822],
       [  2017, 105264]])
```

```python
get_year_pop('ABW').shape == (58, 2)
```

### Question 2

Now that we have have our data, we need to split this into a set of variables we will be training on, and the set of variables that we will make our predictions on. In this case, we're splitting the values such that we train on all but the last year in our dataset. We also need to split our data into the predictive features (denoted `X`) and the response (denoted `y`). 

Write a function that will take as input a 2-d numpy array and return four variables in the form of `(X_train, y_train), (X_test, y_test)`, where `(X_train, y_train)` are the features / response of the training set, and `(X-test, y_test)` are the feautes / response of the testing set.

_**Function Specifications:**_
* Should take a 2-d numpy `array` as input.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.
* `(X_test, y_test)` should just be the last entry of the given input. They should also be the form of an `array`, and not as a single value.

In [18]:
def feature_response_split(arr):
    
    X_train = arr[:-1,0]
    y_train = arr[:-1,1]
    X_test = np.array([arr[-1,0]])
    y_test = np.array([arr[-1,1]])
    x = tuple([X_train,y_train])
    y = tuple([X_test,y_test])
    
    return tuple([x,y])

In [19]:
data = get_year_pop('ABW')
feature_response_split(data)

((array([1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
         1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
         1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
         1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
         2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
         2015, 2016]),
  array([ 54211,  55438,  56225,  56695,  57032,  57360,  57715,  58055,
          58386,  58726,  59063,  59440,  59840,  60243,  60528,  60657,
          60586,  60366,  60103,  59980,  60096,  60567,  61345,  62201,
          62836,  63026,  62644,  61833,  61079,  61032,  62149,  64622,
          68235,  72504,  76700,  80324,  83200,  85451,  87277,  89005,
          90853,  92898,  94992,  97017,  98737, 100031, 100832, 101220,
         101353, 101453, 101669, 102053, 102577, 103187, 103795, 104341,
         104822])),
 (array([2017]), array([105264])))

_**Expected Outputs:**_
```python
data = get_year_pop('ABW')
feature_response_split(data)
```
> ```
((array([1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
         1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
         1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
         1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
         2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
         2015, 2016]),
  array([ 54211,  55438,  56225,  56695,  57032,  57360,  57715,  58055,
          58386,  58726,  59063,  59440,  59840,  60243,  60528,  60657,
          60586,  60366,  60103,  59980,  60096,  60567,  61345,  62201,
          62836,  63026,  62644,  61833,  61079,  61032,  62149,  64622,
          68235,  72504,  76700,  80324,  83200,  85451,  87277,  89005,
          90853,  92898,  94992,  97017,  98737, 100031, 100832, 101220,
         101353, 101453, 101669, 102053, 102577, 103187, 103795, 104341,
         104822])),
 (array([2017]), array([105264])))
 ```

### Question 3

Now that we have formatted our data, we can fit a model using sklearn's `LinearRegression()` class. We'll write a function that will take as input the features and response variables that we created in the last question, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `LinearRegression` model.
* The returned model should be fitted to the data.

_**Hint:**_
You may need to reshape the data within the function. You can use `.reshape(-1, 1)` to do this.

In [20]:
def train_model(X_train, y_train):
    # Fitting Simple Linear Regression to the Training set
    from sklearn.linear_model import LinearRegression
    regressor = LinearRegression()
    model = regressor.fit(np.array(X_train).reshape(-1,1), y_train.reshape(-1,1))
    return model

In [21]:
data = get_year_pop('ABW')
(X_train, y_train), _ = feature_response_split(data)

train_model(X_train, y_train).predict([[2017]])

array([[104770.18984962]])

_**Expected Outputs:**_
```python
train_model(X_train, y_train).predict([[2017]]) == array([[104770.18984962]])
```

### Question 4

We would now like to test our model using the testing data that we produced from Question 2. This test should give the residual sum of squares, which for your convenience is written as
$$
RSS = \sum_{i=1}^N (p_i - y_i)^2,
$$
where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables from Question 2. 
* Should return the residual sum of squares over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 2 decimal places.

In [22]:
def test_model(model, X_test, y_test):
    # your code here
    from sklearn.metrics import mean_squared_error
    rss = np.sum((model.predict(np.array(X_test).reshape(-1,1)) - y_test)**2)
    rss = round(rss,2)
    return rss

In [23]:
data = get_year_pop('ABW')
(X_train, y_train), (X_test, y_test) = feature_response_split(data)
lm = train_model(X_train, y_train)

test_model(lm, X_test, y_test)

243848.46

_**Expected Outputs:**_
```python
test_model(lm, X_test, y_test) == 243848.46
```