<a href="https://colab.research.google.com/github/newton-c/python_for_IC/blob/main/fill_in_missing_linear_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Interpolation
## How to fill in missing data between two points

We often have to deal with incomplete data. In general, we can either ignore the missing data or make some assumptions and fill it in. Deleting the data may often feel like the safe option, but it can create bias or unnecessarily limit the useable data we have.

Let's say we have a country that conducts a census infreqeuntly. Here we only have population data for the years 2010 and 2022. Our options are:
1. only use data for 2010 and 2022, throwing out everything else, like homicides and cocaine seizures for 2011-2021
2. use the 2010 population data until 2022, then the 2022 data until another census is conducted
3. assume the population changed steadily and estimate the values between 2010 and 2022.

If we go with 1, we have to throw out most of our data, and wouldn't be able to show broader trends. With 2, we're making an assumption that doesn't allign with what we know about population growth. Finally 3 is making the assuption that is most likely to be accurate, and will usually be the best option. We should be sure, however to be transparent about filling in the missing data.

In [1]:
from scipy.interpolate import interp1d

Will start by defining the years and populations we have as arrays, `years` and `pop`.

In [2]:
years = [2010, 2022]
pop = [233111, 303910]

Then we'll make an array, `missing_years`, with the years we want to fill in.

In [3]:
missing_years = [2011, 2012, 2013, 2014, 2015, 2016,
                 2017, 2018, 2019, 2020, 2021]

Now we can loop through our missing years, pluggin the actual values in the `years` and `pop` arrays into the `interp1d` function to estimate the missing values and add them to an `interp_pop` array.

In [4]:
def estimate_pop(x, y):
  interp_pop = []
  for missing_year in missing_years:
    y_interp = interp1d(x, y)
    interp_pop.append(int(y_interp(missing_year)))
  return interp_pop

In [5]:
duran_pop = estimate_pop(years, pop)
print(duran_pop)

[239010, 244910, 250810, 256710, 262610, 268510, 274410, 280310, 286210, 292110, 298010]


The way this method works is by drawing a line between the points we have. With a missing year, you can look at the value of the line at that year and then use that value to fill in the population. Obviously, populations to not increase at exact, steady rates. But this method will often get us closest to the actual value when we simply don't have more complete, accurate data.

We can test some other values to see how the output looks.

In [6]:
pop = [14225966, 16635076]
not_duran_pop = estimate_pop(years, pop)
print(not_duran_pop)

pop = [14459077, 16938986]
ec_duran_pop = estimate_pop(years, pop)
print(ec_duran_pop)

[14426725, 14627484, 14828243, 15029002, 15229761, 15430521, 15631280, 15832039, 16032798, 16233557, 16434316]
[14665736, 14872395, 15079054, 15285713, 15492372, 15699031, 15905690, 16112349, 16319008, 16525667, 16732326]


If we have a bunch of different starting and ending points, say for the 10 biggest cities in a country from 2010 to 2022, we can arrange all the data as tuples in an array. We can then loop through the array to interpolate the data for all 10 values.

In [7]:
all_pops = [[2350278, 2746403],[233111, 303910],[2242615, 2679722],
            [507687, 596101], [367323, 441583], [329296, 370664],
            [281747, 322925], [245128, 306309], [225961, 271145],
            [226769, 260882]]

for pop in all_pops:
  print(estimate_pop(years, pop))

[2383288, 2416298, 2449309, 2482319, 2515330, 2548340, 2581350, 2614361, 2647371, 2680382, 2713392]
[239010, 244910, 250810, 256710, 262610, 268510, 274410, 280310, 286210, 292110, 298010]
[2279040, 2315466, 2351891, 2388317, 2424742, 2461168, 2497594, 2534019, 2570445, 2606870, 2643296]
[515054, 522422, 529790, 537158, 544526, 551894, 559261, 566629, 573997, 581365, 588733]
[373511, 379699, 385888, 392076, 398264, 404453, 410641, 416829, 423018, 429206, 435394]
[332743, 336190, 339638, 343085, 346532, 349980, 353427, 356874, 360322, 363769, 367216]
[285178, 288610, 292041, 295473, 298904, 302336, 305767, 309199, 312630, 316062, 319493]
[250226, 255324, 260423, 265521, 270620, 275718, 280816, 285915, 291013, 296112, 301210]
[229726, 233491, 237257, 241022, 244787, 248553, 252318, 256083, 259849, 263614, 267379]
[229611, 232454, 235297, 238140, 240982, 243825, 246668, 249511, 252353, 255196, 258039]
