# IOT for Pandemics

This notebook is part of [*Practical Data Science for IOT*](https://github.com/pablodecm/datalab_ml_iot) tutorial by Pablo de Castro

## What can we do? (in addition to staying at home)

Given the current COVID19 pandemic that is currently undergoing (this was initially
written the 25th of March in Spain), it is worth thinking about possible technological
solutions that could help improve or manage this crisis or future pandemics.

<br>

<div align="center">
  <img src="images/data_science_diagram.png" width="40%">
</div>


## Brainstorming Ideas

Let's think about different solutions that could help with the crisis that use IOT technologies and data science, by iterating following this structure:



### WHY

The actual problem or challenge that we are trying to solve. <br>
Described in specific terms (not in general terms).  <br>
Could be sub-problems of a larger problem.

### HOW

Which technologies could be used to address the WHY. <br>
How will these technologies interact. Estimate human and economical costs.  <br>
Sketch the system components and how they play together.

### WHAT

Name or describe the solution. <br>
Could it really address the WHY? <br>
If yes, great! **Get feedback and/or try to build a PoC!** <br>
If not, do not worry, keep iterating!

## Be bold!

**Homework**: let's think individually or in groups about technological solutions of Data Science and IOT that could help with the COVID19 crisis following the previous structure.



Send your ideas with this Google Form and we will discuss them tomorrow in class form:

https://forms.gle/tZJh8hE3vLdwyyQW8

## Extra for tomorrow

Go to  https://takeout.google.com/settings/takeout and download your own Location History data and keep it save, in the exercise tomorrow we will use our own location data. If possible, try to get the location data of someone else to study contract tracing between people.

<div align="center">
  <img src="images/google_takeout.png" width="50%">
</div>

## Example Idea


### WHY

SARS-CoV-2 virus is very contagious, due to the combined effect of a long incubation period, a large fraction of infected people only developing mild symptoms yet still being contagious and a high survivability in surfaces and air droplets.

Uncontrolled transmission in the population can cause rapid growth with an associated large mortality within risk groups and easily overwhelm the health systems. Strong confinement and social distancing  seem the only effective way to stop the rapid spread if it is already out of control.

While country-wide confinement is required in the short-term to stop the current transmission waves, it might not be sustainable long-term from a social and economical perspective (vaccine production at scale it is probably years away and treatments are likely to be of help but not a definite solution).

**Without confinement even with massive testing it is really hard to track the virus transmission chain at scale, i.e. to find out who is likely to be infected by someone that has just tested positive. Better ways to trace transmissions between the population
allow more directed testing campaign and containment by small group confinements.**






### HOW

Modern technology is likely to help with the problem of tracking the transmission chain. We need a way to register potential virus transmissions between people so it could be used to trace the graph of possible infections once someone has tested positive.

This could be done with a powerful surveillance infrastructure, which is lacking in most countries and building it at the required scale is not possible or desired. Alternatively, the solution could be build based on people knowingly carrying personal devices, either voluntarily or enforced in public space until the crisis is controlled.

Large fractions of the world population already own a internet-capable sensor-rich smartphone, and additional smartphone-like devices could be also provided at scale for those that do not.

**A transmission trace system at national or multi-national scale can be based on location (e.g. cell phone tower triangulation or A-GPS logs), close distance between peers (e.g. ultrasound or Bluetooth) or a combination of all these technologies. Data already collected by companies could also be reused. A secure and escalable data collection and analysis infrastructure could be build rapidly in the cloud and managed by a trusted parties**



### WHAT

A contact tracing system based on data collection by smartphones and other personal devices could be build in a short amount of time and could potentially address the problem of tracing the transmission chains, which in turn can be use to direct testing campaigns and enforce small group confinement, keeping most of the societal and economical activity intact.

While promising, there are also unexplored concern regarding data protection and privacy, the different capabilities of the technologies mentioned, whether it could be enforced or a voluntary usage would suffice and how to put together the organisational, human and economical resources for creating an effective solution in a short time.

**The best way to solve some of these uncertainties is to iteratively build proof of concept (PoC) examples of the system. We are gonna do a basic PoC of a tracing system based on Google Location history in the rest of this document.**


### Google Location History PoC

For this part we are gonna use you own Google Location history (and optionally someone elses that has given acce you theirs).

First step is to copy your own location history to the folder `google_location_history_data` in this directory.

In [None]:
!ls -lrth google_location_history_data/

In [None]:
from typing import Dict
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import io
import zipfile
from pathlib import Path

loc_data_dir = Path("google_location_history_data/")

In [None]:
zip_file_list = list(loc_data_dir.glob("*.zip"))

zip_file_list

In [None]:
zf = zipfile.ZipFile(list(loc_data_dir.glob("*.zip"))[1])
zf.namelist()

In [None]:
good_file_names = ['Takeout/Historial de ubicaciones/Historial de ubicaciones.json',
                   'Takeout/Location History/Location History.json']

# this is an example of how to read from a zipped file
# without decompressing
def load_dataframe_from_file(file_path: Path):
    
    zip_file = zipfile.ZipFile(file_path)
    for file_name in zip_file.namelist():
        if file_name in good_file_names:
            return pd.read_json(io.BytesIO(zip_file.read(file_name)))


In [None]:
# lets load the data from one filter to check that works
df = load_dataframe_from_file(zip_file_list[0])

df.head()

In [None]:
# check column types
df.info()

In [None]:
df.loc[0,"locations"]

This is from the Google Location History guide:

```
The JSON Location History file describes device location signals and associated metadata collected while you were opted into Location History which you have not subsequently deleted.

locations: All location records.
timestampMs(int64): Timestamp (UTC) in milliseconds for the recorded location.
latitudeE7(int32): The latitude value of the location in E7 format (degrees multiplied by 10**7 and rounded to the nearest integer).
longitudeE7(int32): The longitude value of the location in E7 format (degrees multiplied by 10**7 and rounded to the nearest integer).
accuracy(int32): Approximate location accuracy radius in meters.
velocity(int32): Speed in meters per second.
heading(int32): Degrees east of true north.
altitude(int32): Meters above the WGS84 reference ellipsoid.
verticalAccuracy(int32): Vertical accuracy calculated in meters.
activity: Information about the activity at the location.
timestampMs(int64): Timestamp (UTC) in milliseconds for the recorded activity.
type: Description of the activity type.
confidence(int32): Confidence associated with the specified activity type.
```


In [None]:
# we have to extract the different field in the dict
# to create useful analysis variables

def timestamp_from_location_dict(location_dict: Dict,
                                 field_name: str= "timestampMs"
                                ) -> pd.Timestamp:
    timestamp_sec = int(location_dict[field_name])/1000.
    return pd.Timestamp.fromtimestamp(timestamp_sec)

def coordinate_from_location_dict(location_dict: Dict,
                                  cood_name: str) -> pd.np.float32:
    return location_dict[cood_name]/10.**7

def format_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    
    if "locations" in df:
        df["timestamp"] = df["locations"].map(
            timestamp_from_location_dict)
        df["latitude"] =  df["locations"].map(
            lambda l_d: l_d['latitudeE7']/10.**7)
        df["longitude"] =  df["locations"].map(
            lambda l_d: l_d['longitudeE7']/10.**7)

        df["accuracy"] = df["locations"].map(
            lambda l_d: l_d['accuracy'])

        del df["locations"]
    
    return df

In [None]:
# lets test if the formating worked
format_dataframe(df)

In [None]:
available_dfs = {}

for zip_file in zip_file_list:
    df = load_dataframe_from_file(zip_file)
    available_dfs[zip_file.stem] = format_dataframe(df)
    print(f"{zip_file.stem}")
    print(f"  - latest data {df.timestamp.max()}")
    print(f"  - n entries total {len(df)}")

In [None]:
df.describe()

In [None]:
# filter out not accurate data
min_accuracy = 30
for name in available_dfs:
    df = available_dfs[name]
    low_acurracy_filter = df.accuracy > min_accuracy
    n_entries_removed = low_acurracy_filter.sum()
    frac_entries_removed = n_entries_removed/float(len(df))
    available_dfs[name] = df.loc[~low_acurracy_filter]
    print(f"{name} removed {n_entries_removed} ({frac_entries_removed}) entries")

In [None]:
fig, ax = plt.subplots()

ax.set_title("Histograms of accuracy")

bins =  np.linspace(0, min_accuracy, 16)
for name, df in available_dfs.items():
   
    ax.hist(df.accuracy, bins=bins,density=True, histtype='step')


In [None]:
df.timestamp.tail(10)

In [None]:
# use strftime because the weekofyear starts on Sunday
df["weekofyear"] = df.timestamp.dt.strftime('%W').astype(int)
df["year"] = df.timestamp.dt.year

In [None]:
df.head()

In [None]:
# how many datapoints per year
df.groupby(["year"]).count()

In [None]:
# we can also groupby two columns
df.groupby(["year", "weekofyear"]).count()

In [None]:
# trick to generate the location history of many individuals
# from only a few

# we will make all the data correspond to the week starting

fake_start_week = pd.to_datetime('2020121', format='%Y%W%w')

many_df = {}
for name in available_dfs:
    
    df = available_dfs[name]
    df["weekofyear"] = df.timestamp.dt.strftime('%W').astype(int)
    df["year"] = df.timestamp.dt.year

    grouped_df = df.groupby(["year", "weekofyear"])
    for keys, group_df in grouped_df:
        year, weekofyear = keys
        
        # shift datetime so all the data start on fake_week_starts
        date_str = f"{year:04d}{weekofyear:02d}"
        week_date = pd.to_datetime(date_str + '1', format='%Y%W%w')
        shift = fake_start_week - week_date
        new_df =  group_df.copy()
        new_df["timestamp"] = group_df.timestamp + shift
        
        # create new dataframe (use timestamp as index)
        new_name =  f"{name}_{year}_{weekofyear}"
        del new_df["weekofyear"]
        del new_df["year"]
        many_df[new_name] = new_df.set_index("timestamp")

In [None]:
# we have now the data equivalent to several people during
# last week
len(many_df)

In [None]:
# it might be easier to use ordered lists
many_df_list = list(many_df.values())
many_df_names = list(many_df.keys())

In [None]:
import folium
import random
import matplotlib.colors as mcolors
import itertools

cycle_colors = itertools.cycle(mcolors.TABLEAU_COLORS.values())

coord_names = ["latitude", "longitude"]

location = (41.6523, -4.7245)

example_loc_hists = random.sample(many_df_list, 10)

m = folium.Map(location=location,
               zoom_start=1)

for example_df, color in zip(example_loc_hists,
                             cycle_colors):
    
    points = example_df[coord_names].values
    folium.PolyLine(points, color=color).add_to(m)
    
    print(example_df.index.min(), example_df.index.max())

m

In [None]:
df = example_loc_hists[0]
df.head()

In [None]:
df_resample = df.resample('1min').mean()
df_interp = df_resample.interpolate('time')
df_interp.head()

In [None]:
# we can apply the interpolation to all the dataframes
# to unify the treatment

# to keep the detailed data we are only going only to consider
# cases with more than 1000 row
min_n_rows = 1000

formated_df_dict = {}
for name, df in many_df.items():
    if len(df) > min_n_rows:
        resampled_df = df.resample('1min').mean()
        
        df_resample = df.resample('1min').mean()
        df_interp = df_resample.interpolate('time')
        
        formated_df_dict[name] = df_interp
    

In [None]:
formated_df_dict[name]

In [None]:
# create single single dataframe with index level per person
single_df = pd.concat(formated_df_dict, names=["person_id"])
single_df

In [None]:
# this is a more useful grouping
time_single_df = single_df.swaplevel().sort_index()
time_single_df

In [None]:
example_time = "2020-03-25 19:00:00"
example_time_df = time_single_df.loc[example_time, coord_names]
example_time_df

In [None]:
from haversine import haversine_vector, Unit

coord_arr = example_time_df.values
dist_matrix = haversine_vector(coord_arr[:,:,np.newaxis], coord_arr, unit=Unit.METERS)
upper_diag_filter = np.triu(np.ones_like(dist_matrix, dtype=np.bool), k=1)
closer_than_matrix = dist_matrix < 60.0
print("possible pairs: ", np.sum(upper_diag_filter))
closer_pair_mask = upper_diag_filter & closer_than_matrix
print("closer pairs: ", np.sum(closer_pair_mask))

In [None]:
edges = np.array(np.where(closer_pair_mask)).T

edges

In [None]:
!pip install networkx

In [None]:
import networkx as nx

fig, ax = plt.subplots(figsize=(24,24))
gr = nx.Graph()
gr.add_edges_from(edges)
pos=nx.spring_layout(gr, k=0.3)
nx.draw_networkx(gr, pos=pos, node_size=200,ax=ax)

In [None]:
from folium.plugins import HeatMap

# carculate the centroid
centroid_coord = (coord_arr[edges[:,0]] + coord_arr[edges[:,1]])/2.

m = folium.Map(location=location,
               zoom_start=6) 

HeatMap(data=centroid_coord.tolist()).add_to(m)
m

In [None]:
# now suppose that someone has tested positive on the 29th
# find who else could he/she have infected last week by being in contact
# and where it could have happened

infected_person = random.sample(list(time_single_df.index.get_level_values("person_id").unique()),1)
print(infected_person)