# Covid-19 Infection Sources and Tracking Technology Analysis

In [1]:
from jsonfinder import jsonfinder
import urllib.request
import pandas as pd

## Read data from DDC
As data available in the government open data contains very little information, we will scrape it from the network chart provided in the DDC Covid19 public report website.  The chart provides more details about each case and how they got infected.

In [2]:
with urllib.request.urlopen('https://covid19.ddc.moph.go.th/th/network') as fp:
    mybytes = fp.read()
    content = mybytes.decode("utf8")

In [3]:
l = []
for _, __, obj in jsonfinder(content, json_only=True):
    if len(obj) > 0 and isinstance(obj, list):
        if isinstance(obj[0], dict):
            l += obj

In [4]:
len(l)

725

## EDA using pandas

There are 4 rows types:
- case information (status != 1)
  - rid: report id in the new case (can be single case or multiple cases)
  - detail_th, detail_en: description of the case
- location (location_from_id > 0 and report_name is N/A)
  - location-from_id: location reference id
  - from_name, from_name_en: name of the location
  - Note that location can be specific location, country, area, or even close contact
- link from case to case or direct contact (report_from_id > 0)
  - report_name: report id of the patient who contact from existing case
- link from case to location (location_from_id > 0 and report_name is not N/A)
  - report_name: report id of the patient who contact from location

In [5]:
df = pd.DataFrame(l)

In [6]:
df.columns

Index(['id', 'rid', 'confirm_at', 'detail_th', 'detail_en', 'admin_id',
       'created_at', 'updated_at', 'report_id', 'location_from_id',
       'location_to_id', 'status', 'report_from_id', 'icon_id', 'report_name',
       'from_refid', 'from_name', 'from_name_en', 'to_name', 'to_name_en',
       'image'],
      dtype='object')

### Understand the mapping between location_from_id and its meaning

In [7]:
df[(df.location_from_id > 0) & (df.report_name.isnull())][['location_from_id', 'from_name']]

Unnamed: 0,location_from_id,from_name
683,1.0,สัมผัสผู้ป่วยไม่ระบุเคส
684,2.0,บุคลากรการแพทย์
685,3.0,ต่างประเทศ (ไม่ระบุ)
686,4.0,ไม่พบข้อมูล
687,5.0,ญี่ปุ่น
688,6.0,จีน
689,7.0,ไทย
690,8.0,เกาหลีใต้
691,40.0,เยอรมัน
692,9.0,สวิตเซอแลนด์


### Data transformation to group different location codes into something more meaningful
We will save these meaningful types into 'type' column of the case information rows.  The types include:
- 'closed_contacts': infected from being near infected patients
- 'medical_services': infected due to working in the medical facilities
- 'abroad': travelling from abroad
- 'unknown': infected without causes
- 'local': infected from the community without specific location (e.g. in provinces, etc.)
- 'crowded': infected from crowded and known places
- 'airports': infected from working at the airport
- 'public_transports': infected from public transports (e.g. taxi drivers)
- 'traveller': infected from oversea traveller
- 'in_progress': still under investigation

In [8]:
location_mapping = {
    'closed_contacts': [1],
    'medical_services': [2],
    'abroad': [3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 34, 35, 40, 43, 45, 48],
    'unknown': [4],
    'local': [7, 21, 24, 25, 27, 41],
    'crowded': [26, 31, 32, 33, 39, 42, 44],
    'airports': [30],
    'public_transports': [46],
    'traveller': [47]
}

In [9]:
df['type'] = ''

In [10]:
for loc in location_mapping:
    l = df[df.location_from_id.isin(location_mapping[loc]) & df.report_name.notnull()].report_name.tolist()
    df.loc[df.rid.isin(l), 'type'] = loc

In [11]:
fl = df[df.report_from_id > 0].report_name.tolist()
df.loc[df.rid.isin(fl), 'type'] = 'friend_family'

In [12]:
df.loc[df.detail_en.str.find('On examination process') >= 0, 'type'] = 'in_progress'

In [13]:
df.loc[df.detail_en.str.lower().str.find('medical') >= 0, 'type'] = 'medical_services'

In [14]:
df.loc[df.detail_en.str.lower().str.find('crowded') >= 0, 'type'] = 'crowded'

In [15]:
df.loc[(df.detail_th.str.find('แออัด') >= 0) & (df.type == ''), 'type'] = 'crowded'

In [16]:
df.loc[(df.detail_th.str.find('ใกล้ชิด') >= 0) & (df.type == ''), 'type'] = 'closed_contacts'

In [17]:
df.loc[((df.detail_th.str.find('กลับจาก') >= 0) | (df.detail_th.str.find('มาจาก') >= 0) | (df.detail_th.str.find('เดินทางจาก') >= 0)) & (df.type == ''), 'type'] = 'abroad'

In [18]:
df.loc[(df.detail_th.str.find('ไม่ทราบสาเหตุ') >= 0) & (df.type == ''), 'type'] = 'unknown'

## Counting Cases
To count properly, we will have to unroll those range case ids e.g. for report id = '132-134', we will have to count as 3 cases (132, 133, and 134).

In [19]:
all_cases = df.loc[df.type != '', ['rid', 'type']].to_dict('records')

### Unroll cases

In [20]:
cases = {}
max_rid = 0
for c in all_cases:
    rid = c['rid']
    if '-' in rid:
        ids = rid.split('-')
        rids = [i for i in range(int(ids[0]), int(ids[1])+1)]
    else:
        rids = [int(rid)]

    for r in rids:
        cases[r] = c['type']
        if r > max_rid:
            max_rid = r

In [21]:
max_rid

2672

In [22]:
cdf = pd.DataFrame.from_dict(cases, orient='index', columns=['type'])

### Counting by types

In [23]:
cdf.type.value_counts()

closed_contacts      791
in_progress          570
crowded              562
abroad               446
unknown              157
medical_services      82
local                 26
friend_family         15
traveller             14
airports               5
public_transports      4
Name: type, dtype: int64

### To understand the tracking technology ability, we will assign the possible tracking technology to each type

In [24]:
cdf['tracking_type'] = ''

In [25]:
cdf.loc[cdf.type == 'closed_contacts', 'tracking_type'] = 'bluetooth'

In [26]:
cdf.loc[cdf.type == 'in_progress', 'tracking_type'] = 'unknown'

In [27]:
cdf.loc[cdf.type == 'crowded', 'tracking_type'] = 'qr'

In [28]:
cdf.loc[cdf.type == 'abroad', 'tracking_type'] = 'abroad'

In [29]:
cdf.loc[cdf.type == 'unknown', 'tracking_type'] = 'unknown'

In [30]:
cdf.loc[cdf.type == 'medical_services', 'tracking_type'] = 'qr'

In [31]:
cdf.loc[cdf.type == 'local', 'tracking_type'] = 'gps'

In [32]:
cdf.loc[cdf.type == 'friend_family', 'tracking_type'] = 'gps'

In [33]:
cdf.loc[cdf.type == 'traveller', 'tracking_type'] = 'gps'

In [34]:
cdf.loc[cdf.type == 'airports', 'tracking_type'] = 'qr'

In [35]:
cdf.loc[cdf.type == 'public_transports', 'tracking_type'] = 'gps'

In [36]:
total = cdf.shape[0]
total

2672

In [37]:
all_types = cdf.tracking_type.value_counts()
all_types

bluetooth    791
unknown      727
qr           649
abroad       446
gps           59
Name: tracking_type, dtype: int64

In [38]:
tracking_only = pd.DataFrame(all_types.drop('abroad').drop('unknown'))

In [39]:
tracking_only['pct'] = tracking_only.tracking_type / tracking_only.tracking_type.sum()

In [40]:
tracking_only

Unnamed: 0,tracking_type,pct
bluetooth,791,0.527685
qr,649,0.432955
gps,59,0.03936


# Conclusion
For 3 tracking technolgies, QR code will be able to help detecting only **43%** of all cases.  Thus, if there is the second wave, data from qr code checkin alone will not be enough.