# Intro

Realities are usually different from expectations. The same applies to "analytical" jobs in the real world where there is no already "cleaned and prepared" data or well-understood stakeholders for your projects. Most of your daily tasks will be around asking (or begging) for data and deliver multi-level explanations to get people on-board. In this project, I tried to mimic those challenges from reality: from forming a hypothesis, developing a framework to collect data, constructing a preprocessing pipeline to visualizing the results and documenting the case.

# Project Overview

As of June 2020, there are more than 402.000 active HCPs (health care professionals or doctors) in Germany<sup>[1](#ref1)</sup> and the country also possesses one of the highest density of employed doctors (inhabitants per doctor) in the world. For the majority of common people, there are a few ways when it comes to search for doctors information: usually starts with Google Map to get proximities info to ranking websites such as Jameda (claims to be the most popular physician reviews site in Germany) and finally all the way to organizations pages like KVB - **Kassenärztliche Bundesvereinigung** (*National Association of Statutory Health Insurance Physicians*) to get list of specialties associations.

The diagram from [German Medical Association](https://www.bundesaerztekammer.de/) below show the distribution of HCPs in Germany:

![image.png](./graphs/de_hcp.png)

Companies in the medical field get consumers and physicians' related data through brokers<sup>[2](#ref2)</sup> such as [IQVIA](https://www.iqvia.com/). Furthermore, there is a huge industry behind collecting and analyzing the data, making contracts, negotiating with parties, drawing out legal descriptions...etc. Without being said, the prices acquiring those data either as one-time purchase or subscription-based are not that cheap and can range from a few hundred thousand to millions of euros (or dollars).

So I was thinking to myself: **"Why don't we build our own a physician database with available data? After that, we can perform as many analytical tasks as we want on the data without spending a cent?"**

And that was the start of this project.

# Project Overview

The diagram below illustrates the project-at-a-glance

![image.png](./graphs/project_overview_v1.PNG)

The project can be split into 2 main parts:

**1. Data Gathering:** collect and clean the data
   - **Basic information:** Physicians basic information such as name, gender, title, specialties, practice addresses, phone, website...etc.
   - **Reviews:** physicians reviews (same as Amazon shopping) from either patients or associations
   - **Researches:** clinical trials and scientific publications data (also information for "associations & co-authorships")
    
**2. Analytical Tasks:** visualize the results and build analytical models for different purposes
   - **Interactive dashboard:** aims to illustrate the results of all the proccessed data, can be built as an interactive dashboards
   - **Network analysis:** aims to identify **"influential"** phycisians (e.g. highly connected) and communities detection between physicians
   - **Potential analysis:** analyze traffic flows, population density surrounding physician's office, together with reviews and opening hours data to analyze which HCP is more **"popular"** and **potential** for visits (for pharma sales reps) 

# Technical Setup

This will be a live-document, currently I set everything at my workstation offline. But in the future when there is plan to productionalize the product, this part will be updated accordingly.

I'm experimenting with [Kedro framework](https://github.com/quantumblacklabs/kedro) for automation but at first, the codes are split into separate offline modules. All parts are written in Python within [JupyterLab](https://github.com/jupyterlab/jupyterlab) for IDE (this document is also created in JupyterLab with many extensions)

# Data Gathering

In this article, I'm starting the series with "how to build a crawler" with a specific site: https://www.arzt-auskunft.de/

## Arzt-Auskunft https://www.arzt-auskunft.de/

### Description 

According to the website:
>Arzt-Auskunft is the medical directory that lists all 280,000 registered doctors, dentists, psychological psychotherapists, clinics and chief physicians. If, for example, you are looking for a specialist in gynaecology and obstetrics, i.e. a gynaecologist, in Hamburg-Ottensen, you will find what you are looking for via the Arzt-Auskunft. The doctors' directory provides information on consultation hours, contact details, directions, accessibility and the specialisations of the doctors, among other things.

Below is my summarize of the data source, it includes all the features I managed to obtain from the site with my designed crawler

![image.png](./graphs/data_arzt_auskunft.PNG)

For the physician reviews, they come directly either from patients who experienced the physician's service first-hand or from the health association (FAQ: https://www.arzt-auskunft.de/arzt-auskunft/tipps-zur-suche/faq-zum-empfehlungspool.htm). Below is the list of associations in the physician recommendation pool:

![image.png](./graphs/arzt_auskunft_hcp_recommendation_pool.PNG)

**Note:** *Recommendations count should have been an important metric to weight the reviews; however, since most profiles on arzt-auskunft only have reviews from recommendation pool and no direct reviews from patients, I decide to not crawl the recommendations count.*


### Methodology

![image.png](./graphs/crawler_arzt_auskunft.PNG)

*(proxy swapping/ multi-threading throttler to avoid IP ban explanation)*

### Input
- Postal codes with geo coordinates: https://public.opendatasoft.com/explore/dataset/postleitzahlen-deutschland

### Output

Intermediate outputs from phase 1 is a list of HCPs profile links from all postal codes in Germany (~ 290k profile links)
Phase 2 results is a full snapshot of the physician profiles (only outpatient doctors profiles)


### EDA Arzt-Auskunft

## Jameda
https://www.jameda.de/

![image.png](./graphs/data_jameda.PNG)
### Description 
*(Explain what exactly is the source)*

### Methodology
*(insert flowchart with crawler design)*

*(API setup & multi-threading throttler to avoid IP ban explanation)*

### Input
- Full name & address of physicians from Arzt-Auskunft

### Output
- Phase 1 collects all available Jameda profile links and cross-check to confirm the if it is truly belong to the searched physician.
- Phase 2 crawls all available data in the profile


## Pubmed & Clinical Trials

Pubmed: https://pubmed.ncbi.nlm.nih.gov/

Clinical trials: https://clinicaltrials.gov/

![image.png](./graphs/data_pubmed_clinicaltrial.PNG)

### Description 
*(Explain what exactly is the source)*

### Methodology
*(insert flowchart with crawler design)*

*(proxy swapping/ multi-threading throttler to avoid IP ban explanation)*

### Input

### Output



# Q&A
## 21.06.2020

**1. Who are the prospective users? (change from users --> viewers)**
 1. Common people in Germany: curious about their doctors / search for doctors in their areas with good ranks but few traffics for their needs (tbd since *potential analysis* is not yet planned)
 2. Data enthusiasts (data analyst, software engineer, data scientist): come for the convergence of data (they might want to crawl from us), good interactive dashboard
 3. Recruiters / potential interviewers: look at CVs --> go to the site (just an add-on) 


**2. What are some typical use cases?**
 1. Small scale:
     1. As mentioned above for the "common people" use case
     2. Curiosity of the crowd --> come to learn (we can write blog posts as a series about the development of the site on Medium to attract viewers and earns some incomes as well as boost up portfolios)
     3. Doctors can search for themselves as well ==> **a single-point-of-truth** = no need to go to multiple sites to get information.
     
 2. Grand scale:
     1. Sell collected data for pharmaceutical companies (will need price researches) since they spend millions of Euros to get crappy data (even thought the data is available publicly). 
     2. Provide model results to medical companies either as 1-time purchase or subscription base:
         1. *Potential analysis* to detect which HCPs to visit (few efforts, high rewards) and advertise products to.
         2. *Loyalty prediction (retention analysis)* = if HCP is participating in multiple clinical trials or publishing articles that are sponsored by competitors ==> high risk of not prescribing the company drugs
 
**3. What are the learning opportunities?**
 1. How to productionalize machine learning models (https://blog.usejournal.com/a-guide-to-deploying-machine-deep-learning-model-s-in-production-e497fd4b734a)
 2. How to write effective guides/ tutorials to capture audience (Medium or other blogging platforms)
 3. Set up business case and expand horizon once the scale is large enough (TBD)
 
**4. Are there any similar products out there? Jameda is one of the data sources but the offering is quite different.**
To be honest, there might be some but they are usually **not available for the public** (e.g. only sell to medical companies with horrendous prices). Many publicly available sources provide some info about a doctor, but not as ambitious as ours (**360-view of customer** = 1-single-point-of-truth).

About the machine learning modeling parts, most of the online available projects are small and self-made (for learning purposes). As mentioned above, they are usually not available for the public. 

You can take a look at this: https://www.skainfo.com/databases/physician-data

Here is one big fight between 2 biggest providers Veeva and IQVIA: https://www.veeva.com/no-data-restrictions/
*Background:* IQVIA’s OneKey product, a global database of physician reference data—the data used by pharma sales and marketing teams around the world to communicate with prescribers. 
And also here about the lawsuit and background info: https://pharmaceuticalcommerce.com/latest-news/iqvia-veeva-litigation-over-customer-master-data-records-grinds-on/

**5. What sets this product apart?**
Already answered in 4 :). Honestly I don't aim to compete with any data brokers/ providers, just to throw some stones into the lake. Small and medium startups or medical companies can benefit from the usage of our platform ==> google ads and income from blogging can be good enough to maintain the site running

**6. To have a good estimate of the amount of work, a list of tentative specs would be helpful.**
 1. Interactive dashboard to display the information about the customer (will provide POC)
 2. Strategy to display the network sociograms of ~ 300k HCPs profiles.
 3. The 1st phase can aim to build a static site (no refresh or data streaming), everything will be **computed offline** and push to the database. We can plan to refresh the data every 6 or 12 months (for publications and clinical trials, we might need to refresh a bit more frequent if there is anything new). If that's the case then no need to build a complicated pipeline.
 4. Currently there is a plan to search for social networks handle of the physicians, but so far we can ignore that :)


# References

1. <a name="ref1"></a>Statista. (2018). Germany: number of doctors 1990-2018 | Statista. [online] Available at: https://www.statista.com/statistics/582114/doctors-in-germany-number/ [Accessed 8 May 2020].

2. <a name = "ref2"></a>Leetaru, Kalev. “How Data Brokers And Pharmacies Commercialize Our Medical Data.” Forbes, Forbes, 2 Apr. 2018, www.forbes.com/sites/kalevleetaru/2018/04/02/how-data-brokers-and-pharmacies-commercialize-our-medical-data/#5d287eb711a6. Accessed 28 June 2020.
