# HOP Team Project - The Hindenburg Group
Tim Claytor

![Baloon](https://cdn.britannica.com/29/1229-004-0E10D8E6/Hindenburg-flames-Lakehurst-Naval-Air-Station-New-May-6-1937.jpg?s=750x300&q=85) 


___

## Library imports

In [1]:
import pandas as pd
import sqlite3
from tqdm.notebook import tqdm

## Tasks
* ~~We want to eliminate "accidental" referrals, so filter the hop teaming data so that the transaction_count is at least 50 and the average_day_wait is less than 50.~~
* First, build a profile of providers referring patients to the major hospitals in Nashville. Are certain specialties more likely to refer to a particular hospital over the others?
* Determine which professionals Vanderbilt Hospital should reach out to in the Nashville area to expand their own patient volume. 
    - First, research which professionals are sending significant numbers of patients only to competitor hospitals (such as TriStar Centennial Medical Center).
    - Next, consider the specialty of the provider. If Vanderbilt wants to increase volume from Orthopedic Surgeons or from Family Medicine doctors who should they reach out to in those areas?
* Finally, look for "communities" of providers in the Nashville/Davidson County CBSA. Make use of the Louvain community detection algorithm from Neo4j: https://neo4j.com/docs/graph-data-science/current/algorithms/louvain/.

# Smaller Import Dataframes (non-chunky) 

## Taxonomy table

In [116]:
# Taxonomy table
taxonomy = pd.read_csv(
    '../data/nucc_taxonomy_230.csv', 
    encoding = 'unicode_escape')
taxonomy.head()


Unnamed: 0,Code,Grouping,Classification,Specialization,Definition,Notes,Display Name,Section
0,193200000X,Group,Multi-Specialty,,A business group of one or more individual pra...,[7/1/2003: new],Multi-Specialty Group,Individual
1,193400000X,Group,Single Specialty,,A business group of one or more individual pra...,[7/1/2003: new],Single Specialty Group,Individual
2,207K00000X,Allopathic & Osteopathic Physicians,Allergy & Immunology,,An allergist-immunologist is trained in evalua...,"Source: American Board of Medical Specialties,...",Allergy & Immunology Physician,Individual
3,207KA0200X,Allopathic & Osteopathic Physicians,Allergy & Immunology,Allergy,"A physician who specializes in the diagnosis, ...",Source: National Uniform Claim Committee,Allergy Physician,Individual
4,207KI0005X,Allopathic & Osteopathic Physicians,Allergy & Immunology,Clinical & Laboratory Immunology,An allergy and immunology physician who specia...,"Source: National Uniform Claim Committee, 2022...",Clinical & Laboratory Immunology (Allergy & Im...,Individual


## Zip Tract Table

In [2]:
# Read in Zip Tract Table
all_zip = pd.read_excel(
    '../data/ZIP_CBSA_122021.xlsx', 
        index_col = None, 
        header = 0, 
        dtype={'zip': object})
# Nashville tract information
nashville_tract = all_zip[(
    all_zip['usps_zip_pref_city'] == 'NASHVILLE') & (
        all_zip['usps_zip_pref_state'] == 'TN')]
nashville_tract.head()

Unnamed: 0,zip,cbsa,usps_zip_pref_city,usps_zip_pref_state,res_ratio,bus_ratio,oth_ratio,tot_ratio
1856,37219,34980,NASHVILLE,TN,1.0,1.0,1.0,1.0
2863,37242,34980,NASHVILLE,TN,0.0,1.0,0.0,1.0
3332,37212,34980,NASHVILLE,TN,1.0,1.0,1.0,1.0
3333,37218,34980,NASHVILLE,TN,1.0,1.0,1.0,1.0
3334,37232,34980,NASHVILLE,TN,0.0,1.0,1.0,1.0


# Chunky Dataset Imports 

## HOP Teaming Dataset

In [111]:
#Reading in DocGraph HOP Teaming data
hops_and_dreams = pd.read_csv(
        '../data/DocGraph_Hop_Teaming_2018.csv', 
            chunksize = 10000)
# List to hold output of loop
hop_chunks = []
# Loop for filtering criteria
for hop in hops_and_dreams:
    # transaction_count <= 50 & average_day_wait < 50
    hop = hop[(hop.transaction_count>= 50) & (hop.average_day_wait< 50)]
    # appending output to hop_chunks list
    hop_chunks.append(hop)
# Concatenating hop_chunks list to hop dataframe
hop = pd.concat(hop_chunks, ignore_index = True)
hop.head()

Unnamed: 0,from_npi,to_npi,patient_count,transaction_count,average_day_wait,std_day_wait
0,1508085911,1730166125,58,67,23.925,43.923
1,1508167040,1730166125,51,51,28.196,52.876
2,1508863549,1730166125,340,391,18.302,42.422
3,1508867870,1730166125,50,79,12.658,26.402
4,1508011040,1730166224,132,145,8.579,28.053


## NNPES Dataset
![NPPES LOGO](https://nppes.cms.hhs.gov/images/NPPES%20logo.png) 

In [None]:
# below code allows to see all columns, this dataset has 330;
# if you want to display less, replace None with columns to display, e.g. 10
pd.set_option('display.max_columns', None)

In [None]:
def find_taxonomy(col):
    for i in range(1, 16):
        taxonomy_switch = f'Healthcare Provider Primary Taxonomy Switch_{i}'
        taxonomy_value = f'Healthcare Provider Taxonomy Code_{i}'
        if col.get(taxonomy_switch) == 'Y':
            return col.get(taxonomy_value)
    return 'no primary taxonomy'

In [None]:
%%capture [--no-stderr]
# the capture above is here so that it doesn't show warnings about columns types and so that I avoid manually setting dozens of columns dtypes!

db = sqlite3.connect('../data/hop_teaming_database.sqlite')

for chunk in tqdm(pd.read_csv('../data/npidata_pfile_20050523-20230212.csv', chunksize = 10000)):

    # first extract the primary taxonomy
    chunk['Primary Taxonomy'] = chunk.apply(lambda col: find_taxonomy(col), axis=1)

    # next, only keep columns we're interested in and renaming so that there are no ()
    chunk = (
        chunk 
        [['NPI',
        'Entity Type Code',
        'Provider Organization Name (Legal Business Name)',
        'Provider Last Name (Legal Name)',
        'Provider First Name',
        'Provider Middle Name',
        'Provider Name Prefix Text',
        'Provider Name Suffix Text',
        'Provider Credential Text',
        'Provider First Line Business Practice Location Address',
        'Provider Second Line Business Practice Location Address',
        'Provider Business Practice Location Address City Name',
        'Provider Business Practice Location Address State Name',
        'Provider Business Practice Location Address Postal Code',
        'Primary Taxonomy']]
        .rename(columns={'Provider Organization Name (Legal Business Name)': 'Provider Organization Name',
        'Provider Last Name (Legal Name)': 'Provider Last Name'})
    )

    # Then clean up the column names
    chunk.columns = [x.lower().replace(' ', '_') for x in chunk.columns]

    # Finally, the chunk to a calls table
    chunk.to_sql('npidata_pfile', db, if_exists = 'append', index = False)

In [None]:
db.execute('CREATE INDEX npi ON npidata_pfile(npi)')

In [None]:
db.close()