## Create a Data Dictionary for the 2018-2019 Area Health Resource File (AHRF)

This program creates a dictionary to search variables and read data from the AHRF file. The AHRF data comes in ASCII format with fixed column width and no variable names. Variable names, labels and their position in the data file can be extracted from either an Excel file or a SAS program supplied together with the ASCII file. We use the SAS program to generate a csv file with ready to read variable descriptions and information.

The source data can be downloaded from: https://data.hrsa.gov/data/download - Area Health Resources Files (AHRF) 2018-2019. US Department of Health and Human Services, Health Resources and Services Administration, Bureau of Health Workforce, Rockville, MD.


In [1]:
# Import relevant tools
import pandas as pd
from collections import Counter
from collections import defaultdict
import matplotlib.pyplot as plt
import sys
import matplotlib

In [2]:
# Set paths and file names
datapath='C:\\Users\\l_gas\\Documents\\Development\\SpringBoard_DataScience\\HealthInsuranceData'
head2019= datapath + "\\AHRF\\AHRF_2018-2019\\DOC\\AHRF2018-19.sas"

In [3]:
# Read the SAS code - create a dataframe
ahrf19hd=open(head2019)
varlist19=ahrf19hd.readlines()
vardf19=pd.DataFrame(varlist19).copy()


In [4]:
# Keep only the lines with information on how to read each variable in the data file(line starts with @ referencing starting position)
print(vardf19.head(10))
print("Shape of input file:",vardf19.shape)
locf19=vardf19[vardf19[0].astype(str).str.contains("@")]

# Keep only the lines with variable labels (the line starts with the variable name and the first letter is always an f)
labf19=vardf19[vardf19[0].astype(str).str[0]=="f"]
print("Number of variables listed:",len(labf19))

                                                   0
0                                       data ahrf;\n
1            infile 'c:\ahrf2019.asc' lrecl=31706;\n
2                                            input\n
3                                                 \n
4                        /*  AHRF2018-19 SAS FD */\n
5                                                 \n
6       @00001    f00001   $  01.  /*Blank       ...
7       @00002    f00002   $  05.  /*Header - FIP...
8       @00007    f00003   $  05.  /*Entity of Fi...
9       @00012    f00004   $  20.  /*Secondary En...
Shape of input file: (14305, 1)
Number of variables listed: 7147


In [5]:
# Each line in locf19 and labf19 represents one variable
# Clean location information
locf19 = locf19[0].str.split(n=3,expand=True)
locf19.columns=['Pos_Start','Name','Type','Rest']
locf19['String'] = locf19.Type=="$"
locf19['Pos_Start']=locf19['Pos_Start'].map(lambda x: x.lstrip('@'))
locf19['Pos_Start']=locf19['Pos_Start'].astype(int)
locf19 = locf19.reset_index()
locf19 = locf19.drop(['index','Type','Rest'],axis=1)
locend=locf19['Pos_Start'][1:].reset_index().drop(['index'],axis=1)-1
locend = locend.rename(columns={"Pos_Start": "Pos_End"})
locf19 = pd.concat([locf19,locend],axis=1)


In [6]:
# Clean label information and add to variable name and location data
labf19 = labf19[0].str.split("\t",n=1,expand=True)
labf192 =  labf19[1].str.split("\"",n=2,expand=True)
labf19 = pd.concat([labf19[0],labf192[1]],axis=1)
labf19.columns=["Name","Label"]
locf19 = pd.merge(left=locf19,right=labf19, on="Name")


In [7]:
# Save dictionary file

locf19.to_csv('AHRF2019_dict.csv')

In [8]:
locf19.head()

Unnamed: 0,Pos_Start,Name,String,Pos_End,Label
0,1,f00001,True,1.0,Blank
1,2,f00002,True,6.0,Header - FIPS St and Cty Code
2,7,f00003,True,11.0,Entity of File
3,12,f00004,True,31.0,Secondary Entity Of File
4,32,f00005,True,35.0,Date of File
