<div style='padding:5px; border-bottom:0.25rem solid #f3f3f3; height:100px;'><img src='https://precision.heart.org/assets/images/aha-pmp-logo.png' alt='Drawing' style='width:135px; float:left;'><img src='https://precision.heart.org/sso/images/precision-medicine-platform-logo.png' alt='Drawing' style='width:140px; float:right;'></div>

<div style="font-size:32px; padding:5px;"><b>Loading and Summarizing GWTG Data in Python</b></div>

<div style="font-size:22px; padding-bottom:10px; padding-top:20px; color:#6D6E71"><b>Laura Stevens, Remy Poudel, Raakhee Iyer</b></div>
<div style="font-size:18px; font-style:italic; color:#6D6E71"><b>October 2020</b></div>


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-and-Software" data-toc-modified-id="Introduction-and-Software-0"><span style="width: 50% ; color: #c10e21 ; border-bottom: 0.25rem solid #f3f3f3">Introduction and Software</span></a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Software" data-toc-modified-id="Software-0.0.1">Software</a></span></li><li><span><a href="#Package-Installation" data-toc-modified-id="Package-Installation-0.0.2">Package Installation</a></span></li><li><span><a href="#Importing-Packages" data-toc-modified-id="Importing-Packages-0.0.3">Importing Packages</a></span></li></ul></li></ul></li><li><span><a href="#Load-and-View-Data" data-toc-modified-id="Load-and-View-Data-1"><span style="width: 50% ; color: #c10e21 ; border-bottom: 0.25rem solid #f3f3f3">Load and View Data</span></a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Registry-File-Organization" data-toc-modified-id="Registry-File-Organization-1.0.1">Registry File Organization</a></span></li><li><span><a href="#File-Paths" data-toc-modified-id="File-Paths-1.0.2">File Paths</a></span></li><li><span><a href="#Loading-SAS-files-in-Python" data-toc-modified-id="Loading-SAS-files-in-Python-1.0.3">Loading SAS files in Python</a></span></li><li><span><a href="#Viewing-the-data" data-toc-modified-id="Viewing-the-data-1.0.4">Viewing the data</a></span></li></ul></li></ul></li><li><span><a href="#Tables-and-Summary-Statistics" data-toc-modified-id="Tables-and-Summary-Statistics-2"><span style="width: 50% ; color: #c10e21 ; border-bottom: 0.25rem solid #f3f3f3">Tables and Summary Statistics</span></a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Tables-and-Frequencies" data-toc-modified-id="Tables-and-Frequencies-2.0.1">Tables and Frequencies</a></span></li><li><span><a href="#Summary-Statistics" data-toc-modified-id="Summary-Statistics-2.0.2">Summary Statistics</a></span></li></ul></li></ul></li><li><span><a href="#Need-Help?" data-toc-modified-id="Need-Help?-3"><span style="width: 50% ; color: #c10e21 ; border-bottom: 0.25rem solid #f3f3f3">Need Help?</span></a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Internet-access-in-a-PMP-workspace" data-toc-modified-id="Internet-access-in-a-PMP-workspace-3.0.1">Internet access in a PMP workspace</a></span></li><li><span><a href="#Billing,-Data,-or-Analysis-Questions?" data-toc-modified-id="Billing,-Data,-or-Analysis-Questions?-3.0.2">Billing, Data, or Analysis Questions?</a></span></li><li><span><a href="#Technical-Questions-or-Issues?" data-toc-modified-id="Technical-Questions-or-Issues?-3.0.3">Technical Questions or Issues?</a></span></li></ul></li></ul></li><li><span><a href="#References" data-toc-modified-id="References-4"><span style="width: 50% ; color: #c10e21 ; border-bottom: 0.25rem solid #f3f3f3">References</span></a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Helpful-Python-Packages" data-toc-modified-id="Helpful-Python-Packages-4.0.1">Helpful Python Packages</a></span></li><li><span><a href="#Python-Tutorials" data-toc-modified-id="Python-Tutorials-4.0.2">Python Tutorials</a></span></li><li><span><a href="#Academic-Publications-and-Guidelines" data-toc-modified-id="Academic-Publications-and-Guidelines-4.0.3">Academic Publications and Guidelines</a></span></li></ul></li></ul></li></ul></div>

# <span style='width:50%; color:#C10E21; border-bottom:0.25rem  solid #f3f3f3;'>Introduction and Software</span>

This notebook walks through loading, viewing and summarizing a GWTG dataset comprised of a sas7bdat data file and a sas7bcat formats file. The Tables and Summary statistics section provides a brief introduction for how to calculate frequencies and distribution statistics for categorical and continuous variables, respectively. 

### Software

This tutorial is completed in Python. All imported packages used in this tutorial are listed and installed in the sections below. Packages used in this tutorial as well as additional packages that are useful for tabulating and summarizing data are listed in the [References Section](#References).

### Package Installation 

The following packages are used for this tutorial. The pandas package is installed as a default in the workspace. A reference for installing pandas can be accessed [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#installing-from-pypi).
Packages only need to be installed once. After initial installation, the cell below can be skipped or commented out (using "#").

In [1]:
%%bash
sudo python3 -m pip install pyreadstat



You should consider upgrading via the '/bin/python3 -m pip install --upgrade pip' command.


### Importing Packages

Once all packages have been installed, they can be used through the import command. The code below imports all packages for the tutorial.

In [2]:
import pandas as pd
import pyreadstat

# <span style='width:50%; color:#C10E21; border-bottom:0.25rem  solid #f3f3f3;'>Load and View Data</span>

### Registry File Organization
The root directory path in the workspace is <code>/mnt/workspace</code>. This is the directory for the files listed under the Jupyter Home tab as well as the default working directory for Jupyter, JupyterLab, R Studio, and SAS Studio. 

The data, documentation, and usage files for each release of the registry are contained in separate folders within the <code>/mnt/workspace/GWTG/</code> directory and structured as follows: 
   * Each AHA GWTG <b>registry</b> has it's own folder with the GWTG directory (e.g.<code>Resuscitation</code>)
   * The data and accompanying documentation are stored in a directory named with the <b>version number followed by year-month of the release for that version</b> (e.g. v1_2020-08). 
   * Each version for a registry release contains the following: 
       * A <code><b>data</b></code> folder with sas data and format files
       * A <code><b>documentation</b></code> folder with necessary documentation including the data dictionary and coding manual.
   
For a more in-depth introduction to the registry data files available in a PMP workspace, **check the [GWTG Resuscitation Registry Researcher Resource](https://precision.heart.org/documentation/AHA-GWTG-Resus/index.html)**. **This website is available through the "PMP Documentation" bookmark in the favorites bar above under the data resources menu.**  

### File Paths 
<b><span style="background-color:pink;">This notebook can be used to load any version of the registry data by changing the file paths created in the cell below.</span></b> The code in the next cell creates a file path based on the version of the registry for the dataset being loaded. 

<span style="color:#C10E21;"><b><i>NOTE:</b></i></span> ***If needed, change the version_release_date, dataset_folder and data_file names below to update the data loaded***

In [5]:
registry = "Resuscitation/"
version_release_date = "v3_2021-06/"
data_directory =  "/mnt/workspace/GWTG/" + registry + version_release_date + "data/"
dataset_folder = "CPA Adult/"
data_file = data_directory + dataset_folder + 'v3_2021_06_gwtg_resus_cpa_adult.sas7bdat'
formats_file = data_directory + dataset_folder + 'formats/' + 'v3_2021_06_gwtg_resus_formats.sas7bdat'

***This notebook is currently set to load the following dataset.*** Change directory names set to the registry, release date, dataset_folder variables above to load different versions of the data.

In [6]:
print(data_file)
print(formats_file)

/mnt/workspace/GWTG/Resuscitation/v3_2021-06/distribution/data/CPA Adult/v3_2021_06_gwtg_resus_cpa_adult.sas7bdat
/mnt/workspace/GWTG/Resuscitation/v3_2021-06/distribution/data/CPA Adult/formats/v3_2021_06_gwtg_resus_formats.sas7bdat


### Loading SAS files in Python
The <code>pyreadstat_sas7bdat()</code> command loads a list containing the data as a data frame and stores the metadata/documentation in a metadata format object which is a structure similar to a json object or python dictionary. **The code below will load both the data as well as the formats/coding for the variables in the data.** 
***NOTE:*** The <code>pd.read_sas()</code> command from pandas can also be used to load sas data files (sas7b<u><b>d</b></u>at), but will not store any metadata/documentation.*

<span style="background-color:pink;">The code below stores the data in <b>df</b> object, the definitions and formats for each variable in the data are stored in <b>meta</b>. The <b>formats</b> object stores the coding information for each format available in the data.</span> The first 5 rows of both objects are shown below.

##### Mapping formats and coding
To map the coding labels for a categorical variable in the data, use <code>meta.original_variable_types[&lt;column name&gt;]</code> to get the coding format for a variable look the format up using the FMTTNAME column in formats. An example is provided in the [Tables and Summary Statistics Section](#Tables-and-Frequencies)

In [7]:
df, meta = pyreadstat.read_sas7bdat(data_file)
formats,_ = pyreadstat.read_sas7bdat(formats_file)

In [8]:
df.shape

(520523, 326)

### Viewing the data


The following commands are helpful when viewing large datasets. 
* <code>head()</code> will show the first 5 rows. Similarly tail() will display the last 5 rows of the dataset. 
* <code>meta.variable_to_value()</code> can be used to view the column names and the formats for the dataset.
* <code>meta.value_labels()</code> can be used to view the variable names (or field names) of the data and  coding values for the dataset.
* <code>dtype()</code> will give an overview of the data type for the variable(or field names) in the dataset. 
    * To view a single variable, the <code>dtype()</code> command can be used by specifying the variable name in a dataset(ex. <code>df.SEX.dtype</code>). 
    
***NOTE:*** The <code>dtype()</code> commands can be used on the entire dataset as well as a subset. Given the large size of the data, a subset used to limit the output displayed in the code cells below. 
* <code>pd.set_option()</code> can be used to display all the rows and to display the complete column contents for a pandas dataframe.          
    * <code>pd.set_option('display.max_rows', None)</code>
    * <code>pd.set_option('display.max_colwidth', 2000)</code>

In [9]:
df.head()

Unnamed: 0,AGE_CPAY_NEW,ADM_ID_FAN,SITE_ID,SITE_ID_FAN,FORMVER,COVIDDIAG,COVIDVACC,COVIDVACCDT,COVIDVACCDT_ND,COVIDVACCDOC,...,DURATION_SEC_ROC,DURATION_MIN_ROC,DURATION_SEC_TUBE,DURATION_MIN_TUBE,DURATION_SEC_TMAX,DURATION_MIN_TMAX,DURATION_SEC_VASVIA,DURATION_MIN_VASVIA,DURATION_SEC_VFPV,DURATION_MIN_VFPV
0,57,4097749.0,32987.0,32987.0,,,,,,,...,,,241920.0,4032.0,,,,,,
1,77,4097975.0,31192.0,31192.0,,,,,,,...,30840.0,514.0,,,41400.0,690.0,,,,
2,61,4098327.0,57203.0,57203.0,,,,,,,...,,,205260.0,3421.0,209340.0,3489.0,,,,
3,76,4102670.0,31192.0,31192.0,201106.0,,,,,,...,796440.0,13274.0,,,809520.0,13492.0,,,,
4,76,4102670.0,31192.0,31192.0,201106.0,,,,,,...,,,,,,,,,,


In [10]:
formats.head()

Unnamed: 0,FMTNAME,START,END,LABEL,MIN,MAX,DEFAULT,LENGTH,FUZZ,PREFIX,...,FILL,NOEDIT,TYPE,SEXCL,EEXCL,HLO,DECSEP,DIG3SEP,DATATYPE,LANGUAGE
0,ACTIVECF,1,1,1: Surface Cooling,1.0,48.0,48.0,48.0,1e-12,,...,,0.0,N,N,N,,,,,
1,ACTIVECF,2,2,2: Cold IV Saline Bolus,1.0,48.0,48.0,48.0,1e-12,,...,,0.0,N,N,N,,,,,
2,ACTIVECF,3,3,3: Intravascular device or catheter (continuous),1.0,48.0,48.0,48.0,1e-12,,...,,0.0,N,N,N,,,,,
3,ACTIVECF,4,4,4: Intranasal,1.0,48.0,48.0,48.0,1e-12,,...,,0.0,N,N,N,,,,,
4,ACTIVECF,5,5,5: Antipyretics,1.0,48.0,48.0,48.0,1e-12,,...,,0.0,N,N,N,,,,,


In [11]:
df_data_types = pd.DataFrame(df.dtypes, columns={'Data Type'}) #view the data type of each column 
df_data_types

Unnamed: 0,Data Type
AGE_CPAY_NEW,object
ADM_ID_FAN,float64
SITE_ID,float64
SITE_ID_FAN,float64
FORMVER,float64
...,...
DURATION_MIN_TMAX,float64
DURATION_SEC_VASVIA,float64
DURATION_MIN_VASVIA,float64
DURATION_SEC_VFPV,float64


# <span style='width:50%; color:#C10E21; border-bottom:0.25rem  solid #f3f3f3;'>Tables and Summary Statistics</span>

### Tables and Frequencies 

Getting the counts of one or multiple variables is easy in Python with the <code>value_counts</code> function.  The <code>value_counts()</code> function will create a set of counts. This function can be used to get counts of categorical variables in the data, as shown in the code below, which makes a table of the number of patients by <code>RACE</code> variable. You could set the <code>dropna</code> can be used to include or exclude null values by setting to True or False. 

In [12]:
pd.DataFrame(df.RACE.value_counts(dropna = False)) #dropna = False will show counts for missingness

Unnamed: 0,RACE
1.0,344713
2.0,121159
5.0,35155
3.0,9194
6.0,4533
4.0,2649
,2077
7.0,1043


##### Mapping coding labels to a categorical variable. 
To map the coding labels, the <code>meta.original_variable_types[&lt;column name&gt;]</code> command can be used to look up coding in the formats dataframe. The code below uses <code>.astype('category')</code> and <code>.cat.rename_categories()</code> to convert a variable to type category with the appropriate coding labels. <i><b>Note:</b> variables with missingness will have missing labels first, which should be removed or accounted for when mapping.</i>

In [13]:
#get coding for df variable METHDIAG
coding = formats[formats["FMTNAME"] == meta.original_variable_types["RACE"]]["LABEL"][0:7]
#change variable type to category with coding labels
df["RACE"]=df["RACE"].astype('category')
df["RACE"]=df["RACE"].cat.rename_categories(coding.to_list())
#tabulate variable with labels
pd.DataFrame(df.RACE.value_counts(dropna = False))

Unnamed: 0,RACE
1:White,344713
2:Black,121159
5:Unable to Det/Unk/NotD,35155
3:Asian,9194
6:Other,4533
4:AmerInd/Esk,2649
,2077
7:NatHaw/PacIsl,1043


### Summary Statistics  
The <code>describe()</code> command can be used to view distribution statistics for all variables <code>df.describe()</code> in the data or for a single variable. The summary command will display frequencies, mean, median, min, max for all the variables except for the category variables. 
    
***NOTE:*** <code>describe()</code> can be used on the entire dataset. Given the large size of the data, the command is called on a single variable below to limit the output displayed* 

In [29]:
df.AGE_ADMY.describe()

count    451791.000000
mean         64.999316
std          15.848684
min          18.000000
25%          55.000000
50%          67.000000
75%          77.000000
max         112.000000
Name: AGE_ADMY, dtype: float64

# <span style='width:50%; color:#C10E21; border-bottom:0.25rem  solid #f3f3f3;'>Need Help?</span>
### Internet access in a PMP workspace
   * Google and other useful programming sites including git and stack overflow are available in the workspace. Simply add a tab on by clicking the + on the blue, browser bar above.


###  Billing, Data, or Analysis Questions?
   * If you have questions about your account, billing, the GWTG data, or would like to request analysis help from the AHA Data Science Team, please file a ticket with the data team by selecting [Contact Us](https://precision.heart.org/contact) on the PMP portal under the About tab.
    
    
### Technical Questions or Issues?
   * If you are having technical trouble installing software or packages, using the workspace or software in the workspace,  please file a ticket with support by selecting [technical support](https://precision.heart.org/technical-support) on the PMP portal under the About tab.  

# <span style='width:50%; color:#C10E21; border-bottom:0.25rem  solid #f3f3f3;'>References</span>
###  Helpful Python Packages
   * Loading SAS files in Python
        * [pandas-loading files](https://pandas.pydata.org/docs/user_guide/io.html)
   * Data Manipulation
        * [pandas](https://pandas.pydata.org/)
   * Tables and display
        * [pandas-styling](https://pandas.pydata.org/docs/user_guide/style.html)
   * Plotting and Graphs
        * [plotly](https://plotly.com/python/)
        * [matplotlib](https://matplotlib.org/)      
   
### Python Tutorials 
   * [Data Science with Python](https://realpython.com/tutorials/data-science/)
   * [Python documentation](https://docs.python.org/3/tutorial/index.html)
   
### Academic Publications and Guidelines
   * [Descriptives Explanation](https://www.socialresearchmethods.net/kb/statdesc.php)
   * [Reporting Descriptives in Clinical Studies](https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-016-1189-4)

<div style='padding-top:5px; border-top:0.25rem solid #f3f3f3;  font-size:1.5rem; color:#6D6E71; text-align:center;  padding-left:5%; padding-right:15%'>Use of the AHA Get With The Guidelines  <span>&#174;</span> Datasets in the Precision Medicine Platform is a strategic initiative of the American Heart Association's Institute for Precision Cardiovascular Medicine. The Platform is supported by Hitachi Vantara and powered by Amazon Web Services.
        <img src=https://www.hitachivantara.com/content/dam/public/en_us/images/sharing-graphic.jpg alt='Drawing' style='height:75px; margin-top:-5px;  float:center'>
         <img src='https://a0.awsstatic.com/libra-css/images/logos/aws_logo_smile_1200x630.png'  alt='Drawing' style='height:75px; margin-top:-10px; float:center;'>
</div>