# This file will be used to study CKD, organized

## Prelminary installation and file info 

First we import the packages we use:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### File summary:
For the files we use, they are organized in `proj_files_csv` folder of this repository. 
proj_files_csv has 5 folders: `democsv`, `biocsv`, `albucsv`, `a1ccsv`, `FPGcsv`. Each of these has csv files corresponding to years we are interested in. 

In [3]:
years = ['9900', '0102', '0304', '0506', '0708','0910','1112','1314','1516','1720']

#### Here will be some information of what files are used for what. Note all the csv files have a SEQN column which is the identification number for each participant.

For both CKD and Diabetes:
-   `democsv`: Folder has demographics data, will use RIDAGEYR column for age and RIAGENDR column for gender 

For only CKD:
-   `biocsv` is folder with the standard biochemistry profile files. From these we use the LBXSCR column which says the serum creatinine in mg/dL
-   `albucsv` is the folder with files used for the albumin creatinine ratio. URXUMS is the urine albumin in mg/L. URXCRS is the urine creatinine in umol/L. We can divide by 1000 to get mmol/L Then we will take albumin divided by creatinine to get the ratio in (mg/mmol)

For only Diabetes:
-   `a1ccsv` is the folder with files for the a1c. DIQ010, DIQ160 columns are the participants reporting if they've been diagnosed for diabetes and prediabetes, respectively. 1 is yes, 2 is no, 3 is borderline, 7 is refused, 9 is don't know. DIQ280 is the column with the a1c value
-   `FPGcsv` is the folder with the fasting plasma glucose files. The LBXGLU column has the fasting plasma glucose measured in (mg/dL)



#### Here we'll get the demographic data since it will be used for both

In [4]:
demo_vars = []
for y in years:
    df  = pd.read_csv(f'proj_files_csv/democsv/demo{y}.csv', usecols=['SEQN', 'RIDAGEYR', 'RIAGENDR'])
    demo_vars.append(df)

#print(demo_vars)
#print(len(demo_vars))

print("The demo_vars list contains dataframes for each year. It has length: ", len(demo_vars))
print("\n The first dataframe in the list (for year 1999-2000) looks like this:") 
print(demo_vars[0].head())
print("Note RIAGENDR gender (1 male, 2 female) and RIDAGEYR is age (in years)).")
print( "\n In addition, SEQN is a unique identifier for each participant in the survey.")

The demo_vars list contains dataframes for each year. It has length:  10

 The first dataframe in the list (for year 1999-2000) looks like this:
   SEQN  RIAGENDR  RIDAGEYR
0   1.0       2.0       2.0
1   2.0       1.0      77.0
2   3.0       2.0      10.0
3   4.0       1.0       1.0
4   5.0       1.0      49.0
Note RIAGENDR gender (1 male, 2 female) and RIDAGEYR is age (in years)).

 In addition, SEQN is a unique identifier for each participant in the survey.


## CKD Data and analysis

#### We'll get the serum creatinine and urine albumin creatinine ratio data next.

This is the serum creatinine data. 

In [5]:
bio_vars = []
for y in years:
    df  = pd.read_csv(f'proj_files_csv/biocsv/bio{y}.csv', usecols=['SEQN', 'LBXSCR'])
    bio_vars.append(df)

print("\nThe bio_vars list contains dataframes for each year. It has length: ", len(bio_vars))
print("\nThe first dataframe in the list (for year 1999-2000) looks like this:")
print(bio_vars[0].head())
print("Note LBXSCR is the serum creatinine level (in mg/dL).")
print("\nIn addition, SEQN is a unique identifier for each participant in the survey.")

#print(bio_vars)
#print(len(bio_vars))


The bio_vars list contains dataframes for each year. It has length:  10

The first dataframe in the list (for year 1999-2000) looks like this:
   SEQN  LBXSCR
0   2.0     0.7
1   5.0     0.8
2   6.0     0.5
3   7.0     0.6
4   8.0     0.3
Note LBXSCR is the serum creatinine level (in mg/dL).

In addition, SEQN is a unique identifier for each participant in the survey.


Here is the albuminuria data for each year:

In [12]:
albu_vars = []
for y in years:
    df  = pd.read_csv(f'proj_files_csv/albucsv/albu{y}.csv', usecols=['SEQN', 'URXUMS','URXCRS' ])
    df['URXCRSnew'] = df['URXCRS']/1000
    df['albcretratio'] = df['URXUMS']/df['URXCRSnew']
    #df = df[['SEQN', 'albcretratio']]
    albu_vars.append(df)

print("\nThe albu_vars list contains dataframes for each year. It has length: ", len(albu_vars))
print("\nThe first dataframe in the list (for year 1999-2000) looks like this:")
print(albu_vars[0].head())
#print(albu_vars[0].columns)
print("Note URXUMS is the urine albumin level (in mg/L) and URXCRS is the urine creatinine level (in umol/L).\n URXCRSnew is the urine creatinine level converted to mmol/L.")

print("\nThe albcretratio is the albumin to creatinine ratio (in mg/mmol).")


#print(albu_vars)
#print(len(albu_vars))


The albu_vars list contains dataframes for each year. It has length:  10

The first dataframe in the list (for year 1999-2000) looks like this:
   SEQN  URXUMS   URXCRS  URXCRSnew  albcretratio
0   2.0     9.1  12818.0     12.818      0.709939
1   3.0    10.4  11138.0     11.138      0.933740
2   5.0     6.1  15205.0     15.205      0.401184
3   6.0     5.0  10962.0     10.962      0.456121
4   7.0     6.7  11315.0     11.315      0.592134
Note URXUMS is the urine albumin level (in mg/L) and URXCRS is the urine creatinine level (in umol/L).
 URXCRSnew is the urine creatinine level converted to mmol/L.

The albcretratio is the albumin to creatinine ratio (in mg/mmol).


Merging the dataframes, we will use SEQN, serum creatinine, and albumin to creatinine ratio as the main variables of interest.:

### All ages

### Youth

### Middle aged


### Elderly

## Diabetes Data and Analysis

### All ages

### Youth

### Middle aged


### Elderly