# This file will be used to study CKD, organized

## Prelminary installation and file info 

First we import the packages we use:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### File summary:
For the files we use, they are organized in `proj_files_csv` folder of this repository. 
proj_files_csv has 5 folders: `democsv`, `biocsv`, `albucsv`, `a1ccsv`, `FPGcsv`. Each of these has csv files corresponding to years we are interested in. 

In [3]:
years = ['9900', '0102', '0304', '0506', '0708','0910','1112','1314','1516','1720']

#### Here will be some information of what files are used for what. Note all the csv files have a SEQN column which is the identification number for each participant.

For both CKD and Diabetes:
-   `democsv`: Folder has demographics data, will use RIDAGEYR column for age and RIAGENDR column for gender 

For only CKD:
-   `biocsv` is folder with the standard biochemistry profile files. From these we use the LBXSCR column which says the serum creatinine in mg/dL
-   `albucsv` is the folder with files used for the albumin creatinine ratio. URXUMS is the urine albumin in mg/L. URXCRS is the urine creatinine in umol/L. We can divide by 1000 to get mmol/L Then we will take albumin divided by creatinine to get the ratio in (mg/mmol)

For only Diabetes:
-   `a1ccsv` is the folder with files for the a1c. DIQ010, DIQ160 columns are the participants reporting if they've been diagnosed for diabetes and prediabetes, respectively. 1 is yes, 2 is no, 3 is borderline, 7 is refused, 9 is don't know. DIQ280 is the column with the a1c value
-   `FPGcsv` is the folder with the fasting plasma glucose files. The LBXGLU column has the fasting plasma glucose measured in (mg/dL)



#### Here we'll get the demographic data since it will be used for both

In [4]:
demo_vars = []
for y in years:
    df  = pd.read_csv(f'proj_files_csv/democsv/demo{y}.csv', usecols=['SEQN', 'RIDAGEYR', 'RIAGENDR'])
    demo_vars.append(df)

#print(demo_vars)
#print(len(demo_vars))

print("The demo_vars list contains dataframes for each year. It has length: ", len(demo_vars))
print("\n The first dataframe in the list (for year 1999-2000) looks like this:") 
print(demo_vars[0].head())
print("Note RIAGENDR gender (1 male, 2 female) and RIDAGEYR is age (in years)).")
print( "\n In addition, SEQN is a unique identifier for each participant in the survey.")

The demo_vars list contains dataframes for each year. It has length:  10

 The first dataframe in the list (for year 1999-2000) looks like this:
   SEQN  RIAGENDR  RIDAGEYR
0   1.0       2.0       2.0
1   2.0       1.0      77.0
2   3.0       2.0      10.0
3   4.0       1.0       1.0
4   5.0       1.0      49.0
Note RIAGENDR gender (1 male, 2 female) and RIDAGEYR is age (in years)).

 In addition, SEQN is a unique identifier for each participant in the survey.


## CKD Data and analysis

#### We'll get the serum creatinine and urine albumin creatinine ratio data next.

This is the serum creatinine data. 

In [5]:
bio_vars = []
for y in years:
    df  = pd.read_csv(f'proj_files_csv/biocsv/bio{y}.csv', usecols=['SEQN', 'LBXSCR'])
    bio_vars.append(df)

print("\nThe bio_vars list contains dataframes for each year. It has length: ", len(bio_vars))
print("\nThe first dataframe in the list (for year 1999-2000) looks like this:")
print(bio_vars[0].head())
print("Note LBXSCR is the serum creatinine level (in mg/dL).")
print("\nIn addition, SEQN is a unique identifier for each participant in the survey.")

#print(bio_vars)
#print(len(bio_vars))


The bio_vars list contains dataframes for each year. It has length:  10

The first dataframe in the list (for year 1999-2000) looks like this:
   SEQN  LBXSCR
0   2.0     0.7
1   5.0     0.8
2   6.0     0.5
3   7.0     0.6
4   8.0     0.3
Note LBXSCR is the serum creatinine level (in mg/dL).

In addition, SEQN is a unique identifier for each participant in the survey.


Here is the albuminuria data for each year:

In [12]:
albu_vars = []
for y in years:
    df  = pd.read_csv(f'proj_files_csv/albucsv/albu{y}.csv', usecols=['SEQN', 'URXUMS','URXCRS' ])
    df['URXCRSnew'] = df['URXCRS']/1000
    df['albcretratio'] = df['URXUMS']/df['URXCRSnew']
    #df = df[['SEQN', 'albcretratio']]
    albu_vars.append(df)

print("\nThe albu_vars list contains dataframes for each year. It has length: ", len(albu_vars))
print("\nThe first dataframe in the list (for year 1999-2000) looks like this:")
print(albu_vars[0].head())
#print(albu_vars[0].columns)
print("Note URXUMS is the urine albumin level (in mg/L) and URXCRS is the urine creatinine level (in umol/L).\n URXCRSnew is the urine creatinine level converted to mmol/L.")

print("\nThe albcretratio is the albumin to creatinine ratio (in mg/mmol).")


#print(albu_vars)
#print(len(albu_vars))


The albu_vars list contains dataframes for each year. It has length:  10

The first dataframe in the list (for year 1999-2000) looks like this:
   SEQN  URXUMS   URXCRS  URXCRSnew  albcretratio
0   2.0     9.1  12818.0     12.818      0.709939
1   3.0    10.4  11138.0     11.138      0.933740
2   5.0     6.1  15205.0     15.205      0.401184
3   6.0     5.0  10962.0     10.962      0.456121
4   7.0     6.7  11315.0     11.315      0.592134
Note URXUMS is the urine albumin level (in mg/L) and URXCRS is the urine creatinine level (in umol/L).
 URXCRSnew is the urine creatinine level converted to mmol/L.

The albcretratio is the albumin to creatinine ratio (in mg/mmol).


Merging the dataframes, we will use SEQN, serum creatinine, and albumin to creatinine ratio as the main variables of interest.:

In [None]:
demobio_vars = []
for i in range(len(years)):
    demot = demo_vars[i]
    biot = bio_vars[i]
    albut = albu_vars[i]
    demobiot = pd.merge(demot,biot, on = 'SEQN')
    demobiot = pd.merge(demobiot,albut, on = 'SEQN')
    demobiot = (demobiot.dropna())
    demobiot = demobiot[['SEQN', 'RIDAGEYR', 'RIAGENDR', 'LBXSCR', 'albcretratio']]
    demobiot = demobiot[(demobiot['RIDAGEYR'] >= 12) & (demobiot['RIDAGEYR']<85)]
    demobio_vars.append(demobiot)

print("\nThe demobio_vars list contains dataframes with serum creatinine and albu creatinine ratio for each year. It has length: ", len(demobio_vars))
print("\n Note this is ages 12 to 84 (all ages)")
print("\nThe first dataframe in the list (for year 1999-2000) looks like this:")
print(demobio_vars[0].head())


#print(len(demobio_vars))
#print(demobio_vars[3].head())
#print(demobio_vars[3].size)


The demobio_vars list contains dataframes with serum creatinine and albu creatinine ratio for each year. It has length:  10

 Note this is ages 12 to 84 (all ages)

The first dataframe in the list (for year 1999-2000) looks like this:
   SEQN  RIDAGEYR  RIAGENDR  LBXSCR  albcretratio
0   2.0      77.0       1.0     0.7      0.709939
1   5.0      49.0       1.0     0.8      0.401184
2   6.0      19.0       2.0     0.5      0.456121
3   7.0      59.0       2.0     0.6      0.592134
4   8.0      13.0       1.0     0.3      0.570307


We'll now make the following functions:
-   A function for eGFR
-   A function called eGFR_stage which says their predicted CKD stage off the eGFR and albuminuria


In [16]:
#Here is the function for eGFR (esitimated Glomerular Filtration Rate)
def eGFR(row):
    if row['RIAGENDR'] == 1:
        return 142 * min(row['LBXSCR']/0.9, 1)**(-0.302) * max(row['LBXSCR']/0.9, 1)**(-1.200) * 0.9938**row['RIDAGEYR']
    elif row['RIAGENDR'] == 2:
        return 142 * min(row['LBXSCR']/0.7, 1)**(-0.241) * max(row['LBXSCR']/0.7, 1)**(-1.200) * 0.9938**row['RIDAGEYR'] * 1.012
    
#here is a function to determine what stage it is based on https://www.kidneyfund.org/all-about-kidneys/stages-kidney-disease
def ckd_stage(row):
    if row['eGFR'] >= 60 and row['albcretratio'] < 3.4:
        return 0
    elif row['eGFR'] >= 90 and row['albcretratio'] >= 3.4:
        return 1
    elif row['eGFR'] >= 60 and row['albcretratio'] >= 3.4:
        return 2
    elif row['eGFR'] >= 30:
        return 3
    elif row['eGFR'] >= 15:
        return 4
    else:
        return 5


Applying the functions onto the dataframes:

In [18]:
for i in range(len(demobio_vars)):
    demobio_vars[i]['eGFR'] = demobio_vars[i].apply(eGFR, axis=1)
    demobio_vars[i]['ckdstage'] = demobio_vars[i].apply(ckd_stage, axis=1)
    #print(demobio_vars[i].shape)

print("\nThe first dataframe in the list (for year 1999-2000) with eGFR and CKD stage looks like this:")
print(demobio_vars[0].head())
print("\nNote eGFR is the estimated Glomerular Filtration Rate (in mL/min/1.73m^2) and ckdstage is the stage of Chronic Kidney Disease (0-5).")


The first dataframe in the list (for year 1999-2000) with eGFR and CKD stage looks like this:
   SEQN  RIDAGEYR  RIAGENDR  LBXSCR  albcretratio        eGFR  ckdstage
0   2.0      77.0       1.0     0.7      0.709939   94.901350         0
1   5.0      49.0       1.0     0.8      0.401184  108.489332         0
2   6.0      19.0       2.0     0.5      0.456121  138.473468         0
3   7.0      59.0       2.0     0.6      0.592134  103.334082         0
4   8.0      13.0       1.0     0.3      0.570307  182.501259         0

Note eGFR is the estimated Glomerular Filtration Rate (in mL/min/1.73m^2) and ckdstage is the stage of Chronic Kidney Disease (0-5).


In [53]:
print("\nThe CKD stage counts for all the years are:")
ckdvaluecounts = pd.DataFrame()
for i in range(len(demobio_vars)):
    ckdvaluecounts = pd.concat([ckdvaluecounts, demobio_vars[i]['ckdstage'].value_counts().rename(years[i])], axis=1)
    ckdvaluecounts = ckdvaluecounts.sort_index(key=lambda x: x.astype(int))
print(ckdvaluecounts)
print("\nThe CKD stage counts for all the years, the left most column is the CKD stage and the top row is the year. \n The values are the counts of participants in each CKD stage for that year.")
print("\nThe same CKD stage amounts as percentages (%) for all the years are:")

ckdvaluecounts2 = pd.DataFrame()
for i in range(len(demobio_vars)):
    ckdvaluecounts2 = pd.concat([ckdvaluecounts2, demobio_vars[i]['ckdstage'].value_counts(1).rename(years[i])], axis=1)
    ckdvaluecounts2[str(years[i])] = ((ckdvaluecounts2[str(years[i])]*100).round(2))
    ckdvaluecounts2.index = ckdvaluecounts2.index.astype(int)
    ckdvaluecounts2 = ckdvaluecounts2.sort_index()
print(ckdvaluecounts2)



The CKD stage counts for all the years are:
   9900  0102  0304  0506  0708  0910  1112  1314  1516  1720
0  5249  5716  5344  5230  5164  5764  4926  5414  5192  7719
1   542   483   371   464   449   390   418   426   405   638
2   109   166   188   175   240   206   188   207   211   333
3   117   264   284   318   359   382   327   361   319   549
4     7    23    24    23    33    29    35    31    26    51
5    12     5     3     9     7    18    17    12    12    18

The CKD stage counts for all the years, the left most column is the CKD stage and the top row is the year. 
 The values are the counts of participants in each CKD stage for that year.

The same CKD stage amounts as percentages (%) for all the years are:
    9900   0102   0304   0506   0708   0910   1112   1314   1516   1720
0  86.96  85.86  86.00  84.10  82.60  84.90  83.34  83.92  84.22  82.93
1   8.98   7.26   5.97   7.46   7.18   5.74   7.07   6.60   6.57   6.85
2   1.81   2.49   3.03   2.81   3.84   3.03   3.18

In [30]:
demobio_vars[2]['ckdstage'].value_counts()

ckdstage
0    5344
1     371
3     284
2     188
4      24
5       3
Name: count, dtype: int64

Now we will do the weighted Linear Regression and see the graphs

In [None]:
 #First we will list the years again but an average of each year pair so analysis can be done.
years2 = np.array([2000, 2002, 2004, 2006,2008,2010,2012,2014,2016,2019])

### All ages

For all ages here we will use the dataset as is without modifications

### Youth

For youth we will look at ages 12 to 30 inclusive, so we will filter the dataframes to only include those ages.

### Middle aged


### Elderly

## Diabetes Data and Analysis

### All ages

### Youth

### Middle aged


### Elderly