<a href="https://colab.research.google.com/github/jillginger/epic26_methodology_dif/blob/master/apcari_epic26_dif.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Clean dataset


### Attach STATA

In [18]:
pip install ipystata



In [2]:
import ipystata
from ipystata.config import config_stata
config_stata('C:\Program Files (x86)\Stata15>\StataSE-64.exe')

IPyStata is loaded in batch mode.


### Clean

We used STATA to score EPIC 26 before, so we will import STATA to ipython and score EPIC26 using STATA.

In [8]:
%%stata
display "Hello, I am printed in Stata."  

'Failed to open Stata'

In [13]:
%%stata

clear all
set more off
cap log close
global logpath `"C:/Users/jing.jiang1/2019 Lab Work/epic26 methodology apcari/log"'

'Failed to open Stata'

In [0]:
%%stata

global rawpath `"C:\Users\jing.jiang1\2019 Lab Work\epic26 methodology apcari\data raw"'
global cleanpath `"C:\Users\jing.jiang1\2019 Lab Work\epic26 methodology apcari\data clean"'

In [8]:
%%stata

cd "$logpath"
*epic26 methodology apcari Clean do file
*jing jiang
*20200205
*Part 1. Create a codebook for demographic
*Part 2. Codebook for epic26
***********************************
*1. Create a codebook for demographics*
***********************************

log using cdb20200205.log, replace
cd "$rawpath"
use apcari_full,clear

bysort internal_id:replace ic_000b= ic_000b[_n-1] if missing(ic_000b)
keep if ic_000b==1 //keep consent patients

codebook in_*, compact

**********************************
*2. Codebook for epic26 only 
**********************************

*keep useful vars
keep internal_id redcap_event_name ic_* in_* ep_*
keep if redcap_event_name=="baseline_arm_1" |redcap_event_name=="12_month_followup_arm_1"
*check ipss variables conducted time
bysort internal_id: tab redcap_event_name ep_001 //baseline and every 12 months


**********************************
*3. Score epic26
**********************************

gen subgroup=.
replace subgroup = 2 if redcap_event_name == "12_month_followup_arm_1"

*drop patients who did not finish epic26 questionnaire
drop if subgroup==2 &ep_001==.
bysort internal_id: drop if redcap_event_name[_N]=="baseline_arm_1"
sort internal_id

*score epic26
***************************************
*EPIC-26 scores using standardized value
***************************************
*recode the missing values
foreach v of varlist ep_002-ep_027{
       replace `v'=. if `v' <= -55
	   }

*Count item missing data in each subgroup
foreach v of varlist ep_002-ep_027{
tab `v', missing
}
{
*calculate the missing data rate for each domain
foreach v of varlist ep_002-ep_027{
gen ms`v'=.
replace ms`v'=1 if `v'==.
replace ms`v'=0 if `v'!=.
}

gen msdm1=(msep_002+msep_003+msep_004+msep_005+msep_010)
tab msdm1
//miss:value=1; nonmiss: value=0 ==># of missing value for each ob is the sum
//the overall percentage of missing value is the mean of msdm1 among all obs
gen msdm2=(msep_006+msep_007+msep_008+msep_009)
tab msdm2

gen msdm3=(msep_011+msep_012+msep_013+msep_014+msep_015+msep_016)
tab msdm3

gen msdm4=(msep_017+msep_018+msep_019+msep_020+msep_021+msep_022)
tab msdm4

gen msdm5=(msep_023+msep_024+msep_025+msep_026+msep_027)
tab msdm5
//for every patient, every subgroup, there is a msdmi
}

*Scoring each domain of EPIC-26	   
forvalues i=1/5{
gen ep_score_dm`i' = .
}

replace ep_score_dm1 = 1/5*(25*(ep_002 -1)+round((ep_003 -1)*33.3)+ ///
round(100*(3-ep_004)/3)+25*(4-ep_005)+100-25*(ep_010 -1)) if msdm1< 1
sum ep_score_dm1

replace ep_score_dm2 = 1/4*(25*(4-ep_006)+25*(4-ep_007)+25*(4-ep_008)+25*(4-ep_009)) ///
if msdm2==0
sum ep_score_dm2

replace ep_score_dm3 = 1/6*(25*(4-ep_011)+25*(4-ep_012)+ 25*(4-ep_013)+ 25*(4-ep_014)+ ///
25*(4-ep_015)+ 100-25*(ep_016 -1)) if msdm3< 1.2
sum ep_score_dm3

replace ep_score_dm4 = 1/6*(25*(ep_017 -1)+ 25*(ep_018 -1)+ round((ep_019 -1)*33.3)+ ///
25*(ep_020 -1)+ 25*(ep_021 -1)+ 100-25*(ep_022 -1)) if msdm4<1.2
sum ep_score_dm4

replace ep_score_dm5 = 1/5*(25*(4-ep_023)+ 25*(4-ep_024)+ 25*(4-ep_025)+ ///
25*(4-ep_026)+ 25*(4-ep_027)  ) if msdm5<1
sum ep_score_dm5

*convert original values to standardized values	   
foreach v of varlist ep_002 ep_017 ep_018 ep_020 ep_021{

gen sco`v'=25*(`v' -1)
}

foreach v of varlist ep_005-ep_009 ep_011-ep_015 ep_023-ep_027{
gen sco`v'=25*(4-`v')
}

foreach v of varlist ep_010 ep_016 ep_022{
gen sco`v'=100-25*(`v' -1)
}

foreach v of varlist ep_003 ep_019{
gen sco`v'= round((`v' -1)*33.3)
}

gen scoep_004= round(100*(3-ep_004)/3)


cd "$cleanpath"
save epic26_clean.dta,replace	 
log close
outsheet using datacleaned.xls,replace nol

'Failed to open Stata'

If failed to open STATA in iPython, one can copy the code from the cell above and run it as a STATA do file.

## 2. Differential Item Functioning (R)

Reference:


Method 1: 
According to the literature above, we will assess DIF across: cancer type, age group, body weight status, comorbidity...

Method 2:
A series of comparison models was formed by freeing one item at a time and testing for DIF by examining the respective changes in /square(G) with regard to a critical chi-square. This method is commonly referred to as the constrained baseline or all other method. The

### Attach R

In [1]:
pip install rpy2



In [0]:
%load_ext rpy2.ipython

In [0]:
import rpy2.rinterface

In [0]:
%%R -i python_df

### DIF

## 3. Export tables