# Sex differences in Autism Spectrum Disorder, a Comorbidity Pattern Analysis in National Scale Data: (v) PubMed publications for the Phecodes of interest

To check if the PheCodes that we have found have been previous reported in the literature, we will conduct a PubMed (https://www.ncbi.nlm.nih.gov/pubmed/)query for each PheCode. The steps followed are:
1. Map from PheCode to MESH term (https://www.ncbi.nlm.nih.gov/mesh) through UMLS.
3. Generate a PubMed query for each PheCode

### From PheCodes to MESH
From our final results, we previously saved a data.frame that contains the Phecode and the description. 
Then, to map the PheCodes to MESH:
    - Map from PheCode to ICD9-CM (https://phewascatalog.org/files/phecode_icd9_map_unrolled.csv.zip)
    - Query MRCONSO (UMLS) to extract the MESH terms for each ICD9-CM
    - Query PubMed to extract the number of publications for each phenotype. 


In [None]:
#load the Phecodes of interest 
phecodesForStudy <- read.delim( "phenosForPubmedCheck.txt" )

#load the mapping file (PheCode to ICD9-CM)
phemapFile <- read.csv( "phecode_icd9_rolled.csv" )
phemapFile <- phemapFile[ , c( "ICD9", "PheCode" ) ]

#we generate the SQL query to extract the MESH terms for each ICD9-CM code
for( i in 1:length( phecodesForStudy$description) ){
    
    icd9Selection <- as.character(phemapFile[ phemapFile$PheCode == phecodesForStudy$phecode[i], "ICD9"])
    
    myquery <-  "SELECT unique CUI, CODE, STR 
                 FROM UMLS.MRCONSO mrc 
                 WHERE SAB = 'MSH' AND CUI IN ( SELECT CUI FROM umls.mrconsO WHERE SAB LIKE 'ICD9CM' AND CODE IN ("
  
  for( j in 1:length( icd9Selection ) ){

    if(j != length( icd9Selection )){
      myquery <- paste0( myquery, "'", icd9Selection[j], "',")
  }else{
    myquery <- paste0( myquery, "'", icd9Selection[j], "'));")
  }
  }
  print(paste0( phecodesForStudy$description[i], "*******", myquery ) )
}

As a result we will generate a SQL query for each one of the PheCodes. 
For example, the query for Austism Spectrum Disorder will be:

In [None]:
# SELECT unique CUI, CODE, STR
# FROM 
# UMLS.MRCONSO mrc 
# WHERE 
# SAB = 'MSH' AND 
# CUI IN
# ( 
#   SELECT CUI FROM umls.mrconsO
#   WHERE SAB LIKE 'ICD9CM' AND
#   CODE IN ('299.0', '299.00','299.01', '299.8', '299.80', '299.81', '299.9', '299.90', '299.91')
# );

### PubMed queries

When executing this and the rest of SQL queries generated, we will create a file for each phenotype of interest, that will contain a list of mapping MESH terms. Using those files as input, we will create and run SQL queries to PubMed using the R package **'rentrez'**.

#### Install and load R libraries

In [None]:
install.packages( "rentrez" )
library( "rentrez" )

In [None]:
#select the dbs that we will use, in this case, pubmed
entrez_dbs()
entrez_db_searchable("pubmed")

#define the path were the files with the MESH terms for the PheCodes of interest are located
pth <- "./"

#### Extract publications supporting the ASD co-occurrence with each one of the phenotypes. 
For all the PubMed queries we will have a common part that will contain:
 - The autism MESH terms
 - The date range of publication 
 - The organism, in our case we are interested in Human research
 - The type of publications

Then, for each phenotype we will have a variable part of the query, that will contain the list of each MESH terms associated to each phenotype. 

In [None]:
#load the ASD MESH terms
autismMesh <- read.delim( "autism.dsv", header = TRUE, sep = "\t" )
autismMesh <-  as.character( unique( tolower(autismMesh$STR ) ) ) 

#define the fixed part of the query
commonPart <- "humans[MeSH Terms] AND (\"2009\"[Date - Publication] : \"3000\"[Date - Publication])) 
               AND ( Classical Article[Publication Type] OR Clinical Study[Publication Type] OR 
               Comparative Study[Publication Type] OR Randomized Controlled Trial[Publication Type] OR 
               Observational Study[Publication Type] OR Journal Article[Publication Type]) AND ("

#add the list of ASD MESH terms to the query
for( i in 1:length( autismMesh ) ){
  
  if( i == 1){
    queryFirstPart <- paste0( commonPart, autismMesh[i], "[MeSH Terms] "  )
  }
  if( i != 1 & i != length( autismMesh ) ){
    queryFirstPart <- paste0( queryFirstPart, "OR ", autismMesh[i], "[MeSH Terms] " )
  }
  if( i == length( autismMesh ) ){
    queryFirstPart <- paste0( queryFirstPart, "OR ", autismMesh[i], "[MeSH Terms] ) " )
  }
}

#define the path where the rest of the mapping files are located
myfiles <- list.files( pth )

#generate an empty data frame to fill with the results 
mydfPublicationsASD <- as.data.frame( matrix( ncol = 5, nrow = length( myfiles ) ) )
colnames( mydfPublicationsASD ) <- c("phenotype", 
                                  "publications", 
                                  "MeSH term mapped to the phenotype", 
                                  "query", 
                                  "timeQuery" )

#for each phenotype, complete the query with the specific MESH terms
for( cont in 1:length( myfiles ) ){
    
    print( cont )
    phenoMesh <- read.delim(paste0( pth, myfiles[ cont ]), header = TRUE, sep = "\t" )
    
    if( colnames( phenoMesh )[3] == "STR" ){
        
        phenoMesh <-  as.character( unique( tolower(phenoMesh$STR ) ) ) 
        
        if( length( phenoMesh ) == 0 ){
            
            mydfPublicationsASD[cont,] <- c( myfiles[ cont ], " ", length( phenoMesh ), " ", " ")
            next()
        }else {
            
            for( i in 1:length( phenoMesh ) ){
                
                if( length( phenoMesh ) != 1 ){
                    
                    if( i == 1){
                        
                        querySecondPart <- paste0( " AND (", phenoMesh[i], "[MeSH Terms] "  )
                    }
                    if( i != 1 & i != length( phenoMesh ) ){
                        
                        querySecondPart <- paste0( querySecondPart, "OR ", phenoMesh[i], "[MeSH Terms] " )
                    }
                    if( i == length( phenoMesh ) ){
                        
                        querySecondPart <- paste0( querySecondPart, "OR ", phenoMesh[i], "[MeSH Terms] ) " )
                    }
                }
                else{
                    querySecondPart <- paste0( " AND (", phenoMesh[i], "[MeSH Terms] )"  )
                }
            }
        }
    
    finalQuery <- paste0( queryFirstPart, querySecondPart )
    
    #sometimes when the query is too long we can get some errors, so we will print the query to run it directly in 
    #the PubMed web page
    #otherwise we run the query from R applying the function "entrez_search"
        
        if( length( phenoMesh ) > 17 ){
            
            mydfPublicationsASD[cont,] <- c( myfiles[ cont ], 
                                         "TooLarge", 
                                         length( phenoMesh ), 
                                         finalQuery, 
                                         as.character(Sys.time() ) )
            print("###########################")
            print( finalQuery )
            print("###########################")
        }else{
            
            r_search <- entrez_search(db="pubmed", finalQuery, retmax=100, use_history=TRUE )
            r_search
            mydfPublicationsASD[cont,] <- c( myfiles[ cont ], 
                                         r_search$count, 
                                         length( phenoMesh ), 
                                         finalQuery, 
                                         as.character(Sys.time() ) )
        }
    }
    else{
        print( myfiles[ cont ] )
    }
}

colnames( mydfPublicationsASD ) <- c( "phenotype", "publicationsASD", 
                                     "MESHmapped", "queryASD", "timeQueryASD" )

#### Extract publications supporting the ASD co-occurrence with each one of the phenotypes and sex differences. 
For all the PubMed queries we will have a common part that will contain:
 - The autism MESH terms
 - The date range of publication 
 - The organism, in our case we are interested in Human research
 - The type of publications
 - The MESH terms that define "SEX DIFFERENCES"

Then, for each phenotype we will have a variable part of the query, that will contain the list of each MESH terms associated to each phenotype. 

In [None]:
#load the ASD MESH terms
autismMesh <- read.delim( "autism.dsv", header = TRUE, sep = "\t" )
autismMesh <-  as.character( unique( tolower(autismMesh$STR ) ) ) 

#define the fixed part of the query

commonPart <- "humans[MeSH Terms] AND (\"2009\"[Date - Publication] : \"3000\"[Date - Publication])) 
               AND (sex difference[MeSH Terms] OR Sex Factors[MeSH Terms] OR Sex[MeSH Terms] OR 
               Sex Characteristics[MeSH Terms]) AND ( Classical Article[Publication Type] OR 
               Clinical Study[Publication Type] OR Comparative Study[Publication Type] OR 
               Randomized Controlled Trial[Publication Type] OR Observational Study[Publication Type] OR 
               Journal Article[Publication Type]) AND ("


#add the list of ASD MESH terms to the query
for( i in 1:length( autismMesh ) ){
  
  if( i == 1){
    queryFirstPart <- paste0( commonPart, autismMesh[i], "[MeSH Terms] "  )
  }
  if( i != 1 & i != length( autismMesh ) ){
    queryFirstPart <- paste0( queryFirstPart, "OR ", autismMesh[i], "[MeSH Terms] " )
  }
  if( i == length( autismMesh ) ){
    queryFirstPart <- paste0( queryFirstPart, "OR ", autismMesh[i], "[MeSH Terms] ) " )
  }
}

#define the path where the rest of the mapping files are located
myfiles <- list.files( pth )

#generate an empty data frame to fill with the results 
mydfPublicationsSexDiff <- as.data.frame( matrix( ncol = 5, nrow = length( myfiles ) ) )
colnames( mydfPublicationsSexDiff ) <- c("phenotype", 
                                  "publications", 
                                  "MeSH term mapped to the phenotype", 
                                  "query", 
                                  "timeQuery" )

#for each phenotype, complete the query with the specific MESH terms
for( cont in 1:length( myfiles ) ){
    
    print( cont )
    phenoMesh <- read.delim(paste0( pth, myfiles[ cont ]), header = TRUE, sep = "\t" )
    
    if( colnames( phenoMesh )[3] == "STR" ){
        
        phenoMesh <-  as.character( unique( tolower(phenoMesh$STR ) ) ) 
        
        if( length( phenoMesh ) == 0 ){
            
            mydfPublicationsSexDiff[cont,] <- c( myfiles[ cont ], " ", length( phenoMesh ), " ", " ")
            next()
        }else {
            
            for( i in 1:length( phenoMesh ) ){
                
                if( length( phenoMesh ) != 1 ){
                    
                    if( i == 1){
                        
                        querySecondPart <- paste0( " AND (", phenoMesh[i], "[MeSH Terms] "  )
                    }
                    if( i != 1 & i != length( phenoMesh ) ){
                        
                        querySecondPart <- paste0( querySecondPart, "OR ", phenoMesh[i], "[MeSH Terms] " )
                    }
                    if( i == length( phenoMesh ) ){
                        
                        querySecondPart <- paste0( querySecondPart, "OR ", phenoMesh[i], "[MeSH Terms] ) " )
                    }
                }
                else{
                    querySecondPart <- paste0( " AND (", phenoMesh[i], "[MeSH Terms] )"  )
                }
            }
        }
    
    finalQuery <- paste0( queryFirstPart, querySecondPart )
    
    #sometimes when the query is too long we can get some errors, so we will print the query to run it directly in 
    #the PubMed web page
    #otherwise we run the query from R applying the function "entrez_search"
        
        if( length( phenoMesh ) > 17 ){
            
            mydfPublicationsSexDiff[cont,] <- c( myfiles[ cont ], 
                                         "TooLarge", 
                                         length( phenoMesh ), 
                                         finalQuery, 
                                         as.character(Sys.time() ) )
            print("###########################")
            print( finalQuery )
            print("###########################")
        }else{
            
            r_search <- entrez_search(db="pubmed", finalQuery, retmax=100, use_history=TRUE )
            r_search
            mydfPublicationsSexDiff[cont,] <- c( myfiles[ cont ], 
                                         r_search$count, 
                                         length( phenoMesh ), 
                                         finalQuery, 
                                         as.character(Sys.time() ) )
        }
    }
    else{
        print( myfiles[ cont ] )
    }
}

colnames( mydfPublicationsSexDiff) <- c("phenotype", "publicationsSexDiff", 
                                        "MESHmapped", "querySexDiff", "timeQuerySexDiff" )

#### Combine both results in one table
As a result we generate a table *(Supplementary table 3)* that contains for each Phenotype the number publications supporting ASD co-occurrence, and the number of publications supporting ASD co-occurrence and sex differences. 

Additionally we will also save the date of the publication, the number of MESH terms that mapped to each PheCode and the specific PubMed Query.

And example PubMed query would be: 

*humans[MeSH Terms] AND ("2009"[Date - Publication] : "3000"[Date - Publication])) AND ( Classical Article[Publication Type] OR Clinical Study[Publication Type] OR Comparative Study[Publication Type] OR Randomized Controlled Trial[Publication Type] OR Observational Study[Publication Type] OR Journal Article[Publication Type]) AND (infantile autism, early[MeSH Terms] OR autism[MeSH Terms] OR autism, infantile[MeSH Terms] OR autistic disorder[MeSH Terms] OR autism, early infantile[MeSH Terms] OR kanners syndrome[MeSH Terms] OR pervasive development disorders[MeSH Terms] OR early infantile autism[MeSH Terms] OR infantile autism[MeSH Terms] OR disorders, autistic[MeSH Terms] OR disorder, autistic[MeSH Terms] OR kanner's syndrome[MeSH Terms] OR kanner syndrome[MeSH Terms] )  AND (astigmatism[MeSH Terms])*

In [None]:
finalPublicationTable <- merge( mydfPublicationsASD, mydfPublicationsSexDiff, by="phenotype" )


write.table( finalPublicationTable, 
             file      = "pubmedCheckOutput.txt", 
             col.names = TRUE,
             row.names = FALSE,
             quote     = FALSE, 
             sep       = "\t" )
