# Working with Autism Data (Project Outline) 

## Introduction

This project will aim to see if there are any links between parental ages when they had child, demographic information and autism severity in already diagnosed children

The datasets which will be used and manipulated in this project have been provided by the Early Childhood Partial Hospitalization Program (ECPHP) at the Resnick Psychiatric Institute at UCLA. ECPHP is a 10 week behavioral and psychiatric intervention program for pre-school aged children who have autism and other disorders which make it difficult for them to attend school with neurotypical children. 

According to the 5th edition of the American Psychiatric Association's Diagnostic and Statistical Manual (DSM-V), the criteria for autism diagnosis are impairments in social communication and interaction as well as restricted, repetitive patterns of behavior. 

The data collection occurs before and after preschool-aged children are admitted into the program. The children undergo a diagnostic test called the Autism Diagnostic Observation Schedule (ADOS) which is a semi-structured observational assessment for autism and autism spectrum disorder (ASD). The ADOS dataset includes 4 subtotals, A, B, C and D, which are as follows:

- A = communication total
- B = communication & social interaction total
- C = play total
- D = stereotyped behaviors and restricted interests total

The higher the ADOS totals, the greater likelihood of ASD and the greater the severity of autism. The ADOS dataset also includes the date of administration of the test (which will be used as a proxy for date of diagnosis).

Once admitted into ECPHP, parents fill out several questionnaires, one of which is a demographic information sheet. The demographic information includes mother's and father's date of birth, child's date of birth, parents' employment status, and socioeconomic status (measured via proxy of mother's highest level of education (MHLE)). 

The data sets are in 3 parts: the ADOSII (most recent ADOS version), the ADOSI (older version), and the demographic info sheet. The aim of this project is to see which factors correlate the most to autism severity. The data will be plotted via linear regression, histograms, and simple scatter plots. R values of linear regression will be compared. Possible data visualization:

- Father's DOB vs. all four autism scores 
    - research has indicated that the father's age may be a significant factor in the development of autism (Sandin)
- Mother's DOB vs. all four autism scores
    - for same reason as above, but mother's age likely not as significant a factor (Sandin)
- MHLE vs age of diagnosis (Does socioeconomic status influence how early a child receives a diagnosis?)
    - higher socioeconomic status linked to earlier diagnosis (Thomas)
- Age of diagnosis vs all four autism scores (Does severity influence age of diagnosis?)
    - earlier diagnosis results in better prognosis (Fernell)
- Difference in Age of Parents vs. all four autism scores
    - greaters difference in parents' age linked to higher chance of autism (Sandin)
- Gender Vs. Age of diagnosis
    - females are diagnosed at much later ages than males (Kemp)

There is no identifiable patient information present in the datasets aside from ID numbers (this is only so that multiple datasets may be combined).

_________________________________________________________________________________________________________________________

## Cleaning the Datasets 

The datasets provided are Excel files. The files were saved as CSV files.

Files were joined together based on common field (RID#). Join command was used in shell. Executed code: 

```shell
join --nocheck-order -t"," -a 1 -a 2 -e "NULL" -o "0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12,2.13,2.14" ADOS.csv demo.csv > ADOSandDemo.csv
```
- join merges two files based on single column
- --nocheck-order prevents need for sorting on only one column (files were already sorted)
- -t"," sets delimiter as comma
- -a 1 and -a 2  indicate that if no value present in field of first and second file, still print
- -e says to replace empty value with "NULL" 
- -o says to replace empty value with "NULL" if it is present in 0 (key), 1.1 (first field of first file), 2.2 (second field of second file) etc.
- final file names simply reference the files
- ">" writes to new file

Afterwards, fields which will not be used for data analysis were deleted for sake of simplicity. Executed code:

```shell
cut -f 2,3,5,10,13,14,17,18,21,22 -d "," --complement ADOSandDemo.csv > ADOSandDemo2.csv #complement does the inverse selection since cut KEEPS the fields specified
```


## Father's Age When Couple Had Child (*with* Output)

In [86]:
#need datetime to deal with ages
from datetime import datetime
#define function
#read file but do not write
def ageoffather(filename): #defining function
    fobj = open(filename, "r") #opening file
    data = fobj.readlines() #reading file
    #close file
    fobj.close()
    #open output file and write everything in last file to it + new column with ages
    fobj = open("output1.csv", "w") #opening and writing to new file with additional column of father's age
    fobj.write(data[0].strip()+ ",Father_birth_age\n") #column name
    for line in data[1:]: #skipping header
        item = line.split(",") #delimiter is comma
        fatherDOB = item [11] #specifiying which column is father's DOB
        childDOB = item [6] #specifying which column is child's DOB
        if item[11] != "NULL" and item[6] != "NULL": # so that code ignores NULL items
            F = datetime.strptime(fatherDOB, "%Y/%m/%d") #read date
            C = datetime.strptime(childDOB, "%Y/%m/%d") #read date
            delta = C-F #difference in date
            age=delta.days/365.0 #dividing by 365 to get age in years
            fobj.write(line.strip() + "," + str(age)  + "\n")
            print(age)
        else:
            fobj.write(line.strip() + "," + "NULL" + "\n") # if 6 and 11 are NULL, write line and add null in place of AOF
            print("NULL")
    fobj.close()
#decided not to round since I will be plotting data and would prefer accuracy

In [87]:
ageoffather("ADOSandDemo2.csv")

NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
39.90958904109589
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
39.156164383561645
NULL
40.42739726027397
NULL
46.728767123287675
NULL
NULL
NULL
34.605479452054794
NULL
30.81917808219178
37.657534246575345
36.26027397260274
31.41917808219178
31.904109589041095
36.487671232876714
37.09315068493151
35.31232876712329
NULL
42.06301369863014
32.945205479452056
33.90684931506849
37.23561643835617
32.04383561643836
36.035616438356165
40.21369863013699
34.51780821917808
41.780821

## Mother's Age When She Gave Birth to Child (*with* Output)

In [12]:
#need datetime to deal with ages
from datetime import datetime
#define function
#read file but do not write
def ageofmother(filename): #defining function
    fobj = open(filename, "r") #opening file
    data = fobj.readlines() #reading file
    #close file
    fobj.close()
    #open output file and write everything in last file to it + new column with ages
    fobj = open("output2.csv", "w") #opening and writing to new file with additional column of mother's age
    fobj.write(data[0].strip()+ ",Mother_birth_age\n") #column name
    for line in data[1:]: #skipping header
        item = line.split(",") #delimiter is comma
        motherDOB = item [9] #specifiying which column is father's DOB
        childDOB = item [6] #specifying which column is child's DOB
        if item[9] != "NULL" and item[6] != "NULL": # so that code ignores NULL items
            M = datetime.strptime(motherDOB, "%Y/%m/%d") #read date
            C = datetime.strptime(childDOB, "%Y/%m/%d") #read date
            delta = C-M #difference in date
            age=delta.days/365.0 #dividing by 365 to get age in years
            fobj.write(line.strip() + "," + str(age)  + "\n")
            print(age)
        else:
            fobj.write(line.strip() + "," + "NULL" + "\n") # if 6 and 9 are NULL, write line and add null in place of AOM
            print("NULL")
    fobj.close()
#decided not to round since I will be plotting data and would prefer accuracy
#will use ouput of ageoffather function as argument for age of mother function

In [13]:
ageofmother("output1.csv")

NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
39.92328767123288
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
32.082191780821915
NULL
37.4986301369863
NULL
35.71232876712329
NULL
NULL
NULL
32.367123287671234
NULL
40.035616438356165
25.40821917808219
27.235616438356164
33.536986301369865
33.64931506849315
26.495890410958904
27.742465753424657
32.347945205479455
44.87945205479452
33.85205479452055
32.68219178082192
36.38356164383562
29.2986301369863
28.506849315068493
37.106849315068494
38.16712328767123
33.76438356164

## Age of First Diagnosis (*with* Output)

In [16]:
#need datetime to deal with ages
from datetime import datetime
#define function
#read file but do not write
def diagnosisage(filename): #defining function
    fobj = open(filename, "r") #opening file
    data = fobj.readlines() #reading file
    #close file
    fobj.close()
    #open output file and write everything in last file to it + new column with ages
    fobj = open("output3.csv", "w") #opening and writing to new file with additional column of father's age
    fobj.write(data[0].strip()+ ",Diagnosis_age\n") #column name
    for line in data[1:]: #skipping header
        item = line.split(",") #delimiter is comma
        ADOSdate = item [1] #specifiying which column is father's DOB
        childDOB = item [6] #specifying which column is child's DOB
        if item[1] != "NULL" and item[6] != "NULL": # so that code ignores NULL items
            A = datetime.strptime(ADOSdate, "%Y/%m/%d") #read date
            C = datetime.strptime(childDOB, "%Y/%m/%d") #read date
            delta = A-C #difference in date
            age=delta.days/365.0 #dividing by 365 to get age in years
            fobj.write(line.strip() + "," + str(age)  + "\n")
            print(age)
        else:
            fobj.write(line.strip() + "," + "NULL" + "\n") # if 6 and 11 are NULL, write line and add null in place of AOF
            print("NULL")
    fobj.close()
#decided not to round since I will be plotting data and would prefer accuracy
#using output of mother's age function as argument for diagnosis age

In [17]:
diagnosisage("output2.csv")

NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
7.8493150684931505
4.947945205479452
NULL
NULL
NULL
NULL
5.438356164383562
NULL
5.471232876712329
NULL
8.753424657534246
NULL
NULL
NULL
NULL
6.156164383561644
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
6.019178082191781
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
5.178082191780822
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
3.03013698630137
NULL
NULL
NULL
NULL
NULL
NULL
4.624657534246575
NULL
4.504109589041096
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
3.263013698630137
NULL
NULL
NULL
6.372602739726028
NULL
NULL
4.446575342465754
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
3.663013698630137
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
4.654794520547945
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
2.8575342465753426
NULL
NULL
NULL
NUL

## Working in RStudio

Linear Regression Plots (example of Age of Father vs. Ados Total A). Working code:

```R
mydata <- read.csv("/home/eeb177-student/Desktop/eeb-177-final-project/output3.csv", na.strings = "NULL")
lm (mydata$ATotal ~ mydata$Father_birth_age, mydata = 'output3.csv') #linear regression of Age of father field vs ADOS total A field)
lm.out = lm (mydata$ATotal ~ mydata$Father_birth_age, mydata= "output3.csv", na.strings="NULL") #naming output
summary(lm.out) #gives LOTS of info about output, including coefficients, R values, R^2 etc
options(show.signif.stars=F) #turning off significance stars
anova(lm.out) #gives ANOVA table
plot(mydata$ATotal~ mydata$Father_birth_age, mydata='output3.csv', main="Age of Father vs ADOS Total A") #plotting linear regression and naming plot
abline(lm.out, col="red") #linear regression line red
```
![] (ageoffathervsadostotalA.png)

Repeat with:

- Age of Father vs ADOS total B
- Age of Father vs ADOS total C
- Age of Father vs ADOS total D
- Age of Mother vs ADOS total A
- Age of Mother vs ADOS total B
- Age of Mother vs ADOS total C
- Age of Mother vs ADOS total D
- MHLE vs Age of Diagnosis
- Age of Diagnosis vs ADOS total A
- Age of Diagnosis vs ADOS total B
- Age of Diagnosis vs ADOS total C
- Age of Diagnosis vs ADOS total D
- Gender vs. Age of diagnosis

Maybe repeat with:
- Gender of child vs ADOS total A
- Gender of child vs ADOS total B
- Gender of child vs ADOS total C
- Gender of child vs ADOS total D



I will probably also complete plots of the residuals of the respective graphs since the datasets are clustered around certain ages. 

Histogram plots of: 
- Age of father and the ratio of number of scores above 8 (clinical) to the total number of scores for each ADOS total
- Age of Mother and the ratio of number of scores above 8 (clinical) to the total number of scores for each ADOS total
- Age of Diagnosis and the ratio of number of scores above 8 (clinical) to the total number of scores for each ADOS total


This will be done so as to see if simply being above a certain threshold (being diagnosed via the test) is more closely correlated with age of father than the more continuous score variable. 

## References (temporarily just links):

Fernell: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3583438/

Kemp: http://www.aappublications.org/content/early/2015/04/28/aapnews.20150428-3

Sandin: http://www.nature.com/mp/journal/v21/n5/full/mp201570a.html

Thomas: https://www.ncbi.nlm.nih.gov/pubmed/21810908
