# Working with Autism Data (Project Outline) 

## Introduction

The datasets which will be used and manipulated in this project have been provided by the Early Childhood Partial Hospitalization Program (ECPHP) at the Resnick Psychiatric Institute at UCLA. ECPHP is a 10 week behavioral and psychiatric intervention program for pre-school aged children who have autism and other disorders which make it difficult for them to attend school with neurotypical children. 

According to the 5th edition of the American Psychiatric Association's Diagnostic and Statistical Manual (DSM-V), the criteria for autism diagnosis are impairements in social communication and interaction as well as restricted, repetitive patterns of behavior. 

The data collection occurs before and after preschool-aged children are admitted into the program. The children undergo a diagnostic test called the Autism Diagnostic Observation Schedule (ADOS) which is a semi-structured observational assesment for autism and autism spectrum disorder (ASD). The ADOS dataset includes 4 subtotals, A, B, C and D, which are as follows:

- A = communication total
- B = communication & social interaction total
- C = play total
- D = stereotyped behaviors and restricted interests total

The higher the ADOS totals, the greater likelihood of ASD and the greater the severity of autism. The ADOS dataset also includes the date of administration of the test (which will be used as a proxy for date of diagnosis).

Once admitted into ECPHP, parents fill out several questionaires, one of which is a demographic information sheet. The demographic information includes mother's and father's date of birth, child's date of birth, parent's employment status, and socioeconomic status (measured via proxy of mother's highest level of education (MHLE)). 

The data sets are in 3 parts: the ADOSII (most recent ADOS version), the ADOSI (older version), and the demographic info sheet. The aim of this project is to see which factors correlate the most to autism severity. The data will be plotted via linear regression, and the R values will be compared. Possible data visualization:

- Father's DOB vs. all four autism scores (research has indicated that the father's age may be a significant factor)
- Mother's DOB vs. all four autism scores (for same reason as above, but mother's age likely not as significant a factor)
- MHLE vs age of diagnosis (Does socioeconomic status influence how early a child receives a diagnosis?)
- Age of diagnosis vs all four autism scores (Does severity influence age of diagnosis?)

There is no identifiable patient information present in the datasets aside from ID numbers (this is only so that multiple datasets may be combined).

## Cleaning the Datasets 

The datasets provided are Excel files. The files were saved as CSV files.

Files must be joined together based on common field (RID#). Join command will be used in shell. 
Code might look something like this: 

```shell
join -t"," -ADOSfile.csv -demographicfile.csv -e "NULL" -o "0,1.1,1.2,1.3,1.4,1.5,2.1,2.2,2.3,2.4,2.5" ADOSfile.csv demographicfile.csv
```
- join merges two files based on single column
- -t"," sets delimiter as comma
- first two file names with hyphens indicate that if no value present in field, still print
- -e says to replace empty value with "NULL" 
- -o says to replace empty value with "NULL" if it is present in 0 (key), 1.1 (first field of first file), 2.2 (second field of second file) etc.
- final file names simply reference the files

Afterwards, fields which will not be used for data analysis will be deleted for sake of simplicity. Pseudocode:

```shell
cut -f <specify fields which I do NOT want> -d "," --complement file #complement does the inverse selection since cut KEEPS the fields specified
```

Afterwards, lines with significant missing information will be deleted in python. Pseudocode:

```python
for line in file:
    if line contains "NULL" in <specific field>, #might not delete all NULL lines because other data might still be usable
    delete line
```



## Working Python Code (dummy files)
The following functions provide the age of the father when couple had child, the age of the mother when she gave birth to child, and the age of the child when s/he was first diagnosed (date of ADOS administration used as rxy for age of diagnosis). 

### Mother's Age When She Gave Birth (without output)
``` python
#need datetime to deal with ages
from datetime import datetime
#define function
def ageofmother(filename):
    #open
    fobj = open(filename, "r")
    #read and skip 1st line
    data = fobj.readlines()[1:] #skipping header
    for line in data:
        item = line.split(",") #delimiter is comma
        motherDOB = item [3] #specifying which column is mother's DOB
        childDOB = item [1] #specifying which column is child's DOB
        M = datetime.strptime(motherDOB, "%m/%d/%Y") #read dates
        C = datetime.strptime(childDOB, "%m/%d/%Y") #read dates
        delta = C-M #difference in date
        age=delta.days/365.0 #dividing by 365 to get age in years
        print (age)
#decided not to round since I will be plotting data and would prefer accuracy
```
### Father's Age When Couple Had Child (WITH Output)

```python
#need datetime to deal with ages
from datetime import datetime
#define function
#read file but do not write
def ageoffather(filename): #defining function
    fobj = open(filename, "r") #opening file
    data = fobj.readlines() #reading file
    #close file
    fobj.close()
    #open output file and write everything in last file to it + new column with ages
    fobj = open("output.csv", "w") #opening and writing to new file with additional column of father's age
    fobj.write(data[0].strip()+ ",Father_birth_age\n") #column name
    for line in data[1:]: #skipping header
        item = line.split(",") #delimiter is comma
        print (line)
        fatherDOB = item [2] #specifiying which column is father's DOB
        childDOB = item [1] #specifying which column is child's DOB
        F = datetime.strptime(fatherDOB, "%m/%d/%Y") #read date
        C = datetime.strptime(childDOB, "%m/%d/%Y") #read date
        delta = C-F #difference in date
        age=delta.days/365.0 #dividing by 365 to get age in years
        fobj.write(line.strip() + "," + str(age)  + "n")
        print(age)
    fobj.close()
#decided not to round since I will be plotting data and would prefer accuracy
```
### Age of First Diagnosis

```python
#using output from Age of Father file as input for Age of diagnosis file, which will output to brand new file
#need datetime to deal with ages
from datetime import datetime
#define function
#read file but do not write
def ageofdiagnosis(filename): #defining function
    fobj = open(filename, "r") #opening file
    data = fobj.readlines() #reading file
    #close file
    fobj.close()
    #open output file and write everything in last file to it + new column with ages
    fobj = open("output.csv", "w") #opening and writing to new file with additional column of Age of Diagnosis
    fobj.write(data[0].strip()+ ",Ageofdiagnosis\n") #column name
    for line in data[1:]: #skipping header
        item = line.split(",") #delimiter is comma
        print (line)
        fatherDOB = item [2] #specifiying which column is father's DOB
        childDOB = item [1] #specifying which column is child's DOB
        A = datetime.strptime(ADOSdate, "%m/%d/%Y") #read date
        C = datetime.strptime(childDOB, "%m/%d/%Y") #read date
        delta = A-C #difference in date
        age=delta.days/365.0 #dividing by 365 to get age in years
        fobj.write(line.strip() + "," + str(age)  + "n")
        print(age)
    fobj.close()
#decided not to round since I will be plotting data and would prefer accuracy
```

## Working in RStudio

#### All Pseudocode and/or untested working code

Linear Regression Plots (example of Age of Father vs. Ados Total A)

```R
data (finalfile.csv)
lm (AgeofFather ~ Atotal, data = finalfile.csv) #linear regression of Age of father field vs ADOS total A field)
lm.out = lm (AgeofFather ~ Atotal, data = finalfile.csv) #naming output
Summary(lm.out) #gives LOTS of info about output, including coefficients, R values, R^2 etc
options(show.signif.stars=F) #turning off significance stars
anova(lm.out) #gives ANOVA table
plot(AgeofFather ~ Atotal, data=finalfile.csv, main="Age of Father vs ADOS Total A") #plotting linear regression and naming plot
abline(lm.out, col="red") #linear regression line red
```
Will look something like this (placeholder figure): 

<img src: "placeholderfigure.png">

Repeat with:

- Age of Father vs ADOS total B
- Age of Father vs ADOS total C
- Age of Father vs ADOS total D
- Age of Mother vs ADOS total A
- Age of Mother vs ADOS total B
- Age of Mother vs ADOS total C
- Age of Mother vs ADOS total D
- MHLE vs Age of Diagnosis
- Age of Diagnosis vs ADOS total A
- Age of Diagnosis vs ADOS total B
- Age of Diagnosis vs ADOS total C
- Age of Diagnosis vs ADOS total D

Make CSV file of all R and R^2 values and whether they had positive or negative correlation. Turn CSV into a presentable table:

```R
data (Rvalues.csv)
make table of Rvalues.csv
```
Will look something like this (placeholder figure):

<img src: "placeholdertable.png">
