<font size=7>Create Input Dataset</font>

In this notebook we create the input files `GSE74923_annotations.txt` and `GSE74923_new.txt`, which are used to run the <font color="magenta">Singler</font> example.

# Do some data cleaning

## Read Counts File

We load in the RNA "Counts" data files. We will have to remove the duplicate counts from the data file

In [None]:
data = read.table('GSE74923_old.txt', sep="\t", header=TRUE)

In [None]:
dim(data)

In [None]:
data[1:10,1:10]

## Read types file

We check to make sure that the "types" data file is the right shape

In [None]:
data2 = read.table('GSE74923_annotations.txt', sep="\t", header=TRUE)

In [None]:
dim(data)

## Remove Duplicates from Counts

In [None]:
dataNew = data[!duplicated(data$X), ]

In [None]:
dim(dataNew)

In [None]:
row.names(dataNew)= as.character(dataNew$X)

In [None]:
dataNew$X = NULL

We check that everything looks good

In [None]:
dataNew[1:10,1:10]

## See Format of GSE74923_old.txt

We have to make sure that the format of our file is the same as the old file, `GSE74923_old.txt`. Lets take a look at what the old file looks like

In [None]:
con <- file("GSE74923_old.txt","r")
first_lines <- readLines(con,n=3)
close(con)

In [None]:
class(first_lines)

In [None]:
length(first_lines)

We look at the first row of `GSE74923_old.txt` aka, the header.

In [None]:
print(first_lines[1])

We look at the first row of real data in `GSE74923_old.txt`

In [None]:
print(first_lines[2])

## Create GSE74923_new.txt


We save the our corrected version of the dataset to `GSE74923_new.txt`

<font color="orange">Note: We comment out the line where we save the data. Remove the comment if you want to resave data.</font>

In [None]:
#write.table( dataNew, "GSE74923_new.txt", sep = "\t" )

In [None]:
con <- file("GSE74923_new.txt","r")
first_lines <- readLines(con,n=2)
close(con)

We then check that the text file we created is formatted correctly (compare to results from last section)

In [None]:
print(first_lines[1])

In [None]:
print(first_lines[2])

Lets load the data back in to check that everything looks good

In [None]:
dataCheck = read.table('GSE74923_new.txt', sep="\t", header=TRUE)

In [None]:
dataCheck[1:10,1:10]

## Create GSE74923_new.txt


The first try didn't work so lets try again to formatting the data again

In [None]:
dataNew2 = data[!duplicated(data$X), ]

<font color="orange">Note: We comment out the line where we save the data. Remove the comment if you want to resave data.</font>

In [None]:
#write.table( dataNew2, "GSE74923_new2.txt", sep = "\t" )

Now we load in the file we created and check to see what it looks like

In [None]:
dataCheck2 = read.table('GSE74923_new2.txt', sep="\t", header=TRUE)

In [None]:
dataCheck2[1:10,1:10]