## Pre-process of omics geno data for downstream analysis：

This notebook explains the process for obtaining and pre-processing omics mean genotype data of HS Rats used for downstream association studies.

In [3]:
using CSV, DelimitedFiles, DataFrames, Missings, XLSX # for manipulating raw data

#### First, we would need the sample ID's for samples in the omics traits data

We would need the information for selecting the subset of the large file of genotypes only corresponding to the samples we want.

In [4]:
@time omics_pheno_df = CSV.read("/home/zyu20/shareddata/HSNIH-Palmer/HSNIH-Rat-PL-RSeq-0818_nomissing.csv", DataFrame);

 12.750460 seconds (8.74 M allocations: 490.348 MiB, 2.36% gc time, 91.02% compilation time)


In [5]:
omics_pheno_df[1:10, 1:10]

Row,id,ENSRNOG00000000001,ENSRNOG00000000007,ENSRNOG00000000008,ENSRNOG00000000009,ENSRNOG00000000010,ENSRNOG00000000012,ENSRNOG00000000017,ENSRNOG00000000021,ENSRNOG00000000024
Unnamed: 0_level_1,String15,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,00077E67B5,3.6379,11.9384,1.5656,0.142,8.1366,4.2163,2.9092,7.6822,9.5078
2,00077E76FE,3.7742,11.9253,1.5245,-0.068,8.2848,4.663,2.8339,7.8128,9.5016
3,00077E8336,3.554,12.2001,1.3987,-0.0679,8.1853,4.5741,2.5846,7.705,9.4759
4,00077EA7E6,3.3301,11.9553,1.6367,-0.0218,8.066,4.4902,2.5845,7.6575,9.3513
5,00078A0224,3.5547,12.0272,1.5649,-0.0217,8.3169,4.5341,2.6773,7.8402,9.3225
6,00078A02CB,3.8215,12.0395,1.3515,-0.0679,8.3208,4.5923,2.6765,7.6363,9.4791
7,00078A0A43,3.6407,12.0575,1.4871,-0.0678,8.2847,4.6763,2.7223,7.7997,9.4048
8,00078A18A7,3.6343,12.0777,1.4854,-0.0679,8.0468,4.7543,2.7599,7.8119,9.3522
9,00078A193E,3.6807,12.1188,1.5244,-0.068,8.1551,4.8433,2.7173,7.9472,9.2675
10,00078A19A7,3.8224,12.0631,1.3515,-0.0679,8.4357,4.5116,2.5344,7.7446,9.3388


Column names are the names of the omics:

In [8]:
size(omics_pheno_df)

(80, 32624)

In [14]:
names(omics_pheno_df)[1:6]

6-element Vector{String}:
 "id"
 "ENSRNOG00000000001"
 "ENSRNOG00000000007"
 "ENSRNOG00000000008"
 "ENSRNOG00000000009"
 "ENSRNOG00000000010"

Row names are the ID's of the samples. Obtaining sample id's for the samples of the omics traits...

In [11]:
sample_ids = omics_pheno_df[:, 1];

Get the numeric array of the traits data...

In [17]:
omics_pheno = omics_pheno_df[:, 2:end] |> Matrix{Float64}

80×32623 Matrix{Float64}:
 3.6379  11.9384  1.5656   0.142   …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.7742  11.9253  1.5245  -0.068      0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.554   12.2001  1.3987  -0.0679     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.3301  11.9553  1.6367  -0.0218     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.5547  12.0272  1.5649  -0.0217     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.8215  12.0395  1.3515  -0.0679  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.6407  12.0575  1.4871  -0.0678     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.6343  12.0777  1.4854  -0.0679     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.6807  12.1188  1.5244  -0.068      0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.8224  12.0631  1.3515  -0.0679     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.6883  12.1836  1.8251  -0.0679  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.4971  12.1465  1.3516  -0.0679     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 3.6853  11.9484  1.4855  -0.0218     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                                 ⋱  

Notice that there are some traits with all 0.0 values. Remove missings or columns of all 0's...

In [21]:
no_value_traits = vec(sum(omics_pheno, dims = 1) .== 0.0);

In [22]:
omics_pheno_processed = omics_pheno[:, map(x -> x .== 0.0, no_value_traits)];

#### Load in and pre-processing HS genotype file:

First, we see which line the original geno file actually starts...(line 9)

In [25]:
count = 0;
test_line = readline("/home/zyu20/shareddata/HSNIH-Palmer/HSNIH-Palmer_true.geno")

for line in eachline("/home/zyu20/shareddata/HSNIH-Palmer/HSNIH-Palmer_true.geno")
    
    count = count + 1;
    # println(line)
    
    test_line = line
    
    if count == 9 # data starts from the line 9
        break
    end
    
end

Line 9 is the first line we need, and it is the line of column names:

In [27]:
geno_colnames = split(test_line, '\t')

6151-element Vector{SubString{String}}:
 "Chr"
 "Locus"
 "cM"
 "Mb"
 "0007A0008B"
 "0007A00024"
 "0007A000DB"
 "0007A001C5"
 "0007A0059F"
 "0007A00263"
 "0007A00670"
 "0007A00716"
 "0007A01A7C"
 ⋮
 "0007929889"
 "0007929894"
 "0007929913"
 "0007929918"
 "0007929924"
 "0007929944"
 "0007929945"
 "0007929959"
 "0007929963"
 "0007929965"
 "0007929994"
 "0007929999"

We see from the column names that each row contains information about the **Chr, Locus name, cM, Mb length, and the sample IDs** for each marker.

We would not want to load the entire genofile data set (for all samples), since not all samples with genotype information collected also have their omics traits information collected...**we therefore extract the geno information for only samples appear in our traits data**.

In [29]:
@time cols_to_extract_from_fullgeno = map(x -> x in sample_ids, geno_colnames);
cols_to_extract_from_fullgeno[1:4] .= 1; # also want to store information about the markers (the first 4 cols: Chr, Locus, cM, Mb)

  0.099610 seconds (553.44 k allocations: 33.283 MiB, 15.01% gc time, 51.30% compilation time)


In [16]:
sum(cols_to_extract_from_fullgeno)

84

In [30]:
cols_ids_in_geno = findall(cols_to_extract_from_fullgeno .== 1);

The following cell does the process to seletively read-in and write to file the columns corresponding to samples we want...

In [35]:
#=
@time begin
    myfile = open("my_genofile.txt", "w")
    row_count = 0
    for line in eachline("/home/zyu20/shareddata/HSNIH-Palmer/HSNIH-Palmer_true.geno")

        row_count = row_count + 1;

        if row_count >= 9
            col_count = 0;
            words_in_curr_line = split(line, '\t');
            for word in words_in_curr_line
                col_count = col_count + 1;
                if col_count in cols_ids_in_geno
                    to_write = word * "\t";
                    write(myfile, to_write);
                end
            end
            write(myfile, "\n")
        end
    end

    close(myfile)
end
=#

The original genotype file has around 3Gb. Selectively reading in and writing to file only 80 samples, instead of the whole genotype file with many samples reduces the runtime significantly.

In [104]:
@time omics_geno_full = readdlm("my_genofile.txt", '\t')[:, 1:(end-1)]

 17.776847 seconds (209.66 M allocations: 5.530 GiB, 25.90% gc time, 0.21% compilation time)


134919×84 Matrix{Any}:
   "Chr"  "Locus"             "cM"     …   "000789FFF8"   "000789FFF9"
  1       "chr1:55365"       0.055365     2              2
  1       "chr1:666374"      0.666374     2              2
  1       "chr1:666382"      0.666382     2              2
  1       "chr1:666394"      0.666394     2              2
  1       "chr1:669529"      0.669529  …  2              1
  1       "chr1:669562"      0.669562     1              2
  1       "chr1:671466"      0.671466     1              2
  1       "chr1:759319"      0.759319     1.979          0.007
  1       "chr1:1134030"     1.13403      1.993          1.998
  1       "chr1:1139210"     1.13921   …  1.993          1.997
  1       "chr1:1143318"     1.14332      0.01           0.003
  1       "chr1:1151014"     1.15101      1.997          1.998
  ⋮                                    ⋱                 
 20       "chr20:55641278"  55.6413       1.917          1.917
 20       "chr20:55644746"  55.6447       1.99           

Get only the nunmeric array of genotype file...

In [None]:
omics_geno_full

In [106]:
@time omics_geno = omics_geno_full[2:end, 5:end]

  0.041490 seconds (5 allocations: 82.348 MiB)


134918×80 Matrix{Any}:
 2        2        1        2        …  2        1        2        2
 2        2        2        2           2        2        2        2
 2        2        2        1           2        2        2        2
 2        2        2        1           2        2        2        2
 2        2        2        1           2        2        2        1
 2        2        2        2        …  2        2        1        2
 2        2        2        2           2        2        1        2
 0.997    1.976    0.991    1.979       1.979    1.979    1.979    0.007
 1.941    1.559    1.006    1.994       1.993    1.994    1.993    1.998
 1.978    1.992    1.997    1.994       1.993    1.993    1.993    1.997
 0.031    0.39     0.99298  0.008    …  0.01     0.009    0.01     0.003
 1.997    1.997    1.998    1.997       1.997    1.997    1.997    1.998
 1.997    1.997    1.006    1.997       1.997    1.997    1.997    1.998
 ⋮                                   ⋱                  

#### Remove missings in genotype file:

There are still some missings coded as "NA"'s...

In [107]:
NA_entries = findall(x -> typeof(x) == SubString{String}, omics_geno)

1494-element Vector{CartesianIndex{2}}:
 CartesianIndex(21302, 1)
 CartesianIndex(21316, 1)
 CartesianIndex(35852, 1)
 CartesianIndex(35855, 1)
 CartesianIndex(78338, 1)
 CartesianIndex(78340, 1)
 CartesianIndex(78341, 1)
 CartesianIndex(78363, 1)
 CartesianIndex(95701, 1)
 CartesianIndex(95704, 1)
 CartesianIndex(95731, 1)
 CartesianIndex(95812, 1)
 CartesianIndex(21302, 2)
 ⋮
 CartesianIndex(95776, 80)
 CartesianIndex(95777, 80)
 CartesianIndex(95779, 80)
 CartesianIndex(95781, 80)
 CartesianIndex(95782, 80)
 CartesianIndex(95784, 80)
 CartesianIndex(95787, 80)
 CartesianIndex(95812, 80)
 CartesianIndex(109662, 80)
 CartesianIndex(124557, 80)
 CartesianIndex(124558, 80)
 CartesianIndex(130075, 80)

For example,

In [108]:
omics_geno[21302, 1]

"NA"

In [109]:
omics_geno[130075, 80]

"NA"

In [110]:
NA_markers = zeros(length(NA_entries));

for i in 1:length(NA_entries)
    NA_markers[i] = NA_entries[i][1];
end

NA_markers = trunc.(Int64, NA_markers);

In [111]:
NA_markers

1494-element Vector{Int64}:
  21302
  21316
  35852
  35855
  78338
  78340
  78341
  78363
  95701
  95704
  95731
  95812
  21302
      ⋮
  95776
  95777
  95779
  95781
  95782
  95784
  95787
  95812
 109662
 124557
 124558
 130075

In [112]:
omics_geno_full_no_colnames = omics_geno_full[2:end, :];

In [113]:
omics_geno_nomissing = omics_geno_full_no_colnames[map(x -> !(x in NA_markers), collect(1:size(omics_geno, 1))), :];

In [114]:
omics_geno_nomissing

134681×84 Matrix{Any}:
  1  "chr1:55365"       0.055365  …  2        1        2        2
  1  "chr1:666374"      0.666374     2        2        2        2
  1  "chr1:666382"      0.666382     2        2        2        2
  1  "chr1:666394"      0.666394     2        2        2        2
  1  "chr1:669529"      0.669529     2        2        2        1
  1  "chr1:669562"      0.669562  …  2        2        1        2
  1  "chr1:671466"      0.671466     2        2        1        2
  1  "chr1:759319"      0.759319     1.979    1.979    1.979    0.007
  1  "chr1:1134030"     1.13403      1.993    1.994    1.993    1.998
  1  "chr1:1139210"     1.13921      1.993    1.993    1.993    1.997
  1  "chr1:1143318"     1.14332   …  0.01     0.009    0.01     0.003
  1  "chr1:1151014"     1.15101      1.997    1.997    1.997    1.998
  1  "chr1:1151294"     1.15129      1.997    1.997    1.997    1.998
  ⋮                               ⋱  ⋮                          
 20  "chr20:55641278"  55.6413

In [115]:
omics_geno_colnames = omics_geno_full[1, :]

84-element Vector{Any}:
 "Chr"
 "Locus"
 "cM"
 "Mb"
 "00077E67B5"
 "00077E76FE"
 "00077E8336"
 "00077EA7E6"
 "00078A002C"
 "00078A0041"
 "00078A0058"
 "00078A0085"
 "00078A00AC"
 ⋮
 "00078A2315"
 "00078A2463"
 "00078A2496"
 "00078A2595"
 "00078A2667"
 "000789FF6E"
 "000789FF7D"
 "000789FF94"
 "000789FFD3"
 "000789FFF0"
 "000789FFF8"
 "000789FFF9"

In [117]:
omics_geno_df_nomissing = DataFrame(omics_geno_nomissing, omics_geno_colnames|> Array{String})

Row,Chr,Locus,cM,Mb,00077E67B5,00077E76FE,00077E8336,00077EA7E6,00078A002C,00078A0041,00078A0058,00078A0085,00078A00AC,00078A00BF,00078A01A6,00078A01C0,00078A01D8,00078A01DB,00078A01FE,00078A02CB,00078A02DF,00078A07A2,00078A09B1,00078A021A,00078A022D,00078A087B,00078A096C,00078A0127,00078A0138,00078A0139,00078A0166,00078A0215,00078A0224,00078A0246,00078A0255,00078A0A43,00078A0AEA,00078A1A2B,00078A1A16,00078A1B05,00078A1F34,00078A16D3,00078A16DF,00078A17F7,00078A18A7,00078A18CF,00078A18F7,00078A19A7,00078A19B5,00078A19C0,00078A19D6,00078A22DF,00078A22EB,00078A179C,00078A181B,00078A186C,00078A192C,00078A193E,00078A194B,00078A261F,00078A1707,00078A1731,00078A1732,00078A1772,00078A1807,00078A1816,00078A1837,00078A1863,00078A1875,00078A1937,00078A1942,00078A1979,00078A2315,00078A2463,00078A2496,00078A2595,00078A2667,000789FF6E,000789FF7D,000789FF94,000789FFD3,000789FFF0,000789FFF8,000789FFF9
Unnamed: 0_level_1,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any
1,1,chr1:55365,0.055365,0.055365,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,1,2,2,2,1,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2
2,1,chr1:666374,0.666374,0.666374,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
3,1,chr1:666382,0.666382,0.666382,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
4,1,chr1:666394,0.666394,0.666394,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,2,2,2,2,2,1,2,1,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
5,1,chr1:669529,0.669529,0.669529,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,1,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1
6,1,chr1:669562,0.669562,0.669562,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,1,2
7,1,chr1:671466,0.671466,0.671466,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,1,1,1,1,2,2,1,2,2,2,2,2,2,2,2,2,2,1,2
8,1,chr1:759319,0.759319,0.759319,0.997,1.976,0.991,1.979,0.007,1.979,0.991,0.997,0.008,1.979,0.997,0.997,0.007,0.997,0.007,0.991,0.997,0.007,0.997,0.991,0.007,0.99,0.997,0.99,0.991,0.007,0.991,1.979,0.997,0.997,0.992,0.997,0.997,0.997,0.992,0.992,0.007,0.997,0.997,0.997,0.007,0.007,0.997,0.007,1.979,0.007,0.991,0.007,0.997,0.997,0.008,0.997,0.997,0.007,0.997,1.979,0.997,0.997,0.007,0.007,0.997,0.007,0.997,1.979,0.007,0.997,0.99603,0.99,0.991,0.997,1.978,0.007,0.991,0.997,0.007,0.007,1.979,1.979,1.979,0.007
9,1,chr1:1134030,1.13403,1.13403,1.941,1.559,1.006,1.994,1.998,1.994,1.846,1.951,1.4895,1.994,1.984,1.945,1.031,1.993,1.00494,1.989,1.992,1.998,1.989,1.994,1.022,1.49847,1.797,1.998,1.946,1.055,1.946,1.994,1.955,1.993,1.994,1.94696,1.951,1.989,1.926,1.495,1.998,1.991,1.994,1.952,1.997,1.998,1.94696,1.998,1.994,1.998,1.994,1.998,1.946,1.994,1.5045,1.994,1.994,1.998,1.977,1.993,1.935,1.993,1.998,1.998,1.994,1.023,1.994,1.994,1.998,1.993,1.912,1.997,1.994,1.993,1.919,1.998,1.994,1.994,1.044,1.031,1.993,1.994,1.993,1.998
10,1,chr1:1139210,1.13921,1.13921,1.978,1.992,1.997,1.994,1.997,1.993,1.912,1.985,1.996,1.994,1.959,1.985,1.037,1.986,1.006,1.14,1.991,1.997,1.14,1.992,1.026,1.964,1.858,1.997,1.984,1.06793,1.984,1.993,1.988,1.993,1.994,1.213,1.986,1.14,1.982,1.668,1.997,1.1181,1.994,1.986,1.996,1.997,1.984,1.997,1.993,1.997,1.992,1.997,1.984,1.994,1.995,1.994,1.993,1.997,1.927,1.993,1.981,1.992,1.997,1.997,1.994,1.029,1.994,1.993,1.997,1.993,1.99,1.866,1.992,1.993,1.992,1.997,1.992,1.993,1.05292,1.037,1.993,1.993,1.993,1.997


In [119]:
# CSV.write("/home/zyu20/shareddata/HSNIH-Palmer/HSNIH-Palmer_true_omics_geno_nomissing.csv", omics_geno_df_nomissing)

## Summary of issues of acquiring data from GN2: