## gencode.v41.annotation.gff3 の読み込み
gff["attributes"]に様々な情報が格納されている。各情報は";"で区切られているので`split(";")`で情報を抽出できる。

In [152]:
import pandas as pd
import numpy as np
import gffpandas.gffpandas as gffpd

In [154]:
annotation = gffpd.read_gff3("gencode.v41.annotation.gff3")
gff = annotation.df
gff = gff.rename(columns={"seq_id":"chr"}) # 染色体の列名を chr に変更

print(len(gff))

gff.head()

3373604


Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes
0,chr1,HAVANA,gene,11869,14409,.,+,.,ID=ENSG00000223972.5;gene_id=ENSG00000223972.5...
1,chr1,HAVANA,transcript,11869,14409,.,+,.,ID=ENST00000456328.2;Parent=ENSG00000223972.5;...
2,chr1,HAVANA,exon,11869,12227,.,+,.,ID=exon:ENST00000456328.2:1;Parent=ENST0000045...
3,chr1,HAVANA,exon,12613,12721,.,+,.,ID=exon:ENST00000456328.2:2;Parent=ENST0000045...
4,chr1,HAVANA,exon,13221,14409,.,+,.,ID=exon:ENST00000456328.2:3;Parent=ENST0000045...


タンパク質をコードする遺伝子（`gene_type="protein_coding"`）については、GFFファイル中に3'UTRの情報(`type="three_prime_RNA"`)が含まれており、その情報を抽出すればOK。一方、lncRNA（`gene_type=lncRNA`）については、GFFファイル中に3'UTRの情報が含まれていないため、代わりに遺伝子全長の情報（`type="transcript"`）を抽出する。GFFファイルには1つの遺伝子あたり複数の転写産物が登録されているが、その中で最も代表的なもの (`Ensembl_canonical`) がEnsemblのアルゴリズムによって選出されている。この**canonical**な転写産物のみを抽出する。
### Ensembl Canonical transcript
The Ensembl Canonical transcript is a single, representative transcript identified at every locus. For accurate analysis, we recommend that more than one transcripts at a locus may need to be considered, however, we designate a single Ensembl Canonical transcript per locus to provide consistency when only one transcript is required e.g. for initial display on the Ensembl (or other) website and for use in genome-wide calculations e.g. the Ensembl gene tree analysis.

For protein-coding genes, we aim to identify the transcript that, on balance, has the highest coverage of conserved exons, highest expression, longest coding sequence and is represented in other key resources, such as NCBI and UniProt. To identify this transcript, we consider, where available, evidence of functional potential (such as evolutionary conservation of a coding region, transcript expression levels), transcript length and evidence from other resources (such as concordance with the APPRIS1 ’principal isoform’ and with the UniProt/Swiss-Prot ‘canonical isoform’).

まず`gene_type`を抽出する関数を用意。

In [155]:
def get_gene_type(df):
    attributes = df.split(";")
    output_type = [s for s in attributes if "gene_type=" in s][0].replace("gene_type=", "")
    return output_type

### (1) タンパク質をコードしている遺伝子のみを抽出
`gene_type` が `protein_coding` かつ `Ensembl_canonical` なものを抽出。

In [156]:
gff["gene_type"] = gff["attributes"].apply(get_gene_type)
gff = gff[gff["gene_type"] == "protein_coding"]

In [157]:
print(f"Number of rows: {len(gff)}")
gff.head()

Number of rows: 2988184


Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes,gene_type
57,chr1,HAVANA,gene,65419,71585,.,+,.,ID=ENSG00000186092.7;gene_id=ENSG00000186092.7...,protein_coding
58,chr1,HAVANA,transcript,65419,71585,.,+,.,ID=ENST00000641515.2;Parent=ENSG00000186092.7;...,protein_coding
59,chr1,HAVANA,exon,65419,65433,.,+,.,ID=exon:ENST00000641515.2:1;Parent=ENST0000064...,protein_coding
60,chr1,HAVANA,exon,65520,65573,.,+,.,ID=exon:ENST00000641515.2:2;Parent=ENST0000064...,protein_coding
61,chr1,HAVANA,CDS,65565,65573,.,+,0,ID=CDS:ENST00000641515.2;Parent=ENST0000064151...,protein_coding


In [158]:
gff = gff[gff["attributes"].str.contains("Ensembl_canonical")]

print(f"Number of rows: {len(gff)}")

Number of rows: 518563


3'UTR (type=`three_prime_RNA`) の情報のみを抽出する。Sourceは `HAVANA` を利用。

In [159]:
gff = gff[(gff["type"] == "three_prime_UTR") & (gff["source"] == "HAVANA")]
print(f"Number of rows: {len(gff)}")
gff.head()

Number of rows: 22243


Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes,gene_type
68,chr1,HAVANA,three_prime_UTR,70009,71585,.,+,.,ID=UTR3:ENST00000641515.2;Parent=ENST000006415...,protein_coding
1008,chr1,HAVANA,three_prime_UTR,944154,944574,.,+,.,ID=UTR3:ENST00000616016.5;Parent=ENST000006160...,protein_coding
1318,chr1,HAVANA,three_prime_UTR,944203,944693,.,-,.,ID=UTR3:ENST00000327044.7;Parent=ENST000003270...,protein_coding
1386,chr1,HAVANA,three_prime_UTR,965192,965719,.,+,.,ID=UTR3:ENST00000338591.8;Parent=ENST000003385...,protein_coding
1445,chr1,HAVANA,three_prime_UTR,974576,975865,.,+,.,ID=UTR3:ENST00000379410.8;Parent=ENST000003794...,protein_coding


BEDファイルに記載しておく情報として `gene_name` と `gene_id` の2つの情報があればいい。これらの情報を抽出しする関数を作成。

In [160]:
def get_attributes(df):
    attributes = df.split(";")
    gene_name = [s for s in attributes if "gene_name=" in s][0].replace("gene_name=", "")
    gene_id = [s for s in attributes if "gene_id=" in s][0].replace("gene_id=", "")
    return gene_name + "|" + gene_id

In [161]:
gff["gene"] = gff["attributes"].apply(get_attributes)

In [162]:
rows = len(gff)
unique_rows = len(gff["gene"].unique())

print(f"Number of rows: {rows}")
print(f"Unique rows: {unique_rows}")

Number of rows: 22243
Unique rows: 19487


ユニークな遺伝子名が19,487個あるのに対し行数が多い。被っているものがあるはず。

In [163]:
gff[gff["gene"].duplicated(keep=False)]

Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes,gene_type,gene
7813,chr1,HAVANA,three_prime_UTR,1787322,1787330,.,-,.,ID=UTR3:ENST00000378609.9;Parent=ENST000003786...,protein_coding,GNB1|ENSG00000078369.18
7814,chr1,HAVANA,three_prime_UTR,1785286,1787053,.,-,.,ID=UTR3:ENST00000378609.9;Parent=ENST000003786...,protein_coding,GNB1|ENSG00000078369.18
13647,chr1,HAVANA,three_prime_UTR,4774500,4774558,.,+,.,ID=UTR3:ENST00000378191.5;Parent=ENST000003781...,protein_coding,AJAP1|ENSG00000196581.11
13648,chr1,HAVANA,three_prime_UTR,4782545,4792534,.,+,.,ID=UTR3:ENST00000378191.5;Parent=ENST000003781...,protein_coding,AJAP1|ENSG00000196581.11
15128,chr1,HAVANA,three_prime_UTR,6106234,6106279,.,-,.,ID=UTR3:ENST00000262450.8;Parent=ENST000002624...,protein_coding,CHD5|ENSG00000116254.18
...,...,...,...,...,...,...,...,...,...,...,...
3372770,chrY,HAVANA,three_prime_UTR,25037992,25038097,.,-,.,ID=UTR3:ENST00000382287.5;Parent=ENST000003822...,protein_coding,BPY2C|ENSG00000185894.8
3372771,chrY,HAVANA,three_prime_UTR,25031317,25031441,.,-,.,ID=UTR3:ENST00000382287.5;Parent=ENST000003822...,protein_coding,BPY2C|ENSG00000185894.8
3372772,chrY,HAVANA,three_prime_UTR,25030901,25031222,.,-,.,ID=UTR3:ENST00000382287.5;Parent=ENST000003822...,protein_coding,BPY2C|ENSG00000185894.8
3373214,chrY,HAVANA,three_prime_UTR,57128472,57130289,.,+,.,ID=UTR3:ENST00000286448.12_PAR_Y;Parent=ENST00...,protein_coding,VAMP7|ENSG00000124333.16


3,919個の遺伝子は遺伝子名, 遺伝子IDが被っている。3' UTR の位置が異なっているようだ。例として BPY2C という遺伝子について詳しく見てみる。

In [164]:
bpy2c = gff[gff["gene"] == "BPY2C|ENSG00000185894.8"]
bpy2c

Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes,gene_type,gene
3372770,chrY,HAVANA,three_prime_UTR,25037992,25038097,.,-,.,ID=UTR3:ENST00000382287.5;Parent=ENST000003822...,protein_coding,BPY2C|ENSG00000185894.8
3372771,chrY,HAVANA,three_prime_UTR,25031317,25031441,.,-,.,ID=UTR3:ENST00000382287.5;Parent=ENST000003822...,protein_coding,BPY2C|ENSG00000185894.8
3372772,chrY,HAVANA,three_prime_UTR,25030901,25031222,.,-,.,ID=UTR3:ENST00000382287.5;Parent=ENST000003822...,protein_coding,BPY2C|ENSG00000185894.8


In [165]:
bpy2c["attributes"].unique()

array(['ID=UTR3:ENST00000382287.5;Parent=ENST00000382287.5;gene_id=ENSG00000185894.8;transcript_id=ENST00000382287.5;gene_type=protein_coding;gene_name=BPY2C;transcript_type=protein_coding;transcript_name=BPY2C-201;exon_number=7;exon_id=ENSE00003506422.1;level=2;protein_id=ENSP00000371724.1;transcript_support_level=1;hgnc_id=HGNC:18225;tag=NMD_exception,basic,Ensembl_canonical,MANE_Select,appris_principal_1,CCDS;ccdsid=CCDS44030.1;havana_gene=OTTHUMG00000045199.3;havana_transcript=OTTHUMT00000104944.1',
       'ID=UTR3:ENST00000382287.5;Parent=ENST00000382287.5;gene_id=ENSG00000185894.8;transcript_id=ENST00000382287.5;gene_type=protein_coding;gene_name=BPY2C;transcript_type=protein_coding;transcript_name=BPY2C-201;exon_number=8;exon_id=ENSE00001764403.1;level=2;protein_id=ENSP00000371724.1;transcript_support_level=1;hgnc_id=HGNC:18225;tag=NMD_exception,basic,Ensembl_canonical,MANE_Select,appris_principal_1,CCDS;ccdsid=CCDS44030.1;havana_gene=OTTHUMG00000045199.3;havana_transcript=OTTHU

この3つは `exon_number` が異なっている。IGVで構造を確認したところ 3' UTR がスプライシングを受けているっぽい。

In [173]:
plcxd1 = gff_dup[gff_dup["gene"] == "PLCXD1|ENSG00000182378.15"]
plcxd1

Unnamed: 0,chr,start,end,gene,score,strand
21206,chrX,299336,303356,PLCXD1|ENSG00000182378.15,0,+
22172,chrY,299336,303356,PLCXD1|ENSG00000182378.15,0,+


PLCXD1 という遺伝子は X 染色体と Y 染色体の両方に存在しているため重複している。

同じ染色体内で重複している遺伝子についてはそれぞれ `start` と `end` の最大値と最小値を抽出し、それをマージしたものを3' UTRの範囲とする。

まずBEDファイルに必要な情報 (`chr`, `start`, `end`, `gene`, `score`, `strand`) を抽出

In [175]:
gff = gff[["chr", "start", "end", "gene", "score", "strand"]].drop_duplicates().reset_index(drop=True)
gff["score"] = 0

gff

Unnamed: 0,chr,start,end,gene,score,strand
0,chr1,70009,71585,OR4F5|ENSG00000186092.7,0,+
1,chr1,944154,944574,SAMD11|ENSG00000187634.13,0,+
2,chr1,944203,944693,NOC2L|ENSG00000188976.11,0,-
3,chr1,965192,965719,KLHL17|ENSG00000187961.15,0,+
4,chr1,974576,975865,PLEKHN1|ENSG00000187583.11,0,+
...,...,...,...,...,...,...
22238,chrY,25031317,25031441,BPY2C|ENSG00000185894.8,0,-
22239,chrY,25030901,25031222,BPY2C|ENSG00000185894.8,0,-
22240,chrY,25624528,25624902,CDY1|ENSG00000172288.8,0,+
22241,chrY,57128472,57130289,VAMP7|ENSG00000124333.16,0,+


重複のある遺伝子とない遺伝子をわける

In [176]:
gff_unique = gff[gff["gene"].duplicated(keep=False) == False]
gff_dup = gff[gff["gene"].duplicated(keep=False)]

In [178]:
gff_unique

Unnamed: 0,chr,start,end,gene,score,strand
0,chr1,70009,71585,OR4F5|ENSG00000186092.7,0,+
1,chr1,944154,944574,SAMD11|ENSG00000187634.13,0,+
2,chr1,944203,944693,NOC2L|ENSG00000188976.11,0,-
3,chr1,965192,965719,KLHL17|ENSG00000187961.15,0,+
4,chr1,974576,975865,PLEKHN1|ENSG00000187583.11,0,+
...,...,...,...,...,...,...
22219,chrY,22168542,22168819,RBMY1F|ENSG00000169800.14,0,-
22220,chrY,22417604,22417881,RBMY1J|ENSG00000226941.9,0,+
22221,chrY,22514071,22514637,PRY|ENSG00000169789.10,0,+
22229,chrY,24045793,24046065,CDY1B|ENSG00000172352.5,0,-


In [179]:
gff_dup

Unnamed: 0,chr,start,end,gene,score,strand
44,chr1,1787322,1787330,GNB1|ENSG00000078369.18,0,-
45,chr1,1785286,1787053,GNB1|ENSG00000078369.18,0,-
76,chr1,4774500,4774558,AJAP1|ENSG00000196581.11,0,+
77,chr1,4782545,4792534,AJAP1|ENSG00000196581.11,0,+
80,chr1,6106234,6106279,CHD5|ENSG00000116254.18,0,-
...,...,...,...,...,...,...
22237,chrY,25037992,25038097,BPY2C|ENSG00000185894.8,0,-
22238,chrY,25031317,25031441,BPY2C|ENSG00000185894.8,0,-
22239,chrY,25030901,25031222,BPY2C|ENSG00000185894.8,0,-
22241,chrY,57128472,57130289,VAMP7|ENSG00000124333.16,0,+


同じ染色体で重複のある遺伝子について 3' UTR の開始位置の最小値と終了位置の最大値を取得しまとめる。

In [196]:
duplicated_genes = np.array(gff_dup["gene"].unique())
gff_dedup = pd.DataFrame(columns=["chr", "start", "end", "gene", "score", "strand"])

for gene in duplicated_genes:
    tmp_df = gff_dup[gff_dup["gene"] == gene]   
    tmp_df["chr"].unique()   
    
    if len(tmp_df["chr"].unique()) == 1:
        tmp_chr = tmp_df.iloc[0, 0]
        tmp_start = tmp_df["start"].min()
        tmp_end = tmp_df["end"].max()
        tmp_strand = tmp_df.iloc[0, 5]
        add_row = pd.DataFrame([tmp_chr, tmp_start, tmp_end, gene, 0, tmp_strand], index=gff_dedup.columns).T
        gff_dedup = pd.concat([gff_dedup, add_row])
    
    else:
        gff_dedup = pd.concat([gff_dedup, tmp_df])
    
gff_dedup

Unnamed: 0,chr,start,end,gene,score,strand
0,chr1,1785286,1787330,GNB1|ENSG00000078369.18,0,-
0,chr1,4774500,4792534,AJAP1|ENSG00000196581.11,0,+
0,chr1,6101787,6106279,CHD5|ENSG00000116254.18,0,-
0,chr1,6159430,6181467,ENSG00000285629|ENSG00000285629.1,0,-
0,chr1,10946475,10947662,C1orf127|ENSG00000175262.15,0,-
...,...,...,...,...,...,...
0,chrY,23285502,23291356,DAZ2|ENSG00000205944.12,0,+
0,chrY,24632011,24639207,BPY2B|ENSG00000183795.8,0,+
0,chrY,24763069,24768933,DAZ3|ENSG00000187191.16,0,-
0,chrY,24901176,24907040,DAZ4|ENSG00000205916.12,0,+


In [197]:
gff_merged = pd.concat([gff_unique, gff_dedup])
gff_merged

Unnamed: 0,chr,start,end,gene,score,strand
0,chr1,70009,71585,OR4F5|ENSG00000186092.7,0,+
1,chr1,944154,944574,SAMD11|ENSG00000187634.13,0,+
2,chr1,944203,944693,NOC2L|ENSG00000188976.11,0,-
3,chr1,965192,965719,KLHL17|ENSG00000187961.15,0,+
4,chr1,974576,975865,PLEKHN1|ENSG00000187583.11,0,+
...,...,...,...,...,...,...
0,chrY,23285502,23291356,DAZ2|ENSG00000205944.12,0,+
0,chrY,24632011,24639207,BPY2B|ENSG00000183795.8,0,+
0,chrY,24763069,24768933,DAZ3|ENSG00000187191.16,0,-
0,chrY,24901176,24907040,DAZ4|ENSG00000205916.12,0,+


BEDファイルを保存。

In [198]:
gff_merged.to_csv("gencode.v41.customized.canonical.bed", sep="\t", index=False, header=False)

### lncRNA も含めた BED ファイルを作る

In [201]:
annotation = gffpd.read_gff3("gencode.v41.annotation.gff3")
gff = annotation.df
gff = gff.rename(columns={"seq_id":"chr"}) # 染色体の列名を chr に変更

print(len(gff))

gff.head()

3373604


Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes
0,chr1,HAVANA,gene,11869,14409,.,+,.,ID=ENSG00000223972.5;gene_id=ENSG00000223972.5...
1,chr1,HAVANA,transcript,11869,14409,.,+,.,ID=ENST00000456328.2;Parent=ENSG00000223972.5;...
2,chr1,HAVANA,exon,11869,12227,.,+,.,ID=exon:ENST00000456328.2:1;Parent=ENST0000045...
3,chr1,HAVANA,exon,12613,12721,.,+,.,ID=exon:ENST00000456328.2:2;Parent=ENST0000045...
4,chr1,HAVANA,exon,13221,14409,.,+,.,ID=exon:ENST00000456328.2:3;Parent=ENST0000045...


In [202]:
gff["gene_type"] = gff["attributes"].apply(get_gene_type)
gff = gff[gff["gene_type"] == "lncRNA"]

In [203]:
gff

Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes,gene_type
28,chr1,HAVANA,gene,29554,31109,.,+,.,ID=ENSG00000243485.5;gene_id=ENSG00000243485.5...,lncRNA
29,chr1,HAVANA,transcript,29554,31097,.,+,.,ID=ENST00000473358.1;Parent=ENSG00000243485.5;...,lncRNA
30,chr1,HAVANA,exon,29554,30039,.,+,.,ID=exon:ENST00000473358.1:1;Parent=ENST0000047...,lncRNA
31,chr1,HAVANA,exon,30564,30667,.,+,.,ID=exon:ENST00000473358.1:2;Parent=ENST0000047...,lncRNA
32,chr1,HAVANA,exon,30976,31097,.,+,.,ID=exon:ENST00000473358.1:3;Parent=ENST0000047...,lncRNA
...,...,...,...,...,...,...,...,...,...,...
3373355,chrY,HAVANA,exon,57208519,57208756,.,+,.,ID=exon:ENST00000483543.7_PAR_Y:4;Parent=ENST0...,lncRNA
3373356,chrY,HAVANA,gene,57201143,57203357,.,-,.,ID=ENSG00000185203.12_PAR_Y;gene_id=ENSG000001...,lncRNA
3373357,chrY,HAVANA,transcript,57201143,57203357,.,-,.,ID=ENST00000399966.9_PAR_Y;Parent=ENSG00000185...,lncRNA
3373358,chrY,HAVANA,exon,57203182,57203357,.,-,.,ID=exon:ENST00000399966.9_PAR_Y:1;Parent=ENST0...,lncRNA


`Ensembl_canonical` かつ `type` == `transcript` のものを抽出。

In [204]:
gff = gff[gff["attributes"].str.contains("Ensembl_canonical")]
gff = gff[gff["type"] == "transcript"]
gff

Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes,gene_type
29,chr1,HAVANA,transcript,29554,31097,.,+,.,ID=ENST00000473358.1;Parent=ENSG00000243485.5;...,lncRNA
40,chr1,HAVANA,transcript,34554,36081,.,-,.,ID=ENST00000417324.1;Parent=ENSG00000237613.2;...,lncRNA
75,chr1,HAVANA,transcript,92230,129217,.,-,.,ID=ENST00000477740.5;Parent=ENSG00000238009.6;...,lncRNA
93,chr1,HAVANA,transcript,89551,91105,.,-,.,ID=ENST00000495576.1;Parent=ENSG00000239945.1;...,lncRNA
106,chr1,HAVANA,transcript,139790,140339,.,-,.,ID=ENST00000493797.1;Parent=ENSG00000239906.1;...,lncRNA
...,...,...,...,...,...,...,...,...,...,...
3372838,chrY,HAVANA,transcript,25378300,25394719,.,-,.,ID=ENST00000427373.5;Parent=ENSG00000228786.5;...,lncRNA
3372887,chrY,HAVANA,transcript,25482908,25486705,.,+,.,ID=ENST00000306641.1;Parent=ENSG00000240450.1;...,lncRNA
3372986,chrY,HAVANA,transcript,25728490,25733388,.,+,.,ID=ENST00000417334.1;Parent=ENSG00000231141.1;...,lncRNA
3373351,chrY,HAVANA,transcript,57190738,57208756,.,+,.,ID=ENST00000483543.7_PAR_Y;Parent=ENSG00000270...,lncRNA


遺伝子名, 遺伝子ID を抽出 (lncRNAであることもわかるようにする)

In [205]:
def get_attributes_lncRNA(df):
    attributes = df.split(";")
    gene_name = [s for s in attributes if "gene_name=" in s][0].replace("gene_name=", "")
    gene_id = [s for s in attributes if "gene_id=" in s][0].replace("gene_id=", "")
    return gene_name + "|" + gene_id + "|lncRNA"

In [206]:
gff["gene"] = gff["attributes"].apply(get_attributes_lncRNA)

In [207]:
gff

Unnamed: 0,chr,source,type,start,end,score,strand,phase,attributes,gene_type,gene
29,chr1,HAVANA,transcript,29554,31097,.,+,.,ID=ENST00000473358.1;Parent=ENSG00000243485.5;...,lncRNA,MIR1302-2HG|ENSG00000243485.5|lncRNA
40,chr1,HAVANA,transcript,34554,36081,.,-,.,ID=ENST00000417324.1;Parent=ENSG00000237613.2;...,lncRNA,FAM138A|ENSG00000237613.2|lncRNA
75,chr1,HAVANA,transcript,92230,129217,.,-,.,ID=ENST00000477740.5;Parent=ENSG00000238009.6;...,lncRNA,ENSG00000238009|ENSG00000238009.6|lncRNA
93,chr1,HAVANA,transcript,89551,91105,.,-,.,ID=ENST00000495576.1;Parent=ENSG00000239945.1;...,lncRNA,ENSG00000239945|ENSG00000239945.1|lncRNA
106,chr1,HAVANA,transcript,139790,140339,.,-,.,ID=ENST00000493797.1;Parent=ENSG00000239906.1;...,lncRNA,ENSG00000239906|ENSG00000239906.1|lncRNA
...,...,...,...,...,...,...,...,...,...,...,...
3372838,chrY,HAVANA,transcript,25378300,25394719,.,-,.,ID=ENST00000427373.5;Parent=ENSG00000228786.5;...,lncRNA,SEPTIN14P23|ENSG00000228786.5|lncRNA
3372887,chrY,HAVANA,transcript,25482908,25486705,.,+,.,ID=ENST00000306641.1;Parent=ENSG00000240450.1;...,lncRNA,CSPG4P1Y|ENSG00000240450.1|lncRNA
3372986,chrY,HAVANA,transcript,25728490,25733388,.,+,.,ID=ENST00000417334.1;Parent=ENSG00000231141.1;...,lncRNA,TTTY3|ENSG00000231141.1|lncRNA
3373351,chrY,HAVANA,transcript,57190738,57208756,.,+,.,ID=ENST00000483543.7_PAR_Y;Parent=ENSG00000270...,lncRNA,ENSG00000270726|ENSG00000270726.6|lncRNA


In [208]:
rows = len(gff)
unique_rows = len(gff["gene"].unique())

print(f"Number of rows: {rows}")
print(f"Unique rows: {unique_rows}")

Number of rows: 18041
Unique rows: 18027


lncRNAも重複があるものがあるが、これは性染色体（X, Y）に同じ遺伝子を持っているため。なので今回は無視。

In [213]:
gff = gff[["chr", "start", "end", "gene", "score", "strand"]].drop_duplicates().reset_index(drop=True)
gff["score"] = 0

In [214]:
gff

Unnamed: 0,chr,start,end,gene,score,strand
0,chr1,29554,31097,MIR1302-2HG|ENSG00000243485.5|lncRNA,0,+
1,chr1,34554,36081,FAM138A|ENSG00000237613.2|lncRNA,0,-
2,chr1,92230,129217,ENSG00000238009|ENSG00000238009.6|lncRNA,0,-
3,chr1,89551,91105,ENSG00000239945|ENSG00000239945.1|lncRNA,0,-
4,chr1,139790,140339,ENSG00000239906|ENSG00000239906.1|lncRNA,0,-
...,...,...,...,...,...,...
18036,chrY,25378300,25394719,SEPTIN14P23|ENSG00000228786.5|lncRNA,0,-
18037,chrY,25482908,25486705,CSPG4P1Y|ENSG00000240450.1|lncRNA,0,+
18038,chrY,25728490,25733388,TTTY3|ENSG00000231141.1|lncRNA,0,+
18039,chrY,57190738,57208756,ENSG00000270726|ENSG00000270726.6|lncRNA,0,+


同じ染色体で重複のある遺伝子について 3' UTR の開始位置の最小値と終了位置の最大値を取得しまとめる。

In [217]:
gff_unique = gff[gff["gene"].duplicated(keep=False) == False]
gff_dup = gff[gff["gene"].duplicated(keep=False)]

duplicated_genes = np.array(gff_dup["gene"].unique())
gff_dedup = pd.DataFrame(columns=["chr", "start", "end", "gene", "score", "strand"])

for gene in duplicated_genes:
    tmp_df = gff_dup[gff_dup["gene"] == gene]   
    tmp_df["chr"].unique()   
    
    if len(tmp_df["chr"].unique()) == 1:
        tmp_chr = tmp_df.iloc[0, 0]
        tmp_start = tmp_df["start"].min()
        tmp_end = tmp_df["end"].max()
        tmp_strand = tmp_df.iloc[0, 5]
        add_row = pd.DataFrame([tmp_chr, tmp_start, tmp_end, gene, 0, tmp_strand], index=gff_dedup.columns).T
        gff_dedup = pd.concat([gff_dedup, add_row])
    
    else:
        gff_dedup = pd.concat([gff_dedup, tmp_df])

In [218]:
gff_merged_lncRNA = pd.concat([gff_unique, gff_dedup])
gff_merged_lncRNA

Unnamed: 0,chr,start,end,gene,score,strand
0,chr1,29554,31097,MIR1302-2HG|ENSG00000243485.5|lncRNA,0,+
1,chr1,34554,36081,FAM138A|ENSG00000237613.2|lncRNA,0,-
2,chr1,92230,129217,ENSG00000238009|ENSG00000238009.6|lncRNA,0,-
3,chr1,89551,91105,ENSG00000239945|ENSG00000239945.1|lncRNA,0,-
4,chr1,139790,140339,ENSG00000239906|ENSG00000239906.1|lncRNA,0,-
...,...,...,...,...,...,...
17971,chrY,2612988,2615347,LINC00102|ENSG00000230542.6|lncRNA,0,-
17958,chrX,156004218,156022236,ENSG00000270726|ENSG00000270726.6|lncRNA,0,+
18039,chrY,57190738,57208756,ENSG00000270726|ENSG00000270726.6|lncRNA,0,+
17959,chrX,156014623,156016837,WASIR1|ENSG00000185203.12|lncRNA,0,-


In [219]:
gff_merged = pd.concat([gff_merged, gff_merged_lncRNA])
gff_merged

Unnamed: 0,chr,start,end,gene,score,strand
0,chr1,70009,71585,OR4F5|ENSG00000186092.7,0,+
1,chr1,944154,944574,SAMD11|ENSG00000187634.13,0,+
2,chr1,944203,944693,NOC2L|ENSG00000188976.11,0,-
3,chr1,965192,965719,KLHL17|ENSG00000187961.15,0,+
4,chr1,974576,975865,PLEKHN1|ENSG00000187583.11,0,+
...,...,...,...,...,...,...
17971,chrY,2612988,2615347,LINC00102|ENSG00000230542.6|lncRNA,0,-
17958,chrX,156004218,156022236,ENSG00000270726|ENSG00000270726.6|lncRNA,0,+
18039,chrY,57190738,57208756,ENSG00000270726|ENSG00000270726.6|lncRNA,0,+
17959,chrX,156014623,156016837,WASIR1|ENSG00000185203.12|lncRNA,0,-


`protein_coding`と`lncRNA`をあわせて37,524遺伝子の情報を抽出できた。

In [221]:
gff_merged.to_csv("gencode.v41.customized.canonical.with.lncRNA.bed", sep="\t", index=False, header=False)

### Q. 3'UTRだけで本当にすべてのリードをカバーしきれているのか？
マッピング後のBAMファイルをIGVで見ると、3'UTRだけでなくラストエクソン（またはさらにその1つ前のエクソン）にまでまたがってリードがマッピングされている遺伝子がいくつも見つかる。そういう遺伝子では3'UTRに貼り付いたリードのみをカウントすると、リード数が少なめに見積もられる。これが積み重なると`DESeq2`で発現変動解析を行う際に size factor に影響が出るのでは？