# バクテリアゲノム解析

このファイルは bash カーネルを使用しています。bash カーネルをインストールしていない場合は、ターミナルで
```
pip install bash_kernel
python -m bash_kernel.install
```
を実行した後、再度 Jupyter notebook を起動してください。

anaconda or miniconda は既にインストールされているものとします。

本ファイルのコマンドを順に実行した場合、カレントディレクトリ以下に解析作業用のフォルダが作られます。（解析教本の本文で示した手順とは異なります）

`shift-Enter` を押すと選択中のセルの内容を実行することが可能です。

##  解析ツールのインストール

### FastQCのインストール

インストールの続行確認を省略するため `-y` オプションを指定して実行しています。

In [9]:
conda install -y -c bioconda fastqc

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [5]:
fastqc -v

FastQC v0.11.8


### fastpのインストール

In [8]:
conda install -y -c bioconda fastp

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/nigyta/miniconda3

  added / updated specs:
    - fastp


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    fastp-0.20.0               |       hd9629dc_0         220 KB  bioconda
    ------------------------------------------------------------
                                           Total:         220 KB

The following NEW packages will be INSTALLED:

  fastp              bioconda/osx-64::fastp-0.20.0-hd9629dc_0



Downloading and Extracting Packages
fastp-0.20.0         | 220 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


動作確認のためバージョンを表示

In [10]:
fastp -v

fastp 0.20.0


### fastpのインストール

In [11]:
conda install -y -c bioconda seqkit

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/nigyta/miniconda3

  added / updated specs:
    - seqkit


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    seqkit-0.11.0              |                0         6.7 MB  bioconda
    ------------------------------------------------------------
                                           Total:         6.7 MB

The following NEW packages will be INSTALLED:

  seqkit             bioconda/osx-64::seqkit-0.11.0-0



Downloading and Extracting Packages
seqkit-0.11.0        | 6.7 MB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


動作確認のためバージョンを表示

In [13]:
seqkit version

seqkit v0.11.0

Checking new version...
You are using the latest version of seqkit


### Platanus_Bのインストール

カレントディレクトリに`tools`ディレクトリを作成し、インストールを行います。Platanus_Bは[開発元ウェブサイト](http://platanus.bio.titech.ac.jp/platanus-b)から取得します。

In [17]:
mkdir tools
cd tools
curl http://platanus.bio.titech.ac.jp/?ddownload=411 --output platanus_tmp.tar.gz
tar xvfz platanus_tmp.tar.gz
cp Platanus_B_v1.1.0_190607_macOS_bin/platanus_b ./
cp Platanus_B_v1.1.0_190607_macOS_bin/README ./platanus_README
rm -rf Platanus_B_v1.1.0_190607_macOS_bin platanus_tmp.tar.gz
cd ../

mkdir: tools: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1407k  100 1407k    0     0   685k      0  0:00:02  0:00:02 --:--:--  685k
x Platanus_B_v1.1.0_190607_macOS_bin/
x Platanus_B_v1.1.0_190607_macOS_bin/platanus_b
x Platanus_B_v1.1.0_190607_macOS_bin/README


tools以下にインストールできたことを確認

In [18]:
ls -l tools/

total 10256
-rw-r--r--  1 nigyta  staff     5587 12  2 21:44 platanus_README
-rwxr-xr-x  1 nigyta  staff  4775076 12  2 21:44 platanus_b


platanus_bのバージョン番号・ヘルプが表示されることを確認する。

In [23]:
tools/platanus_b

Platanus version: 1.1.0
tools/platanus_b 

Usage: platanus_b Command [options]

Command: assemble, iterate, scaffold, gap_close, polish, merge


: 1

後の操作を簡単にするため環境変数`PATH`に`tools`ディレクトリを追加。  
（この操作をJupyter notebook以外から行う場合には、カレントディレクトリの場所によって正しく動作しない場合がありますのでご注意ください）

In [24]:
export PATH=$PWD/tools:$PATH

再度、動作確認を行い、platanus_bのバージョン番号・ヘルプが表示されることを確認する。

In [26]:
platanus_b

Platanus version: 1.1.0
platanus_b 

Usage: platanus_b Command [options]

Command: assemble, iterate, scaffold, gap_close, polish, merge


: 1

## データの入手と前処理

以下の手順では、カレントディレクトリに作業フォルダとして`analysis`という名称のディレクトリを作成してディレクトリ内で解析を行います。

In [34]:
mkdir analysis
cd analysis
pwd

/Users/nigyta/bact_genome/analysis


`reads`というディレクトリを作り、その中にリード配列のデータを取得します。

In [36]:
mkdir reads
cd reads
pwd

/Users/nigyta/bact_genome/analysis/reads


__テスト用データの取得__

ネットワーク環境によっては数分かかります。

In [37]:
curl -O ftp://ftp.ddbj.nig.ac.jp//ddbj_database/dra/fastq/DRA002/DRA002643/DRX022186/DRR024501_1.fastq.bz2
curl -O ftp://ftp.ddbj.nig.ac.jp//ddbj_database/dra/fastq/DRA002/DRA002643/DRX022186/DRR024501_2.fastq.bz2

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  448M  100  448M    0     0  10.8M      0  0:00:41  0:00:41 --:--:-- 11.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  503M  100  503M    0     0  10.5M      0  0:00:47  0:00:47 --:--:-- 11.1M


In [38]:
bunzip2 *bz2

__取得したデータの確認__

In [39]:
seqkit stats *fastq

file               format  type   num_seqs      sum_len  min_len  avg_len  max_len
DRR024501_1.fastq  FASTQ   DNA   2,971,310  745,798,810      251      251      251
DRR024501_2.fastq  FASTQ   DNA   2,971,310  745,798,810      251      251      251


__FastQCの実行__

In [40]:
fastqc *fastq

Started analysis of DRR024501_1.fastq
Approx 5% complete for DRR024501_1.fastq
Approx 10% complete for DRR024501_1.fastq
Approx 15% complete for DRR024501_1.fastq
Approx 20% complete for DRR024501_1.fastq
Approx 25% complete for DRR024501_1.fastq
Approx 30% complete for DRR024501_1.fastq
Approx 35% complete for DRR024501_1.fastq
Approx 40% complete for DRR024501_1.fastq
Approx 45% complete for DRR024501_1.fastq
Approx 50% complete for DRR024501_1.fastq
Approx 55% complete for DRR024501_1.fastq
Approx 60% complete for DRR024501_1.fastq
Approx 65% complete for DRR024501_1.fastq
Approx 70% complete for DRR024501_1.fastq
Approx 75% complete for DRR024501_1.fastq
Approx 80% complete for DRR024501_1.fastq
Approx 85% complete for DRR024501_1.fastq
Approx 90% complete for DRR024501_1.fastq
Approx 95% complete for DRR024501_1.fastq
Analysis complete for DRR024501_1.fastq
Started analysis of DRR024501_2.fastq
Approx 5% complete for DRR024501_2.fastq
Approx 10% complete for DRR024501_2.fastq
Appr

__アダプター配列の除去__

In [41]:
fastp -i DRR024501_1.fastq -I DRR024501_2.fastq -o DRR024501_1.fastp.fastq -O DRR024501_2.fastp.fastq

Read1 before filtering:
total reads: 2971310
total bases: 745798810
Q20 bases: 696510031(93.3911%)
Q30 bases: 651512094(87.3576%)

Read2 before filtering:
total reads: 2971310
total bases: 745798810
Q20 bases: 630071152(84.4827%)
Q30 bases: 554509571(74.3511%)

Read1 after filtering:
total reads: 2779208
total bases: 676268617
Q20 bases: 647164103(95.6963%)
Q30 bases: 612568020(90.5806%)

Read2 aftering filtering:
total reads: 2779208
total bases: 676268617
Q20 bases: 599957379(88.7158%)
Q30 bases: 536918684(79.3943%)

Filtering result:
reads passed filter: 5558416
reads failed due to low quality: 369366
reads failed due to too many N: 14838
reads failed due to too short: 0
reads with adapter trimmed: 602436
bases trimmed due to adapters: 43092538

Duplication rate: 0.421494%

Insert size peak (evaluated by paired-end reads): 383

JSON report: fastp.json
HTML report: fastp.html

fastp -i DRR024501_1.fastq -I DRR024501_2.fastq -o DRR024501_1.fastp.fastq -O DRR024501_2.fastp.fastq 
fastp

__再度FastQCを実行__

In [42]:
fastqc *.fastp.fastq

Started analysis of DRR024501_1.fastp.fastq
Approx 5% complete for DRR024501_1.fastp.fastq
Approx 10% complete for DRR024501_1.fastp.fastq
Approx 15% complete for DRR024501_1.fastp.fastq
Approx 20% complete for DRR024501_1.fastp.fastq
Approx 25% complete for DRR024501_1.fastp.fastq
Approx 30% complete for DRR024501_1.fastp.fastq
Approx 35% complete for DRR024501_1.fastp.fastq
Approx 40% complete for DRR024501_1.fastp.fastq
Approx 45% complete for DRR024501_1.fastp.fastq
Approx 50% complete for DRR024501_1.fastp.fastq
Approx 55% complete for DRR024501_1.fastp.fastq
Approx 60% complete for DRR024501_1.fastp.fastq
Approx 65% complete for DRR024501_1.fastp.fastq
Approx 70% complete for DRR024501_1.fastp.fastq
Approx 75% complete for DRR024501_1.fastp.fastq
Approx 80% complete for DRR024501_1.fastp.fastq
Approx 85% complete for DRR024501_1.fastp.fastq
Approx 90% complete for DRR024501_1.fastp.fastq
Approx 95% complete for DRR024501_1.fastp.fastq
Analysis complete for DRR024501_1.fastp.fastq

__fastp 処理結果の確認__

In [43]:
seqkit stats *fastq

file                     format  type   num_seqs      sum_len  min_len  avg_len  max_len
DRR024501_1.fastp.fastq  FASTQ   DNA   2,779,208  676,268,617       31    243.3      251
DRR024501_1.fastq        FASTQ   DNA   2,971,310  745,798,810      251      251      251
DRR024501_2.fastp.fastq  FASTQ   DNA   2,779,208  676,268,617       31    243.3      251
DRR024501_2.fastq        FASTQ   DNA   2,971,310  745,798,810      251      251      251


__データのサブサンプリング__

カバレッジが約100xになるよう、全データのうち20%を抽出する。（ゲノムサイズは約2.5Mbpと想定）

In [44]:
seqkit sample -p 0.2 DRR024501_1.fastp.fastq > DRR024501_1.sampled.fastp.fastq
seqkit sample -p 0.2 DRR024501_2.fastp.fastq > DRR024501_2.sampled.fastp.fastq

[INFO][0m sample by proportion
[INFO][0m 556074 sequences outputted
[INFO][0m sample by proportion
[INFO][0m 556074 sequences outputted


__サブサンプリングしたデータの確認__

In [None]:
seqkit stats *.sampled.fastp.fastq

## アセンブリ

ディレクトリの準備  
`assembly`という名称で作業ディレクトリを作成する。

In [47]:
cd ../
mkdir assembly
cd assembly
pwd

/Users/nigyta/bact_genome/analysis/assembly


（確認）リードデータは、一つ上の階層にあるreadsディレクトリに格納されている

In [48]:
ls -l ../reads/

total 15816912
-rw-r--r--  1 nigyta  staff  1770380254 12  2 22:06 DRR024501_1.fastp.fastq
-rw-r--r--  1 nigyta  staff      231274 12  2 22:06 DRR024501_1.fastp_fastqc.html
-rw-r--r--  1 nigyta  staff      251019 12  2 22:06 DRR024501_1.fastp_fastqc.zip
-rw-r--r--  1 nigyta  staff  1938304424 12  2 21:56 DRR024501_1.fastq
-rw-r--r--  1 nigyta  staff   314019940 12  2 22:07 DRR024501_1.sampled.fastp.fastq
-rw-r--r--  1 nigyta  staff      239558 12  2 22:03 DRR024501_1_fastqc.html
-rw-r--r--  1 nigyta  staff      268951 12  2 22:03 DRR024501_1_fastqc.zip
-rw-r--r--  1 nigyta  staff  1770380254 12  2 22:06 DRR024501_2.fastp.fastq
-rw-r--r--  1 nigyta  staff      231004 12  2 22:07 DRR024501_2.fastp_fastqc.html
-rw-r--r--  1 nigyta  staff      253422 12  2 22:07 DRR024501_2.fastp_fastqc.zip
-rw-r--r--  1 nigyta  staff  1938304424 12  2 21:56 DRR024501_2.fastq
-rw-r--r--  1 nigyta  staff   314019940 12  2 22:08 DRR024501_2.sampled.fastp.fastq
-rw-r--r--  1 nigyta  staff      244115 12  2 22

__アセンブリ1 (platanus_b の assemble ステップ)__

In [52]:
platanus_b assemble -t 2 -f ../reads/DRR024501_1.sampled.fastp.fastq ../reads/DRR024501_2.sampled.fastp.fastq

Platanus version: 1.1.0
platanus_b assemble -t 2 -f ../reads/DRR024501_1.sampled.fastp.fastq ../reads/DRR024501_2.sampled.fastp.fastq 

K = 32, saving kmers from reads...
AVE_READ_LEN=243.268

KMER_EXTENSION:
K=32, KMER_COVERAGE=90.322 (>= 7), COVERAGE_CUTOFF=7
K=42, KMER_COVERAGE=86.0669, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-inf
K=52, KMER_COVERAGE=81.8118, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-inf
K=62, KMER_COVERAGE=77.5567, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-inf
K=72, KMER_COVERAGE=73.3016, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-inf
K=82, KMER_COVERAGE=69.0465, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-inf
K=92, KMER_COVERAGE=64.7914, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-inf
K=102, KMER_COVERAGE=60.5363, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-inf
K=112, KMER_COVERAGE=56.2812, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-15.6536
K=122, KMER_COVERAGE=52.0261, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-14.0625
K=132, KMER_COVERAGE=47.771, COVERAGE_CUTOFF=7, PROB_SPLIT=10e-12.4307
K=142, KMER_COVERAGE=43.5159, COVERAGE_CUTOFF=7, PROB

NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=39
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=13
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=13
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
extracting reads (containing kmer used in contig assemble)...
K = 132, loading kmers from contigs...
K = 132, saving additional kmers(not found in contigs) from reads...
COVERAGE_CUTOFF = 7
loading kmers...
connecting kmers...
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=35
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=35
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=8
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVE

K=172, KMER_COVERAGE=36.8684, COVERAGE_CUTOFF=5, PROB_SPLIT=10e-10.0356
K=182, KMER_COVERAGE=31.7667, COVERAGE_CUTOFF=3, PROB_SPLIT=10e-10.0245
K=192, KMER_COVERAGE=26.6651, COVERAGE_CUTOFF=1, PROB_SPLIT=10e-10.5391
K=195, KMER_COVERAGE=25.1346, COVERAGE_CUTOFF=1, PROB_SPLIT=10e-10.3138
K=196, KMER_COVERAGE=24.6244, COVERAGE_CUTOFF=1, PROB_SPLIT=10e-10.3932
K=197, KMER_COVERAGE=24.1143, COVERAGE_CUTOFF=1, PROB_SPLIT=10e-10.1717
loading kmers...
connecting kmers...
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=23486
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=23486
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
extracting reads (containing kmer used in contig assemb

K = 142, loading kmers from contigs...
K = 142, saving additional kmers(not found in contigs) from reads...
COVERAGE_CUTOFF = 12
loading kmers...
connecting kmers...
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=100
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=100
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=17
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=17
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
extracting reads (containing kmer used in contig assemble)...
K = 152, loading kmers from contigs...
K = 152, saving additional kmers(not found in contigs) from reads...
COVERAGE_CUTOFF = 9
loading kmers...
connecting kmers...
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVE

K=227, KMER_COVERAGE=25.7046, COVERAGE_CUTOFF=1, PROB_SPLIT=10e-10.3852
K=228, KMER_COVERAGE=24.216, COVERAGE_CUTOFF=1, PROB_SPLIT=10e-10.2159
loading kmers...
connecting kmers...
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=578
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=578
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
extracting reads (containing kmer used in contig assemble)...
K = 42, loading kmers from contigs...
K = 42, saving additional kmers(not found in contigs) from reads...
COVERAGE_CUTOFF = 180
loading kmers...
connecting kmers...
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=

BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=11
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=11
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
extracting reads (containing kmer used in contig assemble)...
K = 152, loading kmers from contigs...
K = 152, saving additional kmers(not found in contigs) from reads...
COVERAGE_CUTOFF = 66
loading kmers...
connecting kmers...
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=7
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=7
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=0
TOTAL_NUM_CUT=0
removing erroneous nodes...
NUM_REMOVED_NODES=10
NUM_REMOVED_NODES=0
TOTAL_NUM_REMOVED_NODES=10
removing branches...
BRANCH_DELETE_THRESHOLD=0.5
NUM_CUT=

上記コマンドが正しく実行できない場合、`tools`ディレクトリに`PATH`が通っていること、および、
```
ls ../reads/DRR024501_1.sampled.fastp.fastq ../reads/DRR024501_2.sampled.fastp.fastq
```
を実行してみて`reads`ディレクトリにファイルが存在していることを確認すること。

アセンブリ結果 (out_contig.fa) の確認

In [53]:
seqkit stats -a -G N out_contig.fa

file           format  type  num_seqs    sum_len  min_len  avg_len  max_len   Q1   Q2     Q3  sum_gap     N50  Q20(%)  Q30(%)
out_contig.fa  FASTA   DNA        369  2,445,694      228  6,627.9   88,589  245  393  7,138        0  28,870       0       0


__アセンブリ2 (platanus_b の iterate ステップ)__

実行には数10分程度かかる。

In [54]:
platanus_b iterate -t 2 -c out_contig.fa -IP1 ../reads/DRR024501_1.sampled.fastp.fastq ../reads/DRR024501_2.sampled.fastp.fastq

Platanus version: 1.1.0
platanus_b iterate -t 2 -c out_contig.fa -IP1 ../reads/DRR024501_1.sampled.fastp.fastq ../reads/DRR024501_2.sampled.fastp.fastq 


#### PROCESS INFORMATION ####
VmPeak:           0.000 GByte
VmHWM:            0.000 GByte


アセンブリ結果 (out_iterativeAssembly.fa) の確認

In [55]:
seqkit stats -a -G N out_iterativeAssembly.fa

file                      format  type  num_seqs    sum_len  min_len   avg_len  max_len   Q1      Q2      Q3  sum_gap     N50  Q20(%)  Q30(%)
out_iterativeAssembly.fa  FASTA   DNA         55  2,365,587      246  43,010.7  257,852  709  14,430  69,725    2,619  96,297       0       0


## アノテーション

本文ではアセンブリ結果ファイル (out_iterativeAssembly.fa) を [DFAST ウェブサービス](https://dfast.nig.ac.jp) にアップロードしてアノテーションを行っています。  
本文執筆後、DFAST のローカル版が bioconda でインストールできるようになったため、下記ではローカル版のインストールおよび利用方法を紹介します。


### DFAST ローカル版のインストール

インストール時に参照データベースファイルの取得等を行うため、ネットワーク環境によっては数分〜10分程度かかります。

In [57]:
conda install -y -c bioconda dfast

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/nigyta/miniconda3

  added / updated specs:
    - dfast


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    aragorn-1.2.38             |                2         134 KB  bioconda
    barrnap-0.9                |                3         556 KB  bioconda
    bedtools-2.29.0            |       h37cfd92_3         860 KB  bioconda
    biopython-1.74             |   py37h1de35cc_0         2.0 MB
    blas-1.0                   |              mkl           6 KB
    blast-2.6.0                |      boost1.64_2       113.0 MB  bioconda
    bzip2-1.0.8                |       h1de35cc_0          75 KB
    dfast-1.2.3                |      py37pl526_0        12.8 MB  bioconda
    ghostx-1.3.7               |                0         274 KB  bioconda
    hmmer-3.2.1  

動作確認

In [None]:
dfast -v

### DFASTの実行

初回実行時には参照データベースファイルの indexing を行うため時間がかかる。二回目以降は実行時間が短縮される。

In [None]:
dfast --genome out_iterativeAssembly.fa

結果ファイルは `OUT` ディレクトリ内に出力される。出力ファイルには、
- genome.gbk (GenBank 形式のアノテーション結果ファイル)
- genome.gff (GFF 形式のアノテーション結果ファイル)
- protein.faa (FASTA 形式のアノテーションされた遺伝子のアミノ酸配列ファイル)
- cds.fna (FASTA 形式のアノテーションされた遺伝子の CDS 塩基配列ファイル)

などが含まれる。

詳細な使用方法はヘルプ `dfast -h` や、[公式サイト](https://github.com/nigyta/dfast_core)のドキュメントをご参照ください。