# Convert .xmap to .bed for ALLMAPS

This script explains how to convert a Bionano .xmap file to .bed file which is suitable for input to ALLMAPS. I struggeled a lot to find a corresponding script and I found it hard to find information on the Bionano ouput. That's why this file explains step by step. For those who just want the job to be done, there is the file `xmap2allmaps.py`.

The **.xmap** file contains following columns : 

- #h XmapEntryID : the ID of the current row
- QryContigID : The ID of the NGS Contig. Which Contig name corresponds to which ID is specified somewhere in the Bionano output files. (In my case the file had 'keys in its name.
- RefContigID : The Super_Scaffold name/number
- QryStartPos : where the NGS contig starts to match an optical mapping pseudomolecule
- QryEndPos : where the NGS contig ends to match an optical mapping pseudomolecule
- RefStartPos : where on the pseudomolecule the matching starts
- RefEndPos : where on the pseudomolecule the matching stops
- Orientation : the Orientation of the NGS Contig
- Confidence : a confidence score
- HitEnum, QryLen, RefLen, LabelChannel, Alignment can be deleted, they do not contain information which belongs into the .bed file.

My input files are named 'ngsbased.xmap' and 'names.txt'. There were several .xmap files in my Bionano ouptput, I work with the one from NGS. It rather has something like 'NGS' than 'BNG' in its name. 'NGS' : 'Next generation sequencing', 'BNG' : 'Bionano Genomics'

'names.txt' contains information on all Contigs, not just the ones used for optical mapping, so this list can be much longer than the xmap file.

In [1]:
import pandas as pd
import numpy as np
data = pd.read_table("GCTCTTC_EXP_REFINEFINAL1_bppAdjust_cmap_P_EXSERTA_contigs_v1_1_3_fasta_BNGcontigs_NGScontigs.xmap", header=8) # my header was 8 rows, It is not needed for conversion
names = pd.read_table("names.txt")

In [2]:
data.head()

Unnamed: 0,#h XmapEntryID,QryContigID,RefContigID,QryStartPos,QryEndPos,RefStartPos,RefEndPos,Orientation,Confidence,HitEnum,QryLen,RefLen,LabelChannel,Alignment
0,#f int,int,int,float,float,float,float,string,float,string,float,float,int,string
1,1,9,6,1979830.2,1896561.4,10905.0,94125.0,-,13.24,6M1D2M1D1M1D3M,3091368.2,116084.0,1,"(1,235)(2,234)(3,233)(4,232)(5,231)(6,230)(7,2..."
2,2,1006,11,1133607.3,1006319.4,10163.0,137359.0,-,12.67,10M1D1M,2625477.9,180532.0,1,"(1,93)(2,93)(3,92)(4,91)(5,90)(6,89)(7,88)(8,8..."
3,3,862,21,3304915.3,3150323.3,5154.0,160399.0,-,21.40,15M1D1M,5362440.5,169247.0,1,"(1,302)(2,302)(3,301)(4,300)(5,299)(6,298)(7,2..."
4,4,705,40,1983671.9,1810779.2,2768.0,175088.0,-,23.47,6M1D5M1D2M1D5M1D2M1D1M,2799289.6,175494.0,1,"(1,176)(2,176)(3,175)(4,174)(5,173)(6,172)(7,1..."


In [3]:
names.head()

Unnamed: 0,CompntId,CompntName
0,1,Peex113Ctg00001
1,2,Peex113Ctg00002
2,3,Peex113Ctg00003
3,4,Peex113Ctg00004
4,5,Peex113Ctg00005


Delete unneeded columns and rows.

In [4]:
data = data.drop(['Alignment', 'HitEnum', 'QryLen', 'RefLen', 'LabelChannel'], axis=1)
data = data.drop(data.index[0]) # delete row 0

The .bed file can not contain floats, they need to be rounded and changed to type 'integer'.

In [5]:
data[['QryStartPos', 'QryEndPos','RefStartPos','RefEndPos']] = data.loc[:,['QryStartPos', 'QryEndPos','RefStartPos','RefEndPos']].astype('float')#.astype('int')
# round the columns before converting to int
# change : take out QryStartPos and End Pos as this can be a float
data[['RefStartPos','RefEndPos']] = np.ceil(data[['RefStartPos','RefEndPos']]).astype('int')
data.dtypes # shows all types occurring in the data frame column wise

#h XmapEntryID     object
QryContigID        object
RefContigID        object
QryStartPos       float64
QryEndPos         float64
RefStartPos         int64
RefEndPos           int64
Orientation        object
Confidence         object
dtype: object

Rename contig Id column so it matches first column of 'names'.

In [6]:
if(data.columns[2] == 'RefContigID'):
    data = data.rename(index=str,columns={data.columns[2] : 'CompntId'})

Change Name column of data to int so it can be merged to names.

In [7]:
data[['CompntId']] = data[['CompntId']].astype('int')

Merge data and names dataframes.

In [8]:
data = names.merge(data, on='CompntId')
data.head()

Unnamed: 0,CompntId,CompntName,#h XmapEntryID,QryContigID,QryStartPos,QryEndPos,RefStartPos,RefEndPos,Orientation,Confidence
0,6,Peex113Ctg00006,1,9,1979830.2,1896561.4,10905,94125,-,13.24
1,11,Peex113Ctg00011,2,1006,1133607.3,1006319.4,10163,137359,-,12.67
2,21,Peex113Ctg00021,3,862,3304915.3,3150323.3,5154,160399,-,21.4
3,40,Peex113Ctg00040,4,705,1983671.9,1810779.2,2768,175088,-,23.47
4,43,Peex113Ctg00043,5,267,697847.4,900790.6,8048,211715,+,25.72


## Create bedfile
Start creating columns of the new bed file named 'bed'.

In [9]:
bed = pd.DataFrame(data['CompntName'])
# change start to reference (NGS) start position
bed['start'] = data['RefStartPos']
bed['end'] = data['RefEndPos']
bed['name'] = "SuperScaffold" + data['QryContigID'] + ":" + data['QryStartPos'].astype(str)
bed['score'] = data['Confidence']
bed['orientation'] = data['Orientation']

In [10]:
bed.head()

Unnamed: 0,CompntName,start,end,name,score,orientation
0,Peex113Ctg00006,10905,94125,SuperScaffold9:1979830.2,13.24,-
1,Peex113Ctg00011,10163,137359,SuperScaffold1006:1133607.3,12.67,-
2,Peex113Ctg00021,5154,160399,SuperScaffold862:3304915.3,21.4,-
3,Peex113Ctg00040,2768,175088,SuperScaffold705:1983671.9,23.47,-
4,Peex113Ctg00043,8048,211715,SuperScaffold267:697847.4,25.72,+


# change : if I take the RefStart and RefEnd positions, + and - strands are well formatted already.

In [84]:
#new = bed[bed.orientation == '-']

For rows with orientation -, the start and end positions are swapped. In the .bed file, start and end position must be ascending.

In [85]:
#new = bed[bed.orientation == '-']
#new = new[['CompntName', 'end', 'start', 'name', 'score', 'orientation']]
#new.rename(columns={'end':'start', 'start':'end'}, inplace=True)
#bed[bed.orientation == '-'] = new

The .bed file table is ready now, it only needs to be printed to a tab-delimined file now.

In [11]:
bed.to_csv('bionanonew.bed', sep='\t', header=False, index=False)

# Avoid bugs
There have been several issues I have met during reformatting of the files. I will list them here in case you adapt the script for your own purposes:

- The name column (column 4 in the .bed file) consists of a name and a location. The name can not contain any underlines \_ , dashes - or points, the only special character allowed is ':'. E.g. don't change the name to 'Super_Scaffold_11:17683' but leave it like 'SuperScaffold11:17683'.
Especially the dashes, because in the merged bed file (in case the optical mapping data is merged with another map data), the name will change to 'nameofmap-SuperScaffold11:17683' and if there is an additional dash - like 'nameofmap-SuperScaffold-11:17683', ALLMAPS will interpret 11:17683 as a location. The resulting Scaffolding output will then not contain any information of this map.

- make sure the 'CompntName' in the .bed file matches the Sequence names in the .fasta file you will feed into ALLMAPS.
- make sure the .bed file is tab-delimined
- if you have any other maps, like genetic maps, make sure the Linkage group names do not contain any special character like 'L.1', instead change the name to something like 'L1'.
- getting an error like 'AttributeError: 'ScaffoldOO' object has no attribute 'object'' means there is something wrong with the names, in my case I had special characters at the wrong places in my .bed file.
- I compared the [sample .bed file](https://figshare.com/articles/ALLMAPS_supporting_data_Medicago_genome_assembly/1057745) a lot with my own, in case you have more problems with conversion, this might help you as well. 



### Conversion problems/ things to consider (copied in without revision!!- TODO)
* .bed format: NGS-contig name, NGS-start position, NGS-end position, Name of optical mapping SuperScaffold and its starting position (e.g. SS40:340.2), score, strand
    * the map position needs to be given with the name of the map scaffold
    * RefStart and not QryStart (had QryStart first)
* .bed is tab-delimined
* 577 unique contigs in OM data
* names of Peex113Ctgs must match with scaffolds.fasta file (they do)
* no empty lines
* no '\_' and '.' together in file names, neither in Scaffold names (not sure if the second is necessary)
    * genetic map linkage group not 'L.1' but 'L1'
* no '-' in Scaffold names, dash is used to specify the map in ALLMAPS, e.g. if it is the opticalmap or the geneticmap

* Problem: only geneticmap information is used for allmaps, the optical mapping info appears in the "unmapped" file.
    * run break of chimeric contigs first?