# Read and Standardize Mutation Information
This notebook reads a .csv or .tsv file with one mutation per line. This notebook is a template that you can modify for your specific use case.

To prepare your data for subsequenct analysis, you need to:

1. Read the file with your mutation information
2. Create a column 'var_id' with the genomic location using the [HGVS sequence variant nomenclature](http://varnomen.hgvs.org/recommendations/general/), e.g. chr5:g.149440497C>T
3. Filter out any variations that are not SNPs
4. Save the file as 'mutations.csv'

The mutations.csv file is the input for the next step: MapTo3DStructures

In [1]:
import pandas as pd

#### Input parameters (specify your input file name below)

In [2]:
input_file_name = "../data/example-grch37.tsv"

#input_file_name = <path to your input file> # mutation info (chromosome number and position required)

output_file_name = 'mutations.csv' # contains mutation info in standard format (e.g., chr5:g.149440497C>T)

## Read input file and remove mutations that are not SNVs

In [3]:
df = pd.read_csv(input_file_name, sep='\t')
pd.options.display.max_columns = None # show all columns

Filter out any variants that are not SNVs

In [4]:
# df = df[df['ANN[*].EFFECT'] == 'missense_variant']
df

Unnamed: 0,ID,#CHROM,POS,REF,ALT
0,rs147776857,6,52619766,C,T
1,rs121913460,9,133738358,A,T
2,rs34933751,11,5246945,G,T


## Create a new column `var_id` with a standard variant identifier

In [5]:
def var_id(chrom, pos, ref, alt):
    return "chr" + str(chrom) + ":g." + str(pos) + ref + ">" + alt

In [6]:
df['var_id'] = df.apply(lambda x: var_id(x['#CHROM'], x['POS'], x['REF'], x['ALT']), axis=1)

In [7]:
df

Unnamed: 0,ID,#CHROM,POS,REF,ALT,var_id
0,rs147776857,6,52619766,C,T,chr6:g.52619766C>T
1,rs121913460,9,133738358,A,T,chr9:g.133738358A>T
2,rs34933751,11,5246945,G,T,chr11:g.5246945G>T


In [8]:
df.to_csv(output_file_name, index=False)

## Now run the next step
Map mutations to 3D Structure: [2-MapTo3DStructures.ipynb](./2-MapTo3DStructures.ipynb)