# Read and Standardize Mutation Information
This notebook reads a .csv file with one mutation per line. This notebook is a template that you can modify for your specific use case.

For subsequent analysis, the input file must contain at least the following 3 fields:
* **var_id** with the genomic location using the [HGVS sequence variant nomenclature]
(http://varnomen.hgvs.org/recommendations/general/) but without the "chr" prefix, e.g. 5:g.149440497C>T
* **annotation** short label, e.g., cancer type
* **color** color for visualization ([list of colors](https://github.com/3dmol/3Dmol.js/blob/master/3Dmol/colors.js#L45-L192)), e.g., to color by cancer type

This notebook is an example that standardizes the variant nomenclature
1. Read the file with your mutation information
2. Create a column 'var_id' with the genomic location, e.g. 5:g.149440497C>T
3. Filter out any variations that are not SNPs
4. Save the file as 'mutations.csv'

The mutations.csv file is the input for the next step: MapTo3DStructures

In [1]:
import pandas as pd

#### Input parameters (specify your input file name below)

In [2]:
input_file_name = "../data/var_ids_to_map.csv"

#input_file_name = <path to your input file> # mutation info (chromosome number and position required)

output_file_name = 'mutations.csv' # contains mutation info in standard format (e.g., 5:g.149440497C>T)

## Read input file and remove mutations that are not SNVs

In [3]:
df = pd.read_csv(input_file_name)
pd.options.display.max_columns = None # show all columns

In [4]:
df['var_id'] = df['var_id'].str.replace('chr', '')

Add color and annotation column

In [5]:
df['color'] = 'red'
df['annotation'] = df['Hugo_Symbol']

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,var_id,Hugo_Symbol,hgvs_protein,color,annotation
0,0,2:g.215370366C>T,FN1,p.Val2261Ile,red,FN1
1,1,6:g.31628105C>A,PRRC2A,p.Thr544Lys,red,PRRC2A
2,2,6:g.31271324G>T,HLA-C,p.Ser123Tyr,red,HLA-C
3,3,6:g.31271324G>T,HLA-C,p.Val117Val,red,HLA-C
4,4,6:g.31356759T>A,HLA-B,p.Tyr91Phe,red,HLA-B


Filter out any variants that are not SNVs

In [7]:
df.to_csv(output_file_name, index=False)

## Now run the next step
Map mutations to 3D Structure: [2-MapTo3DStructures.ipynb](./2-MapTo3DStructures.ipynb)