## Congenital CMV - NGS Data Analysis Python Scripts
***

### Repeated Regions in CMV _wild type_ Merlin Genome

- Get CMV wild type Merlin GenBank (AY446894.2) record.\
- Get summary of the record. 

In [1]:
# Load SeqIO class from Bio package
from Bio import SeqIO
# Create genbank object using read method of SeqIO class
genbank_object=SeqIO.read("./ref/cmv_merlin.gb", "genbank")
# Print summary of genbank record.
print('ID: ', genbank_object.id)
print('Name: ', genbank_object.name)
print('Description: ', genbank_object.description)
print('Genome Size (bp):',len(genbank_object.seq))

ID:  AY446894.2
Name:  AY446894
Description:  Human herpesvirus 5 strain Merlin, complete genome
Genome Size (bp): 235646


- Get unique feature types. 

In [25]:
# Get list of all feature types
feature_types=[i.type for i in genbank_object.features]
# Build an unordered collection of unique elements.
feature_types=set(feature_types)
# Convert set to list
feature_types=list(feature_types)

print(feature_types)

['CDS', 'variation', 'gene', 'misc_feature', 'regulatory', 'mRNA', 'ncRNA', 'polyA_site', 'source', 'repeat_region', 'intron', 'rep_origin']


- Number of repeat regions in AY446894.2

In [26]:
[i.type for i in genbank_object.features].count('repeat_region')

8

- Writing repeat regions to a file 

In [46]:
with open('repeat_regions_file.txt', 'w') as f:
    for feature in genbank_object.features:
        if feature.type=='repeat_region':
            line=f"{feature.type}\t{feature.qualifiers['note'][0]}\t{feature.qualifiers.get('rpt_type')}\t{feature.location}"
            f.write(line+'\n')
f.close()

### Genes with High Number of Non-synonimous Missense Mutations
Gene names obtained from variant analysis.\
Theese genes have at least mean number of 10 mutations 

In [45]:

Gene_Names=['RL1','RL10','RL12','RL13','RL5A','UL1','UL11','UL116','UL119',
  'UL120','UL122','UL123','UL13','UL132','UL133','UL142','UL144','UL147',
  'UL150','UL150A','UL20','UL32','UL33','UL37','UL4','UL40','UL48','UL55',  
  'UL6','UL7','UL74','UL75','UL77','UL8','UL80','UL87','UL9','US27','US34','US7']

with open(file='high_number_sps.txt', mode='w') as document: 
    for feature in genbank_object.features:
        if (feature.qualifiers.get('gene') in [[i] for i in Gene_Names]) and feature.type=='CDS':
            print(feature.qualifiers['gene'][0],
            feature.location,
            feature.qualifiers['product'][0],
            feature.qualifiers.get('note',['None'])[0],
            sep='\t',
            file=document)
document.close()

In [32]:
from Bio.Seq import Seq
from Bio import SeqIO
genbank_object=SeqIO.read("./ref/cmv_merlin.gb", "genbank")
for feature in genbank_object.features:
    if feature.type=='CDS' and feature.qualifiers['gene']==['UL8']:
        IL8_prot = feature.qualifiers['translation']
#print(IL8_prot)
IL8_prot = Seq(IL8_prot[0])

In [34]:
print(IL8_prot)

MASDVGSHPLTVTRFRCRVHYVYNKLLILTLFAPVILESVIYVSGPQGGNVTLVSNFTSNISARWFRWDGNDSHLICFYKRGEGLSTPYVGLSLSCAANQITIFNLTLNDSGRYGAEGFTRSGENETFLWYNLTVKPKPLETTPASNVTTIVTTTSTVTDAKSNVTGNVSLAPQLRAVAGFSHQTPLENNTHLALGEGFVPTMTSSRLSASENYDGNYEFTETANTTRTNTSDWITLGSSASLLKSTETAVNLSNATTVIPQPVEYPAGGVQYQRAATHYSWMLIIVIILIIFIIICLRAPRKIYHHWKDSKQYGQVFMTDTEL
