### Processing DNA in a file

In [1]:
data = open('input.txt').read()
print(data)

ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
ATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT
ATTCGATTATAAGCATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC
ATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTA
ATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA



In [40]:
input_file = open('input.txt')
output_file = open('output.txt', 'w')

for line in input_file:
    output_line = line[14:100]
    # the length of the sequence is equal to 
    # len(output_line) - 1 because output_line ends with \n    
    print(str(len(output_line) - 1))
    output_file.write(output_line)
input_file.close()
output_file.close()

42
37
48
33
47


In [8]:
data = open('output.txt').read()
print(data)

TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
ACTGATCGATCGATCGATCGATCGATGCTATCGTCGT
ATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC
ACTATCGATGATCTAGCTACGATCGTAGCTGTA
ACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA



### Multiple exons from genomic DNA

The data for this exercise is in `genomic_dna.txt` and `exons.txt`. The `genomic_dna.txt` file contains some DNA, all in a single line (terminated with `\n`). The `exons.txt` contains a list of start and end positions of exons. Each exon is described in a single line, with start and end positions separated with a comma.

In [1]:
genomic_dna_filename = 'genomic_dna.txt'
genomic_dna_file = open(genomic_dna_filename)
# .read() is a file object function - this will read the whole of the genomic DNA data from the genomic_dna_file as a text string
# .rstrip('\n')  will strip (i.e. remove) the '\n' from the end of the text string that .read() returns
genomic_dna = genomic_dna_file.read().rstrip('\n')

In [2]:
# show the genomic_dna
print(genomic_dna)

TCGATCGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGACTGATCGATCGATCGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGATCGATCATATGTCAGTCGATGCATCGTAGCATCGTATAGTAGCTACGTAGCTACGATCGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTAGCTAGTACGATCGCGTAGCTAGCATGCTACGTAGATCGATCGATGCATGCTAGCTAGCTAGCTACGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCGATGCTACGTAGATCGATCGCTAGTAGATCGATCGCTAGCTAGCTGACTAGTACGCTGCTAGTAGTCAGCTAGATCGATGCTAGTCA


In [9]:
# first let us just examine the exons by printing them to the screen.
exons_filename = 'exons.txt'
exons_file = open(exons_filename)

# each line in exons_file is 'START,STOP\n' where START and STOP are numbers
for line in exons_file:
    exon_coordinates = line.split(',')
    # the START and STOP coordinates are numbers but we get them as text strings so we need to convert them to numbers with int()
    start = int(exon_coordinates[0])
    stop = int(exon_coordinates[1])
    exon_text = genomic_dna[start:stop]
    print("start: " + str(start) + "\tstop: " + str(stop) + "\t" + exon_text)

start: 5	stop: 58	CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCG
start: 72	stop: 133	CGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGA
start: 190	stop: 276	CGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTA
start: 340	stop: 398	CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCG


In [13]:
# I am copying the code from above to this cell so that the whole solution is visible in a single cell
genomic_dna_filename = 'genomic_dna.txt'
genomic_dna_file = open(genomic_dna_filename)
genomic_dna = genomic_dna_file.read().rstrip('\n')

# now that we have verified that we can extract the exons from the file, let us write them to the output file
exons_filename = 'exons.txt'
exons_file = open(exons_filename)

# we can choose any suitable name for the output filename. here I choose 'coding_dna.txt'
output_filename = 'coding_dna.txt'
output_file = open(output_filename, 'w')
# each line in exons_file is 'START,STOP\n' where START and STOP are numbers
for line in exons_file:
    exon_coordinates = line.split(',')
    # the START and STOP coordinates are numbers but we get them as text strings so we need to convert them to numbers with int()
    start = int(exon_coordinates[0])
    stop = int(exon_coordinates[1])
    # genomic_dna is a text string containing the whole genomic sequence.
    # the exon_sequence is extracted it using the [] slicing operator
    exon_sequence = genomic_dna[start:stop]
    # remember that when we use .write() it updates the file pointer, so the next time we .write() we add to the text in the output_file
    output_file.write(exon_sequence)
output_file.close()

### Processing DNA in a file - working notes

In [3]:
input_file = open('input.txt')

In [2]:
input_file.read()

'ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC\nATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT\nATTCGATTATAAGCATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC\nATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTA\nATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA\n'

In [13]:
input_file = open('input.txt')

In [4]:
for peter in input_file:
    print(peter)

ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC

ATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT

ATTCGATTATAAGCATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC

ATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTA

ATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA



In [5]:
for peter in input_file:
    print(peter)

In [7]:
word = 'banana'

In [8]:
for letter in word:
    print(letter)

b
a
n
a
n
a


In [9]:
for letter in word:
    print(letter)

b
a
n
a
n
a


In [10]:
input_file = open('input.txt')

In [11]:
input_file.read()

'ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC\nATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT\nATTCGATTATAAGCATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC\nATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTA\nATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA\n'

In [12]:
input_file.read()

''

In [17]:
input_file = open('input.txt')
input_file.read()

'ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC\nATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT\nATTCGATTATAAGCATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC\nATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTA\nATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA\n'

In [19]:
input_file = open('input.txt')

for peter in input_file:
    print(peter)

ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC

ATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT

ATTCGATTATAAGCATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC

ATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTA

ATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA



In [20]:
input_file = open('input.txt')

for line in input_file:
    output_line = line[14:1000]
    print(output_line, end='')

TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
ACTGATCGATCGATCGATCGATCGATGCTATCGTCGT
ATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC
ACTATCGATGATCTAGCTACGATCGTAGCTGTA
ACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA


In [23]:
print('Hello World', end='')
print('Today is Monday')

Hello WorldToday is Monday


In [31]:
input_file = open('input.txt')
output_filename = 'output.txt'
output_file = open(output_filename, 'w')

for line in input_file:
    output_line = line[14:1000]
    output_file.write(output_line)
#     print(output_line, end='')
#     print(str(len(line) - 14))
    print(str(len(output_line)))
output_file.close()



43
43
38
38
49
49
34
34
48
48


In [36]:
input_file = open('input2.txt')
output_filename = 'output.txt'
output_file = open(output_filename, 'w')

for line in input_file:
    output_line = line[14:1000]
    output_file.write(output_line)
    print(line, end='')
    print(output_line, end='')
#     print(str(len(line) - 14))
    # the length of the sequence is equal to 
    # len(output_line) - 1 because output_line ends with \n
    print(str(len(output_line) - 1))
output_file.close()



ATTCGATTATAAGCT
T
1
