# Intro to Strings 
## with DNA Sequences

`string` objects in Python represent text.  They can be created in several ways:

```python
>>> 'Hello'  # with apostrophes ("single-quotes")
'Hello'

>>> "Hello"  # with quotation marks ("double-quotes")
"Hello"

>>> """Hello,     
... my name is
... Nick"""   # with triple-double-quotes (a "docstring", used for multi-line text)
'Hello,\nmy name is\nNick'  

>>> str(32)  # using the str() function to change into a string
'32'
```

Nucleotide sequences are often represented as strings:

```python
>>> seq = 'GCATTGGCT'
```

## String Operation Exercises

Modify the dna sequences below in a single line of code to match what's asked for.  Functions and methods that may be used are:

### Operations
  - `'GTC' * 3`
  - `'GTC' + 'GTC'`
  - `'GTC'[0]`
  - `'GTC'[-1]`
  - `'GTC'[1:]`
  - `'GTC'[:-1]`
  - `'GTC'[::-1]   # Note: Reverses the sequence`
  - `'GTC' == 'GTC'`
  - `'GTC' != 'GTC'`

### Functions
  - `len('GTC')`

### Methods
  - `'GTC'.count('A')`
  - `'GtC'.upper()`
  - `'GTc'.lower()`
  - `'GTC'.isdigit()`
  - `'GTC'.index('T')`
  - `'GTC'.replace('G', 'C')`
  - `'GTC-CCA'.split('-')`



**Exercises**

Count the Number of "G" in the sequence

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

Count the number of "AT" repeats in the sequence

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

Concatenate the following two sequences (i.e. combine them into one sequence)

In [None]:
seq1 = "GTGTCAGT"
seq2 = "TGAATCGATAG"

How long is the following sequence?

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

What is the 7th nucleotide in this sequence?

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

What is the 3rd-from-the-last nucleotide in this sequence?

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

Repeat the following sequence 13 times

In [None]:
gc = "GC"

Replace the incorrect letter with an empty string (i.e. delete the letter)

In [None]:
seq = "GTGXXGTXCCXCCATGXAATCGXATA"

Keep only the first six nucleotides in this sequence

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

Standardize the formatting of this sequence

In [None]:
seq = "GtCGAaaCCgTaGcTAgc"

Split the following string around the empty space into a list of sequences

In [None]:
seqs = "GTTCGAAAG GACCTGATTATAG AACCGATTTA"

Reverse this sequence

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

What percentage of strong nucleotides (G and C) are there in this sequence?

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

Is this sequence the same forwards and backwards (i.e. a palindrome)?

In [None]:
seq = "TCGATCTAGCGCGAATATCGGAGAAGAGGCTATAAGCGCGATCTAGCT"

## Files

### Writing Strings to Files

Strings can be saved to text files by making a `File` object with the `open()` function and writing the string to it.  Here are two ways to do it:

```python
my_file = open('myfile.txt', 'w')  # open in 'write' mode
my_file.write('This is my text')
my_file.close()
```

A shorter version of this is:
```python
with open('myfile.txt', 'w') as my_file:
    my_file.write('This is my text')

Even shorter uses the `Path` object from the pathlib package:
```python
from pathlib import Path
Path('myfile.txt').write_text('This is my text')
```

### Reading Strings from Files

Reading works in a similar way

```python
my_file = open('myfile.txt')
text = my_file.read()
my_file.close()
```

A shorter version of this is:
```python
with open('myfile.txt') as my_file:
    text = my_file.read()

```

Even shorter: 
```python
from pathlib import Path
text = Path('myfile.txt').read_text()
```

**Exercises**

Write the following sequence to a file called "sequence1.txt":

In [None]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

Read the sequence from the file back into Python

### Lists

##### Operations

  - `['a', 'b', 'c']`
  - `[1, 'a', True]`
  - `['a', 'b', 'c'][1]  # b`
  - `['a', 'b', 'c'][-1]  # c`
  - `['a', 'b', 'c'][:2]  # ['a', 'b']`
  - `['a', 'b'] * 2  # ['a', 'b', 'a', 'b']`
  - `[1, 2] + [4]  # [1, 2, 4]`
  - `my_list[2] = 99`



##### Functions

  - `len(['a', 'b', 'c'])  # 3`
  - `max([10, 11, 12])  # 12`
  - `min([10, 11, 12])  # 10`
  - `sum([10, 11, 12])  # 33`

##### Methods

  - `my_list.append('d')`
  - `my_list.extend(['d', 'e'])`
  - `my_list.insert('f', 2)`
  - `my_list.find('b')`
  



**Exercises**

Replace 'b', with 'B'

In [None]:
letters = ['A', 'b', 'C', 'D']

Get the first 3 letters from the list

In [None]:
letters = ['a', 'b', 'c', 'd', 'e']

How many items are in this list?

In [None]:
people = ['Rachel', 'Monica', 'Joey', 'Phoebe', 'Chandler']

*Bonus, non-Python question*: What name is missing from this list?

Reverse the list

In [None]:
letters = ['a', 'b', 'c', 'd', 'e']

Find two different ways to make [3, 4, 5, 6]

In [None]:
data = [3, 4, 5]
new_value = 6

In [None]:
data = [3, 4, 5]
new_value = 6

Find two different ways to concatenate these two lists

In [None]:
aa = [1, 2, 3]
bb = [4, 5, 6]

In [None]:
aa = [1, 2, 3]
bb = [4, 5, 6]

**Exercises**  Without writing loops, Combine String and List techniques to solve the following:

Uppercase both items in the list

In [None]:
seqs = ['gcTTA', 'GttTTGgt']

Lowercase both items in the list

In [None]:
seqs = ['gcTTA', 'GttTTGgt', 'GGTAATA']

Make a list of only the first codon (the first 3 nucleotides) in each sequence

In [None]:
seqs = ['GGTATTA', 'GGCCAG', 'CCAGGATTAG']

Make a list of seqs from each double-seq, without spaces
(e.g. ['ag ct', 'cc gg'] -> ['ag', 'ct', 'cc', 'gg'])

In [None]:
seq_pairs = ['aggg ctgt ggtta ggcaac cca', 'ccag gg ccagg aattag aacccgcgt']

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2259a589-ed93-4e73-9b6b-ccd4f70c8d25' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>