# Problem Set 1: Intro to Bash and Python
## Due Monday September 11th, 11:30AM
Please submit this assignment by uploading the completed Jupyter notebook to Canvas.  Before submitting, please make sure to run all cells so that reviewers do not have to run your code to see your results.  Also, please make sure to save your work before uploading to Canvas.  You can do this by clicking on `File -> Save` in the menu bar or by pressing `Ctrl-S` (`Cmd-S` on Mac).

### Problem 1: Bash Basics
First, make sure that you have downloaded and unzipped the contents of the [ion_channel_sequences folder](https://github.com/gofflab/Quant_mol_neuro_2022/tree/main/modules/module_1/pset/ion_channel_sequences) of genome sequence files. This dataset will be included in Canvas for this problemset as well.

Then, answer each question using Bash scripting. Please note that the bash operations should be executed inside a `bash` code chunk like so:


In [None]:
%%bash

You'll need to have your folder of sequence files (`ion_channel_sequence`) in your working directory. Print your working directory and its contents. (Hint: use `pwd` and `ls`)

```bash

In [None]:
%%bash
echo "Working directory:"
pwd
echo "Contents of working directory:"
ls -l

We'll start with Kcna1.fa, a FASTA file that contains the genome sequence of a mouse voltage-gated potassium channel. The first line contains a carat ("\>"), followed by a unique sequence identifier. The actual sequence starts on the next line. See for yourself by printing the first three lines of the file. Note that FASTA files can contain multiple sequences, but this one has just one.


In [2]:
%%bash
head -3 ion_channel_sequences/Kcna1.fa

>ref|NM_010595.3|:1-8970 Mus musculus potassium voltage-gated channel, shaker-related subfamily, member 1 (Kcna1), mRNA
GGGGGCTCCTCAGAGGCTCCGCAGCGGTGGAAGGACTGGAGCTGCTGGCTGCCTCCTCCGGTGCAGCCTG
TATCCAGGTGCAGCGGCACTGGGGACGCGGTGCATATCCCTTGCTCAGACTGCCACTGTGACCCTTGCGC


Print just the sequence into a new file called Kcna1_sequence.txt

In [None]:
%%bash


How many lines are in Kcna1_sequence.txt? How many characters are in the file? If you subtract the number of lines from the number of characters, you should get the number of nucleotides in the Kcna1 gene. Why? Answer in a comment.


In [None]:
%%bash

Count the number of times each nucleotide appears in the Kcna1 gene.

Many voltage-gated potassium channels have a signature selectivity filter motif with the amino acid sequence `TVGYG`. The reverse translation of this AA sequence can be modeled using the following DNA codons:

`AC[TCAG] GT[TCAG] GG[TCAG] TA[TC] GG[TCAG]`

Note the bases in square brackets. This string uses [regular expressions](https://quickref.me/regex) to allow flexibility in the wobble base of each codon. For example the first codon in this motif evaluates to `AC[TCAG]` which, when used with a grep search, will find matches that begin with `AC` followed by _any_ base in the range `[TCAG]`. Use this provided motif sequence as an argument to `grep` to search the Kcna1 gene for any instances that match.

Confirm that the Kcna1 gene has a sequence that would encode these amino acids. _Hint: Your first step should be to remove newline characters from Kcna1_sequence.txt._

In [None]:
%%bash
tvgyg="AC[TCAG]GT[TCAG]GG[TCAG]TA[TC]GG[TCAG]"


Write a bash script that takes a FASTA file (with a single sequence) and a target sequence motif as input and outputs the number of times that the motif appears in the FASTA file.

Your script must (in order):
  1. Take a fasta file as the first argument
  2. Take a string (motif) as the second argument
  3. Remove the header row from the fasta sequence (ie. select just the sequence of the gene)
  4. Trim the lines to remove newline characters ("\n")
  5. Search the trimmed gene sequence for the provided motif and count the number of occurances.
  6. Return the number of motifs found.
  
After writing your script, check its file permissions and make it is executable.


Use your new script here to determine which of the FASTA files in the ion_channel_sequences folder are likely to encode a voltage-gated potassium channel.
Construct a loop to check whether each of the FASTA files in the ion_channel_sequences folder contains the `TVGYG` motif. Print the names of just the files that do contain the motif.

## Part 2: Basic Python

We are going to load a gene expression matrix from Yao 2021

In [2]:
import pandas as pd
import numpy as np

In [12]:
expData = pd.read_csv("Yao2021_ionChannels_expMat_genesByCells.csv",index_col=0)

In [13]:
expData.head()

Unnamed: 0,AAACCCAAGCTTCATG-1L8TX_181211_01_G12,AAACCCAAGTGAGGTC-1L8TX_181211_01_G12,AAACCCACACCAGCCA-1L8TX_181211_01_G12,AAACCCAGTGAACGGT-1L8TX_181211_01_G12,AAACCCAGTGGCATCC-1L8TX_181211_01_G12,AAACCCATCTACCTTA-1L8TX_181211_01_G12,AAACGAAAGCCTGGAA-1L8TX_181211_01_G12,AAACGAACAACGATTC-1L8TX_181211_01_G12,AAACGAAGTAACAGTA-1L8TX_181211_01_G12,AAACGAATCGCCAGAC-1L8TX_181211_01_G12,...,TTTGTTGAGTTAGTGA-12L8TX_190430_01_G08,TTTGTTGCAATGGCCC-12L8TX_190430_01_G08,TTTGTTGCAGCGATTT-12L8TX_190430_01_G08,TTTGTTGGTATGGAGC-12L8TX_190430_01_G08,TTTGTTGGTGTGTACT-12L8TX_190430_01_G08,TTTGTTGTCAGCATTG-12L8TX_190430_01_G08,TTTGTTGTCATTGCGA-12L8TX_190430_01_G08,TTTGTTGTCCCAACTC-12L8TX_190430_01_G08,TTTGTTGTCTATGCCC-12L8TX_190430_01_G08,TTTGTTGTCTTGGAAC-12L8TX_190430_01_G08
Kcnj10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Kcnk2,8,8,28,49,16,0,2,21,9,2,...,2,2,7,4,6,4,11,19,26,5
Scn3a,4,5,6,6,3,4,14,4,6,4,...,2,2,1,3,8,5,1,6,7,8
Scn9a,1,0,0,1,7,2,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
Slc12a5,14,19,24,64,21,12,15,28,11,18,...,13,6,9,10,6,13,16,24,13,13


In [19]:
expData.mode(axis=1,dropna=False)

Unnamed: 0,0
Kcnj10,0
Kcnk2,2
Scn3a,4
Scn9a,0
Slc12a5,11
Kcnb1,11
Kcna2,6
Cacna1c,7
Kcna1,0
Gabra1,17


In [10]:
cell_pheno = pd.read_csv("Yao2021_ionChannels_expMat_pheno.csv",index_col=0)

In [11]:
cell_pheno.head()

Unnamed: 0_level_0,cellClass,cellSubclass
cellID,Unnamed: 1_level_1,Unnamed: 2_level_1
AAACCCAAGCTTCATG-1L8TX_181211_01_G12,Glutamatergic,L5 IT
AAACCCAAGTGAGGTC-1L8TX_181211_01_G12,Glutamatergic,L5 IT
AAACCCACACCAGCCA-1L8TX_181211_01_G12,Glutamatergic,L5 IT
AAACCCAGTGAACGGT-1L8TX_181211_01_G12,Glutamatergic,L5 IT
AAACCCAGTGGCATCC-1L8TX_181211_01_G12,GABAergic,Vip
