# Project 2 - Noah Guzman

### Scientific Question: Which mutations on the SCN1A, sodium voltage gated channel alpha subunit, human gene are most common and responsible for altering proteins leading to epilepsy?

SCN1A, sodium voltage-gated channel alpha subunit 1, is the gene of interest in my project. Voltage dependent sodium channels are responsible for regulating the sodium exchange between intracellular and extracellular spaces. This is crucial for the generation and transmission of action potentials within neurons or muscle cells. This specific gene encodes the alpha subunit that has four homologous domains each with various transmembrane regions. Variants of the alleles within the gene have been assosiated with epilepsy, specifically ferbile seizures and epileptic encephalopathy. 

Due to its associantion with epilepsy, many researchers have looked into the gene to try to understand more about its connection. The data is sourced from NCBI (https://www.ncbi.nlm.nih.gov/gene/6323). 

### Scientific Hypothesis: If there are common selective mutations within the SCN1A gene found leading to consequences within proteins, then these mutations could be lead to future diagnoses of epilepsy.

The analyses that were done for this project were pairwise sequence alignment, protein expresiion analysis. As for the plotting methods, I chose to use PCA and hierarchial clustering. Pairwise sequence alignment is used for identifying regions of similarity whether that be in a protein or nucleic acid. As for protein expression analysis, this is a method used for the understanding of how some genes or proteins are transcribed to develop functional products. Principal Component Analysis, or PCA is a dimensionality reduction method often used to reduce the size of large data sets. With hierarchical clustering, this plotting method is used to group similar objects into clusters and can show distinction between clusters, this is often used to show where certain parts of a gene is commonly shown up or highlighted. The data was downloaded from NCBI (https://www.ncbi.nlm.nih.gov/gene/6323).

### Loading in Packages

- Pysam: Pysam is a module within python that allows users to easily read and alter mapped sequence data. Often this sequence data is stored in SAM or BAM files. Pysam is also used as a lightweight wrapper of htslin C-API. Using this you would want to create an alignment file object and then open the file and can read the mapping of the file. Here it is used for the pairwise sequence alignment. (https://pysam.readthedocs.io/en/latest/api.html#:~:text=Pysam%20is%20a%20python%20module,pysam%20followed%20by%20the%20API.)

- Matplotlib: MAtplotlib is a way to create various visuals within python. Some of these visuals could be interactive and possibly even animated. WIth matplotlib, one is able to create plots in graphs, interactive figures with the ability to zoom in, export to many file formats, amd can even allows customizable visual styles. Within the graphs created, one can alter the plot with labels, add different lines, and even change the colors. Here matplotlib is used with hierarchical clustering. (https://www.google.com/search?q=what+does+matplotlib+do+in+python&rlz=1C5CHFA_enUS797US797&oq=what+does+matplot&aqs=chrome.2.0i512j69i57j0i512l8.5414j0j4&sourceid=chrome&ie=UTF-8). 

- Numpy: Numpy is an open source library which spans across science and engineering uses within python. It is the fodation for working with numerical data in python. Numpy is used from beginner coders to high level researchers who use coding in everyday work. WIthin numpy, there is a multidimensional array and matrix data structures which allow users to work more efficiently with their numerical data. Here numpy is used in a multitude of ways to arrange my data (https://numpy.org/doc/stable/user/absolute_beginners.html#:~:text=NumPy%20can%20be%20used%20to,on%20these%20arrays%20and%20matrices.)

- Plotly: Plotly is another python library which allows users to create various visuals. Most of the visuals created with plotly are charts that can be used with statistics, geographics, scientific, and even 3D visuals. This proves to be useful with many who wish to edit desktop sites with non web contexts, allowing for more uniqe visuals. Here it was used for my PCA Plot (https://plotly.com/python/getting-started/#:~:text=The%20plotly%20Python%20library%20is,3%2Ddimensional%20use%2Dcases.)

Pairwise Sequence Alignment

In [2]:
import pandas as pd

In [3]:
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment

In [5]:
genes_df = pd.read_csv('data_table.tsv', sep='\t')

print(genes_df)

    gene_id gene_symbol                                   description  \
0      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
1      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
2      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
3      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
4      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
5      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
6      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
7      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
8      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
9      6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
10     6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
11     6323       SCN1A  sodium voltage-gated channel alpha subunit 1   
12     6323       SCN1A  sodium voltage-gated chann

In [8]:
pip install pysam

Collecting pysam
  Downloading pysam-0.19.1-cp39-cp39-macosx_10_9_x86_64.whl (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 1.6 MB/s eta 0:00:01
[?25hInstalling collected packages: pysam
Successfully installed pysam-0.19.1
Note: you may need to restart the kernel to use updated packages.


In [9]:
from pysam import FastaFile

In [11]:
fasta = "gene1.fa"

sequences = FastaFile(fasta)

print(sequences)

<pysam.libcfaidx.FastaFile object at 0x7fa33ba8bc40>


In [14]:
seq1 = Seq("TGTGACTA")
seq2 = Seq("CATGGTCA")

alignments = pairwise2.align.globalxx(seq1, seq2)

for alignment in alignments:
    print(format_alignment(*alignment))

--TGTGACT-A
  || |  | |
CATG-G--TCA
  Score=5

--TGTGA-CTA
  || |  | |
CATG-G-TC-A
  Score=5

--TGTGACTA
  || |.| |
CATG-GTC-A
  Score=5

--TG-TGACTA
  || |  | |
CATGGT--C-A
  Score=5

--T-GTGACTA
  | ||  | |
CATGGT--C-A
  Score=5



Hierarchical clustering

In [15]:
import matplotlib.pyplot as plt

In [17]:
labels = range(1, 11)
plt.figure(figsize=(10, 7))
plt.subplots_adjust(bottom=0.1)
plt.scatter(X[:,0],X[:,1], label='True Position')

for label, x, y in zip(labels, X[:, 0], X[:, 1]):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-3, 3),
        textcoords='offset points', ha='right', va='bottom')
plt.show()

NameError: name 'X' is not defined

<Figure size 720x504 with 0 Axes>

In [18]:
X = np.array([[5,3],
    [10,15],
    [15,12],
    [24,10],
    [30,30],
    [85,70],
    [71,80],
    [60,78],
    [70,55],
    [80,91],])

NameError: name 'np' is not defined

In [19]:
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

In [20]:
linked = linkage(X, 'single')

labelList = range(1, 11)

plt.figure(figsize=(10, 7))
dendrogram(linked,
            orientation='top',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()

NameError: name 'X' is not defined

Protein Expression Analysis

In [1]:
import numpy
 
def get_conditions_and_genes(work_counts): 
    conditions = work_counts.keys()
    conditions.sort()
    all_genes = []
    for c in conditions:
        all_genes.extend(work_counts[c].keys())
    all_genes = list(set(all_genes))
    all_genes.sort()
    sizes = [work_counts[c]["Total"] for c in conditions]
    all_genes.remove("Total")
    return conditions, all_genes, sizes
     
def edger_matrices(work_counts):
    conditions, all_genes, sizes = get_conditions_and_genes(work_counts)
    assert len(sizes) == 2
    groups = [1, 2]
    data = []
    final_genes = []
    for g in all_genes:
        cur_row = [int(work_counts[c][g]) for c in conditions]
        if sum(cur_row) > 0:
            data.append(cur_row)
            final_genes.append(g)
    return (numpy.array(data), numpy.array(groups), numpy.array(sizes),
            conditions, final_genes)

PCA Plot

In [4]:
conda install -c plotly plotly_express

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/noahguzman/opt/anaconda3

  added / updated specs:
    - plotly_express


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.13.0               |   py39hecd8cb5_0         906 KB
    plotly-5.8.0               |             py_0         6.9 MB  plotly
    plotly_express-0.4.1       |             py_0           5 KB  plotly
    tenacity-8.0.1             |   py39hecd8cb5_0          33 KB
    ------------------------------------------------------------
                                           Total:         7.9 MB

The following NEW packages will be INSTALLED:

  plotly             plotly/noarch::plotly-5.8.0-py_0
  plotly_express     plotly/noarch::plotly_express-0.4.1-py_0
  tenacity           pkgs/main/osx-64::tenacity-8.0.1-py39hecd8cb5_0

The fo

In [2]:
import plotly.express as px
from sklearn.decomposition import PCA

ModuleNotFoundError: No module named 'plotly'

In [3]:
df = px.data.iris()
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]

pca = PCA()
components = pca.fit_transform(df[features])
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    components,
    labels=labels,
    dimensions=range(4),
    color=df["species"]
)
fig.update_traces(diagonal_visible=False)
fig.show()

NameError: name 'px' is not defined