# Open Reading Frames & Regression

**1(a)**

``` r
findStartCodons <- function(seq){
  startcodons <- numeric(0)
  k <- 1
  for(i in 1:(length(seq)-5)){
    if (seq[i] == "a" && seq[i + 1] == "t" && seq[i + 2] == "g") {
      startcodons <- c(startcodons, k)
    }
    k <- k + 1
  }
  return(startcodons)
}

seq <- c("g", "t", "a", "a", "t", "g", "t", "a", "g", "t", "g", "a", "t", "t", "g", "t", "a", "g")
findStartCodons(seq)
```

    [1] 4

**1(b)**

``` r
findStopCodons <- function(seq){
  stopcodons <- numeric(0)
  k <- 1
  for(i in 1:(length(seq)-2)){
    if((seq[i] == "t" && seq[i+1] == "a" && seq[i+2] == "a") ||
      (seq[i] == "t" && seq[i+1] == "g" && seq[i+2] == "a") ||
      (seq[i] == "t" && seq[i+1] == "a" && seq[i+2] == "g")){
      stopcodons[k]<- i
      k <- k+1
    }
  }
  return(stopcodons)
}

seq<-c("g", "t", "a", "a", "t", "g", "t", "a", "g", "t", "g", "a", "t", "t", "g", "t", "a", "g")
findStopCodons(seq)
```

    [1]  2  7 10 16

**1(c)**

``` r
findStopCodons <- function(seq){
  stopcodons <- numeric(0)
  k <- 1
  for(i in 1:length(seq)){
    if((seq[i] == "t" && seq[i+1] == "a" && seq[i+2] == "a") ||
      (seq[i] == "t" && seq[i+1] == "g" && seq[i+2] == "a") ||
      (seq[i] == "t" && seq[i+1] == "a" && seq[i+2] == "g")){
      stopcodons[k]<- i
      k <- k+1
    }
  }
  return(stopcodons)
}

seq<-c("g", "t", "a", "a", "t", "g", "t", "a", "g", "t", "g", "a", "t", "t", "g", "t", "a", "g")
findStopCodons(seq)
```

    [1]  2  7 10 16

<span style="color:blue"> </span>

**1(d)** <span style="color:blue"> To ensure that the identified start
codon is part of an open reading frame, we need to leave room for the
stop codon to occur after the start codon. By iterating over
1:length(seq)-5 in 1(a), we allow for at least six elements to be
present after the start codon. This way, if a start codon is found,
there is sufficient space to search for the subsequent stop codons. On
the other hand, in 1(b) when search for stop codon, there is need to
leave additional elements after the stop codon. The stop codons can
occur immediately after the start codon or after any number of
nucleotides within the ORF. Therefore, the loop iterates over
1:length(seq)-2 to cover all possible positions of the stop codons
within the sequence. </span>

**2**

``` r
paste("Lacy", "loves", "the", "Friday BIOS20172", "Lab")
```

    [1] "Lacy loves the Friday BIOS20172 Lab"

**3** <span style="color:blue"> The line a \<- 23:33 creates a vector a
containing the numbers from 23:33 whereas the code b \<- c(40, 30, 12,
9, 27) creates a vector b with the specified values. The %in% operator
is used to check whether each element of a is present in b. In other
words, R compares each element of a against all the elements in b to
determine if there is a match. The output will be a logical vector of
the same length as a, where each element represents whether the
corresponding element in a is found in b. For example, the first element
of a is 23. When compared with all the elements of b, 23 is not found,
so the first element of the output will be FALSE. So on and so forth.
Each TRUE indicates that the corresponding element in a is found in b,
while each FALSE indicates that the corresponding element in a is not
present in b. </span>

**4**

``` r
a <- c("xyz")
b <- letters

a%in%b
```

    [1] FALSE

**5**

``` r
for(i in 20:30){
  if(i==28){
    break
  }
  print(i)
}
```

    [1] 20
    [1] 21
    [1] 22
    [1] 23
    [1] 24
    [1] 25
    [1] 26
    [1] 27

<span style="color:blue"> 1. The loop starts with i being assigned 20
(the first value in the sequence) </span>

<span style="color:blue"> 2. The condition i == 28 is checked. Since i
is not equal to 28, the if statement condition evaluates to FALSE, and
the code inside the if block is not executed. </span>

<span style="color:blue"> 3. The value of i (which is 20) is printed
using the print() function. </span>

<span style="color:blue"> 4. The loop moves to the next iteration, and i
is incremented to 21. The loop continues as previously stated, printing
the values of i for each iteration until i reaches 28. When i becomes
28, the condition i == 28 becomes TRUE, and the code inside the if block
is executed. The break statement is encountered, which immediately exits
the loop, stopping any further iterations. The output should list
numbers 20 through and including 27. </span>

<span style="color:blue"> 5. The break function is responsible for
prematurely ending the loop execution as soon as i becomes 28 (as per
the condition: i == 28). </span>

**6**

``` r
findORF <- function(seq) {
  startcodon <- findStartCodons(seq)
  stopcodon <- findStopCodons(seq)
  
  usedStop <- numeric(0)
  ORFs <- character(0)
  k <- 1
  
  for (start in startcodon) {
    for (stop in stopcodon) {
      if ((stop - start) %% 3 == 0 && stop > start) {
        if (stop %in% usedStop) {
          break
        } else if (stop - start + 1 < 300) {
          break
        } else {
          ORFs [k] <- paste("ORF starting at", start, "and ending at", stop+2)
          usedStop[k] <- stop
            k <- k + 1
            break
          }
        }
      }
    }
  return(ORFs)
}
```

**7**

``` r
setwd("C:/Users/15ull/OneDrive/Desktop/Labs - BIOS 20172 (Spring)/Lab5-BIOS20172_Spring2023")

library(seqinr)
```

    Warning: package 'seqinr' was built under R version 4.2.3

``` r
covid <- read.fasta("covid19.fasta")[[1]]
ORFs <- findORF(covid)
print(ORFs)
```

    [1] "ORF starting at 266 and ending at 13483"  
    [2] "ORF starting at 13768 and ending at 21555"
    [3] "ORF starting at 21536 and ending at 25384"
    [4] "ORF starting at 25393 and ending at 26220"
    [5] "ORF starting at 26523 and ending at 27191"
    [6] "ORF starting at 27394 and ending at 27759"
    [7] "ORF starting at 27894 and ending at 28259"
    [8] "ORF starting at 28274 and ending at 29533"

**8**

Yes, it is surprising that there are only 8 ORFs found considering the
number of proteins coded by the virus. As per the article provided, the
SARS-CoV-2 genomic sequences have 12 ORFs that encode 27 proteins. In
the findORF function code, we limited results to ORFs with a minimum
length of 100 amino acids. This could account for the 4 ORFs that aren’t
reported from the covid19.fasta data.

Protein isoform modification could allow the coronavirus genome to
encode multiple proteins. More specifically, once proteins are formed,
they can undergo further modifications, including phosphorylation,
methylation, acetylation, etc. These modifications can alter the
protein’s structure, function, and interaction with other proteins.
Alternative splicing could also account for the multiple proteins
encoded by the coronavirus genome.

**9**

library(seqinr)

zika \<- read.fasta(“zika.fasta”)\[\[1\]\] orfs \<- findORF(zika)
print(orfs)

\[1\] “ORF starting at 107 and ending at 10366”

**10**

<span style="color:blue"> Only one ORF was found using the findORF
function. This is surprising given the relatively large size of the Zika
virus genome (10,794 base pairs as per the article linked). According to
the article, the entire Zika virus genome serves as both the viral
genome and the mRNA. The genome is tranlsated into a single polyprotein
that is 3,419 amino acids long. This polyprotein is then cleaved by both
host and viral proteases, yielding all the structural and non-structal
proteins required for viral replication. The strategy of having a single
ORF and producing a polyprotein is a common feature among many
positive-sense RNA viruses like the Zika virus. By encoding multiple
proteins within a single ORF, positive-sense RNA viruses can maximize
their coding capacity while minimizing the their genome size.
Furthermore, this compact arrangement is advantageous as it allows for
efficient packaging of genetic material within the viral particle.
</span>

**11**

<span style="color:blue"> Null Hypothesis: There is no association or
relationship between the semester of birth and right-thumb or left-thumb
preference. (In other words, the two are independent). Note: Wording was
changed from quarter to semester sense the Q1/Q2 and Q3/Q4 columns had
to be lumped to perform the chi square test in order to account for the
values \<5 in the expected table. </span> <span style="color:blue">
Alternative Hypotehsis: There is an association or relationship between
the semester of birth and right-thumb or left-thumb preference. (In
other words, the two are not independent). </span>

**12**

``` r
x <- matrix(c(24.32,17.68,42,8.68,6.32,15,33,24,57), nrow = 3, byrow = TRUE)
colnames(x) <- c("S1", "S2", "Totals")
rownames(x) <- c("Left-Thumb", "Right Thumb", "Totals")
print(x)
```

                   S1    S2 Totals
    Left-Thumb  24.32 17.68     42
    Right Thumb  8.68  6.32     15
    Totals      33.00 24.00     57

``` r
T <- matrix(c(27,15,42, 6,9, 15, 33, 24, 57), nrow = 3, byrow = TRUE)
colnames(T) <- c("S1", "S2", "Totals")
rownames(T) <- c("Left-Thumb", "Right Thumb", "Totals")
print(T)
```

                S1 S2 Totals
    Left-Thumb  27 15     42
    Right Thumb  6  9     15
    Totals      33 24     57

``` r
a <- (T[3,1]/T[3,3]) %*% (T[1,3]/T[3,3]) %*% T[3,3]
b <- (T[3,2]/T[3,3]) %*% (T[1,3]/T[3,3]) %*% T[3,3]
c <- (T[3,1]/T[3,3]) %*% (T[2,3]/T[3,3]) %*% T[3,3]
d <- (T[3,2]/T[3,3]) %*% (T[2,3]/T[3,3]) %*% T[3,3]

t <- matrix(c(a,b,42,c,d,15,33,24,57), nrow = 3, byrow = TRUE)
colnames(t) <- c("S1", "S2", "Totals")
rownames(t) <- c("Left-Thumb", "Right Thumb", "Totals")
print(t)
```

                       S1        S2 Totals
    Left-Thumb  24.315789 17.684211     42
    Right Thumb  8.684211  6.315789     15
    Totals      33.000000 24.000000     57

**13**

``` r
chi_squared <- function(O, E){
  calculation <- sum((O-E)^2 / E)
  return(calculation)
}
```

**14**

``` r
Matrix1 <- matrix(c(27,15,6,9), nrow = 2, byrow = TRUE)
Matrix2 <- matrix(c(24.32,17.68,8.68,6.32), nrow = 2, byrow = TRUE)

O <- Matrix1
E <- Matrix2
result <- chi_squared(O, E)
print(result)
```

    [1] 2.665494

**15**

``` r
CI95 <- qchisq(p= 1-0.05, df = 1, ncp = 0, lower.tail = TRUE, log.p = FALSE)
print(CI95)
```

    [1] 3.841459

``` r
CI99 <- qchisq(p= 1-0.01, df = 1, ncp = 0, lower.tail = TRUE, log.p = FALSE)
print(CI99)
```

    [1] 6.634897

``` r
CI99.5 <- qchisq(p= 1-0.005, df = 1, ncp = 0, lower.tail = TRUE, log.p = FALSE)
print(CI99.5)
```

    [1] 7.879439

``` r
CI95 <- qchisq(p= 1-0.05, df = 3, ncp = 0, lower.tail = TRUE, log.p = FALSE)
print(CI95)
```

    [1] 7.814728

``` r
CI99 <- qchisq(p= 1-0.01, df = 3, ncp = 0, lower.tail = TRUE, log.p = FALSE)
print(CI99)
```

    [1] 11.34487

``` r
CI99.5 <- qchisq(p= 1-0.005, df = 3, ncp = 0, lower.tail = TRUE, log.p = FALSE)
print(CI99.5)
```

    [1] 12.83816

<span style="color:blue"> Note: When making the table of expected
values, I lumped columns together since values less than 5 were present
in multiple cells. For this question, I found the threshold chi-squared
values for the grouped and ungrouped tables. The grouped table has two
columns (semesters) and one degree of freedom. The ungrouped table has
four columns (quarters) and three degrees of freedom. </span>

<span style="color:blue"> The chi squared value calculated in #14 is
smaller than all of the threshold chi-squared values. Therefore, we fail
to reject the null hypothesis at every significance level and conclude
that there is no association between the semester of birth and
right-thumb or left-thumb preference. The two are independent of one
another. </span>

**16**

``` r
Matrix1 <- matrix(c(27,15, 6,9), nrow = 2, byrow = TRUE)
chisq.test(Matrix1)
```

        Pearson's Chi-squared test with Yates' continuity correction

    data:  Matrix1
    X-squared = 1.7707, df = 1, p-value = 0.1833

``` r
Matrix2 <- matrix(c(15,12,6,9,3,3,2,7),nrow=2, byrow = TRUE)
chisq.test(Matrix2)
```

    Warning in chisq.test(Matrix2): Chi-squared approximation may be incorrect

        Pearson's Chi-squared test

    data:  Matrix2
    X-squared = 3.688, df = 3, p-value = 0.2972

<span style="color:blue"> The result of this chi-squared test is similar
to the result from #14 in the sense that we fail to reject the null
hypothesis at every significance level. We conclude that there is no
association between semester of birth and right-thumb or left-thumb
preference. The p-value associated with the results of #14 is 0.1025,
which is lower than the p-value of 0.1833 calculated here. However,
p-values greater than 0.05 are not statistically significant and
indicate strong evidence for the null hypothesis. So, in both cases, the
p-value supports our failure to reject the null. Degrees of freedom for
both tables are the same. Columns were lumped from quarters to semesters
to account for values \<5 such that the chi-squared test can be
performed (rule of 5). </span>

<span style="color:blue"> Out of curiosity, I calculated the chi-squared
value for the original observed table. This table violates the rule of 5
when expected values are calculated. I wanted to see whether the code
would still run. Regardless, the p-value exceeds 0.05 and provides
evidence for the null hypothesis. The calculated chi-squared value is
less than all threshold values calculated in #15. There are three
degrees of freedom since four quarters are represented in the table
(4-1).</span>

**17**

``` r
bitter <- c("PTC Bitter" = 10, "PTC Not Bitter" = 7)
not <- c("PTC Bitter" = 27, "PTC Not Bitter" = 32)
kaleLeaf <- rbind("Plant Bitter" = bitter, "Plant Not Bitter" = not)

bitter <- c("PTC Not Bitter" = 7, "PTC Bitter" = 8)
not <- c("PTC Not Bitter" = 32, "PTC Bitter" = 29)
brusselSprouts <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

bitter <- c("PTC Not Bitter" = 5, "PTC Bitter" = 8)
not <- c("PTC Not Bitter" = 34, "PTC Bitter" = 29)
broccoli <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

bitter <- c("PTC Not Bitter" = 3, "PTC Bitter" = 7)
not <- c("PTC Not Bitter" = 36, "PTC Bitter" = 30)
rapiniStem <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

bitter <- c("PTC Not Bitter" = 7, "PTC Bitter" = 16)
not <- c("PTC Not Bitter" = 32, "PTC Bitter" = 21)
rapiniLeaf <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

bitter <- c("PTC Not Bitter" = 26, "PTC Bitter" = 31)
not <- c("PTC Not Bitter" = 13, "PTC Bitter" = 6)
arugulaLeaf <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

print(kaleLeaf)
```

                     PTC Bitter PTC Not Bitter
    Plant Bitter             10              7
    Plant Not Bitter         27             32

``` r
print(brusselSprouts)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter             32         29
    Vegetable Bitter                  7          8

``` r
print(broccoli)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter             34         29
    Vegetable Bitter                  5          8

``` r
print(rapiniStem)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter             36         30
    Vegetable Bitter                  3          7

``` r
print(rapiniLeaf)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter             32         21
    Vegetable Bitter                  7         16

``` r
print(arugulaLeaf)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter             13          6
    Vegetable Bitter                 26         31

**18**

<span style="color:blue"> A chi sqaured test can be used to test for a
possible association between PTC bitterness and vegetable bitterness.
None of the expected tables contain values less than 5 so the rule of 5
isn’t violated. </span>

**19**

<span style="color:blue"> Null Hypothesis: There is no relationship or
association between the PTC bitterness and vegetable bitterness. In
other words, the PTC bitterness is independent of the vegetable
bitterness. </span> <span style="color:blue"> Alternative Hypothesis:
There is a relationship/association between the PTC bitterness and
vegetable bitterness. In other words, the PTC and vegetable bitterness
are not indepnedent of one another.</span>

``` r
bitter <- c("PTC Bitter" = 9.39, "PTC Not Bitter" = 7.60)
not <- c("PTC Bitter" = 26.40, "PTC Not Bitter" = 32.61)
kaleLeaf <- rbind("Plant Bitter" = bitter, "Plant Not Bitter" = not)

bitter <- c("PTC Not Bitter" = 7.70, "PTC Bitter" = 7.30)
not <- c("PTC Not Bitter" = 31.30, "PTC Bitter" = 29.70)
brusselSprouts <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

bitter <- c("PTC Not Bitter" = 6.67, "PTC Bitter" = 6.34)
not <- c("PTC Not Bitter" = 32.33, "PTC Bitter" = 30.67)
broccoli <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

bitter <- c("PTC Not Bitter" = 5.13, "PTC Bitter" = 4.87)
not <- c("PTC Not Bitter" = 33.89, "PTC Bitter" = 32.13)
rapiniStem <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

bitter <- c("PTC Not Bitter" = 11.80, "PTC Bitter" = 11.20)
not <- c("PTC Not Bitter" = 27.20, "PTC Bitter" = 25.80)
rapiniLeaf <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

bitter <- c("PTC Not Bitter" = 29.25, "PTC Bitter" = 27.75)
not <- c("PTC Not Bitter" = 9.75, "PTC Bitter" = 9.25)
arugulaLeaf <- rbind("Vegetable Not Bitter" = not, "Vegetable Bitter" = bitter)

print(kaleLeaf)
```

                     PTC Bitter PTC Not Bitter
    Plant Bitter           9.39           7.60
    Plant Not Bitter      26.40          32.61

``` r
print(brusselSprouts)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter           31.3       29.7
    Vegetable Bitter                7.7        7.3

``` r
print(broccoli)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter          32.33      30.67
    Vegetable Bitter               6.67       6.34

``` r
print(rapiniStem)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter          33.89      32.13
    Vegetable Bitter               5.13       4.87

``` r
print(rapiniLeaf)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter           27.2       25.8
    Vegetable Bitter               11.8       11.2

``` r
print(arugulaLeaf)
```

                         PTC Not Bitter PTC Bitter
    Vegetable Not Bitter           9.75       9.25
    Vegetable Bitter              29.25      27.75

<span style="color:blue"> I’m not sure what the question is referring to
when it asks for PTC-curly kale stem data. I calculated the expected
tables associated with the 5 observed tables provided in question 17. I
asked a TA, but never heard back. The R function chisq.test is used in
question 20 (I also asked a TA about this via email and was told I was
fine to put the chisq.test under question 20). The only expected table
that violates the rule of 5 is the table for the rapinistem. In this
case, Fisher’s test may be more useful. All expected tables were
calculated by hand. </span>

**20**

``` r
chisq.test(kaleLeaf)
```

        Pearson's Chi-squared test with Yates' continuity correction

    data:  kaleLeaf
    X-squared = 0.24048, df = 1, p-value = 0.6239

``` r
chisq.test(brusselSprouts)
```

        Pearson's Chi-squared test with Yates' continuity correction

    data:  brusselSprouts
    X-squared = 5.1764e-32, df = 1, p-value = 1

``` r
chisq.test(broccoli)
```

        Pearson's Chi-squared test with Yates' continuity correction

    data:  broccoli
    X-squared = 5.4452e-30, df = 1, p-value = 1

``` r
chisq.test(rapiniStem)
```

    Warning in chisq.test(rapiniStem): Chi-squared approximation may be incorrect

        Pearson's Chi-squared test with Yates' continuity correction

    data:  rapiniStem
    X-squared = 8.02e-31, df = 1, p-value = 1

``` r
chisq.test(rapiniLeaf)
```

        Pearson's Chi-squared test with Yates' continuity correction

    data:  rapiniLeaf
    X-squared = 2.3831e-31, df = 1, p-value = 1

``` r
chisq.test(arugulaLeaf)
```

        Pearson's Chi-squared test

    data:  arugulaLeaf
    X-squared = 0, df = 1, p-value = 1

<span style="color:blue"> Every test except for the chi-squared test
performed on the RapiniLeaf table provide evidence in support of the
null hypothesis since p-values exceed 0.05 (which suggests that the
results are not statistically significant). Support of the null
hypothesis indicates that there is no association between PTC and
vegetable bitterness. The p-value calculated for the RapiniLeaf table is
0.0316, which is less than 0.05. </span>

<span style="color:blue"> The threshold chi-squared value for one degree
of freedom and a significance level of 0.05 is approximately 3.84. All
calculated chi-squared values are less than 3.84 except for the
chi-squared value calculated for the RapiniLeaf. Therefore, we would
fail to reject the null hypothesis in every case except for the
RapiniLeaf. In the case of the RapiniLeaf, we reject the null and
conclude that there is an association between PTC and vegetable
bitterness. One degree of freedom is used for every contingency table
since they all contain two columns (2-1). </span>

**21**

<span style="color:blue"> The data largely demonstrates no association
between PTC and vegetable bitterness. Biologically, this could be
attributed to several factors including the following: genetic
variation, tast perception thresholds, compensatory taste mechanisms,
and psychological factors. </span>

<span style="color:blue"> Genetic Variation: Individuals may have
genetic variations that affect their ability to taste bitterness.
Genetic factors can influence taste receptor genes and their sensitivity
to bitter compounds. Therefore, even if vegetables contain varying
levels of bitterness, some individuals may not be able to perceive it
due to their genetic makeup.</span>

<span style="color:blue"> Taste Perception Thresholds: Taste perception
thresholds can vary among individuals. Some individuals may have higher
or lower thresholds for perceiving bitterness. Even if a vegetable is
objectively bitter, individuals with higher thresholds may not perceive
it as strongly, or vice versa. </span>

<span style="color:blue"> Compensatory Taste Mechanisms: The human sense
of taste is complex, and our perception of taste is influenced by
various factors. Compensatory taste mechanisms, such as the presence of
other flavors or textures in a meal, can modulate our perception of
bitterness. It is possible that individuals may find ways to mask or
counteract the bitterness of vegetables through other sensory cues,
making the perceived association between ptc and vegetable bitterness
weaker. </span>

<span style="color:blue"> Psychological Factors: Psychological factors,
such as expectations and preferences, can also influence taste
perception. If individuals have preconceived notions or strong
preferences for certain vegetables, it may affect their perception of
bitterness. For example, individuals who enjoy a particular vegetable
may perceive it as less bitter than individuals who dislike it,
regardless of its actual bitterness level. </span>