Bioinformatics

To clone this repository using HTTPS run the code below in your terminal:

git clone https://github.com/heispv/bioinformatics.git

You can retrieve all the functions within this Python file. 😉

In this collection, you will find all the notebooks along with their file descriptions and publish dates. Each notebook contains examples that showcase the results of various functions. 🙂

Notebook Name	Comment	Publish Date
Counting Words	To compute Count(Text, Pattern), our plan is to “slide a window” down Text, checking whether each k-mer substring of Text matches Pattern. We will therefore refer to the k-mer starting at position i of Text as Text(i, k)	21 July 2023
Best Common Patterns	To calculate the `best common patterns` in a given sequence (ori), we can begin by creating a function that generates a `frequency table` of patterns based on their k-mer values. Then, we proceed to develop another function that identifies and selects the most frequently occurring patterns based on the k-mer values. By following these steps, we can effectively extract the most prevalent patterns in the ori sequence.	21 July 2023
Reverse Complement	In DNA, each strand is read from 3' to 5'. Therefore, to generate a complementary strand, we need to create a function that first reverses the original strand and then generates the complementary sequence based on the reversed strand.	22 July 2023
Pattern Index Finder V1 - Pattern Index Finder V2(For Real Data)	Here we created a function which returns a list of the staring index of a pattern in a sequence.	22 July 2023
Clump Finding	We defined a k-mer as a "clump" if it appears many times within a short interval of the genome. More formally, given integers L and t, a k-mer Pattern forms an (L, t)-clump inside a (longer) string Genome if there is an interval of Genome of length L in which this k-mer appears at least t times, so we created a function called `find_clumps(seq, k, L, t)` which also uses the function `freq_table(seq, k)` underhood.	22 July 2023
Clump Finding (for Real Data)	In this notebook, we have optimized the `find_clump(seq, k, L, t)` function to efficiently detect clumps in real-world examples. This enhancement was necessary as the previous implementation proved to be considerably slow for such scenarios. Additionally, we introduced a new function named `freq_index(seq, k)`, which also contributes to the improved performance by providing a list of starting indices for each pattern.	23 July 2023
Skew Diagram	In this particular section, our primary objective is to create a function called `skew_diagram(seq)` to analyze the DNA sequence. Notably, we have observed that the frequency of the GC base pair increases in the forward half-strand from the origin (ori) to the termination (ter), while it decreases in the reverse half-strand. To accurately visualize these fluctuations, we will represent the nucleotides as follows: C will be replaced with -1, G with +1, and T and A with 0. This representation will allow us to illustrate the changing pattern of the GC frequency effectively. Furthermore, by utilizing the "skew_diagram" function, we will be able to generate a skew diagram and identify the location of the ori, which is characterized by the minimum value on the diagram. This will enable us to pinpoint the specific position where the DNA replication process begins.	25 July 2023
Hamming Distance	We say that position i in k-mers p1 … pk and q1 … qk is a mismatch if pi ≠ qi. For example, CGAAT and CGGAC have two mismatches. The number of mismatches between strings p and q is called the Hamming distance between these strings and is denoted `hamming_distance(p, q)`.	26 July 2023
Approximate Pattern Matching Problem	We say that a k-mer Pattern appears as a substring of Text with at most d mismatches if there is some k-mer substring Pattern' of Text having d or fewer mismatches with Pattern, i.e., `hamming_distance(Pattern, Pattern') ≤ d`. Our observation that a DnaA box may appear with slight variations leads to the following generalization of the Pattern Matching Problem, and so we use the `approximate_pattern_matching(pattern, seq, d)` fucntion to address this problem.	26 July 2023
Generating the Neighborhood of a String	In this notebook, we will define a function named `neighborhood(pattern, d)`. The function utilizes the pre-existing hamming distance calculation to generate a set of sequences that are within a specified maximum hamming distance, 'd', from the given 'pattern'	28 July 2023
Frequent Word With Mismatch Problem	In this notebook, we have implemented a function named `freq_word_with_mismatch(seq, k, d)`. This function takes a sequence as input, along with the value of k which represents the length of k-mers to observe, and d which indicates the maximum allowable hamming distance. The function's purpose is to find the most frequent word (k-mer) in the sequence while allowing up to d mismatches. This enables us to identify frequently occurring patterns even in cases where there are slight variations or errors in the sequence data.	28 July 2023
Frequent Word With Mismatch and Reverse Complemenet	In this notebook, we have implemented a function named `freq_word_with_mismatch_reverse(seq, k, d)`. This function is different from the previous one because it also takes the reverse complement of the pattern into account.	4 August 2023
Finding the DnaA Boxes of Salmonella Enterica ✅	In this notebook we use all the knowledge we learned from previous notebooks to locate the DnaA boxes of Salmonella Enterica.	6 August 2023
A Brute Force Algorithm For Motif Finding	Brute force (also known as exhaustive search) is a general problem-solving technique that explores all possible solution candidates and checks whether each candidate solves the problem. Such algorithms require little effort to design and are guaranteed to produce a correct solution, but they may take an enormous amount of time, and the number of candidates may be too large to check. A brute force approach for solving the Implanted Motif Problem is based on the observation that any (k, d)-motif must be at most d mismatches apart from some k-mer appearing in the first string in Dna. Therefore, we can generate all such k-mers and then check which of them are (k, d)-motifs. So in this notebook we can use `motif_enumeration(dna_list, k, d)` to address this problem	7 August 2023
Distance Between Patterns and Strings	In order to solve the Median String problem one step is to create a function to calculate the sum of the hamming distance between a pattern and a DNA list, so I created a function called `patterns_strings_distance(pattern, dna_list)` to address this problem.	25 August 2023
Median String Problem	As a computer scientist, the runtime of an algorithm is of paramount importance, especially when dealing with real-world examples containing millions of nucleotides. Unfortunately, the Brute Force Algorithm exhibits an excessively lengthy runtime. Within this notebook, I have devised a function named `median_string(dna_list, k)`. This function aims to determine the optimal pattern, minimizing the total Hamming distance between the pattern and the DNA sequences within the dna_list.	25 August 2023
Greedy Motif Search	Given a profile matrix Profile, we can evaluate the probability of every k-mer in a string Text and find a Profile-most probable k-mer in Text, i.e., a k-mer that was most likely to have been generated by Profile among all k-mers in Text. For example, ACGGGGATTACC is the Profile-most probable 12-mer in GGTACGGGGATTACCT. Indeed, every other 12-mer in this string has probability 0. In general, if there are multiple Profile-most probable k-mers in Text, then we select the first such k-mer occurring in Text. So we can use the function `most_probable_kmer(seq, k, profile_matrix)` to address this problem.	27 August 2023

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
lab-of-bioinformatics		lab-of-bioinformatics
.gitignore		.gitignore
README.md		README.md
approximate-pattern-matching-problem.ipynb		approximate-pattern-matching-problem.ipynb
clump-finding-real.ipynb		clump-finding-real.ipynb
clump-finding.ipynb		clump-finding.ipynb
counting-words.ipynb		counting-words.ipynb
dnaa-box-salmonella-enterica.ipynb		dnaa-box-salmonella-enterica.ipynb
frequent-word-with-mismatch-and-reverse.ipynb		frequent-word-with-mismatch-and-reverse.ipynb
frequent-word-with-mismatch.ipynb		frequent-word-with-mismatch.ipynb
frequent-words-problem.ipynb		frequent-words-problem.ipynb
functions.py		functions.py
greedy-motif-search.ipynb		greedy-motif-search.ipynb
hamming-distance.ipynb		hamming-distance.ipynb
median-string.ipynb		median-string.ipynb
motif-enumeration.ipynb		motif-enumeration.ipynb
neighborhood-of-a-string.ipynb		neighborhood-of-a-string.ipynb
pattern-index-2.ipynb		pattern-index-2.ipynb
pattern-index.ipynb		pattern-index.ipynb
patterns-strings-distance.ipynb		patterns-strings-distance.ipynb
reverse-complement.ipynb		reverse-complement.ipynb
skew-diagram.ipynb		skew-diagram.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioinformatics

About

Releases

Packages

Languages

heispv/bioinformatics

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages