# Welcome to the first python workshop: Bit by Bit

Kindly organised by SBS Biohackathon Committee

Instructors for the day: Sundara and Min Qi!


## Learning Objectives:

1. Understand Bits and Bitwise Operators
2. Build your own exact string matchig algorithm
3. Build your own inexact string matching algorithm
4. Apply Shift-Or algorithm to find genes of interest within queried genetic sequence

# UNDERSTANDING BITS

## What is Binary Code (Bits)?
- The bit is the most basic unit of information in computing and digital communications. 
- The bit represents a logical state with one of two possible values (0 or 1)
- Each digit position represents an increasing power of 2
- Functions: For flagging system; on/off switch; Storage etc.

## Practice! Convert the following from binary to integer numbers

- 10101
- 100011
- 11000

## Practice! Convert the following from decimal integers to binary

- 18
- 36
- 57

## Binary Conversion in python
- You can convert binary to integers using the int() function
- You can convert integers to binary in python using the bin() function
- python recognises binary using the 0b infront of the code

In [1]:
print(int(0b10101))
print(int(0b100011))
print(int(0b11000))

21
35
24


In [2]:
print(bin(18))
print(bin(36))
print(bin(57))

0b10010
0b100100
0b111001


# Bitwise Operators

## What are Bitwise Operators ?
- You can use bitwise operators to perform Boolean logic on individual bits.
- Similar to python Operators (+, -, % etc.) but specifically to compare binary numbers

## Bitwise AND 
- Denoted by: &
- Similar to _and_  Boolean Operator (i.e True and False = False) 
- For each pair of bits occupying the same position in the two numbers, it returns a 1 only when both bits are 1

In [None]:
a = 0b11001
b = 0b10101

print(bin(a))
print(bin(b))
print(bin(a&b))


## Bitwise OR
- Denoted by: |
- Similar to _or_  Boolean Operator (i.e True or False = True; True or True = True; False or False = False) 
- For each pair of bits occupying the same position in the two numbers, it returns a 1 as long as one 1 is present in any of the numbers

In [None]:
c = 0b11001
d = 0b10101

print(bin(c))
print(bin(d))
print(bin(a|b))


## Bitwise Left Shift
- Denoted by: <<
- Moves the bits of its first operand to the left
- Shifting a single bit to the left by one place doubles its value.


In [None]:
a = 0b10011

print(bin(a))
print(bin(a<<1))
print(bin(a<<2 ))

## Bitwise Right Shift
- Denoted by: >>
- Analogous to left shift; moves the bits of its first operand to the right
- Shifting a single bit to the right; dropping the rightmost bit


In [None]:
b = 0b1001101

print(bin(b))
print(bin(b>>1))
print(bin(b>>2))

## Summary 

- AND (__&__) : Sets each bit to1 only if both bits are 1
- OR (__|__) : Sets each bit to 1 if one of two bits is 1
- LEFT SHIFT (__<<__) : Shift left by pushing zeros in from the right and let the leftmost bits fall off
- RIGHT SHIFT (__>>__) : Shift right by pushing copies of the leftmost bit in from the left, and let the rightmost bits fall off

# Shift-Or Algorithm

## Step 1: Create a dictionary of bitmasks


A bitmask is a binary string with 0 and 1s representing the relative position of a single character within the pattern queried.


0 denotes a match of the letter in the position

1 denotes a match of the letter in the position


Every unique letter within the sequence must have a corresponding bitmask (its relative position in the pattern queried)


__TO NOTE:__ 
    <reference> refers to text queried
    <query> refers to pattern queried within the text

In [3]:
#Function: create bitmask dictionary

def _generateAlphabet(reference, query):
	alphabet = list(set(reference))
	bitap_dict = {} 
	for letter in alphabet:
		letterPositionInQuery = 0
		for symbol in query:
			letterPositionInQuery = letterPositionInQuery << 1
			letterPositionInQuery |= int(letter != symbol)
		bitap_dict[letter] = letterPositionInQuery
	return bitap_dict

In [4]:
#Try it yourself!

#reference: "ATTGCTCGATCG"
#pattern: "ATC"

for key,value in _generateAlphabet("ATTGCTCGATCG","ATC").items():
    print(key, value)

#Is this the desired output?

G 7
C 6
T 5
A 3


In [None]:
#Try it again

#reference: "ATTGCTCGATCG"
#pattern: "ATC"

for key,value in _generateAlphabet("ATTGCTCGATCG","ATC").items():
    print(key, bin(value))

In [None]:
from collections import namedtuple

placeholder = namedtuple('placeholder','query seq start end mismatch')

## Step 2:Initialize bitarray (D)

A bitarray, D, is used like a placeholder to capture matches of each letter in the text to the queried pattern

*Must be the same length as the pattern

In [None]:
#Try it yourself! Try changing the queryLen

queryLen = 3
D = (2 << (queryLen - 1)) - 1

print(D)
print(bin(D))

#Why do we perform a left shift on 2?

## Step 3: Exact String Matching

Algorithm to find the instances of sequences that are identical to the pattern queried within the pattern.

In [None]:
#Function: Algorithm to look for exact matches

def bitapexactSearch(reference, query): 
	referenceLen = len(reference)
	queryLen = len(query)
	exact_placeholder = namedtuple('placeholder','query seq start end')

	alphabet = _generateAlphabet(reference, query)

	matrix = [] 
	emptyColumn = (2 << (queryLen - 1)) - 1
    
	matrix.append(emptyColumn)
	gRNAs = []
	skip = []

            
	for columnNum in range(1, referenceLen + 1):
		prevColumn = (matrix[columnNum - 1]) >> 1
		letterPattern = alphabet[reference[columnNum - 1]] 
		curColumn = prevColumn | letterPattern
		matrix.append(curColumn)
        
		if (curColumn & 0x1) == 0:
			startPos = max(0, columnNum - queryLen)
			endPos = min(columnNum, referenceLen)
			place = reference[startPos:endPos]
			temp = exact_placeholder(query, place, startPos, endPos)
			gRNAs.append(temp)
	return gRNAs

In [None]:
#Try it yourself!

reference  = 'ATCGATC'
string_search = 'ATC'
gRNAs = bitapexactSearch(reference, string_search)
for i, g in enumerate(gRNAs):
	print (g)

## Step 4: Inexact String Matching

Algorithm to find the instances of sequences that differ from the pattern by a specified number of characters within text.

Possible uses of inexact matching:
Looking for instances of mutated sequences of a specific gene of interest within a genome

In [None]:
#Function: Algorithm to look for exact and inexact matches

def bitapSearch(reference, query, mismatch = 1): 
	referenceLen = len(reference)
	queryLen = len(query)

	alphabet = _generateAlphabet(reference, query)

	matrix = [] 
	emptyColumn = (2 << (queryLen - 1)) - 1
    
	underground = [emptyColumn for i in range(referenceLen + 1)]
	matrix.append(underground)
	gRNAs = []
	skip = []

	for k in range(1, mismatch + 2): 
		matrix.append([emptyColumn])
        
        #Exact String Search
		for columnNum in range(1, referenceLen + 1):
			prevColumn = (matrix[k][columnNum - 1]) >> 1
			letterPattern = alphabet[reference[columnNum - 1]]
			curColumn = prevColumn | letterPattern

            #Inexact String Search
			if k > 1:
				## Mismatch 
				curColumn = curColumn & (matrix[k - 1][columnNum - 1] >> 1)
                
			matrix[k].append(curColumn)

			if (curColumn & 0x1) == 0:
				startPos = max(0, columnNum - queryLen)
				if startPos in skip: continue
				endPos = min(columnNum, referenceLen)
				place = reference[startPos:endPos]
				temp = placeholder(query, place, startPos, endPos, k - 1)
				gRNAs.append(temp)
				skip.append(startPos)
                
	return gRNAs


In [None]:
#Try it yourself

reference  = 'GGGCNCTGCTGAGAATGNACTGAATATAAACTTGTGGTAGTTGGANGCTGGTGGCGTAGGCTTGTGGTTGTGGGANGCTGGTGGCGAAGAGTGCCTTGACGATACAGNCTANATTNCAGAATNCATTTTGTGGNACGAATATGATCCANACAATAGNAGGATTC'
string_search = 'TTGTGGTAGTTGGANGCTGGTGGCG'
errors = 2
gRNAs = bitapSearch(reference, string_search, errors)
for i, g in enumerate(gRNAs):
	print (g)