<a href="https://colab.research.google.com/github/nunososorio/bhs/blob/main/python_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Bioinformatics in Health Science Course - Python Fundamentals
## School of Medicine, University of Minho, Braga, Portugal

Welcome to our Python Fundamentals practical class! This session will help you with the basics of Python programming.
In the end, you will have the opportunity to apply your knowledge through a hands-on coding playground to build your skills and confidence as a future Python programmer 🚀


# Why?

🤖 Even in the age of AI, learning Python is definitely worth it! 🐍

1. **Understanding**: To effectively use and understand the code generated by AI, you need to have a basic understanding of the programming language being used.

2. **Customization**: AI can generate code based on prompts, but for more complex, specific, or novel tasks, you might need to write or modify the code yourself.

3. **Debugging**: If there's an issue with the code, understanding Python will allow you to debug it.

4. **Career Opportunities**: Python is one of the most popular and in-demand programming languages. Knowing Python opens up many opportunities in fields like web development, data science, machine learning, etc.

5. **Versatility**: Python is used in many areas of software development and data analysis. It's a great tool for building web applications, analyzing data, automating tasks, and much more.

# 1. Variables and print statements
🐍 In Python, a variable is like a container used to store data. It's given a unique name (the variable name) that can be used to retrieve the stored data. Data stored in a variable can be of different types, the most common being:

- **Numbers**: These can be integers (`int`), which are whole numbers, or floats (`float`), which are numbers with a decimal point¹.
- **Text strings (`str`)**: In Python, the string data type is used to represent text data. It is a sequence of characters enclosed in single quotes, double quotes, or triple quotes. Strings are immutable in Python, which means that once a string is created, it cannot be modified.

The `print` function is a built-in function in Python that is used to output text or variables to the console. You can use the `print` function to print strings, numbers, or variables.

⏭ In the examples provided, the first variable that is defined is named `genename1`, which stores the string value `"DMD"`. This means that `genename1` acts as a label for the text `"DMD"`, and you can use this label to refer to the text in your code. This is a common practice in programming and is a fundamental concept in Python.

❎"Code preceded by `#` is ignored."
 any text, even valid code, following the `#` symbol on a line is considered a comment and is ignored by the Python interpreter. This is useful for adding notes or explanations within your code.

 Lets start coding! 🐍

In [None]:
genename1 = "DMD"
gene1description = "gene that codes for the protein dystrophin, which is involved in muscle function. The DMD gene is located on the X chromosome, and it is the largest known gene in the human genome, spanning about 2.3 million base pairs."
print(genename1, ":", gene1description)

❓ Create new variables for the gene "BRCA1" and use print to write a description of the gene. Similar to what was done above for "DMD" ⚠

In [None]:
# please fix the error(s)
genename1
gene1description = "gene that codes for a protein that acts as a tumor suppressor. The BRCA1 gene is located on chromosome 17 and spans about 81,000 base pairs. It plays a critical role in maintaining the stability of a cell's genetic information by repairing damaged DNA."
print(, ":",)


- You can also interpolate you variables inside a text, for a more complex output and print out the result. For this you precede the text with f' and then add the variables to be printed inside {} (don't forget to close the text by adding ' at the end):

In [None]:
print(f'Our lab studies {genename1}, which is a {gene1description}')

❓ Now try here adding your text for the BRCA1 gene variables you created above: ⚠

In [None]:
# please fix the error(s)
print(fOur lab studies {genename1}, which is a {gene1description})

# 🎁 Bonus

The text you create can also be stored in a variable:

In [None]:
general_lab_desc= f'Our lab studies {genename1}, which is a {gene1description}'
print (general_lab_desc)

# 2. Basic arithmetic operations

📝 In Python, basic arithmetic operations include addition `+`, subtraction `-`, multiplication `*`, division `/`, floor division `//`, and modulus `%`.
- These operations can be performed on **numbers** (integers and floats) and can be used to perform mathematical calculations in your code.
- Python supports the use of parentheses to control the order of operations and the use of shorthand operators such as `+=`, `-=`, `*=`, and `/=` to perform arithmetic operations and assign the result back to a variable in a single step.  

📌 These basic arithmetic operations are fundamental building blocks in any mathematical computation and are widely used in various programming fields, this is why a good understanding of these operations is crucial for any Python developer or advanced user.


In [None]:
DMD = 2.3  # Million base-pairs
BRCA1 = 0.08 # Million base-pairs
GENOME = 3000 # Million base-pairs

# DMD represents what % of the human genome?
print("The DMD gene represents", DMD / GENOME * 100, "% of the human genome")

❓What fraction of the human genome does BRCA1 represent? ⚠

In [None]:
# please fix the error(s)
print("The BRCA1 gene represents",  / GENOME * 100, "% of the human genome")

# 3. Conditional statements

📝 Conditional statements are used to control the flow of a program based on certain conditions.
- They allow you to check if a certain condition is true or false, and execute different code depending on the result.
- The most common conditional statements are the `if`-`elif`-`else` statements.

📌 Conditional statements are a fundamental concept in programming, they provide a way to control the flow of the program, and they are widely used in various programming fields.


In [None]:
if DMD > BRCA1:
    print(f"DMD gene ({DMD} mb) is larger than BRCA1 ({BRCA1} mb)")
elif DMD == BRCA1:
    print(f"DMD gene ({DMD} mb) is equal in size to BRCA1 ({BRCA1} mb)")
else:
    print(f"DMD gene ({DMD} mb) is smaller than BRCA1 ({BRCA1} mb)")

# 4. Functions:

📝Functions are blocks of reusable code that can be called by name.
- They are used to organize and structure your code, make it more readable, and promote code reusability.
- In Python, A function is defined using the `def` keyword, followed by the function name, and a set of parentheses that may contain parameters.
- The code block inside the function is indented, this block of code will be executed every time the function is called. Functions can also return a value using the `return` statement.
- They can be used to encapsulate complex logic, and make it easier to test and debug your code.  

📌Functions are a fundamental concept in Python, and they are widely used in various programming fields.

In [None]:
def calculate_genome_percentage(gene_size):
    percentage = gene_size / GENOME * 100
    return percentage

# Example usage for DMD and BRCA1 genes

print(f"The DMD gene represents {calculate_genome_percentage(DMD)}% of the human genome")
print(f"The BRCA gene represents {calculate_genome_percentage(BRCA1)}% of the human genome")


❓ What is the percentage of the genome is represented by an hypothetical gene with 90 million base pairs? ⚠

In [None]:
#please complete the code and fix the error
hip_gene = 90

print(f"The hypothetical gene has {calculate_genome_percentage()}% of the human genome")

# 🎁 Bonus

Numbers can be rounded to a chosen number of decimal places by using the round function and indicating the variable and number of decimals inside the brackets:

In [None]:
def calculate_genome_percentage(gene_size):
    percentage = gene_size / GENOME * 100
    return round(percentage, 3)

# Example usage for DMD and BRCA1 genes

print(f"The DMD gene represents {calculate_genome_percentage(DMD)}% of the human genome")
print(f"The BRCA1 gene represents {calculate_genome_percentage(BRCA1)}% of the human genome")

# 5. Lists and list manipulation:

📝In Python, lists are a built-in data structure that allow you to store and organize collections of items.
- Lists are defined using square brackets and items within the list are separated by commas. Lists can store any type of data, such as numbers, strings, and even other lists.
- List manipulation is the act of modifying or manipulating lists in various ways. This can include adding, removing, or updating items in a list, as well as sorting, reversing or slicing a list.
- Python provides a variety of built-in methods and functions that can be used to manipulate lists. For example, the `append()` method can be used to add an item to the end of a list, the `remove()` method can be used to remove an item from a list, the `sort()` method can be used to sort the items in a list in ascending or descending order. To slice a list and create a new one with a specific part of the original list you can use the `:` operator.

📌List manipulation is an important concept in Python, as lists are widely used in various programming fields. Understanding how to manipulate lists is crucial for any Python user as it allows you to organize and manage data in an efficient and flexible way. With the knowledge of list manipulation, you will be able to work with large sets of data and perform complex operations on them in a powerful and efficient way.

✋Remember that Python 🐍 uses zero-based indexing, so the first element is at index 0.



In [None]:
# create list
genes = ["DMD", "BRCA", "Hypothetical"]

# access elements
print(genes[0])

# modify elements
genes [2] = "IRF1"
print(genes)

# add elements
genes.append("Hypothetical")
print(genes)

# remove elements
genes.remove("Hypothetical")
print(genes)

# sort elements
genes.sort()
print(genes)

# slice genes to include elements 0 and 1
genes_slice = genes[0:2]
print(genes_slice)

# 🎁 Bonus

What is the difference between a list, a tuple and an array?

- **Lists** are mutable, which means you can change their content (add, remove or modify items) after they are created. Lists are defined by enclosing the elements in square brackets `[]`. For example: `my_list = [1, 2, 3]`.

- **Tuples** are immutable, which means once they are created, you cannot change their content. This can be useful in situations where you want to ensure that certain values do not change. Tuples are defined by enclosing the elements in parentheses `()`. For example: `my_tuple = (1, 2, 3)`. You might choose to use a tuple when you have a collection of items that should not be changed.

- **Arrays**: An array is also a collection of elements like a list but it can only store items of the same data type. Arrays are used when you need to perform mathematical operations on collections of numerical data, using less memory and having more functionality. In Python's standard library there's an array module that provides basic array functionality. However, for scientific computing tasks (like data analysis or machine learning), people often use arrays from the NumPy library instead. NumPy arrays have more features than basic Python arrays and are more efficient for numerical operations.

In [None]:
# To perform element-wise operation you can use NumPy arrays
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

c = a + b  # This will give array([5, 7, 9])

# Standard python lists do not support element-wise operations directly.
list1 = [1, 2, 3]
list2 = [4, 5, 6]

concatlists = list1 + list2 # This will give [1, 2, 3, 4, 5, 6]

print(c)
print(concatlists)

# 6. Dictionaries:

📝In Python, dictionaries are a built-in data structure that allow you to store and organize collections of items in a key-value format.
- Dictionaries are defined using curly braces {} and items within the dictionary are separated by commas.
- Each item in a dictionary is made up of a key-value pair, where the key is a unique identifier for the value. Dictionaries can store any type of data, such as numbers, strings, and even other dictionaries.

📌Dictionaries are very useful as they provide a way to store and retrieve data in a very efficient way. They are widely used in various programming fields. Python provides a variety of built-in methods and functions that can be used to manipulate dictionaries. With the knowledge of dictionaries, you will be able to work with large sets of data and perform complex operations on them, making your code more powerful and efficient.


In [None]:
# create dictionary
genesizes = {
  "DMD": 2.3,
  "BRCA": 0.1,
  "IRF1": 0.03
}
print(genesizes)

# access elements
print(genesizes["IRF1"])
print(genesizes.keys())
print(genesizes.values())
print(genesizes.get("IRF1"))

# modify elements
genesizes["IRF1"] = 0.0333
print(genesizes)

# add elements
genesizes["Hypothetical"] = 99
print(genesizes)

# remove elements
del genesizes["Hypothetical"]
print(genesizes)


# 🎁 Bonus

In previous code, the lines `print(genesizes["IRF1"])` and `print(genesizes.get("IRF1"))` are giving the same result. Here's why:

- `genesizes["IRF1"]` uses the bracket notation to access the value of the key `"IRF1"` in the `genesizes` dictionary. If the key is in the dictionary, it will return its value.

- `genesizes.get("IRF1")` uses the `get` method to achieve the same thing. The `get` method is a built-in Python method for dictionaries that returns the value for a given key if it exists in the dictionary.

So, both lines are accessing the value of `"IRF1"` in the `genesizes` dictionary, which is why they return the same result.

However, there's a subtle difference between these two methods:
- If you try to access a key that does not exist in the dictionary, `genesizes["nonexistent_key"]` will raise a `KeyError`.
- On the other hand, `genesizes.get("nonexistent_key")` will return `None` and will not raise an error.

So while they function the same with existing keys, their behavior differs when dealing with keys that don't exist in the dictionary.

In [None]:
print(genesizes["IRF2"])

In [None]:
print(genesizes.get("IRF2"))

🤯 So what?

A `KeyError` that is raised when you try to access a key that does not exist can disrupt the flow of your program and cause it to stop.

On the other hand, when you use the `get` method on a dictionary and the key does not exist, it returns `None`. This does not cause your program to stop or raise an error. Instead, it allows your program to continue running, which can be beneficial in many cases.

However, if your program logic expects a certain value (other than `None`) and it gets `None`, it might lead to unexpected behavior or bugs in other parts of your program.

# 7. Loops:

📝 Loops are used to execute a block of code repeatedly. The most common type of loop is the **"for"** loop:
- A **"for"** loop is used to iterate over a sequence of items, such as a list or a range of numbers. The general syntax is "for variable in sequence:", where the variable takes on each value in the sequence, one at a time, and the code block following the "for" statement is executed for each value.

In [None]:
for gene, gene_size in genesizes.items():
    print(f"The {gene} gene represents {calculate_genome_percentage(gene_size)}% of the human genome.")


# 8. Importing modules:

📝 A module is a collection of code that can be imported and used in other code.
- Importing modules allows you to access pre-existing code and functionality, which can save you time and effort when writing your own code.
- Python has a large number of built-in modules that can be imported, as well as a vast number of third-party modules that can be installed using package managers such as pip.

To import a module, you use `import`  followed by the name of the module. Once a module is imported, you can access the functions and variables defined in the module by prefixing them with the name of the module.

There are several popular collections of modules (libraries) in bioinformatics that are widely used in the field. Some of the most popular ones are:
- **Pandas**: A powerful data manipulation library in Python. It provides data structures and functions needed to manipulate structured data, including functions for reading and writing data in a variety of formats. It is particularly well-suited for working with numerical tables and time series data.
- **Seaborn**: A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.
- **Scanpy**: A scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It includes preprocessing, visualization, clustering, trajectory inference, and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.
- **Biopython**: a collection of modules for bioinformatics, including tools for sequence analysis, structure analysis, and biological data parsing.
- **scikit-allel**: A Python package for exploratory analysis of large-scale genetic variation data. It is particularly useful for working with structured arrays of data, like Variant Call Format (VCF) files, which are often used in bioinformatics to store gene sequence variations.
- **PyBioMed**: A module that provides various bioinformatics tools, such as protein and DNA sequence analysis, molecular docking, and pharmacological prediction.
- **PyMOL**: A molecular visualization system that can be used to create high-quality images of molecules and visualize protein structures.
- **scikit-learn**: A machine learning module that has been widely used in bioinformatics, it contains a variety of tools for classification, regression, clustering and feature selection.

📌These are just a few examples of the many libraries available for bioinformatics, and new ones are constantly being developed. These provide powerful tools and functionality that can be used to analyze and visualize biological data, making it easier for bioinformaticians to work with large and complex datasets.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create a pair plot
sns.pairplot(iris, hue='species')

# Display the plot
plt.show()

MIT License

Copyright (c) 2023 Nuno S. Osório
