# Session 0: Introduction to python and jupyter

## Jupyter notebooks

Jupyter notebooks (.ipynb files) are specially formatted files similar to R markdown files that allow Python code to be edited and executed within a notebook-specific instance of a Python environment that has all packages being used installed (called a kernel). For scientific data analysis, Jupyter notebooks have several advantages over writing python scripts (.py files) including: 
- More easily allowing individual blocks of code to be run rather than running an entire script at once 
- More readable organization and documentation of analysis steps beyond that possible with simple code comments 
- More interactive data exploration and pipeline testing
- Code output can be displayed in line with the code that generates it

Notebooks are constructed using cells that are either "code" or "markdown" cells. Code cells can be executed using the python kernel selected for the document, while markdown cells contain plain text that can be formatted to break apart a notebook into collapsible subsections.
 - Highlighted cells can be run individually (using the "run" button or by pressing shift+enter), or the entire notebook can be run sequentially using the "run all" button 
 - Double click on a cell to edit it's contents

This is a markdown cell

In [3]:
print("this is a code cell")

this is a code cell


Mardown headings/titles are created using the number sign (#) followed by a space, with increasing nubers of '#' indicating a smaller heading, i.e. '###' would indicate a smaller heading than '##'

### This is a subheading

#### This is a smaller subheading

This is normal markdown text

Variables set in a code cell have global scope, i.e. are available to all other code cells in the notebook while the kernel is still running:

In [4]:
myVar = 23

In [5]:
myVar

23

Restarting the python kernel using the "restart" button for a notebook will clear all loaded libraries and variables.

In [1]:
myVar

NameError: name 'myVar' is not defined

For more advanced usage command line (bash) commands can be run directly within a jupyter notebook by adding a '!' before the command in a code cell

In [5]:
!ls -l # This is a bash command that lists all files in the current directory

total 12
-rw-rw-r-- 1 joe joe 3085 Mar 30 18:59 0_intro_to_python.ipynb
-rw-rw-r-- 1 joe joe    0 Mar 30 18:50 1_scanpy_data_preprocessing.ipynb
-rw-rw-r-- 1 joe joe    0 Mar 30 18:50 2_scvi_data_normalization.ipynb
-rw-rw-r-- 1 joe joe  405 Mar 30 18:59 3_downstream_analysis_methods.ipynb
-rw-rw-r-- 1 joe joe   17 Mar 30 18:22 README.md


## Google colab

Google colab is a free resource that allows for jupyter notebooks to be edited and run on a google server. It can be integrated with google drive to load data files and save figures. However the free tier is limited with session timeouts (shutting down the jupyter kernel you are using if it is inactive for too long) and limited compute ressources (RAM, CPU, and GPU). Colab sessions are essentially identical to local jupyter installations, with the exception that markdown cells are called text cells, kernels are called runtimes, and most of the jupyter cell control buttons (i.e. run, restart, run all) are located within the Runtime menu dropdown.

A google colab session can be connected to your google drive using:

In [None]:
from google.colab import drive
drive.mount(‘/content/gdrive’)

This will give you a link to follow that will give you an authorization code after logging into your account to add your google drve folder to the '/content/gdrive' path in your colab session. This will need to be rerun every time you connect to a new session, so it is best to include this cell at the top of all your colab notebooks

A GPU can be added to your colab session (required for training scVI models later on in the course) by going to Runtime > Change runtime type, then selecting 'GPU' from the hardware accelerator dropdown and clicking save

Note that any packages you want to load into your notebook kernel with have to be reinstalled into your colab session whenever it is restarted or times out, so it is best to include a cell installing all of the packages you will need at the top of your colab notebooks

## Setting up python environments

### Installing python packages

By default python includes basic functionality such as simple math functions, printing variable contents, and data type conversion. For most uses, additional functionality is needed, which can be obtained by installing external python packages created by other developers. These packages are typically published to a repository called PyPI (the python package index). 

Python packages from PyPI are added to your python environment using a tool called 'pip' which is typically installed by default when you install python. To use pip to install a package to your colab session run:

In [6]:
!pip install numpy # This runs a bash command to install a pip package with the syntax 'pip install <package name>'



Most popular python packages have websites with documentation on how to use the methods included in the package, as well as installation instructions that include the 'pip install' command for that package. However, if running pip install within a jupyter notebook (such as on google colab), then an '!' needs to be added to the beginning of the command to run it as a bash command rather than as python code.

Other useful pip commands:

In [1]:
!pip list # lists all the installed packages in the current python environemnt
!pip install numpy --upgrade # update an installed package to the latest version, if possible
!pip uninstall numpy -y # uninstall an installed package
!pip install numpy==1.23.5 # install a specific version of a package

Package                  Version
------------------------ -----------
absl-py                  1.4.0
aiohttp                  3.8.3
aiosignal                1.3.1
anndata                  0.8.0
annoy                    1.17.1
asttokens                2.0.5
async-timeout            4.0.2
attrs                    22.2.0
backcall                 0.2.0
beautifulsoup4           4.11.1
bleach                   5.0.1
cached-property          1.5.2
cachetools               5.2.1
certifi                  2022.12.7
charset-normalizer       2.1.1
chex                     0.1.5
click                    8.1.3
comm                     0.1.2
commonmark               0.9.1
contextlib2              21.6.0
contourpy                1.0.6
cycler                   0.11.0
Cython                   0.29.33
DateTime                 5.0
debugpy                  1.5.1
decorator                5.1.1
defusedxml               0.7.1
dm-tree                  0.1.8
docrep                   0.3.2
entrypoints           

In [None]:
# Package needed for this course (many other packages will be installed as dependencies):
!pip install scrnatools -q # the -q parameter runs this command quietly, with less verbose output

### Anaconda/miniconda virtual environments

For use cases where google colab is not sufficient, python can be installed locally on any laptop or desktop. While it is possible to do this directly from the python foundation website, it is **HIGHLY RECCOMENDED** to instead use a python virtual environment manager such as Anaconda instead.

Virtual environments act as "copies" of a base installation of python that allow python packges to be installed to that copy, while keeping the base environment a clean, vanilla python installation. Multiple virtual environments can be created so that one exists for each project/pipline being worked on. This is important because it allows you to install **ONLY** the required packages for a particular project to a given virtual environment, with other projects that require different packages siloed to their own environment. This prevents issues that can arise from version collisions when all the packages needed for all projects are installed into a single environment.

Miniconda is a stripped down version of the Anaconda virtual environemnt manager that can be installed from: https://docs.conda.io/en/main/miniconda.html. After installation, all management of Miniconda environments is performed from a command line (the 'Terminal' application on MacOS, or 'cmd' on windows)

First, create a virtual environment:

In [None]:
# Run in terminal
conda create -n scrnaseq_course python=3.9 # creates a virtual environment with name 'scrnaseq_course' using python version 3.9

Then, activate the environment:

In [None]:
# Run in terminal
conda activate scrnaseq_course # Activates the 'scrnaseq_course' environment

Within the environment, python packages can now be installed:

In [None]:
# Run in terminal
conda install -c conda-forge jupyterlab # installs the components needed to run a local jupyter notebook
pip install scrnatools -q # installs the packages needed for this course

Now, a jupyter server can be launched to edit/create/run jupyter notebooks on your local machine:

In [None]:
# Run in terminal
jupyter lab # This will launch the jupyter lab interface in your default web browser

Note that the directory you launch 'jupyter lab' from in the terminal will dictate what files are accessible within the jupyter lab interface (all subdirectorie of the current folder). To change to a different location on your computer:

In [None]:
# Run in terminal
cd path/to/directory # changes your directory in the terminal to the specified path

To shut down the jupyter server, either select File > Shut Down, or close the browser and terminal running the local jupyter server. To deactivate your conda environment:

In [None]:
# Run in terminal
conda deactivate # returns to the base python installation

Other usefull conda commands:

In [None]:
# Run in terminal
conda env list # lists all available conda environments
conda env remove -n scrnaseq_course # deletes a conda environment with syntax 'conda env remove -n <env name>'

## Python basics

### Importing installed packages for use

After a package has been installed to a python environment, for it to be used in a python script/jupyter notebook, it has to be imported. This must be done at the start of every script or whenever a kernel is started/restarted

In [4]:
import pandas

Methods within a package can then be accessed for use in other cells:

In [10]:
pandas.DataFrame() # The syntax for accessing a package's methods is generally <package name>.<method name>

Packages can also be imported with an alias so the whole package name doesn't have to be used when accessing it's methods:

In [11]:
import pandas as pd
pd.DataFrame()

Finally, if only a single method is needed from a given package, it can be directly imported, removing the need to use the package name in method calls at all (and also saving memory usage by not loading unnecessary methods):

In [13]:
from pandas import DataFrame
DataFrame()

### General code syntax

Data that you load or generate within a notebook/script are stored as **variables** using an '=' that you assign a unique name so they can be operated or accessed later on within that script/notebook

In [15]:
myVar = 23 # Stores a variable called 'myVar'
myVar = myVar + 10 # Adds 10 to 'myVar'
print(myVar) # Prints the current value of 'myVar'

33


**Methods** are python functions that take input parameters, perform a particular operation or set of operations based on those parameters, and return/store the result. These can be built-in methods (i.e. numerical operations), methods included in external python packages (i.e. plotting methods in matplotlib), or custom user-defined methods.

In [60]:
myList = [1,3,2,4,6,5]
myList.sort(reverse=False) # executing a method has syntax <method_name>(<parameter_1_name> = <parameter_1_value>, etc)
print(myList)
myList.sort(reverse=True)
print(myList)

[1, 2, 3, 4, 5, 6]
[6, 5, 4, 3, 2, 1]


Some methods operate directly on a variable (i.e. sort() above), but most just take a list of parameters and return the result

In [64]:
myList = [1,3,2,4,6,5]
myListSorted = sorted(myList, reverse=True)
print(myListSorted)

[6, 5, 4, 3, 2, 1]


Logical comparisons have their own symbolic operators:
- less than: <
- greater than: >
- equal: ==
- not equal: !=
- greater than or equal: >=
- less than or equal: <=
When comparing numbers/variables with these operators the result is returned as a boolean (True/False variable)

In [50]:
print(8 > 9)

varOne = 10
varTwo = 11
print(varOne != varTwo)

False
True


### Variable types

Python variables are dynamically typed (the type of data stored in a variable is determined at runtime):

In [2]:
numGenes = 1
type(numGenes)

int

In [3]:
numGenes = 1.0
type(numGenes)

float

In [4]:
numGenes = "one"
type(numGenes)

str

Variables can also be a list of values:

In [35]:
myList = [0,1,2,3,3,4,5]
print(myList)
myList = ["list", "of", "strings"]
print(myList)
myList = [0,1, "two", 3, 4.0, 5] # Lists can contain mixed data types
print(myList)

[0, 1, 2, 3, 3, 4, 5]
['list', 'of', 'strings']
[0, 1, 'two', 3, 4.0, 5]


Elements within a list can be accessed using their numerical index (starting from 0):

In [43]:
print(myList)
print(myList[0])
print(myList[2])
print(myList[-1]) # Elements can be extracted from the end of a list using the syntax -<num of indicies from end>
print(myList[-2])
print(myList[0:3]) # multiple elements can be extracted using the syntax [start index (inclusive):end index (exclusive)]

[0, 1, 'two', 3, 4.0, 5]
0
two
5
4.0
[0, 1, 'two']


A set of values (does not contain duplicates):

In [25]:
mySet = {0,1,2,2,3,3,4,4,4,4}
print(mySet)
mySet = {0,1,2,"two",3,3,4,4,4,4}# Sets can also contain mixed data types
print(mySet)

{0, 1, 2, 3, 4}
{0, 1, 2, 'two', 3, 4}


Or a dictionary or key-value pairs:

In [31]:
myDict = {"apples": 1, "oranges":10} # dictionaries also use '{}' syntax, but here keys are seperated from values by ':' and key-value pairs are seperated from each other by ','
print(myDict["apples"]) # The value for a specified key can be extracted
print(myDict.keys()) # Or all the keys can be extracted
print(myDict.values()) # Or all the values can be extracted

1
dict_keys(['apples', 'oranges'])
dict_values([1, 10])


Keys and values in dictionaries can be any data type:

In [32]:
myDict = {"apples": [1,2,3], 10:{"oranges", "pears", "pears"}}
print(myDict["apples"])
print(myDict[10])

[1, 2, 3]
{'pears', 'oranges'}


### Variable naming conventions

Variables are usually named using camel case (first letter of each word is capitalized except for the first word): listOfFruits, listOfNumbers, etc.
<br> Alternatively, seperating words by underscores is also acceptable: list_of_fruits, list_of_numbers.
<br> **ALWAYS** use descriptive variable names
<br> It is also helpful to leave a code comment describing the data source/type if a variable is being populated with data loaded from a file

In [47]:
# Bad variable names
list1 = []
list2 = []
listofusersnames = []
testVariable = []
placeholder_var = []

In [48]:
# Good variable names
usernames = []
wtThymusDataset = []
target_genes = []

## Introduction to pandas

In [66]:
import pandas as pd

Beyond the built in data tyes, more complex data structures that are built from the basic data types (called objects) can be added using external packages. One such package is pandas, which adds data frames as an available data structure. Pandas data frames are 2-dimensional data structures (i.e. a table) with named rows and columns. There are several ways to create a data frame.

1. By providing a list of lists, where each sublist is one row in the dataframe

In [65]:
pd.DataFrame([[1,2,3], [4,5,6]])

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


2. By providing a dictionary, where keys are the column names and values are lists containing that column's values

In [67]:
dictionary = {"fruits": ["apples", "oranges", "pears"], "numbers":[1,2,3]}
pd.DataFrame(dictionary)

Unnamed: 0,fruits,numbers
0,apples,1
1,oranges,2
2,pears,3


3. By providing either 1, or 2, along with a list containing specified row names and/or a list containing specified column names

In [73]:
dataframe = pd.DataFrame([[1,2,3], [4,5,6]], index=["row_1", "row_2"], columns=["column_1", "column_2", "column_3"])
display(dataframe) # instead of using print(), display() will show an easier to read representation of a DataFrame object

dictionary = {"fruits": ["apples", "oranges", "pears"], "numbers":[1,2,3]}
dataframe = pd.DataFrame(dictionary, index=["row_1", "row_2", "row_3"])
display(dataframe)

Unnamed: 0,column_1,column_2,column_3
row_1,1,2,3
row_2,4,5,6


Unnamed: 0,fruits,numbers
row_1,apples,1
row_2,oranges,2
row_3,pears,3


Portions of a data frame can be extracted by 'slicing' the data frame object

In [84]:
dictionary = {"fruits": ["apples", "oranges", "pears"], "numbers":[1,2,3], "shapes":["triangle", "square", "circle"]}
dataframe = pd.DataFrame(dictionary, index=["row_1", "row_2", "row_3"])

display(dataframe[["fruits", "shapes"]]) # Slicing a data frame based on a list of column names

display(dataframe.numbers) # single columns can also be extracted using the syntax <df name>.<column name>

Unnamed: 0,fruits,shapes
row_1,apples,triangle
row_2,oranges,square
row_3,pears,circle


row_1    1
row_2    2
row_3    3
Name: numbers, dtype: int64

Comparison logic can also be used to subset a dataframe on only rows where the value in a certain column fullfills a condition:

In [87]:
dataframe[dataframe.fruits == "apples"]

Unnamed: 0,fruits,numbers,shapes
row_1,apples,1,triangle


Subsetting on both rows and columns can be performed using '.loc' (when slicing with row/column labels) or '.iloc' (when slicing with row/column indicies):

In [96]:
display(dataframe)

# Label based slicing
display(dataframe.loc["row_2", "fruits"]) # syntax is <df name>.loc[<row>, <column>]
display(dataframe.loc[["row_2", "row_1"], ["fruits", "numbers"]]) # columns and rows to subset on can also be provided as a list
display(dataframe.loc[:, ["fruits", "numbers"]]) # Selecting all rows (or columns) can be denoted by using ':' instead of a list of labels

Unnamed: 0,fruits,numbers,shapes
row_1,apples,1,triangle
row_2,oranges,2,square
row_3,pears,3,circle


'oranges'

Unnamed: 0,fruits,numbers
row_2,oranges,2
row_1,apples,1


Unnamed: 0,fruits,numbers
row_1,apples,1
row_2,oranges,2
row_3,pears,3


In [101]:
display(dataframe)

# index based slicing
display(dataframe.iloc[1, 0]) # syntax is <df name>.loc[<row index>, <column index>]
display(dataframe.iloc[[1, 0], [0, 1]]) # columns and rows to subset on can also be provided as a list
display(dataframe.iloc[:, [0,1]]) # Selecting all rows (or columns) can be denoted by using ':' instead of a list of labels.

Unnamed: 0,fruits,numbers,shapes
row_1,apples,1,triangle
row_2,oranges,2,square
row_3,pears,3,circle


'oranges'

Unnamed: 0,fruits,numbers
row_2,oranges,2
row_1,apples,1


Unnamed: 0,fruits,numbers
row_1,apples,1
row_2,oranges,2
row_3,pears,3


By providing an index before and after ':' (i.e 'x:y'), rows/columns can also be subset on the range of indicies [x,y):

In [102]:
display(dataframe.iloc[:, 0:2])

Unnamed: 0,fruits,numbers
row_1,apples,1
row_2,oranges,2
row_3,pears,3


There are also several built in operations that can be performed on rows/columns (sum, mean, etc):

In [106]:
df = pd.DataFrame([[1,2,3], [4,5,6]], index=["row_1", "row_2"], columns=["col_1", "col_2", "col_3"])
display(df)
print(df.sum(axis=0)) # The axis parameter determines which axis of the data frame to sum across (0 = sum across rows, 1 = sum across columns)
print(df.sum(axis=1))

Unnamed: 0,col_1,col_2,col_3
row_1,1,2,3
row_2,4,5,6


col_1    5
col_2    7
col_3    9
dtype: int64
row_1     6
row_2    15
dtype: int64


The rows and columns of a dataframe can be transposed:

In [108]:
df = pd.DataFrame([[1,2,3], [4,5,6]], index=["row_1", "row_2"], columns=["col_1", "col_2", "col_3"])
display(df)

dfTransposed = df.T # '<df name>.T' transposes the dataframe and returns it as a new dataframe
display(dfTransposed)

Unnamed: 0,col_1,col_2,col_3
row_1,1,2,3
row_2,4,5,6


Unnamed: 0,row_1,row_2
col_1,1,4
col_2,2,5
col_3,3,6


## Code control

In order to allow developers to specify how/when specific lines of code are executed within their script, python includes several common code control expressions. The most relevant ones for this course are 'if' statements and 'for' loops. When using nested code control expressions it is very important to pay attention to indentation levels, as this is how python interprets what code belongs within a particular expression

### if statements

if statements are made up of a condition statement and a corresponding block of code that will only be executed if that condition is met:

In [113]:
varOne = 10
varTwo = 40

if varOne < varTwo: # The condition before the ':' can use any of logical comparison operators
    varOne = varOne + 10 # Any lines of code indented beneath the if statement 
    varOne = varOne + 10
print(varOne)

if varOne < varTwo:
    varOne = varOne + 10
print(varOne)

if varOne < varTwo:
    varOne = varOne + 10
print(varOne)

30
40
40


If statements can also be nested:

In [119]:
varOne = 10
varTwo = 30
varThree = 15

if varOne < varTwo:
    varOne = varOne + 10
    if varOne < varThree: # pay attention to indentation levels when nesting
        varOne = varOne + 10
print(varOne)

20


Multiple conditions can be strung together in a single if statment:

In [116]:
varOne = 10
varTwo = 30
varThree = 100

if varOne < varTwo and varOne < varThree: # 'and' conditions will only be true if all comparisons are true
    varOne = varOne+10
print(varOne)
if varOne < varTwo and varOne > varThree: # 'and' conditions will only be true if all comparisons are true
    varOne = varOne+10
print(varOne)

if varOne < varTwo or varOne > varThree: # 'or' conditions will be true if at least one of the comparisons are true
    varOne = varOne+10
print(varOne)

20
20
30


If there is also a block of code that you want to execute if the condition in an if statement is **False**, then the if statement can be followed by an else statement

In [117]:
varOne = 10
varTwo = 30

if varOne > varTwo:
    varOne = varOne + 10
else:
    varOne = varOne - 10

print(varOne)

0


### for loops

If there is a line/block of code that needs to be run repeatedly many times, it can be placed inside a for loop. For loops will iterate over a provide range of values/variables, allowing code within the loop to run during each iteration:

In [142]:
varOne = 0

for i in range(0,10): # here range(0,10) is used to create a range of numbers from 0 (inculsive) to 10 (exclusive). The for loop then iterates through this list.
    print(i)
    varOne = varOne + 10 # This code is run each time as the loop iterates through the range, resulting in it being run 10 times before the loop concludes
print()
print(varOne)

0
1
2
3
4
5
6
7
8
9

100


In [141]:
varOne = 0

for i in range(0,10): # the indexing variable 'i' (can be named anything you want) takes on the values within the range as the loop iterates through them
    varOne = varOne + i
print(varOne)

# This is equivalent to taking the sum of the numbers from 0-9 (each value in the range):
sum([0,1,2,3,4,5,6,7,8,9])

45


45

For loops can also be used to iterate over more complex datastructures, performing the same operation each time

In [131]:
dictionary = {"column_1": [1,2,3], "column_2": [4,5,6], "column_3": [7,8,9]}
df = pd.DataFrame(dictionary)
display(df)

for column in df: # here the indexing variable 'column' takes on values equal to each of the column names in df as the loop iterates through each one
    print(sum(df[column])) # This allows us to extract that column from the dataframe using the indexing variable, and perform operations on it

Unnamed: 0,column_1,column_2,column_3
0,1,4,7
1,2,5,8
2,3,6,9


6
15
24


You can also iterate over dictionaries

In [146]:
display(dictionary)

for key,value in dictionary.items(): # here the loop iterates over each key-value pair in the dictionary, extracting them to the 'key' and 'value' indexing variables, respectively
    print(key)
    print(sum(value))
print()
# Alternatively, you can iterate over only the dictionary keys:
for key in dictionary.keys(): 
    print(key)
print()
# Or the dictionary values:
for value in dictionary.values(): 
    print(value)

{'column_1': [1, 2, 3], 'column_2': [4, 5, 6], 'column_3': [7, 8, 9]}

column_1
6
column_2
15
column_3
24

column_1
column_2
column_3

[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
