# <center>**Introduction to Python Workshop**</center>
###### <center>[Harvard Chan Bioinformatics Core](https://bioinformatics.sph.harvard.edu/)</center>
###### <center>2022-8-10</center>


---

# Pre-class preparation (10 min)
Welcome to **Introduction to Python** workshop! Before we start, let's prepare a few things and get familiar with the platform (Google Colab) that we will use during this workshop. Colab is a free cloud service platform based on the [Jupyter notebook](https://jupyter.org/) environment. It does not require users to install python locally, and the code is run entirely on the cloud server.

Please follow the instructions below to set up for this workshop:
1. **Download the [data](https://github.com/hbctraining/Training-modules/blob/master/Python/data_python_workshop.zip?raw=true) needed for this workshop**: Please save the file and unzip it on your computer (e.g. desktop).
2. **Create your own copy of this python notebook (the workshop materials)**: Click "File" at the menu bar -> Click "Save a copy in Drive". This notebook will be automatically named "Copy of Intro_to_Python_in_class_version" and copied in your Google Drive. You can rename it if you want.
3. **Double check where the material is located**: Click "File" again -> click "Locate in Drive". A new window should pop up, showing you its location in your Google Drive.
4. **Get familiar with the Colab user interface and terminology**: 
*   code cell
*   text cell
*   creating and deleting cell
*   adding comments
*   saving the document


---

# Section 1: Introduction to Python (10 min)

## What is Python?
Python is a powerful, open-source, general-purpose programming language with a wide variety of applications. Many [websites and apps](https://codeinstitute.net/blog/7-popular-software-programs-written-in-python/) that we are familiar with, for example YouTube, Instagram, Spotify etc, are built on Python. In the field of bioinformatics, Python is also widely used in computational programming, data analysis, and pipeline development. 

Below we have listed some examples where Python-based tools are used:

| **Example use case** | **Tool** |
| :---: | :---: |
| Pipeline development | [bcbio](https://bcbio-nextgen.readthedocs.io/en/latest/), [snakemake](https://snakemake.readthedocs.io/en/stable/) |
| Image analysis | [CellProfiler](https://cellprofiler.org/) |
| Molecular visualization | [PyMOL](https://pymol.org/2/) |
| Machine learning | [scikit-learn](https://scikit-learn.org/stable/) |

> Note: While many bioinformatics tools are written in Python, you may have used them (for example, through the command line) without knowing how they work internally. This is okay, but if you want to modify the script for customized usage, it is better to learn the syntax of the script.

Since it was first released in 1991, Python has been constantly developed and updated. The current version of Python is 3.0. Python programming language features include:
- Easy-to-understand syntax
- Large number of built-in libraries for common tasks
- Succinct and readable format, friendly to both programming amateur and professional
- Widely used programming language, with comprehensive documentation and plenty of tutorials

## What will you learn in this workshop?
To take this workshop, you don't need to have prior experience in Python (or any other programming language(s)). We will learn the basic Python syntax, data structure, function, data manipulation, and data visulization. At the end of the workshop, students should be able to understand the structure of a simple Python script, and can perform basic data analysis.

## Additional tips
- `#` at the beginning of a line denotes that this line is a comment for the code - any line starting with a `#` won't be executed by Python. 
- Indentation is very important in Python syntax. Not only is it necessary for Python, it is also great for code readability. By making indentations essential, Python forces good code-writing.

---

# Section 2: Basic Python syntax (20 min)

## Variables
A "variable" is a temporary container that stores certain information. We use the `=` operator to assign some value to a variable. The naming of a variable in Python has to fulfill the following rules:
- must start with a letter or underscore character
- can contain only alpha-numeric characters or the underscore character (A-z, 0-9, _ )
- cannot be a reserved keyword in Python. A complete list can be found [here](https://docs.python.org/3.8/reference/lexical_analysis.html#keywords).
- is case-sensitive (e.g. `Year` and `year` are two different variable names)  

In [None]:
# Assign two variables: 2 to x, and 5 to y
x = 2
y = 5

The "x" and "y" variables are now stored in the current Python computing environment. We can see the variables in the variable inspector, `{x}`, located at the left side panel.

We could also use the `print()` function, to print out the value of variables to the console. Alternatively, we could just type out the variable name. 
> If running mutliple lines of code in a code chunk, it is recommended to use the `print()` function. Otherwise, only the value of the last variable in your code chunk will be printed out to the console.

In [None]:
# Print out the value of variables 'x' and 'y'
print(x)
print(y)

> Note: Within a line of code, Python is lenient about whether to use spaces or not. Using space could help readability, but it is a personal preference.

New variables can be generated by performing some mathematical calculations on existing variables. For example, we can calculate the mean of x and y. 

Python also supports a wide range of common mathematical calculations using operators and functions, e.g. power (`**`) and square root (`sqrt()`).

In [None]:
# Calculate the mean of x and y, and store the value to a variable called 'mean'
mean = (x + y) / 2
mean

## Data types
Data comes in different types. For example, the newly created variables `x` and `mean` are numeric. `x` is a whole number, so its data type is `int`, or "integer"; `mean`, on the other hand, is a number with decimal places, so its data type is `float`. 

> These are similar to the "integer" and "numeric" data types in R, respectively.

We can use the `type()` function to check what data type a given variable has.

In [None]:
# Check the data type for 'x' and 'mean'
print(type(x))
print(type(mean))

Another commonly used data type is `str`. String stores a sequence of characters, and can be created by enclosing characters inside single quotation marks `''` or double quotation marks `""` .

> This is similar to the "character" data type in R.

In [None]:
# Generate a str variable called 'text', with the value 'hello world!'. Check its data type
text = 'hello world!'
print(text)
print(type(text))

The last data type we introduce here is `bool`. The boolean data type can be either `True` or `False`. It is to specify if an expression is true or false. We will cover this data type in the conditional statement section.

> This is similar to the "logical" data type in R.

In [None]:
# Generate a boolean variable called 'test', which judges whether 10 is smaller than 8. Check its data type
z = 10 < 8
print(z)
print(type(z))

## Recap
In this section, we introduced some basic terms in Python. We learned **how to assign variables**, and what rules to follow. We also described several important **data types** - `int`, `float`, `str`, `bool`. Hope it has been fun so far!

| **Data type** | **Examples** |
| :---: | :---: |
| int (numeric) | 2 |
| float (numeric) | 3.5 |
| str | 'hello world!' |
| bool | True, False|

In the next section, we will focus on one important concept - Python lists. This will be something you use all the time in Python.

---

# Section 3: Python List (40 min)

## Create a List
Let's talk about data structures. In Python, data is stored in specific ways within variables. A frequently used "data structure" is `list`. A Python list is a collection of data stored within a square bracket `[]`.  

A list has the following features:
- order of its elements matters
- can store mixed data types that we introduced above
- can even contain a sublist

> Note: There are other Python data structures, including `tuple`, `dictionary`, `sets`. We will not cover those in this workshop, but they can be very useful in some situations. If you are interested in learning more about them, this [website](https://thomas-cokelaer.info/tutorials/python/data_structures.html) has more information.

In [None]:
# Create an empty list called 'empty'
empty = []
empty

In [None]:
# Create a list called 'species', containing three strings: ecoli, human, corn.
species = ['ecoli', 'human', 'corn']
species

In [None]:
# Create a list called 'glengths', containing three numeric values that correponds to genome length (in Mb): 4.6, 3000, 2500
glengths = [4.6, 3000, 2500]
glengths

In [None]:
# Create a list called 'combined', containing all three species and corresponding genome lengths as pairs
combined = ['ecoli', 4.6, 'human', 3000, 'corn', 2500]
combined

In [None]:
# Create a list called 'combined2', with each species and genome length pair as a sublist
combined2 = [['ecoli', 4.6],['human', 3000], ['corn', 2500]]
combined2

## Subsetting a single element from a list
Now that we created a list, how do we access the data from it? 

We can do so by specifying the "index" number - the location of the data within the list. **Python index starts from 0** (it is not intuitive, we know! Please just bear with it). 

The first element of a list is `list[0]`. Alternatively, we can also use `-` to access the data starting from the last element. The last element of a list is `list[-1]`. The image below illustrates the index for each elements. 

<p align="center">
<img src="https://github.com/hbctraining/Training-modules/blob/master/Python/img/list1.png?raw=true" width="500"/>
</p>



In [None]:
# Get the 3rd element from list 'combined'
combined[2]

In [None]:
# Get the 3rd element from list 'combined2'. Notice that the result is a sublist!
combined2[2]

In [None]:
# Get the 3rd from the last element from the list 'combined'
combined[-3]

## Subsetting multiple elements from a list
Now, what if we want to access multiple elements in a list? 

Here we introduce the slicing `:` operator. The syntax of "slicing" is `[start:stop:step]`. *start* refers to the starting index of the slice. *stop* refers to the index of the first element just **after** the finish of our "slice". *step* refers to step value of the slice.
> Note: You don't have to specify all slicing elements; when it is not specified, Python will use default value - **by default, it will start from the first element, stop at the last element, and use step of 1**.

<p align="center">
<img src="https://github.com/hbctraining/Training-modules/blob/master/Python/img/list2.png?raw=true" width="500"/>
</p>

In [None]:
# Get the first two elements from the list 'combined' - method 1: specify both start and stop position
combined[0:2]

In [None]:
# Get the first two elements from the list 'combined' - method 2: specify only stop position
combined[:2]

In [None]:
# Get the last two elements from the list 'combined' - method 1: use normal index
combined[4:]

In [None]:
# Get the last two elements from the list 'combined' - method 2: use negative index
combined[-2:]

In [None]:
# Get every other element from the list 'combined'
combined[::2]

## Recap
In this section, we introduced **what is a list** and **how to create a list in Python**. We learned **how to access and manipulate one or more elements in a list**. Sometimes there are multiple ways to achieve this goal.

In the next section, we are going to introduce tools that make Python programming truly powerful - functions.

---

# Section 4: Functions (20 min)

## Built-in function
A function is a collection of reusable code that performs a particular task. Python has a set of built-in [functions](https://docs.python.org/3/library/functions.html). For example, `max()` returns the maximum value of a list consisting of numeric numbers.

In [None]:
# Use the max() function to return maximum value of the list 'glengths'
max(glengths)

Let's take another example - `round()` - this function rounds a numeric value to a certain decimal point. By default, the output will be a whole number. 

In [None]:
# define a variable with the value of pi, and then output the corresponding whole number using the round() function 
pi = 3.14159
round(pi)

What if we want to round the value to specific number of decimal places? In that case, we would have to use additional *arguments* when using the function. 

To check the available arguments and usage information for a function, one can use the `help()` function. However, we would recommend that you search the web for the function you want to use. For instance, [this webpage](https://www.programiz.com/python-programming/methods/built-in/round) shows some nice examples for the `round()` function. You can easily find similar resources online for most other functions.

In [None]:
# Use the `help()` function to display the usage of the `round()` function
help(round)

We now know that we can specify number of digits using the `ndigits` argument within `round()`. Let's try that with pi!

In [None]:
# round the value of pi to 2 decimal places
round(pi, ndigits=2)

### Exercise
Another useful base function is `sorted()` - it sorts the elements of a given list in a specific order. Use this function to reorder the `glengths` list in **descending** order. Check [here](https://www.programiz.com/python-programming/methods/built-in/sorted) if you are not sure what argument to use.

In [None]:
# Sort the glengths list in descending order
#### Insert your code below ####
sorted(glengths, reverse = True)

## Object-specific function
Python has a lot of functionality beyond the basic built-in functions. Recall the data types and data structures that we learned earlier? They are all called Python **objects**. Depending on the object type, there are functions to perform object-specific tasks. 

Let's take a concrete example. One function for a Python string is `count`. `count` searches the substring in the given string and returns how many times the substring is present within the object. The syntax is `string.count(substring)`.

In [None]:
# Count number of T in a DNA sequence 'ACTGAT'
DNA = "TCAGTT"
DNA.count("T")

Pretty handy, right? We have just touched the tip of the iceberg so far. There are many more [functions](https://docs.python.org/3/library/stdtypes.html#string-methods) for strings in Python. Below we list a few more functions that you will likely see or may use.

| **Function** | **Description** | **Example** | **Output** |
| :---: | :---: | :---: | :---: |
| capitalize | Converts the first character to upper case | 'atgc'.capitalize() | 'Atgc' |
| count | Returns the number of times a specified value occurs in a string | 'atgc'.count('c') | 1 |
| islower | Returns True if all characters in the string are lower case | 'atgc'.islower() | True |
| join | Joins the elements of an iterable to the end of the string | ''.join(['a', 't', 'g', 'c']) | 'atgc' |
| replace | Returns a string where an old value is replaced with a new value | 'atgc'.replace('a', 'g') | 'gtgc' |
| split | Splits the string at the specified separator, and returns a list | 'hello world'.split() | ['hello', 'world'] |

## Recap
In this section, we introduced different types of functions: 
* **built-in**
* **object-specific**

We also walked through some examples of very practical tasks, using these functions. Another type of function is user-defined function, which we do not cover in this workshop. 

---

# Section 5: Packages (10 min)

A Python package contains a collection of pre-defined scripts for specific tasks. It allows us to directly use these scripts to accomplish a task of interest, without having to write everything from scratch.

We need to install a Python package if it is not already present. [pip](https://pip.pypa.io/en/stable/) is the package installer for Python. To install a package, we could use `!pip install package_name` command. For example, to install [scanpy](https://scanpy.readthedocs.io/en/stable/), a popular package for single-cell RNAseq analysis, we could use `!pip install scanpy`.

All major Python libraries are already installed on Colab, so we do not need to install any packages for now. However, we need to `import` a package before we could use it - this import step is required everytime we initiate a new Python environment. As a result, we usually place these `import` codes at the beginning of a script. 

Sometimes, we name an alias for a package, using the `import package_name as alias` syntax. This way, we just need to use the alias when citing a function from the package, which is convenient if we use the package often. For some popular packages, people set some conventions on what alias to use.

In [None]:
# error when using numpy package without importing first
numpy.array([2, 3, 4, 5]) + numpy.array([1, 10, 100, 1000])

In [None]:
# import numpy library, and name it as np
import numpy as np

In [None]:
# use numpy package after importing (note: need to use np, instead of numpy, because np is the alias we set earlier)
np.array([2, 3, 4, 5]) + np.array([1, 10, 100, 1000])

---

# Section 6: Data manipulation (30 min)

Data manipulation is the process in which we organize and clean data to a required format, mostly for downstream analysis. In Python, [pandas](https://pandas.pydata.org/) is a powerful package to perform data manipulation. We will touch upon some basic usage of this package.

First, we need to load our data to the cloud server. Let's upload a tsv file, using the upload button on the left panel. We should see the file in our directory (another folder that Colab auto-generates is `sample_data`).

> Note: We need to upload the data every time we start a notebook. This is because the cloud server erases the data when the previous notebook is disconnected. There are other ways to connect to data (e.g. mounting the Google Drive), but we will not cover them in this workshop.

Once the data is loaded, we can read the data into the Python environment, using the [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function from pandas. By default, the function assumes the file to be comma-separated (`,`). However, our file is tab-separated (`\t`). So we need to add an additional argument, `sep`, to indicate that. 

In [None]:
# import pandas library, and name it as pd 
import pandas as pd

In [None]:
# create a variable called "data", and read in data using the read_csv function
data = pd.read_csv('functional_analysis.tsv', sep='\t')

Our data is a two-dimensional dataframe - each row represents a functional process, and each column defines a feature. Below, we will show how to inspect this dataframe and extract data of interest.

The `head` function prints out the first few rows of the dataframe. This is a helpful way to glimpse the structure of the data, without having to print out all the data.

In [None]:
# print out the first few rows of the data
data.head()

The `shape` function prints out the number of rows and the number of columns for the dataframe. The `columns` function prints out all column names.

In [None]:
# print out the shape of the data (number of rows, number of columns)
data.shape

In [None]:
# print out the column names of the data
data.columns

Now that we are familiar with the data, how do we extract specific rows or columns of interest? 

The first method is to select rows or columns based on **position**. We use the [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) function from pandas. iloc accepts an integer, a list of integers, or a slice objects. 

In [None]:
# select the 4th row
data.iloc[3]

In [None]:
# select the 4th and 5th row. Note that the slicing syntax we introduced in earlier section applies here as well.
data.iloc[3:5]

In [None]:
# select the 4th column
data.iloc[:,3]

In [None]:
# select the 4th and 5th column
data.iloc[:,3:5]

In [None]:
# select data from 4th and 6th row, 1st and 3rd column
data.iloc[[3,5], [0,2]]

The second method is to select based on **column names** or **boolean arrays**. We use the [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) function from pandas.

In [None]:
# select the column p.value
data.loc[:,'p.value']

In [None]:
# select mutiple columns: term.id, term.name, p.value, query.size, overlap.size
data.loc[:,['term.id', 'term.name', 'p.value', 'query.size', 'overlap.size']]

In [None]:
# select where column "domain" is "BP"
idx = data['domain'] == 'BP'
data[idx]

Lastly, we use the [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) function to sort values based on a column.

In [None]:
# sort value based on p.value
data.sort_values(by='p.value')

---
# Section 7: Visualization (20 min)

Data visualization is an integral part of data analysis. In Python, there are multiple tools for data visualization. For example, pandas has its own [plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) function. Another popular visualization library is [Matplotlib](https://matplotlib.org/). In this lesson, we will demonstrate [seaborn](https://seaborn.pydata.org/), a library based on Matplotlib, but with extended features for informative graphics. 

Let's first import matplotlib and seaborn libraries, and load a metadata file that we want to plot some metrics from.

In [None]:
# import matplotlib.pyplot and seaborn library
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# read in metadata as a dataframe. Use index_col to indicate which column to be used as the row name
metadata = pd.read_csv('metadata.csv', index_col=0)

In [None]:
metadata

The [scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) function from seaborn plots the relationship between two variables. We add an additional argument, `hue`, to indicate the grouping variable. We add `plt.show()` command, to print out the plot.

In [None]:
# plot a scatterplot with "samplemeans" as y axis and "age_in_days" as x axis. Separate the "celltype" with different color.
sns.scatterplot(x = 'age_in_days', y = 'samplemeans', hue='celltype', data = metadata)
plt.show()

Similarly, we could create [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) and [violinplot](https://seaborn.pydata.org/generated/seaborn.violinplot.html).

In [None]:
# plot a boxplot with "samplemeans" as y axis and "genotype" as x axis. Separate the "celltype" with different color.
sns.boxplot(x = 'genotype', y = 'samplemeans', hue='celltype', data = metadata)
plt.show()

In [None]:
# plot a violinplot with "samplemeans" as y axis and "replicate" as x axis. Separate the "celltype" with different color.
sns.violinplot(x = 'replicate', y = 'samplemeans', hue="celltype", data = metadata)
plt.show()

# Section 8: Final exercise


In the final exercise, we want to analyze the tumor purity (the percentage of non-cancerous cells present in a tumor sample) across different cancer types. The data is downloaded from this [paper](https://www.nature.com/articles/ncomms9971). Let's first read in the data, and check what the data is about.

In [None]:
# read in the tumor purity data, and name the variable "purity"
purity = pd.read_csv("tumor_purity.csv")

In [None]:
# Check the first few entries of the data
purity.head()

The data has four columns: Sample_ID, Cancer_type, IHC, and CPE. IHC refers to immunohistochemistry, an experimental method to measure tumor purity. CPE refers to consensus purity estimations, a computational method described in the paper to measure tumor purity.

1. How many samples are included in this data? Among them, how many are GBM cancer type?

In [None]:
# Check the size of the data
purity.shape

In [None]:
# Check the number of samples for GBM
sum(purity['Cancer_type']=='GBM')

2. The IHC and CPE values are current between 0 and 1. For these two columns, change the value to percentage, by multiplying the current value with 100. Overwrite the old value wiht the new value in these columns.

In [None]:
# Change the value to percentage
purity.loc[:,'IHC'] = purity.loc[:,'IHC'] * 100
purity.loc[:,'CPE'] = purity.loc[:,'CPE'] * 100

3. Plot a violinplot, with "Cancer_type" as the x axis and "CPE" as the y axis. 

In [None]:
# violinplot
sns.violinplot(x = 'Cancer_type', y = 'CPE', data = purity)
plt.show()

4. Let's make a few more modifications, to make the plot clearer:

- Rotate the cancer type labels to 45 degree, so that they don't overlap. Use `xticks` function from Matplotlib, with this [reference solution](https://stackoverflow.com/questions/10998621/rotate-axis-text-in-python-matplotlib).
- Remove the x label, and revise the y label to "Tumor purity (%)", using `xlabel` and `ylabel` function from Matplotlib. [Here](https://matplotlib.org/3.1.1/gallery/pyplots/pyplot_simple.html#sphx-glr-gallery-pyplots-pyplot-simple-py) is a simple example.

In [None]:
sns.violinplot(x = 'Cancer_type', y = 'CPE', data = purity)
plt.xticks(rotation=45)
plt.xlabel('')
plt.ylabel('Tumor purity (%)')
plt.show()

5 (Optional): The plot is clear now, but it would be even better if we sort the cancer types based on the mean value of tumor purity.
To achieve that, we need to:
- calculate the mean of purity grouped by each cancer type. Two useful functions for this step are [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) and [mean](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html).
- sort the purity from lowest to highest, and extract cancer type names
- plot violinplot with the new order

In [None]:
# Calculate the mean of the purity, grouped by each cancer type
purity_mean = purity.groupby(by=['Cancer_type'])['CPE'].mean()

In [None]:
# Sort the purity, from lowest to highest
purity_sort = purity_mean.sort_values()
# Extract the index (cancer type name) as our plotting order
new_order = purity_sort.index

In [None]:
# Re-generate the violin plot, by specifying the new order
sns.violinplot(x = 'Cancer_type', y = 'CPE', data = purity, order = new_order)
plt.xticks(rotation=45)
plt.xlabel('')
plt.ylabel('Tumor purity (%)')
plt.show()

---

# Section 9: Class survey (5 min)
Before we conclude with final remarks, please take some time to complete this class [survey](http://tinyurl.com/hbc-modules). We appreciate your comments and feedbacks.

If you have any suggestions or questions in the future, please contact us at [hbctraining (at) hsph.harvard.edu](mailto:hbctraining@hsph.harvard.edu).

---

# Section 10: Final remarks (5 min)

## Installing Python
In this workshop, we used [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb), an online cloud-based notebook. Colab allows us to easily write Python code and execute it just using our internet browser. It is rapidly gaining popularity for collaborative research and teaching (so, you may encounter it again soon!).

If you would like to use Python offline though, you will need to install Python on your computer. Python can be downloaded and installed from [here](https://www.python.org/downloads/). The latest release version is Python 3.10 (as of August 2022), and it is recommended that you install the latest version.

After Python is installed, there are multiple platforms where you can run Python code. The choice of the platform is a personal preference, and it does not affect code running - at the core we use the native Python environment. Below we introduce some other popular platforms (from among many other options):
1. Jupyter Notebook/JupyterLab: [Jupyter Notebook](https://jupyter.org/) is an open-source web application for interactive computing. It allows users to create and share documents that contain code, equations, plots and narrative text. Since Colab is based on the Jupyter notebook environment, we have checked it out in this workshop already. The extension of the notebook file is always `.ipynb`. JupyterLab is a next-generation version of Jupyter Notebook, just with some improvements. You can install them from [here](https://jupyter.org/install).
2. Spyder: [Spyder](https://www.spyder-ide.org/) is an open-source integrated development environment (IDE) for scientific programming in Python. It offers a combination of script editing, data analysis, debugging, and visualization. If you are already used to coding in an IDE interface, like RStudio or Matlab, you will find that using Spyder is very familiar and intuitive.
3. **(Recommended)** Anaconda: many Python users would use a one-stop-shop package manager like [Anaconda](https://docs.anaconda.com/anaconda/user-guide/getting-started/). Just to note, Anaconda allows users to launch applications and manage conda packages, environments and channels without using command-line commands. After installing Anaconda, you can access its desktop graphical user interface (GUI), the Anaconda Navigator. Applications like Jupyter Notebook, JupyterLab, and Spyder are by default installed and are available on it. Anaconda for Python can be downloaded from [here](https://www.anaconda.com/distribution/).

## Python vs R
Both Python and R are very popular programming languages, with ample training materials and community support. There are many online discussions about which one would be better for [bioinformatics](https://www.reddit.com/r/bioinformatics/comments/af7wjv/r_language_vs_python_which_is_the_most_necessary/), or for [data analysis](https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis), in general. We should keep in mind that both have strengths and weaknesses - one language is not better than the other. It all depends on what your use case is and which tools/packages within each language are available to best handle your task. Python is a general purpose programming language with easy-to-understand syntax and is a good start for learning basic programming. R is widely used among stasticians and has field-specific advantages, such as for [RNA-seq analysis](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) or for [data visualization](https://ggplot2.tidyverse.org/index.html). 

## Future learning
If you are interested in learning more about basics of Python programming, we listed a few additional resources below:
- [Python course on kaggle](https://www.kaggle.com/learn/python)
- [Python course on codecademy](https://www.codecademy.com/learn/learn-python)
- [Python course on software carpentry](https://swcarpentry.github.io/python-novice-inflammation/)
- [A Byte of Python](https://python.swaroopch.com/)
- [Python for Biologists](http://userpages.fu-berlin.de/digga/p4b.pdf)


---

**Authors**: Jihe Liu, Radhika Khetani

*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*