# Proposal: Applied Seqera Academy Proposal

Seqera is a company that operates predominantly in the Bioscience space. In our 'Academy' onboarding process, we'd like to provide you with an intuition of the basic problems our customers are trying to solve, and empathy with their motivations for doing so. We'd like to do so in a way that can be appreciated by new Seqerans of diverse background, whatever team you're coming into, from People to Engineering.

## Introduction to Notebooks

For those without any technical programming background, it's worth giving a simple outline of the concept of notebooks in a programming context. If you're already familiar please feel free to skip this section. 

What you're looking at right now is a 'Jupyter Notebook'. Jupyter Notebooks are an innovative digital tool designed to make computing and data analysis accessible and interactive for both technical and non-technical users. Picture a digital notebook that allows you to write text, like explanations or instructions, and also execute code, all in one place. This is made possible through "cells," which are like individual blocks within the notebook. Some cells contain code, which can be run to perform computations or manipulate data, while others contain text, which can explain what the code does or present results. 

This structure makes Jupyter Notebooks an excellent teaching aid, as it allows educators to seamlessly integrate lessons and practical exercises. Students can easily run complex commands without needing to navigate the often intimidating environment of the terminal. By simplifying the execution of technical tasks, Jupyter Notebooks democratize access to computing and data analysis, enabling a broader audience to engage with and learn from technology.

### Markdown cells

This is a markdown cell (as are all the cells above). Click into this cell and either click the little 'play' button in the toolbar or do cntrl-enter (or cmd-enter on a Mac). Nothing will happen really- you'll just get a rendered version of the text.

### Cells for running things on the command line

Bellow is a cell for running commands (the '!' means execute a terminal command). Run that cell in the same way and see what happens. 

In [None]:
!ls

You should see the listing of the current directory in the Linux file system, produced by the Bash 'ls' command. Without the '!' we just run a Python command - try running the cell below.

### Cells for running Python

In [None]:
print('Hello world')

This runs the Python 'print' command with the specified text string.

We're going to use these constructs throughout the applied Academy training. You can edit the commands to try out different things, and sometimes we'll ask you too, but at the basic level all you'll need to do is run each cell and examine the outputs.

## Introduction to Genomics Data 

In this next section we'll introduce you to some crucial genomics data analysis concepts. Don't worry if this sounds daunting, everything will be done with pre-made commands as described above. 

If you don't come from a genomics/ bioinformatics background, the first thing to do is go to https://learngenomics.dev/ and learn about the basic concepts. Then come back here and we'll learn something about the actual data.

### FASTQ files

The workhorse file format for genomics data is the FASTQ file. The best way to illustrate this is to download one and have a look at it, so let's do that. Activate the cell below to get a compressed FASTQ file from the SRA (sequence read archive).

In [None]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR131/002/SRR13191702/SRR13191702_1.fastq.gz

`wget` is just a command that 'gets' files from the internet. FASTQ files are often BIG, so often stored compressed. Just for our purposes here, let's unzip this one:

In [13]:
!gunzip SRR13191702_1.fastq.gz

Each record in a FASTQ file is 4 lines. Activate the following cell to see what the first 4 lines looks like. 

In [None]:
!head -n 4 SRR13191702_1.fastq

In the output you should see:

 * An identifier (SRR13191702.1 1/1)
 * A sequence (GATGAACGCT...)
 * A separator (+)
 * Encoded 'scores' representing 'quality' (DDDDDIIIII...)

Don't worry too much about that last one. Suffice to say those are not really letters, they are numeric values 'encoded' using those charcacters. That quality can be taken into account by the software used to work with these data.

We can use another command (`sed`) to have a look at the next 4 lines:

In [None]:
!sed -n '5,8p' SRR13191702_1.fastq

You will see the same pattern, as described above, of four lines, for a new record.

**Question:** What is the identifier of the third record? Use the same sort command as above to get the next record (lines 9-12). Paste it into the box below and run it.