# Lab 1: next generation sequencing and mutation hunting

## Exercises 1 (10 pts)

The following exercises will help prepare you to complete this week's lab.
All lab assignments will be completed using Jupyter notebooks.
For our first exercise, we'll focus primarily on making sure you have access to the lab resources on JupyterHub.

Each week you will need to complete the following:
* **Exercises** focus on more theoretical aspects of the topic we're studying. These will typically consist of 20 pts total. 
* **Lab assignments** contain the main data analysis assignment. These will typically consist of 80 pts total. In the first several labs, these will have many prompts and code examples. In later labs, you will be given less guidance on specific commands to run.

All notebooks for each week are collected (virtually) on Mondays at 9:59am.
Each lab assignment is worth 10% of your total grade, graded out of 100 points.
Each notebook section will be clearly marked with the total number of possible points for that section.

Some reminders on lab policies:
* You are encouraged to work with a partner or group on the lab assignments! Although you must all complete your own Jupyter notebooks.
* Google is your friend! This course is meant to expose you to what bioinformatics and research in the real life is like. You are allowed to, and encouraged to, google things as much as you like.

Please keep in mind these reminders when working on Jupyter assignments:
* Do not "copy" any cells! Instead, use "Insert" to make new cells to play around in. Copying cells creates notebook validation errors that are hard to fix.
* You can remove the `raise NotImplementedError` messages. These are meant as placeholders until you have finished answering each question.
* Once you are done, you must "Submit" your assignment before the deadline. This button should be visible on the "Assignments" tab on Jupyter.

## 1. Jupyter Lab setup

Navigate to [datahub.ucsd.edu](http://datahub.ucsd.edu) and spawn CSE185. 

First, we'll make sure you can run a Jupyter notebook.
Click the "Assignments" tab. Then under released assignments, you should see "lab1-spring24".
Fetch the assignment.
You should now see it under "downloaded assignments".

Click on a notebook to begin!
We'll go over how to validate and submit assignments in lab this week.

Now, edit the python code directly in the cell below to make the `HelloWorld` function return the string "Hello world".

In [1]:
# Write some python code to return the string "Hello world" (5pts)
def HelloWorld():
    """Return the string Hello world"""
    # your code here
    x= "Hello world"
    return str(x)


In [2]:
"""Check that HelloWorld function works as planned"""
assert(HelloWorld() == "Hello world")

## 2. Course server login and basic UNIX navigation

Now we will access the terminal that our Jupyter notebook is running on top of.
Throughout the lab assignments, we will be switching back and forth between the terminal (where you will run command line tools to perform data analysis) and Jupyter notebooks (where you will complete the assignments and visualize your analysis results).

You can access the terminal in two ways:

* On the upper right corner of the main Jupyter page (the one that led you to this notebook), click "New" then choose "Terminal". This will open a new window with a terminal screen.
* Alternatively, edit your URL to be https://datahub.ucsd.edu/user/yourusername/lab (rather than /tree) to enter the JupyterLab environment. In the Launcher you should see an option to choose "Terminal". JupyterLab is also convenient for viewing the directory structure and files that are available.

Both of these will launch a terminal and put you in your home directory. Use `pwd` to print the current working directory:

```shell
pwd
```

Use `ls` (list) to see what’s in your home directory (it should be empty). 

```shell
ls
```

Besides your home directory, the other directory you need to know about is the `~/public/` directory which contains all datasets that will be needed for the assignments. To get to this directory, you can use the `cd` (change directory) command. The general format for this command is:

```shell
cd [directory]
```

To use the command, replace the part in brackets with the path to the directory that you’d like to change too. (We will use a similar format throughout the tutorials for code you will need to fill in.) If you just type `cd` alone, the shell will take you to your home directory. To specify relative paths in the [directory] part of the command, a single period refers to the current directory and a double period refers to its parent.

From the `~/public` directory, when you type `ls`, you should see multiple directories, including a directory for each lab (e.g. `lab1`, `lab2`, etc. and a `genomes` directory with things like reference genomes. You only have access to read files from this directory. You have write access only to your home directory. We'll talk more about permissions later.

<blockquote>
**UNIX TIP**: Unix has an ‘autocomplete’ feature that will help you correctly type names and paths. If
you start typing the command below and then press the tab key, unix will automatically fill in
the rest of the directory name, and you can just hit enter. Try it. 
</blockquote>

Go ahead and navigate to the `lab1` folder:

```shell
cd lab1/
```

<blockquote>
**UNIX TIP CONTINUED**: If there are multiple options in a file that start with the same letters (ie `lab1`
and `lab2`), when you press tab after you start typing, the shell will autocomplete the shared part,
then beep (if the sound is on) and wait for you to specify the rest, then you can keep typing and
tabbing. 
</blockquote>

Use `ls` to see what’s in the lab1 folder, and `pwd` to get the absolute path to the public folder. Here you should see 6 files. Fastq files (`*.fq`) contain raw Illumina sequencing reads from our samples (1 and 2 refer to the first and second reads in each pair, this was a paired end run). 

While you are in the `lab1` folder, compare the size of each of these files with the disk usage `ls`
command. The optional flag `-l` makes the output in "long format" and the `-h` makes the results human readable (*e.g.*, 1K 234M 2G instead of the number of bytes).

```shell
ls -lh
```

<blockquote>
**IMPORTANT NOTE**: Data analysis you do in the class will often be done in your own home directory. **DO
NOT COPY** the raw sequencing files from the public folder into your folder. This is because they are sometimes
very large and the server space for each account and the course as a whole is limited, so we won’t make copies unless we have to.
</blockquote>

## 3. Some Python intro

While lab assignments will not require writing complex Python programs, we will be using some basic Python quite a bit. If you have not used Python before, we strongly recommend going through intro course through Stepik that was posted during week 1. We will also introduce some basic Python along the way.

Many data types, such as strings, and ints, are similar to most other programming languages:

```python
string_variable = 'my string' # This is a string. It must be surrounded by quotes
int_variable = 13 # This is an int
```

In the question below, we will use a Python data type called a dictionary. A dictionary is simply a map of one thing (key) to another (value). For example, the dictionary below maps numbers (ints) to colors (strings).

```python
colordict = {1: 'red', 2: 'green'} # Create a dictionary

colordict[1] # This will return 'red'
colordict[3] = 'purple' # This line adds a new item to the dictionary

colordict[5] # This will return an error, since 5 is not a key in our dictionary.
```

You can try the code snippets above by opening the Terminal and typing "Python" to start a Python shell.

**Question 2 (5 pts):** Use the command `ls -lh` to see the human readable file sizes of the fastq files in the `lab1/` directory. In the cell below, add the sizes (in MB) of the rest of the fastq files to the python dictionary `filesizes`, which maps file names to their size (in MB).

In [3]:
# Initialize a dictionary, called filesizes, mapping filename -> file size
filesizes = {
    'NA12878_child_1.fq': 13,
    'NA12878_child_2.fq': 13,
}
# We can now add items to the dictionary using the syntax below:
filesizes['NA12891_father_1.fq'] = 13

# your code here
filesizes['NA12891_father_2.fq'] = 13
filesizes['NA12892_mother_1.fq'] = 14
filesizes['NA12892_mother_2.fq'] = 14


In [4]:
# Check that the entered file sizes are correct.
# Hidden tests check the rest of the files.
assert(filesizes['NA12892_mother_1.fq'] == 14)

In [5]:
# Check that the completed dictionary has the expected number of keys and unique values
assert(len(filesizes.keys())==6)
assert(len(set(filesizes.values()))==2)

# Check all keys are present
keys = ['NA12878_child_1.fq', 'NA12878_child_2.fq', 'NA12891_father_1.fq', 'NA12891_father_2.fq', 
       'NA12892_mother_1.fq', 'NA12892_mother_2.fq']
for i in range(len(keys)):
    assert(keys[i] in filesizes.keys())

In [6]:
# Hidden tests to check dictionary values

Before turning in your assignment, it is a good idea to click the "Validate" button at the top. This will make sure you have answered at least the autograded questions, and will run all visible tests. (Note some tests may be hidden to you, so just because the notebook validates doesn't mean all answers are correct.)