# Introduction to the command line

**Brief description:** This notebook will walk you through the
installation of Linux subsystems and Ubuntu OS and the steps for setting
up a working environment. You will learn basic Linux commands from
examples focused on analysing biological data, e.g. sequences and their
metadata. This is an introductory tutorial for people that is not
familiar with the command line.

1.  [**The Linux subsystem (Windows only)**](#install-linux)  
1.1. Installing the Linux subsystem  

2.  [**Conda environments**](#conda)  
2.1. Installing Conda  
2.2. Creating and deleting environments  
2.3. Creating environments from files  
2.4. Exporting environments  

3. [**Basic Linux commands**](#commands)  
3.1. Folder hierarchy  
3.2. Create, inspect, and delete files  
3.3. Matching, replacing, and counting patterns  
3.4. Variables, loops, and conditionals  
3.5. Compress and decompress files  
3.6. Connecting with other machines  

4. [**AWK**](#awk)  
4.1. Basic filtering  
4.2. Operations  

5. [**Summary**](#summary) 

---


# [**1. The Linux subsystem (Windows only)**](#install-linux) 

## 1.1. Installing the Linux subsystem

Follow these steps if you work on a Windows machine (likely Windows 11 but it works for Windows 10 too). *If you have MacOS or Ubuntu/Linux already running in your machine, skip this section.*


Right click on the Windows icon in the task bar, click search, search for 'turn windows features on or off'. Open the application.  

Look for the options 'windows subsystem for linux' and 'virtual machine platform' and tick the boxes.  

Hit OK and restart your machine.  

Go again to the windows icon and search for bash. Open the application. If you see an error and a URL, open the link on a web browser and follow the instructions.

Check **Lecture 2** slides for instructions.

---

# [**2. Conda environments**](#conda)  

This section takes you through installing Conda and managing environments. Documentation about Conda is available
[here](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html).  

## 2.1. Installing Conda  

Download the Linux version of Conda [here](https://www.anaconda.com/download). Scroll down the website and you will find it. It is important that you get the Linux distribution as you will be installing it in a **Linux subsystem (even within windows)**.

Use the `wget` command to retrieve the file from conda\'s website.

The code below will not work easily if you are working under a VU VPN. If that is the case, try the manual download.

In [2]:
%%bash
wget --help # check the manual if you want

GNU Wget 1.20.3, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:
  -V,  --version                   display the version of Wget and exit
  -h,  --help                      print this help
  -b,  --background                go to background after startup
  -e,  --execute=COMMAND           execute a `.wgetrc'-style command

Logging and input file:
  -o,  --output-file=FILE          log messages to FILE
  -a,  --append-output=FILE        append messages to FILE
  -d,  --debug                     print lots of debugging information
  -q,  --quiet                     quiet (no output)
  -v,  --verbose                   be verbose (this is the default)
  -nv, --no-verbose                turn off verboseness, without being quiet
       --report-speed=TYPE         output bandwidth as TYPE.  TYPE can be bits
  -i,  --input-file=FILE           download URLs found in local or external FILE
  

       --crl-file=FILE             file with bundle of CRLs
       --pinnedpubkey=FILE/HASHES  Public key (PEM/DER) file, or any number
                                   of base64 encoded sha256 hashes preceded by
                                   'sha256//' and separated by ';', to verify
                                   peer against
       --random-file=FILE          file with random data for seeding the SSL PRNG

       --ciphers=STR           Set the priority string (GnuTLS) or cipher list string (OpenSSL) directly.
                                   Use with care. This option overrides --secure-protocol.
                                   The format and syntax of this string depend on the specific SSL/TLS engine.
HSTS options:
       --no-hsts                   disable HSTS
       --hsts-file                 path of HSTS database (will override default)

FTP options:
       --ftp-user=USER             set ftp user to USER
       --ftp-password=PASS         set ftp password to 

Now, open the command line and check where in the folders' hierarchy you are. You can do that with the `pwd` command: **print current
directory**. You can explore how to use it by adding the `--help` argument.

You are likely to be in a folder called **home**, pretty much your starting folder. But in the case of linux  subsystems within windows, that is bash's starting folder and not your main disk's home folder.  

You need to **change  directory** to your **Downloads** folder to find the conda installer.  

Then, **list** files to check if the installer is in that folder. The asterisk is used as a [wildcard] (https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm).  

To run the bash `*sh` script use the `bash` command. Follow the instructions, either pressing ENTER, space for  scrolling down the page, or typing "yes".  

**Note:** always replace my username with your username e.g. mftorr should be replaced with your username.

Once you are done, close the bash command line interface and open a new one.

## 2.2. Creating and deleting environments

Conda allows you to create discrete, enclosed environments with specific versions of programs to ensure a smooth interaction. If you are only worried about one particular program, conda will look for the versions of other programs and dependencies (programs that other programs require for working correctly) that works best and avoids conflict.  

Every time we run an analysis associated to a publication, we can export the environment to a file that other researchers can use to recreate the environment used for running said analysis. Better yet, if you need to re-install conda, you can quickly use that file to re-install your environments. Conda environments help increase the reproducibility and transparency of your research.  

Most of this section is taken from the [conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html) documentation, which you can check in more depth if you need to.  

Let's create an environment from scratch. we ask `conda` to `create` a new environment with `-n bioevo` name. When prompted, reply 'y' to whether you want to install the packages needed (for now none). Then, activate the environment using `conda activate`.  

Once active, you should see it in the left side of the prompt. **Anything that you install whilst an environment is active will only be available for that environment.**

If we want to install a particular library through conda, e.g. [VCFtools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/), you can check for the correct channel (locations where the packages are stored) within conda's repository website, e.g. [here](https://anaconda.org/bioconda/vcftools).  

The commands `conda config --add channels` indicate conda that it needs to include the channles bioconda and conda-forge, locations where most (but not all) libraries for bioinformatics are sitting.

You can check the list of environments installed within conda and their folder location using the command `conda env list`.  

Once you see the folder location, you can remove the environment by simply deleting the entire folder using `rm` and the arguments `-r`. The command will ask you if are sure you want to delete the entire folder.  

**BE CAREFUL!** That command will remove folders, files, data, everything you have in there. Don't drink and `rm`. Alternatively (if you remember the name of your environments), you can use `conda remove`

## 2.3. Creating environments from files

For reproducibility, sometimes it is desirable to install an environment from a file that someone else has exported. 
The process is as simple as using `conda env create`


You can confirm that all packages and their dependencies were intalled by listing every component within the environment with `conda list` and the name of the environment. The name of a single (or more, separated by spaces) package can be added at the end, to specifically check its installation and version.

In [9]:
%%bash
# check for packages
conda list -n bioevo
conda list -n bioevo blast

# packages in environment at /home/mft/anaconda3/envs/bioevo:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
alsa-lib                  1.2.8                h166bdaf_0    conda-forge
anyio                     3.7.1              pyhd8ed1ab_0    conda-forge
argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
argon2-cffi-bindings      21.2.0          py311hd4cff14_3    conda-forge
asttokens                 2.2.1              pyhd8ed1ab_0    conda-forge
attr                      2.5.1                h166bdaf_1    conda-forge
attrs                     23.1.0             pyh71513ae_1    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                pyhd8ed1ab_3    conda-forge
backports.functools_lru_cache 1.6.5              pyhd8ed1ab_0    c

libiconv                  1.17                 h166bdaf_0    conda-forge
libidn2                   2.3.4                h166bdaf_0    conda-forge
libjpeg-turbo             2.1.5.1              h0b41bf4_0    conda-forge
liblapack                 3.9.0           17_linux64_openblas    conda-forge
libllvm16                 16.0.6               h5cf9203_0    conda-forge
libnghttp2                1.52.0               h61bc06f_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libogg                    1.3.5                h27cfd23_1  
libopenblas               0.3.23          pthreads_h80387f5_0    conda-forge
libopus                   1.3.1                h7f98852_1    conda-forge
libpng                    1.6.39               h753d276_0    conda-forge
libpq                     15.3                 hbcd7760_1    conda-forge
libsndfile                1.2.0                hb75c966_0    conda-forge
libsodium                 1.0.18               h36c2ea0_

qtpy                      2.3.1              pyhd8ed1ab_0    conda-forge
readline                  8.2                  h8228510_1    conda-forge
referencing               0.29.1             pyhd8ed1ab_0    conda-forge
rfc3339-validator         0.1.4              pyhd8ed1ab_0    conda-forge
rfc3986-validator         0.1.1              pyh9f0ad1d_0    conda-forge
rpds-py                   0.8.10          py311h46250e7_0    conda-forge
samtools                  1.17                 hd87286a_1    bioconda
send2trash                1.8.2              pyh41d4057_0    conda-forge
setuptools                68.0.0             pyhd8ed1ab_0    conda-forge
sip                       6.7.9           py311hb755f60_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sniffio                   1.3.0              pyhd8ed1ab_0    conda-forge
soupsieve                 2.3.2.post1        pyhd8ed1ab_0    conda-forge
stack_data                0.6.2              pyhd8ed1a

## 2.4. Exporting environments

You can export and share your own environment. To do so, make sure the environment is active before using `conda env export`.  

Here, the `>` symbol re-directs the output to a new file called *bioevo_environment.yml*. Every time you run this command and re-direct the output to the file, the old file will be replaced by the new one (if both have the same name). This behaviour is typical of redirecting with `>` and we will cover that later.

---

# [**3.  Basic Linux commands**](#commands)  

We have set up an enviroment within which we can work, now the aim is to use biological data to learn basic Linux commands and how to use them within the context of biological data analysis.  

First, we will gather data from [NCBI](https://www.ncbi.nlm.nih.gov/), one of the most widely used databases "[of online resources for biological information and data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323993/)".  

We are also interested in exploring the data available for Lithuanian spiders on NCBI, using as a reference a species list published by [**Biteniekytė and Rėlys (2011)**](https://lmaleidykla.lt/ojs/index.php/biologija/article/view/1926/828).  

Our aim is to understand which spider species have data available, which don't, the type of data available and its quality. You are likely to use these commands often, following a similar data processing that will help you have an initial idea about data availability before you embark into your own projects.

## 3.1. Folder hierarchy

Files are stored within folders (or directories, is the same) and folders are nested within other folders, following a hierarchy similar to a tree structure. Using the command line interface, we can navigate through folders by following the branches of the tree, up or down to different levels of the hierarchy.  

We can express the **path** from one folder to another in two ways: absolute and relative. **Absolute paths** explicitely indicate thefolder names along the hierarchy path, from the current location to thedestination folder or file **Relative** paths express the steps alongthe hierarchy path to reach another destination and assume the user knowits current position *relative* to the rest of the structure.

For example, assuming we arein the current folder:

`C/Users/mftorr/Documents/RandomFolder/`

Then the absolute and relative paths to reach `C/Users/mftorr/Documents/biol4evol/Lectures/Lecture0` are:  

**Absolute:** `C/Users/mftorr/Documents/biol4evol/Lectures/Lecture0`  


**Relative:** `../biol4evol/Lectures/Lecture0`  

Thus, to move around folders, the first step is to know your **current working directory**, easily done with `pwd`.

It is also helpful to understand which files are inside your current folder by **listing** its items. The command `ls` comes in handy, particularly when specific argumens are used. For example, `ls -l` will list all items within a folder, printing to screen the [permissions of a file](https://en.wikipedia.org/wiki/File-system_permissions) (first column from left to right), size of the file (fifth column) and last-modified date. To access the same information in a "human readable" format, you can use `ls -lh`. If you want to explore more options for listing (or in general for other commands), you can use `ls --help`.

**Advanced:** try using `ls` and Linux wildcards to filter which files to list or move.

You can list the contents of a folder if you know the path. Once you confirm that a particular folder is the one you want to move to, you can **change directory** to that path using `cd`. You can also copy (leaves a copy behind) or move (does not, or replaces one version with another) files across folders. Finally, you can create new folders with `mkdir` and the name of the new folder.

In [10]:
%%bash
pwd # likely something like /home/user
ls
ls -lh

/mnt/c/Users/mftor/Documents/bioinfo4evol/practicals/lecture2
LT_spider_list.txt
Lecture2_command_line.ipynb
Zelote_coordinates.csv
Zelote_coordinates_sel.csv
bioevo_environment.yml
count_sequences.sh
genera.txt
genera_filtered.txt
sequence1.gb
sequence2.gb
species.txt
spiders.fasta
spiders.gb
total 37M
-rwxrwxrwx 1 mft mft  29K Jul 17 15:42 LT_spider_list.txt
-rwxrwxrwx 1 mft mft  55K Sep 13 13:04 Lecture2_command_line.ipynb
-rwxrwxrwx 1 mft mft  23M Jul 26 10:21 Zelote_coordinates.csv
-rwxrwxrwx 1 mft mft 5.1M Jul 26 14:03 Zelote_coordinates_sel.csv
-rwxrwxrwx 1 mft mft 9.4K Jul 17 15:42 bioevo_environment.yml
-rwxrwxrwx 1 mft mft  438 Aug  3 13:08 count_sequences.sh
-rwxrwxrwx 1 mft mft 1.9K Jul 19 15:09 genera.txt
-rwxrwxrwx 1 mft mft  104 Jul 19 15:14 genera_filtered.txt
-rwxrwxrwx 1 mft mft  50K Jul 18 11:07 sequence1.gb
-rwxrwxrwx 1 mft mft  18K Jul 18 09:47 sequence2.gb
-rwxrwxrwx 1 mft mft  12K Jul 19 14:41 species.txt
-rwxrwxrwx 1 mft mft 1.5M Jul 20 18:42 spiders.fasta
-rwxr

**What happened to file1 and file2?**

<div class="alert alert-block alert-success">
<b>TO DO:</b> Familiarise yourself with the folders and files in your system. No need to delete anything, just move around and get used to the structure within your machine.
</div>

## 3.2. Create, inspect, and delete files  

Creating, inspecting, and deleting files is relatively easy. For creating a file that you are not writing to inmediately (just creating) `touch filename` is enough. That command will create (but not open) a file called *filename*

If you want to create and modify a file manually, you can do it using `nano filename`.

[nano](https://en.wikipedia.org/wiki/GNU_nano) is a text editor for the command line interface and it is very useful for changing files in a non-systematic way. You can navigate through a file inside of nano using the keyboard arrows, the PfDn and PgUP keys, and the Home and End keys (and many more).

You can write and modify text as usual, just keep in mind that clicking with a mouse is not an option. Once you are happy with the changes, simply follow the sequence for saving the changes in a file with the same name and exiting the editor: `ctrl+x` (Save modified buffer?) `y` (File Name to Write: filename.file) `enter`.

<div class="alert alert-block alert-success">
<b>TO DO:</b> As an exercise, open the <i>answersL0_name.txt</i> in <i>nano</i>, write your name, save and close the file. <b>You will have to answer a few questions throughout this tutorial. For each question, open the <i>answersL0_name.txt</i> in <i>nano</i>, add the question and answers, save and close the file. You will submit your file by the end of the class.</b> </div>

Other usefull commands for exploring files without modifiying them are `less`, `head`, and `tail`.

`less` is the equivalent of opening the file "in view mode". You can navigate through the file with the keyboard arrows, PgUp and PgDn, and the spacebar, and use the kwy `q` to exit the view.

`head` and `tail` show the first and last lines of a file, respectively. By default, both commands show 10 lines, but it is possible tomodify the number of lines with the `-n 50` argument to show, for example, 50 lines.

For example, imagine a program throws you an error that says "unrecognised character in line 928th of sequence1.gb". How can you check that line? You can use `head -n 928` and **pipe** the output to `tail`. **"Piping"** (`|`) allows one process to communicate with a subsequent process, it simplifies and makes more efficient the analysis of large amounts of data.

In [11]:
%%bash
head -n 928 sequence1.gb | tail -n 1

                     /PCR_primers="fwd_seq: attcaaccaatcataaagatattgg, rev_seq:


Two more symbols that are very important and omnipresent are `>` and `>>` for creating and appending to files, respectively.

For example, `head -n 10 sequence1.gb > sequence1_n10.txt` creates a file called `sequence1_n10.txt` that contains the first 10 lines in the `sequence1.gb` file. If the file `sequence1_n10.txt` already exists, then **the command will replace the old file with the new one** conserving no information from the old file.

If, instead of creating a new `sequence1_n10.txt` file we want to **append** additional information to it, we can use `>>`. For example, `head -n 10 sequence2.gb >> sequence1_n10.txt` will append the first 10 lines from the file `sequence2.gb` to the already existing `sequence1_n10.txt` file.

<div class="alert alert-block alert-success">
<b>TO DO:</b> What is the 500th line in <i>sequence1.gb</i>? And the 500th line in <i>sequence2.gb</i>? Append the two lines to your <i>answersL0_name.txt</i> file.</div>

We will now start working with data on Lithuanian spiders. We have a species list from [Biteniekytė and Rėlys (2011)](https://lmaleidykla.lt/ojs/index.php/biologija/article/view/1926/828).

The `LT_spider_list.txt` file contains the information as directly copy-pasted from the publication's PDF. Our aim is to transform the unorganised information in `LT_spider_list.txt` into a list of species with taxonomic IDs that we will use to gather published sequences available online.

Have a look at the `LT_spider_list.txt` file using `less`, `head`, and `tail`.

<div class="alert alert-block alert-success">
    <b>TO DO: </b>What is the 20th line in <i>LT_spider_list.txt</i>? Append the line to your <i>answersL0_name.txt</i> file</div>

## 3.3. Matching, replacing, and counting patterns  

You might have noticed a pattern from looking at the file. First you find a line starting with "FAMILY" for the spider family and then subsequent lines for every species in that family.

We can use `grep` to print the spider families represented in Lithuania. `grep` is a command to search for **matching patterns** in a file. `grep` has very useful arguments that will help you extract key information from very large files, for example: `-v`, `-A`, `-B`. Familiarise yourself with `grep` and its arguments checking `--help`

In [12]:
%%bash
grep --help

Usage: grep [OPTION]... PATTERNS [FILE]...
Search for PATTERNS in each FILE.
Example: grep -i 'hello world' menu.h main.c
PATTERNS can contain multiple patterns separated by newlines.

Pattern selection and interpretation:
  -E, --extended-regexp     PATTERNS are extended regular expressions
  -F, --fixed-strings       PATTERNS are strings
  -G, --basic-regexp        PATTERNS are basic regular expressions
  -P, --perl-regexp         PATTERNS are Perl regular expressions
  -e, --regexp=PATTERNS     use PATTERNS for matching
  -f, --file=FILE           take PATTERNS from FILE
  -i, --ignore-case         ignore case distinctions in patterns and data
      --no-ignore-case      do not ignore case distinctions (default)
  -w, --word-regexp         match only whole words
  -x, --line-regexp         match only whole lines
  -z, --null-data           a data line ends in 0 byte, not newline

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select no

First, let's check the families in the file.

Then, use `wc`, or the **word count** command for counting the lines that match "FAMILY".

Finally, you can count the number of unique lines that match "FAMILY" (i.e. the lines with the exact same characters, visible or "invisible") using `uniq -c`.

Beware, before checking for unique lines, you must `sort` the lines.

In [13]:
%%bash
# grep 'pattern' file
# use " instead of ' if the pattern is a variable (more on that later)
grep 'FAMILY' LT_spider_list.txt
grep 'FAMILY' LT_spider_list.txt | wc -l
grep 'FAMILY' LT_spider_list.txt | sort | uniq -c

FAMILY – PHOLCIDAE
FAMILY – SEGESTRIIDAE
FAMILY – DYSDERIDAE
FAMILY – MIMETIDAE
FAMILY – ERESIDAE
FAMILY – ULOBORIDAE
FAMILY – THERIDIIDAE
FAMILY – LINYPHIIDAE
FAMILY – TETRAGNATHIDAE
FAMILY – ARANEIDAE
FAMILY – LYCOSIDAE
FAMILY – PISAURIDAE
FAMILY – OXYOPIDAE
FAMILY – ZORIDAE
FAMILY – AGELENIDAE
FAMILY – CYBAEIDAE
FAMILY – HAHNIIDAE
FAMILY – DICTYNIDAE
FAMILY – AMAUROBIIDAE
FAMILY – MITURGIDAE
FAMILY – ANYPHAENIDAE
FAMILY – LIOCRANIDAE
FAMILY – CLUBIONIDAE
FAMILY – CORINNIDAE
FAMILY – GNAPHOSIDAE
FAMILY – SPARASSIDAE
FAMILY – PHILODROMIDAE
FAMILY – THOMISIDAE
FAMILY SALTICIDAE
29
      1 FAMILY SALTICIDAE
      1 FAMILY – AGELENIDAE
      1 FAMILY – AMAUROBIIDAE
      1 FAMILY – ANYPHAENIDAE
      1 FAMILY – ARANEIDAE
      1 FAMILY – CLUBIONIDAE
      1 FAMILY – CORINNIDAE
      1 FAMILY – CYBAEIDAE
      1 FAMILY – DICTYNIDAE
      1 FAMILY – DYSDERIDAE
      1 FAMILY – ERESIDAE
      1 FAMILY – GNAPHOSIDAE
      1 FAMILY – HAHNIIDAE
      1 FAMILY – LINYPHIIDAE
      1 FAMILY – LIO

I encourage you to check all the options for `wc`, `sort`, and `uniq`. These are commands that you will be using very frequently.

<div class="alert alert-block alert-success">
    <b>TO DO: </b>How many spider families are in <i>LT_spider_list.txt</i>? Append the line to your <i>answersL0_name.txt</i> file. What do <i>wc -l</i> and <i>uniq -c</i> do? Manually add your answer to <i>answersL0_name.txt</i> using nano.</div>

We can use grep to search **regular expressions** (known as [regex](https://en.wikipedia.org/wiki/Regular_expression)) to match patterns instead of exact words. For example, we can match a pattern that is made of two strings of alphabetic characters separated by a space character, each string can be of unlimited length as long as one alphabetic character comes after another.

In [14]:
%%bash
grep -E '[[:alpha:]]+ [[:alpha:]]+ ' LT_spider_list.txt
grep -E '[A-Za-z]+ [A-Za-z]+' LT_spider_list.txt

# the following will only show the matches, one match per line (two matches in a single input line are in different output lines)
grep -o -E '[A-Za-z]+ [A-Za-z]+' LT_spider_list.txt

THE CHECKLIST OF SPECIES
Pholcus phalangioides (Fuesslin, 1775). [5, 16].
Segestria senoculata (Linnaeus, 1758). [3–4; 7, 17].
Harpactea rubicunda (C. L. Koch, 1838). [3–4; 16, 18];
Ero cambridgei Kulczynski, 1911. [20].
Ero furcata (Villers, 1789). [3, 14, 16, 19].
Eresus cinnaberinus (O. P.-Cambridge, 1872). [12, 21].
Hyptiotes paradoxus C. L.Koch, 1834. [12].
Crustulina guttata (Wider, 1834). [5, 9, 14].
Cryptachaea riparia (Blackwall, 1834). [12].
Enoplognatha ovata (Clerck, 1757). [3–4; 14, 16, 20];
Enoplognatha thoracica (Hahn, 1833). [14, 27].
Episinus angulatus (Blackwall, 1836). [5, 7].
Episinus truncatus Latreille, 1809. [14].
Euryopis flavomaculata (C. L. Koch, 1836). [5–6; 9,
Keijia tincta (Walckenaer, 1802). [3–4; 14, 23–25; 29–
Lasaeola prona (Menge, 1868). [5, 9] (as Dipoena).
Lasaeola tristis (Hahn, 1833). [5, 12].
Lessertia dentichelis (Simon, 1884). [12].
Neottiura bimaculata (Linnaeus, 1767). [3–4; 14, 20, 23]
Parasteatoda lunata (Clerck, 1757). [5, 16] (as Achaearan

Tenuiphantes alacris (Blackwall, 1853). [16]; [14] (as
Tenuiphantes cristatus (Menge, 1866). [6–7; 9, 14–15]
Tenuiphantes flavipes (Blackwall, 1854). [3–4; 26] (as
Tenuiphantes mengei (Kulczynski, 1887). [6, 9, 14, 27]
Tenuiphantes tenebricola (Wider, 1834). [14, 20] (as Lepthyphanthes).
Troxochrus scabriculus (Westring, 1851). [6, 11, 14].
Walckenaeria acuminata Blackwall, 1833. [7].
Walckenaeria alticeps (Denis, 1952). [7, 11, 14–15;
Walckenaeria antica (Wider, 1834). [6, 9, 11, 14, 27,
Walckenaeria atrotibialis (O. P.-Cambridge, 1878). [7, 9,
Walckenaeria capito Westring, 1861. [10].
Walckenaeria cucullata (C. L. Koch, 1836). [3–4; 7, 11,
Walckenaeria cuspidata Blackwall, 1833. [7, 9, 14].
Walckenaeria dysderoides (Wider, 1834). [7, 27].
Walckenaeria incisa (O. P.-Cambridge, 1871). [11].
Walckenaeria karpinskii (O. P.-Cambridge, 1873). [9].
Walckenaeria kochi (O. P.-Cambridge, 1872). [14, 27].
Walckenaeria mitrata (Menge, 1868). [27].
Walckeaneria nodosa O. P.-Cambridge, 1873. [7, 9

Agraecina striata (Kulczyn’ski, 1882). [8].
Agroeca brunnea (Blackwall, 1833). [3–4; 6–9; 14, 17,
Agroeca cuprea Menge, 1873. [3, 8, 17].
Agroeca dentigera Kulczyński, 1913. [8–9; 15, 31].
Agroeca lusatica (L.Koch, 1875). [8].
Agroeca proxima (O. P.-Cambridge, 1871). [5–9; 11, 14,
Scotina palliardi (L.Koch, 1881). [6–9, 11, 15, 31].
Clubiona caerulescens L.Koch, 1867. [3–4; 16–17]
Clubiona comta C. L.Koch, 1839. [3–4; 16–17].
Clubiona diversa O. P.-Cambridge, 1862. [3–4; 6, 9,
Clubiona frutetorum L.Koch, 1866. [3–4; 17].
Clubiona germanica Thorell, 1871. [3–4; 17, 22].
Clubiona juvenis Simon, 1878. [27].
Clubiona lutescens Westring, 1851. [3–4; 16–17; 20].
Clubiona marmorata L.Koch, 1866. [3–4; 17–18].
Clubiona neglecta O. P.-Cambridge, 1862. [ 3–4; 6, 10,
Clubiona norvegica Strand, 1900. [27].
Clubiona pallidula (Clerck, 1757). [3–4; 6, 16, 22, 26].
Clubiona phragmitis C. L. Koch, 1843. [3–4; 6, 16–17;
Clubiona reclusa O. P.-Cambridge, 1863. [3–4; 17].
Clubiona similis L.Koch, 1867. [

Robertus scoticus Jackson, 1914. [5, 14, 16, 20].
Robertus ungulatus Vogelsanger, 1944. [5, 14].
Simitidion simile (C. L. Koch, 1836). [3–4; 23–25] (as
Steatoda albomaculata (De Geer, 1778). [4, 6].
Steatoda bipunctata (Linnaeus, 1758). [3–4; 24] (as
Steatoda castanea (Clerck, 1757). [3–4] (as Asagena);
[24] (as Teutana).
Steatoda grossa (C. L.Koch, 1838). [5].
Steatoda phalerata (Panzer, 1801). [3–4; 26] (as Asagena).
Theonoe minutissima (O. P.-Cambridge, 1879). [7, 9,
Phyloneta impressa L. Koch, 1881. [3–4; 16, 19, 23–26]
(as Theridion).
Phyloneta sisyphia (Clerck, 1757). [4, 14] (as Theridion).
= notatum Linnaeus, 1758. [3, 23–25] (as Theridion).
Theridion mystaceum L.Koch, 1870. [9].
Theridion pictum (Walckenaer, 1802). [3–4; 19–20;
Theridion pinastri L.Koch, 1872. [3–4; 23].
Theridion varians Hahn, 1833. [3–4; 20, 22–25].
Abacoproeces saltum (L.Koch, 1872). [11].
Agyneta cauta (O. P.-Cambridge, 1902). [7, 9, 11, 14–
Agyneta conigera (O. P.-Cambridge, 1863). [11, 14–15].
Agyneta de

Walckenaeria karpinskii (O. P.-Cambridge, 1873). [9].
Walckenaeria kochi (O. P.-Cambridge, 1872). [14, 27].
Walckenaeria mitrata (Menge, 1868). [27].
Walckeaneria nodosa O. P.-Cambridge, 1873. [7, 9, 31].
Walckenaeria nudipalpis (Westring, 1851). [4, 6–7; 9,
26, 31]; [3] (as Trachynella).
Walckenaeria obtusa Blackwall, 1836. [14].
Walckenaeria unicornis O. P.-Cambridge, 1861. [6, 14].
Walckenaeria vigilax (Blackwall, 1853). [6].
Zornella cultrigera (L.Koch, 1879). [7].
Metellina mengei (Blackwall, 1870). [3, 7]; [4] (as Meta)
(Family Metidae).
= reticulata mengei Wiehle, 1931. [32] (as Meta).
= reticulata Wiehle, 1931. [26] (as Meta).
Metellina merianae (Scopoli, 1763). [3, 16]; [14, 32] (as
Meta); [4] (as Meta) (Family Metidae).
Metellina segmentata (Clerck, 1757). [3, 16]; [23–24; 26,
32] (as Meta); [4] (as Meta) (Family Metidae).
Pachygnatha clercki Sundevall, 1823. [3–4; 6, 9–10; 14,
Pachygnatha degeeri Sundevall, 1830. [3–4; 6–7; 9–10;
Pachygnatha listeri Sundevall, 1830. [3–4; 7,

Hahnia ononidum Simon, 1875. [20].
Archaeodictyna consecuta (O. P.-Cambridge, 1872). [4].
= sedilloti Simon, 1875. [3, 26] (as Dictyna).
Argenna patula (Simon, 1874). [4].
Argenna subnigra (O. P.-Cambridge, 1861). [5].
Cicurina cicur (Fabricius, 1793). [5–7; 9–10; 14].
Dictyna arundinacea (Linnaeus, 1758). [3–4; 7, 20,
Dictyna pusilla Thorell, 1856. [3–4; 23].
Dictyna uncinata Thorell, 1856. [4, 16].
Amaurobius fenestralis (StrÖm, 1768). [12].
Cheiracanthium erraticum (Walckenaer, 1802). (Family
Cheiracanthium montanum L.Koch, 1877. [18]. (Family
Cheiracanthium punctorium (Villers, 1789). (Family
Cheiracanthium virescens (Sundevall, 1833). (Family
= lapidicolens Simon, 1878. [17].
Anyphaena accentuata (Walckenaer, 1802). [3–4; 14, 17].
Agraecina striata (Kulczyn’ski, 1882). [8].
Agroeca brunnea (Blackwall, 1833). [3–4; 6–9; 14, 17,
Agroeca cuprea Menge, 1873. [3, 8, 17].
= pullata Thorell, 1875 [4].
Agroeca dentigera Kulczyński, 1913. [8–9; 15, 31].
Agroeca lusatica (L.Koch, 1875). [8]

Pholcus phalangioides
Segestria senoculata
Harpactea rubicunda
as Harpactocrates
Ero cambridgei
Ero furcata
Eresus cinnaberinus
Hyptiotes paradoxus
Crustulina guttata
Cryptachaea riparia
Enoplognatha ovata
as Theridium
redimitum Clerck
as Theridium
Enoplognatha thoracica
Episinus angulatus
Episinus truncatus
Euryopis flavomaculata
Keijia tincta
as Theridion
Lasaeola prona
as Dipoena
Lasaeola tristis
Lessertia dentichelis
Neottiura bimaculata
as Theridium
Parasteatoda lunata
as Achaearanea
Parasteatoda simulans
as Achaearanea
Pholcomma gibbum
Robertus arundineti
Robertus lividus
Robertus lyrifer
Robertus neglectus
Robertus scoticus
Robertus ungulatus
Simitidion simile
Steatoda albomaculata
De Geer
Steatoda bipunctata
Steatoda castanea
as Asagena
as Teutana
Steatoda grossa
Steatoda phalerata
as Asagena
Theonoe minutissima
Phyloneta impressa
as Theridion
Phyloneta sisyphia
as Theridion
notatum Linnaeus
as Theridion
Theridion mystaceum
Theridion pictum
Theridion pinastri
Theridion varians


Clubiona phragmitis
Clubiona reclusa
Clubiona similis
Clubiona stagnatilis
Clubiona subsultans
erratica C
Clubiona trivialis
Phrurolithus festivus
Family Micariidae
Family Liocranidae
Phrurolithus minimus
Family Liocranidae
Berlandina cinerea
Callilepis nocturna
Drassodes hypocrita
Drassodes pubescens
Drassodes villosus
Drassyllus lutetianus
Gnaphosa bicolor
Gnaphosa lugubris
Gnaphosa microps
Gnaphosa montana
Gnaphosa muscorum
Gnaphosa nigerrima
Haplodrassus cognatus
as Drassodes
Haplodrassus dalmatensis
Haplodrassus moderatus
Haplodrassus signifer
as Drassodes
Haplodrassus silvestris
Haplodrassus soerenseni
Haplodrassus umbratilis
as Drassodes
Micaria fulgens
Micaria lenzi
Micaria pulicaria
Family Micariidae
Micaria silesiaca
Micaria subopaca
Family Micariidae
albostriata L
Scotophaeus blackwalli
gotlandicus Thorell
An old
record as
gotlandicus in
was wrongly
assigned to
and was
followed in
Scotophaeus scutulatus
Sosticus loricatus
as Scotophaeus
Zelotes aeneus
Zelotes electus
Family 

The code above returns what is mostly "genus species" matches, but there are lines that correspond to words describing species in the original file. We need to remove those.

One obvious pattern that emerges is matches like "as speciesX". One option to deal with this is using the **stream editor** `sed`.

`sed` is an incredibly powerful command that does match replacement and removes lines by index number or pattern. In the first example, we will use sed to remove lines starting with a pattern `^as` (where `^` indicates that the following characters must be at the begining of the line).

In [18]:
%%bash
# sed pattern file, only in this case we don't specify the file - input is comming from the pipe
grep -o -E '[A-Za-z]+ [A-Za-z]+' LT_spider_list.txt | sed -r '/^as /d'

THE CHECKLIST
OF SPECIES
Pholcus phalangioides
Segestria senoculata
Harpactea rubicunda
Ero cambridgei
Ero furcata
Eresus cinnaberinus
Hyptiotes paradoxus
Crustulina guttata
Cryptachaea riparia
Enoplognatha ovata
redimitum Clerck
Enoplognatha thoracica
Episinus angulatus
Episinus truncatus
Euryopis flavomaculata
Keijia tincta
Lasaeola prona
Lasaeola tristis
Lessertia dentichelis
Neottiura bimaculata
Parasteatoda lunata
Parasteatoda simulans
Pholcomma gibbum
Robertus arundineti
Robertus lividus
Robertus lyrifer
Robertus neglectus
Robertus scoticus
Robertus ungulatus
Simitidion simile
Steatoda albomaculata
De Geer
Steatoda bipunctata
Steatoda castanea
Steatoda grossa
Steatoda phalerata
Theonoe minutissima
Phyloneta impressa
Phyloneta sisyphia
notatum Linnaeus
Theridion mystaceum
Theridion pictum
Theridion pinastri
Theridion varians
Abacoproeces saltum
Agyneta cauta
Agyneta conigera
Agyneta decora
Agyneta ramosa
Agyneta subtilis
Anguliphantes angulipalpis
Aphileta misera
Araeoncus humilis

followed in
Scotophaeus scutulatus
Sosticus loricatus
Zelotes aeneus
Zelotes electus
Family Drassidae
Zelotes exiguus
Zelotes clivicola
Zelotes latreillei
Zelotes longipes
Zelotes petrensis
Drassyllus praeficus
Drassylus pusillus
Zelotes subterraneus
Micromata virescens
Philodromus aureolus
Philodromus cespitum
Philodromus collinus
Philodromus emarginatus
Philodromus fuscomarginatus
De Geer
Philodromus histrio
Philodromus margaritatus
Philodromus poecilus
Thanatus arenarius
Thanatus formicinus
Thanatus striatus
Tibellus maritimus
Tibellus oblongus
Coriarachne depressa
Diaea dorsata
Misumena vatia
Misumenops tricuspidatus
Ozyptila atomaria
Ozyptila brevipes
Ozyptila praticola
Ozyptila scabricula
Ozyptila simplex
Ozyptila trux
Thomisus onustus
Xysticus audax
pini Hahn
Xysticus bifasciatus
Xysticus cristatus
viaticus Linnaeus
Xysticus erraticus
Xysticus kochi
Xysticus lanio
Xysticus lineatus
Xysticus luctuosus
Xysticus obscurus
Xysticus sabulosus
Xysticus ulmi
FAMILY SALTICIDAE
Aellurillu

However, that solution is not the best when lines have more than one matching text, i.e., more than one instance with two strings of alphabetic characters in the same line. Other solution can be instead to use `sed` and regex to match the first instance of the pattern and delete anything else afterwards (line by line).

In [20]:
%%bash
# s stands for substitute (the command within sed)
# we define groups or variables of patterns using the parentheses
# we call back the groups by index (order in which they ocurr)
# anything other character (symbolised by .)
# g stands for global - replace all matches, not just the first
sed -r 's/^([A-Za-z]+) ([A-Za-z]+).+/\1 \2/g' LT_spider_list.txt

# better yet - we can delete lines matching "FAMILY" first, then removing anything after "genus species" matches
# then remove lines with any punctuation symbol
sed '/FAMILY/d' LT_spider_list.txt | sed -r 's/^([A-Za-z]+) ([A-Za-z]+).+/\1 \2/g' | sed '/[[:punct:]]/d'

THE CHECKLIST
Order – ARACHNIDA
FAMILY – PHOLCIDAE
Pholcus phalangioides
FAMILY – SEGESTRIIDAE
Segestria senoculata
FAMILY – DYSDERIDAE
Harpactea rubicunda
[17] (as Harpactocrates).
FAMILY – MIMETIDAE
Ero cambridgei
Ero furcata
FAMILY – ERESIDAE
Eresus cinnaberinus
FAMILY – ULOBORIDAE
Hyptiotes paradoxus
FAMILY – THERIDIIDAE
Crustulina guttata
Cryptachaea riparia
Enoplognatha ovata
[22–23] (as Theridium);
= redimitum Clerck, 1757. [24–26] (as Theridium).
Enoplognatha thoracica
Episinus angulatus
Episinus truncatus
Euryopis flavomaculata
14].
Keijia tincta
30] (as Theridion).
Lasaeola prona
Lasaeola tristis
Lessertia dentichelis
Neottiura bimaculata
(as Theridium).
Parasteatoda lunata
Parasteatoda simulans
Pholcomma gibbum
Robertus arundineti
Robertus lividus
14–16].
Robertus lyrifer
Robertus neglectus
Robertus scoticus
Robertus ungulatus
Simitidion simile
Theridion).
Steatoda albomaculata
Steatoda bipunctata
Asagena).
Steatoda castanea
[24] (as Teutana).
Steatoda grossa
Steatoda phaler

Pardosa prativaga
16, 20, 31].
= riparia O. P.-Cambridge, 1875. [19, 23–24, 26, 34–35]
(as Lycosa).
Pardosa pullata
20, 31]; [34–35] (as Lycosa).
Pardosa riparia
= cursoria (C. L.Koch, 1847). [34–35] (as Lycosa).
Pardosa schenkeli
= calida Dahl, 1908. [34–35] (as Lycosa).
Pardosa sphagnicola
15, 20, 31].
= riparia sphagnicola Dahl, 1908. [34] (as Lycosa).
Pirata hygrophilus
20, 26, 31, 34–35].
Pirata insularis
Pirata latitans
Pirata piraticus
35].
Pirata piscatorius
34–35].
Pirata tenuitarsis
Pirata uliginosus
20, 31, 34–35].
Trochosa robusta
= lapidicola (Dahl, 1927). [26, 34–35].
Trochosa ruricola
19, 24, 26, 34–35].
Trochosa spinipalpis
6–7; 9, 11, 14–15; 26, 31, 34–35].
Trochosa terricola
23–24; 26, 31, 34–35].
Xerolycosa miniata
Xerolycosa nemoralis
24, 34–35].
FAMILY – PISAURIDAE
Dolomedes fimbriatus
16, 19–21; 23–24; 31, 34–35]; (Family Dolomedidae.
[3]).
Dolomedes plantarius
(Family Dolomedidae. [3]).
Pisaura mirabilis
FAMILY – OXYOPIDAE
Oxyopes ramosus
19–20; 24–25].
FAMILY – 

Pelecopsis elongata
Pelecopsis parallela
Pityohyphantes phrygianus
Pocadicnemis pumila
Porrhomma microphthalmum
Porrhomma oblitum
Porrhomma pallidum
Porrhomma pygmaeum
Saaristoa abnormis
Savignia frontata
Silometopus reussi
Sintula cornigera
Stemonyphantes lineatus
Tallusia experta
Tapinocyba biscissa
Tapinocyba insecta
Tapinocyba pallens
Tapinocyba praecox
Tapinopa longidens
Taranucnus setosus
Thyreosthenius parasiticus
Tiso vagans
Tmeticus affinis
Tenuiphantes alacris
Tenuiphantes cristatus
Tenuiphantes flavipes
Tenuiphantes mengei
Tenuiphantes tenebricola
Troxochrus scabriculus
Walckenaeria acuminata
Walckenaeria alticeps
Walckenaeria antica
Walckenaeria atrotibialis
Walckenaeria capito
Walckenaeria cucullata
Walckenaeria cuspidata
Walckenaeria dysderoides
Walckenaeria incisa
Walckenaeria karpinskii
Walckenaeria kochi
Walckenaeria mitrata
Walckeaneria nodosa
Walckenaeria nudipalpis
Walckenaeria obtusa
Walckenaeria unicornis
Walckenaeria vigilax
Zornella cultrigera
Metellina mengei
M

We can confirm how good the processing of the file is by sorting the lines and looking at unique strings. In other words, sort the lines to look for things that do not look like "genus species".

In bioinformatics (and in life in general) is always good to **check and double check your results/output.** Particularly if you use match/replace, since there will be instances that you did not expect.

In [21]:
%%bash
# sort and count unique, then pass output to a file
sed '/FAMILY/d' LT_spider_list.txt | sed -r 's/^([A-Za-z]+) ([A-Za-z]+).+/\1 \2/g' | sed '/[[:punct:]]/d' | sort | uniq -c > species.txt

In [22]:
%%bash
head species.txt

      1 Abacoproeces saltum
      1 Acantholycoa lignaria
      1 Aculepeira ceropegia
      1 Aellurillus v
      1 Agalenatea redii
      1 Agelena labyrinthica
      1 Agraecina striata
      1 Agroeca brunnea
      1 Agroeca cuprea
      1 Agroeca dentigera


You will notice that `uniq -c` produces two columns, the first column is the frequency of a string and the second column is the string itself.

You can **cut** the columns in a file and look to specific ones using `cut` and specifiying a field delimiter (like in Excel) using `cut -d" "`. In this case, the delimiter is any space (generally includes tabs, spaces, etc).

You will also notice that `uniq -c` adds a TAB (or four spaces) before the first column, you can replace multiple spaces with a single one by **translating** the strings using `tr`.

`cat` simply prints the file contents to the screen.

In [23]:
%%bash
# print second column if delimiter is a single space
# cut shows the fields 3 and 4.
cat species.txt | tr -s ' ' | cut -d' ' -f3,4

# the above transformation results in the same output as the code below
cut -d' ' -f8,9 species.txt

Abacoproeces saltum
Acantholycoa lignaria
Aculepeira ceropegia
Aellurillus v
Agalenatea redii
Agelena labyrinthica
Agraecina striata
Agroeca brunnea
Agroeca cuprea
Agroeca dentigera
Agroeca lusatica
Agroeca proxima
Agyneta cauta
Agyneta conigera
Agyneta decora
Agyneta ramosa
Agyneta subtilis
Alopecosa aculeata
Alopecosa barbipes
Alopecosa cuneata
Alopecosa cursor
Alopecosa fabrilis
Alopecosa inquilina
Alopecosa mariae
Alopecosa pinetorum
Alopecosa pulverulenta
Alopecosa trabalis
Amaurobius fenestralis
An old
Anguliphantes angulipalpis
Antistea elegans
Anyphaena accentuata
Aphileta misera
Araeoncus humilis
Araneus alsine
Araneus angulatus
Araneus diadematus
Araneus marmoreus
Araneus quadratus
Araneus sturmi
Araneus triguttatus
Araniella cucurbitina
Araniella displicata
Archaeodictyna consecuta
Arctosa alpigena
Arctosa cinerea
Arctosa leopardus
Arctosa perita
Arctosa stigmosa
Argenna patula
Argenna subnigra
Argiope bruennichi
Argyroneta aquatica
Asianellus festivus
Aulonia albimana
Ballu

Xysticus audax
Xysticus bifasciatus
Xysticus cristatus
Xysticus erraticus
Xysticus kochi
Xysticus lanio
Xysticus lineatus
Xysticus luctuosus
Xysticus obscurus
Xysticus sabulosus
Xysticus ulmi
Yllenus arenarius
Zelotes aeneus
Zelotes clivicola
Zelotes electus
Zelotes exiguus
Zelotes latreillei
Zelotes longipes
Zelotes petrensis
Zelotes subterraneus
Zora nemoralis
Zora silvestris
Zora spinimana
Zornella cultrigera
Zygiella atrica
Zygiella x
all records
and was
not be
were wrongly
Abacoproeces saltum
Acantholycoa lignaria
Aculepeira ceropegia
Aellurillus v
Agalenatea redii
Agelena labyrinthica
Agraecina striata
Agroeca brunnea
Agroeca cuprea
Agroeca dentigera
Agroeca lusatica
Agroeca proxima
Agyneta cauta
Agyneta conigera
Agyneta decora
Agyneta ramosa
Agyneta subtilis
Alopecosa aculeata
Alopecosa barbipes
Alopecosa cuneata
Alopecosa cursor
Alopecosa fabrilis
Alopecosa inquilina
Alopecosa mariae
Alopecosa pinetorum
Alopecosa pulverulenta
Alopecosa trabalis
Amaurobius fenestralis
An old
Ang

Trochosa spinipalpis
Trochosa terricola
Troxochrus scabriculus
Walckeaneria nodosa
Walckenaeria acuminata
Walckenaeria alticeps
Walckenaeria antica
Walckenaeria atrotibialis
Walckenaeria capito
Walckenaeria cucullata
Walckenaeria cuspidata
Walckenaeria dysderoides
Walckenaeria incisa
Walckenaeria karpinskii
Walckenaeria kochi
Walckenaeria mitrata
Walckenaeria nudipalpis
Walckenaeria obtusa
Walckenaeria unicornis
Walckenaeria vigilax
Xerolycosa miniata
Xerolycosa nemoralis
Xysticus audax
Xysticus bifasciatus
Xysticus cristatus
Xysticus erraticus
Xysticus kochi
Xysticus lanio
Xysticus lineatus
Xysticus luctuosus
Xysticus obscurus
Xysticus sabulosus
Xysticus ulmi
Yllenus arenarius
Zelotes aeneus
Zelotes clivicola
Zelotes electus
Zelotes exiguus
Zelotes latreillei
Zelotes longipes
Zelotes petrensis
Zelotes subterraneus
Zora nemoralis
Zora silvestris
Zora spinimana
Zornella cultrigera
Zygiella atrica
Zygiella x
all records
and was
not be
were wrongly


Similarly, to check the unique genera in the list, we can combine `cat`, `tr`, `sort`, and `unique`

In [24]:
%%bash
cat species.txt | tr -s ' ' | cut -d' ' -f3 | sort | uniq > genera.txt

In [25]:
%%bash
head genera.txt

Abacoproeces
Acantholycoa
Aculepeira
Aellurillus
Agalenatea
Agelena
Agraecina
Agroeca
Agyneta
Alopecosa


<div class="alert alert-block alert-success">
    <b>TO DO: </b>What is happening? Is it working perfectly or not? How many genera is there? How could you process this file more efficiently? Manually add your answer to <i>answersL0_name.txt</i> using nano.</div>

To speed up the process of sequence downloads, we will select the first 10 genera.

<div class="alert alert-block alert-success">
    <b>TO DO: </b>Create a file called <i>genera_filtered.txt</i> with the first 10 genera from <i>genera.txt</i>. We will search for COI sequences for the genera in the list by processing one line at a time. We will use loops and variables to achieve it.</div>

## 3.4. Variables, loops, and conditionals  

A variable is a symbolic name that representes a value that can be asigned and re-assigned depending on the context they are used. Variables are useful for generalising programs, creating pipelines where only one imput/argument changes, or for making loops more efficiently.

You can assign a value to a the variable NAME llike this:

In [26]:
%%bash
NAME='my_name'
AGE=15

# check the variable values using the echo function
echo $NAME
echo $AGE

# what happens if you miss the $ symbol?
echo NAME
echo AGE

my_name
15
NAME
AGE


A variable can also be the output of a command, for example:

In [27]:
%%bash
GENUS=$(head -n 10 species.txt | tail -n 1 | sed -r 's/([A-Za-z]+) .+/\1/g' | cut -d' ' -f8)
echo $GENUS

Agroeca


A **loop** is a structure that repeat a set of commands multiple times based on a certain condition. It could be a array of items being exhausted or a variable being equal to a defined value.

Two common loops in bioinformatics are `for` and `while`. `for` runs a command in a array of items (variables, files, numbers, etc) until the array is empty. 

In [28]:
%%bash

for genus in Abacoproeces Agelena Alopecosa Gnaphosa Haplodrassus; do
  echo $genus;
done

# or 

for genus in Abacoproeces Agelena Alopecosa Gnaphosa Haplodrassus; do echo $genus; done
for genus in Abacoproeces Agelena Alopecosa Gnaphosa Haplodrassus; do echo "Genus: $genus"; done
# note above that passing variables and text or passing variables to grep sed, needs double quotes

count=1
while [ $count -le 5 ]; do
    echo "Count is $count";
    ((count++));
done

# above, the -le argument stands for "less than or equal to"
# count++ adds count to count. as count == 1, then it adds 1 at every loop

Abacoproeces
Agelena
Alopecosa
Gnaphosa
Haplodrassus
Abacoproeces
Agelena
Alopecosa
Gnaphosa
Haplodrassus
Genus: Abacoproeces
Genus: Agelena
Genus: Alopecosa
Genus: Gnaphosa
Genus: Haplodrassus
Count is 1
Count is 2
Count is 3
Count is 4
Count is 5


Now, we will use the `genera_filtered.txt` file to search for all [COI - cytochrome c oxidase I](https://en.wikipedia.org/wiki/Cytochrome_c_oxidase_subunit_I) sequences for each genus available on NCBI.

We will use a couple of commands from the Blast+ suit, `esearch` and `efetch`. We will learn more about these in more depth in another lecture but for now, `esearch` creates a search query and returns an NCBI URL with the results of the search. `efetch` then takes the URL and retrieves the results in the format chosen by the user, in our case, fasta format.

Note that we are reading `genera_filtered.txt` line by line and asigning the line value to the `$GENUS` variable. Then, we pass that variable to `esearch` and `efetch` to retrieve the sequences in fasta format (we will learn more about sequence formats later).

**Important note:** Keep in mind that different search terms will output different results. For example, searching for "cytochrome" will return sequences with either *cytochrome* or *cytochrome II* or *cytochrome I* in the title and will not return sequences without *cytochrome* but with *COI* in the title.

The code above creates the URL with the results from searching sequences from each genus and with *cytochrome* in the sequence title. Then, one line downloads the results in fasta format (only sequences) and in GeneBank (gb) format.

In fasta files, the line containing the information about the sequence starts with the '\>' symbol whereas the lines containing the sequences will not have any symbol at all. Thus, you can count the number of sequences in the fasta file by using `grep` to find the '\>' symbol in the `spiders.fasta` file, then pipe the result to `wc` and count the number of lines.

<div class="alert alert-block alert-success">
    <b>TO DO: </b>How many sequences are there in the <i>spiders.fasta</i> file? Append the answer to your <i>answersL0_name.txt</i> using <i>>></i>.</div>

<div class="alert alert-block alert-success">
    <b>TO DO: </b>Use a <i>for</i> loop to grep each genera from the <i>genera_filtered.txt</i> file and count how many sequences were downloaded.</div>

<div class="alert alert-block alert-success">
    <b>TO DO: </b>Which spider genera do NOT have sequences available? Manually add you answer to your <i>answersL0_name.txt</i> using <i>nano</i>.</div>

To check the publications associated with the sequence submissions, you can use `grep` and include in the results one extra line for a bit of context.

Check for the country and `lat_lon` fields associated with every sequence in the `gb` (or GenBank) file. Where are the vouchers (the source of the sequence) from? Where were they collected? Is any voucher collected in Lithuania?

Finally, but not less importantly, we want to understand the quality state of our data.

We downloaded COI Sequences from spiders. COI is a universaly used DNA barcode that is sequenced using primers, oligonucleotides that match a conserved region near the sequence of interest that *prime* the taq polymerase to start the synthesis of the complementary strand (after denaturation and annealing).

Primers are often sequenced together with the rest of the barcode fragment and it is advisable to remove their sequences from the data before proceeding to do phylogenetic analyses. However, sequences submited to NCBI are not always curated and clean and we must identify which sequences still have the primers. This is easily done with `grep`.

In [30]:
%%bash

grep 'AGATATTGG' spiders.fasta # LCO 1490 (Miller et. al., 2013)

# it is possible to search for the complementary and reversed sequence of a primer in the sequences downloaded
# primer Lepidoptera Forward (Hebert et. al., 2004)
# note that the code is not using the pipe but giving three separate instructions
# each instruction is separated by a ;
# this is rarely done because reading the code gets difficult.
# -C provides n lines of context

REVSEQ=$(echo 'TAAAGATAT' | tr ACGTacgt TGCAtgca | rev ); echo $REVSEQ; grep -C 3 "$REVSEQ" spiders.fasta

ATAGATATTGGGACTTTATATTTAATTTTTGGGGCTTGGTCTGCTATAGTGGGGACAGCTATAAGAGTAT
ATAGATATTGGGACTTTATATTTAATTTTTGGGGCTTGGTCTGCTATAGTGGGGACAGCTATAAGAGTAT
TTTTCAACTAATCATAAAGATATTGGAAGTTTATATTTTATTTTTGGAGCTTGAGCGGCTATAGTTGGAA
AATCATAAAGATATTGGAACTTTATATTTAATGTTAGGTGTTTGATCGGCTATAATAGGGACTGCTATAT
ATAGATATTGGGACTTTATATTTACTGTTAGGTGTTTGATCGGCAATAATAGGTACTGCTATATCAGTGT
ATATCTTTA
TTATTATTGTTTATTTCTTCTATAGCTGAAATAGGAGTTGGTGCTGGTTGAACTGTTTATCCTCCTTTAG
CATCTAGAGTTGGTCATGTAGGTAGTGCTATGGATTTTGCTATTTTTTCTTTACATTTAGCAGGGGCATC
TTCAATTATAGGTGCTGTAAATTTCATTTCTACTATTATAAATATACGTTCTACTGGAATATCTATAGAG
AAGGTATCTTTGTTTGTTTGATCGGTATTTATTACTGCAGTATTATGATTAATATCTTTACCTGTATTAG
CAGGAGCAATTACCATATTATTGACTGATCGGAATTTTAATACTTCTTTTTTTGAACCTGCAAGGGGGGG
AGATCCTATTTGACATCAACATT
>MW996910.1 Agroeca inopina voucher UB-MD615 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial


Think about this. What implications exist if the primer sequence is at the begining, at the end, or in the middle of the sequence?

## 3.5. Compress and dicompress

Compressing and decompressing files is a common task in bioinformatics and it [consists in reducing the size of a file by reducing the bits needed to encode the same information](https://en.wikipedia.org/wiki/Data_compression).

Compressing files is useful because pipelines (a series of interconnected programs/analyses to process data, from raw to final, often generate large files that fill the disk space pretty quickly. It is also useful as smaller files are easier to transfer.

Common compression and decompression pairs of programs are (respectively) `gzip` and `gunzip`, and the parallelizable `pigz` and `pigz -dc`. Another popular program is `tar`, which stores multiple files in a single one and has compression options (is not limited to just compress/decompress single files)

Explore the options for each of those programs, particularly how to deal with the output. Which one creates a file automatically? Which one sends the output to *standard output* ([stdout](https://en.wikipedia.org/wiki/Standard_streams)) instead?

## 3.6. Connecting with other machines

Sometimes we need to connect to other machines and send files through the command line. Personal computers do not always have enough computational capabilities to handle data but we can connect from our machines to other [High-Performance Computing systems (HPC)](https://en.wikipedia.org/wiki/High-performance_computing) that can handle larger data and more memory-intensive processes. If that is the case, then we will need to transfer large files between local (yours) and remote (the other), something that is not possible through email or simple cloud services.

`ssh` is the most common command for securely login into a remote computer and executing commands on that machine. It generally looks like this:

Once you are connected, you can run commands just like if you were on your machine, minding the folder structure of the remote machine you are connected to.

`scp` allows you to securely transfer files between your local machine and the remote computer and viceversa. You need to keep in mind where you are (local machine? or already logged into the remote machine?) and the source and destination paths for the files.

`scp` is good enough when files are small to medium files and internet connections are robust. However, `scp` is not enough for transfering larger files over internet connections that might fail and when having to check the integrity of the file after transfer. 

For that, it is better to use `rsync`. Moreover, if the files exist in both the local and remote machines `rsyn` only transfers the changes between file copies and skips the rest, making the transfer faster.

---

# [**4.  AWK**](#awk)  

AWK is a programming language design to process text efficiently, particularly tabular text. AWK takes as inputs a pattern, an action, and a file or text from the *standard input* ([stdin](https://en.wikipedia.org/wiki/Standard_streams)) to operate upon. The pattern is a regex expression that indicates a text match or a field in the input file. The action is a command that will be executed on the match.

AWK has pre-defined variables that can be passed as patterns. The most common ones are variables that indicate the entire line of the file `$0` or a particular field (think column) of the file, e.g. `$1` and `$2` for printing the first and second fields/columns of the file. Very similar to the command `cut` would do. As AWK considers fields as variables, it is possible to indicate the delimiter with `-F`.

We will apply AWK commands to `Zelote_coordinates.csv`, a tabular text file that contains geographic coordinates for the spider genera [*Zelotes*](https://en.wikipedia.org/wiki/Zelotes_subterraneus). The data was downloaded on July 26th, 2023 from GBIF and the DOI associated with the search is [here](https://doi.org/10.15468/dl.4ke7tf)

First, use `head`, `cut`, or other commands that you have learnt so far to explore the data within the file.

<div class="alert alert-block alert-success">
    <b>TO DO: </b>How many lines are in the file (each line represents a coordinate record)? How many different species are there in the file? Hint: you can use a combination of <i>head</i> and <i>cut</i> to figure out which file corresponds to the species ephitet. Append your answer to <i>answersL0_name.txt</i> using <i>>></i>.</div>

## 4.1. Basic filtering


Now that you understand what is the structure of the file, the header and name of the columns, and the kind of information that it has, we will use AWK to extract information and clean records with flagged coordinate records. It is possible to carry out the same actions on the file using e.g. Excel, however, there's a limit to the size of files that Excel can process and command-line programs/languages like AWK are faster and more efficient, allowing you to process files that are [gigabytes](https://en.wikipedia.org/wiki/File_size) in size.

First, we will filter out the columns that contain information that is not relevant for us right now. We can use AWK's `print` and field variables to do it. Then, we can add another a filter to only print the lines with non-empty `decimalLatitude` fields, i.e. printing records that are not missing the geolographic coordinates.

## 4.2. Operations

It is possible to caryr out matematichal operations using AWK. The `Zelote_coordinates_sel.csv` column *individualCount* contains the number of spider observations per record. It is common that insects are recorded in counts instead of creating a single record for each insect with the same metadata. Thus, counting lines as records is not enough to count the total number of individuals per species that have been observed. But we can do that with AWK.

AWK has associative arrays, something like a dictionary where the word is a key and the "meaning" is the value. Assuming that we will sum up the individual counts across all the records within a species, then the value of the species column is the key and the individual counts will be the values.

First, familiarise yourself with the *species* and *individualCount* columns, then you can do the sums.

In [35]:
%%bash

# we need to skip the header line by asking Number Record higher than one --> NR>1
# && means AND as a conditional such that if species is not empty AND individualCount is not empty
# the information passed to the second pipe only has two columns, species and individualCount
# the last part creates an array called sum from the species column and adds the values in column 2 as a list of values

# AWK ends the first action and starts iterating through the array sum, adding all the individual counts in the list of values

awk '(NR>1)' Zelote_coordinates_sel.csv \
| awk -F '\t' '{if ($4 != "" && $6 != "") printf"%s\t%s\n", $4, $6;}' \
| awk -F '\t' '{ sum[$1] += $2 } END { for (group in sum) print group, sum[group] }'

Zelotes anglo 246
Zelotes tuobus 27
Zelotes zhaoi 1
Zelotes occidentalis 60
Zelotes laccus 75
Zelotes funestus 2
Zelotes tenuis 163
Zelotes gertschi 87
Zelotes discens 74
Zelotes electus 3433
Zelotes mundus 1
Zelotes exiguoides 133
Zelotes scrutatus 0
Zelotes perditus 8
Zelotes anthereus 9
Zelotes asiaticus 2
Zelotes azsheganovae 178
Zelotes gynethus 19
Zelotes fulvopilosus 19
Zelotes anatolyi 1
Zelotes duplex 86
Zelotes moestus 5
Zelotes bajo 3
Zelotes latreillei 3975
Zelotes sarawakensis 7
Zelotes gabriel 23
Zelotes hentzi 388
Zelotes monachus 47
Zelotes captator 1
Zelotes egregioides 21
Zelotes segrex 10
Zelotes flagellans 16
Zelotes hermani 37
Zelotes kimwha 1
Zelotes similis 37
Zelotes pseudogallicus 39
Zelotes plumiger 4
Zelotes pullus 37
Zelotes hardwar 1
Zelotes laetus 5
Zelotes potani 5
Zelotes foresta 3
Zelotes sula 87
Zelotes spadix 1
Zelotes aurantiacus 25
Zelotes pinos 2
Zelotes subterraneus 3136
Zelotes nannodes 173
Zelotes clivicola 1290
Zelotes gallicus 42
Zelotes longi

The code above then takes a tabular list of geographic records from GBIF, then removes lines with no coordinates or species information, and finally counts how many individual counts in total there are per species.

# [**5. Summary**](#summary) 

**Navigating the folders:**  
`cd`: Change directory  
`ls`: List files and directories  
`pwd`: Print working directory  

**File and folder manipulation:**  
`mkdir`: Create a new directory
`touch`: Create a new file  
`cp`: Copy files or directories  
`mv`: Move or rename files or directories  
`rm`: Remove files or directories  
`grep`: Search for a specific pattern in files  
`find`: Search for files and directories  

**Permissions:**  
`chmod`: Change file permissions  
`chown`: Change file ownership  

**redirecting information and piping:**  
`>`: Redirect output to a file (overwrite)  
`>>` Redirect output to a file (append)  
`|`: Pipe output of one command as input to another  

**Exploring and editing files:**  
`cat`: Concatenate and display file content  
`less`: View file content with pagination  
`head`: Display the beginning of a file  
`tail`: Display the end of a file  
`nano`: Text editor for modifying files  

**Data Manipulation:**  
`awk`: Text processing and data extraction  
`sed`: Stream editor for modifying text  

**Process management:**  
`ps`: View running processes  
`kill`: Terminate a process  

**File compression:**  
`tar`: Archive files and directories  
`gzip` or `gunzip`: Compress or decompress files  

**Remote connections and transfers:**  
`ssh`: Securely connect to a remote server  
`scp`: Securely copy files between local and remote machines  
`rsync`: Securely and efficiently transfer files, more flexible  
