# Working with Alphanumeric Data in Unix

## **Finding 'Alien Genes' in the Plant Pathogen *Streptomyces scabies***

*Streptomyces scabies* is a plant pathogen that causes necrosis in potatoes. Many of its virulence factors are located in regions with low GC content. Additionally, virulence is highly expressed during interaction with plant roots. In this notebook, you will analyze a file named `aliens_in_scabies`, which contains over 1000 genes. The file provides gene names, their expression levels (growth in rich medium vs. interaction with plant roots), and GC content for each gene.

We will use Unix commands to:
1. Sort the table by gene expression levels.
2. Create a new file with the top 10 highest expressed genes.
3. Sort the top 10 genes by their GC content.
4. Generate a new table with the gene names, replacing 'SCAB' with 'SCABIES'.
5. Perform statistical analysis of the GC content and expression levels (min, max, average, median).

Get a file from a Github Repo , display the content and coun the number of lines

In [None]:
!wget https://raw.githubusercontent.com/joscarhuguet/Bioinfomatics-for-Pythopathologists/master/learning_unix_in_colab/aliens_in_scabies.txt
!cat aliens_in_scabies.txt
!wc -l aliens_in_scabies.txt

### Task 1: Sort the table by expression levels (second column)

To better understand which genes are highly expressed during the interaction with plant roots, we will sort the data based on the second column, which contains expression levels. The `sort` command will help us rank genes by their expression levels in descending order.

In [None]:
# Sorting by the second column (expression levels) in descending order
!sort -k2,2nr aliens_in_scabies.txt > sorted_by_expression.txt

# Viewing the sorted file
!cat sorted_by_expression.txt

### Task 2: Extract the top 10 highest expressed genes

Next, we will extract the top 10 genes with the highest expression levels using the `head` command. This will allow us to focus on the most significant genes for further analysis.

In [None]:
# Extracting the top 10 highest expressed genes
!head -n 10 sorted_by_expression.txt > top_10_genes.txt

# Viewing the top 10 highest expressed genes
!cat top_10_genes.txt


Task 3: Sort the top 10 genes by GC content (third column)


In [None]:
# Sorting the top 10 genes by the third column (GC content) in descending order
!sort -k3,3nr top_10_genes.txt > top_10_sorted_by_gc.txt

# Viewing the top 10 genes sorted by GC content
!cat top_10_sorted_by_gc.txt

Task 4: Replace 'SCAB' with 'SCABIES' in gene names


In [None]:
# Replacing "scab" with "SCABIES" in gene names
!cut -f1 aliens_in_scabies | sed 's/scab/SCABIES/g' > genes_replaced.txt

# Viewing the gene names with "SCABIES"
!cat genes_replaced.txt

Task 5: Calculate basic statistics for expression levels and GC content
Expression Levels Statistics

Expression Levels Statistics

In [None]:
# Calculating min, max, and average of expression levels
!awk '{sum+=$2; count+=1; if(min==""){min=max=$2}; if($2>max){max=$2}; if($2<min){min=$2}} END {print "Min expression:", min; print "Max expression:", max; print "Average expression:", sum/count}' aliens_in_scabies.txt


### GC Content Statistics

Now that we've identified the top 10 genes, let's look at their GC content. We'll use the `awk` command to calculate some basic statistics: minimum, maximum, and average GC content.

In [None]:
# Calculating min, max, and average of GC content
!awk '{sum+=$3; count+=1; if(min==""){min=max=$3}; if($3>max){max=$3}; if($3<min){min=$3}} END {print "Min GC content:", min; print "Max GC content:", max; print "Average GC content:", sum/count}' aliens_in_scabies.txt


### Task 4: Advanced Statistics - Range-Based Grouping

Now let's explore some more advanced statistical analysis. We'll group the genes based on ranges of expression levels and GC content. This will allow us to identify genes that belong to specific ranges, such as those with GC content between 60% and 70% or expression levels between -10 and 10.

In [None]:
# Find genes with GC content between 60% and 70%
!awk '$3 >= 60 && $3 <= 70' aliens_in_scabies.txt

In [None]:
# Find genes with expression levels between -10 and 10
!awk '$2 >= -10 && $2 <= 10' aliens_in_scabies.txt

In [None]:
# Find genes that start with "SCAB2" and have a positive expression level
!awk '$1 ~ /scab2/ && $2 > 0' aliens_in_scabies.txt

In [None]:
# List genes where high GC content corresponds to high expression levels
!awk '$2 > 10 && $3 > 70' aliens_in_scabies.txt