#Working with alpha numeric data in unix.

## **Finding "Alien Genes" in the Plant Pathogen Streptomyces scabies**

*Streptomyces scabies* is a plant pathogen that causes necrosis in potatoes. Most of the virulence factors are located in regions with low GC content. Additionally, virulence is highly expressed during the interaction with plant roots. The file "aliens_in_scabies" contains a table with more than 1000 genes (in the first column) showing their change in expression levels when comparing growth in rich medium versus interaction with roots (second column). The GC content of each gene sequence is also provided (third column).

This analysis will focus on using Unix commands to:
1. Sort the table by gene expression levels.
2. Create a new file with the top 10 highest expressed genes.
3. Sort the top 10 genes by their GC content.
4. Generate a new table with only the gene names, replacing "SCAB" with "SCABIES".
5. Perform statistical analysis of the GC content and expression levels (min, max, average, and median).


Task 1: Sort the table by expression levels (second column)

In [None]:
# Sorting by the second column (expression levels) in descending order
!sort -k2,2nr aliens_in_scabies > sorted_by_expression.txt

# Viewing the sorted file
!cat sorted_by_expression.txt

Task 2: Extract the top 10 highest expressed genes

In [None]:
# Extracting the top 10 highest expressed genes
!head -n 10 sorted_by_expression.txt > top_10_genes.txt

# Viewing the top 10 highest expressed genes
!cat top_10_genes.txt


Task 3: Sort the top 10 genes by GC content (third column)


In [None]:
# Sorting the top 10 genes by the third column (GC content) in descending order
!sort -k3,3nr top_10_genes.txt > top_10_sorted_by_gc.txt

# Viewing the top 10 genes sorted by GC content
!cat top_10_sorted_by_gc.txt

Task 4: Replace 'SCAB' with 'SCABIES' in gene names


In [None]:
# Replacing "scab" with "SCABIES" in gene names
!cut -f1 aliens_in_scabies | sed 's/scab/SCABIES/g' > genes_replaced.txt

# Viewing the gene names with "SCABIES"
!cat genes_replaced.txt

Task 5: Calculate basic statistics for expression levels and GC content
Expression Levels Statistics

Expression Levels Statistics

In [None]:
# Calculating min, max, and average of expression levels
!awk '{sum+=$2; count+=1; if(min==""){min=max=$2}; if($2>max){max=$2}; if($2<min){min=$2}} END {print "Min expression:", min; print "Max expression:", max; print "Average expression:", sum/count}' aliens_in_scabies


GC Content Statistics


In [None]:
# Calculating min, max, and average of GC content
!awk '{sum+=$3; count+=1; if(min==""){min=max=$3}; if($3>max){max=$3}; if($3<min){min=$3}} END {print "Min GC content:", min; print "Max GC content:", max; print "Average GC content:", sum/count}' aliens_in_scabies


**Little more complex stats**

Range-Based Grouping (Numerical Tasks)
:  Group genes by expression levels or GC content ranges: You can create bins based on certain ranges of expression levels or GC content (e.g., genes with GC content between 50% and 60%, expression levels between -10 and 10).
Find extreme values: Identify genes with extreme GC content or expression levels (e.g., top 1%, lowest 1%, or within a specific percentile).

In [None]:
# Find genes with GC content between 60% and 70%
!awk '$3 >= 60 && $3 <= 70' aliens_in_scabies

In [None]:
# Find genes with expression levels between -10 and 10
!awk '$2 >= -10 && $2 <= 10' aliens_in_scabies

In [None]:
# Find genes that start with "SCAB2" and have a positive expression level
!awk '$1 ~ /scab2/ && $2 > 0' aliens_in_scabies

In [None]:
# List genes where high GC content corresponds to high expression levels
!awk '$2 > 10 && $3 > 70' aliens_in_scabies