# Unix Command Practice for Text Parsing and Formatting in Colab

In this notebook, we will practice using Unix commands to parse, manipulate, and format plain text files directly within Google Colab.
We will work with a pathogen dataset that includes the Genus, Species, and Disease (in parentheses).
You will learn to use powerful Unix commands like `sort`, `cut`, `uniq`, `grep`, and `sed` to process the data and generate meaningful reports.

In this notebook, we will practice using Unix commands in Colab to parse, manipulate, and reformat text files. You will work with a pathogen dataset where each line includes the Genus, Species, and Disease. Commands like `sort`, `cut`, `uniq`, `grep`, and `sed` will be used to reformat and create reports from this data.


In [None]:
# Step 1: Get a file and View the content of the file to understand its structure
!wget https://raw.githubusercontent.com/joscarhuguet/Bioinfomatics-for-Pythopathologists/master/learning_unix_in_colab/pathogens.txt
!cat pathogens.txt


In [None]:
# Step 2 Sort the file alphabetically by the first word (Pathogen)
!sort pathogens.txt > sorted_pathogens.txt
#print("Sorted text:")
!cat sorted_pathogens.txt

In [None]:
# Step 3: Use `cut` to check the separator of the fields
#Check is tab separator is used or space separator is used.
#The simbol "|" will concatenate the next command head.
#it displays the first 5 lines to save space
!cut -f 1 sorted_pathogens.txt | head -n 5

Note that you can't separate the fields because the tab is not the separator in this file.

In [None]:
#check space
!cut -d ' ' -f 1 sorted_pathogens.txt | head -n 5

Note that now you can obtain the first collumn (field).  In this case you found that the sperator is the space.

In [None]:
# Step 4: Use sed to replace the intial parenthesis "(" with tabs so we cna divide the file into two two
# fields (collumns) - pathogen disease
#Lets delete the parenthesis ")" with sed
#and format the file with headers
# Using echo to add headers and saving the output with '>'


#Replace "(" with tabs
!sed 's/(/\t/g' sorted_pathogens.txt | sed 's/)//g' | cut -f 1,2 > tab_separated_pathogens.txt

#Add headers using echo
!echo -e "Pathogen\tDisease" > formatted_report_with_headers.txt

#Append the formatted data to the file
!cat tab_separated_pathogens.txt >> formatted_report_with_headers.txt

#View the formatted report
!cat formatted_report_with_headers.txt

In [None]:
# Step 5: Filter diseases related to a specific keyword using `grep` (e.g., "mildew")
!grep "mildew" formatted_report_with_headers.txt


## **Exercise 1: Count the Number of Pathogens for Each Disease Type**

In [None]:

# Task: Count the number of pathogens associated with a specific disease (e.g., "wilt")

# Step 1: Use grep to find pathogens related to "wilt"
!grep "wilt" formatted_report_with_headers.txt

# Step 2: Count the number of lines that contain "wilt"
!grep -c "wilt" formatted_report_with_headers.txt


## **Exercise 2: Find and Extract Specific Genus Using `grep` and `awk`**

In [None]:

# Task: Extract all entries related to a specific genus (e.g., "Xanthomonas")

# Step 1: Find all pathogens related to "Xanthomonas"
!echo "result with grep"
!grep "^Xanthomonas" formatted_report_with_headers.txt
!echo "..............................................."
!echo "result with awk"
!awk '$1 == "Xanthomonas"' formatted_report_with_headers.txt

# Task: Reformat the sorted pathogen list into CSV format
We will reformat the data so that it separates Genus, Species, and Disease by commas using `awk`. CSV format is commonly used in data science applications.

In [None]:

# Task: Reformat the sorted pathogen list into CSV format

# Step 1: Use awk to output Genus, Species, and Disease in CSV format

! sed 's/ (/,/g' sorted_pathogens.txt | sed 's/)//g'  > pathogens.csv

# Step 2: View the CSV-formatted data

!cat pathogens.csv




## **Exercise 4: Extract Unique Genera Using `cut` and `sort`**

In [None]:

# Task: Extract and list all unique genera from the dataset

# Step 1: Use cut to extract the first field (Genus) from the file
!cut -d ' ' -f 1 sorted_pathogens.txt > genera.txt

# Step 2: Sort and remove duplicates to get the unique genera

!sort genera.txt | uniq > unique_genera.txt
#count how many unique genera
!cat unique_genera.txt | wc -l

# Step 3: Detail of how many unique genera

!sort genera.txt | uniq -c
