# File Input and Output in Python

In bioinformatics and computational biology, handling large datasets is a crucial task. Many types of biological data, such as AMR gene profiles, whole-genome sequencing results, variant annotations, and expression datasets, are stored in files. These files can be CSV (Comma-Separated Values), FASTA (for sequences), TSV (Tab-Separated Values), or plain text files.

Python provides built-in tools and external libraries that allow us to efficiently read, process, and write these files. Instead of manually handling large datasets, Python automates these operations, making it easier to analyze thousands of genes, filter resistance markers, extract sequence motifs, or generate reports.

---

## ** Why Work with Files?**
- Data in **CSV (Comma-Separated Values)** format is commonly used for **storing structured data** (e.g., AMR gene profiles, sequencing results).
- Instead of manually entering data, we can **read from a file** and process thousands of records automatically.
- File handling allows **storing results** in a structured way after analysis.

---

## ** Key Operations in File Handling**
Python provides built-in functions to **read, process, and write files** efficiently:
- **Reading a file (`open()`, `read()`, `readline()`, `readlines()`)**
- **Processing data line by line**
- **Using `csv` and `pandas` for structured file handling**
- **Writing processed data back into a file (`write()`, `csv.writer()`, `pandas.to_csv()`)**

---

## ** File Types Commonly Used in Bioinformatics**
| File Type | Description | Example Usage |
|-----------|------------|--------------|
| `.csv` | Comma-Separated Values | AMR gene datasets, metadata tables |
| `.txt` | Plain text files | FASTA headers, gene lists |
| `.fasta` | Sequence data format | DNA/protein sequences |
| `.tsv` | Tab-Separated Values | Expression data, genome annotations |

---




## ** Reading an AMR Dataset (Example)**
```python
import csv
import pandas as pd

# The one way to open and read a CSV file
with open("/home/analysis/Desktop/PHA4GE_Training_Materials/data/Synthetic_AMR_Dataset.csv", newline="") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)  # Prints each row as a list
```


In [5]:
# The following part is new
# In here we are importing other modules into our cell. So that we can use it in the following code as abbreviations

# There is an easy way to do it as always!

import csv
import pandas as pd


amr_df = pd.read_csv("/home/analysis/Desktop/PHA4GE_Training_Materials/data/Synthetic_AMR_Dataset.csv")

print(amr_df)

   Sample_ID      Gene Resistance_Class  Detected  MIC_µg/mL
0   Sample_1  blaCTX-M      Beta-lactam      True        4.0
1   Sample_2      mecA      Methicillin      True       16.0
2   Sample_3      tetM     Tetracycline     False        NaN
3   Sample_4     aadA1   Aminoglycoside      True        8.0
4   Sample_5      ermB        Macrolide     False        NaN
5   Sample_6    blaTEM      Beta-lactam      True        2.0
6   Sample_7      vanA       Vancomycin      True       32.0
7   Sample_8    blaOXA      Beta-lactam     False        NaN
8   Sample_9      mphA        Macrolide      True        1.0
9  Sample_10      sul1      Sulfonamide      True        2.0


In [6]:
# can I recall the csv file again?

# Let's try.

print(amr_df)

   Sample_ID      Gene Resistance_Class  Detected  MIC_µg/mL
0   Sample_1  blaCTX-M      Beta-lactam      True        4.0
1   Sample_2      mecA      Methicillin      True       16.0
2   Sample_3      tetM     Tetracycline     False        NaN
3   Sample_4     aadA1   Aminoglycoside      True        8.0
4   Sample_5      ermB        Macrolide     False        NaN
5   Sample_6    blaTEM      Beta-lactam      True        2.0
6   Sample_7      vanA       Vancomycin      True       32.0
7   Sample_8    blaOXA      Beta-lactam     False        NaN
8   Sample_9      mphA        Macrolide      True        1.0
9  Sample_10      sul1      Sulfonamide      True        2.0


In [7]:
# We've done it! So let's go deeper now.
print("First 5 rows of the dataset:")
print(amr_df.head())


First 5 rows of the dataset:
  Sample_ID      Gene Resistance_Class  Detected  MIC_µg/mL
0  Sample_1  blaCTX-M      Beta-lactam      True        4.0
1  Sample_2      mecA      Methicillin      True       16.0
2  Sample_3      tetM     Tetracycline     False        NaN
3  Sample_4     aadA1   Aminoglycoside      True        8.0
4  Sample_5      ermB        Macrolide     False        NaN


In [8]:
# Get the total number of rows and columns
num_rows, num_cols = amr_df.shape
print(f"\nTotal Rows: {num_rows}")
print(f"Total Columns: {num_cols}")


Total Rows: 10
Total Columns: 5


In [None]:
# CODE PLAYGROUND

In [9]:
# Display column names
print("\nColumn Names:")
print(amr_df.columns.tolist())


Column Names:
['Sample_ID', 'Gene', 'Resistance_Class', 'Detected', 'MIC_µg/mL']


In [None]:
# CODE PLAYGROUND

In [10]:
# Check data types of each column
print("\nData Types:")
print(amr_df.dtypes)


Data Types:
Sample_ID            object
Gene                 object
Resistance_Class     object
Detected               bool
MIC_µg/mL           float64
dtype: object


In [None]:
# CODE PLAYGROUND

In [11]:
# Check for missing values in each column
print("\nMissing Values Per Column:")
print(amr_df.isnull().sum())


Missing Values Per Column:
Sample_ID           0
Gene                0
Resistance_Class    0
Detected            0
MIC_µg/mL           3
dtype: int64


In [None]:
# CODE PLAYGROUND

In [12]:
# Display dataset summary
print("\nDataset Summary:")
print(amr_df.info())


Dataset Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Sample_ID         10 non-null     object 
 1   Gene              10 non-null     object 
 2   Resistance_Class  10 non-null     object 
 3   Detected          10 non-null     bool   
 4   MIC_µg/mL         7 non-null      float64
dtypes: bool(1), float64(1), object(3)
memory usage: 458.0+ bytes
None


In [None]:
# CODE PLAYGROUND

In [15]:
filtered_amr_df = amr_df[["Gene", "Detected"]]
print(filtered_amr_df)

       Gene  Detected
0  blaCTX-M      True
1      mecA      True
2      tetM     False
3     aadA1      True
4      ermB     False
5    blaTEM      True
6      vanA      True
7    blaOXA     False
8      mphA      True
9      sul1      True


In [None]:
# If you want to save this file we can use the following code.
# But be careful about the file path.

filtered_amr_df.to_csv("/mnt/data/filtered_amr_data.csv", index=False)

In [None]:
# CODE PLAYGROUND

In [16]:
# Now we can use another file type which is VCF 

# Let's designate our target file_path
file_path = "data/test.vcf"

# Let's open it.
with open(file_path, "r") as file:
    for _ in range(10):  # and print first 10 lines
        print(file.readline().strip())
        
# What else we can do with thi?

##fileformat=VCFv4.0
##fileDate=20220125
##source=lofreq call -f freyja/data/NC_045512_Hu-1.fasta -o test.vcf freyja/data/test.bam
##reference=freyja/data/NC_045512_Hu-1.fasta
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth">
##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency">
##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=CONSVAR,Number=0,Type=Flag,Description="Indicates that the variant is a consensus variant (as opposed to a low frequency variant).">


### VCF File Inspection

The first 10 lines of the VCF file contain metadata and headers, confirming it follows the VCF format (v4.0).
These metadata lines (##) describe:

The file format and source (VCFv4.0, lofreq call).

The reference genome used (NC_045512_Hu-1.fasta).

Various INFO fields, such as:

DP → Raw sequencing depth.

AF → Allele frequency.

SB → Phred-scaled strand bias.

DP4 → Counts for ref-forward, ref-reverse, alt-forward, alt-reverse bases.

INDEL → Flag for insertion/deletion variants.

CONSVAR → Flag for consensus variants.


In [19]:
import pandas as pd

# Define the file path
vcf_file_path = "data/test.vcf"  # Update this to match your file path

# Extract relevant variant data from the VCF file
vcf_data = []

# Open and read the VCF file line by line
with open(vcf_file_path, "r") as file:
    for line in file:
        if not line.startswith("#"):  # Skip metadata/header lines
            columns = line.strip().split("\t")
            chrom = columns[0]  # Chromosome
            pos = columns[1]  # Position
            ref = columns[3]  # Reference allele
            alt = columns[4]  # Alternate allele
            qual = columns[5]  # Quality score

            # Store extracted data
            vcf_data.append([chrom, pos, ref, alt, qual])

# Convert extracted data into a DataFrame
vcf_df = pd.DataFrame(vcf_data, columns=["Chromosome", "Position", "Reference", "Alternate", "Quality"])

# Save the extracted data as a CSV file
# vcf_df.to_csv("extracted_variants.csv", index=False)

# Print confirmation
# print("Extracted VCF data saved to 'extracted_variants.csv'")


In [26]:
vcf_data[3]

[['NC_045512.2', '441', 'G', 'A', '105'],
 ['NC_045512.2', '1055', 'G', 'A', '1592'],
 ['NC_045512.2', '1191', 'C', 'T', '11391'],
 ['NC_045512.2', '1267', 'C', 'T', '9741'],
 ['NC_045512.2', '2184', 'A', 'C', '154']]

In [27]:
vcf_data[3:8]

[['NC_045512.2', '441', 'G', 'A', '105'],
 ['NC_045512.2', '1055', 'G', 'A', '1592'],
 ['NC_045512.2', '1191', 'C', 'T', '11391'],
 ['NC_045512.2', '1267', 'C', 'T', '9741'],
 ['NC_045512.2', '2184', 'A', 'C', '154']]

In [28]:
vcf_data[8:]

[['NC_045512.2', '2973', 'C', 'T', '1395'],
 ['NC_045512.2', '3037', 'C', 'T', '13537'],
 ['NC_045512.2', '4897', 'C', 'T', '159'],
 ['NC_045512.2', '5184', 'C', 'T', '3943'],
 ['NC_045512.2', '5457', 'C', 'T', '949'],
 ['NC_045512.2', '6996', 'T', 'C', '80'],
 ['NC_045512.2', '8782', 'C', 'T', '32957'],
 ['NC_045512.2', '8883', 'T', 'C', '72'],
 ['NC_045512.2', '9203', 'G', 'A', '10829'],
 ['NC_045512.2', '9678', 'T', 'C', '1593'],
 ['NC_045512.2', '11005', 'C', 'A', '4808'],
 ['NC_045512.2', '11414', 'C', 'T', '246'],
 ['NC_045512.2', '12235', 'A', 'C', '117'],
 ['NC_045512.2', '12583', 'T', 'A', '67'],
 ['NC_045512.2', '14256', 'A', 'C', '75'],
 ['NC_045512.2', '14408', 'C', 'T', '1357'],
 ['NC_045512.2', '16247', 'C', 'T', '99'],
 ['NC_045512.2', '16255', 'C', 'T', '77'],
 ['NC_045512.2', '17000', 'C', 'T', '115'],
 ['NC_045512.2', '17474', 'C', 'A', '106'],
 ['NC_045512.2', '17496', 'A', 'G', '6152'],
 ['NC_045512.2', '18060', 'C', 'T', '14317'],
 ['NC_045512.2', '18096', 'T', 'C'

In [30]:
vcf_df["Chromosome"]

0     NC_045512.2
1     NC_045512.2
2     NC_045512.2
3     NC_045512.2
4     NC_045512.2
         ...     
68    NC_045512.2
69    NC_045512.2
70    NC_045512.2
71    NC_045512.2
72    NC_045512.2
Name: Chromosome, Length: 73, dtype: object

In [31]:
vcf_df["Position"]

0       210
1       241
2       372
3       441
4      1055
      ...  
68    28881
69    29402
70    29512
71    29581
72    29742
Name: Position, Length: 73, dtype: object

In [37]:
vcf_df[["Reference", "Alternate"]]

Unnamed: 0,Reference,Alternate
0,G,T
1,C,T
2,A,C
3,G,A
4,G,A
...,...,...
68,G,T
69,G,T
70,T,C
71,T,C


In [38]:
# Selecting only Reference and Alternate columns
ref_alt_df = vcf_df[["Reference", "Alternate"]]

# Display the first few rows
print(ref_alt_df.head())

  Reference Alternate
0         G         T
1         C         T
2         A         C
3         G         A
4         G         A


In [40]:
# If you want to compare reference and alternated bases we can go ahead like this.

mutations_df = vcf_df[vcf_df["Reference"] != vcf_df["Alternate"]]
print(mutations_df)

     Chromosome Position Reference Alternate Quality
0   NC_045512.2      210         G         T   48519
1   NC_045512.2      241         C         T   49314
2   NC_045512.2      372         A         C     110
3   NC_045512.2      441         G         A     105
4   NC_045512.2     1055         G         A    1592
..          ...      ...       ...       ...     ...
68  NC_045512.2    28881         G         T    9326
69  NC_045512.2    29402         G         T   30263
70  NC_045512.2    29512         T         C   44232
71  NC_045512.2    29581         T         C     227
72  NC_045512.2    29742         G         T    1184

[73 rows x 5 columns]


In [None]:
# If you want to save it for future analysis use below!

mutations_df.to_csv("filtered_mutations.csv", index=False)
print("Filtered mutations saved to 'filtered_mutations.csv'")



### Why Use try-except in File Handling?

When working with VCF files or any large biological dataset, errors may occur due to:

File not found → The file might not exist or be in the wrong directory.

Corrupted data → The file may have missing or unexpected values.

Incorrect data type → Parsing issues when converting strings to numbers.

Using try-except, we can catch these errors and provide custom error messages or fallback solutions.



In [None]:
try:
    with open("test.vcf", "r") as file:
        vcf_data = file.readlines()
    print("File loaded successfully!")
except FileNotFoundError:
    print("Error: The file 'test.vcf' was not found. Please check the file path.")


### What happens here is our code is trying to open the file. If it is succesful, than printing the message. Otherwise (which is an exemption in this case), goes directly to the other part and giving the message back.

In [None]:
# CODE PLAYGROUND

In [41]:
# Let's try something else...


import pandas as pd

# Load extracted VCF data
vcf_df = pd.read_csv("extracted_variants.csv")

# Handling missing values safely. Let's define a function

def safe_convert(value):
    try:
        return float(value)
    except ValueError:
        return None  # Return None if conversion fails

# Apply safe conversion to the 'Allele_Frequency' column
vcf_df["Allele_Frequency"] = vcf_df["Allele_Frequency"].apply(safe_convert)

print("Data processed successfully!")


KeyError: 'Allele_Frequency'

In [None]:
### Code Playground!

In [None]:
### Code Playground!

In [None]:
### Code Playground!

# Problem 1: Extracting High-Quality Variants from a VCF File
### Scenario
You are analyzing a **Variant Call Format (VCF) file** containing genetic variants. Some variants have low **quality scores**, and you only want to keep those with **quality ≥ 1000**.

### Task
Write a Python program that:
1. **Reads a VCF file** and extracts relevant variant data (Chromosome, Position, Reference, Alternate, Quality).
2. **Filters variants** where `Quality ≥ 1000`.
3. **Writes the filtered variants** to a new CSV file.


# Your VCF file is in the data/ folder


In [None]:
### Code Playground!

In [None]:
### Code Playground!

In [None]:
### Code Playground!


---

# Problem 2: Handling Missing MIC Values in an AMR Dataset
### Scenario
You are working with an **Antimicrobial Resistance (AMR) dataset** containing **Minimum Inhibitory Concentration (MIC) values**. Some MIC values are **missing (NaN)**, and you need to handle them.

### Task
Write a Python program that:
1. **Reads an AMR dataset (CSV file)**.
2. **Finds missing MIC values** in the `MIC_µg/mL` column.
3. **Fills missing values** with `"Not Available"` and saves the cleaned dataset.

### Example Input (AMR Dataset Sample)


In [None]:
### Code Playground!

In [None]:
### Code Playground!

In [None]:
### Code Playground!

In [None]:
### Code Playground!

In [None]:

## **Solution for Problem 1: Extracting High-Quality Variants**

import pandas as pd

# Define the VCF file path
vcf_file_path = "data/test.vcf"

# Extract relevant data
vcf_data = []

# Open and read the VCF file
with open(vcf_file_path, "r") as file:
    for line in file:
        if not line.startswith("#"):  # Ignore metadata lines
            columns = line.strip().split("\t")
            chrom, pos, ref, alt, qual = columns[0], columns[1], columns[3], columns[4], columns[5]

            # Store extracted data
            vcf_data.append([chrom, pos, ref, alt, int(qual)])

# Convert to DataFrame
vcf_df = pd.DataFrame(vcf_data, columns=["Chromosome", "Position", "Reference", "Alternate", "Quality"])

# Filter variants with Quality >= 1000
filtered_variants = vcf_df[vcf_df["Quality"] >= 1000]

# Save to CSV
filtered_variants.to_csv("filtered_variants.csv", index=False)

print("Filtered variants saved to 'filtered_variants.csv'")


In [None]:
### Code Playground!

In [None]:
## **Solution for Problem 2:


import pandas as pd


# Load the AMR dataset
amr_df = pd.read_csv("data/amr_dataset.csv")

# Fill missing MIC values with "Not Available"
amr_df["MIC_µg/mL"].fillna("Not Available", inplace=True)

# Save the cleaned dataset
amr_df.to_csv("cleaned_amr_data.csv", index=False)

print("Cleaned AMR dataset saved to 'cleaned_amr_data.csv'")


In [None]:
### Code Playground!

In [None]:
### Code Playground!

### Any Questions?