Chapter 4: Handling Files
-----------------------------

# Reading Files

In [1]:
file_handle = open('samples/readme.txt', 'r')

In [2]:
file_handle

<_io.TextIOWrapper name='samples/readme.txt' mode='r' encoding='cp1252'>

In [3]:
file_handle = open('samples/seqA.fas', 'r')
file_handle.read()

'>O00626|HUMAN Small inducible cytokine A22.\nMARLQTALLVVLVLLAVALQATEAGPYGANMEDSVCCRDYVRYRLPLRVVKHFYWTSDSCPRPGVVLLTFRDKEICADPRVPWVKMILNKLSQ\n'

In [4]:
file_handle = open('samples/readme.txt', 'r')
# do something with the file
file_handle.read()
file_handle.close()

In [5]:
with open('samples/readme.txt', 'r') as file_handle:
    print(file_handle.read())


It is a fake text! 



In [6]:
with open('samples/seqA.fas', 'r') as file_handle:
    print(file_handle.read())

>O00626|HUMAN Small inducible cytokine A22.
MARLQTALLVVLVLLAVALQATEAGPYGANMEDSVCCRDYVRYRLPLRVVKHFYWTSDSCPRPGVVLLTFRDKEICADPRVPWVKMILNKLSQ



In [7]:
from pathlib import Path

file_path = Path('samples/readme.txt')
content = file_path.read_text()
print(content)


It is a fake text! 



In [8]:
binary_content = file_path.read_bytes()
binary_content

b'It is a fake text! \n'

#### read a FASTA file (Extracting Name & sequence)

- 🔹 First Approach: 

Using read() 
(Not Recommended for Large Files

In [12]:
with open('samples/prot.fas') as fh:
    my_file = fh.read()
    #temp = my_file.split('\n')
    name = my_file.split('\n')[0][1:]
    sequence = ''.join(my_file.split('\n')[1:])
#print('temp:\n', temp)
print(f'\nThe name is:\n{name}')
print(f'\nThe seq. is :\n{sequence}')

temp:
 [">sp|Q8RW96|2A5G_ARATH Serine/threonine protein phosphatase 2A 59 kDa regulatory subunit B' gamma isoform OS=Arabidopsis thaliana GN=B'GAMMA PE=1 SV=2", 'MIKQIFGKLPRKPSKSSHNDSNPNGEGGVNSYYIPNSGISSISKPSSKSSASNSNGANGT', 'VIAPSSTSSNRTNQVNGVYEALPSFRDVPTSEKPNLFIKKLSMCCVVFDFNDPSKNLREK', 'EIKRQTLLELVDYIATVSTKLSDAAMQEIAKVAVVNLFRTFPSANHESKILETLDVDDEE', 'PALEPAWPHLQVVYELLLRFVASPMTDAKLAKRYIDHSFVLKLLDLFDSEDQREREYLKT', 'ILHRIYGKFMVHRPFIRKAINNIFYRFIFETEKHNGIAELLEILGSIINGFALPLKEEHK', 'LFLIRALIPLHRPKCASAYHQQLSYCIVQFVEKDFKLADTVIRGLLKYWPVTNSSKEVMF', 'LGELEEVLEATQAAEFQRCMVPLFRQIARCLNSSHFQVAERALFLWNNDHIRNLITQNHK', 'VIMPIVFPAMERNTRGHWNQAVQSLTLNVRKVMAETDQILFDECLAKFQEDEANETEVVA', 'KREATWKLLEELAASKSVSNEAVLVPRFSSSVTLATGKTSGS', '']

The name is:
sp|Q8RW96|2A5G_ARATH Serine/threonine protein phosphatase 2A 59 kDa regulatory subunit B' gamma isoform OS=Arabidopsis thaliana GN=B'GAMMA PE=1 SV=2

The seq. is :
MIKQIFGKLPRKPSKSSHNDSNPNGEGGVNSYYIPNSGISSISKPSSKSSASNSNGANGTVIAPSSTSSNRTNQVNGVYEALPSFRDVPTSEKPNLFIK

- 🔹 Optimized Approach: 
Using readline() 
(Recommended)


In [35]:
with open('samples/seqA.fas', 'r') as fh:
    name = fh.readline().strip()[1:]
    sequence = ''.join(line.strip() for line in fh)

print(f'The name is: {name}')
print(f'The sequence is: {sequence}')


The name is : O00626|HUMAN Small inducible cytokine A22.
The sequence is:  MARLQTALLVVLVLLAVALQATEAGPYGANMEDSVCCRDYVRYRLPLRVVKHFYWTSDSCPRPGVVLLTFRDKEICADPRVPWVKMILNKLSQ


- 🔹 Alternative: Using pathlib.Path (Python 3.10+)


In [5]:
from pathlib import Path

file_path = Path('samples/seqA.fas')
lines = file_path.read_text().splitlines()

name = lines[0][1:]
sequence = ''.join(lines[1:])

print(f'The name is: {name}')
print(f'The sequence is: {sequence}')


The name is: O00626|HUMAN Small inducible cytokine A22.
The sequence is: MARLQTALLVVLVLLAVALQATEAGPYGANMEDSVCCRDYVRYRLPLRVVKHFYWTSDSCPRPGVVLLTFRDKEICADPRVPWVKMILNKLSQ


In [1]:
with open('samples/prot.fas', 'r') as fh:
    name = fh.readline()
    print(name)

>sp|Q8RW96|2A5G_ARATH Serine/threonine protein phosphatase 2A 59 kDa regulatory subunit B' gamma isoform OS=Arabidopsis thaliana GN=B'GAMMA PE=1 SV=2



In [2]:
with open('samples/prot.fas', 'r') as fh:
    name1 = fh.read()
    print(name1)

>sp|Q8RW96|2A5G_ARATH Serine/threonine protein phosphatase 2A 59 kDa regulatory subunit B' gamma isoform OS=Arabidopsis thaliana GN=B'GAMMA PE=1 SV=2
MIKQIFGKLPRKPSKSSHNDSNPNGEGGVNSYYIPNSGISSISKPSSKSSASNSNGANGT
VIAPSSTSSNRTNQVNGVYEALPSFRDVPTSEKPNLFIKKLSMCCVVFDFNDPSKNLREK
EIKRQTLLELVDYIATVSTKLSDAAMQEIAKVAVVNLFRTFPSANHESKILETLDVDDEE
PALEPAWPHLQVVYELLLRFVASPMTDAKLAKRYIDHSFVLKLLDLFDSEDQREREYLKT
ILHRIYGKFMVHRPFIRKAINNIFYRFIFETEKHNGIAELLEILGSIINGFALPLKEEHK
LFLIRALIPLHRPKCASAYHQQLSYCIVQFVEKDFKLADTVIRGLLKYWPVTNSSKEVMF
LGELEEVLEATQAAEFQRCMVPLFRQIARCLNSSHFQVAERALFLWNNDHIRNLITQNHK
VIMPIVFPAMERNTRGHWNQAVQSLTLNVRKVMAETDQILFDECLAKFQEDEANETEVVA
KREATWKLLEELAASKSVSNEAVLVPRFSSSVTLATGKTSGS



- File Handling Example-
Calculating Net Charge of a Protein from a FASTA File

In [3]:
sequence = ''
charge = -0.002
amino_acid_charge = {'C': -0.045, 'D': -0.999, 'E': -0.998, 'H': 0.091,
                     'K': 1, 'R': 1, 'Y': -0.001}
with open('Samples/prot.fas') as fh:
    fh.readline()  # Skip header
    for line in fh:
        sequence += line.strip().upper()  # Read sequence
print(f'seq: {sequence}')
for aa in sequence:
    charge += amino_acid_charge.get(aa, 0)  # Calculate charge
print(f'Net charge: {charge}')

seq: MIKQIFGKLPRKPSKSSHNDSNPNGEGGVNSYYIPNSGISSISKPSSKSSASNSNGANGTVIAPSSTSSNRTNQVNGVYEALPSFRDVPTSEKPNLFIKKLSMCCVVFDFNDPSKNLREKEIKRQTLLELVDYIATVSTKLSDAAMQEIAKVAVVNLFRTFPSANHESKILETLDVDDEEPALEPAWPHLQVVYELLLRFVASPMTDAKLAKRYIDHSFVLKLLDLFDSEDQREREYLKTILHRIYGKFMVHRPFIRKAINNIFYRFIFETEKHNGIAELLEILGSIINGFALPLKEEHKLFLIRALIPLHRPKCASAYHQQLSYCIVQFVEKDFKLADTVIRGLLKYWPVTNSSKEVMFLGELEEVLEATQAAEFQRCMVPLFRQIARCLNSSHFQVAERALFLWNNDHIRNLITQNHKVIMPIVFPAMERNTRGHWNQAVQSLTLNVRKVMAETDQILFDECLAKFQEDEANETEVVAKREATWKLLEELAASKSVSNEAVLVPRFSSSVTLATGKTSGS
Net charge: 3.046999999999999


# Writing Files

In [37]:
fh = open('samples/newfile.txt','w')

In [38]:
fh = open('samples/error.log','a')

- Write numbers to a file.

In [39]:
with open('samples/numbers.txt','w') as fh:
    fh.write('1\n2\n3\n4\n5')

- Writing Computation Results to a File
(Stores the computed protein net charge in out.txt.)

In [4]:
sequence = ''
charge = -0.002
aa_charge = {'C': -.045, 'D': -.999, 'E': -.998, 'H': .091, 'K': 1, 'R': 1, 'Y': -.001}

with open('samples/prot.fas') as fh:
    next(fh)  # Skip the first line (FASTA header)
    for line in fh:
        sequence += line.strip().upper()  # Remove trailing characters and convert to uppercase

for aa in sequence:
    charge += aa_charge.get(aa, 0)  # Calculate the net charge of the protein

with open('out.txt', 'w') as file_out:
    file_out.write(str(charge))  # Save the result to a file
    print('file created')


file created


- writing files (Using pathlib)

In [6]:
from pathlib import Path
file_path = Path('output3.txt')
file_path.write_text("Hello, World!")
print('file created!!!!!!!!!!!!!!!!')

file created!!!!!!!!!!!!!!!!


# CSV FILES

- 🔹 Reading a CSV File in Python (Without csv Module)

In [7]:
total_len = 0
with open('samples/B1.csv') as fh:
    next(fh)
    for n, line in enumerate(fh, start =1 ):
        data = line.strip().split(',')
        print(f'data: {data}')
        total_len += int(data[1])
    print(total_len / n)

data: ['TKO001', '119', 'AG(12)']
data: ['TKO002', '255', 'TC(16)']
data: ['TKO003', '121', 'AG(5)']
data: ['TKO004', '220', 'AG(9)']
data: ['TKO005', '238', 'TC(17)']
190.6


- Reading data from a CSV file, using csv module

In [9]:
import csv

with open('samples/data.csv', newline='', encoding='utf-8') as file:
    rows = list(csv.reader(file, delimiter = '\t'))  # Convert CSV reader to a list
    header = rows[0]  # First row (header)
    data = rows[1:]  # All rows except the first one

print("Header:", header)
print("First data row:", data[0])


Header: ['Entry', 'Entry name', 'Status', 'Protein names', 'Gene names', 'Organism', 'Length']
First data row: ['Q8BG02', '2ABG_MOUSE', 'reviewed', 'Serine/threonine-protein phosphatase 2A 55 kDa regulatory subunit B gamma isoform (PP2A subunit B isoform B55-gamma) (PP2A subunit B isoform PR55-gamma) (PP2A subunit B isoform R2-gamma) (PP2A subunit B isoform gamma)', 'Ppp2r2c', 'Mus musculus (Mouse)', '447']


- Reading data from a CSV file, using pandas module (tab seperator)

In [10]:
import pandas as pd

df = pd.read_csv('samples/data.csv', delimiter='\t')  # or sep='\t'
print(df.head())


    Entry   Entry name    Status  \
0  Q8BG02   2ABG_MOUSE  reviewed   
1  P24815  3BHS1_MOUSE  reviewed   
2  P55194   3BP1_MOUSE  reviewed   
3  P28334  5HT1B_MOUSE  reviewed   
4  P35363  5HT2A_MOUSE  reviewed   

                                       Protein names    Gene names  \
0  Serine/threonine-protein phosphatase 2A 55 kDa...       Ppp2r2c   
1  3 beta-hydroxysteroid dehydrogenase/Delta 5-->...  Hsd3b1 Hsd3b   
2               SH3 domain-binding protein 1 (3BP-1)   Sh3bp1 3bp1   
3  5-hydroxytryptamine receptor 1B (5-HT-1B) (5-H...   Htr1b 5ht1b   
4  5-hydroxytryptamine receptor 2A (5-HT-2) (5-HT...    Htr2a Htr2   

               Organism  Length  
0  Mus musculus (Mouse)     447  
1  Mus musculus (Mouse)     373  
2  Mus musculus (Mouse)     601  
3  Mus musculus (Mouse)     386  
4  Mus musculus (Mouse)     471  


In [16]:
import pandas as pd

df = pd.read_csv('samples/B1.csv')
print(df.head())

  MarkerID  LenAmp MotifAmpForSeq
0   TKO001     119         AG(12)
1   TKO002     255         TC(16)
2   TKO003     121          AG(5)
3   TKO004     220          AG(9)
4   TKO005     238         TC(17)


In [11]:
import csv
total_len = 0

lines = csv.reader(open('samples/B1.csv'))
print(lines)
next(lines)
for n, line in enumerate(lines, start =1 ):
    total_len += int(line[1])
print(total_len / n)

<_csv.reader object at 0x000002A0D3AC6580>
190.6


In [15]:
data = list(csv.reader(open('samples/B1.csv')))
print(data[0][2])
print(data[1][1])
print(data[1][2])
print(data[3][0])

MotifAmpForSeq
119
AG(12)
TKO003


In [17]:
import csv

with open('samples/data.csv', newline='', encoding='utf-8') as file:
    sample = file.read(1024)  # Read a portion of the file (number of bytes)
    file.seek(0)  # Reset the file pointer to the beginning

    rows = csv.reader(file, delimiter='\t')  # Manually set the delimiter as tab
    print(next(rows))  # Print the first row (header)
    print(next(rows))  # Print the second row (first data entry)


['Entry', 'Entry name', 'Status', 'Protein names', 'Gene names', 'Organism', 'Length']
['Q8BG02', '2ABG_MOUSE', 'reviewed', 'Serine/threonine-protein phosphatase 2A 55 kDa regulatory subunit B gamma isoform (PP2A subunit B isoform B55-gamma) (PP2A subunit B isoform PR55-gamma) (PP2A subunit B isoform R2-gamma) (PP2A subunit B isoform gamma)', 'Ppp2r2c', 'Mus musculus (Mouse)', '447']


### Changing Delimiters
Default delimiter: Comma (,)Can be changed (e.g., : or \t)

In [9]:
rows = csv.reader(open('/etc/passwd'), delimiter=':')

Different CSV structures require different dialects

In [7]:
rows = csv.reader(open('samples/data.csv'), dialect='excel')

- Automatically Detecting CSV Format (🔍 Sniffer detects the correct format)

In [11]:
with open('samples/data.csv', newline='') as file:
    dialect = csv.Sniffer().sniff(file.read(1024))
    file.seek(0)
    rows = csv.reader(file, dialect=dialect)

    print(next(rows))
    print(next(rows))

['Entry', 'Entry name', 'Status', 'Protein names', 'Gene names', 'Organism', 'Length']
['Q8BG02', '2ABG_MOUSE', 'reviewed', 'Serine/threonine-protein phosphatase 2A 55 kDa regulatory subunit B gamma isoform (PP2A subunit B isoform B55-gamma) (PP2A subunit B isoform PR55-gamma) (PP2A subunit B isoform R2-gamma) (PP2A subunit B isoform gamma)', 'Ppp2r2c', 'Mus musculus (Mouse)', '447']


### Reading an Excel File (.xlsx) with openpyxl


In [18]:
from openpyxl import load_workbook

iedb = {}
wb = load_workbook('samples/sampledata.xlsx', data_only=True)  # Load the Excel file
sh = wb.active  # Get the active sheet

# Iterate through rows, starting from the second row (skip headers)
for row in sh.iter_rows(min_row=2, values_only=True):  
    iedb[int(row[0])] = row[2]  # Store column A as key and column C as value

print(iedb)


{6273: 'CGAELNHFL', 14101: 'ERYLKDQQL', 22030: 'GRFKLIVLY', 25569: 'IDFPKTFGW', 26070: 'IFFPKTFGW', 26790: 'IKFPKTFGW', 27049: 'ILFPKTFGW', 27636: 'INFPKTFGW', 28419: 'IRYPKTFGW', 33140: 'KRGILTLKY', 33170: 'KRKKAYADF', 33260: 'KRYKSIVKY', 55565: 'RRFVNVVPTF', 55785: 'RRYQKSTEL', 58781: 'SKADVIAKY', 60636: 'SRDKTIIMW', 63789: 'TGASIQTTL', 144753: 'QRSPMFEGTL', 144784: 'SKFPKMRMG', 226822: 'AKFPGMKKSK', 504020: 'NQFNGGCLLV'}


In [19]:
from openpyxl import Workbook

wb = Workbook()  # Create a new workbook
ws = wb.active  # Get the active sheet

# Write column headers
ws.append(["Column A", "Column B", "Column C", "Column D"])  

# Write a row of data
ws.append([1, 230, 0, 5])  
ws.append([2, 238, 0, 5])
ws.append([3, 454, 0, 5])
ws.append([4, 234, 0, 5])
# Save the workbook as "mynewfile.xlsx"
wb.save("mynewfile5.xlsx")
print('done')

done


In [20]:
from openpyxl import load_workbook

iedb = {}
wb = load_workbook('mynewfile5.xlsx', data_only=True)  # Load the Excel file
sh = wb.active  # Get the active sheet

# Iterate through rows, starting from the second row (skip headers)
for row in sh.iter_rows(min_row=2, values_only=True):  
    iedb[int(row[0])] = row[1]  # Store column A as key and column C as value

print(iedb)


{1: 230, 2: 238, 3: 454, 4: 234}


### Reading an Excel File (.xlsx) with pandas


In [21]:
import pandas as pd

df1 = pd.read_csv('samples/data.csv', delimiter='\t')  # First method
df2 = pd.read_csv('samples/data.csv', sep='\t')  # Second method (more common)
print(df2.head())  # Display the first 5 rows


    Entry   Entry name    Status  \
0  Q8BG02   2ABG_MOUSE  reviewed   
1  P24815  3BHS1_MOUSE  reviewed   
2  P55194   3BP1_MOUSE  reviewed   
3  P28334  5HT1B_MOUSE  reviewed   
4  P35363  5HT2A_MOUSE  reviewed   

                                       Protein names    Gene names  \
0  Serine/threonine-protein phosphatase 2A 55 kDa...       Ppp2r2c   
1  3 beta-hydroxysteroid dehydrogenase/Delta 5-->...  Hsd3b1 Hsd3b   
2               SH3 domain-binding protein 1 (3BP-1)   Sh3bp1 3bp1   
3  5-hydroxytryptamine receptor 1B (5-HT-1B) (5-H...   Htr1b 5ht1b   
4  5-hydroxytryptamine receptor 2A (5-HT-2) (5-HT...    Htr2a Htr2   

               Organism  Length  
0  Mus musculus (Mouse)     447  
1  Mus musculus (Mouse)     373  
2  Mus musculus (Mouse)     601  
3  Mus musculus (Mouse)     386  
4  Mus musculus (Mouse)     471  


- Selecting Specific Columns

In [22]:
import pandas as pd
df = pd.read_excel('mynewfile5.xlsx', usecols= ["Column A", "Column C"])
print(df)

   Column A  Column C
0         1         0
1         2         0
2         3         0
3         4         0


In [23]:
import pandas as pd
df = pd.read_excel('samples/sampledata.xlsx')
print(df.head())

   Epitope ID     Object Type Description  Starting Position  Ending Position
0        6273  Linear peptide   CGAELNHFL              379.0            387.0
1       14101  Linear peptide   ERYLKDQQL              584.0            592.0
2       22030  Linear peptide   GRFKLIVLY                NaN              NaN
3       25569  Linear peptide   IDFPKTFGW                NaN              NaN
4       26070  Linear peptide   IFFPKTFGW                NaN              NaN


In [24]:
import pandas as pd
df = pd.read_excel('samples/sampledata.xlsx', usecols= ["Epitope ID", "Description"])
print(df)

    Epitope ID Description
0         6273   CGAELNHFL
1        14101   ERYLKDQQL
2        22030   GRFKLIVLY
3        25569   IDFPKTFGW
4        26070   IFFPKTFGW
5        26790   IKFPKTFGW
6        27049   ILFPKTFGW
7        27636   INFPKTFGW
8        28419   IRYPKTFGW
9        33140   KRGILTLKY
10       33170   KRKKAYADF
11       33260   KRYKSIVKY
12       55565  RRFVNVVPTF
13       55785   RRYQKSTEL
14       58781   SKADVIAKY
15       60636   SRDKTIIMW
16       63789   TGASIQTTL
17      144753  QRSPMFEGTL
18      144784   SKFPKMRMG
19      226822  AKFPGMKKSK
20      504020  NQFNGGCLLV


- Converting an Excel Column to a Dictionary

In [28]:
df = pd.read_excel("mynewfile5.xlsx", engine="openpyxl")
print(df.head())
# Convert "Column A" to keys and "Column C" to values
iedb = dict(zip(df["Column A"], df["Column B"]))
print(iedb)


   Column A  Column B  Column C  Column D
0         1       230         0         5
1         2       238         0         5
2         3       454         0         5
3         4       234         0         5
{1: 230, 2: 238, 3: 454, 4: 234}


- Writing a DataFrame to an Excel File

In [29]:
df = pd.DataFrame({
    'colA' : [1,2,3,4],
    'colB' : ['A','B','C','D']
})

df.to_excel('mynewfile6.xlsx', index = False)
print('done')

done


In [30]:
df = pd.read_excel("mynewfile6.xlsx")
print(df.head())

   colA colB
0     1    A
1     2    B
2     3    C
3     4    D


### PICKLE: storing and retrieving the contents of variables

In [31]:
import pickle

# Define a dictionary
sp_dict = {'one': 'uno', 'two': 'dos', 'three': 'tres'}

# Save the dictionary to a file
with open('spdict.data', 'wb') as fh:
    pickle.dump(sp_dict, fh)


In [32]:
import pickle

# Load the dictionary from the file
with open('spdict.data', 'rb') as fh:
    loaded_dict = pickle.load(fh)

print(loaded_dict)  # Output: {'one': 'uno', 'two': 'dos', 'three': 'tres'}


{'one': 'uno', 'two': 'dos', 'three': 'tres'}


### JSON

In [None]:
{
  "contactPoint": {
    "fn": "PREUSCH, PETER\u00a0",
    "hasEmail": "mailto:preuschp@nigms.nih.gov"
  },
  "description": "<p>The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids found in all organisms</p>\n",
  "identifier": "d9f3932a-9c55-41b3-ad3a-0b4e18ee4752",
  "keyword": ["national-institutes-of-health-nih"],
  "language": ["en"],
  "license": "http://opendefinition.org/licenses/odc-odbl/",
  "modified": "2016-07-18",
  "programCode": ["009:000"],
  "publisher": {
    "@type": "org:Organization",
    "name": "National Institutes of Health (NIH)"
  },
  "title": "Protein Data Bank (PDB)"
}


In [33]:
import json
sp_dict = {'one': 'uno', 'two': 'dos', 'three': 'tres'}
with open('spdict.json', 'w') as fh:
    json.dump(sp_dict, fh)


In [35]:
sp_dict

{'one': 'uno', 'two': 'dos', 'three': 'tres'}

In [34]:
with open('spdict.json') as fh:
    sp_dict = json.load(fh)

In [36]:
import json
json.dump({1,2,3,4}, open('test.json','wb'))

#TypeError: Object of type set is not JSON serializable
#Note that JSON cannot serialize sets and other specific Python objects:
#Here is a list of serializable objects in JSON: 
#int, float, str, list, dict, True, False, and None.

TypeError: Object of type set is not JSON serializable

### FILE HANDLING: OS, OS.PATH, SHUTIL, AND PATH.PY MODULE

In [4]:
#os.chdir('C:\\Users\\paris\\bio4py')

In [5]:
import os
os.getcwd()  # windows Output: 'C:\\Users\\paris\\bio4py'
# linux Output: 'C:/Users/paris/bio4py'

'C:\\Users\\paris\\bio4py'

In [6]:
os.chdir('samples')
os.getcwd()  # Output: '/home/sb/docs'

'C:\\Users\\paris\\bio4py\\samples'

In [7]:
os.chdir('..')  # Return to the higher directory
os.getcwd()

'C:\\Users\\paris\\bio4py'

In [8]:
path = 'C:\\Users\\paris\\bio4py'
os.path.isfile(path)
os.path.isdir(path)


True

In [9]:
os.listdir('C:\\Users\\paris\\bio4py')
# Output:  ['readme.txt', 'ms115.ab1', '.atom', 'projects', '.bash_history']


['.ipynb_checkpoints',
 'Chapter 5 - Code Modularizing.ipynb',
 'Chapter1- Introduction.ipynb',
 'Chapter2 - Basic Programming - Data Types.ipynb',
 'Chapter3 - Programming - Flow Control.ipynb',
 'Chapter4 - Handling Files.ipynb',
 'Common-Dictionary-Methods.png',
 'compares-datatype.png',
 'mynewfile1.xlsx',
 'mynewfile5.xlsx',
 'mynewfile6.xlsx',
 'newfile.txt',
 'out.txt',
 'prot.fas',
 'PythonStringMethodsOverview.png',
 'samples',
 'Untitled.ipynb']

In [10]:
os.path.join(os.getcwd(), 'images')
# On Windows: '\\', on Linux/macOS: '/'


'C:\\Users\\paris\\bio4py\\images'

In [1]:
#pip install pathlib
#!pip install pathlib

Collecting pathlib
  Using cached pathlib-1.0.1-py3-none-any.whl.metadata (5.1 kB)
Using cached pathlib-1.0.1-py3-none-any.whl (14 kB)
Installing collected packages: pathlib
Successfully installed pathlib-1.0.1



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
from pathlib import Path

In [12]:
f = Path('C:\\Users\\paris\\bio4py\\newfile.txt')
f.touch()  # Create the file if it doesn't exist

In [23]:
f.is_file()  # True if it is a file

False

In [24]:
f = Path('/home/sb/xx.py')
print(f.suffix)           # '.py'
print(f.name)          # Path('xx.py')
print(f.parent)        # Path('/home/sb')
print(f.parent.parent) # Path('/home')


.py
xx.py
\home\sb
\home


In [38]:
# Filter files with specific pattern (e.g. .pdf)
d = Path('C:\\Users\\paris\\bio4py\\samples')
txt_files = list(d.glob('*.txt'))
print(txt_files)


[WindowsPath('C:/Users/paris/bio4py/samples/newfile-checkpoint.txt'), WindowsPath('C:/Users/paris/bio4py/samples/newfile.txt'), WindowsPath('C:/Users/paris/bio4py/samples/output3-checkpoint.txt'), WindowsPath('C:/Users/paris/bio4py/samples/output3.txt'), WindowsPath('C:/Users/paris/bio4py/samples/readme-checkpoint.txt'), WindowsPath('C:/Users/paris/bio4py/samples/readme.txt')]


In [29]:
from pathlib import Path

d = Path('C:/Users/paris/bio4py')

# List all files (not directories)
files = [f for f in d.iterdir() if f.is_file()]
print(files)


[WindowsPath('C:/Users/paris/bio4py/Chapter 5 - Code Modularizing.ipynb'), WindowsPath('C:/Users/paris/bio4py/Chapter1- Introduction.ipynb'), WindowsPath('C:/Users/paris/bio4py/Chapter2 - Basic Programming - Data Types.ipynb'), WindowsPath('C:/Users/paris/bio4py/Chapter3 - Programming - Flow Control.ipynb'), WindowsPath('C:/Users/paris/bio4py/Chapter4 - Handling Files.ipynb'), WindowsPath('C:/Users/paris/bio4py/Common-Dictionary-Methods.png'), WindowsPath('C:/Users/paris/bio4py/compares-datatype.png'), WindowsPath('C:/Users/paris/bio4py/mynewfile1.xlsx'), WindowsPath('C:/Users/paris/bio4py/mynewfile5.xlsx'), WindowsPath('C:/Users/paris/bio4py/mynewfile6.xlsx'), WindowsPath('C:/Users/paris/bio4py/newfile.txt'), WindowsPath('C:/Users/paris/bio4py/out.txt'), WindowsPath('C:/Users/paris/bio4py/prot.fas'), WindowsPath('C:/Users/paris/bio4py/PythonStringMethodsOverview.png'), WindowsPath('C:/Users/paris/bio4py/Untitled.ipynb')]


In [26]:
# List all directories
dirs = [f for f in d.iterdir() if f.is_dir()]
print(dirs)


[WindowsPath('C:/Users/paris/bio4py/.ipynb_checkpoints'), WindowsPath('C:/Users/paris/bio4py/samples')]


In [31]:
# Filter files with specific pattern (e.g. .pdf)
d = Path('C:/Users/paris/bio4py/samples')
pdf_files = list(d.glob('*.pdf'))
print(pdf_files)


[WindowsPath('C:/Users/paris/bio4py/samples/test.pdf')]


In [36]:
# Walk through all files in directory and subdirectories
d = Path('C:/Users/paris/bio4py')
for file in d.rglob('*'):
    if file.is_file():
        print(file)


C:\Users\paris\bio4py\Chapter 5 - Code Modularizing.ipynb
C:\Users\paris\bio4py\Chapter1- Introduction.ipynb
C:\Users\paris\bio4py\Chapter2 - Basic Programming - Data Types.ipynb
C:\Users\paris\bio4py\Chapter3 - Programming - Flow Control.ipynb
C:\Users\paris\bio4py\Chapter4 - Handling Files.ipynb
C:\Users\paris\bio4py\Common-Dictionary-Methods.png
C:\Users\paris\bio4py\compares-datatype.png
C:\Users\paris\bio4py\mynewfile1.xlsx
C:\Users\paris\bio4py\mynewfile5.xlsx
C:\Users\paris\bio4py\mynewfile6.xlsx
C:\Users\paris\bio4py\newfile.txt
C:\Users\paris\bio4py\out.txt
C:\Users\paris\bio4py\outfile.fasta
C:\Users\paris\bio4py\prot.fas
C:\Users\paris\bio4py\PythonStringMethodsOverview.png
C:\Users\paris\bio4py\Untitled.ipynb
C:\Users\paris\bio4py\.ipynb_checkpoints\Chapter1- Introduction-checkpoint.ipynb
C:\Users\paris\bio4py\.ipynb_checkpoints\Chapter2 - Basic Programming - Data Types - part1-checkpoint.ipynb
C:\Users\paris\bio4py\.ipynb_checkpoints\Chapter2 - Basic Programming - Data Typ

In [39]:
from pathlib import Path

# Search recursively for all PDF files in your home directory
my_path = Path('C:/Users/paris/bio4py/samples')
#pdfs = Path.home().rglob('*.pdf')
pdfs = my_path.rglob('*.txt')

# Create a folder to store the final reports
output_dir = my_path / 'FinalReports'
output_dir.mkdir(exist_ok=True)

# Move all found PDFs to the FinalReports folder
for pdf in pdfs:
    target = output_dir / pdf.name
    pdf.replace(target)


- To combine all .fasta DNA sequence files from a given directory and its subdirectories into a single file called outfile.fasta.
-  (Reads each .fasta file’s contents and writes them to outfile.fasta) combine all fasta data to one file.

In [39]:
from pathlib import Path

# Set the directory that contains the input .fasta files
d = Path('C:/Users/paris/bio4py/samples')

# Open the output file to consolidate all sequences
with open('./outfile.fasta', 'w') as f_out:
    for file in d.rglob('*.fasta'):  # Recursively find all .fasta files (in the directory and all subdirectories)
        with open(file, 'r') as f_in:
            f_out.write(f_in.read())


### Practical Example: Collecting All Your Project Reports


Scenario: It's the end of the semester, and you need to submit all your .pdf reports to your professor.But they are scattered across your computer in folders like Downloads, Documents, Desktop, etc.Let’s write a Python script to find all .pdf files and move them to one clean folder. 

In [44]:
from pathlib import Path

# Search recursively for all PDF files in your home directory
my_path = Path('C:/Users/paris/bio4py/samples')
#pdfs = Path.home().rglob('*.pdf')
pdfs = my_path.rglob('*.pdf')

# Create a folder to store the final reports
output_dir = my_path / 'FinalReports'
output_dir.mkdir(exist_ok=True)

# Move all found PDFs to the FinalReports folder
for pdf in pdfs:
    target = output_dir / pdf.name
    pdf.replace(target)


#### Theoretical Questions:

1. What is the difference between “w” and “a” modes if both allow you to write
files?
2. Why we must close all files that are no longer in use?
3. Why we open files using with?
4. Is it possible to parse csv files without csv module? If so, how is it done?
5. Why is it not recommended to read a file using read()?
6. What is the most efficient way to walk through a file line by line?
7. What is Pickle in Python?
8. Explain what is JSON and what limitation it has with respect to Pickle.

#### Code-Related Questions:

9. Make a program that asks a name, and then writes it to a file called
MyName.txt.
10. Make a program that reads all the numbers from the second column of an
Excel file and prints the average of these values.
11. Write a Python program that asks the user to enter a DNA sequence and then writes it to a file called dna_sequence.txt.
12. Given a CSV file named genes.csv, where the second column contains the lengths of genes, write a Python program that reads all gene lengths from the second column and calculates their average. Do not use the csv module.
- Sample genes.csv file content:
- Gene,Length
- BRCA1,1863
- TP53,1183
- MYH7,2005
- CFTR,1480