Original file is `original.txt`, now we try to read and re-save the file using differernt ways.

In [1]:
import csv
import pandas as pd
import numpy as np
import sys


# Print Python version
print("Python version:", sys.version)

# Print Pandas version
print("Pandas version:", pd.__version__)

original_file = "original.txt"
test1_file = "test1_normal.txt"
test2_file = "test2_read_csv.txt"
test3_file = "test3_to_csv.txt"
test4_file = "test4_both_read_write.txt"

Python version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:37:49) [MSC v.1916 64 bit (AMD64)]
Pandas version: 1.3.5


### Mehod 1: read and write using the normal way

In [2]:
with open(original_file, 'r') as f:
    with open(test1_file, 'w') as w:
        for line in f:
            line = line.replace("\n", "").split("\t")
            line = "\t".join(line)
            w.write(line + "\n")

### Method 2: read using pd.read_csv

In [3]:
df = pd.read_csv(original_file, sep='\t', encoding='utf-8-sig')
header = df.columns.tolist()
header = '\t'.join(header)
with open(test2_file, 'w') as w:
    w.write(header + "\n")
    for i in range(len(df)):
        row = df.loc[i, :].values.tolist()
        row = [str(i) for i in row]
        w.write('\t'.join(row) + "\n")

###  Method 3: write using pd.to_csv

In [4]:
def write_row_to_file(file_name, row, delimiter='\t'):
    with open(file_name, 'a+', newline="", encoding='utf-8') as f:
        writer = csv.writer(f, delimiter=delimiter)
        writer.writerow(row)

In [5]:
with open(original_file, 'r') as f:
    for line in f:
        line = line.replace("\n", "").split("\t")
        write_row_to_file(test3_file, line, delimiter='\t')

### Method 4: read and write using pandas

In [6]:
df4 = pd.read_csv(original_file, sep='\t', encoding='utf-8-sig')
df4.to_csv(test4_file, sep='\t', index=False, header=True, encoding='utf-8-sig', line_terminator='\n')

### Test the length of each line

In [7]:
def print_line_length(file_name):
    _file = open(file_name, 'r', encoding='utf-8')
    print(f"\n{file_name}:")
    for i in range(2):
        _len = len(_file.readline())
        print(f"* line {i}: length is {_len}")

In [8]:
print_line_length(original_file)
print_line_length(test1_file)
print_line_length(test2_file)
print_line_length(test3_file)
print_line_length(test4_file)


original.txt:
* line 0: length is 172
* line 1: length is 79

test1_normal.txt:
* line 0: length is 172
* line 1: length is 79

test2_read_csv.txt:
* line 0: length is 172
* line 1: length is 78

test3_to_csv.txt:
* line 0: length is 172
* line 1: length is 79

test4_both_read_write.txt:
* line 0: length is 173
* line 1: length is 78


From the results, we can see the `test2_read_csv.txt` and `test4_both_read_write.txt` have different resluts. Let's look at the first charactor of each line:

In [9]:
with open(original_file, 'r') as f1, open(test4_file, 'r') as f2:
    for i, (line1, line2) in enumerate(zip(f1, f2)):
        print(f"\nFor original_file, the length of row {i} is {len(line1)}. The fisrt charactor is {line1[0]}")
        print(f"For test4_file, the length of row {i} is {len(line2)}. The fisrt charactor is {line2[0]}")
        if i == 1: break


For original_file, the length of row 0 is 172. The fisrt charactor is c
For test4_file, the length of row 0 is 173. The fisrt charactor is ﻿

For original_file, the length of row 1 is 79. The fisrt charactor is R
For test4_file, the length of row 1 is 78. The fisrt charactor is R


### Compare the two lines

In [10]:
import difflib
def compare_strings_with_difflib(str1, str2):
    d = difflib.Differ()
    diff = list(d.compare(str1, str2))
    return diff

In [11]:
with open(original_file, 'r') as f1, open(test4_file, 'r') as f2:
    for i, (line1, line2) in enumerate(zip(f1, f2)):
        if i == 0:
            s1 = line1[0:5]
            s2 = line2[0:5]
            print(f"\nFor Row {i}: s1 is {s1}, s2 is {s2}")
            differences = compare_strings_with_difflib(s1, s2)
            print('\n'.join(differences))
        if i == 1: 
            s1 = line1
            s2 = line2
            print(f"\nFor Row {i}: s1 is {s1}, s2 is {s2}")
            differences = compare_strings_with_difflib(s1, s2)
            print('\n'.join(differences))
            break


For Row 0: s1 is conti, s2 is ﻿cont
+ ﻿
  c
  o
  n
  t
- i

For Row 1: s1 is R2_55_2	7	AGCTG	2	t	1	113.21	3.874	0.00730	AGCTG	117.44	3.55	-1.06	30669	30691
, s2 is R2_55_2	7	AGCTG	2	t	1	113.21	3.874	0.0073	AGCTG	117.44	3.55	-1.06	30669	30691

  R
  2
  _
  5
  5
  _
  2
  	
  7
  	
  A
  G
  C
  T
  G
  	
  2
  	
  t
  	
  1
  	
  1
  1
  3
  .
  2
  1
  	
  3
  .
  8
  7
  4
  	
  0
  .
  0
  0
  7
  3
- 0
  	
  A
  G
  C
  T
  G
  	
  1
  1
  7
  .
  4
  4
  	
  3
  .
  5
  5
  	
  -
  1
  .
  0
  6
  	
  3
  0
  6
  6
  9
  	
  3
  0
  6
  9
  1
  



### Why different?

If you use `pd.read_csv`, it will convert the `0.00730` to `0.0073`.

If you use `df.to_csv` to save the whole dataframe to a file, though set `index=False,` it still have a blank before the header. Using the `len(file.readline())` can see the blank.