## Interacting with the OS and filesystem

The `os` module in Python provides many functions for interacting with the OS and the filesystem. Let's import it and try out some examples.

In [1]:
import os

In [2]:
os.makedirs('./data', exist_ok=True) # To make new Directory

In [3]:
'data' in os.listdir('.')

True

In [4]:
os.listdir('./data')

['.ipynb_checkpoints', 'loans1.txt', 'loans2.txt', 'loans3.txt']

In [5]:
url1 = 'https://hub.jovian.ml/wp-content/uploads/2020/08/loans1.txt'
url2 = 'https://hub.jovian.ml/wp-content/uploads/2020/08/loans2.txt'
url3 = 'https://hub.jovian.ml/wp-content/uploads/2020/08/loans3.txt'

In [6]:
import urllib.request

In [7]:
urllib.request.urlretrieve(url1, './data/loans1.txt')

('./data/loans1.txt', <http.client.HTTPMessage at 0x24c8865de48>)

In [8]:
urllib.request.urlretrieve(url2, './data/loans2.txt')

('./data/loans2.txt', <http.client.HTTPMessage at 0x24c88671bc8>)

In [9]:
urllib.request.urlretrieve(url3, './data/loans3.txt')

('./data/loans3.txt', <http.client.HTTPMessage at 0x24c8867d788>)

In [10]:
os.listdir('./data')

['.ipynb_checkpoints', 'loans1.txt', 'loans2.txt', 'loans3.txt']

### Reading from a file 

To read the contents of a file, we first need to open the file using the built-in `open` function. The `open` function returns a file object, provides several methods for interacting with the contents of the file. It also accepts a `mode` argument

In [11]:
file1 = open('data/loans1.txt', mode='r')
file1_contents = file1.read()
print(file1_contents)

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


### Reading a file line by line


File objects provide a `readlines` method to read a file line-by-line. 

In [12]:
file1.close()

In [13]:
with open('data/loans3.txt', 'r') as file3:
    file3_lines = file3.readlines()
file3_lines

['amount,duration,rate,down_payment\n',
 '883000,16,0.14,\n',
 '45230,48,0.07,4300\n',
 '100000,12,0.1,\n',
 '728400,120,0.12,100000\n',
 '3637400,240,0.06,\n',
 '82900,90,0.07,8900\n',
 '316000,16,0.13,\n',
 '15230,48,0.08,4300\n',
 '991360,99,0.08,\n',
 '323000,27,0.09,4720010000,36,0.08,20000\n',
 '528400,120,0.11,100000\n',
 '8633400,240,0.06,\n',
 '12900,90,0.08,8900\n']

### Processing data from files

Before performing any operations on the data stored in a file, we need to convert the contents of the file from one large string into Python data types. For the file `loans1.txt` containing information about loans in a CSV format, we can do the following:

* Read the file line by line
* Parse the first line to get list of the column names or headers
* Split each remaining line and convert each value into a float
* Create a dictionary for each loan using the headers as keys
* Create a list of dictionaries to keep track of all the loans

Since we will perform the same operations for multiple files, it would be useful define a function `read_csv` to do this. We'll also define some helper functions to build up the functionality step by step. 

Let's start by defining a function `parse_header` which takes a line as input and returns a list of column headers.

In [14]:
def parse_headers(header_line):
    return header_line.strip().split(',')

The strip method removes any extra spaces and the newline character \n, and the split method breaks a string into a list using the given separator (, in this case).

In [15]:
file3_lines[0]

'amount,duration,rate,down_payment\n'

In [16]:
headers = parse_headers(file3_lines[0])

In [17]:
headers

['amount', 'duration', 'rate', 'down_payment']

A function parse_values which takes a line containing some data, and returns a list of floating point numbers.

In [18]:
def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        values.append(float(item))
    return values

In [19]:
file3_lines[1]

'883000,16,0.14,\n'

In [20]:
file3_lines[2]

'45230,48,0.07,4300\n'

In [21]:
parse_values(file3_lines[2])

[45230.0, 48.0, 0.07, 4300.0]

This leads to a `ValueError` because the empty string `''` cannot be converted to a float. We can enhance the `parse_values` function to handle this *edge case*.

In [22]:
def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        if item == '':
            values.append(0.0)
        else:
            values.append(float(item))
    return values

In [23]:
file3_lines[2]

'45230,48,0.07,4300\n'

In [24]:
parse_values(file3_lines[2])

[45230.0, 48.0, 0.07, 4300.0]

Define a function create_item_dict which takes a list of values and a list of headers as inputs, and returns a dictionary with the values associated with their respective headers as keys.

In [25]:
def create_item_dict(values, headers):
    result = {}
    for value, header in zip(values, headers):
        result[header] = value
    return result

In [26]:
file3_lines[1]

'883000,16,0.14,\n'

In [27]:
values1 = parse_values(file3_lines[1])
create_item_dict(values1, headers)

{'amount': 883000.0, 'duration': 16.0, 'rate': 0.14, 'down_payment': 0.0}

In [28]:
file3_lines[2]

'45230,48,0.07,4300\n'

In [29]:
values2 = parse_values(file3_lines[2])
create_item_dict(values2, headers)

{'amount': 45230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0}

As expected, the values & header are combined to create a dictionary with the approriate key-value pairs.

We are now ready to put it all together and define the read_csv function.

In [30]:
def read_csv(path):
    result = []
    # Open the file in read mode
    with open(path, 'r') as f:
        # Get a list of lines
        lines = f.readlines()
        # Parse the header
        headers = parse_headers(lines[0])
        # Loop over the remaining lines
        for data_line in lines[1:]:
            # Parse the values
            values = parse_values(data_line)
            # Create a dictionary using values & headers
            item_dict = create_item_dict(values, headers)
            # Add the dictionary to the result
            result.append(item_dict)
    return result

Let's try it out!

In [31]:
with open('data/loans2.txt') as file2:
    print(file2.read())

amount,duration,rate,down_payment
828400,120,0.11,100000
4633400,240,0.06,
42900,90,0.08,8900
983000,16,0.14,
15230,48,0.07,4300



In [32]:
read_csv('data/loans2.txt')

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0},
 {'amount': 4633400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 42900.0, 'duration': 90.0, 'rate': 0.08, 'down_payment': 8900.0},
 {'amount': 983000.0, 'duration': 16.0, 'rate': 0.14, 'down_payment': 0.0},
 {'amount': 15230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0}]

The file is read and converted to a list of dictionaries, as expected. The read_csv file is generic enough that it can parse any file in the CSV format, with any number of rows or columns. Here's the full code for read_csv along with the helper functions:

In [33]:
def parse_headers(header_line):
    return header_line.strip().split(',')

def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        if item == '':
            values.append(0.0)
        else:
            values.append(float(item))
    return values

def create_item_dict(values, headers):
    result = {}
    for value, header in zip(values, headers):
        result[header] = value
    return result

def read_csv(path):
    result = []
    # Open the file in read mode
    with open(path, 'r') as f:
        # Get a list of lines
        lines = f.readlines()
        # Parse the header
        headers = parse_headers(lines[0])
        # Loop over the remaining lines
        for data_line in lines[1:]:
            # Parse the values
            values = parse_values(data_line)
            # Create a dictionary using values & headers
            item_dict = create_item_dict(values, headers)
            # Add the dictionary to the result
            result.append(item_dict)
    return result

In [34]:
import math

def loan_emi(amount, duration, rate, down_payment=0):
    """Calculates the equal montly installment (EMI) for a loan.
    
    Arguments:
        amount - Total amount to be spent (loan + down payment)
        duration - Duration of the loan (in months)
        rate - Rate of interest (monthly)
        down_payment (optional) - Optional intial payment (deducted from amount)
    """
    loan_amount = amount - down_payment
    try:
        emi = loan_amount * rate * ((1+rate)**duration) / (((1+rate)**duration)-1)
    except ZeroDivisionError:
        emi = loan_amount / duration
    emi = math.ceil(emi)
    return emi

In [35]:
loans2 = read_csv('data/loans2.txt')
loans2

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0},
 {'amount': 4633400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 42900.0, 'duration': 90.0, 'rate': 0.08, 'down_payment': 8900.0},
 {'amount': 983000.0, 'duration': 16.0, 'rate': 0.14, 'down_payment': 0.0},
 {'amount': 15230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0}]

In [36]:
for loan in loans2:
    loan['emi'] = loan_emi(loan['amount'], 
                           loan['duration'], 
                           loan['rate']/12, # the CSV contains yearly rates
                           loan['down_payment'])
    
loans2

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0,
  'emi': 10034},
 {'amount': 4633400.0,
  'duration': 240.0,
  'rate': 0.06,
  'down_payment': 0.0,
  'emi': 33196},
 {'amount': 42900.0,
  'duration': 90.0,
  'rate': 0.08,
  'down_payment': 8900.0,
  'emi': 504},
 {'amount': 983000.0,
  'duration': 16.0,
  'rate': 0.14,
  'down_payment': 0.0,
  'emi': 67707},
 {'amount': 15230.0,
  'duration': 48.0,
  'rate': 0.07,
  'down_payment': 4300.0,
  'emi': 262}]

You can see that each loan now has a new key emi, which provides the EMI for the loan. We can extract this logic into a function, so that we can be used for other files too.

In [37]:
def compute_emis(loans):
    for loan in loans:
        loan['emi'] = loan_emi(
            loan['amount'], 
            loan['duration'], 
            loan['rate']/12, # the CSV contains yearly rates
            loan['down_payment'])

### Writing to files

Now that we have performed some processing on the data, it would be a good idea to write the results back to a file in the CSV format. We can do this by creating/opening a file in write mode with `open` and using the `.write` method of the file object. The string `format` method will be useful for 

In [38]:
loans2 = read_csv('data/loans2.txt')

In [39]:
compute_emis(loans2)
loans2

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0,
  'emi': 10034},
 {'amount': 4633400.0,
  'duration': 240.0,
  'rate': 0.06,
  'down_payment': 0.0,
  'emi': 33196},
 {'amount': 42900.0,
  'duration': 90.0,
  'rate': 0.08,
  'down_payment': 8900.0,
  'emi': 504},
 {'amount': 983000.0,
  'duration': 16.0,
  'rate': 0.14,
  'down_payment': 0.0,
  'emi': 67707},
 {'amount': 15230.0,
  'duration': 48.0,
  'rate': 0.07,
  'down_payment': 4300.0,
  'emi': 262}]

In [40]:
with open('data/emis2.txt', 'w') as f:
    for loan in loans2:
        f.write('{},{},{},{},{}\n'.format(
            loan['amount'], 
            loan['duration'], 
            loan['rate'], 
            loan['down_payment'], 
            loan['emi']))

In [41]:
os.listdir('data')

['.ipynb_checkpoints', 'emis2.txt', 'loans1.txt', 'loans2.txt', 'loans3.txt']

In [42]:
with open('data/emis2.txt', 'r') as f:
    print(f.read())

828400.0,120.0,0.11,100000.0,10034
4633400.0,240.0,0.06,0.0,33196
42900.0,90.0,0.08,8900.0,504
983000.0,16.0,0.14,0.0,67707
15230.0,48.0,0.07,4300.0,262



Great, looks like the loan details (along with the computed EMIs) were written into the file.

Let's define a generic function write_csv which takes a list of dictionaries and writes it to a file in CSV format. We will also include the column headers in the first line.

In [43]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

In [44]:
loans3 = read_csv('data/loans3.txt')

In [45]:
compute_emis(loans3)

In [46]:
write_csv(loans3, 'data/emis3.txt')

In [47]:
with open('data/emis3.txt', 'r') as f:
    print(f.read())

amount,duration,rate,down_payment,emi
883000.0,16.0,0.14,0.0,60819
45230.0,48.0,0.07,4300.0,981
100000.0,12.0,0.1,0.0,8792
728400.0,120.0,0.12,100000.0,9016
3637400.0,240.0,0.06,0.0,26060
82900.0,90.0,0.07,8900.0,1060
316000.0,16.0,0.13,0.0,21618
15230.0,48.0,0.08,4300.0,267
991360.0,99.0,0.08,0.0,13712
323000.0,27.0,0.09,4720010000.0,-193751447
528400.0,120.0,0.11,100000.0,5902
8633400.0,240.0,0.06,0.0,61853
12900.0,90.0,0.08,8900.0,60



With just 4 lines of code, we can now read each downloaded file, calcualte the EMIs, and write the results back to new files:

In [48]:
for i in range(1,4):
    loans = read_csv('data/loans{}.txt'.format(i))
    compute_emis(loans)
    write_csv(loans, 'data/emis{}.txt'.format(i))

In [49]:
os.listdir('data')

['.ipynb_checkpoints',
 'emis1.txt',
 'emis2.txt',
 'emis3.txt',
 'loans1.txt',
 'loans2.txt',
 'loans3.txt']

Once all the functions are defined, we can calculate EMIs for thousands or even millions of loans across many files with just a few lines of code, in a few seconds. Now we're starting to see the real power of using a programming language like Python for processing data!