In [1]:
import os

We can check the present working directory using the `os.getcwd` function.

In [2]:
os.getcwd()

'/home/jovyan'

To get the list of files in a directory, use `os.listdir`. You pass an absolute or relative path of a directory as the argument to the function.

In [5]:
help(os.listdir)

Help on built-in function listdir in module posix:

listdir(path=None)
    Return a list containing the names of the files in the directory.
    
    path can be specified as either str, bytes, or a path-like object.  If path is bytes,
      the filenames returned will also be bytes; in all other circumstances
      the filenames returned will be str.
    If path is None, uses the path='.'.
    On some platforms, path may also be specified as an open file descriptor;\
      the file descriptor must refer to a directory.
      If this functionality is unavailable, using it raises NotImplementedError.
    
    The list is in arbitrary order.  It does not include the special
    entries '.' and '..' even if they are present in the directory.



In [6]:
os.listdir('.') # relative path

['.bash_logout',
 '.profile',
 '.bashrc',
 '.ipynb_checkpoints',
 '.ipython',
 'python-os-and-filesystem.ipynb',
 '.local',
 '.cache',
 '.jupyter',
 '.jovian',
 '.config',
 '.conda',
 '.wget-hsts',
 '.jovianrc',
 '.git',
 'work',
 '.npm']

In [7]:
os.listdir('/usr') # absolute path

['lib',
 'src',
 'bin',
 'lib32',
 'share',
 'include',
 'libx32',
 'games',
 'sbin',
 'local',
 'lib64']

You can create a new directory using `os.makedirs`. Let's create a new directory called `data`, where we'll later download some files.

In [8]:
os.makedirs('./data', exist_ok=True)

Can you figure out what the argument `exist_ok` does? Try using the `help` function or [read the documentation](https://docs.python.org/3/library/os.html#os.makedirs).

Let's verify that the directory was created and is currently empty.

In [9]:
'data' in os.listdir('.')

True

In [10]:
os.listdir('./data')

[]

Let us download some files into the `data` directory using the `urllib` module.

In [12]:
from urllib.request import urlretrieve

In [13]:
urlretrieve(url1, './data/loans1.txt')

('./data/loans1.txt', <http.client.HTTPMessage at 0x7f285c602910>)

In [14]:
urlretrieve(url2, './data/loans2.txt')

('./data/loans2.txt', <http.client.HTTPMessage at 0x7f285c608df0>)

In [15]:
urlretrieve(url3, './data/loans3.txt')

('./data/loans3.txt', <http.client.HTTPMessage at 0x7f285c617520>)

Let's verify that the files were downloaded.

In [16]:
os.listdir('./data')

['loans2.txt', 'loans1.txt', 'loans3.txt']

You can also use the [`requests`](https://docs.python-requests.org/en/master/) library to dowload URLs, although you'll need to [write some additional code](https://stackoverflow.com/questions/44699682/how-to-save-a-file-downloaded-from-requests-to-another-directory) to save the contents of the page to a file.

In [17]:

file1 = open('./data/loans1.txt', mode='r')

The `open` function also accepts a `mode` argument to specifies how we can interact with the file. The following options are supported:

To view the contents of the file, we can use the `read` method of the file object.

In [18]:
file1_contents = file1.read()

In [19]:
print(file1_contents)

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


In [20]:
file1.close()

Once a file is closed, you can no longer read from it.

In [21]:
file1.read()

ValueError: I/O operation on closed file.

## Closing files automatically using `with`

To close a file automatically after you've processed it, you can open it using the `with` statement.

In [22]:
with open('./data/loans2.txt') as file2:
    file2_contents = file2.read()
    print(file2_contents)

amount,duration,rate,down_payment
828400,120,0.11,100000
4633400,240,0.06,
42900,90,0.08,8900
983000,16,0.14,
15230,48,0.07,4300


Once the statements within the `with` block are executed, the `.close` method on `file2` is automatically invoked. Let's verify this by trying to read from the file object again.

## Reading a file line by line


File objects provide a `readlines` method to read a file line-by-line. 

In [24]:
with open('./data/loans3.txt', 'r') as file3:
    file3_lines = file3.readlines()

In [25]:
file3_lines

['amount,duration,rate,down_payment\n',
 '45230,48,0.07,4300\n',
 '883000,16,0.14,\n',
 '100000,12,0.1,\n',
 '728400,120,0.12,100000\n',
 '3637400,240,0.06,\n',
 '82900,90,0.07,8900\n',
 '316000,16,0.13,\n',
 '15230,48,0.08,4300\n',
 '991360,99,0.08,\n',
 '323000,27,0.09,4720010000,36,0.08,20000\n',
 '528400,120,0.11,100000\n',
 '8633400,240,0.06,\n',
 '12900,90,0.08,8900']

In [26]:
def parse_headers(header_line):
    return header_line.strip().split(',')

The `strip` method removes any extra spaces and the newline character `\n`. The `split` method breaks a string into a list using the given separator (`,` in this case).

In [27]:
file3_lines[0]

'amount,duration,rate,down_payment\n'

In [28]:
headers = parse_headers(file3_lines[0])

In [29]:
headers

['amount', 'duration', 'rate', 'down_payment']

Next, let's define a function `parse_values` that takes a line containing some data and returns a list of floating-point numbers.

In [30]:
def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        values.append(float(item))
    return values

In [31]:
file3_lines[1]

'45230,48,0.07,4300\n'

In [32]:
parse_values(file3_lines[1])

[45230.0, 48.0, 0.07, 4300.0]

The values were parsed and converted to floating point numbers, as expected. Let's try it for another line from the file, which does not contain a value for the down payment.

In [33]:
file3_lines[2]

'883000,16,0.14,\n'

In [35]:
def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        if item == '':
            values.append(0.0)
        else:
            try:
                values.append(float(item))
            except ValueError:
                values.append(item)
    return values

In [36]:
file3_lines[2]

'883000,16,0.14,\n'

In [37]:
parse_values(file3_lines[2])

[883000.0, 16.0, 0.14, 0.0]

Next, let's define a function `create_item_dict` that takes a list of values and a list of headers as inputs and returns a dictionary with the values associated with their respective headers as keys.


In [38]:
def create_item_dict(values, headers):
    result = {}
    for value, header in zip(values, headers):
        result[header] = value
    return result

Can you figure out what the Python built-in function `zip` does? Try out an example, or [read the documentation](https://docs.python.org/3.3/library/functions.html#zip).

In [39]:
for item in zip([1,2,3], ['a', 'b', 'c']):
    print(item)

(1, 'a')
(2, 'b')
(3, 'c')


Let's try out `create_item_dict` with a couple of examples.

In [40]:
file3_lines[1]

'45230,48,0.07,4300\n'

In [41]:
values1 = parse_values(file3_lines[1])
create_item_dict(values1, headers)

{'amount': 45230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0}

In [42]:
file3_lines[2]

'883000,16,0.14,\n'

In [43]:
values2 = parse_values(file3_lines[2])
create_item_dict(values2, headers)

{'amount': 883000.0, 'duration': 16.0, 'rate': 0.14, 'down_payment': 0.0}

As expected, the values & header are combined to create a dictionary with the appropriate key-value pairs.

We are now ready to put it all together and define the `read_csv` function.

In [44]:
def read_csv(path):
    result = []
    # Open the file in read mode
    with open(path, 'r') as f:
        # Get a list of lines
        lines = f.readlines()
        # Parse the header
        headers = parse_headers(lines[0])
        # Loop over the remaining lines
        for data_line in lines[1:]:
            # Parse the values
            values = parse_values(data_line)
            # Create a dictionary using values & headers
            item_dict = create_item_dict(values, headers)
            # Add the dictionary to the result
            result.append(item_dict)
    return result

Let's try it out!

In [45]:
with open('./data/loans2.txt') as file2:
    print(file2.read())

amount,duration,rate,down_payment
828400,120,0.11,100000
4633400,240,0.06,
42900,90,0.08,8900
983000,16,0.14,
15230,48,0.07,4300


In [46]:
read_csv('./data/loans2.txt')

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0},
 {'amount': 4633400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 42900.0, 'duration': 90.0, 'rate': 0.08, 'down_payment': 8900.0},
 {'amount': 983000.0, 'duration': 16.0, 'rate': 0.14, 'down_payment': 0.0},
 {'amount': 15230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0}]

The file is read and converted to a list of dictionaries, as expected. The `read_csv` file is generic enough that it can parse any file in the CSV format, with any number of rows or columns. Here's the full code for `read_csv` along with the helper functions:

In [47]:
def parse_headers(header_line):
    return header_line.strip().split(',')

def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        if item == '':
            values.append(0.0)
        else:
            try:
                values.append(float(item))
            except ValueError:
                values.append(item)
    return values

def create_item_dict(values, headers):
    result = {}
    for value, header in zip(values, headers):
        result[header] = value
    return result

def read_csv(path):
    result = []
    # Open the file in read mode
    with open(path, 'r') as f:
        # Get a list of lines
        lines = f.readlines()
        # Parse the header
        headers = parse_headers(lines[0])
        # Loop over the remaining lines
        for data_line in lines[1:]:
            # Parse the values
            values = parse_values(data_line)
            # Create a dictionary using values & headers
            item_dict = create_item_dict(values, headers)
            # Add the dictionary to the result
            result.append(item_dict)
    return result

Try to create small, generic, and reusable functions whenever possible. They will likely be useful beyond just the problem at hand and save you significant effort in the future.

In the [previous tutorial](https://jovian.ml/aakashns/python-functions-and-scope), we defined a function to calculate the equal monthly installments for a loan. Here's what it looked like:

In [48]:
import math

def loan_emi(amount, duration, rate, down_payment=0):
    """Calculates the equal montly installment (EMI) for a loan.
    
    Arguments:
        amount - Total amount to be spent (loan + down payment)
        duration - Duration of the loan (in months)
        rate - Rate of interest (monthly)
        down_payment (optional) - Optional intial payment (deducted from amount)
    """
    loan_amount = amount - down_payment
    try:
        emi = loan_amount * rate * ((1+rate)**duration) / (((1+rate)**duration)-1)
    except ZeroDivisionError:
        emi = loan_amount / duration
    emi = math.ceil(emi)
    return emi

We can use this function to calculate EMIs for all the loans in a file.

In [49]:
loans2 = read_csv('./data/loans2.txt')

In [50]:
loans2

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0},
 {'amount': 4633400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 42900.0, 'duration': 90.0, 'rate': 0.08, 'down_payment': 8900.0},
 {'amount': 983000.0, 'duration': 16.0, 'rate': 0.14, 'down_payment': 0.0},
 {'amount': 15230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0}]

In [51]:
for loan in loans2:
    loan['emi'] = loan_emi(loan['amount'], 
                           loan['duration'], 
                           loan['rate']/12, # the CSV contains yearly rates
                           loan['down_payment'])

In [52]:
loans2

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0,
  'emi': 10034},
 {'amount': 4633400.0,
  'duration': 240.0,
  'rate': 0.06,
  'down_payment': 0.0,
  'emi': 33196},
 {'amount': 42900.0,
  'duration': 90.0,
  'rate': 0.08,
  'down_payment': 8900.0,
  'emi': 504},
 {'amount': 983000.0,
  'duration': 16.0,
  'rate': 0.14,
  'down_payment': 0.0,
  'emi': 67707},
 {'amount': 15230.0,
  'duration': 48.0,
  'rate': 0.07,
  'down_payment': 4300.0,
  'emi': 262}]

You can see that each loan now has a new key `emi`, which provides the EMI for the loan. We can extract this logic into a function so that we can use it for other files too.

In [53]:
def compute_emis(loans):
    for loan in loans:
        loan['emi'] = loan_emi(
            loan['amount'], 
            loan['duration'], 
            loan['rate']/12, # the CSV contains yearly rates
            loan['down_payment'])

## Writing to files

Now that we have performed some processing on the data, it would be good to write the results back to a CSV file. We can create/open a file in `w` mode using `open` and write to it using the `.write` method. The string `format` method will come in handy here.

In [54]:
loans2 = read_csv('./data/loans2.txt')

In [55]:
compute_emis(loans2)

In [56]:
loans2

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0,
  'emi': 10034},
 {'amount': 4633400.0,
  'duration': 240.0,
  'rate': 0.06,
  'down_payment': 0.0,
  'emi': 33196},
 {'amount': 42900.0,
  'duration': 90.0,
  'rate': 0.08,
  'down_payment': 8900.0,
  'emi': 504},
 {'amount': 983000.0,
  'duration': 16.0,
  'rate': 0.14,
  'down_payment': 0.0,
  'emi': 67707},
 {'amount': 15230.0,
  'duration': 48.0,
  'rate': 0.07,
  'down_payment': 4300.0,
  'emi': 262}]

In [57]:
with open('./data/emis2.txt', 'w') as f:
    for loan in loans2:
        f.write('{},{},{},{},{}\n'.format(
            loan['amount'], 
            loan['duration'], 
            loan['rate'], 
            loan['down_payment'], 
            loan['emi']))

Let's verify that the file was created and written to as expected.

In [58]:
os.listdir('data')

['loans2.txt', 'loans1.txt', 'emis2.txt', 'loans3.txt']

In [59]:
with open('./data/emis2.txt', 'r') as f:
    print(f.read())

828400.0,120.0,0.11,100000.0,10034
4633400.0,240.0,0.06,0.0,33196
42900.0,90.0,0.08,8900.0,504
983000.0,16.0,0.14,0.0,67707
15230.0,48.0,0.07,4300.0,262



Great, looks like the loan details (along with the computed EMIs) were written into the file.

Let's define a generic function `write_csv` which takes a list of dictionaries and writes it to a file in CSV format. We will also include the column headers in the first line.

In [60]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

Do you understand how the function works? If now, try executing each statement by line by line or a different cell to figure out how it works. 

Let's try it out!

In [61]:
loans3 = read_csv('./data/loans3.txt')

In [62]:
compute_emis(loans3)

In [63]:
write_csv(loans3, './data/emis3.txt')

In [64]:
with open('./data/emis3.txt', 'r') as f:
    print(f.read())

amount,duration,rate,down_payment,emi
45230.0,48.0,0.07,4300.0,981
883000.0,16.0,0.14,0.0,60819
100000.0,12.0,0.1,0.0,8792
728400.0,120.0,0.12,100000.0,9016
3637400.0,240.0,0.06,0.0,26060
82900.0,90.0,0.07,8900.0,1060
316000.0,16.0,0.13,0.0,21618
15230.0,48.0,0.08,4300.0,267
991360.0,99.0,0.08,0.0,13712
323000.0,27.0,0.09,4720010000.0,-193751447
528400.0,120.0,0.11,100000.0,5902
8633400.0,240.0,0.06,0.0,61853
12900.0,90.0,0.08,8900.0,60



With just four lines of code, we can now read each downloaded file, calculate the EMIs, and write the results back to new files:

In [65]:
for i in range(1,4):
    loans = read_csv('./data/loans{}.txt'.format(i))
    compute_emis(loans)
    write_csv(loans, './data/emis{}.txt'.format(i))

In [67]:
movies_url = "https://gist.githubusercontent.com/aakashns/afee0a407d44bbc02321993548021af9/raw/6d7473f0ac4c54aca65fc4b06ed831b8a4840190/movies.csv"

In [68]:
urlretrieve(movies_url, 'data/movies.csv')

('data/movies.csv', <http.client.HTTPMessage at 0x7f285c1f53d0>)

In [69]:
movies = read_csv('data/movies.csv')

In [70]:
movies

[{'title': 'Fast & Furious', 'description': '"A movie'},
 {'title': 'The Dark Knight', 'description': '"Gotham'},
 {'title': 'Memento',
  'description': 'A guy forgets everything every 15 minutes'}]

As you can seen above, the movie descriptions weren't parsed properly.

To read this CSV properly, we can use the `pandas` library.

In [74]:
!pip install pandas --upgrade --quiet

In [75]:
import pandas as pd

The `pd.read_csv` function can be used to read the CSV file into a pandas data frame: a spreadsheet-like object for analyzing and processing data. We'll learn more about data frames in a future lesson.

In [76]:
movies_dataframe = pd.read_csv('data/movies.csv')

In [77]:
movies_dataframe

Unnamed: 0,title,description
0,Fast & Furious,"A movie, a race, a franchise"
1,The Dark Knight,"Gotham, the ""Batman"", and the Joker"
2,Memento,A guy forgets everything every 15 minutes


A dataframe can be converted into a list of dictionaries using the `to_dict` method.

In [78]:
movies = movies_dataframe.to_dict('records')

In [80]:
movies

[{'title': 'Fast & Furious', 'description': 'A movie, a race, a franchise'},
 {'title': 'The Dark Knight',
  'description': 'Gotham, the "Batman", and the Joker'},
 {'title': 'Memento',
  'description': 'A guy forgets everything every 15 minutes'}]

If you don't pass the arguments `records`, you get a dictionary of lists instead.

In [81]:
movies_dict = movies_dataframe.to_dict()

In [82]:
movies_dict

{'title': {0: 'Fast & Furious', 1: 'The Dark Knight', 2: 'Memento'},
 'description': {0: 'A movie, a race, a franchise',
  1: 'Gotham, the "Batman", and the Joker',
  2: 'A guy forgets everything every 15 minutes'}}

Let's try using the `write_csv` function to write the data in `movies` back to a CSV file.

In [83]:
write_csv(movies, 'movies2.csv')

In [84]:
!head movies2.csv

title,description
Fast & Furious,A movie, a race, a franchise
The Dark Knight,Gotham, the "Batman", and the Joker
Memento,A guy forgets everything every 15 minutes








As you can see above, the CSV file is not formatted properly. This can be verified by attempting to read the file using `pd.read_csv`.

In [85]:
pd.read_csv('movies2.csv')

Unnamed: 0,Unnamed: 1,title,description
Fast & Furious,A movie,a race,a franchise
The Dark Knight,Gotham,"the ""Batman""",and the Joker
Memento,A guy forgets everything every 15 minutes,,


To convert a list of dictionaries into a dataframe, you can use the `pd.DataFrame` constructor.

In [86]:
df2 = pd.DataFrame(movies)

In [87]:
df2

Unnamed: 0,title,description
0,Fast & Furious,"A movie, a race, a franchise"
1,The Dark Knight,"Gotham, the ""Batman"", and the Joker"
2,Memento,A guy forgets everything every 15 minutes


It can now be written to a CSV file using the `.to_csv` method of a dataframe.

In [88]:
df2.to_csv('movies3.csv', index=None)

Can you guess what the argument `index=None` does? Try removing it and observing the difference in output.

In [89]:
!head movies3.csv

title,description
Fast & Furious,"A movie, a race, a franchise"
The Dark Knight,"Gotham, the ""Batman"", and the Joker"
Memento,A guy forgets everything every 15 minutes


The CSV file is formatted properly. We can verify this by trying to read it back.

In [97]:
urlretrieve(url1, 'data/write_csv_columnar.csv')

('data/write_csv_columnar.csv', <http.client.HTTPMessage at 0x7f2813729ac0>)

In [98]:
urlretrieve(url2, 'data/urlretrieve.csv')

('data/write_csv_columnar.csv', <http.client.HTTPMessage at 0x7f2813729eb0>)

In [99]:
urlretrieve(url3, 'data/write_csv_columnar.csv')

('data/write_csv_columnar.csv', <http.client.HTTPMessage at 0x7f2813729910>)

In [100]:
write_csv_columnar

NameError: name 'write_csv_columnar' is not defined

In [101]:
import os

In [102]:
from urllib.request import urlretrieve

In [106]:
os.makedirs('./data2',exist_ok=True)

In [107]:
os.listdir('.')

['.bash_logout',
 '.profile',
 '.bashrc',
 '.ipynb_checkpoints',
 'movies3.csv',
 '.ipython',
 'python-os-and-filesystem.ipynb',
 '.local',
 '.cache',
 '.jovianrc',
 'data2',
 'movies2.csv',
 '.jupyter',
 'data',
 '.jovian',
 '.config',
 '.conda',
 '.wget-hsts',
 '.git',
 'work',
 '.npm']

In [108]:
urlretrieve(url1,'./data2/file1.txt')

('./data2/file1.txt', <http.client.HTTPMessage at 0x7f2813d88c40>)

In [109]:
urlretrieve(url2,'./data2/file2.txt')

('./data2/file2.txt', <http.client.HTTPMessage at 0x7f2813d39a60>)

In [110]:
urlretrieve(url3,'./data2/file3.txt')

('./data2/file3.txt', <http.client.HTTPMessage at 0x7f2813729940>)

In [111]:
with open('./data2/file1.txt','r') as file1:
    print(file1.read())


amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


In [112]:
def read_csv_columnar(path):
    with open(path,'r') as file:
        file_lines=file.readlines()
        resultdic={}
        #we obtain the keys to de dictionary
        keys=file_lines[0].strip().split(',')
        for header_num in range(len(file_lines[0].strip().split(','))):
            values=[]
            for i in range (1,len(file_lines)):
                file_line=file_lines[i].strip().split(',')
                if file_line[header_num]=='':
                    values.append(0.0)
                else:
                    try:
                        values.append(float(file_line[header_num]))
                    except ValueError:
                        values.append(file_line[header_num])            
            key=keys[header_num]
            resultdic[key]=values
        return resultdic

In [113]:
loans1=read_csv_columnar('./data2/file1.txt')

In [115]:
loans1

{'amount': [100000.0,
  200000.0,
  628400.0,
  4637400.0,
  42900.0,
  916000.0,
  45230.0,
  991360.0,
  423000.0],
 'duration': [36.0, 12.0, 120.0, 240.0, 90.0, 16.0, 48.0, 99.0, 27.0],
 'rate': [0.08, 0.1, 0.12, 0.06, 0.07, 0.13, 0.08, 0.08, 0.09],
 'down_payment': [20000.0,
  0.0,
  100000.0,
  0.0,
  8900.0,
  0.0,
  4300.0,
  0.0,
  47200.0]}

In [116]:
loans2=read_csv_columnar('./data2/file2.txt')

In [117]:
loans2

{'amount': [828400.0, 4633400.0, 42900.0, 983000.0, 15230.0],
 'duration': [120.0, 240.0, 90.0, 16.0, 48.0],
 'rate': [0.11, 0.06, 0.08, 0.14, 0.07],
 'down_payment': [100000.0, 0.0, 8900.0, 0.0, 4300.0]}

In [119]:
lloans2=read_csv_columnar('./data2/file2.txt')