# Importing files into Python

The aim of this notebook is to describe how to import a folder of files into a single csv or dataframe in Python.

I have created a very small example folder to practice with here - `GitHub\python-basics\test-folder-with-csv-files\`

There is plenty of stuff to review before we even get to the folder. There's some nice reading [here](https://dbader.org/blog/python-file-io) about the fundamentals of working with files in Python which I'll loosely follow along. First up - file types

## Terms / abbreviations to know

- I/O - input/output
- `\n` - new line

## Binary vs Text

There are two types of files that python can handle - binary and text.

Examples of **binary** files
- Image files (.jpg, .png, .gif, etc)
- Database files (.mdb, .frm, .sqlite, etc)
- Documents (.doc, .xlsx, .pdf, etc)

These files require specific software and handling to open.

**Text** files have no specific coding and can be opened by a standard text editor. There are rules for text files
- A text file has to be readable as is
- data is organised in lines - each line is a distinct element, e.g. line of instruction, command, etc
- they have an unseen character at the end of each line to let the editor know there should be a new line

## Built in Python functions

The first function to know is `open()`

In [11]:
my_first_file = open(file = 'test-folder-with-csv-files/weights-2020-01-01.txt', 
                     mode = 'r+')

The `mode` parameter is to tell python what to do with the file
- **'w'** - write mode
- **'r'** - read mode
- **'a'** - append mode
- **'r+'** - read/write mode
- **'a+'** - append and read mode

If opening a binary file then these modes have a **b** at the end, like 
**'wb'**, **'r+b'**, etc.

To close a file, add `close()` to the file object -

In [4]:
my_first_file.close()

The best way to open and close files is to use `with`. It closes the file after the nested code block

In [12]:
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r+') as my_file:
    my_file.read()

## Reading data from a file

By default, `read(size)` will read the entire file and print it out as a string (text) or as byte objects (binary). If the file size is larger than available memory, it won't be able to read the whole file at once, therefore the size parameter can be used to break the file up into chunks that available memory can handle. **size** says how many bytes into the file to return.

In [13]:
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r+') as my_file:
    print("The file name: ", my_file)
    line = my_file.read()
    print(line)

The file name:  <_io.TextIOWrapper name='test-folder-with-csv-files/weights-2020-01-01.txt' mode='r+' encoding='cp1252'>
Day|PersonID|Name|Age|Weight
2020-01-01|1|Tom|32|80
2020-01-01|2|Matt|33|78
2020-01-01|3|Alex|31|90
2020-01-01|4|Matt|32|76
2020-01-01|5|Pat|32|70


In [16]:
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r+') as my_file:
    print("The file name: ", my_file)
    line = my_file.read(8) # limit size of what is read
    print(line)

The file name:  <_io.TextIOWrapper name='test-folder-with-csv-files/weights-2020-01-01.txt' mode='r+' encoding='cp1252'>
Day|Pers


## Read data line-by-line

Using `readline(size)` by default reads the first line of a file.

In [24]:
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r+') as my_file:
    print("The file name: ", my_file)
    print(my_file.readline())
    # print(my_file.readline()) # use multiple times to call more lines

The file name:  <_io.TextIOWrapper name='test-folder-with-csv-files/weights-2020-01-01.txt' mode='r+' encoding='cp1252'>
Day|PersonID|Name|Age|Weight



`readlines()` returns all lines as a list. This doesn't really work with binary files as it doesn't have a defined line end.

In [26]:
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r+') as my_file:
    print(my_file.readlines())

['Day|PersonID|Name|Age|Weight\n', '2020-01-01|1|Tom|32|80\n', '2020-01-01|2|Matt|33|78\n', '2020-01-01|3|Alex|31|90\n', '2020-01-01|4|Matt|32|76\n', '2020-01-01|5|Pat|32|70']


## Processing a text file line-by-line

The easiest way is to use a loop. It is memory-efficient as it reads and processes each line individually.

In [47]:
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r+') as my_file:
    for line in my_file: # for each element in the object
        print(line)

Day|PersonID|Name|Age|Weight

2020-01-01|1|Tom|32|80

2020-01-01|7|Barney|32|74

2020-01-01|2|Matt|33|78

2020-01-01|3|Alex|31|90

2020-01-01|4|Matt|32|76

2020-01-01|5|Pat|32|70

2020-01-01|6|Jack|30|68



## Writing to a file

The method to write to a file is `write(data)`. For example -

*Note - don't run this unnecessarily!*

In [33]:
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'a+') as my_file:
    my_file.write('2020-01-01|6|Jack|30|68\n') # \n - new line

Anything to be written that isn't a string needs to be cast or converted to strings. Example - 

In [37]:
# values = [123456, 234567, 345678]

# with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'a+') as my_file:
#     for value in values:
#         str_value = str(value)
#         my_file.write(str_value)
#         my_file.write('\n')

## Editing an existing text file

You can't just use `w+` to edit as it will completey overwrite the file. Also, `a+` will always insert data at the end of the file. A way to do it is to extract the file into and array and then insert new data. Then it can be joined back and write it to file.

For `list.insert(i, x)`, `i` is an integer than indicates the cell number. Then the data `x` is placed before the cell in the list indicated by `i`.

So, for example -

In [38]:
# open as read-only
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r') as my_file:
    my_file_contents = my_file.readlines()
    
my_file_contents.insert(2, '2020-01-01|7|Barney|32|74\n')

# re-open in write-only mode to overwrite old file
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'w') as my_file:
    # join items together with nothing between them
    my_fileContents = ''.join(my_file_contents) 
    my_file.write(my_fileContents)

## Remove a line from a file

Found this on stack overflow [here](https://stackoverflow.com/questions/4710067/how-to-delete-a-specific-line-in-a-file) so adapted it to remove the extract stuff we've just written into the original file -

In [49]:
values_to_rm = ['123456', '234567', '345678', 
                '2020-01-01|7|Barney|32|74',
                '2020-01-01|6|Jack|30|68']

with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r') as my_file:
    my_file_contents = my_file.readlines()

with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'w') as my_file:
    for line in my_file_contents:
        if line.strip("\n") not in values_to_rm:
            my_file.write(line)
            
# check it has worked
with open('test-folder-with-csv-files/weights-2020-01-01.txt', 'r') as my_file:
    for line in my_file:
        print(line)

Day|PersonID|Name|Age|Weight

2020-01-01|1|Tom|32|80

2020-01-01|2|Matt|33|78

2020-01-01|3|Alex|31|90

2020-01-01|4|Matt|32|76

2020-01-01|5|Pat|32|70



In [4]:
# import packages

import pandas as pd

To get help with a function in Python, use `help()`. For example - `help(pd.read_csv)`. Alternatively, use google.

The first thing I'd want to do is view what is in a folder. So lets figure out how to do that.

Lets open one of the csv files to have a look at it

In [9]:
testcsv = pd.read_csv('test-folder-with-csv-files/Weights 2020-01-01.txt', sep = '|')

In [10]:
testcsv.head()

Unnamed: 0,Day,PersonID,Name,Age,Weight
0,2020-01-01,1,Tom,32,80
1,2020-01-01,2,Matt,33,78
2,2020-01-01,3,Alex,31,90
3,2020-01-01,4,Matt,32,76
4,2020-01-01,5,Pat,32,70
