## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Files

Python uses file objects to interact with external files on your computer. These file objects can be any sort of file you have on your computer, whether it be an audio file, a text file, emails, Excel documents, etc. Note: You will probably need to install certain libraries or modules to interact with those various file types, but they are easily available. (We will cover downloading modules later on in the course).

Python has a built-in open function that allows us to open and play with basic file types.

## Python Opening a File

### Know Your File's Location

In [3]:
# show the absolute pathname from the folder that contains
# this current jupyter-notebook file

In [4]:
pwd

'/Users/hisamuka/Library/CloudStorage/OneDrive-ifsp.edu.br/IFSP/Cursos/ECD/D3TOP_2023.1/github/basics'

### Opening a file

Be sure your are passing the **correct pathname** for the file, which can be:
- the _relative pathname_ from the current folder of this jupyter-notebook file
- the _absolute_ pathname of the file

In [5]:
# incomplete pathname
myfile = open('star-wars-wikipedia.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'star-wars-wikipedia.txt'

In [6]:
# correct relative pathname
myfile = open('demos/star-wars-wikipedia.txt')

In [7]:
myfile

<_io.TextIOWrapper name='demos/star-wars-wikipedia.txt' mode='r' encoding='UTF-8'>

In [None]:
# or correct absolute pathname

In [9]:
myfile = open('/Users/hisamuka/Library/CloudStorage/OneDrive-ifsp.edu.br/IFSP/Cursos/ECD/D3TOP_2023.1/github/basics/demos/star-wars-wikipedia.txt')

In [10]:
myfile

<_io.TextIOWrapper name='/Users/hisamuka/Library/CloudStorage/OneDrive-ifsp.edu.br/IFSP/Cursos/ECD/D3TOP_2023.1/github/basics/demos/star-wars-wikipedia.txt' mode='r' encoding='UTF-8'>

The `open` function receives two parameters: `file pathname` and `access mode`. The availabe `access modes` are:

- `"r"`: **Read** - Default value. Opens a file for reading, error if the file does not exist
- `"a"`: **Append** - Opens a file for appending, creates the file if it does not exist
- `"w"`: **Write** - Opens a file for writing, creates the file if it does not exist
- `"x"`: **Create** - Creates the specified file, returns an error if the file exists

In addition you can specify if the file should be handled as **binary** or **text mode**:

- `"t"`: **Text** - Default value. Text mode
- `"b"`: **Binary** - Binary mode (e.g. images)

By default, files are opened in **text reading mode** - `rt`.

In [11]:
# alternative one
myfile = open('demos/star-wars-wikipedia.txt')
myfile

<_io.TextIOWrapper name='demos/star-wars-wikipedia.txt' mode='r' encoding='UTF-8'>

In [12]:
# alternative two
myfile = open('demos/star-wars-wikipedia.txt', 'r')
myfile

<_io.TextIOWrapper name='demos/star-wars-wikipedia.txt' mode='r' encoding='UTF-8'>

In [13]:
# alternative three
myfile = open('demos/star-wars-wikipedia.txt', 'rt')
myfile

<_io.TextIOWrapper name='demos/star-wars-wikipedia.txt' mode='rt' encoding='UTF-8'>

## Closing a file
Once there is no further interaction with the file, it must be **closed**.

In [16]:
myfile.close()

## Reading the content of a file: `.read()`

In [17]:
myfile = open('demos/star-wars-wikipedia.txt')
myfile

<_io.TextIOWrapper name='demos/star-wars-wikipedia.txt' mode='r' encoding='UTF-8'>

The `.read()` command reads the *entire content* of the file.

In [18]:
content = myfile.read()

In [19]:
content

'Star Wars is an American epic space opera[1] multimedia franchise created by George Lucas, which began with the eponymous 1977 film[b] and quickly became a worldwide pop culture phenomenon. The franchise has been expanded into various films and other media, including television series, video games, novels, comic books, theme park attractions, and themed areas, comprising an all-encompassing fictional universe.[c] Star Wars is one of the highest-grossing media franchises of all time.\n\nThe original film (Star Wars), retroactively subtitled Episode IV: A New Hope (1977), was followed by the sequels Episode V: The Empire Strikes Back (1980) and Episode VI: Return of the Jedi (1983), forming the original Star Wars trilogy. Lucas later returned to the series to direct a prequel trilogy, consisting of Episode I: The Phantom Menace (1999), Episode II: Attack of the Clones (2002), and Episode III: Revenge of the Sith (2005). In 2012, Lucas sold his production company to Disney, relinquishing

Note that the file has three paragraphs separated by an empty line each. The content stores `\n` for each line break. 

<br/>

If you try to read the same file again:

In [20]:
content_again = myfile.read()
content_again

''

**There is no content!**

This happens because you can imagine the **reading "cursor"** is _at the end of the file_ after having read it. **So there is nothing left to read.**

When opening a file, the reading "cursor" is at the _beginning of it_.

<br/>

We can reset the reading "cursor" by using the command `.seek()`:

In [22]:
# Seek to the start of file (index 0)
myfile.seek(0)

0

This command **moves** the _reading "cursor"_ for the position of the **byte 0**, that is, the beginning of the file.

In [23]:
content_again = myfile.read()
content_again

'Star Wars is an American epic space opera[1] multimedia franchise created by George Lucas, which began with the eponymous 1977 film[b] and quickly became a worldwide pop culture phenomenon. The franchise has been expanded into various films and other media, including television series, video games, novels, comic books, theme park attractions, and themed areas, comprising an all-encompassing fictional universe.[c] Star Wars is one of the highest-grossing media franchises of all time.\n\nThe original film (Star Wars), retroactively subtitled Episode IV: A New Hope (1977), was followed by the sequels Episode V: The Empire Strikes Back (1980) and Episode VI: Return of the Jedi (1983), forming the original Star Wars trilogy. Lucas later returned to the series to direct a prequel trilogy, consisting of Episode I: The Phantom Menace (1999), Episode II: Attack of the Clones (2002), and Episode III: Revenge of the Sith (2005). In 2012, Lucas sold his production company to Disney, relinquishing

<br/>

Now we can read the entire file again!

We can choose any (valid) position to move the reading cursor. In text files, each character requires one byte of space.

Thus, if you pass 16 for the `seek()` command, we are moving the _reading cursor_ in 16 bytes/characters from the beginning of the file.

In [24]:
myfile.seek(16)

16

In [25]:
content_again = myfile.read()
content_again

'American epic space opera[1] multimedia franchise created by George Lucas, which began with the eponymous 1977 film[b] and quickly became a worldwide pop culture phenomenon. The franchise has been expanded into various films and other media, including television series, video games, novels, comic books, theme park attractions, and themed areas, comprising an all-encompassing fictional universe.[c] Star Wars is one of the highest-grossing media franchises of all time.\n\nThe original film (Star Wars), retroactively subtitled Episode IV: A New Hope (1977), was followed by the sequels Episode V: The Empire Strikes Back (1980) and Episode VI: Return of the Jedi (1983), forming the original Star Wars trilogy. Lucas later returned to the series to direct a prequel trilogy, consisting of Episode I: The Phantom Menace (1999), Episode II: Attack of the Clones (2002), and Episode III: Revenge of the Sith (2005). In 2012, Lucas sold his production company to Disney, relinquishing his ownership o

<br/>

Note that the first 16 characters including the spaces were ignored: `'Star Wars is an '`

<br/>

To find out the current position of the reading cursor, we use the `.tell()` command:

In [26]:
# move the cursor for the 16th byte/character
myfile.seek(16)

16

In [27]:
# show the current position of the file
myfile.tell()

16

### `.readlines()`
You can read a file **line by line** using the `readlines()` method, returning a list where each element is a line of the file.

**Use caution with *large files***, since everything will be held in memory.

In [28]:
animals_file = open('demos/animals.txt')

In [29]:
animals_file

<_io.TextIOWrapper name='demos/animals.txt' mode='r' encoding='UTF-8'>

In [30]:
animals_list = animals_file.readlines()

In [31]:
animals_list

['Canidae\n',
 'Felidae\n',
 'Cat\n',
 'Cattle\n',
 'Dog\n',
 'Donkey\n',
 'Goat\n',
 'Guinea pig\n',
 'Horse\n',
 'Pig\n',
 'Rabbit\n',
 'Fancy rat varieties\n',
 'laboratory rat strains\n',
 'Sheep breeds\n',
 'Water buffalo breeds\n',
 'Chicken breeds\n',
 'Duck breeds\n',
 'Goose breeds\n',
 'Pigeon breeds\n',
 'Turkey breeds\n',
 'Aardvark\n',
 'Aardwolf\n',
 'African buffalo\n',
 'African elephant\n',
 'African leopard\n',
 'Albatross\n',
 'Alligator\n',
 'Alpaca\n',
 'American buffalo (bison)\n',
 'American robin\n',
 'Amphibian\n',
 'list\n',
 'Anaconda\n',
 'Angelfish\n',
 'Anglerfish\n',
 'Ant\n',
 'Anteater\n',
 'Antelope\n',
 'Antlion\n',
 'Ape\n',
 'Aphid\n',
 'Arabian leopard\n',
 'Arctic Fox\n',
 'Arctic Wolf\n',
 'Armadillo\n',
 'Arrow crab\n',
 'Asp\n',
 'Ass (donkey)\n',
 'Baboon\n',
 'Badger\n',
 'Bald eagle\n',
 'Bandicoot\n',
 'Barnacle\n',
 'Barracuda\n',
 'Basilisk\n',
 'Bass\n',
 'Bat\n',
 'Beaked whale\n',
 'Bear\n',
 'list\n',
 'Beaver\n',
 'Bedbug\n',
 'Bee\n

In [32]:
len(animals_list)

521

## Writing to a File

By default, the `open()` function will only allow us to read the file.

We need to pass the argument `'w'` to write over the file. It will create an ***empty text file*** to be written.

PS: you can choose the extension you wish.

In [36]:
love_letter = open('out/love_letter.md', 'w')

<br/>

Now, use the `.write()` commands **to write** a string to the file.

PS: This method also accepts the _f-string_ convention.

In [37]:
love_letter.write('I walked alone on the street.\n')
love_letter.write('I spoke to the stars and the moon.\n')

35

We've just written 35 bytes/characters. <br/>
Now, open the file in a text editor and check it. See that the file is empty. This happen because we still **don't close** the file. The written content is just trully written to the file after closing it. 

In [38]:
love_letter.close()

Now, open the file in a text editor again and check its content. The content is now there!

<br/>

If you open the file again in the **writing mode**, its **content will be overwritten** since the file already exists, that is, all previous content will be lost.

In [39]:
love_letter = open('out/love_letter.md', 'w')

Open the file in the file editor again and check it.

### Using `f-string`
The `.write()` command also accepts the `f-string` pattern:

In [40]:
song = 'I slept in the square'

In [41]:
love_letter.write(f'** Song: {song} **\n\n')

35

In [42]:
love_letter.write('I walked alone on the street.\n')
love_letter.write('I spoke to the stars and the moon.\n')

35

In [43]:
love_letter.close()

### Appending to a File
Passing the read mode `'a'` opens the file and puts the cursor **at the end**, so anything written is **appended**. If the file does not exist, one will be created.

In [44]:
love_letter = open('out/love_letter.md', 'a')
love_letter

<_io.TextIOWrapper name='out/love_letter.md' mode='a' encoding='UTF-8'>

In [45]:
love_letter.write('I lay down on the bench in the square, trying to forget you.\n')
love_letter.write('I fell asleep and dreamed of you.\n')

34

In [46]:
love_letter.close()

### `.writelines()`
We can write a list of elements to a file.

In [47]:
products = ['iPhone', 'Xbox', 'Playstation', 'Nintendo Switch', 'Fusca']

In [48]:
products

['iPhone', 'Xbox', 'Playstation', 'Nintendo Switch', 'Fusca']

In [49]:
products_file = open('out/products.txt', 'w')

In [50]:
products_file.writelines(products)

In [51]:
products_file.close()

Note that no line separator (e.g., `\n`) was added to each element.

## Aliases and Context Managers
You can assign _temporary variable_ names as **aliases**, and manage the opening and closing of files **automatically** using a **context manager** (`with`):

In [52]:
with open('demos/animals.txt', 'r') as txt:
    first_animal = txt.readlines()[0]

In [53]:
first_animal

'Canidae\n'

Note that the `with ... as ...:` *context manager* **automatically closed** `animals.txt` after assigning the first line of text to `first_animal`:

In [54]:
txt.read()

ValueError: I/O operation on closed file.

## Iterating through a File

In [59]:
with open('demos/googleplaystore.csv', 'r') as playstore_file:
    for i, line in enumerate(playstore_file):
        print(i, line, end='')

0 App,Category,Rating,Reviews
1 Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159
2 Coloring book moana,ART_AND_DESIGN,3.9,967
3 Gas Prices (Germany only),AUTO_AND_VEHICLES,4.4,805
4 Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644
5 Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967
6 Paper flowers instructions,ART_AND_DESIGN,4.4,167
7 Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178
8 Infinite Painter,ART_AND_DESIGN,4.1,36815
9 Garden Coloring Book,ART_AND_DESIGN,4.4,13791
10 Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121
11 Text on Photo - Fonteee,ART_AND_DESIGN,4.4,13880
12 Name Art Photo Editor - Focus n Filters,ART_AND_DESIGN,4.4,8788
13 Tattoo Name On My Photo Editor,ART_AND_DESIGN,4.2,44829
14 Mandala Coloring Book,ART_AND_DESIGN,4.6,4326
15 3D Color Pixel by Number - Sandbox Art Coloring,ART_AND_DESIGN,4.4,1518
16 Learn To Draw Kawaii Characters,ART_AND_DESIGN,3.2,55
17 Photo Designer - Write your name with shapes,ART_AND_DESIGN

### Reading a CSV by scratch

In [60]:
# package for debugging
import pdb

In [67]:
playstore_dict = {
    'apps': [],
    'categories': [],
    'ratings': [],
    'reviews': []
}

with open('demos/googleplaystore.csv', 'r') as playstore_file:
    for line in playstore_file:
        # delete \n
        line = line.replace('\n', '')
        
        # split the columns on ','
        app, category, rating, review = line.split(',')
        
        playstore_dict['apps'].append(app)
        playstore_dict['categories'].append(category)
        playstore_dict['ratings'].append(rating)
        playstore_dict['reviews'].append(review)
        
        # pdb.set_trace() = set a debugging breakpoint
        # type c to continue the execution
        # type q to quit
        # pdb.set_trace()

In [68]:
from pprint import pprint

pprint(playstore_dict)

{'apps': ['App',
          'Photo Editor & Candy Camera & Grid & ScrapBook',
          'Coloring book moana',
          'Gas Prices (Germany only)',
          'Sketch - Draw & Paint',
          'Pixel Draw - Number Art Coloring Book',
          'Paper flowers instructions',
          'Smoke Effect Photo Maker - Smoke Editor',
          'Infinite Painter',
          'Garden Coloring Book',
          'Kids Paint Free - Drawing Fun',
          'Text on Photo - Fonteee',
          'Name Art Photo Editor - Focus n Filters',
          'Tattoo Name On My Photo Editor',
          'Mandala Coloring Book',
          '3D Color Pixel by Number - Sandbox Art Coloring',
          'Learn To Draw Kawaii Characters',
          'Photo Designer - Write your name with shapes',
          '350 Diy Room Decor Ideas',
          'FlipaClip - Cartoon animation',
          'ibis Paint X',
          'Logo Maker - Small Business',
          "Boys Photo Editor - Six Pack & Men's Suit",
          'Superheroes Wallpa

In [70]:
# number of elements including the header
len(playstore_dict['apps'])

31

Let's remove the _header_ (first element of each list):

In [71]:
playstore_dict['apps'].pop(0)
playstore_dict['categories'].pop(0)
playstore_dict['ratings'].pop(0)
playstore_dict['reviews'].pop(0)

'Reviews'

In [72]:
pprint(playstore_dict)

{'apps': ['Photo Editor & Candy Camera & Grid & ScrapBook',
          'Coloring book moana',
          'Gas Prices (Germany only)',
          'Sketch - Draw & Paint',
          'Pixel Draw - Number Art Coloring Book',
          'Paper flowers instructions',
          'Smoke Effect Photo Maker - Smoke Editor',
          'Infinite Painter',
          'Garden Coloring Book',
          'Kids Paint Free - Drawing Fun',
          'Text on Photo - Fonteee',
          'Name Art Photo Editor - Focus n Filters',
          'Tattoo Name On My Photo Editor',
          'Mandala Coloring Book',
          '3D Color Pixel by Number - Sandbox Art Coloring',
          'Learn To Draw Kawaii Characters',
          'Photo Designer - Write your name with shapes',
          '350 Diy Room Decor Ideas',
          'FlipaClip - Cartoon animation',
          'ibis Paint X',
          'Logo Maker - Small Business',
          "Boys Photo Editor - Six Pack & Men's Suit",
          'Superheroes Wallpapers | 4K Backgro

### Writing a CSV by scratch
Let's create a CSV file by scratch for our apps read previously. Let's only consider the columns: `'apps'` and `'ratings'`.

In [82]:
my_playstore_file = open('out/my_googleplaystore.csv', 'w')

In [83]:
# writing the header
my_playstore_file.write('app,rating\n')

11

In [84]:
for app, rating in zip(playstore_dict['apps'], playstore_dict['ratings']):
    my_playstore_file.write(f'{app},{rating}\n')

In [85]:
my_playstore_file.close()

In [86]:
import pandas as pd

pd.read_csv('out/my_googleplaystore.csv')

Unnamed: 0,app,rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1
1,Coloring book moana,3.9
2,Gas Prices (Germany only),4.4
3,Sketch - Draw & Paint,4.5
4,Pixel Draw - Number Art Coloring Book,4.3
5,Paper flowers instructions,4.4
6,Smoke Effect Photo Maker - Smoke Editor,3.8
7,Infinite Painter,4.1
8,Garden Coloring Book,4.4
9,Kids Paint Free - Drawing Fun,4.7
