Go through the basics of creating a Python script, and then create a Python file for the script to run it on the terminal. In this practice notebook, you'll create the building blocks for a script that finds large files on the filesytem

## Get the logic right 
Start by defining some of the requirements of the script. In this case, we need to:
- _Walk_ the filesystem looking at files, directories and sub-directories
- Capture file information: is it a file? a directory? what size? what path?
- Store that information in a suitable data structure
- Report the sorted data what are the largest files by looking at the data structure

In [3]:
# The os module is perfect for filesystem operations like "walking" throught directories and files
# Although there are many ways of achieving the same effect, a good way to loop over the filesystem is using `os.walk()`
import os
for root, directories, files in os.walk('.'):
    for _file in files:
        print(f"File found: {_file}")


File found: large_files.py
File found: .gitignore
File found: querying-databases.ipynb
File found: README.md
File found: sqlite-operations.ipynb
File found: large-files.ipynb
File found: generate_large_files.py
File found: index
File found: description
File found: ORIG_HEAD
File found: HEAD
File found: config
File found: packed-refs
File found: FETCH_HEAD
File found: main
File found: HEAD
File found: main
File found: pre-push.sample
File found: pre-applypatch.sample
File found: applypatch-msg.sample
File found: post-update.sample
File found: pre-rebase.sample
File found: fsmonitor-watchman.sample
File found: prepare-commit-msg.sample
File found: commit-msg.sample
File found: pre-commit.sample
File found: update.sample
File found: pre-receive.sample
File found: exclude
File found: 62a800dae9c576eb22d703673eb997ea1c171c
File found: 6a1e025adeb3b3ad305a3d0456bc3257e45899
File found: bc35ef788fad96454ad2eaa63127e8dbf2363c
File found: dc322ae5e0148613138b2f74a3b4bce0089b79
File found: e4761

In [4]:
# Update the loop so that it shows the absolute path of a file ignoring directories which we aren't going to track
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        print(f"File found: {full_path}")


File found: ./large_files.py
File found: ./.gitignore
File found: ./querying-databases.ipynb
File found: ./README.md
File found: ./sqlite-operations.ipynb
File found: ./large-files.ipynb
File found: ./scripts/generate_large_files.py
File found: ./.git/index
File found: ./.git/description
File found: ./.git/ORIG_HEAD
File found: ./.git/HEAD
File found: ./.git/config
File found: ./.git/packed-refs
File found: ./.git/FETCH_HEAD
File found: ./.git/refs/heads/main
File found: ./.git/refs/remotes/origin/HEAD
File found: ./.git/refs/remotes/origin/main
File found: ./.git/hooks/pre-push.sample
File found: ./.git/hooks/pre-applypatch.sample
File found: ./.git/hooks/applypatch-msg.sample
File found: ./.git/hooks/post-update.sample
File found: ./.git/hooks/pre-rebase.sample
File found: ./.git/hooks/fsmonitor-watchman.sample
File found: ./.git/hooks/prepare-commit-msg.sample
File found: ./.git/hooks/commit-msg.sample
File found: ./.git/hooks/pre-commit.sample
File found: ./.git/hooks/update.sample

So now we have a few objectives completed:
- Files are detected
- Full paths are being collected

Next, we need to find size information. Python uses bytes by default for size, so in addition to capturing the size, we'll need to find a way to change bytes to megabytes or gigabytes to make it easier to read

In [5]:
# Update the loop to include the file size
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        print(f"Size: {size}b - File: {full_path}")

Size: 275811b - File: ./large_files.py
Size: 1799b - File: ./.gitignore
Size: 16667b - File: ./querying-databases.ipynb
Size: 61b - File: ./README.md
Size: 4447b - File: ./sqlite-operations.ipynb
Size: 8496b - File: ./large-files.ipynb
Size: 639b - File: ./scripts/generate_large_files.py
Size: 681b - File: ./.git/index
Size: 73b - File: ./.git/description
Size: 41b - File: ./.git/ORIG_HEAD
Size: 21b - File: ./.git/HEAD
Size: 268b - File: ./.git/config
Size: 112b - File: ./.git/packed-refs
Size: 107b - File: ./.git/FETCH_HEAD
Size: 41b - File: ./.git/refs/heads/main
Size: 30b - File: ./.git/refs/remotes/origin/HEAD
Size: 41b - File: ./.git/refs/remotes/origin/main
Size: 1348b - File: ./.git/hooks/pre-push.sample
Size: 424b - File: ./.git/hooks/pre-applypatch.sample
Size: 478b - File: ./.git/hooks/applypatch-msg.sample
Size: 189b - File: ./.git/hooks/post-update.sample
Size: 4898b - File: ./.git/hooks/pre-rebase.sample
Size: 3327b - File: ./.git/hooks/fsmonitor-watchman.sample
Size: 1492

In [6]:
# Persist the data into a dictionary. Since file paths are unique you can use those as dictionary keys
file_metadata = {}
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        file_metadata[full_path] = size
print(file_metadata)

{'./large_files.py': 275811, './.gitignore': 1799, './querying-databases.ipynb': 16667, './README.md': 61, './sqlite-operations.ipynb': 4447, './large-files.ipynb': 8496, './scripts/generate_large_files.py': 639, './.git/index': 681, './.git/description': 73, './.git/ORIG_HEAD': 41, './.git/HEAD': 21, './.git/config': 268, './.git/packed-refs': 112, './.git/FETCH_HEAD': 107, './.git/refs/heads/main': 41, './.git/refs/remotes/origin/HEAD': 30, './.git/refs/remotes/origin/main': 41, './.git/hooks/pre-push.sample': 1348, './.git/hooks/pre-applypatch.sample': 424, './.git/hooks/applypatch-msg.sample': 478, './.git/hooks/post-update.sample': 189, './.git/hooks/pre-rebase.sample': 4898, './.git/hooks/fsmonitor-watchman.sample': 3327, './.git/hooks/prepare-commit-msg.sample': 1492, './.git/hooks/commit-msg.sample': 896, './.git/hooks/pre-commit.sample': 1642, './.git/hooks/update.sample': 3610, './.git/hooks/pre-receive.sample': 544, './.git/info/exclude': 240, './.git/objects/88/62a800dae9c5

**Exercise:** Now that the metadata is captured and stored in a suitable data structure like a dictionary, report back the results with only the four largest files. Try using other quantities to report on, like the 10 largest files instead of 4.

In [7]:
items_shown = 0
    
for path, size in sorted(file_metadata.items(), key=lambda x:x[1], reverse=True):
    if items_shown > 4:
        break
    print(f"Size: {size} Path: {path}")
    items_shown += 1


Size: 275811 Path: ./large_files.py
Size: 44320 Path: ./.git/objects/fe/be85cd6cf368f9ff91e9e1d511b7b3e13fc75a
Size: 16667 Path: ./querying-databases.ipynb
Size: 8496 Path: ./large-files.ipynb
Size: 8288 Path: ./.ipynb_checkpoints/large-files-checkpoint.ipynb


There is a lot happening in the previous cell. `sorted()` is a built-in function that can sort iterables like Python dictionaries. In this case, we need to sort by the _value_. This is done using the `key` parameter which accepts a `lambda`.
`lambda` allows to represent a function in a single line without defining it. That `lambda` expression is the same as defining a function like:

```python
def by_value(x):
    return x[1]
```

`x` represents two items, the path and the size. The function is returning only the size because that is what we want to sort with. Try changing the `lambda` expression to use `x[0]` instead and see what happens.

**Exercise:** Try using a function instead of a `lambda` function and achieve the same result

In [8]:
items_shown = 0
    
for path, size in sorted(file_metadata.items(), key=lambda x:x[0], reverse=True):
    if items_shown > 4:
        break
    print(f"Size: {size} Path: {path}")
    items_shown += 1

Size: 4447 Path: ./sqlite-operations.ipynb
Size: 639 Path: ./scripts/generate_large_files.py
Size: 16667 Path: ./querying-databases.ipynb
Size: 275811 Path: ./large_files.py
Size: 8496 Path: ./large-files.ipynb
