Go through the basics of creating a Python script, and then create a Python file for the script to run it on the terminal. In this practice notebook, you'll create the building blocks for a script that finds large files on the filesytem

## Get the logic right 
Start by defining some of the requirements of the script. In this case, we need to:
- _Walk_ the filesystem looking at files, directories and sub-directories
- Capture file information: is it a file? a directory? what size? what path?
- Store that information in a suitable data structure
- Report the sorted data what are the largest files by looking at the data structure

In [2]:
# The os module is perfect for filesystem operations like "walking" throught directories and files
# Although there are many ways of achieving the same effect, a good way to loop over the filesystem is using `os.walk()`
import os
for root, directories, files in os.walk('.'):
    for _file in files:
        print(f"File found: {_file}")


File found: sqlite-operations.ipynb
File found: querying-databases.ipynb
File found: .gitignore
File found: large_files.py
File found: large-files.ipynb
File found: README.md
File found: description
File found: FETCH_HEAD
File found: HEAD
File found: index
File found: config
File found: packed-refs
File found: post-update.sample
File found: applypatch-msg.sample
File found: fsmonitor-watchman.sample
File found: update.sample
File found: pre-applypatch.sample
File found: commit-msg.sample
File found: pre-merge-commit.sample
File found: pre-commit.sample
File found: pre-rebase.sample
File found: pre-push.sample
File found: pre-receive.sample
File found: push-to-checkout.sample
File found: sendemail-validate.sample
File found: prepare-commit-msg.sample
File found: bc35ef788fad96454ad2eaa63127e8dbf2363c
File found: 3b70c25abcd803d00691ddf81d5a9cd39726e6
File found: 139b26c12f981afa3ea3781d3c66ee9fb06a91
File found: 62a800dae9c576eb22d703673eb997ea1c171c
File found: 4ac288bdf382b140cd4d505c

In [3]:
# Update the loop so that it shows the absolute path of a file ignoring directories which we aren't going to track
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        print(f"File found: {full_path}")


File found: ./sqlite-operations.ipynb
File found: ./querying-databases.ipynb
File found: ./.gitignore
File found: ./large_files.py
File found: ./large-files.ipynb
File found: ./README.md
File found: ./.git/description
File found: ./.git/FETCH_HEAD
File found: ./.git/HEAD
File found: ./.git/index
File found: ./.git/config
File found: ./.git/packed-refs
File found: ./.git/hooks/post-update.sample
File found: ./.git/hooks/applypatch-msg.sample
File found: ./.git/hooks/fsmonitor-watchman.sample
File found: ./.git/hooks/update.sample
File found: ./.git/hooks/pre-applypatch.sample
File found: ./.git/hooks/commit-msg.sample
File found: ./.git/hooks/pre-merge-commit.sample
File found: ./.git/hooks/pre-commit.sample
File found: ./.git/hooks/pre-rebase.sample
File found: ./.git/hooks/pre-push.sample
File found: ./.git/hooks/pre-receive.sample
File found: ./.git/hooks/push-to-checkout.sample
File found: ./.git/hooks/sendemail-validate.sample
File found: ./.git/hooks/prepare-commit-msg.sample
File

So now we have a few objectives completed:
- Files are detected
- Full paths are being collected

Next, we need to find size information. Python uses bytes by default for size, so in addition to capturing the size, we'll need to find a way to change bytes to megabytes or gigabytes to make it easier to read

In [4]:
# Update the loop to include the file size
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        print(f"Size: {size}b - File: {full_path}")

Size: 4447b - File: ./sqlite-operations.ipynb
Size: 16667b - File: ./querying-databases.ipynb
Size: 1799b - File: ./.gitignore
Size: 275811b - File: ./large_files.py
Size: 13219b - File: ./large-files.ipynb
Size: 61b - File: ./README.md
Size: 73b - File: ./.git/description
Size: 105b - File: ./.git/FETCH_HEAD
Size: 21b - File: ./.git/HEAD
Size: 681b - File: ./.git/index
Size: 460b - File: ./.git/config
Size: 112b - File: ./.git/packed-refs
Size: 189b - File: ./.git/hooks/post-update.sample
Size: 478b - File: ./.git/hooks/applypatch-msg.sample
Size: 4726b - File: ./.git/hooks/fsmonitor-watchman.sample
Size: 3650b - File: ./.git/hooks/update.sample
Size: 424b - File: ./.git/hooks/pre-applypatch.sample
Size: 896b - File: ./.git/hooks/commit-msg.sample
Size: 416b - File: ./.git/hooks/pre-merge-commit.sample
Size: 1649b - File: ./.git/hooks/pre-commit.sample
Size: 4898b - File: ./.git/hooks/pre-rebase.sample
Size: 1374b - File: ./.git/hooks/pre-push.sample
Size: 544b - File: ./.git/hooks/pr

In [5]:
# Persist the data into a dictionary. Since file paths are unique you can use those as dictionary keys
file_metadata = {}
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        file_metadata[full_path] = size
print(file_metadata)

{'./sqlite-operations.ipynb': 4447, './querying-databases.ipynb': 16667, './.gitignore': 1799, './large_files.py': 275811, './large-files.ipynb': 15776, './README.md': 61, './.git/description': 73, './.git/FETCH_HEAD': 105, './.git/HEAD': 21, './.git/index': 681, './.git/config': 460, './.git/packed-refs': 112, './.git/hooks/post-update.sample': 189, './.git/hooks/applypatch-msg.sample': 478, './.git/hooks/fsmonitor-watchman.sample': 4726, './.git/hooks/update.sample': 3650, './.git/hooks/pre-applypatch.sample': 424, './.git/hooks/commit-msg.sample': 896, './.git/hooks/pre-merge-commit.sample': 416, './.git/hooks/pre-commit.sample': 1649, './.git/hooks/pre-rebase.sample': 4898, './.git/hooks/pre-push.sample': 1374, './.git/hooks/pre-receive.sample': 544, './.git/hooks/push-to-checkout.sample': 2783, './.git/hooks/sendemail-validate.sample': 2308, './.git/hooks/prepare-commit-msg.sample': 1492, './.git/objects/ec/bc35ef788fad96454ad2eaa63127e8dbf2363c': 166, './.git/objects/c9/3b70c25ab

**Exercise:** Now that the metadata is captured and stored in a suitable data structure like a dictionary, report back the results with only the four largest files. Try using other quantities to report on, like the 10 largest files instead of 4.

In [12]:
items_shown = 0
    
for path, size in sorted(file_metadata.items(), key=lambda x:x[1], reverse=True):
    if items_shown > 9:
        break
    print(f"Size: {size} Path: {path}")
    items_shown += 1


Size: 275811 Path: ./large_files.py
Size: 50163 Path: ./.git/objects/pack/pack-79ca6e3e498b45aeec5d4913ff96152fd77c6228.pack
Size: 16667 Path: ./querying-databases.ipynb
Size: 15776 Path: ./large-files.ipynb
Size: 4898 Path: ./.git/hooks/pre-rebase.sample
Size: 4804 Path: ./.git/objects/c8/4ac288bdf382b140cd4d505c1bf1be9a786929
Size: 4726 Path: ./.git/hooks/fsmonitor-watchman.sample
Size: 4447 Path: ./sqlite-operations.ipynb
Size: 3650 Path: ./.git/hooks/update.sample
Size: 2783 Path: ./.git/hooks/push-to-checkout.sample


There is a lot happening in the previous cell. `sorted()` is a built-in function that can sort iterables like Python dictionaries. In this case, we need to sort by the _value_. This is done using the `key` parameter which accepts a `lambda`.
`lambda` allows to represent a function in a single line without defining it. That `lambda` expression is the same as defining a function like:

```python
def by_value(x):
    return x[1]
```

`x` represents two items, the path and the size. The function is returning only the size because that is what we want to sort with. Try changing the `lambda` expression to use `x[0]` instead and see what happens.

**Exercise:** Try using a function instead of a `lambda` function and achieve the same result

In [17]:
def by_value(x):
    return x[1]

items_shown = 0
    
for path, size in sorted(file_metadata.items(), key=by_value, reverse=True):
    if items_shown > 4:
        break
    print(f"Size: {size} Path: {path}")
    items_shown += 1


Size: 275811 Path: ./large_files.py
Size: 50163 Path: ./.git/objects/pack/pack-79ca6e3e498b45aeec5d4913ff96152fd77c6228.pack
Size: 16667 Path: ./querying-databases.ipynb
Size: 15776 Path: ./large-files.ipynb
Size: 4898 Path: ./.git/hooks/pre-rebase.sample
