Go through the basics of creating a Python script, and then create a Python file for the script to run it on the terminal. In this practice notebook, you'll create the building blocks for a script that finds large files on the filesytem

## Get the logic right 
Start by defining some of the requirements of the script. In this case, we need to:
- _Walk_ the filesystem looking at files, directories and sub-directories
- Capture file information: is it a file? a directory? what size? what path?
- Store that information in a suitable data structure
- Report the sorted data what are the largest files by looking at the data structure

In [9]:
import os
import json

In [5]:
# Update the loop so that it shows the absolute path of a file ignoring directories which we aren't going to track
COUNT = 0
for root, directories, files in os.walk('.'):
    print(f"pass: {COUNT}")
    COUNT += 1
    print(f"root: {root}")
    print(f"dirs: {directories}")
    print(f"files: {files}")
    for _file in files:
        full_path = os.path.join(root, _file)
        print(f"File found: {full_path}")


pass: 0
root: .
dirs: ['sample_data', 'scripts']
files: ['csv2json_excercise.ipynb', 'querying-databases.ipynb', 'large-files.ipynb', 'sqlite-operations.ipynb', 'looping-data-structures.ipynb', 'serializing-json.ipynb']
File found: ./csv2json_excercise.ipynb
File found: ./querying-databases.ipynb
File found: ./large-files.ipynb
File found: ./sqlite-operations.ipynb
File found: ./looping-data-structures.ipynb
File found: ./serializing-json.ipynb
pass: 1
root: ./sample_data
dirs: []
files: ['wine-ratings-small.csv', 'wine-ratings.csv', 'wine-ratings.json']
File found: ./sample_data/wine-ratings-small.csv
File found: ./sample_data/wine-ratings.csv
File found: ./sample_data/wine-ratings.json
pass: 2
root: ./scripts
dirs: []
files: ['generate_large_files.py', 'generate_sql.py']
File found: ./scripts/generate_large_files.py
File found: ./scripts/generate_sql.py


So now we have a few objectives completed:
- Files are detected
- Full paths are being collected

Next, we need to find size information. Python uses bytes by default for size, so in addition to capturing the size, we'll need to find a way to change bytes to megabytes or gigabytes to make it easier to read

In [6]:
# Update the loop to include the file size
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        print(f"Size: {size}b - File: {full_path}")

Size: 105196b - File: ./csv2json_excercise.ipynb
Size: 16667b - File: ./querying-databases.ipynb
Size: 8113b - File: ./large-files.ipynb
Size: 4447b - File: ./sqlite-operations.ipynb
Size: 24650b - File: ./looping-data-structures.ipynb
Size: 8833b - File: ./serializing-json.ipynb
Size: 314894b - File: ./sample_data/wine-ratings-small.csv
Size: 13518834b - File: ./sample_data/wine-ratings.csv
Size: 355744b - File: ./sample_data/wine-ratings.json
Size: 677b - File: ./scripts/generate_large_files.py
Size: 549b - File: ./scripts/generate_sql.py


In [10]:
# Persist the data into a dictionary. Since file paths are unique you can use those as dictionary keys
file_metadata = {}
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        file_metadata[full_path] = size
print(file_metadata)

{"./csv2json_excercise.ipynb": 105196, "./querying-databases.ipynb": 16667, "./large-files.ipynb": 8323, "./sqlite-operations.ipynb": 4447, "./looping-data-structures.ipynb": 24650, "./serializing-json.ipynb": 8833, "./sample_data/wine-ratings-small.csv": 314894, "./sample_data/wine-ratings.csv": 13518834, "./sample_data/wine-ratings.json": 355744, "./scripts/generate_large_files.py": 677, "./scripts/generate_sql.py": 549}


**Exercise:** Now that the metadata is captured and stored in a suitable data structure like a dictionary, report back the results with only the four largest files. Try using other quantities to report on, like the 10 largest files instead of 4.

In [12]:

for path, size in sorted(file_metadata.items(), key=lambda x:x[1], reverse=True):
    print(f"Size: {size} Path: {path}")


Size: 13518834 Path: ./sample_data/wine-ratings.csv
Size: 355744 Path: ./sample_data/wine-ratings.json
Size: 314894 Path: ./sample_data/wine-ratings-small.csv
Size: 105196 Path: ./csv2json_excercise.ipynb
Size: 24650 Path: ./looping-data-structures.ipynb
Size: 16667 Path: ./querying-databases.ipynb
Size: 8833 Path: ./serializing-json.ipynb
Size: 8323 Path: ./large-files.ipynb
Size: 4447 Path: ./sqlite-operations.ipynb
Size: 677 Path: ./scripts/generate_large_files.py
Size: 549 Path: ./scripts/generate_sql.py


There is a lot happening in the previous cell. `sorted()` is a built-in function that can sort iterables like Python dictionaries. In this case, we need to sort by the _value_. This is done using the `key` parameter which accepts a `lambda`.
`lambda` allows to represent a function in a single line without defining it. That `lambda` expression is the same as defining a function like:

```python
def by_value(x):
    return x[1]
```

`x` represents two items, the path and the size. The function is returning only the size because that is what we want to sort with. Try changing the `lambda` expression to use `x[0]` instead and see what happens.

**Exercise:** Try using a function instead of a `lambda` function and achieve the same result