# Working with `.py` files in and out of a notebook

## Motivation
- Most Python users start in Jupyter notebooks — great for exploration!
- `.py` scripts are essential for **reusable**, **shareable**, and **production-ready** code.

### Notebooks are great for:
- exploring data
- data visualization
- creating initial test versions of code (prototyping)
- code that requires regular human intervention or feedback
- teaching code

But notebooks can be memory intensive, slow, messy and can't easily be executed in the background.

### Scripts are a better choice for:
- memory intensive jobs
- parallel jobs
- long tasks
- creating data analysis pipelines
- running jobs in the background (and allowing you to do something else!)
- submitting code to run on a high-performance computer (e.g., Quest)
- writing software

But scripts may not be the best choice for viewing visualizations or other immediate output, and debugging can be a bit harder than in a notebook.

## What is a `.py` file?

- A plain text file containing Python code.
- Saved with the `.py` extension.
- Can be run as a program or imported into a notebook.

You can create a `.py` file using a text editor on your computer.  For instance you could use VSCode, Notepad on a PC, TextEdit on a Mac, or any other plain text edit -- NOT Word.

Let's take a look at `example01a_helloworld.py`.  (If you're working in Colab, you will likely need to upload this file and the others from our GitHub repo by dragging and dropping the files from your computer into the Colab file explorer in the left bar.) To view and edit the file, you can use any of these plain text editors:

- If you are using Jupyter Lab on your own computer, you can double click the script from the file tree on the left. Jupyter Lab has its own text editor.
- If you are using Google Colab online, you can double click the script from the file tree on the left. Google Colab has its own text editor!
- You can also open a separate text editor on your own computer.

## Running a `.py` file

You can run a `.py` file from a terminal or from a notebook.

To run from a **terminal**, you can either:
1. Use Colab's built in terminal utility
2. Use the Anaconda Prompt (or Anaconda PowerShell Prompt on Windows) within the same directory as your script.  This does require some minimal bash command-line skills.  
3. VSCode also contains a terminal, and there are other terminal apps that you may prefer.  These may require additional configuration to get them to recognize Python (and the version of Python that you intend to use).

(I will demonstrate methods 1 and 2 above.)  

If you need a refresher on basic bash commands, [here is a good starting point](https://www.w3schools.com/bash/bash_commands.php).

Once you have your terminal open and in the directory containing yoru script, you would type the following:

```bash
python example01a_helloworld.py
```

Let's run that together.

Alternatively, to run from a notebook (e.g., Jupyter or Colab):


In [None]:
import example01a_helloworld

## Scripting best practices
- Write **functions**, not just loose code.
- Add **comments** to explain your logic.
- Use **docstrings** for your functions.
- Organize reusable code using:
  ```python
  if __name__ == "__main__":
      # This code runs when the script is executed directly
  ```

## (Refresher) What are Python functions?

A function is a chunk of code that you give a name to.

After you name the function, whenever you want to use it, you can call it by name instead of typing or copying and pasting the entire chunk of code.

Using functions can help your code be **neater, faster, and reuseable**.

If you have functions that you want to use in multiple scripts or notebooks, you can save them. Here are a few times when you might want to do that.

For reusability:
- You wrote code to create custom visualizations and want to apply it to multiple data sets
- You frequently work with the same file type, and you have code to clean it or to extract certain data from it
- You have code to check a file or dataset for certain qualities before you use it
- You coded equations or algorithms
- You find yourself frequently doing the exact same thing

For organization/presentation:
- Your notebook is getting long and slow and messy, and you want to clean it up by saving the function definitions in a different file
- You want to use your notebook to display visualizations to an advisor or colleague and they don't need to see all the code behind the scenes


By default, scripts run code from top to bottom. Of course, when you run a function definition:

In [None]:
def makeFancy(a_string):
    '''This function takes a string and returns a fancier string'''
    new_list = a_string.upper().split()
    new_string = "**".join(new_list)
    return new_string

... your computer only stores it in memory, ready to use it when it eventually gets called:

In [None]:
makeFancy("This workshop is awesome.")

Let's take a look at a Python script with a functions we can import. Open the script `example01b_helloworld.py` in your text editor. 

Let's walk through the file together and use it here in the notebook.

In [None]:
import example01b_helloworld

To see what functions are available in an imported package, we can use the `dir()` command:

In [None]:
dir(example01b_helloworld)

If you ever forget which arguments are required from a function, you can use `?`:

In [None]:
example01b_helloworld.greet?

Now let's use the `greet` function:

In [None]:
example01b_helloworld.greet("John Doe")

## Modules

When your script only contains functions for importing into other scripts and notebooks, you can refer to it as a **module**.

In general, when you import a script the Python interpreter will read it from top to bottom, storing functions along the way and executing any "loose" code.  Sometimes, like when you're importing individual functions from your script into a notebook, you may not want the script to produce output on import. Other times, you might have a script that you want to be able to use both ways - in its entirety from the command line AND calling individual functions into another script or notebook.

To get around this problem, we're going to include a special line at the end of our code:

  ```python
  if __name__ == "__main__":
      # This code runs when the script is executed directly
  ```

This is a conditional statement that allows you to define code  (contained inside `__main__`) that runs only when the file is executed as a script, not when it’s imported as a module.

Let's look at `example01c_helloworld.py`.  We can use it here, and also in the terminal.

In [None]:
# import the module (notice that there is no output)
import example01c_helloworld

In [None]:
# use the function here
example01c_helloworld.greet("John Doe")

Now run this in the terminal with:

```bash
python example01c_helloworld.py
```

(This should output `Hello World!`)

## Anatomy of a (good) script

Your script should have the following elements, in this order:
1. sufficient comments (e.g., a docstring) at the top of the code so that a user knows what the script does and what is included.  
2. any import statements
3. any hard-coded variables (e.g., file names); we will do this later
4. all functions (each should have their own descriptive docstring and comments)
5. the `__main__` section

Let's look at `example02_arithmetic.py`.


In [None]:
# import the arithmetic module
import example02_arithmetic 

In [None]:
# define a list and use the print_sum function
my_list = [1,2,3,4,5]
s, s_sentence = example02_arithmetic.print_sum(my_list)
print(s_sentence)

Note that each of my functions return two values as tuples, the numerical result and a sentence.  

## <span style="color:#4E2A84">**Exercise:**</span>

Use the `dir()` function to see what this `example02_arithmetic` module contains.  Also use `?` to show information about the `print_mean` function.

Create your own list and use the `print_mean` function to print the mean value.

*Bonus*: multiply the numerical mean value by 2, and print that result.

In [None]:
# your code goes here

### Output in the terminal vs. notebook

Let's look at another example that will provide a description of a pandas DataFrame: `example03_describe.py`.

First, run the script from the command line.  (You should see output in the terminal.)

Next, let's run it from inside the notebook.

In [None]:
# note that we can also just import the function(s) rather than the entire module
from example03_describe import describe_data

# I am importing pandas here because I want to use it below.
# However, in this case it is not strictly necessary because example03_describe.py imports pandas.
# Still, it is good practice to be explicit in your coding.
import pandas as pd 

Notice how I imported `describe_data` directly.  Now I can use the `describe_data` function here without needing to explicitly type `example03_describe.describe_data`.  This will shorten your code, but note that it does abstract away some information (i.e., the original script name) AND importantly you could encounter naming conflicts if there was some other `describe_data` function in another script.  In our case this is not a concern, but it is something to think about for your future use.  

Also note that if there are multiple functions, you can import them all with 

```python
from example03_describe import *
```

Finally, note that you should have seen output when you ran the code in the terminal, but you do not see output here.  This is because the output you saw from the terminal is contained within:

```python
if __name__ == "__main__":
```

Now let's create a basic dataframe and use `describe_data` :

In [None]:
# Create a basic DataFrame for testing
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Score": [85, 92, 78]
}
df = pd.DataFrame(data)

# send that DataFrame to our function 
describe_data(df)

## <span style="color:#4E2A84">**Exercise:**</span>

Read in the `plots_of_land_1.csv` file with `pandas` to create a `DataFrame`, and use the `describe_data` to print a description of your `DataFrame`.

In [None]:
# your code goes here


## Hardcoding vs. User Input


### Hardcoding

Sometimes you may want to hardcode a variable into your script.  In `example04a_acre.py` I hardcoded the file name.  Let's look at this script.

## <span style="color:#4E2A84">**Exercise:**</span>

Run `example04a_acre.py` from the command line. (You should see many lines printed to the terminal, each with an address and number of acres.)

### User Input from `sys`

Other times, it may make more sense to allow the user to specify the value of a variable from the command line (or another input method) rather than hardcoding.  This can be important if, for instance, the script is used within a pipeline or other task that requires the same function to be applied to many files.

Let's look at `example04b_acre.py`.  In this script, we use the `sys` module to get the user input from the command line.  To define the file for our script, write the file name after the script name in your command:  

```bash
python example04b_acre.py plots_of_land_1.csv
```

The file name will be stored in the `sys.argv` list.  We access the file name string inside our script with the Python code `sys.argv[1]`.

Run the bash command above in your terminal to execute our `example04b_acre.py` script on the file `plots_of_land_1.csv`


## <span style="color:#4E2A84">**Exercise:**</span>

Try running the script without an argument and see what happens.  (You should see an error message that I defined in the script.)

Next, run the `example04b_acre.py` script on the file `plots_of_land_2.csv`.

## <span style="color:#4E2A84">**Exercise:**</span>

Write your own simple script that prints out the `sys.argv` list.  Run the script with any number of arguments to see the result.  What do you notice about the first element in the `sys.argv` list?

### User Input from `argparse`

The method above provides a basic way to send arguments to your script via the command line, but we can do better with the `argparse` package ([see documentation here](https://docs.python.org/3/library/argparse.html)).  Some important features of the package :
- easily provide help text for the user 
- we can allow the user to provide arguments in any order
- we can define default values
- error handling can be built in

We will not cover all the features of `argparse`, but I encourage you to read the documentation and experiment!

The basic syntax for `argparse` is as follows:
```python
# import the module
import argparse

# define the argument parser
parser = argparse.ArgumentParser(description="Use this space to describe the code.")
# add any arguments you want (here we have only one)
parser.add_argument("--input_a", "-a", type=float, default=1.234, help="Input A")
parser.add_argument("--input_b", "-b", type=int, default=42, help="Input B")

# get the user-defined arguments arguments 
args = parser.parse_args()

# you can access your arguments with, e.g. 
# a = args.input_a

```
Notice how we can define both a short and long method for user to provide these arguments.


## <span style="color:#4E2A84">**Exercise:**</span>

Write your own script that uses `argparse` to define at least three different user inputs, parses them, and then prints them to the screen.  *Bonus* perform some operation with these user inputs (e.g., if they are numeric, you could add them together) and print the result. 

## <span style="color:#4E2A84">**Exercise:**</span>

Now let's take a look at the `example04c_acre.py`, where I have replaced the `sys` method with `argparse`.

Run the following commands from your terminal (one at a time):

The following two commands will show the (same) help text
```bash
python example04b_acre.py -h 
```

```bash
python example04b_acre.py --help
```

The following two commands will send the file `plots_of_land_1.csv` to our script and print the result to the screen.
```bash
python example04c_acre.py -f plots_of_land_1.csv
```

```bash
python example04c_acre.py --filename plots_of_land_1.csv
```

This command will print an error.
```bash
python example04c_acre.py
```

## <span style="color:#4E2A84">**Exercise:**</span>

Run the `example04c_acre.py` script on the file `plots_of_land_2.csv`.

## <span style="color:#4E2A84">**Final Exercise:** Analyze a Word List</span>

Let's create a Python script that analyzes a list of words.

## Goal
Write a script that:
- Reads in a file which will contain a list of words, expecting one word per line.
- Prints out the list of words as part of a complete sentence.
- Finds and prints:
  1. The longest word
  2. The number of words that start with a vowel
  3. A list of all words in uppercase, printed one word per line.
- Each of the printed outputs should be written as a complete sentence.
- *BONUS*: Rather than printing all words in uppercase (task 3 above), add a user-defined argument for the number of words to print (e.g., if the user supplies 2, you only print the first 2 words in your list). 

## Instructions

1. Create a file called `word_analysis.py`.

2. In this file:
   - Define **at least one function** that returns the outputs listed above.
   - use the `argparse` module to get the filename (and any other arguments you need) from the user.
   - Use `print()` statements in the `__main__` section to display results when the script is run directly.
   - Add **helpful comments** and **docstrings** to tell users about your script, what each function does and what inputs are required, and how to run your script.
   - Include the `if __name__ == "__main__":` section so that your script outputs the results when the user runs it from the command line.
3. Run your script from the **command line** to apply your function(s) to the file `words.txt` .
4. Run your script from a **notebook** to apply your function(s) to the file `words.txt` .


In [None]:
# Your code here