# The Life and Anatomy of a Python Script

Before we dive into the nitty-gritty of the Python programming language, we're going to discuss what Python code looks like, how you can write and run it, and what you can do with it.

The first few times that I tried to learn Python — yes, it took me multiple tries! — it felt like learning a bunch of made-up rules about an imaginary universe. Turns out Python *is* kind of like an imaginary universe with made-up rules, and that's part of what makes it fun. But it can also make learning Python difficult if you don't really know what the imaginary universe looks like, or how it functions, or how it relates to your universe and your specific goals — such as doing text analysis or making a Twitter bot or creating a network visualization.

We're going to demonstrate what Python looks like in action, so you can get a feel for its structure and flow. Don't get too bogged down in the details for now. Just try to get a sense — at an abstract level — of how Python works and how you might use it.

<img src="https://cdn.pixabay.com/photo/2017/01/31/23/21/animal-2028134_960_720.png" alt="The command line" width="100%">

# Example Python Code — Counting Words in a Text File

Below is a chunk of Python code. These lines, when put together, do something simple yet important. They count and display the most frequent words in a text file. The example below specifically counts and displays the 40 most frequent words in Charlotte Perkins Gilman's short story "The Yellow Wallpaper" (1892).

In [2]:
import re
from collections import Counter
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

filepath_of_text = "../texts/literature/The-Yellow-Wallpaper.txt"
nltk_stop_words = stopwords.words("english")
number_of_desired_words = 40

full_text = open(filepath_of_text).read()

all_the_words = split_into_words(full_text)
meaningful_words = [word for word in all_the_words if word not in nltk_stop_words]
meaningful_words_tally = Counter(meaningful_words)
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

most_frequent_meaningful_words

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/melaniewalsh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('john', 45),
 ('one', 33),
 ('said', 30),
 ('would', 27),
 ('get', 24),
 ('see', 24),
 ('room', 24),
 ('pattern', 24),
 ('paper', 23),
 ('like', 21),
 ('little', 20),
 ('much', 16),
 ('good', 16),
 ('think', 16),
 ('well', 15),
 ('know', 15),
 ('go', 15),
 ('really', 14),
 ('thing', 14),
 ('wallpaper', 13),
 ('night', 13),
 ('long', 12),
 ('course', 12),
 ('things', 12),
 ('take', 12),
 ('always', 12),
 ('could', 12),
 ('jennie', 12),
 ('great', 11),
 ('says', 11),
 ('feel', 11),
 ('even', 11),
 ('used', 11),
 ('dear', 11),
 ('time', 11),
 ('enough', 11),
 ('away', 11),
 ('want', 11),
 ('never', 10),
 ('must', 10)]

Maybe you could have guessed these metrics without the help of a computer. Maybe not. Maybe you could make an interesting argument about the story with these metrics. Maybe not.

Calculating word frequency is a very basic form of computational text analysis and often not terribly interesting on its own, especially with a single short text. However, calculating word frequency *is* important, and it's at the center of most text analysis approaches, even far more complicated ones. That's why we're starting with it.

> 🗣️Heads up! This is just *one* way to count words in a text file, not the one *right* way. There is no *right* way to count words in a text file or to do anything else in Python.

>Rather than asking "Is this code *right*?", you want to ask:
>- "Is this code *efficient*?"
>- "Is this code *readable*?"
>- "Does this code *help me accomplish my goal*?"

>Sometimes you'll prioritize one concern over another. Maybe your code isn't as efficient as humanly possible, but if it gets the job done, and you understand it, then you might not care about maximum efficiency. I wrote the code above to help illustrate a number of key Python concepts, but there are many other ways I could have written it.

# The Anatomy of a Python Script

## Import Libraries/Packages/Modules

In [133]:
import re
from collections import Counter
from nltk.corpus import stopwords

Ready for some great Python news? You don't have to code everything by yourself from scratch! Many other people have written Python code that you can `import` into your own code, which will save you time and often do a lot of complex, powerful work behind-the-scenes. 

We call the code written and packaged up by other people a "library," "package," or "module." We'll talk more about them in a couple weeks. For now simply know that you `import` libraries/packages/modules at the very top of a Python script for later use.

- [`Counter`](https://stackabuse.com/introduction-to-pythons-collections-module/) will help me count words
- `re`, short for regular expressions, is basically a fancy find-and-replace that will help me split "The Yellow Wallpaper" into individual words and get rid of trailing punctuation
- [`nltk`](https://spacy.io/usage/spacy-101#whats-spacy), a natural language processing library, will give me a list of "stopwords" (but it can do a *lot* more than that)

## Define Functions

In [124]:
def split_into_words(any_chunk_of_text):
    words = re.split("\W+", any_chunk_of_text.lower())
    return words 

After `import`ing modules and libraries, you typically `def`ine your "functions." Functions are a nifty way to bundle up code so that you can use them again later. Functions also keep your code neat and tidy.

Here we're making a function called `split_into_words`, which takes in any chunk of text, transforms that text to lower-case, and splits the text into a list of clean words without punctuation or spaces. We're not actually using the function yet. We're just setting it up. We'll talk more about functions in two weeks.

## Define Filepaths and Assign Variables

In [9]:
filepath_of_text = "../texts/literature/The-Yellow-Wallpaper.txt"
nltk_stop_words = stopwords.words("english")
number_of_desired_words = 40

Here we establish some values and data that we're going to call later. We'll need the filepath of the short story in order to read it, so we make a variable called `filepath_of_text` and assign it to the relative path of the text file `"../texts/literature/The-Yellow-Wallpaper.txt"`. We also make a variable called `nltk_stop_words` and plug in the stopwords from the `nltk` library. Lastly we make a variable called `number_of_desired_words`, which will eventually tell the script how many words to display, and we assign it the value `40`. If we changed this value to `50`, then the script would display 50 words instead.

## Read in File

In [12]:
full_text = open(filepath_of_text).read()

The line above opens Charlotte Perkins Gilman's "The Yellow Wallpaper," reads in the novel, and then assigns it to the variable `full_text`.

## Manipulate and Analyze File

In [15]:
all_the_words = split_into_words(full_text)

To count the words in "The Yellow Wallpaper," we need to break the full text into individual words. Above  we call the function `split_into_words`, which we created earlier, and use it to split the `full_text` of the story into individual words. Then we put the words inside a variable called `all_the_words`.

In [16]:
meaningful_words = [word for word in all_the_words if word not in nltk_stop_words]

At this point, we don't really care about tiny words — such as "the," "and," "or," — so we're going to remove them from our list. The line of code above makes a new list of words that includes every word in `all_the_words` if it does *not* appear in `nltk_stop_words` (aka it nixes the stopwords). 

In [15]:
meaningful_words_tally = Counter(meaningful_words)

Now we're ready to count! We plug `meaningful_words` into our `Counter`, which gives us a tally of how many times each word in the story appears. Then we put the tally into a new variable called `meaningful_words_tally`. This kind of tally is called a "dictionary," which we will discuss two weeks from now.

## Output Results

In [13]:
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

Lastly, we pull out the top 40 most frequently occurring words from our complete tally. We make one final variable and grab our top `number_of_desired_words`, which we previously established as "40."

In [16]:
print(most_frequent_meaningful_words)

[('john', 45), ('one', 33), ('said', 30), ('would', 27), ('get', 24), ('see', 24), ('room', 24), ('pattern', 24), ('paper', 23), ('like', 21), ('little', 20), ('much', 16), ('good', 16), ('think', 16), ('well', 15), ('know', 15), ('go', 15), ('really', 14), ('thing', 14), ('wallpaper', 13), ('night', 13), ('long', 12), ('course', 12), ('things', 12), ('take', 12), ('always', 12), ('could', 12), ('jennie', 12), ('great', 11), ('says', 11), ('feel', 11), ('even', 11), ('used', 11), ('dear', 11), ('time', 11), ('enough', 11), ('away', 11), ('want', 11), ('never', 10), ('must', 10)]


Then we display the results with `print()`, a built-in Python function for displaying data. 

## Comments

The cell below shows the script again with explanations. These are called comments. Lines that begin with a hash symbol `#` are excluded from the running code, so you can use them to write notes or instructions to yourself or others. If you want to write a multi-line comment, you can put it in between three quotations marks `""" """`.

In [7]:
"""
Example Python code for
calculating word frequency
in a text file
"""

#Import Libraries and Modules


import re
from collections import Counter
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Define Functions

def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

# Define Filepaths and Assign Variables

filepath_of_text = "../texts/literature/The-Yellow-Wallpaper.txt"
nltk_stop_words = stopwords.words("english")
number_of_desired_words = 40

# Read in File

full_text = open(filepath_of_text).read()

# Manipulate and Analyze File

all_the_words = split_into_words(full_text)
meaningful_words = [word for word in all_the_words if word not in nltk_stop_words]
meaningful_words_tally = Counter(meaningful_words)
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

# Output Results

most_frequent_meaningful_words

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/melaniewalsh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('john', 45),
 ('one', 33),
 ('said', 30),
 ('would', 27),
 ('get', 24),
 ('see', 24),
 ('room', 24),
 ('pattern', 24),
 ('paper', 23),
 ('like', 21),
 ('little', 20),
 ('much', 16),
 ('good', 16),
 ('think', 16),
 ('well', 15),
 ('know', 15),
 ('go', 15),
 ('really', 14),
 ('thing', 14),
 ('wallpaper', 13),
 ('night', 13),
 ('long', 12),
 ('course', 12),
 ('things', 12),
 ('take', 12),
 ('always', 12),
 ('could', 12),
 ('jennie', 12),
 ('great', 11),
 ('says', 11),
 ('feel', 11),
 ('even', 11),
 ('used', 11),
 ('dear', 11),
 ('time', 11),
 ('enough', 11),
 ('away', 11),
 ('want', 11),
 ('never', 10),
 ('must', 10)]

# The Life of a Python Script

## Jupyter Notebook / JupyterLab

The primary way that we're going to write and run Python in this class is through JupyterLab and Jupyter notebooks. As we've already covered, Jupyter notebooks are documents that can combine live code, explanatory text, and nice displays of data, which makes them great for teaching and learning. But it's also a fully functional way to run Python. By running a cell of Python code in a Jupyter notebook, you can read files from your computer and write files to your computer, you can make and save a bar chart, you can gather data from YouTube and Spotify, you can programmatically tweet from a Twitter bot account, and more!

In [29]:
def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

filepath_of_text = "../texts/literature/The-Yellow-Wallpaper.txt"
nltk_stop_words = stopwords.words("english")
number_of_desired_words = 40

full_text = open(filepath_of_text).read()

all_the_words = split_into_words(full_text)
meaningful_words = [word for word in all_the_words if word not in nltk_stop_words]
meaningful_words_tally = Counter(meaningful_words)
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

with open("most-frequent-words-Yellow-Wallpaper.txt", "w") as file_object:
    file_object.write(str(most_frequent_meaningful_words))

By adding two lines of code at the bottom of our script, I can output the most frequent words from "The Yellow Wallpaper" into a text file.

## Text Editor —> Command Line

You can also run a Python script by writing it in a text editor and then running it from the command line.

If you copy and paste the code above into a simple text editor (like TextEdit or NotePad) and name the file with the extension ".py" (the file extension for Python code), you should be able to run the script from your command line.

<img src="../images/Python-plain-text.png" width=100%>

All you need to do is call `python` with the name of the Python file (and also make sure that the script includes the correct file path location for the short story file).

In [8]:
!python word_frequency_Yellow_Wallpaper.py

[('john', 45), ('one', 33), ('said', 30), ('would', 27), ('get', 24), ('see', 24), ('room', 24), ('pattern', 24), ('paper', 23), ('like', 21), ('little', 20), ('much', 16), ('good', 16), ('think', 16), ('well', 15), ('know', 15), ('go', 15), ('really', 14), ('thing', 14), ('wallpaper', 13), ('night', 13), ('long', 12), ('course', 12), ('things', 12), ('take', 12), ('always', 12), ('could', 12), ('jennie', 12), ('great', 11), ('says', 11), ('feel', 11), ('even', 11), ('used', 11), ('dear', 11), ('time', 11), ('enough', 11), ('away', 11), ('want', 11), ('never', 10), ('must', 10)]


<img src="../images/Python-Atom.png" width=100%>

Though it's possible to write Python from TextEdit, it's not very common, because it's a pain. It's much more common to write Python code in a text editor like Atom, as shown above. You can see that there's all sorts of formatting and functionality that makes the code writing faster and easier.

You can also write Python scripts such that they can work with different files or any file you want it to. With a few small alterations, our word frequency script can crunch numbers for Grimms Fairy Tales...

In [3]:
!python word_frequency.py ../texts/literature/Grimms-Fairy-Tales.txt

[('said', 911), ('little', 498), ('one', 454), ('king', 438), ('went', 385), ('came', 359), ('go', 266), ('away', 239), ('old', 233), ('man', 225), ('good', 211), ('took', 210), ('two', 209), ('woman', 199), ('saw', 193), ('could', 193), ('come', 186), ('time', 184), ('day', 180), ('would', 178), ('well', 177), ('_', 163), ('home', 163), ('back', 161), ('shall', 159), ('eyes', 157), ('three', 153), ('daughter', 150), ('mother', 145), ('house', 142), ('thought', 139), ('must', 139), ('forest', 138), ('great', 136), ('cried', 134), ('take', 134), ('long', 133), ('door', 132), ('nothing', 130), ('let', 130)]


or Louisa May Alcott's *Little Women*...

In [4]:
!python word_frequency.py ../texts/literature/Little-Women.txt

[('jo', 1416), ('one', 903), ('said', 841), ('little', 773), ('meg', 704), ('amy', 670), ('laurie', 615), ('like', 607), ('beth', 496), ('good', 480), ('would', 441), ('see', 425), ('go', 402), ('old', 396), ('mother', 388), ('much', 377), ('never', 375), ('well', 372), ('could', 360), ('away', 343), ('mr', 334), ('time', 332), ('march', 326), ('know', 322), ('made', 305), ('home', 288), ('young', 286), ('girls', 282), ('come', 280), ('day', 280), ('think', 280), ('say', 277), ('came', 275), ('went', 275), ('dear', 266), ('face', 265), ('got', 260), ('make', 258), ('mrs', 257), ('asked', 256)]


or any other text your heart desires!