# Word Frequency Counter

Given the following string as input data, complete the described tasks.

In [None]:
text = "Python is great and Python is powerful but Python requires practice and dedication to master Python programming"

## Convert the string into a list of words, all lowercase.

In [None]:
words = text.lower().split()
print(words)

## Count occurrences of each unique word.

In [None]:
word_counts = {}

for word in words:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

print(word_counts)

## Find words longer than 5 characters.

In [None]:
long_words = []
for word in words:
    if len(word) > 5:
        long_words.append(word)

print(long_words)

Very common pattern in programming: create a new container based on another by filtering, manipulating its data.

So common that Python offers a shorthand for this loop structure called a *comprehension*.

Above cell implemented as a list comprehension (with word capitalized):

In [None]:
[word.title() for word in words if len(word) > 5]

## Show the unique long words, sorted alphabetically

In [None]:
unique_long_words_sorted = sorted(list(set(long_words)))
print(unique_long_words_sorted)

## Display word counts sorted alphabetically

In [None]:
keys = sorted(list(word_counts.keys()))

for key in keys:
    print(f"word: {key}\t\tcount: {word_counts[key]}")

## Display word counts sorted by frequency (tricky)

As always, there are many ways to solve this problem in Python. Here are a few.

### Build sorted list manually

Similar to the approach used for alphabetical sort, above.

In [None]:
# Find all unique counts and sort them
counts = sorted(set(word_counts.values()))

# Build list by iterating through counts in order
sorted_items = []
for count in counts:
    for word, word_count in word_counts.items():
        if word_count == count:
            sorted_items.append((word, count))

for word, count in sorted_items:
    print(f"word: {word}\t\tcount: {count}")

### Implement "bubble" sort

Classic CS algorithm. (Not something you need to know for this class)

```text
BUBBLE SORT (sorting word-count pairs by count):

Repeat for each position in the list:
    Go through the list from start to the unsorted portion:
        Look at the current pair and the next pair
        If current pair's count is bigger than next pair's count:
            Swap them
        
    After this pass, the largest count has "bubbled" to its correct spot
    The sorted portion at the end grows by one
    
List is now sorted from smallest to largest count
```

In [None]:
items = list(word_counts.items())

# Bubble sort by frequency (second element of tuple)
for i in range(len(items)):
    for j in range(len(items) - 1 - i):
        if items[j][1] > items[j + 1][1]:  # Compare counts
            items[j], items[j + 1] = items[j + 1], items[j]

for word, count in items:
    print(f"word: {word}\t\tcount: {count}")

This approach is not very ***Pythonic***.
What is Pythonic?

In [None]:
import this

### Swap tuple order and sort

Clever.

In [None]:
# Create list with count first, word second
swapped_items = [(count, word) for word, count in word_counts.items()]
swapped_items.sort()  # This will sort by count naturally

for count, word in swapped_items:
    print(f"word: {word}\t\tcount: {count}")

### Use `key` option in `sort` with custom function

`key` lets us specify how to sort. For example:

In [None]:
# Example 1: Sort strings by their length using len()
words = ["cat", "elephant", "a", "dog", "butterfly"]

# Without key - sorts alphabetically
words.sort()
print(words)

# With key - sorts by length
words = ["cat", "elephant", "a", "dog", "butterfly"]
words.sort(key=len)
print(words)

In [None]:
# Example 2: Sort numbers by absolute value using abs()
numbers = [-10, 5, -3, 8, -1]

# Without key - sorts normally
numbers.sort()
print(numbers)

# With key - sorts by absolute value
numbers = [-10, 5, -3, 8, -1]
numbers.sort(key=abs)
print(numbers)

Key can use any function, including one that we write.

In [None]:
def by_freq(t):
    return t[1]

items = list(word_counts.items())
items.sort(key = by_freq)

for word, count in items:
    print(f"word: {word}\t\tcount: {count}")

## Identify the most / least common words

In [None]:
least_common_count = min(list(word_counts.values()))
most_common_count = max(list(word_counts.values()))

least_common_words = []
most_common_words = []

for word, count in word_counts.items():
    if count == least_common_count:
        least_common_words.append(word)
    if count == most_common_count:
        most_common_words.append(word)

print("least common:", least_common_words)
print("most common: ", most_common_words)