# Cleaning up code

* collect imports on top
* make sure the code runs from top to bottom (esp. in Jupyter Notebooks)
* odentify paragraphs of code that belong together (5-20 lines)
     * make functions out of these
* move all functions below import section

In [1]:
import requests
import re
from collections import Counter
from pprint import pprint

A function is a miniature program, a piece of reusable code. It has:

* input - arguments
* processing - body
* output - return

Two situations:

1. you already have code: start with the body
2. you don't have anything: start with input-output

In [64]:
def download_wikipedia_page(keyword, show_status=False):
    """retrieves the Wikipedia page and returns an HTML string""" #docstring 
    url = f"https://en.wikipedia.org/wiki/{keyword}"
    response = requests.get(url)
    # 2. check the status code
    if show_status:
        print(response.status_code)
    # 3. store the html contents of the page
    return response.text

def write_page(content, keyword, mode='w'):
    """write html contents to a file"""
    filename = f'{keyword}.html'
    with open (filename, mode) as f:
        f.write(content)

def clean_html(content):
    """removes html tags and returns a count of words"""
    tags = "</?\w+[^>]*>"
    text = re.sub(tags, " ", content, re.IGNORECASE)
    words = re.findall("\w+", text, re.IGNORECASE)
    return words

def count_words(content):
    """removes html tags and returns a count of words"""
    words = clean_html(content)
    count = len(words)
    return count

def get_top_words(content, n_words=20):
    words = clean_html(content)
    c = Counter(words)
    return c.most_common(n_words)

In [65]:
cities = ['Berlin', 'Paris', 'Helsinki']

In [66]:
for city in cities:   
    html = download_wikipedia_page(city, show_status=True)
    write_page(html, city)
    c = count_words(html)
    print(city, c)
    print(get_top_words(html, n_words=5))

200
Berlin 127927
[('a', 7168), ('class', 4102), ('href', 3726), ('span', 3492), ('title', 2886)]
200
Paris 192205
[('a', 11532), ('href', 5905), ('class', 5515), ('title', 4428), ('span', 4388)]
200
Helsinki 81949
[('a', 4211), ('class', 2486), ('href', 2178), ('span', 2123), ('title', 1799)]


___

## advanced function writing
### 1 - partial

In [69]:
def my_sum(a, b, c):
    return a + b + c

my_sum(2, 3, 5)

10

In [73]:
from functools import partial

# b is always 7
x = partial(my_sum, b=7)

x(a=3,c=4)

14

In [75]:
y = partial(my_sum, a=1, b=6, c=2)
y()

9

In [77]:
# practical example
from random import randint

D6 = partial(randint, 1, 6)

In [88]:
D6() # rolling a dice

5

### *args and **kwargs

rather useful for programs of at least 1000 lines of code. Makes documentation harder to read.

In [91]:
def print_everything(a, b, *args): # *args allows flexible extra arguments
    print('A', a)
    print('B', b)
    print('*args', args)
    for x in args:
        print('\t', x)

In [92]:
print_everything(77, 42, 11, 22, 33)

A 77
B 42
*args (11, 22, 33)
	 11
	 22
	 33


In [96]:
data = ('ABC', 'DEF', 'GHI')
list(zip(*data)) # poor persons version of df.transpose()

[('A', 'D', 'G'), ('B', 'E', 'H'), ('C', 'F', 'I')]

In [99]:
def print_everything(a, b, *args, **kwargs): # **kwargs allows extra keyword arguments
    print('A', a)
    print('B', b)
    print('*args', args)
    for x in args:
        print('\t', x)
    print('**kwargs', kwargs)
    print(kwargs['allspice'])

In [100]:
print_everything(77, 42, 11, 22, 33, allspice=99, thyme=15)

A 77
B 42
*args (11, 22, 33)
	 11
	 22
	 33
**kwargs {'allspice': 99, 'thyme': 15}
99


In [102]:
# contact point with **kwargs: documentation
import seaborn as sns
df = sns.load_dataset('flights')

In [None]:
df.plot()