## Python Functions

### How should we use them?
- define a docstring at the beginning of your function body 
- functions do only **one** thing (if you want to do two things, then write two functions for it)
            - input -> processing -> output
            
- do not reference/ use global variabes inside your function
- name your functions after verbs or actions

### Why should we use them at all?
- reusability: we don't have to write code multiple times (we don't have to fix a bug several times) - DON'T REPEAT YOURSELF!
- testability: functions are easy to test


In [1]:
import requests
from bs4 import BeautifulSoup

In [15]:
# TODO: scrape all links from google.com

def my_func():
    #implement scraping
    return None

### building up the logic snippets for my function:

In [16]:
test_url = "https://www.google.com/search?q=python+pandas+dataframe+to+csv"

In [17]:
resp = requests.get(test_url)

In [18]:
resp.status_code

200

In [19]:
soup = BeautifulSoup(resp.text)

In [20]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="de">
 <head>
  <meta charset="utf-8"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   python pandas dataframe to csv - Google Suche
  </title>
  <script nonce="SuXpiGHBVazOhFo+4vqmLg==">
   (function(){
document.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"==c||"q"==c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentElement.addEventListener("click",function(b){var a;a:{for(a=b.target;a&&a!=document.documentElement;a=a.parentElement)if("A"==a.tagName){a="1"==a.getAttribute("data-nohref");break a}a=!1}a&&b.preventDefault()},!0);}).call(this);(function(){
var a=window.performance;window.start=Date.now();a:{var b=window;if(a){var c=a.timing;if(c){var d=c.navigationStart,f=c.responseStart;if(f>d&&f<=window.start){window.start=f;b.wsrt=f-d;break a}}a.now&&(b.wsrt=Math.floor

In [21]:
raw_links = soup.find_all('a')

In [22]:
raw_links[0].get('href')

'/?sa=X&ved=0ahUKEwjPpubV0MzrAhXKE7kGHWA2CM0QOwgC'

In [23]:
links = [raw_link.get('href') for raw_link in raw_links]

### put everything in the function and write docstring

In [28]:
def get_google_soup(query):
    """scrapes google and return soup"""

    url='https://www.google.com/search'
    
    full_url = url + "?=" + query
    
    resp = requests.get(full_url)
    soup = BeautifulSoup(resp.text)
    
    return soup

In [44]:
def scrape_google_links(soup):
    """scrapes all links from google result page"""
    
    raw_links = soup.find_all('a')
    links = [raw_link.get('href') for raw_link in raw_links]
    
    return links

In [45]:
def scrape_google_headers(soup):
    return None # just an example for how to split it up into smaller chunks

In [46]:
soup = get_google_soup('pandas to csv')
links = scrape_google_links(soup)

In [42]:
links

['https://www.google.de/imghp?hl=de&tab=wi',
 'https://maps.google.de/maps?hl=de&tab=wl',
 'https://play.google.com/?hl=de&tab=w8',
 'https://www.youtube.com/?gl=DE&tab=w1',
 'https://news.google.com/?tab=wn',
 'https://mail.google.com/mail/?tab=wm',
 'https://drive.google.com/?tab=wo',
 'https://www.google.de/intl/de/about/products?tab=wh',
 'http://www.google.de/history/optout?hl=de',
 '/preferences?hl=de',
 'https://accounts.google.com/ServiceLogin?hl=de&passive=true&continue=https://www.google.com/webhp&ec=GAZAAQ',
 '/advanced_search?hl=de&authuser=0',
 'https://www.google.com/url?q=https://families.google.com/familylink/%3Futm_source%3DGoogle%26utm_medium%3DHPP%26utm_campaign%3Dback_to_school&source=hpp&id=19020078&ct=3&usg=AFQjCNERInh5f87Cz5IreA6a45WurDJTng&sa=X&ved=0ahUKEwjMgKKp1czrAhVxCWMBHUPXDO8Q8IcBCAU',
 '/intl/de/ads/',
 '/services/',
 '/intl/de/about.html',
 'https://www.google.com/setprefdomain?prefdom=DE&prev=https://www.google.de/&sig=K_JIhixRv9Auk3f94XJj7M-So-Q1Y%3D',
