# Python Project

In this part of the course we will use Python to predict polls as outcomes of the 2016 Prsidential Election. Sound's fun? 

Here's what we're going to do:
1. use selenium to scrape search volume data from Google Trends
3. use pandas to prepare the data
2. use data.world to download 538 data from polls in each state
4. use scipy to do some data analysis
5. use XXX to visualize our results

Let's go!

## 1. Web scraping
Before starting to code your brains out, it's worth taking a look at [Google Trends](https://trends.google.com/trends). Familiarize yourself with the basic functionalities of the website. Search for something and try to get the data in csv form.

Some things to think about:
- Google Trends normalizes the data so that the peak search interest in the results corresponds to a value of 100. You can't unnormalize the data, but maybe it can actually save you time! Ask: How would I normalize anyways?
- With this in mind, should you search for both candidates simultaneously or for one at a time?
- There are different options to search: search terms and concepts. Which should you use?
- We want to have a time series for each state. There two main ways to accomplish this: export the time series for each state or export the data in the map for each time point. Which should you use?
- can you use a static method or will you have to used a dynamic webdriver?

For this step-by-step guide we will compare both search terms simultaneously and export the map data for every day in the year leading up to the election. If the volume is very low, Google doesn't report anything. Although there exists a [workaround](https://www.sciencedirect.com/science/article/pii/S0047272714000929) we won't bother with it here.

As a first step, write write a function `open_driver()` that imports and opens the selenium webdriver. Return the webdriver as an object. I suggest that you also import `By` and `Options`. 

You could try changing the download location using Options, but it might not work on your OS. We'll rename the downloads anyways so you can just move them to the correct location at that step. Don't worry, by now that should't be a challenge for you!

In [1]:
def open_driver():
    """
    Function to open the chrome webdriver.
    """
    # ---
    # add your code here
    
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.options import Options
    
    chrome_options = Options()
    #chrome_options.add_argument("--headless") # doesn't work yet
    chrome_options.add_argument('--no-sandbox')
    driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options=chrome_options)
    
    # ---
    print('Chrome driver is good to go!')
    return driver

Make sure to test your function:

In [286]:
driver = open_driver()

Next, set up a example search for both candidates on election day (Nov 8, 2016). Take a look at your URL. While the date menue is relatively complicated to navigate, changing the date through the url id straightforward, so that is what we're going to do.
https://trends.google.com/trends/explore?hl=en&date=[START_DATE]%20[END_DATE]&geo=[REGION]&q=[SEARCH_TERMS]

Note: I added `hl=en` flag to the url, otherwise the labels might not be in English and you'll have trouble matching the data.

Write a function that takes a start date, end date, region, and a list of search terms as inputs and returns the url. For fancy pants: Check if `searchterms` is a list and join it using commas if necessary. Replace spaces in search terms by `%20`.

In [2]:
def build_url(start, end, region, searchterms):
    """
    Function to construct the URL.
    """
    # ---
    # add your code here
    
    if isinstance(searchterms, list) == True:
        searchterms = ','.join(map(str, searchterms)) # the easier  ",".join(list) might not work with some symbols.
    searchterms = searchterms.replace(' ','%20')
    
    url = 'https://trends.google.com/trends/explore?hl=en&date='+start+'%20'+end+'&geo='+region+'&q='+searchterms
    
    # ---
    return url

In [3]:
searchterms = ["Donald Trump", "Hillary Clinton"]

In [5]:
search_terms = ','.join(map(str, searchterms))
search_terms

'Donald Trump,Hillary Clinton'

Test the function:

In [283]:
url = build_url('2016-11-08','2016-11-08','US',['Donald Trump','Hillary Clinton'])
url

'https://trends.google.com/trends/explore?date=2016-11-08%202016-11-08&geo=US&q=Donald%20Trump,Hillary%20Clinton'

Now comes the tricky part: Write a function that opens the url in the driver and downloads the csv file for the map. Be careful, as there are multiple download buttons!

Note: Google has a nasty habit of producting an error the first time you access the url. A relatively reliable work around uses the `time.sleep()` function to wait two seconds and try again. A more sophistiated solution would check if the download button is there and reload periodically until it is (although in my experience, either you get the page on the second attempt or you don't get it anytime soon).

Also: All the files will have the same name when downloaded. This can cause some problems, expecially if you start a new download while there's still a previous file in the directory. I suggest that you first remove the old file (if it exists) and that you wait at the end until your download is complete. Both can easily be achieved using an `if os.path.exists("path/to/your/file"):` clause.

In [3]:
def download_csv(url, driver):
    """
    Function to download the csv file of the map.
    """
    # ---
    # add your code here
    
    import os
    import time
    print('... start download...')
    
    map_dl = '/Users/czuend/Downloads/geoMap.csv'
    if os.path.exists(map_dl):
        os.remove(map_dl)

    export_map = []
    while len(export_map) == 0:
        print('... ... try loading the page...')
        driver.get(url)
        time.sleep(2)
        export_map = driver.find_elements_by_xpath('//*[@class="fe-multi-heat-map-generated fe-atoms-generic-container"]')
    
    export_map[0].find_element_by_xpath('.//*[@title="CSV"]').click()
    
    while not os.path.exists(map_dl):
        time.sleep(1)
        
    del export_map # maybe not needed with the download_csv function. 
    
    print('... download complete.')
    
    # ---
    return

... and test it:

In [284]:
download_csv(url, driver)

... start download...
... ... wait one second and try again...
... ... wait one second and try again...
... download complete.


We have to rename (and possibly move) the downloaded csv file. Name the file `"map_[searchterms]_[startdate]_[enddate]_[region].csv"`, to avoid accidentally overwriting existing files if you later explore other search specifications.

In [4]:
def rename_csv(start, end, region, searchterms):
    """
    Function to rename and move files.
    """
    # ---
    # add your code here
    
    import os
    
    if isinstance(searchterms, list) == True:
        searchterms = ','.join(map(str, searchterms))
    searchterms = searchterms.replace(' ','%20')
    
    dir = 'data'
    if not os.path.exists(dir):
        os.makedirs(dir)

    map_dl = '/Users/czuend/Downloads/geoMap.csv'
    map_name = dir+'/map_'+searchterms+'_'+start+'_'+end+'_'+region+'.csv'

    os.rename(map_dl, map_name)
    
    # ---
    return

... and test it:

In [71]:
rename_csv('2016-11-08','2016-11-08','US',['Donald Trump','Hillary Clinton'])

Finally, we need a list of dates to loop over. There are many ways to do this. The internet is your friend here!

Note: I suggest you only scrape a month's worth of data and copy the rest from the github repo as webscraping is time consuming and not universally appreciated (e.g. by Google's system admins).

In [12]:
def get_dates():
    """
    Function to produce a list of dates with YYYY-MM-DD from 2015-11-08 to 2016-11-08.
    """
    # ---
    # add your code here
    
    from datetime import date, timedelta
    d1 = date(2016,7,2)
    d2 = date(2016,11,8)
    dates = [str(d1 + timedelta(days=x)) for x in range((d2-d1).days + 1)]
    
    # ---
    return dates

One last test:

In [291]:
get_dates()

Congratulations, we're ready to put everything together!

In [13]:
def main():
    driver = open_driver()
    searchterms = ['Donald Trump','Hillary Clinton']
    region = 'US'
    dates = get_dates()
    for date in dates:
        print('Download data for: ', date)
        url = build_url(date, date, region, searchterms)
        download_csv(url, driver)
        rename_csv(date, date, region, searchterms)
    driver.quit()
    print('All data downloaded.')
    
if __name__ == '__main__':
  main()

Chrome driver is good to go!
Download data for:  2016-07-02
... start download...
... ... try loading the page...
... ... try loading the page...
... download complete.
Download data for:  2016-07-03
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-04
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-05
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-06
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-07
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-08
... start download...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try lo

... download complete.
Download data for:  2016-09-12
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-13
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-14
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-15
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-16
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-17
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-18
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-19
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-20
... start download...
... ... try loading the page...
... download complete.
Downl

In [11]:
driver.quit()

NameError: name 'driver' is not defined