# Python Project

In this part of the course we will use Python to predict polls as outcomes of the 2016 Prsidential Election. Sound's fun? 

Here's what we're going to do:
1. use selenium to scrape search volume data from Google Trends
3. use pandas to prepare the data
2. use data.world to download 538 data from polls in each state
4. use scipy to do some data analysis
5. use XXX to visualize our results

Let's go!

## 1. Web scraping
Before starting to code your brains out, it's worth taking a look at [Google Trends](https://trends.google.com/trends). Familiarize yourself with the basic functionalities of the website. Search for something and try to get the data in csv form.

Some things to think about:
- Google Trends normalizes the data so that the peak search interest in the results corresponds to a value of 100. You can't unnormalize the data, but maybe it can actually save you time! Ask: How would I normalize anyways?
- With this in mind, should you search for both candidates simultaneously or for one at a time?
- There are different options to search: search terms and concepts. Which should you use?
- We want to have a time series for each state. There two main ways to accomplish this: export the time series for each state or export the data in the map for each time point. Which should you use?
- can you use a static method or will you have to used a dynamic webdriver?

For this step-by-step guide we will compare both search terms simultaneously and export the map data for every day in the year leading up to the election. If the volume is very low, Google doesn't report anything. Although there exists a [workaround](https://www.sciencedirect.com/science/article/pii/S0047272714000929) we won't bother with it here.

As a first step, write write a function `open_driver()` that imports and opens the selenium webdriver. Return the webdriver as an object. I suggest that you also import `By` and `Options`. 

You could try changing the download location using Options, but it might not work on your OS. We'll rename the downloads anyways so you can just move them to the correct location at that step. Don't worry, by now that should't be a challenge for you!

In [None]:
def open_driver():
    """
    Function to open the chrome webdriver.
    """
    # ---
    # add your code here

    
    
    # ---
    print('Chrome driver is good to go!')
    return driver

Make sure to test your function:

In [None]:
driver = open_driver()

Next, set up a example search for both candidates on election day (Nov 8, 2016). Take a look at your URL. While the date menue is relatively complicated to navigate, changing the date through the url id straightforward, so that is what we're going to do.
https://trends.google.com/trends/explore?hl=en&date=[START_DATE]%20[END_DATE]&geo=[REGION]&q=[SEARCH_TERMS]

Note: I added `hl=en` flag to the url, otherwise the labels might not be in English and you'll have trouble matching the data.

Write a function that takes a start date, end date, region, and a list of search terms as inputs and returns the url. For fancy pants: Check if `searchterms` is a list and join it using commas if necessary. Replace spaces in search terms by `%20`.

In [None]:
def build_url(start, end, region, searchterms):
    """
    Function to construct the URL.
    """
    # ---
    # add your code here
    

    
    # ---
    return url

Test the function:

In [None]:
url = build_url('2016-11-08','2016-11-08','US',['Donald Trump','Hillary Clinton'])
url

Now comes the tricky part: Write a function that opens the url in the driver and downloads the csv file for the map. Be careful, as there are multiple download buttons!

Note: Google has a nasty habit of producting an error the first time you access the url. A relatively reliable work around uses the `time.sleep()` function to wait two seconds and try again. A more sophistiated solution would check if the download button is there and reload periodically until it is (although in my experience, either you get the page on the second attempt or you don't get it anytime soon).

Also: All the files will have the same name when downloaded. This can cause some problems, expecially if you start a new download while there's still a previous file in the directory. I suggest that you first remove the old file (if it exists) and that you wait at the end until your download is complete. Both can easily be achieved using an `if os.path.exists("path/to/your/file"):` clause.

In [None]:
def download_csv(url, driver):
    """
    Function to download the csv file of the map.
    """
    # ---
    # add your code here
    
    
    
    # ---
    return

... and test it:

In [None]:
download_csv(url, driver)

We have to rename (and possibly move) the downloaded csv file. Name the file `"map_[searchterms]_[startdate]_[enddate]_[region].csv"`, to avoid accidentally overwriting existing files if you later explore other search specifications.

In [None]:
def rename_csv(start, end, region, searchterms):
    """
    Function to rename and move files.
    """
    # ---
    # add your code here
    
    
    
    # ---
    return

... and test it:

In [None]:
rename_csv('2016-11-08','2016-11-08','US',['Donald Trump','Hillary Clinton'])

Finally, we need a list of dates to loop over. There are many ways to do this. The internet is your friend here!

Note: I suggest you only scrape a month's worth of data and copy the rest from the github repo as webscraping is time consuming and not universally appreciated (e.g. by Google's system admins).

In [None]:
def get_dates():
    """
    Function to produce a list of dates with YYYY-MM-DD from 2015-11-08 to 2016-11-08.
    """
    # ---
    # add your code here
    
    
    
    # ---
    return dates

One last test:

In [None]:
get_dates()

Congratulations, we're ready to put everything together!

In [None]:
def main():
    driver = open_driver()
    searchterms = ['Donald Trump','Hillary Clinton']
    region = 'US'
    dates = get_dates()
    for date in dates:
        print('Download data for: ', date)
        url = build_url(date, date, region, searchterms)
        download_csv(url, driver)
        rename_csv(date, date, region, searchterms)
    driver.quit()
    print('All data downloaded.')
    
if __name__ == '__main__':
  main()

In [None]:
driver.quit()

## 2. Prepare Google Trends data

Now that we have the data from Google, we're going to combine all the csv files in a large pandas dataframe. 

- use `os.listdir()` to generate a list of all the files
- write a loop to append all of them in a large pandas dataframe
- clean the data to remove missing values and normalize the data

First, use `os.listdir()` to generate a list `files` of all the files we want to load. If you saved other data in the same directory, you might need to filter the list and keep just the Google Trends data.

In [None]:
# ---
# add your code here



# ---

If done correctly, you should have 367 files (Nov 8th is included twice and 2016 was a leap year). Let's see how many we got:

In [None]:
len(files)

Write a function `load_data()` to load a file from into a pandas dataframe with colums `state`, `trump`, and `clinton`, and generate a column 'date' that contains the date from the file name.

In [None]:
def load_data(f):
    # ---
    # add your code here
    
    
    
    # ---
    return df

Test the function:

In [None]:
load_data(files[0])

Create an empty pandas dataframe `google_raw` and write a loop over `files` that applies `load_data()` on each file and appends it to `google_raw`.

In [None]:
# ---
# add your code here



# ---

Let's see whether it worked:

In [None]:
google_raw

Finally, we'll do some housekeeping:

1. create a copy of the 'google_raw' dataframe so that we don't have to rerun the loop if we mess up.
2. remove missing values
3. remove the `%` symbol from the percentages
4. normalize the data such that `candidate = candidate / (total searches)`
5. use `groupby` to order by `date` and `state`

In [None]:
# ---
# add your code here



# ---

Checking the result and copying it to `google_data`:

In [None]:
google_data = df.copy()
google_data

## 3. Prepare polling data

We're going to use polling data compiled by 538. However, instead of scraping that data too, we're going to use a short cut. Interesting data has often been scraped by somebody else before, so you can save a lot of time by googling datasets and checking github and other data repositories!

1. Go to [data.world](data.world) and registre as a new user (unless you're an old user, duh!). They have some cool data, so it's worth it.
2. Download [presidential_polls_2016_fivethirtyeight.csv](https://data.world/databeats/2016-us-presidential-election/workspace/file?filename=presidential_polls_2016_fivethirtyeight.csv) to an appropriate location.

Note: If you feel like showing off, try creating an API key and downloading the data using the python module. [Here's some guidance.](https://data.world/integrations/python)

Load the poll data into a pandas dataframe called `polls`.

In [None]:
# ---
# add your code here



# ---

Let's have a look:

In [None]:
polls

To ensure that the data are comparable, check whether all the results are for the same election cycle, branch of government, type, forecastdate, and most importantly, the same matchup of candidates! Table the data to make sure that this is indead the case.

In [None]:
# ---
# add your code here



# ---

This seems mostly ok. But what's up with type? Turns out that each poll is included three times:

In [None]:
polls.groupby('poll_id')['cycle'].count()

We'll have to change that. The next step is filtering the data!

Use pandas amazing data-slizing ability to generate `polls_filtered` such that: 
- only `polls-only` are included
- only polls with information on sample size are included
- national polls (`state` = 'U.S.') are excluded
- Main and Nebraska have split electoral votes. Make sure to include only polls for the entire state. 

In [None]:
# ---
# add your code here



# ---

Let's see:

In [None]:
polls_filtered

Finally we have to aggregate the data so that we can merge it on the day and state. There are many possibilities and there's going to be an element of subjective judgement. Here's one way:

- `date`: There's a start date and an end date. We deal with this by calculating the number of days that the poll run for and assuming that the same number of repondents participated on each day.
- `state`: There can be multiple polls in the same state and on the same day. We deal with this by calculating a weighted average based on the (estimated) sample size of each poll on that date.

Step by step:
1. copy the `polls_filtere` to a new dataframe `df`. We don't want to make a mess!
2. calculate the number of days that the poll ran for from `startdate` and `enddate`.
3. calculate `samplesize_day` by dividing `samplesize` by the number of days.
4. expand the dataframe such that each observation is included `days` time.
5. generate a variable `date` with the ficticious polling date.
6. aggregate the data by date and state such that `rawpoll_clinton`, `rawpoll_trump`, `adjpoll_clinton` and `adjpoll_trump` are averages weighted by the sample size, and `samplesize` is the sum of the daily sample sizes.

In [None]:
# ---
# add your code here




# ---

Let's check the results:

In [None]:
poll_data = df.copy()
poll_data

## 4. Analyze data
With our data at hand, we can finally test what google searches reveal about political opinion. We combine the data (easy!) and use the scipy module to test some ideas. After that you're free to explore!

Combine Google and 538 data using `pd.concat()`.

In [None]:
# ---
# add your code here



# ---

As always, check the result! (If you have no complete observations, something went wrong with the matching. Most likely the date wasn't the same format.)

In [None]:
data.dropna()

Before we use scipy, it's worth keeping in mind that pandas has a lot of data science capability built in! Try to estimate the correlation between the relative search volume for Trump and his performance in the polls using `.corr()`:

In [None]:
# ---
# add your code here



# ---

Seems like people didn't particularily like the candidates they googled...

Some additional ideas to test if you have enough time:
- Search data around the primaries might be missleading. Does the fit improve if we only look after the national conventions?