
## Question 1 - Python Bootcamp
Write a function `url_detector()` that loads a list of URLs from the file `urls.txt` (new-line separated), and filters that list for valid URLs, starting with `https` and containing a link to a product ID. Although you could rely on [regular expressions](https://tilburgsciencehub.com/building-blocks/develop-your-coding-skills/learn-to-code/learn-regular-expressions/) to get the job done, other simpler workarounds exist. How many URLs do you end up with? 



In [19]:
def url_detector(url_list):
    return [url for url in url_list if url[0:5] == "https" and "/dp/" in url]

url_list = pd.read_csv("urls.txt", header=None)[0].to_list()
print(len(url_detector(url_list))) # 40 urls

40


## Question 2 - Web Scraping
Scrape the top 1000 lifetime grossing movies (domestic) from [Box Office Mojo](https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW). Filter down on movies released since 2000 and export the rank, title, and lifetime gross of these movies to a CSV file.

In [78]:
import requests
from bs4 import BeautifulSoup
from time import sleep

def seed_generator(url_base):
    return [url_base + str(offset) for offset in range(0, 1000, 200)]

def scrape_data(urls):
    dfs = pd.DataFrame()

    for url in urls: 
        request_object = requests.get(url)
        source_code = request_object.text
        soup = BeautifulSoup(source_code, "html.parser")

        ranks = soup.find_all(class_ = "mojo-field-type-rank")[1:]
        ranks_cleaned = [int(rank.get_text().replace(",", "")) for rank in ranks]

        titles = soup.find_all(class_ = "mojo-field-type-title")[1:]
        titles_cleaned = [title.find("a").get_text() for title in titles]

        years = soup.find_all(class_ = "mojo-field-type-year")[1:]
        years_cleaned = [int(year.get_text()) for year in years]

        moneys = soup.find_all(class_ = "mojo-field-type-money")[1:]
        money_cleaned = [int(money.get_text().replace(",", "").replace("$", "")) for money in moneys]

        df = pd.DataFrame({"rank": ranks_cleaned, "title": titles_cleaned, "years": years_cleaned, "gross_dollars": money_cleaned})
        dfs = pd.concat([dfs, df])

        sleep(1)
    return dfs


url_base = "https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW&offset="
urls = seed_generator(url_base)

data = scrape_data(urls)
data_selection = data.loc[data.years >= 2000]
data_selection.to_csv("box_office_mojo.csv", index=False) 

## 3. APIs
As a researcher you're interested in polarity in online communities and therefore collect data on the distribution of up and down votes on Reddit. Extract a random sample of at least 100 Reddit posts from the [`politics`](https://www.reddit.com/r/politics) and [`science`](https://www.reddit.com/r/science) communities and compare the upvote ratio. Comment on your findings.

In [97]:
def reddit_activity(subreddit, attribute):
    after = None
    headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=0', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
    activity = []

    while len(activity) < 100: 
        # we pick a random sample by selecting posts by recency (rather than by popularity)
        url = f'https://www.reddit.com/r/{subreddit}/new.json'  
        response = requests.get(url, 
                                headers=headers, 
                                params={"after": after})
        json_response = response.json()
        after = json_response['data']['after']

        # loop over all items in a request
        for item in json_response['data']['children']:
            activity.append(item['data'][attribute])
    return pd.Series(activity)

science = reddit_activity("science", "upvote_ratio")
politics = reddit_activity("politics", "upvote_ratio")

print(f"The mean upvote ratio in the science and politics subreddits is {round(science.mean(),2)} and {round(politics.mean(),2)} respectively")

The mean upvote ratio in the science and politics subreddits is 0.79 and 0.77 respectively


## Question 4 - Workflow 

Review the following text in which a master student describes the institutional background of the data collection. The thesis centers around the effect of hiding like counts on user behavior and thus proposes a methodology for sample construction. Describe how you would define the treatment and control group, and how you would go about collecting data on a user-level. Keep in mind ethical and legal concerns of collecting and storing data. 

*Late April 2019 Instagram announced that it would run an experiment among Canadian users in which the like counts were hidden (Constine 2019). Three months later, around mid-July, they expanded the treatment to users in various other countries including Australia, Canada, and Italy. Users located in these countries could not see the number of likes on media posted by others, whereas users living anywhere else could still view like counts (Loren 2020). Thus, treatment groups enter the treated pool of persons sequentially, and assignment to the treatment or control condition was dependent on users’ geography.* 


An answer that includes the following elements:
* Mechanism to identify country of origin (e.g., manual coding, selective niches, validating country of origin)
* Representative sample (across multiple influencer categories, removing business accounts; controlling for algorithmic biases)
* Sample criteria (activity prior to and after the intervention)
* Privacy concerns (e.g. only public users, data anonymization)
* Data cleaning (removing bots, inactive accounts)
* Practical (selenium/prebuilt package; Beautifulsoup would not be a good choice; proxies to avoid getting blocked) 




### Question 1 (small coding task)


In [1]:
import requests
headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=10', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en;q=0.9'}

def get_posts(subreddit):
    url = f'https://www.reddit.com/r/{subreddit}.json'
    response = requests.get(url,
                            headers=headers)
    json_response = response.json()
    posts = []
    for item in json_response['data']['children']:
        posts.append({'subreddit name': item['data']['subreddit'],
                      'title': item['data']['title'],
                    'number of comments:': item['data']['num_comments']})
    return posts

subreddits = ['marketing', 'digitalmarketing', 'socialmedia']

all_posts = [] # create empty list to hold final results

# loop through subreddits
for sub in subreddits:
	# use `get_users()` function to retrieve post for subreddit `sub`
    retrieved_posts = get_posts(sub)
	# loop through posts, and add to `posts` list holding all posts as a final result
	for post in retrieved_posts:
		all_posts.append(post)
all_posts

TabError: inconsistent use of tabs and spaces in indentation (<ipython-input-1-14af29fe716c>, line 25)