# Week 1

## Overview

As explained in the [*Before week 1* notebook](https://nbviewer.org/github/lalessan/comsocsci2022/blob/main/lectures/Before_week_1.ipynb), each week of this class is a Jupyter notebook like this one. **_In order to follow the class, you simply start reading from the top_**, following the instructions.

**Hint**: And you can ask me or your amazing TA for help at any point if you get stuck!

## Today

This first lecture will go over a few different topics to get you started:

* You picked this course in **Computational Social Science** but... *What does that even mean??*. The first thing we will do today is to learn a bit more about it by reading some chapters of the book and listen to a short lecture by me.
* Then, we will focus on a more practical aspect. We will **learn about APIs** to get data from online platforms.
* Finally, we'll put learnings into practive by using an API to **download some data from Reddit**.



## Part 1: Computational Social Science


*What is Computational Social Science?* In the video below, I will give a short introduction to the topic. As they say, an example is worth a thousand words, so I will also present one really good example of a research study in Computational Social Science. The specific study focuses on the diffusion of misinformation online. If you are interested (yes, it's only optional), you can have a look at the whole work [in this scientific article published in Science in 2018](https://www.science.org/doi/10.1126/science.aap9559#).


> **_Video lecture_**: Watch the video below about Computational Social Science

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo("qoPk_C3buD8",width=600, height=337.5)


In this course, we are going to focus mostly on observational data collected online to address social science questions. So, I would like us to reflect a little bit more on what it means to use *Ready made* data in the social science, and understand its advantages and challenges. This is something that you can read about in Sections 2.1 to 2.3 of the book _Bit by Bit_. 

> *Reading*: [Bit by Bit, sections 2.1 to 2.3](https://www.bitbybitbook.com/en/1st-ed/observing-behavior/observing-intro/) Read sections 2.1 to 2.3. I don't expect you to read all the details, but to have a general understanding of advantages and challenges of large observational datasets (a.k.a. Big Data/Ready made data) for social science research.

## Part 2: Using APIs to download Reddit data

Let me repeat it one more time: In this class, we will work with *Ready made* data. The second thing we will learn today is how to get data from online platforms. We will do it using the Reddit API. If you are not familiar at all with Reddit, I suggest you now take a small break and familiarize with the wonderful world of [Reddit](https://www.reddit.com/).  

I made a short video for you to get familiar with the Pushshift API to access Reddit data. Check it out here below. 



> **_Video lecture_**: Watch the video below about the Reddit API

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo("eqBIFua00O4",width=600, height=337.5)


It's time for you to get to work. Take a look at the two texts below - just to get a sense of a more technical description of how the Pushshift API works.


> _Reading_ (just skim): [New to Pushshift? Read this! FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)  
> _Reading_ (just skim): [Pushshift Github Repository](https://github.com/pushshift/api)
> 

## Prelude to part 3: Pandas Dataframes


Before starting, we will also learn a bit about [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), a very user-friendly data structure that you can use to manipulate tabular data. Pandas dataframes are built using numpy, which is in turn built in C, so they are a quite efficient data structure. You will find it quite useful :)

Pandas dataframes should be intuitive to use. **I suggest you to go through the [10 minutes to Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) to learn what you need to solve the next exercise.**

## Part 3: Getting data from the _r/wallstreetbet_ subreddit

There has been a lot of interest in the social platform Reddit last year, after investors from the [_r/wallstreetbet_](https://www.reddit.com/r/wallstreetbets/) subreddit managed to [give a huge boost](https://www.google.com/search?q=GME+price&oq=GME+price&aqs=chrome..69i57.1261j0j4&sourceid=chrome&ie=UTF-8) to the shares of the video game retailer's GameStop (traded as "*GME*"), causing massive losses to professional investors and established hedge funds.

There was so much buzz about _Gamestop_ because it was really something unprecedented! Online discussions about stocks on social media have fuelled massive price moves that cannot be explained by traditional valuation metrics and can seriously destabilize the established market. Many ordinary investors on Reddit have coordinated to buy shares of a stock that had been losing value for a long time. __But how did this all happen?__ 


Today and in the following classes, we will try to answer precisely this question, by studying the social network of Redditors of _r/wallstreetbet_ throughout last year. We will focus on the period between Jan,1st 2020 and Jan 25th, 2021. In this period the value of the GME stock [went up by about 1000%](https://www.nasdaq.com/market-activity/stocks/gme)... pretty crazy :)

The starting point will be to understand how to download data from Reddit using APIs. But before we start getting our hands dirty, if you feel like you don't know much about Gamestop, I suggest to watch this short video summarizing the events. If you already know everything about it, feel free to skip it. 

> 
> **_Video_**: [Stocks explained: What's going on with GameStop?](https://www.bbc.com/news/av/technology-55864312)
> 

> *Exercise 1*: **Download submissions of the [*r/wallstreetbet*](https://www.reddit.com/r/wallstreetbets/) subreddit using the [Pushift API](https://github.com/pushshift/api)**
> 1. Use the [psaw Python library](https://pypi.org/project/psaw/) (a wrapper for the Pushshift API) to find all the submissions in subreddit *r/wallstreetbet*', related to either "*GME*" or "*Gamestop*" (**Hint**: Use the [``q``](https://github.com/pushshift/api) parameter to search text. To search multiple words you can separate them with character "|"). Focus on the period included **between Jan,1st 2020 and Jan 25th, 2021**, where time must be provided in [Unix Timestamp](https://www.unixtimestamp.com/). *Note: The Pushift API returns at most 100 results per query, so you may need to divide your entire time period in small enough sub-periods.* 
> 2. For each submission, find the following information: **title, id, score, date of creation, author, and number of comments** (**Hint**: access the dictionary with all attributes by typing ``my_submission.d_``). Store this data in a pandas DataFrame and save it into a file. (Downloading required me 30 minutes using two cores. While you wait for the results, you can start thinking about *Exercise 3*).
> 3. Create a figure using [``matplotlib``](https://matplotlib.org/) and plot the total number of submissions per day (**Hint**: You can use the function [``datetime.datetime.utcfromtimestamp``](https://docs.python.org/3/library/datetime.html) to convert a timestamp into a date, and you can use the function [``pd.resample``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html) to aggregate by day). What do you observe? 
> 4. How many submissions have you downloaded in total? How many unique authors? 
> 5. *Optional*: How many unique authors are there each week in the period under study? 


> *Exercise 2*: **Download comments from the [*r/wallstreetbets*](https://www.reddit.com/r/wallstreetbets/) subreddit.** The second task for today is to download the comments associated to each submission, which we will use to build the social network of Redditers.
> 1. For each submission you found in *Exercise 2*, download all the comments (*Hint*: Use the [``search_comments``](https://github.com/pushshift/api) function to search comments. You can specify the parameter ``link_id``, which corresponds to the *id* of the submission for which you require comments).  
> 2. For each comment, store the following information: **id, submission*id, score, date of creation, author, parent*id**. Note that the *submission id* can be retrieved by accessing the *link_id* attribute. Store this in a pandas DataFrame and save it into a file. We will use it in the next classes.

> **Note**: It took me about a night to get the data for *Exercise 3*. If you experience extremely slow downloading time, reach out to me or you TA! If you are brave, you can also check out the Reddit API, which is wrapped by [praw](https://praw.readthedocs.io/en/latest/tutorials/comments.html). It functions very much like psaw, but it requires you to first get credentials [here](https://www.reddit.com/prefs/apps) (click on *Create another app*)

> *Exercise 3*: **Ten characteristics of Big Data.** Consider the dataset you have just collected, and think of the *10 characteristics of Big Data* from the book [Bit by Bit section 2.3](https://www.bitbybitbook.com/en/1st-ed/observing-behavior/characteristics/). As usual, reach out to me if you have any doubt. 
> * **Big**. How large is this data (approximately)? Could you collect the same amount of information via surveys?
> * **Always-on**. Can you keep collecting data over time?
> * **Non-reactive**. Is the dataset non-reactive?
> * **Incomplete**. Do you think the dataset captures entirely the unfolding of events leading to the GME stock rise in price? 
> * **Inaccessible**. Is the data accessible? 
> * **Non-representative**. Do you think that the conclusions we will draw by analyzing this dataset are specific to the GME events? Or could they instead help us understand social phenomena more in general? If yes, which phenomena could you think of? Of not, what are the aspects that make this dataset non-representative?
> * **Drifting**. Is there any source of *drift* in this dataset (within the period observed)? 
> * **Algorithmically confounded**. Is the dataset algorithmically confounded? If yes, why?
> * **Dirty**. What aspect may make this dataset *dirty*?
> * **Sensitive**. Is there any sensitive information in the data?