---
---


# Observing Behaviors
## Bit by Bit: Social Research in the Digital Age


---
---

![chengjun.png](attachment:chengjun.png)


https://www.bitbybitbook.com/
    
![image.png](attachment:image.png)
    

## 2.1 Introduction
Now, in the digital age, the behaviors of billions of people are recorded, stored, and analyzable.

Because these types of data are a **by-product** of people’s everyday actions, they are often called **digital traces**. 

In addition to these traces held by businesses, there are also large amounts of incredibly rich data held by governments. Together, these _business_ and _government_ records are often called **big data**.

![bear.png](attachment:bear.png)

A first step to learning from big data is realizing that it is part of a broader category of data that has been used for social research for many years: **observational data**.

In addition to business and government records, observational data also includes things like the text of newspaper articles and satellite photos.

## 2.2 Big data
Big data are created and collected by companies and governments for purposes other than research. 

Using this data for research therefore requires **repurposing**.

使用大数据需要对其稍做调整以适应新的目的。

- First, increasingly, corporate big data sources come from digital devices in the physical world. 
- The second important source of big data missed by a narrow focus on online behavior is data created by governments--government administrative records.

- 3 Vs”: Volume, Variety, and Velocity. 
- “5 Ws”: Who, What, Where, When, and Why. 

Social scientists, who are accustomed to working with data designed for research, are typically quick to point out the problems with repurposed data, while ignoring its strengths. 

On the other hand, data scientists are typically quick to point out the benefits of repurposed data, while ignoring its weaknesses. 

Naturally, the best approach is a hybrid. 

## 2.3 Ten common characteristics of big data

- generally helpful for research: big, always-on, and nonreactive 海量性、持续性、不反应性

- generally problematic for research: incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, dirty, and sensitive 不完整、难获取、不具代表性、漂移、算法干扰、脏数据、敏感性

### 2.3.1 Big
Large datasets are a means to an end; they are not an end in themselves. 大数据是实现目的的手段，不是最终目的。

Three specific scientific ends of using big data:
- the study of rare events
- the study of heterogeneity
- to detect small differences

![image.png](attachment:image.png)

Jean-Baptiste Michel et al. (2011) Quantitative Analysis of Culture Using Millions of Digitized Books.Science 331, 176 

![image.png](attachment:image.png)
 
Raj Chetty, et al. (2017) The fading American dream: Trends in absolute income mobility since 1940. Science. 356(6336):398-406. https://opportunityinsights.org/

Chetty and colleagues were able to use the tax records from 40 million people to estimate the heterogeneity in intergenerational mobility across regions in the United States. They found, for example, that the probability that a child reaches the top quintile of the national income distribution starting from a family in the bottom quintile is about 13% in San Jose, California, but only about 4% in Charlotte, North Carolina.

Finally, in addition to studying rare events and studying heterogeneity, large datasets also enable researchers to detect small differences.
- Reliably detecting the difference between 1% and 1.1% click-through rates on an ad can translate into millions of dollars in extra revenue.
- Picking the more effective intervention could end up saving thousands of additional lives.

Although bigness is generally a good property when used correctly, I’ve noticed that it can sometimes lead to a conceptual error.
- While bigness does reduce the need to worry about random error, it actually increases the need to worry about systematic errors.
    - Social bots (Pury 2011; Back, Küfner, and Egloff 2011)

Big datasets seem to lead some researchers to ignore how their data was created, which can lead them to get a precise estimate of an unimportant quantity.

### 2.3.2 Always-on
Always-on big data enables the study of unexpected events and real-time measurement.

First, always-on data collection enables researchers to study unexpected events in ways that would not otherwise be possible. 

![image.png](attachment:image.png)

Design used by Budak and Watts (2015) to study the Occupy Gezi protests in Turkey in the summer of 2013. 

Studies of Unexpected Events Using Always-On Big Data Sources

<table>
  <tr>
    <th>Unexpected event</th>
    <th>Always-on data source</th>
      <th>References</th>
  </tr>
  <tr>
    <td>Occupy Gezi movement in Turkey</td>
    <td>Twitter</td>
    <td>Budak and Watts (2015)</td>
  </tr>
    <tr>
    <td>Umbrella protests in Hong Kong</td>
    <td>Weibo</td>
    <td>Zhang (2016)</td>
  </tr>
  
  <tr>
    <td>Shootings of police in New York City</td>
    <td>Stop-and-frisk reports</td>
    <td>Legewie (2016)</td>
  </tr>
    
  <tr>
    <td>Person joining ISIS</td>
    <td>Twitter</td>
    <td>Magdy, Darwish, and Weber (2016)</td>
  </tr>
    
  <tr>
    <td>September 11, 2001 attack</td>
    <td>livejournal.com</td>
    <td>Cohn, Mehl, and Pennebaker (2004)</td>
  </tr>
    
  <tr>
    <td>September 11, 2001 attack</td>
    <td>Pager messages</td>
    <td>Back, Küfner, and Egloff (2010), Pury (2011), Back, Küfner, and Egloff (2011)</td>
  </tr>

</table>


In addition to studying unexpected events, always-on big data systems also enable researchers to produce real-time estimates, which can be important in settings where policy makers—in government or industry—want to respond based on situational awareness. 

- social media data can be used to guide emergency response to natural disasters (Castillo 2016)
- a variety of different big data sources can be used to produce real-time estimates of economic activity (Choi and Varian 2012).

> I do not, however, think that always-on data systems are well suited for tracking changes over very long periods of time. That is because many big data systems are constantly changing—a process that I’ll call drift later.

### 2.3.3 Nonreactive

Measurement in big data sources is much less likely to change behavior.

One challenge of social research is that people can change their behavior when they know that they are being observed by researchers. Social scientists generally call this **reactivity** (Webb et al. 1966). 

For example, people can be more generous in laboratory studies than field studies because in the former they are very aware that they are being observed (Levitt and List 2007a).

One aspect of big data that many researchers find promising is that participants are generally not aware that their data are being captured or they have become so accustomed to this data collection that it no longer changes their behavior. 

For example, Stephens-Davidowitz (2014) used the prevalence of racist terms in search engine queries to measure racial animus in different regions of the United States. 

<u>Nonreactivity, however, does not ensure that these data are somehow a direct reflection of people’s behavior or attitudes.</u>

Even though some big data sources are nonreactive, they are not always free of **social desirability bias**, the tendency for people to want to present themselves in the best possible way.

The behavior captured in big data sources is sometimes impacted by the goals of platform owners, an issue of <u>algorithmic confounding</u>.

### 2.3.4 Incomplete

No matter how big your big data, it probably doesn’t have the information you want.

Big data tends to be missing three types of information useful for social research: 

- demographic information about par ticipants, 
- behavior on other platforms, 
- and data to operationalize theoretical constructs.

Roughly, **theoretical constructs are abstract ideas** that social scientists study, and operationalizing a
theoretical construct means proposing some way to capture that construct with observable data.

Social scientists call the match between theoretical constructs and data <big>**construct validity**</big> (Cronbach and Meehl 1955). 

How to test this **claim**: _people who are more intelligent earn more money._ 

- In order to test this claim, you would need to measure “intelligence.” 
    - But what is intelligence?
    Gardner (2011) argued that there are actually eight different forms of intelligence.

- The Raven Progressive Matrices Test
    - a well-studied test of analytic intelligence (Carpenter, Just, and Shell 1990)

In the first study, the researcher found that people who score well on the Raven Progressive Matrices Test have higher reported incomes on their tax returns.

In the second study, the researcher found that people on Twitter who used longer words are more likely to mention luxury brands.

As this example illustrates, more data does not automatically solve problems with construct validity.

You should doubt the results of the second study whether it involved a million tweets, a billion tweets, or a trillion tweets.






Examples of Digital Traces That Were Used to Operationalize Theoretical Constructs

<table>
  <tr>
    <th>Data source</th>
    <th>Theoretical construct</th>
    <th>References</th>
  </tr>
  <tr>
    <td>Email logs from a university (metadata only)</td>
    <td>Social relationships</td>
    <td>Kossinets and Watts (2006), Kossinets and Watts (2009), De Choudhury et al. (2010)</td>
  </tr>
  <tr>
    <td>Social media posts on Weibo</td>
    <td>Civic engagement</td>
    <td>Zhang (2016)</td>
  </tr>
  <tr>
    <td>Email logs from a firm (metadata and complete text)</td>
    <td>Cultural fit in an organization</td>
    <td>Srivastava et al. (2017)</td>
  </tr>
</table>

Three Solutions:
- The first solution is to actually collect the data you need;
- The second main solution is to do what data scientists call user-attribute inference and social scientists call imputation.
- A third possible solution is to combine multiple data sources. This process is sometimes called record linkage.

**A book of life** or **a database of ruin**? (Dunn 1946;Ohm 2010)

- Each person in the world creates **a Book of Life**. 
- This Book starts with birth and ends with death. 
- Its pages are made up of records of the principal events in life. 
- Record linkage is the name given to the process of assembling the pages of this book into a volume.



### 2.3.5 Inaccessible
Data held by companies and governments are difficult for researchers to access.

**The Utah Data Center**

In May 2014, the US National Security Agency opened a data center in rural Utah with an awkward name, the Intelligence Community Comprehensive National Cybersecurity Initiative Data Center.

One report alleges that it is able to store and process all forms of communication including “the complete contents of private emails, cell phone calls, and Google searches, as well as all sorts of personal data trails—parking receipts, travel itineraries, bookstore purchases, and other digital ‘pocket litter’” (Bamford 2012). 

These data are inaccessible not because people at companies and governments are stupid, lazy, or uncaring. Rather, there are serious legal, business, and ethical barriers that prevent data access.

**The story of Abdur Chowdhury**

In 2006, when he was the head of research at AOL, he intentionally released to the research community what he thought were anonymized search queries from 650,000 AOL users.

Reporters from the New York Times were able to identify someone in the dataset with ease (Barbaro and Zeller 2006).

Ultimately, Chowdhury was fired, and AOL’s chief technology officer resigned (Hafner 2006). 

Researchers can, however, sometimes gain access to data that is inac- cessible to the general public. 

Four ingredients in successful partnerships: 
- researcher interest, 
- researcher capability, 
- company interest, 
- company capability. 



- First, you will probably not be able to share your data with other researchers, which means that other researchers will not be able to verify and extend your results. 
- Second, the questions that you can ask may be limited; companies are unlikely to allow research that could make them look bad. 
- Finally, these partnerships can create at least the appearance of a conflict of interest, where people might think that your results were influenced by your partnerships. 

In summary, lots of big data are inaccessible to researchers. There are serious legal, business, and ethical barriers that prevent data access, and these barriers will not go away as technology improves, because they are not technical barriers. 

### 2.3.6 Nonrepresentative
Nonrepresentative data are bad for out-of-sample generalizations, but can be quite useful for within-sample comparisons.


Some social scientists are accustomed to working with data that comes from a probabilistic random sample from a well-defined population, such as all adults in a particular country. This kind of data is called representative data because the sample “represents” the larger population.

At the most extreme, some skeptics seem to believe that nothing can be learned from nonrepresentative data. Fortunately, these skeptics are only partially right. 

**John Snow’s study of the 1853–54 cholera outbreak in London**

At the time, many doctors believed that cholera was caused by “bad air,” but Snow believed that it was an infectious disease, perhaps spread by sewage-laced drinking water. 

To test this idea, Snow took advantage of what we might now call a natural experiment. He compared the cholera rates of households served by two different water companies: Lambeth and Southwark & Vauxhall.

Lambeth moved its intake point upstream from the main sewage discharge in London, whereas Southwark & Vauxhall left their intake pipe downstream from the sewage discharge. When Snow compared the death rates from cholera in households served by the two companies, he found that customers of Southwark & Vauxhall—the company that was providing customers sewage- tainted water—were 10 times more likely to die from cholera. 

This result provides strong scientific evidence for Snow’s argument about the cause of cholera, even though it is not based on a representative sample of people in London.

The data from these two companies, however, would not be ideal for answering a different question: what was the prevalence of cholera in London during the outbreak? For that second question, which is also important, it would be much better to have a representative sample of people from London.

As Snow’s work illustrates, there are some scientific questions for which nonrepresentative data can be quite effective, and there are others for which it is not well suited. 

**The British Doctors Study** 

Richard Doll and A. Bradford Hill followed approximately 25,000 male doctors for several years and compared their death rates based on the amount that they smoked when the study began.

Doll and Hill (1954) found a strong exposure–response relationship: the more heavily people smoked, the more likely they were to die from lung cancer. 

Of course, it would be unwise to estimate the prevalence of lung cancer among all British people based on this group of male doctors, but the within-sample comparison still provides evidence that smoking causes lung cancer.

The generalization from a sample to the population from which it is drawn is a largely a statistical issue, but questions about the **transportability of pattern** found in one group to another group is largely a nonstatistical issue (Pearl and Bareinboim 2014; Pearl 2015).

**A study of the 2009 German parliamentary election by Andranik Tumasjan and colleagues (2010)** 

By analyzing more than 100,000 tweets, they found that the proportion of tweets mentioning a political party matched the proportion of votes that party received in the parliamentary election.

![image.png](attachment:image.png)

A follow-up paper by Andreas Jungherr, Pascal Jürgens, and Harald Schoen (2012) pointed out that the original analysis had excluded the political party that had received the most mentions on Twitter: the Pirate Party, a small party that fights government regulation of the Internet. 

To conclude, many big data sources are not representative samples from some well-defined population. For questions that require generalizing results from the sample to the population from which it was drawn, this is a serious problem. But for questions about within-sample comparisons, nonrepresentative data can be powerful, so long as researchers are clear about the characteristics of their sample and support claims about transportability with theoretical or empirical evidence. In fact, my hope is that big data sources will enable researchers to make more within-sample comparisons in many nonrepresentative groups, and my guess is that estimates from many different groups will do more to advance social research than a single estimate from a probabilistic random sample.

### 2.3.7 Drifting
Population drift, usage drift, and system drift make it hard to use big data sources to study long-term trends.


In order to reliably measure change, however, the measurement system itself must be stable. In the words of sociologist Otis Dudley Duncan, “if you want to measure change, don’t change the measure” (Fischer 2011).

Unfortunately, many big data systems—especially business systems—are changing all the time, a process that is called drift. 

In particular, these systems change in three main ways: 
- population drift (change in who is using them), 
- behavioral drift (change in how people are using them), and 
- system drift (change in the system itself). 

For example, during the US Presidential election of 2012 the proportion of tweets about politics that were written by women fluctuated from day to day (Diaz et al. 2016). 

For example, during the 2013 Occupy Gezi protests in Turkey, protesters changed their use of hashtags as the protest evolved. 

For example, over time, Facebook has increased the limit on the length of status updates.

### 2.3.8 Algorithmically confounded

Behavior in big data systems is not natural; it is driven by the engineering goals of the systems.


The ways that the goals of system designers can introduce patterns into data is called algorithmic confounding. 

On Facebook there are an anomalously high number of users with approximately 20 friends, as was discovered by Johan Ugander and colleagues (2011).
- Facebook encouraged people with few connections on Facebook to make more friends until they reached 20 friends. 

In other words, the surprisingly high number of people with about 20 friends tells us more about Facebook than about human behavior.

There is an even trickier version of algorithmic confounding that occurs when designers of online systems are aware of social theories and then bake these theories into the working of their systems. 
- Social scientists call this **performativity**: 
    - when a theory changes the world in such a way that it bring the world more into line with the theory. 

One example of a pattern created by performativity is transitivity in online social networks.

However, the magnitude of transitivity in the Facebook social graph is partially driven by algorithmic confounding. 
- That is, data scientists at Facebook knew of the empirical and theoretical research about transitivity and then baked it into how Facebook works. Facebook has a “People You May Know” feature that suggests new friends, and one way that Facebook decides who to suggest to you is transitivity. 

The theory of transitivity brings the world into line with the predictions of the theory (Zignani et al. 2014; Healy 2015). 

Algorithmic confounding was one possible explanation for the gradual breakdown of Google Flu Trends.

We should be cautious about machines, social bots, and digital media.

### 2.3.9 Dirty
Big data sources can be loaded with junk and spam.


Cleaning big data sources seems to be more difficult. 
- the ultimate source of this difficulty is that many of these big data sources were never intended to be used for research 
- they are not collected, stored, and documented in a way that facilitates data cleaning.

![image.png](attachment:image.png)

Back and colleagues’ (2010) study of the emotional response to the attacks of September 11, 2001.

Originally, Back, Küfner, and Egloff (2010) reported a pattern of increasing anger throughout the day. However, most of these apparently angry messages were generated by a single pager that repeatedly sent out the following message: 
> “Reboot NT machine [name] in cabinet [name] at [location]:CRITICAL:[date and time]”. 

With this message removed, the apparent increase in anger disappears (Pury 2011; Back, Küfner, and Egloff 2011). 

While dirty data that is created unintentionally,there are also some online systems that attract intentional spammers.

For example, political activity on Twitter seems to include at least some reasonably sophisticated spam, whereby some political causes are intentionally made to look more popular than they actually are (Ratkiewicz et al. 2011).

For example, many edits to Wikipedia are created by automated bots (Geiger 2014).

The best way to avoid being fooled by dirty data is to understand as much as possible about how your data were created.

### 2.3.10 Sensitive
Some of the information that companies and governments have is sensitive.

The Netflix Prize. 

In 2006, Netflix released 100 million movie ratings provided by almost 500,000 members and had an open call where people from all over the world submitted algorithms that could improve Netflix’s ability to recommend movies. 

Arvind Narayanan and Vitaly Shmatikov (2008) showed that it was possible to learn about specific people’s movie ratings using a trick

In fact, in response to the release and re-identification of the data, a closeted lesbian woman joined a class-action suit against Netflix. Here’s how the problem was expressed in this lawsuit (Singel 2009):

> “[M]ovie and rating data contains information of a . . . highly personal and sensitive nature. The member’s movie data exposes a Netflix mem- ber’s personal interest and/or struggles with various highly personal is- sues, including sexuality, mental illness, recovery from alcoholism, and victimization from incest, physical abuse, domestic violence, adultery, and rape.”

In the digital age companies and governments are able to collect data at a scale that was not possible previously, but these data were not collected by researchers for researchers.
- 3 Pros: big, always-on, and nonreactive
- 7 Cons: incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, inaccessible, dirty, and sensitive

Government data tends to be less nonrepresentative, less algorithmically confounded, and less drifting. One the other hand, business administrative records tend to be more always-on. 


---
---

# 2.4 Research Strategies With Big Data

## Observing Behaviors
### Bit by Bit: Social Research in the Digital Age


---
---

- counting things
- forecasting things
- approximating experiments.


- motivation by absence 
    - does not usually lead to good research.
- a better strategy is to look for research questions that are important or interesting (or ideally both)

### 2.4.1 Counting things

Simple counting can be interesting if you combine a good question with good data.

One way to think about important research is that it has some measurable impact or feeds into an important decision by policy makers. 

- For example, measuring the rate of unemployment is important because it is an indicator of the economy that drives policy decisions. 

Counting in very particular settings can reveal important insights into more general ideas about how social systems work. 


> What makes these particular counting exercises interesting is not the data itself, it comes from these more general ideas.

![torch.gif](attachment:torch.gif)



**Henry Farber’s (2015) study of the behavior of New York City taxi drivers**

- Neoclassical models in economics predict that taxi drivers will work more on days where they have higher hourly wages. 
- Alternatively, models from behavioral economics predict exactly the opposite.

So, do drivers work more hours on days with higher hourly wages (as predicted by the neoclassical models) or more hours on days with lower hourly wages (as predicted by behavioral economic models)?



**Taxi meter data** 
- start time, start location, end time, end location, fare, and tip (if the tip was paid with a credit card)
- https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Farber found that most drivers work more on days when wages are higher, consistent with the neoclassical theory.

In addition, Farber was able to use the size of the data for a better understanding of heterogeneity and dynamics. 
- Over time, newer drivers gradually learn to work more hours on high-wage days (e.g., they learn to behave as the neoclassical model predicts). 
- And new drivers who behave more like target earners are more likely to quit being taxi drivers. 

Both of these more subtle findings, which help explain the observed behavior of current drivers, were only possible because of the size of the dataset. They were impossible to detect in earlier studies that used paper trip sheets from a small number of taxi drivers over a short period of time (Camerer et al. 1997).

> The key to Farber’s research was bringing an interesting question to the data, a question that has larger implications beyond just this specific setting.

![bear.png](attachment:bear.png)

**Online censorship research** by Gary King, Jennifer Pan, and Molly Roberts (2013)

Scholars actually have conflicting expectations about which kinds of posts are most likely to get deleted. 

- posts that are critical of the state
- posts that encourage collective behavior, such as protests.

- Crawling more than 1,000 Chinese social media websites, finding relevant posts, and then revisiting these posts to see which were subsequently deleted.
- King and colleagues had obtained about 11 million posts on 85 different prespecified topics, about 2 million had been censored.
- King and colleagues needed a way to label their 11 million social media posts as to whether they were (1) critical of the state, (2) supportive of the state, or (3) irrelevant or factual reports about the events. 

![image.png](attachment:image.png)

- the probability of a post being deleted was unrelated to whether it was critical of the state or supportive of the state.
- only three types of posts were regularly censored: pornography, criticism of censors, and those that had collective action potential (i.e., the possibility of leading to large-scale protests). 

By observing a huge number of posts that were deleted and posts that were not deleted, King and colleagues were able to learn how the censors work just by watching and counting. 

### 2.4.2 Forecasting and nowcasting
Predicting the future is hard, but predicting the present is easier.

Each year, seasonal influenza epidemics cause millions of illnesses and hundreds
of thousands of deaths around the world. 
- The 1918 influenza outbreak, for example, is estimated to have killed between 50 and 100 million people (Morens and Fauci 2007). 

Centers for Disease Control and Prevention (CDC) regularly and systematically collect information from carefully selected doctors
- Although this system produces high-quality data, it has a reporting lag.

**Nowcasting**
- a term derived from combining “now” and “forecasting.” 

Rather than predicting the future, nowcasting attempts to use ideas from forecasting to measure the current state of the world: 
- it attempts to “predict the present” (Choi and Varian 2012). 

Nowcasting has the potential to be especially useful to governments and companies that require timely and accurate measures of the world.

**Google Flu Trends**
 
- Using data from 2003 to 2007, Ginsberg and colleagues estimated the relationship between the prevalence of influenza in the CDC data and the search volume for 50 million distinct terms. 
- researchers found a set of 45 different queries that seemed to be most predictive of the CDC flu prevalence data. 
- Ginsberg and colleagues tested their model during the 2007–2008 influenza season. 
 

**Google Flu Trends**
 
- Ginsberg and colleagues found that their procedures could indeed make useful and accurate nowcasts.
    

![image.png](attachment:image.png)

First, the performance of Google Flu Trends was actually not much better than that of a simple model that estimates the amount of flu based on a linear extrapolation from the two most recent measurements of flu prevalence (Goel et al. 2010). And, over some time periods, Google Flu Trends was actually worse than this simple approach (Lazer et al. 2014). 

The second important caveat about Google Flu Trends is that its ability to predict the CDC flu data was prone to short-term failure and long-term decay because of drift and algorithmic confounding. 

By using more careful methods, Lazer et al. (2014) and Yang, Santillana, and Kou (2015) were able to avoid these two problems.

### 2.4.3 Approximating experiments 近似实验

We can approximate experiments that we can’t do. Two approaches that especially benefit from the digital age are 
- natural experiments 
- matching.

Highly recommend reading one of the many excellent books on causal inference 
- Imbens and Rubin 2015 
- Pearl 2009
- Morgan and Winship 2014

One approach to making causal estimates from non-experimental data is to look for an event that has randomly assigned a treatment to some people and not to others. These situations are called **natural experiments**.

**The research of Joshua Angrist (1990) measuring the effect of military service on earnings**

- During the war in Vietnam, the US government held a lottery to randomly call young men into service.
- Joshua Angrist (1990) combined the draft lottery with earnings data from the Social Security Administration to estimate the effect of military service on earnings.
- The earnings of veterans were about 15% less than the earnings of comparable non-veterans.

As this example illustrates, sometimes social, political, or natural forces create experiments that can be leveraged by researchers, and sometimes the effects of those experiments are captured in always-on big data sources. 

**random (or as if random) variation** + **always-on data** = natural experiment

The effect of working with productive colleagues on a worker’s productivity

Alexandre Mas and Enrico Moretti (2009) studied cashiers at a particular supermarket.
- each cashier had different co-workers at different times of day.
- a digital-age checkout system
- a cashier was assigned co-workers who were 10% more productive than average, her productivity would increase by 1.5%. 

In practice, researchers use two different strategies for finding natural experiments, both of which can be fruitful. 
- Some researchers start with an always-on data source and look for random events in the world; 
- others start with a random event in the world and look for data sources that capture its impact.

**Examples of Natural Experiments Using Big Data Sources**

| Substantive focus                      | Source of natural experiment | Always-on data source   | Reference                                |
| :------------------------------------- | :--------------------------- | :---------------------- | :--------------------------------------- |
| Peer effects on productivity           | Scheduling process           | Checkout data           | Mas and Moretti (2009)                   |
| Friendship formation                   | Hurricanes                   | Facebook                | Phan and Airoldi (2015)                  |
| Spread of emotions                     | Rain                         | Facebook                | Lorenzo Coviello et al. (2014)           |
| Peer-to-peer economic transfers        | Earthquake                   | Mobile money data       | Blumenstock, Fafchamps, and Eagle (2011) |
| Personal consumption behavior          | 2013 US government shutdown  | Personal finance data   | Baker and Yannelis (2015)                |
| Economic impact of recommender systems | Various                      | Browsing data at Amazon | Sharma, Hofman, and Watts (2015)         |
| Effect of stress on unborn babies      | 2006 Israel–Hezbollah war    | Birth records           | Torche and Shwed (2015)                  |
| Reading behavior on Wikipedia          | Snowden revelations          | Wikipedia logs          | Penney (2016)                            |
| Peer effects on exercise               | Weather                      | Fitness trackers        | Aral and Nicolaides (2017)               |

Angrist was interested in estimating the effect of military service on earnings. 

Unfortunately, military service was not randomly assigned; rather it was being drafted that was randomly assigned. 
- not everyone who was drafted served (there were a variety of exemptions), 
- not everyone who served was drafted (people could volunteer to serve). 

But Angrist didn’t want to know the effect of being drafted; he wanted to know the effect of serving in the military. 

To make this estimate, however, additional assumptions and complications are required. 

- **the exclusion restriction** assumption
    - researchers need to assume that <u>the only way that being drafted impacted earnings is through military service</u>
    
If it is violated, researchers can only estimate the effect on a specific subset of men called compliers (men who would serve when drafted, but would not serve when not drafted) (Angrist, Imbens, and Rubin 1996). Compliers, however, were not the original population of interest.

The second strategy for making causal estimates from non-experimental data depends on **statistically adjusting** non-experimental data 
- in an attempt to account for preexisting differences between those who did and did not receive the treatment. 

In matching, the researcher looks through non-experimental data 
- to create pairs of people who are similar except that one has received the treatment and one has not. 
- to prune/discard cases where there are no obvious match. 


The research on consumer behavior by Liran Einav and colleagues (2015).

- One example of the power of matching strategies with massive non-experimental data sources 
- the effect of auction starting price on auction outcomes, such as the sale price or the probability of a sale.
- to discover things similar to field experiments that have already happened on eBay. 


An example of a matched set. This is the exact same golf club (a Taylormade Burner 09 Driver) being sold by the exact same person (“budgetgolfer”), but some of these sales were performed under different conditions (e.g., different starting prices).

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Bias in the matching
- An artifact of seasonal variation in demand
    - trying many different kinds of matching. 
    - repeated their analysis while varying the time window used for matching
- Estimates from matching only apply to matched data; they do not apply to the cases that could not be matched. 
    - By limiting their research to items that had multiple listings, they are focusing on professional and semi-professional sellers. 
        - Thus, when interpreting these comparisons we must remember that they only apply to this subset of eBay.

Matching in massive data might be better than a small number of field experiments when 
- (1) heterogeneity in effects is important
- (2) the important variables needed for matching have been measured.

Examples of Studies that Use Matching with Big Data Sources

| Substantive focus                                      | Big data source                         | Reference                              |
| :----------------------------------------------------- | :-------------------------------------- | :------------------------------------- |
| Effect of shootings on police violence                 | Stop-and-frisk records                  | Legewie (2016)                         |
| Effect of September 11, 2001 on families and neighbors | Voting records and donation records     | Hersh (2013)                           |
| Social contagion                                       | Communication and product adoption data | Aral, Muchnik, and Sundararajan (2009) |



## 2.5 Conclusion

Big data sources are everywhere, but using them for social research can be tricky. In my experience, there is something like a “no free lunch” rule for data: 

> if you don’t put in a lot of work collecting it, then you are probably going to have to put in a lot of work think about it and analyzing it.

The big data sources of today—and likely tomorrow—will tend to have 10 characteristics. 
- Three of these are generally (but not always) helpful for research: big, always-on, and nonreactive. 
- Seven are generally (but not always) problematic for research: incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, dirty, and sensitive. 

Many of these characteristics ultimately arise because big data sources were not created for the purpose of social research.

Based on the ideas in this chapter, I think that there are three main ways that big data sources will be most valuable for social research. 
- First, they can enable researchers to decide between competing theoretical predictions.  
- Second, big data sources can enable improved measurement for policy through nowcasting. 
- Finally, big data sources can help researchers make causal estimates without running experiments. 


Each of these approaches, however, tends to require researchers to bring a lot to the data, such as 
- `the definition of a quantity that is important to estimate` 
- `two theories that make competing predictions`. 


Big data sources can help researchers who can ask interesting and important questions.

- A rebalancing in the relationship between data and theory.

![image.png](attachment:image.png)