Skip to content

Commit

Permalink
Building changes from commit c388f12
Browse files Browse the repository at this point in the history
  • Loading branch information
Adeel Cheema authored and Adeel Cheema committed Jan 14, 2019
1 parent c388f12 commit 06870f7
Show file tree
Hide file tree
Showing 10 changed files with 28 additions and 34 deletions.
2 changes: 1 addition & 1 deletion _build/chapters/01/1/2/statistical-techniques.md
Expand Up @@ -33,6 +33,6 @@ focusing too much attention on simplistic summaries such as average values.
Computers enable a family of methods based on resampling that apply to a wide
range of different inference problems, take into account all available
information, and require few assumptions or conditions. Although these
techniques have often been reserved for graduate courses in statistics, their
techniques have often been reserved for advanced courses in statistics, their
flexibility and simplicity are a natural fit for data science applications.

2 changes: 1 addition & 1 deletion _build/chapters/01/1/intro.md
Expand Up @@ -44,7 +44,7 @@ Applying this approach requires learning to program a computer, and so this
text interleaves a complete introduction to programming that assumes no prior
knowledge. Readers with programming experience will find that we cover several
topics in computation that do not appear in a typical introductory computer
science curriculum. Data science also requires careful reasoning about
science curriculum. Data science also requires careful reasoning about numerical
quantities, but this text does not assume any background in mathematics or
statistics beyond basic algebra. You will find very few equations in this text.
Instead, techniques are described to readers in the same language in which they
Expand Down
4 changes: 2 additions & 2 deletions _build/chapters/01/2/why-data-science.md
Expand Up @@ -13,7 +13,7 @@ Why Data Science?

Most important decisions are made with only partial information and uncertain
outcomes. However, the degree of uncertainty for many decisions can be reduced
sharply by public access to large data sets and the computational tools
sharply by access to large data sets and the computational tools
required to analyze them effectively. Data-driven decision making has already
transformed a tremendous breadth of industries, including finance, advertising,
manufacturing, and real estate. At the same time, a wide range of academic
Expand All @@ -25,7 +25,7 @@ their work, their scientific endeavors, and their personal decisions. Critical
thinking has long been a hallmark of a rigorous education, but critiques are
often most effective when supported by data. A critical analysis of any aspect
of the world, may it be business or social science, involves inductive
reasoning; conclusions can rarely been proven outright, only supported by
reasoning; conclusions can rarely been proven outright, but only supported by
the available evidence. Data science provides the means to make precise,
reliable, and quantitative arguments about any set of observations. With
unprecedented access to information and computing, critical thinking about
Expand Down
4 changes: 2 additions & 2 deletions _build/chapters/01/3/2/Another_Kind_Of_Character.md
Expand Up @@ -43,7 +43,7 @@ chars_periods_little_women = Table().with_columns([
```


Here are the data for *Huckleberry Finn*. Each row of the table corresponds to one chapter of the novel and displays the number of characters as well as the number of periods in the chapter. Not surprisingly, chapters with fewer characters also tend to have fewer periods, in general the shorter the chapter, the fewer sentences there tend to be, and vice versa. The relation is not entirely predictable, however, as sentences are of varying lengths and can involve other punctuation such as question marks.
Here are the data for *Huckleberry Finn*. Each row of the table corresponds to one chapter of the novel and displays the number of characters as well as the number of periods in the chapter. Not surprisingly, chapters with fewer characters also tend to have fewer periods, in general: the shorter the chapter, the fewer sentences there tend to be, and vice versa. The relation is not entirely predictable, however, as sentences are of varying lengths and can involve other punctuation such as question marks.



Expand Down Expand Up @@ -159,7 +159,7 @@ chars_periods_little_women



You can see that the chapters of *Little Women* are in general longer than those of *Huckleberry Finn*. Let us see if these two simple variables – the length and number of periods in each chapter – can tell us anything more about the two books. One way for us to do this is to plot both sets of data on the same axes.
You can see that the chapters of *Little Women* are in general longer than those of *Huckleberry Finn*. Let us see if these two simple variables – the length and number of periods in each chapter – can tell us anything more about the two books. One way to do this is to plot both sets of data on the same axes.

In the plot below, there is a dot for each chapter in each book. Blue dots correspond to *Huckleberry Finn* and gold dots to *Little Women*. The horizontal axis represents the number of periods and the vertical axis represents the number of characters.

Expand Down
7 changes: 3 additions & 4 deletions _build/chapters/01/what-is-data-science.md
Expand Up @@ -8,21 +8,20 @@ next_page:
title: 'Introduction'
comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
---
What is Data Science
What is Data Science?
====================

Data Science is about drawing useful conclusions from large and diverse data
sets through exploration, prediction, and inference. Exploration involves
identifying patterns in information. Prediction involves using information
we know to make informed guesses about values we wish we knew. Inference
involves quantifying our degree of certainty: will those patterns we found
also appear in new observations? How accurate are our predictions? Our primary
involves quantifying our degree of certainty: will the patterns that we found in our data also appear in new observations? How accurate are our predictions? Our primary
tools for exploration are visualizations and descriptive statistics, for
prediction are machine learning and optimization, and for inference are
statistical tests and models.

Statistics is a central component of data science because statistics
studies how to make robust conclusions with incomplete information. Computing
studies how to make robust conclusions based on incomplete information. Computing
is a central component because programming allows us to apply analysis
techniques to the large and diverse data sets that arise in real-world
applications: not just numbers, but text, images, videos, and sensor readings.
Expand Down
Expand Up @@ -11,7 +11,7 @@ comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /con
Observation and Visualization: John Snow and the Broad Street Pump
------------------------------------------------------------------

One of the earliest examples of astute observation eventually leading to the
One of the most powerful examples of astute observation eventually leading to the
establishment of causality dates back more than 150 years. To get your mind into
the right timeframe, try to imagine London in the 1850’s. It was the world’s
wealthiest city but many of its people were desperately poor. Charles Dickens,
Expand All @@ -35,7 +35,7 @@ they were breathing the same air—and miasmas—as their neighbors, there was n
compelling association between bad smells and the incidence of cholera.

Snow had also noticed that the onset of the disease almost always involved
vomiting and diarrhea. He therefore believed that that infection was carried by
vomiting and diarrhea. He therefore believed that the infection was carried by
something people ate or drank, not by the air that they breathed. His prime
suspect was water contaminated by sewage.

Expand All @@ -44,9 +44,9 @@ London. As the deaths mounted, Snow recorded them diligently, using a method
that went on to become standard in the study of how diseases spread: *he drew a
map*. On a street map of the district, he recorded the location of each death.

Here is Snow’s original map. Each black bar represents one death. The black
Here is Snow’s original map. Each black bar represents one death. When there are multiple deaths at the same address, the bars corresponding to those deaths are stacked on top of each other. The black
discs mark the locations of water pumps. The map displays a striking
revelationthe deaths are roughly clustered around the Broad Street pump.
revelationthe deaths are roughly clustered around the Broad Street pump.
![Snow’s Cholera Map](../../../images/snow_map.jpg)

Snow studied his map carefully and investigated the apparent anomalies. All of
Expand Down
2 changes: 1 addition & 1 deletion _build/chapters/02/2/snow-s-grand-experiment.md
Expand Up @@ -12,7 +12,7 @@ Snow’s “Grand Experiment”
-------------------------

Encouraged by what he had learned in Soho, Snow completed a more thorough
analysis of cholera deaths. For some time, he had been gathering data on cholera
analysis. For some time, he had been gathering data on cholera
deaths in an area of London that was served by two water companies. The Lambeth
water company drew its water upriver from where sewage was discharged into the
River Thames. Its water was relatively clean. But the Southwark and Vauxhall
Expand Down
14 changes: 6 additions & 8 deletions _build/chapters/02/3/establishing-causality.md
Expand Up @@ -18,14 +18,13 @@ two groups were comparable to each other, apart from the treatment.

In order to establish whether it was the water supply that was causing cholera,
Snow had to compare two groups that were similar to each other in all but one
aspecttheir water supply. Only then would he be able to ascribe the differences
aspecttheir water supply. Only then would he be able to ascribe the differences
in their outcomes to the water supply. If the two groups had been different in
some other way as well, it would have been difficult to point the finger at the
water supply as the source of the disease. For example, if the treatment group
consisted of factory workers and the control group did not, then differences
between the outcomes in the two groups could have been due to the water supply,
or to factory work, or both, or to any other characteristic that made the groups
different from each other. The final picture would have been much more fuzzy.
or to factory work, or both. The final picture would have been much more fuzzy.

Snow’s brilliance lay in identifying two groups that would make his comparison
clear. He had set out to establish a causal relation between contaminated water
Expand All @@ -48,8 +47,8 @@ diseases.
Let us now return to more modern times, armed with an important lesson that we
have learned along the way:

In an observational study, if the treatment and control groups differ in ways
other than the treatment, it is difficult to make conclusions about causality.
**In an observational study, if the treatment and control groups differ in ways
other than the treatment, it is difficult to make conclusions about causality.**

An underlying difference between the two groups (other than the treatment) is
called a *confounding factor*, because it might confound you (that is, mess you
Expand All @@ -58,10 +57,9 @@ up) when you try to reach a conclusion.
**Example: Coffee and lung cancer.** Studies in the 1960’s showed that coffee
drinkers had higher rates of lung cancer than those who did not drink coffee.
Because of this, some people identified coffee as a cause of lung cancer. But
coffee does not cause lung cancer. The analysis contained a confounding factor –
smoking. In those days, coffee drinkers were also likely to have been smokers,
coffee does not cause lung cancer. The analysis contained a confounding factor—smoking. In those days, coffee drinkers were also likely to have been smokers,
and smoking does cause lung cancer. Coffee drinking was associated with lung
cancer, but it did not cause the disease.

Confounding factors are common in observational studies. Good studies take great
care to reduce confounding.
care to reduce confounding and to account for its effects.
16 changes: 7 additions & 9 deletions _build/chapters/02/5/endnote.md
Expand Up @@ -11,7 +11,7 @@ comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /con
Endnote
-------

In the terminology of that we have developed, John Snow conducted an
In the terminology that we have developed, John Snow conducted an
observational study, not a randomized experiment. But he called his study a
“grand experiment” because, as he wrote, “No fewer than three hundred thousand
people … were divided into two groups without their choice, and in most cases,
Expand All @@ -26,7 +26,7 @@ quite a bit more complex. But every method of randomization consists of a
sequence of carefully defined steps that allow chances to be specified
mathematically. This has two important consequences.

1. It allows us to accountmathematicallyfor the possibility that randomization
1. It allows us to accountmathematicallyfor the possibility that randomization
produces treatment and control groups that are quite different from each
other.

Expand All @@ -37,10 +37,9 @@ mathematically. This has two important consequences.

In this course, you will learn how to conduct and analyze your own randomized
experiments. That will involve more detail than has been presented in this
section. For now, just focus on the main idea: to try to establish causality,
chapter. For now, just focus on the main idea: to try to establish causality,
run a randomized controlled experiment if possible. If you are conducting an
observational study, you might be able to establish association but not
causation. Be extremely careful about confounding factors before making
observational study, you might be able to establish association but it will be harder to establish causation. Be extremely careful about confounding factors before making
conclusions about causality based on an observational study.

**Terminology**
Expand Down Expand Up @@ -76,8 +75,8 @@ conclusions about causality based on an observational study.
proof such as would be admitted in any scientific enquiry that there is any
such thing as contagion.”

3. A later RCT established that the conditions on which PROGRESA insisted
children going to school, preventive health carewere not necessary to
3. A later RCT established that the conditions on which PROGRESA insisted—children
going to school, preventive health carewere not necessary to
achieve increased enrollment. Just the financial boost of the welfare
payments was sufficient.

Expand All @@ -90,7 +89,6 @@ published by our own University of California Press, reads like a whodunit. It
was one of the main sources for this section's account of John Snow and his
work. A word of warning: some of the contents of the book are stomach-churning.

[*Poor Economics*](http://www.pooreconomics.com), the best seller by Abhijit V.
Banerjee and Esther Duflo of MIT, is an accessible and lively account of ways to
[*Poor Economics*](http://www.pooreconomics.com), the best seller by Abhijit Banerjee and Esther Duflo of MIT, is an accessible and lively account of ways to
fight global poverty. It includes numerous examples of RCTs, including the
PROGRESA example in this section.
3 changes: 1 addition & 2 deletions _build/chapters/02/causality-and-experiments.md
Expand Up @@ -32,8 +32,7 @@ group of individuals, a factor of interest called a *treatment*, and an

It is easiest to think of the individuals as people. In a study of whether
chocolate is good for the health, the individuals would indeed be people, the
treatment would be eating chocolate, and the outcome might be a measure of blood
pressure. But individuals in observational studies need not be people. In a
treatment would be eating chocolate, and the outcome might be a measure of heart disease. But individuals in observational studies need not be people. In a
study of whether the death penalty has a deterrent effect, the individuals could
be the 50 states of the union. A state law allowing the death penalty would be
the treatment, and an outcome could be the state’s murder rate.
Expand Down

0 comments on commit 06870f7

Please sign in to comment.