# 6.0002 Lecture 15: Statistical Sins and Wrap Up

**Speaker:** Prof. John Guttag

## Global Warming: fact or fiction
- recall: beware of charts where the y-axis doesn't start at zero (one of the "statistical sins")
    - scale is important
- a plot of temperature flucutations will look different if the y-axis is zoomed in on a range of a few values around 57 degrees versus from 0 to 110 degrees

## Fever and the flu
- oral temperature does change over time when someone gets the flu, but zooming out from the small range of values near 99 degrees will make it look flat over time
    - no human has a temperature of 0 degrees; this is useless information
- moral: truncate the y-axis to eliminate preposterous values

## The myth of global warming
- we believe that climate change is something that happens over very long periods of time, so it would be silly to look at climate data over a very short period of time
    - whereas if looking at someone's heart rate, probably don't want to look at data over a longer period of time
- moral: don't confuse fluctuations with trends
    - in any time series data, there will always be fluctuations
- choose an interval consistent with phenomenon being considered

## But at least the Arctic ice isn't melting(?)
- One can choose the right day in 1989 and the right day in 2013 to show that there is more ice in 2013 than 1989
    - statistical sin!
    - only choosing two points
- this is again relying on statistical fluctiation as opposed to a trend
- moral: avoid cherrypicking data
    - i.e. choosing only data points that support what you believe

## A comforting statistics
- 99.8% of the firearms in the US will not be used to commit a violent crime in any given year
- How many privately owned firearms in the US?
    - ~300,000,000
    - 300,000,000 * 0.002 = 600,000
- not very meaningful to say that *most* of the guns are not used to commit a violent crime

## A not so comforting statistic
- "Mexican health officials suspect that the swine flu outbreak has caused more than 159 deaths and roughly 2,500 illnesses." CNN, April 29, 2009
- How many deaths per year from seasonal flu in U.S?
- about 36,000
- moral: context matters!
    - a number without context doesn't mean anything

## Relative to what?
- skipping lectures increases your probability of failing 6.0002 by 50%
- from 0.5 to 0.75
    - large change
- from 0.005 to 0.0075
    - probably don't care
- moral: beware of percentage change when you don't know the denominator

## Cancer clusters
- a **cancer cluster** is defined by the CDC as "a greater-than-expected number of cancer cases that occurs within a group of people in a geographic area over a period of time"
- about 1000 "cancer clusters" per year are reported to health authorities in the US
- vast majority are deemed not significant

## A hypothetical example
- Massachusetts is about 10,000 square miles
- About 36,000 new cancer cases per year
- attorney partitioned state into 1000 regions of 10 square miles each, and looked at distribution of cases
    - expected number of cases per region: 36
- discovered that region 111 had 143 new cancer cases over a 3 year period!
    - more than 32% greater than expected
- how worried should residents be?

## How likely is it just bad luck?

In [2]:
import random

In [4]:
numCasesPerYear = 36000
numYears = 3
stateSize = 10000
communitySize = 10
numCommunities = stateSize // communitySize

numTrials = 100
numGreater = 0
for t in range(numTrials):
    locs = [0]*numCommunities
    for i in range(numYears*numCasesPerYear):
        locs[random.choice(range(numCommunities))] += 1
    if locs[111] >= 143:
        numGreater += 1
prob = round(numGreater/numTrials, 4)
print('Est. probability of region 111 having at least 143 cases =', prob)

Est. probability of region 111 having at least 143 cases = 0.0


- seems unlikely
- but wait... we chose only a specific region (111); why is this more important than any of the other region having at least 143 cases?

In [6]:
# look at any region (the right thing to do)
anyRegion = 0
for trial in range(numTrials):
    locs = [0]*numCommunities
    for i in range(numYears*numCasesPerYear):
        locs[random.choice(range(numCommunities))] += 1
    if max(locs) >= 143:
        anyRegion += 1
print(anyRegion)
aProb = round(anyRegion/numTrials, 4)
print('Est. probability of some region having at least 143 cases =', aProb)

58
Est. probability of some region having at least 143 cases = 0.58


- a variant of cherry picking called **multiple hypothesis testing**
- the attorney general didn't look at one Hypothesis (region 111 is bad), instead he looked at 1000 hypotheses and chose the one that met what he wanted
- a.k.a. Texas Sharpshooter Fallacy
    - see bunch of bulletholes near a target on the side of a barn
    - but what actually happened: farmer shot at random at wall of barn, then painted target over the bulletholes

## The bottom line
- when drawing inference from data, skepticism is merited
- but remember, skepticism and denial are different
- "Doubt, indulged and cherished, is in danger of becoming denial, but if honest, and bent on thorough investigation, it may soon lead to full establishment of the truth." -- Ambrose Bierce

## 6.0002 major topics
- optimization problems
- stochastic thinking
- modeling aspects of the world
- becoming a better programmer
    - exposure to a few extra features of Python and some useful libraries
    - practice, practice, practice

## Optimization problems
- many problems can be formulated in terms of
    - objective function
    - set of constraints
- greedy algorithms often useful
    - but may not find optimal solution
- many optimization problems inherently exponential
    - but dynamic programming often works
    - and memoization a generally useful technique (trading time for space using lookup)
- examples: knapsack problems, graph problems, curve fitting, clustering

## Stochastic thinking
- the world is (predictably) non-deterministic
- thinking in terms of probabilities is often useful
- randomness is a powerful tool for building computations that model the world
- random computations useful even for problems that do not involve randomness
    - e.g. integration

## Modeling the world
- models always inaccurate
    - provide abstractions of reality
- deterministic models, e.g. graph theoretic
- statistical models
    - simulation models: Monte Carlo simulation
    - models based on sampling
        - characterizing accuracy is critical
            - central limit theorem
            - empirical rule
        - machine learning
            - unsupervised and supervised
- presentation of data
    - plotting
    - good and bad practices

## What's next for you
- other CS courses you are prepared to take
    - 6.009, 6.005, 6.006, 6.034, 6.036