<h1 style="font-family:Courier;font-weight:bold;font-size:40px;">Stats: Data & Models (De Veaux, Velleman & Bock)[Chapters 4-6]</h1>

<h1 style="font-family:Courier;font-weight:bold;font-size:30px;">Chapter4:Comparing Distributions</h1>

-Sometimes, outliers can teach us alot about the data.  Sometimes, they can indicate significant errors in data collection

-It can be a good idea to report a dataset both with and without outliers to compare the difference

*<strong>Timeplot</strong>: A display of values against time.

-A smooth trace in a time-plot can help make light of more local variation

*<strong>Moving Average</strong>: This is typically how we smooth a time plot between points, values are averaged around a point in an interval called a "window".  To find the value for the next point, we move the window by one point in time and take the new average.  The size of the chosen window, will effect the level of 'smoothing'.

*<strong>Exponential Smoothing</strong>: A more sophisticated moving average function, in which more weight is given to recent past values than to older values

*<strong>Re-Expression or Transformation</strong>: A simple function can be used to make a skewed distribution more symmetric

-Variables that are skewed to the right often benefit from a re-expression by square roots, logs or reciprocals

<h1 style="font-family:Courier;font-weight:bold;font-size:30px;">Chapter5:Standard Deviation as a Ruler & the Normal Model</h1>

-We can judge how unusual a data point is, by how many standard deviation units from the mean it is

-To standardize a value, we subtract the mean and then divide this difference by the standard deviation.  This standardized unit is referred to as the <strong>Z-Score</strong>, its formula is as follows:
$$z=\frac{y - \bar y}{s}$$

-Data values below the mean have negative z-scores.  The further the data is from the mean, the more unusual it is.



<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">Shifting & Scaling</h1>

-Two steps to finding a z-score: the data are shifted by subtracting the mean.  They, they are rescaled by dividing by the standard deviation.

-When we shift the data by adding (or subtracting) a constant to each value, all measures of position (center, percentiles, min, max) will increase (or decrease) by the same constant.  This shift, causes no changes in the measure of spread (IQR, range, stdev).

<strong>Rescaling to Adjust the Scale</strong>:<br>
-If we wanted to shift in units, we would have to rescale the data.  If we wanted to convert lbs. to kg., we would need to multiply each value by 2.2 (thus rescaling the data from lbs. to kgs.).

-Rescaling the data, causes measures of spread to change by the scaling factor exactly. (Figure 5.5, p.117 displays this nicely)

-Rescaling data by a constant (addition or substraction) does not effect the spread at all.  It only effects measures of position.

-When we subtract the mean of the data from every data value, we shift the mean to zero.  As we have seen, such a shift doesn't change the standard deviation.

-Standardizing into z-scores does not change the <strong>shape</strong> of the distribution of a variable<br>
-Standardizing into z-scores changes the <strong>center</strong> by making the mean = 0<br>
-Standardizing into z-scores changes the <strong>spread</strong> by making the standard deviation 1

-How far from 0 does a z-score have to be to be interesting or unusual?  <strong>We know that 50% of the data lie between the quartiles.  For symmetric data, the standard deviation is usually a bit smaller than the IQR, and it's not uncommon for at least half of the data to have z-scores between -1 and 1</strong>

-To say more about how big we expect a z-score to be, we need to model the data's distribution.  A model will not match the physical world exactly, but will still be useful.

-<strong>The Normal Model</strong> is extremely useful and appropriate for distributions whose shapes are unimodal and roughly symmetric.  For these distributions, they provide a measure of how extreme a z-score is.  <strong>There is a Normal model for every possible combination of mean and standard deviation</strong>.

-We write $N(\mu\sigma)$ to represent a Normal model with a mean of $\mu$ and a stdev of $\sigma$.  These numbers don't come from the data, they come from the model.  Such numbers are considered <strong>parameters</strong> of the model.

-If we model data with a Normal model and standardize them using the corresponding $\mu$ and $\sigma$ we still call the value a z-score and write:

$$z = \frac{y-\mu}{\sigma}$$

-The Normal model with mean 0 and standard deviation 1 is called the <strong>Standard Normal Model</strong> or the <strong>Standard Normal Distribution</strong>.  When we use the Normal Model, we assume the distribution of the data set is Normal (<strong>Normality Assumption</strong>).

-To use the Normal Model, we use the <strong>Nearly Normal Condition</strong>:  The shape of the data's distribution is unimodal and symmetric

<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">The 68-95-99.7 Rule</h1>

-It turns out that <strong>in a Normal Model, about 68% of the values fall within 1 standard deviation of the mean, about 95% of the values fall within 2 standard deviations of the mean and about 99.7%--almost all--of the values fall within 3 standard deviations of the mean.</strong>

*<strong>Inflection Point</strong>: The place where the bell shaped curve of the Normal Distribution changes from curving downward to curving back up.  This is exactly one standard deviation from the mean.

-What if we are dealing with a value that is not exactly one z-value away from the mean?  We can look it up in a table of Normal Percentiles.  First, the value must be converted to a z-score.









In [1]:
import pandas as pd
import numpy as np
import statistics
import matplotlib as plt
import seaborn as sns
from pandas import Series, DataFrame
help(pd)

Help on package pandas:

NAME
    pandas

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    
    See http://pandas.pydata.org/ for full documentation. Otherwise, see the
    docstrings of the various objects in the pandas namespace:
    
    Series
    DataFrame
    Panel
    Index
    DatetimeIndex
    HDFStore
    bdate_range
    date_range
    read_csv
    read_fwf
    read_table
    ols

PACKAGE CONTENTS
    _period
    _sparse
    _testing
    _version
    algos
    compat (package)
    computation (package)
    core (package)
    formats (package)
    hashtable
    index
    indexes (package)
    info
    io (package)
    json
    lib
    msgpack (package)
    parser
    rpy (package)
    sandbox (package)
    sparse (package)
    stats (package)
    tests (package)
    tools (package)
    tseries (package)
    tslib
    types (package)
    util (package)

SUBMODULES
    datetools
    offsets

DATA
    IndexSlice = <pandas.core.indexing._In

In [2]:
help(statistics)

Help on module statistics:

NAME
    statistics - Basic statistics module.

DESCRIPTION
    This module provides functions for calculating statistics of data, including
    averages, variance, and standard deviation.
    
    Calculating averages
    --------------------
    
    Function            Description
    mean                Arithmetic mean (average) of data.
    median              Median (middle value) of data.
    median_low          Low median of data.
    median_high         High median of data.
    median_grouped      Median, or 50th percentile, of grouped data.
    mode                Mode (most common value) of data.
    
    Calculate the arithmetic mean ("the average") of data:
    
    >>> mean([-1.0, 2.5, 3.25, 5.75])
    2.625
    
    
    Calculate the standard median of discrete data:
    
    >>> median([2, 3, 4, 5])
    3.5
    
    
    Calculate the median, or 50th percentile, of data grouped into class intervals
    centred on the data values provided. E.

<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">Normal Probability Plots</h1>

-There is a specialized graphical display that can help to decide whether or not the Normal Model is appropriate: the <strong>Normal Probability Plot</strong>

-If the distribution of the data is roughly Normal, the plot is roughly a diagonal straight line.  Deviations from a straight line indicate that the distribution is not normal.

-Don't use a Normal Model when the distribution is not unimodal and symmetric.

*<strong>Nearly Normal Condition</strong>: Look at a picture of the data to see that it is unimodal and symmetric

-Rules: Don't use the mean and standard deviation when outliers are present.  Don't round results in the midst of a calculation.

<h1 style="font-family:Courier;font-weight:bold;font-size:30px;">Chapter6:Scatterplots, Association and Correlation</h1>

*<strong>Scatterplot</strong>: One of the most common data displays, these plots can allow us to see patterns, trends, relationships, and occasional extraordinary values

-Direction of an association in a scatterplot is important.  A Pattern that runs from upper left to lower right is said to be negative.  A pattern running the other way (lower left to upper right) is said to be positive.

-The second thing to look for, is a scatterplot's <strong>form</strong>.  A plot that appears as a cloud or swarm of points stretched out in a generally consistent, straight form is called linear.

-The third feature to look for, is a scatterplot's <strong>strength</strong>.  At one extreme, do points become tightly clustered in a single stream?

-One example of such a surprise is an outlier standing away from the overall pattern of the scatterplot.

-You should also look for clusters or subgroups that stand away from the rest of the plot or that show a trend in a different direction.

<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">Roles for Variables</h1>

-The variable of interest can be described as the <strong>response variable</strong> and the other the <strong>explanatory</strong> or <strong>predictor variable</strong>.

-The roles that we choose for variables are more about how we think about them than about the variables themselves.


<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">Correlation</h1>

-Units shouldn't affect our measure of strength in correlation, a natural way to remove the units is to standardize each variable and work with z-scores:

$$(Z_x,Z_y)=(\frac{x-\bar x}{S_x},\frac{y-\bar y}{S_y})$$
 
<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">the Correlation Coefficient</h1>

$$r = \frac{\sum z_x z_y}{n - 1} $$

-We can summarize both the strength and direction of x and y associations, by summing the product of the z-scores.

-<strong>Dividing the sum by (n-1) serves two purposes</strong>:<br>
<ol>
    <li>It adjusts the strength for the number of points</li>
    <li>It makes the correlation lie between values of -1 and 1</li>
</ol>

-Formula Alternatives: 

$$r = \frac{\sum(x-\bar x)(y - \bar y)}{\sqrt{\sum(x - \bar x)^2(y - \bar y)^2}}=\frac{\sum(x - \bar x)(y - \bar y)}{(n-1)S_xS_y} $$

-The z-score form tends to be the best for intuitively grasping what correlation means computationally.

<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">Assumptions and Conditions for Correlation</h1>

-Correlation measures the strength of linear association between two quantitative variables.  In order to discuss correlation, the variables must satisfy three conditions:<br>
<ol>
    <li><strong>Quantitative Variables Condition:</strong>Correlation is ONLY about quantitative variables</li>
    <li><strong>'Straight Enough Condition':</strong>The best check for the assumption that the variables are truly linearly related is to look at the scatterplot to see whether or not it looks relatively straight.</li>
    <li><strong>No Outliers Condition:</strong>Outliers can distort the correlation dramatically, making a weak association look strong, and a strong association look weak</li>
</ol>

-The sign of a correlation coefficient gives the direction of the association

-Correlation is ALWAYS between -1 and +1, if a coefficient of exactly -1.0 and +1.0 arose it would indicate a perfectly straight line.

-Correlation treats x and y symmetrically, the correlation of x with y is the same as the correlation of y with x.

-Correlation has no units.  This fact can be especially appropriatee when the data's units are somewhat vague to begin with.

-Correlation is not affected by changes in the center or scale of either variable.  Changing the units or baseline of either variable has no effect on the correlation coefficient.  Correlation depends only on the z-scores, and they are unaffected by changes in center or scale.

-Correlation measures the strength of the linear association between the two variables.  Variables can be strongly associated but still have a small correlation if the association isn't linear.

-Correlation is sensitive to outliers.  A single outlying variable can make a small correlation large or make a large one small.

<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">Measuring Trend: Kendall's Tau</h1>

*<strong>Kendall's Tau</strong>: A statistic designed to assess how close the relationship between two variables is to being monotone.  A monotone relationship is one that consistently increases or decreases, but not necessarily in a linear fashion.  Kendall's Tau meausures monotonicity directly.  For each pair of points in a scatterplot, it records only whether the slope of a line between thos two points is positive, negative or zero (if points have the same x-value, the slope between them is ignored).  <strong>Tau</strong> is the difference between the number of positive slopes and the number of negative slopes divided the the total number of slopes between pairs.

<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">Nonparametric Association: Spearman's Rho</h1>

-One of the problems we have seen with the correlation coefficient is that it is very sensitive to violations of the Straight Enough Condition.  Both outliers and bends in the data make it impossible to interpret correlation.  <strong>Spearman's Rho</strong> ($p$) can deal with both of these problems.  Rho replaces the original data values with their ranks within each variable.  That is, it replaces the lowest value in x, with the number 1, the next lowest with 2...

-Spearman's rho is the correlation of the two rank variables.  Because this is a correlation coefficient, it must be between -1 and 1.0.

**Both Spearman's Rho and Kendall's Tau have advantages over the correlation coefficient r.  For one they can be used even when we know only the ranks. Also, they measure the consistency of the trend between variables, even if the trend is not linear.  They are also not affected much by outliers.

-These are both examples of what are called nonparametric of distribution-free methods

-The Correlation Coefficient attempts to estimate a particular parameter in the Normal Model for two quantitative variables.

-Kendall's Tau and Speaman's Rho are less specific.  They measure association, but there is no parameter that they are tied to and no specific model they require.

<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">Correlation != Causation</h1>

-A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is referred to as a lurking variable.

-Regardless of the existence (or lack thereof) of a lurking variable, it is important to remember that <strong>correlation coefficients do not prove causation</strong>.

<h1 style="font-family:Courier;font-weight:bold;font-size:20px;">The Ladder of Powers</h1>

-We can raise each data value in a quantitative variable to the same power

<ul>
    <li>the 1/2 power is the same as taking the square root</li>
    <li>the -1 power is the reciprocal $y^{-1}=\frac {1}{y}$</li>
    <li>Putting these together: $y^{\frac{-1}{2}}=\frac{1}{\sqrt{y}}$
