# <center> Basic Statistics </center>

## <center> Measures of Central Tendency, Variability, and Distribution Shape for continuous variables </center>
### <center> Motor Trend Car Road Tests </center>

# The data

* **mpg**	Miles/(US) gallon
* **cyl**	Number of cylinders
* **disp**	Displacement (cu.in.)
* **hp**	Gross horsepower
* **drat**	Rear axle ratio
* **wt**	Weight (1000 lbs)
* **qsec**	1/4 mile time
* **vs**	Engine (0 = V-shaped, 1 = straight)
* **am**	Transmission (0 = automatic, 1 = manual)
* **gear**	Number of forward gears
* **carb**	Number of carburetors

In [4]:
myvars <- c("mpg", "hp", "wt")
head(mtcars[myvars])

Unnamed: 0_level_0,mpg,hp,wt
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,110,2.62
Mazda RX4 Wag,21.0,110,2.875
Datsun 710,22.8,93,2.32
Hornet 4 Drive,21.4,110,3.215
Hornet Sportabout,18.7,175,3.44
Valiant,18.1,105,3.46


# Descriptive Statistics

In [6]:
summary(mtcars[myvars])

      mpg              hp              wt       
 Min.   :10.40   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :335.0   Max.   :5.424  

# Defining your own Statistics with ```sapply()```

In [3]:
# Create a function
mystats <- function(x, na.omit=FALSE){
if (na.omit)
x <- x[!is.na(x)]
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x-m)^3/s^3)/n
kurt <- sum((x-m)^4/s^4)/n - 3
return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt))
}

**Skewness** is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean

![](skew.png)

**Kurtosis** is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

![](kurt.png)

In [4]:
# Apply the function
sapply(mtcars[myvars], mystats)

Unnamed: 0,mpg,hp,wt
n,32.0,32.0,32.0
mean,20.090625,146.6875,3.21725
stdev,6.026948,68.5628685,0.97845744
skew,0.610655,0.7260237,0.42314646
kurtosis,-0.372766,-0.1355511,-0.02271075


# Methods Comming from User-contributed libraries

In [5]:
install.packages('Hmisc')

also installing the dependencies 'deldir', 'RcppEigen', 'png', 'jpeg', 'interp', 'checkmate', 'Formula', 'latticeExtra', 'gridExtra', 'htmlTable', 'viridis'




package 'deldir' successfully unpacked and MD5 sums checked
package 'RcppEigen' successfully unpacked and MD5 sums checked
package 'png' successfully unpacked and MD5 sums checked
package 'jpeg' successfully unpacked and MD5 sums checked
package 'interp' successfully unpacked and MD5 sums checked
package 'checkmate' successfully unpacked and MD5 sums checked
package 'Formula' successfully unpacked and MD5 sums checked
package 'latticeExtra' successfully unpacked and MD5 sums checked
package 'gridExtra' successfully unpacked and MD5 sums checked
package 'htmlTable' successfully unpacked and MD5 sums checked
package 'viridis' successfully unpacked and MD5 sums checked
package 'Hmisc' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Nemo\AppData\Local\Temp\Rtmp8gI1y5\downloaded_packages


In [6]:
library(Hmisc)

Loading required package: lattice

Loading required package: survival

Loading required package: Formula

Loading required package: ggplot2


Attaching package: 'Hmisc'


The following objects are masked from 'package:base':

    format.pval, units




In [7]:
describe(mtcars[myvars]) # Gmd Gini's mean difference: mean absolute difference between any two elements

mtcars[myvars] 

 3  Variables      32  Observations
--------------------------------------------------------------------------------
mpg 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       25    0.999    20.09    6.796    12.00    14.34 
     .25      .50      .75      .90      .95 
   15.43    19.20    22.80    30.09    31.30 

lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
--------------------------------------------------------------------------------
hp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       22    0.997    146.7    77.04    63.65    66.00 
     .25      .50      .75      .90      .95 
   96.50   123.00   180.00   243.50   253.55 

lowest :  52  62  65  66  91, highest: 215 230 245 264 335
--------------------------------------------------------------------------------
wt 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32    

In [8]:
install.packages('pastecs')

package 'pastecs' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Nemo\AppData\Local\Temp\Rtmp8gI1y5\downloaded_packages


In [9]:
library(pastecs)

In [10]:
stat.desc(mtcars[myvars])

Unnamed: 0_level_0,mpg,hp,wt
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
nbr.val,32.0,32.0,32.0
nbr.null,0.0,0.0,0.0
nbr.na,0.0,0.0,0.0
min,10.4,52.0,1.513
max,33.9,335.0,5.424
range,23.5,283.0,3.911
sum,642.9,4694.0,102.952
median,19.2,123.0,3.325
mean,20.090625,146.6875,3.21725
SE.mean,1.065424,12.1203173,0.1729685


In [11]:
install.packages('psych')

also installing the dependency 'mnormt'




package 'mnormt' successfully unpacked and MD5 sums checked
package 'psych' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Nemo\AppData\Local\Temp\Rtmp8gI1y5\downloaded_packages


In [12]:
library(psych)


Attaching package: 'psych'


The following object is masked from 'package:Hmisc':

    describe


The following objects are masked from 'package:ggplot2':

    %+%, alpha




In [13]:
describe(mtcars[myvars])

Unnamed: 0_level_0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
mpg,1,32,20.09062,6.0269481,19.2,19.696154,5.41149,10.4,33.9,23.5,0.610655,-0.37276603,1.065424
hp,2,32,146.6875,68.5628685,123.0,141.192308,77.0952,52.0,335.0,283.0,0.7260237,-0.13555112,12.1203173
wt,3,32,3.21725,0.9784574,3.325,3.152692,0.7672455,1.513,5.424,3.911,0.4231465,-0.02271075,0.1729685


# Descriptive Statistics by Group ```aggregate()```

In [14]:
myvars <- c("mpg", "hp", "wt")

In [15]:
aggregate(mtcars[myvars], by=list(am=mtcars$am), mean)

am,mpg,hp,wt
<dbl>,<dbl>,<dbl>,<dbl>
0,17.14737,160.2632,3.768895
1,24.39231,126.8462,2.411


In [16]:
aggregate(mtcars[myvars], by=list(am=mtcars$am), sd)

am,mpg,hp,wt
<dbl>,<dbl>,<dbl>,<dbl>
0,3.833966,53.9082,0.7774001
1,6.166504,84.06232,0.6169816


# Descriptive Statistics by Group ```by()```

```by(data, INDICES, FUN)```

Remember the stats
``` Python
mystats <- function(x, na.omit=FALSE){
if (na.omit)
x <- x[!is.na(x)]
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x-m)^3/s^3)/n
kurt <- sum((x-m)^4/s^4)/n - 3
return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt))
}
```

In [17]:
dstats <- function(x)sapply(x, mystats)

In [18]:
myvars <- c("mpg", "hp", "wt")

In [19]:
by(mtcars[myvars], mtcars$am, dstats)

mtcars$am: 0
                 mpg           hp         wt
n        19.00000000  19.00000000 19.0000000
mean     17.14736842 160.26315789  3.7688947
stdev     3.83396639  53.90819573  0.7774001
skew      0.01395038  -0.01422519  0.9759294
kurtosis -0.80317826  -1.20969733  0.1415676
------------------------------------------------------------ 
mtcars$am: 1
                 mpg          hp         wt
n        13.00000000  13.0000000 13.0000000
mean     24.39230769 126.8461538  2.4110000
stdev     6.16650381  84.0623243  0.6169816
skew      0.05256118   1.3598859  0.2103128
kurtosis -1.45535200   0.5634635 -1.1737358

# Summary Statistics by Group using ```summaryBy()```

In [20]:
install.packages('doBy')
library(doBy)

also installing the dependencies 'rprojroot', 'diffobj', 'brio', 'desc', 'pkgload', 'praise', 'waldo', 'testthat', 'minqa', 'nloptr', 'lme4', 'Deriv', 'microbenchmark', 'pbkrtest'




package 'rprojroot' successfully unpacked and MD5 sums checked
package 'diffobj' successfully unpacked and MD5 sums checked
package 'brio' successfully unpacked and MD5 sums checked
package 'desc' successfully unpacked and MD5 sums checked
package 'pkgload' successfully unpacked and MD5 sums checked
package 'praise' successfully unpacked and MD5 sums checked
package 'waldo' successfully unpacked and MD5 sums checked
package 'testthat' successfully unpacked and MD5 sums checked
package 'minqa' successfully unpacked and MD5 sums checked
package 'nloptr' successfully unpacked and MD5 sums checked
package 'lme4' successfully unpacked and MD5 sums checked
package 'Deriv' successfully unpacked and MD5 sums checked
package 'microbenchmark' successfully unpacked and MD5 sums checked
package 'pbkrtest' successfully unpacked and MD5 sums checked
package 'doBy' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Nemo\AppData\Local\Temp\Rtmp8gI1y5\downloaded

Variables on the left of the ~ are the numeric variables to be analyzed, and variables on the right are categorical grouping variables.

```summaryBy(formula, data=dataframe, FUN=function)```

$var1 + var2 + var3 + ... + varN ~ groupvar1 + groupvar2 + ... + groupvarN$

In [21]:
summaryBy(mpg+hp+wt~am, data=mtcars, FUN=mystats)

Unnamed: 0_level_0,am,mpg.n,mpg.mean,mpg.stdev,mpg.skew,mpg.kurtosis,hp.n,hp.mean,hp.stdev,hp.skew,hp.kurtosis,wt.n,wt.mean,wt.stdev,wt.skew,wt.kurtosis
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,19,17.14737,3.833966,0.01395038,-0.8031783,19,160.2632,53.9082,-0.01422519,-1.2096973,19,3.768895,0.7774001,0.9759294,0.1415676
2,1,13,24.39231,6.166504,0.05256118,-1.455352,13,126.8462,84.06232,1.35988586,0.5634635,13,2.411,0.6169816,0.2103128,-1.1737358


# Frequency and Contingency Tables

In [22]:
install.packages('vcd')
library(vcd)

also installing the dependency 'lmtest'




package 'lmtest' successfully unpacked and MD5 sums checked
package 'vcd' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Nemo\AppData\Local\Temp\Rtmp8gI1y5\downloaded_packages


Loading required package: grid



In [23]:
head(Arthritis)

Unnamed: 0_level_0,ID,Treatment,Sex,Age,Improved
Unnamed: 0_level_1,<int>,<fct>,<fct>,<int>,<ord>
1,57,Treated,Male,27,Some
2,46,Treated,Male,29,
3,77,Treated,Male,30,
4,17,Treated,Male,32,Marked
5,36,Treated,Male,46,Marked
6,23,Treated,Male,58,Marked


* Treatment (Placebo, Treated)
* Sex (Male, Female)
* Improved (None, Some, Marked)

![](table_tables.png)

In [24]:
mytable <- with(Arthritis, table(Improved))
mytable

Improved
  None   Some Marked 
    42     14     28 

In [25]:
prop.table(mytable)

Improved
     None      Some    Marked 
0.5000000 0.1666667 0.3333333 

# Two way Tables

the ```xtabs()``` function allows you to create a contingency table using formula-style input. The format is

```mytable <- xtabs(~ A + B, data=mydata)```

where mydata is a matrix or data frame. In general, the variables to be cross-classified
appear on the right of the formula (that is, to the right of the ~) separated by + signs.

In [26]:
mytable <- xtabs(~ Treatment+Improved, data=Arthritis)
mytable

         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

In [27]:
# Marginal frequency
margin.table(mytable, 1)

Treatment
Placebo Treated 
     43      41 

In [28]:
# Proportions Tables
prop.table(mytable, 1)

         Improved
Treatment      None      Some    Marked
  Placebo 0.6744186 0.1627907 0.1627907
  Treated 0.3170732 0.1707317 0.5121951

The index (1) refers to the first variable in the ```table()``` statement. Looking at the
table, you can see that $51\%$ of treated individuals had marked improvement, compared
to $16\%$ of those receiving a placebo.

In [29]:
margin.table(mytable, 2)

Improved
  None   Some Marked 
    42     14     28 

In [30]:
prop.table(mytable, 2)

         Improved
Treatment      None      Some    Marked
  Placebo 0.6904762 0.5000000 0.2500000
  Treated 0.3095238 0.5000000 0.7500000

Here, the index (2) refers to the second variable in the table() statement.

You can use the ```addmargins()``` function to add marginal sums to these tables.

In [56]:
addmargins(mytable)

Unnamed: 0,None,Some,Marked,Sum
Placebo,29,7,7,43
Treated,13,7,21,41
Sum,42,14,28,84


In [57]:
addmargins(prop.table(mytable))

Unnamed: 0,None,Some,Marked,Sum
Placebo,0.3452381,0.08333333,0.08333333,0.5119048
Treated,0.1547619,0.08333333,0.25,0.4880952
Sum,0.5,0.16666667,0.33333333,1.0


# Two-way table using ```CrossTable```

In [59]:
library(gmodels)
CrossTable(Arthritis$Treatment, Arthritis$Improved)

"package 'gmodels' was built under R version 3.6.3"


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  84 

 
                    | Arthritis$Improved 
Arthritis$Treatment |      None |      Some |    Marked | Row Total | 
--------------------|-----------|-----------|-----------|-----------|
            Placebo |        29 |         7 |         7 |        43 | 
                    |     2.616 |     0.004 |     3.752 |           | 
                    |     0.674 |     0.163 |     0.163 |     0.512 | 
                    |     0.690 |     0.500 |     0.250 |           | 
                    |     0.345 |     0.083 |     0.083 |           | 
--------------------|-----------|-----------|-----------|-----------|
            Treated |        13 |         7 |        21 |        41 | 
                    |     2.744 |     0.004 |     3.935 |        

# Three-way contingency table 

In [62]:
mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)
mytable

, , Improved = None

         Sex
Treatment Female Male
  Placebo     19   10
  Treated      6    7

, , Improved = Some

         Sex
Treatment Female Male
  Placebo      7    0
  Treated      5    2

, , Improved = Marked

         Sex
Treatment Female Male
  Placebo      6    1
  Treated     16    5


# print in a more convenient way

In [64]:
ftable(mytable)

                 Improved None Some Marked
Treatment Sex                             
Placebo   Female            19    7      6
          Male              10    0      1
Treated   Female             6    5     16
          Male               7    2      5

In [65]:
# Marginal Frequencies
margin.table(mytable, 1) # or 2 or 3

Treatment
Placebo Treated 
     43      41 

In [66]:
margin.table(mytable, c(1, 3))

         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

In [67]:
ftable(prop.table(mytable, c(1, 2)))

                 Improved       None       Some     Marked
Treatment Sex                                             
Placebo   Female          0.59375000 0.21875000 0.18750000
          Male            0.90909091 0.00000000 0.09090909
Treated   Female          0.22222222 0.18518519 0.59259259
          Male            0.50000000 0.14285714 0.35714286

In [68]:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3))

                 Improved       None       Some     Marked        Sum
Treatment Sex                                                        
Placebo   Female          0.59375000 0.21875000 0.18750000 1.00000000
          Male            0.90909091 0.00000000 0.09090909 1.00000000
Treated   Female          0.22222222 0.18518519 0.59259259 1.00000000
          Male            0.50000000 0.14285714 0.35714286 1.00000000

# Test of Independence

# Chi-square Test
Chi-square test of independence
It tests to see whether distributions of categorical variables differ from each another.

* A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
* A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

The p-values are the probability of obtaining the sampled results, assuming independence of the row and column variables in the population

In [71]:
# Dependent 
library(vcd)
mytable <- xtabs(~Treatment+Improved, data=Arthritis)
chisq.test(mytable)


	Pearson's Chi-squared test

data:  mytable
X-squared = 13.055, df = 2, p-value = 0.001463


In [73]:
# Independent 
mytable <- xtabs(~Improved+Sex, data=Arthritis)
chisq.test(mytable)

"Chi-squared approximation may be incorrect"


	Pearson's Chi-squared test

data:  mytable
X-squared = 4.8407, df = 2, p-value = 0.08889


# Fisher's Exact Test
Like the chi-square test for fourfold (2 by 2) tables, Fisher's exact test examines the relationship between the two dimensions of the table (classification into rows vs. classification into columns). The null hypothesis is that these two classifications are not different.

In [74]:
# Dependent
mytable <- xtabs(~Treatment+Improved, data=Arthritis)
fisher.test(mytable)


	Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.001393
alternative hypothesis: two.sided


In [75]:
# Independent 
mytable <- xtabs(~Improved+Sex, data=Arthritis)
fisher.test(mytable)


	Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.1094
alternative hypothesis: two.sided


# COCHRAN–MANTEL–HAENSZEL TEST
The ```mantelhaen.test()``` function provides a Cochran–Mantel–Haenszel chi-square
test of the null hypothesis that two nominal variables are conditionally independent in
each stratum of a third variable. The following code tests the hypothesis that the
Treatment and Improved variables are independent within each level for Sex. The test
assumes that there’s no three-way (Treatment × Improved × Sex) interaction:

In [77]:
mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis)
mantelhaen.test(mytable)


	Cochran-Mantel-Haenszel test

data:  mytable
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647


The results suggest that the treatment received and the improvement reported aren’t
independent within each level of Sex (that is, treated individuals improved more than
those receiving placebos when controlling for sex).
#### <center> We have many more test that we could conduct </center>

![](null.jpg)

# Types of Correlation
## PEARSON, SPEARMAN, AND KENDALL CORRELATIONS
The Pearson product-moment correlation assesses the degree of linear relationship
between two quantitative variables. Spearman’s rank-order correlation coefficient assesses the degree of relationship between two rank-ordered variables. Kendall’s tau
is also a nonparametric measure of rank correlation.

![](corr.png)

The ```cor()``` function produces all three correlation coefficients, whereas the ```cov()```
function provides covariances. There are many options, but a simplified format for
producing correlations is

```cor(x, use= , method= )```

In [78]:
states<- state.x77[,1:6]

In [79]:
cov(states)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad
Population,19931683.7588,571229.7796,292.8679592,-407.8424612,5663.523714,-3551.509551
Income,571229.7796,377573.3061,-163.7020408,280.6631837,-521.894286,3076.76898
Illiteracy,292.868,-163.702,0.3715306,-0.4815122,1.581776,-3.235469
Life Exp,-407.8425,280.6632,-0.4815122,1.8020204,-3.86948,6.312685
Murder,5663.5237,-521.8943,1.5817755,-3.8694804,13.627465,-14.549616
HS Grad,-3551.5096,3076.769,-3.2354694,6.3126849,-14.549616,65.237894


In [80]:
cor(states)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad
Population,1.0,0.2082276,0.1076224,-0.06805195,0.3436428,-0.09848975
Income,0.20822756,1.0,-0.4370752,0.34025534,-0.2300776,0.61993232
Illiteracy,0.10762237,-0.4370752,1.0,-0.58847793,0.7029752,-0.65718861
Life Exp,-0.06805195,0.3402553,-0.5884779,1.0,-0.7808458,0.5822162
Murder,0.34364275,-0.2300776,0.7029752,-0.78084575,1.0,-0.48797102
HS Grad,-0.09848975,0.6199323,-0.6571886,0.5822162,-0.487971,1.0


# T-test
The most common activity in research is the comparison of two groups. Do patients receiving a new drug show greater improvement than patients using an existing medication?

We’ll use the UScrime dataset distributed with the MASS package.
It contains information about the effect of punishment regimes on crime rates in
47 US states in 1960. The outcome variables of interest will be Prob (the probability of
imprisonment), U1 (the unemployment rate for urban males ages 14–24), and U2 (the
unemployment rate for urban males ages 35–39). The categorical variable So (an indicator
variable for Southern states) will serve as the grouping variable.

# Independent T-test
Are you more likely to be imprisoned if you commit a crime in the South? The comparison
of interest is Southern versus non-Southern states, and the dependent variable
is the probability of incarceration.

A two-group independent t-test can be used to
test the hypothesis that the two population means are equal. Here, you assume that
the two groups are independent and that the data is sampled from normal populations.


In [82]:
library(MASS)

In [83]:
t.test(Prob ~ So, data=UScrime)


	Welch Two Sample t-test

data:  Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1 
     0.03851265      0.06371269 


You can reject the hypothesis that Southern states and non-Southern states have equal
probabilities of imprisonment $(p < .001)$.

# Dependent T-test
As a second example, you might ask if the unemployment rate for younger males (14–
24) is greater than for older males (35–39). In this case, the two groups aren’t independent.
You wouldn’t expect the unemployment rate for younger and older males in
Alabama to be unrelated. When observations in the two groups are related, you have a
dependent-groups design. Pre-post or repeated-measures designs also produce dependent
groups.

In [84]:
library(MASS)
sapply(UScrime[c("U1","U2")], function(x)(c(mean=mean(x),sd=sd(x))))

Unnamed: 0,U1,U2
mean,95.46809,33.97872
sd,18.02878,8.44545


In [85]:
with(UScrime, t.test(U1, U2, paired=TRUE))


	Paired t-test

data:  U1 and U2
t = 32.407, df = 46, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 57.67003 65.30870
sample estimates:
mean of the differences 
               61.48936 


The mean difference (61.5) is large enough to warrant rejection of the hypothesis
that the mean unemployment rate for older and younger males is the same. Younger
males have a higher rate. In fact, the probability of obtaining a sample difference this
large if the population means are equal is less than 0.00000000000000022 (that is,
2.2e–16).

# Nonparametric tests of group differences
If the two groups are independent, you can use the Wilcoxon rank sum test (more
popularly known as the Mann–Whitney U test) to assess whether the observations are
sampled from the same probability distribution (that is, whether the probability of
obtaining higher scores is greater in one population than the other)

In [87]:
with(UScrime, by(Prob, So, median))

So: 0
[1] 0.038201
------------------------------------------------------------ 
So: 1
[1] 0.055552

In [88]:
wilcox.test(Prob ~ So, data=UScrime)


	Wilcoxon rank sum test

data:  Prob by So
W = 81, p-value = 8.488e-05
alternative hypothesis: true location shift is not equal to 0


Again, you can reject the hypothesis that incarceration rates are the same in Southern
and non-Southern states (p < .001).
The Wilcoxon signed rank test provides a nonparametric alternative to the dependent
sample t-test. It’s appropriate in situations where the groups are paired and the
assumption of normality is unwarranted.