# Chapter 22- Sampling

## 22.1. Introduction 
Sampling means that a number of observations are selected from the available data for analysis. From a table point of view, this means that certain rows are selected. 

Sampling is the random selection of a subset of observations from the population. The sampling rate is the proportion of observations from the original table that will be selected for the sample. In data mining analyses sampling is important, because we usually have a large number of observations. In this case sampling is performed in order to reduce the number of observations to gain performance.

Another reason for sampling is to reduce the number of observations that are used to produce graphics such as scatter plots or single plots. For example, to create a scatter plot with 500,000 observations takes a lot of time and would probably not give a better picture of the relationship than a scatter plot with 5,000 observations.

Sampling is also important in the case of longitudinal data if single plots (one line per subject) are to be produced. Besides the fact that plotting the course of a purchase amount over 18 months for 500,000 customers would take a long time to create, it also has to be considered that 500,000 lines in one graph will not have the visual impression that only 1,000 or 5,000 lines will give.



## 22.2 Sampling Methods

__Overview__

There are a number of different sampling methods. We will investigate the three most important for analytic data marts: 
1. __The simple random sample__: 
   
2. __The stratified sample__: 
    - In the stratified sample the distribution of a nominal variable is controlled. 
    - In the case of simple stratified sampling the distribution in the original data and in the sampling has to be equal. 
    - In the case of oversampling the sample is drawn in such a way that the proportion of a certain category of a nominal variable in the sample is higher than in the original table.

    This is encountered in most cases with binary target variables, where the proportion of events (or responders) will be increased in the sample. If the number of events is rare, all observations that have an event in the target variable are selected for the sample and a certain number of non-events are selected for the sample.

3. __The clustered sample__:
    - If the data have a one-to-many relationship, a clustered sample is most needed. 
    - In the case of a clustered sample, the fact that an observation is in the sample is determined by whether the underlying subject has been selected for the sample. This means that for a subject or cross-sectional group in the sample, all observations are in the sample. 
    - In market basket analysis, for example, it does not make sense to sample on basket-item level, but to include the whole basket in the sample. 
    - In the case of single plots or time series analysis, for example, it makes sense to include the whole series of measurements for a subject or to exclude the whole series.


## 22.3 Simple Sampling and Reaching the Exact Sample Count or Proportion

   - In the simple random sample for each observation it is decided whether the observation is in the sample or not. 
   - This decision is made independently of other observations. For each observation a random number between 0 and 1 is created, depending on the sampling rate. 
   - All observations that have a random number lower than or equal to the sampling rate are selected.
    


In [11]:
/*************************************/
   /* create the initial data set       */
   /*************************************/
data admit;
   input ID $ 1-4 Name $ 6-19 Sex $ 21 Age 23-24
         Date 26-27 Height 29-30 Weight 32-34
         ActLevel $ 36-39 Fee 41-46;
   format fee 6.2;
   datalines;
2458 Murray, W      M 27  1 72 168 HIGH  85.20
2462 Almers, C      F 34  3 66 152 HIGH 124.80
2501 Bonaventure, T F 31 17 61 123 LOW  149.75
2523 Johnson, R     F 43 31 63 137 MOD  149.75
2539 LaMance, K     M 51  4 71 158 LOW  124.80
2544 Jones, M       M 29  6 76 193 HIGH 124.80
2552 Reberson, P    F 32  9 67 151 MOD  149.75
2555 King, E        M 35 13 70 173 MOD  149.75
2563 Pitts, D       M 34 22 73 154 LOW  124.80
2568 Eberhardt, S   F 49 27 64 172 LOW  124.80
;
run;

proc print data=admit; 
run;

Obs,ID,Name,Sex,Age,Date,Height,Weight,ActLevel,Fee
1,2458,"Murray, W",M,27,1,72,168,HIGH,85.2
2,2462,"Almers, C",F,34,3,66,152,HIGH,124.8
3,2501,"Bonaventure, T",F,31,17,61,123,LOW,149.75
4,2523,"Johnson, R",F,43,31,63,137,MOD,149.75
5,2539,"LaMance, K",M,51,4,71,158,LOW,124.8
6,2544,"Jones, M",M,29,6,76,193,HIGH,124.8
7,2552,"Reberson, P",F,32,9,67,151,MOD,149.75
8,2555,"King, E",M,35,13,70,173,MOD,149.75
9,2563,"Pitts, D",M,34,22,73,154,LOW,124.8
10,2568,"Eberhardt, S",F,49,27,64,172,LOW,124.8


In [38]:
proc freq data=admit; 
tables sex; 
run;

Sex,Frequency,Percent,Cumulative Frequency,Cumulative Percent
F,5,50.0,5,50.0
M,5,50.0,10,100.0


### 22.3.1 Unrestricted Random Sampling

Considering sampling as a random selection of observations can mean that we select each observation with an appropriate probability in the sample. If we want to draw a 10% sample of the basis population and apply this as shown in the following code, we will see that we reach only approximately 10%, because the unrestricted random selection does not guarantee exactly 10% selected cases.


    DATA SAMPLE;
     SET BASIS;
     IF RANUNI(123) < 0.1 THEN OUTPUT;
    RUN;

If the sample count is given as an absolute number, say 6,000 observations out of 60,000, we can write the code in the following way:


    DATA SAMPLE;
     SET BASIS;
     IF RANUNI(123) < 6000/60000 THEN OUTPUT;
    RUN;

Again, we will not necessarily reach exactly 6,000 observations in the sample.


In [8]:
 data _null_; 
  do i = 7 to 17;
      x = RANUNI(i) ; 
      put x; 
   end;
run; 


In [9]:
data _null_; 
  do i = 7 to 17;
      x = RANUNI(i) ; 
      put x; 
   end;
run; 

In [27]:
proc print data=admit;
run;

Obs,ID,Name,Sex,Age,Date,Height,Weight,ActLevel,Fee
1,2458,"Murray, W",M,27,1,72,168,HIGH,85.2
2,2462,"Almers, C",F,34,3,66,152,HIGH,124.8
3,2501,"Bonaventure, T",F,31,17,61,123,LOW,149.75
4,2523,"Johnson, R",F,43,31,63,137,MOD,149.75
5,2539,"LaMance, K",M,51,4,71,158,LOW,124.8
6,2544,"Jones, M",M,29,6,76,193,HIGH,124.8
7,2552,"Reberson, P",F,32,9,67,151,MOD,149.75
8,2555,"King, E",M,35,13,70,173,MOD,149.75
9,2563,"Pitts, D",M,34,22,73,154,LOW,124.8
10,2568,"Eberhardt, S",F,49,27,64,172,LOW,124.8


In [17]:
DATA _null_;
 SET admit;
 x=RANUNI(123);
  
 put ID= x= ; 
 
RUN;

In [47]:
DATA sample;
 SET admit;
 x=RANUNI(123);
 if x <.2 then output; 
 
RUN;

proc print data=sample; 
run;


Obs,ID,Name,Sex,Age,Date,Height,Weight,ActLevel,Fee,x
1,2501,"Bonaventure, T",F,31,17,61,123,LOW,149.75,0.17839
2,2563,"Pitts, D",M,34,22,73,154,LOW,124.8,0.12467
3,2568,"Eberhardt, S",F,49,27,64,172,LOW,124.8,0.18769


### 22.3.2 Restricted Random Sampling

If we want to force the sampling to exactly 6,000 observations or 10%, we have to amend the code as shown in the following example:


    DATA SAMPLE_FORCE;
     SET BASIS;
     IF smp_count < 6000 THEN DO;
      IF RANUNI(123) <= (6000 - smp_count)/(60000 - _N_) THEN DO;
          OUTPUT;
          Smp_count+1;
      END;
     END;
    RUN;

In this example the probability for each observation to be in the sample is influenced by the number (proportion) of observations that are in the sample so far. The probability is controlled by the ratio of the actual sample proportion of all previous observations to the desired sample proportion.

_N_

    is initially set to 1. Each time the DATA step loops past the DATA statement, the variable _N_ increments by 1. The value of _N_ represents the number of times the DATA step has iterated.
    
RANUNI(seed) 

    seed is an initial starting point, called a seed that Random-number functions routines use to generate streams of pseudo-random numbers. A seed must be a nonnegative integer with a value less than 231-1 (or 2,147,483,647). If you use a positive seed, you can always replicate the stream of random numbers by using the same DATA step. 
     


In [57]:
DATA SAMPLE_FORCE;
 SET admit;
 ro=0;
 currentobs=0;
 IF smp_count < 5 THEN DO;
  IF RANUNI(123) <= (5 - smp_count)/(10 - _N_) THEN DO;
       ro=(5 - smp_count)/(10 - _N_);
       currentObs=_N_;
       OUTPUT;
       Smp_count+1;
  END;
 END;
RUN;

proc print data=SAMPLE_FORCE; 
run;

Obs,ID,Name,Sex,Age,Date,Height,Weight,ActLevel,Fee,ro,currentobs,smp_count
1,2462,"Almers, C",F,34,3,66,152,HIGH,124.8,0.625,2,0
2,2501,"Bonaventure, T",F,31,17,61,123,LOW,149.75,0.57143,3,1
3,2539,"LaMance, K",M,51,4,71,158,LOW,124.8,0.6,5,2
4,2544,"Jones, M",M,29,6,76,193,HIGH,124.8,0.5,6,3
5,2555,"King, E",M,35,13,70,173,MOD,149.75,0.5,8,4


In [61]:
DATA SAMPLE_FORCE;
 SET admit;
 IF smp_count < 5 THEN DO;
 rnd=RANUNI(123);
  IF rnd <= .5 THEN DO;
      
      OUTPUT;
      Smp_count+1;
  END;
 END;
RUN;

proc print data=SAMPLE_FORCE; 
run;

Obs,ID,Name,Sex,Age,Date,Height,Weight,ActLevel,Fee,smp_count,rnd
1,2462,"Almers, C",F,34,3,66,152,HIGH,124.8,0,0.32091
2,2501,"Bonaventure, T",F,31,17,61,123,LOW,149.75,1,0.17839
3,2539,"LaMance, K",M,51,4,71,158,LOW,124.8,2,0.35712
4,2544,"Jones, M",M,29,6,76,193,HIGH,124.8,3,0.22111
5,2555,"King, E",M,35,13,70,173,MOD,149.75,4,0.39808


## 22.4: Oversampling: small event rate

Oversampling is a common task in data mining if the event rate is small compared to the total number of observations. In this case sampling is performed so that all or a certain proportion of the event cases are selected and a random sample from the non-event is drawn.

For example, if we have an event rate of 1.5% in the population, we might want to oversample the events and create a data set with 15% of events.

### 22.4.1 Example of Oversampling
A simple oversampling can be performed with the following statements:


    DATA oversample1;
     SET sampsio.hmeq;
     IF bad = 1 or
        bad = 0 and RANUNI(44) <= 0.5 THEN OUTPUT;
    RUN;

With these statements we select all event cases and 50% of non-event cases of the observations.


### 22.4.2 Specifying an Event Rate
If we want to control a certain proportion of event cases, we can specify a more complex condition:


    %let eventrate = 0.1995;
    DATA oversample2;
     SET sampsio.hmeq;
    IF bad = 1 or
       bad = 0 and (&eventrate*100)/(RANUNI(34)*(1-&eventrate)
                   +&eventrate) > 25 THEN OUTPUT;
    RUN;

Note that 0.1995 is the event rate in the SAMPSIO.HMEQ data set. Specifying the value 25 causes the condition to select a number of non-events that we receive approximately a 25% event rate. The distribution in the data set OVERSAMPLE2 is as follows:

    The FREQ Procedure

                                    Cumulative    Cumulative
    BAD    Frequency     Percent     Frequency      Percent
    -------------------------------------------------------
      0        3588       75.11          3588        75.11
      1        1189       24.89          4777       100.00

In [70]:
/*************************************/
   /* create the initial data set       */
   /*************************************/
data admit;
   input ID $ 1-4 Name $ 6-19 Sex $ 21 Age 23-24
         Date 26-27 Height 29-30 Weight 32-34
         ActLevel $ 36-39 Fee 41-46;
   format fee 6.2;
   datalines;
2458 Murray, W      M 27  1 72 168 HIGH  85.20
2462 Almers, C      M 34  3 66 152 HIGH 124.80
2501 Bonaventure, T F 31 17 61 123 LOW  149.75
2523 Johnson, R     F 43 31 63 137 MOD  149.75
2539 LaMance, K     F 51  4 71 158 LOW  124.80
2544 Jones, M       F 29  6 76 193 HIGH 124.80
2552 Reberson, P    F 32  9 67 151 MOD  149.75
2555 King, E        F 35 13 70 173 MOD  149.75
2563 Pitts, D       F 34 22 73 154 LOW  124.80
2568 Eberhardt, S   F 49 27 64 172 LOW  124.80
;
run;

proc freq data=admit; 
tables sex;
run;

Sex,Frequency,Percent,Cumulative Frequency,Cumulative Percent
F,8,80.0,8,80.0
M,2,20.0,10,100.0


In [77]:
%let eventrate = 0.2;
DATA oversample2;
 SET admit;
IF sex = 'M' or  sex = 'F' and (&eventrate*100)/(RANUNI(34)*(1-&eventrate)+&eventrate) >= 60 THEN OUTPUT;
RUN;

proc print data=oversample2; 
run;

Obs,ID,Name,Sex,Age,Date,Height,Weight,ActLevel,Fee
1,2458,"Murray, W",M,27,1,72,168,HIGH,85.2
2,2462,"Almers, C",M,34,3,66,152,HIGH,124.8
3,2555,"King, E",F,35,13,70,173,MOD,149.75


In [80]:
proc freq data=oversample2; 
tables sex;
run;

Sex,Frequency,Percent,Cumulative Frequency,Cumulative Percent
F,1,33.33,1,33.33
M,2,66.67,3,100.0


In [78]:
%let eventrate = 0.2;
DATA oversample3;
 SET admit;
IF sex = 'M' or
   sex = 'F' and RANUNI(34)<= .5 THEN OUTPUT;
RUN;

proc print data=oversample3; 
run;

Obs,ID,Name,Sex,Age,Date,Height,Weight,ActLevel,Fee
1,2458,"Murray, W",M,27,1,72,168,HIGH,85.2
2,2462,"Almers, C",M,34,3,66,152,HIGH,124.8
3,2539,"LaMance, K",F,51,4,71,158,LOW,124.8
4,2552,"Reberson, P",F,32,9,67,151,MOD,149.75
5,2555,"King, E",F,35,13,70,173,MOD,149.75
6,2568,"Eberhardt, S",F,49,27,64,172,LOW,124.8
