# Chapter 4- Data Cleaning Techniques for Numeric Data

## Introduction
Introduce several techniques for identifying possible errors in numeric data for which range checking is feasible.
- to examine the distribution of each variable, using a graphical approach (histogram)
- to look at the highest and lowest values, in a table form 
- specify reasonable ranges for some numeric variables.

### <font color='blue'>  4.1. Using PROC MEANS to Detect Invalid and Missing Values</font>
By default, PROC MEANS lists the minimum and maximum values, along with the n, mean, and standard deviation.

__Program 2-1: Using PROC MEANS to Detect Invalid and Missing Values__

In [3]:
libname Clean '/folders/myfolders/lectures'; 

title "Checking numeric variables in the patients data set";
proc means data=Clean.patients n nmiss min max mean maxdec=3;
   var HR SBP DBP;
run;



Variable,Label,N,N Miss,Minimum,Maximum,Mean
HR SBP DBP,Heart Rate Systolic Blood Pressure Diastolic Blood Pressure,27 26 27,3 4 3,10.000 20.000 8.000,900.000 400.000 200.000,110.556 145.077 88.000


This program used the PROC MEANS options N, NMISS, MIN, MAX, and MAXDEC=3. 
- The N and NMISS options report the number of non-missing and missing observations for each variable, respectively. 
- The MIN and MAX options list the smallest and largest non-missing values for each variable. 
- The MAXDEC=3 option is used so that the minimum and maximum values will be printed to three decimal places. Because HR, SBP, and DBP are supposed to be integers, you might have thought to set the MAXDEC option to 0. However, you might want to catch any data errors where a decimal point was entered by mistake.


### <font color='blue'>4.2. Using PROC UNIVARIATE to Examine Numeric Variables</font>

Before you run any statistical analysis on numeric variables, you should first use PROC UNIVARIATE to create both tabular and graphical information on these variables. Let's use the numeric variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure) to demonstrate data cleaning techniques for numeric variables. As mentioned in the introduction, a good first step is to run PROC UNIVARIATE.

- Covered here: Listing Output Objects Using the Statement TRACE ON
- use ODS SELECT output-object-name;

__Program 4.1: Running PROC UNIVARIATE on HR, SBP, and DBP__


In [19]:
libname Clean '/folders/myfolders/lectures'; 
ods trace on;
title "Running PROC UNIVARIATE on HR, SBP, and DBP";
ODS Select ExtremeObs Quantiles;
proc univariate data=Clean.Patients;
   id Patno;
   var HR SBP DBP;
   histogram / normal;
run;
ods trace off;

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,900
99%,900
95%,210
90%,208
75% Q3,88
50% Median,74
25% Q1,60
10%,48
5%,22
1%,10

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Lowest,Highest,Highest,Highest
Value,PATNO,Obs,Value,PATNO,Obs
10,20,20,90,,1
22,23,22,101,4.0,7
48,22,21,208,17.0,18
58,19,19,210,8.0,11
58,3,6,900,321.0,29

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,400
99%,400
95%,300
90%,240
75% Q3,166
50% Median,121
25% Q1,112
10%,40
5%,34
1%,20

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Lowest,Highest,Highest,Highest
Value,PATNO,Obs,Value,PATNO,Obs
20,20,20,190,3,5
34,23,22,200,4,7
40,10,13,240,9,12
102,25,24,300,11,14
102,6,8,400,321,29

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,200
99%,200
95%,180
90%,120
75% Q3,100
50% Median,80
25% Q1,74
10%,64
5%,20
1%,8

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Lowest,Highest,Highest,Highest
Value,PATNO,Obs,Value,PATNO,Obs
8,20,20,106,27,25
20,11,14,120,4,7
64,13,16,120,10,13
68,25,24,180,9,12
68,6,8,200,321,29


### <font color='blue'> 4.3. Using a PROC UNIVARIATE Option to List More Extreme Values </font>
If you have a large data set and you expect a lot of errors, you might elect to see more than the five lowest and highest observations. To change the number of extreme observations in the output, add the following procedure option:


    NEXTROBS=number

where number is the number of extreme observations to include in the list. For example, to see the 10 lowest and highest values for your variables, submit the following program. This program includes both the ODS SELECT statement as well as the NEXTROBS= option. 

__Program 4.2: Adding the NEXTROBS= Option to PROC UNIVARIATE__

In [4]:
title "Running PROC UNIVARIATE on HR, SBP, and DBP";
ods select ExtremeObs;
proc univariate data=Clean.Patients nextrobs=10;
   id Patno;
   var HR SBP DBP;
   histogram / normal;
run;

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Lowest,Highest,Highest,Highest
Value,PATNO,Obs,Value,PATNO,Obs
10,020,20,84,2.0,3
22,023,22,84,2.0,4
48,022,21,86,9.0,12
58,019,19,88,1.0,2
58,003,6,88,7.0,10
60,123,28,90,,1
60,012,15,101,4.0,7
66,028,26,208,17.0,18
68,XX5,30,210,8.0,11
68,011,14,900,321.0,29

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Lowest,Highest,Highest,Highest
Value,PATNO,Obs,Value,PATNO,Obs
20,020,20,148,7.0,10
34,023,22,148,15.0,17
40,010,13,150,28.0,26
102,025,24,166,27.0,25
102,006,8,190,,1
108,013,16,190,3.0,5
112,003,6,200,4.0,7
114,022,21,240,9.0,12
118,019,19,300,11.0,14
120,XX5,30,400,321.0,29

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Lowest,Highest,Highest,Highest
Value,PATNO,Obs,Value,PATNO,Obs
8,20,20,88,15.0,17
20,11,14,90,28.0,26
64,13,16,100,,1
68,25,24,100,3.0,5
68,6,8,102,7.0,10
70,19,19,106,27.0,25
74,12,15,120,4.0,7
74,3,6,120,10.0,13
78,23,22,180,9.0,12
78,2,4,200,321.0,29


### <font color='blue'> 4.5 Describing a program to List the Highest and Lowest Values by Percentage </font>
You have seen several ways to list the top and bottom n values of a variable. As an alternative, you might want to see the top and bottom n percentage of data values.

#### <font color='red'>Using PROC UNIVARIATE</font>
One approach uses PROC UNIVARIATE to output the cutoff values for the desired percentiles

__Program 4.5: Listing the Top and Bottom 5% for HR (Using PROC UNIVARIATE)__

In [5]:
libname Clean '/folders/myfolders/lectures'; 

PROC univariate data=clean.patients noprint; 
    var HR;
    id Patno;
    output  out=Tmp  pctlpts=5 95 pctlpre=Percent_; 
run; 

proc print data=Tmp; 
run;


Obs,Percent_5,Percent_95
1,22,210


In [6]:
data HighLowPercent;
    set Clean.patients (keep = Patno HR); 
    * bring in upper and lower cutoffs for variable HR; 
    if _n_=1 then set Tmp ; 
    if HR le Percent_5 and not missing (HR) then do ; 
        Range= 'Low '; 
        output; 
    end; 
    else if HR ge Percent_95 then do; 
     Range= 'High'; 
    output; 
    end; 
run; 

proc sort data=HighLowPercent; 
by HR; 
run; 

proc print data=HighLowPercent; 
run;

Obs,PATNO,HR,Percent_5,Percent_95,Range
1,20,10,22,210,Low
2,23,22,22,210,Low
3,8,210,22,210,High
4,321,900,22,210,High


1. You are using PROC UNIVARIATE to compute the value of HR at the 5th and 95th percentiles. Because you only want an output data set and you do not want printed output, use the NOPRINT option.

2. Use an OUTPUT statement to have PROC UNIVARIATE create an output data set. You name this data set Tmp (this author's favorite temporary data set name) and use the keywords PCTLPTS (percentile points) and PCTLPRE (percentile prefix) to name the variables in the output data set that contains the value of HR at the 5th and 95th percentiles. Here's how this works. The numbers you place after PCTLPTS are the percentiles you want to output. You need to supply a name for the two percentiles that you are outputting. They can't be called 5 and 95 because SAS variable names must begin with a letter or underscore. The value you select following PCTLPRE becomes a prefix for the variable names in the output data set. Because the prefix you chose was Percent_, the two variables in data set Tmp will be named Percent_5 and Percent_95.

3. You need to add the two variables Percent_5 and Percent_95 to each observation in the Patients data set. First, use a SET statement to read each observation in the Patients data set.

4. This statement is known as a conditional SET statement. On the first iteration of the DATA step, _N_ is equal to 1, the IF statement is true, and the SET statement brings the values of Percent_5 and Percent_95 into the PDV. On the next iteration of the DATA step, the second observation is read from the Patients data set. However, the value of _N_ is now 2 and the IF statement is not true. However, because the two values Percent_5 and Percent_95 came from a SAS data set, these values are not set back to missing, as they would be if you were reading raw data—they are retained. Thus, the values of Percent_5 and Percent_95 are added to every observation of the Patients data set.

5. You can now test if the value of HR is less than the value at the 5th percentile and not missing. If so, you set Range equal to 'Low' (with an extra space so that the length of Range is set to 4) and output an observation.

6. If the previous comparison is not true, you test if the value of HR is greater than the value at the 95th percentile. If so, you set Range to 'High' and output the observation.

7. Sort the data set containing only the values of HR below the 5th percentile (and not missing) or above the value of HR at the 95th percentile.

8. Use PROC PRINT to list the values of HR below the 5th percentile or above the 95th percentile.

### <font color='blue'> 4.6. Using Pre-Determined Ranges to Check for Possible Data Errors </font>
####  <font color='blue'>Using PROC PRINT with a WHERE Statement to List Invalid Data Values  </font>
- This section examines ways to detect possible data errors where you can determine reasonable ranges for each variable.
- This works quite well for variables such as heart rates and blood pressures, but may not be feasible for other types of variables, such as financial values that may take on a very large range of possible values.
- Suppose you want to check all the data for any patient having a heart rate outside the range of 40 to 100, a systolic blood pressure outside the range of 80 to 200, and a diastolic blood pressure outside the range of 60 to 120. For this example, missing values are not treated as invalid. 
- The PROC PRINT step in Program 2-12 (cody's 2008) reports all patients with out-of-range values for heart rate, systolic blood pressure, or diastolic blood pressure.

__Program 2-12: Using a WHERE Statement with PROC PRINT to List Out-of-Range Data__

In [41]:
libname Clean '/folders/myfolders/lectures'; 
title "Out-of-range values for numeric variables";

proc print data=Clean.patients; 
    where ( HR not between 40 and 100 and HR is not missing) 
    or 
    (SBP not between 80 and 200 and SBP is not missing) 
    or 
    (DBP not between 60 and 120 and DBP is not missing); 
    id patno; 
    var HR SBP DBP; 
run; 




PATNO,HR,SBP,DBP
4,101,200,120
8,210,.,.
9,86,240,180
10,.,40,120
11,68,300,20
17,208,.,84
20,10,20,8
23,22,34,78
321,900,400,200


A disadvantage of this listing is that an observation is printed if one or more of the variables is outside the specified range. To obtain a more precise listing that shows only the data values outside the normal range, you can use a DATA step as described in the next section.

####  <font color='Magenta'> Activity </font>
Using a DATA Step to Check for Out-of-Range Values: A simple DATA _NULL_ step can also be used to produce a report on out-of-range values.
- Complete the program for SBP and DBP out of range values check. Valid values for SBP are between 80 and 200. Valid values for DBP are between 60 and 120  

__Program 2-13: Using a DATA _NULL_ Step to List Out-of-Range Data Values__

In [25]:
libname Clean '/folders/myfolders/lectures'; 

title "Listing of patient numbers and invalid data values";
data _null_;
   file print; ***send output to the output window;
   set clean.patients(keep=Patno HR SBP DBP);
   ***Check HR;
   if (HR lt 40 and not missing(HR)) or HR gt 100 then
      put 
      Patno= 
      HR=;
    ***Check SBP;
  
  
  
    ***Check DBP;



run;

#### <font color='blue'> Identifying Invalid Values versus Missing Values </font>

__Program 4.11: Identifying Invalid Numeric Data__

In [45]:
libname Clean '/folders/myfolders/lectures'; 

*Program to Demonstrate How to Identify Invalid Data;
title "Listing of Invalid Data for HR, SBP, and DBP";
data _null_;
   file print;
   input @1  Patno $3.
         @4  HR $3.
         @7  SBP $3.
         @10 DBP $3.;
   
   if notdigit(trimn(HR)) and not missing (HR) then 
       put "invalid value " HR  " for HR in patient " Patno ; 
    if notdigit(trimn(SBP)) and not missing (SBP) then 
       put "invalid value " SBP  " for SBP in patient " Patno ;
   if notdigit(trimn(DBP)) and not missing (DBP) then 
       put "invalid value " DBP  " for DBP in patient " Patno ;
   
datalines;
001080140 90
0029.0180 90
003abcdefghi
00490x120100
005       80
;

What if you have numeric values that include decimal points? The program above will flag those values as errors because a period is not a digit. If some of your numeric values contain decimal points, you will need to modify Program 4.11. The following program uses an interesting technique to detect invalid numeric values.

__Program 4.12: An Alternative to Program 4.11__

In [47]:
libname Clean '/folders/myfolders/lectures'; 

*Program to Demonstrate How to Identify Invalid Data;
title "Listing of Invalid Data for HR, SBP, and DBP";
data _null_;
   file print;
   input @1  Patno $3.
         @4  HR $3.
         @7  SBP $3.
         @10 DBP $3.;

X= input(HR, 3.); 
if _error_ then do ;
 put "Invalid Value  " HR  "for HR in patient" patno; 
 _error_ =0;
 end; 

X= input(SBP, 3.); 
if _error_ then do ;
 put "Invalid Value  " SBP  "for SBP in patient" patno; 
 _error_ =0;
 end;
 
X= input(DBP, 3.); 
if _error_ then do ;
 put "Invalid Value  " DBP  "for DBP in patient" patno; 
 _error_ =0;
 end;

datalines;
001080140 90
0029.0180 90
003abcdefghi
00490x120100
005       80
;

Comments: 

This program lets SAS do all the work. You read each variable as character and use the INPUT function to attempt a character-to-numeric conversion. When SAS encounters an error reading raw data with an INPUT statement or invalid values with an INPUT function, it sets the internal variable _ERROR_ to 1 (true). This program uses that feature to detect values that are not valid numeric values. Notice that _ERROR_ is set back to 0 after each INPUT function.

### Conclusions: 
Ensuring accurate data: 
- to use PROC UNIVARIATE (with or without the NEXTROBS= option) and look at the highest and lowest data values.
- define reasonable ranges for your variables.
- If ranges are not reasonable for some variables, proceed on to the next chapter to explore methods that use the distribution of data values to automatically identify possible data errors.