# Chapter 1- Working with character data

### Introduction:
- Basic methods for checking the validity of character data 

As an example of a character error,
-  consider a variable called Gender that has been defined as either an 'M' or an 'F'. A value of 'X' would clearly be a data error.
- you might have rules that a set of character values must adhere to. For example, a variable such as ID might be stored as character, but there is a requirement that it must only contain digits.

### 1. Loading patients.txt file
__program 1.1: Reading the Patients.txt File__

Reading using formatted input: 
This type of input is called formatted input—you use a column pointer (@n) to indicate the starting column for each variable and an INFORMAT statement to instruct the program how to read the value. This INPUT statement reads character data with the $n. informat, numeric data with the n. informat, and the visit date with the mmddyy10. informat.

In [1]:
libname Clean '/folders/myfolders/ban110'; 

SAS Connection established. Subprocess id is 9533



In [3]:
* program 1.1: Reading the Patients.txt File; 

DATA CLEAN.PATIENTS;
   INFILE "~/ban110/Patients.TXT" ;
   INPUT @1  PATNO    $3.
         @4  GENDER   $1.
         @5  VISIT    MMDDYY10.
         @15 HR       3.
         @18 SBP      3.
         @21 DBP      3.
         @24 DX       $3.
         @27 AE       $1.;

   LABEL PATNO   = "Patient Number"
         GENDER  = "Gender"
         VISIT   = "Visit Date"
         HR      = "Heart Rate"
         SBP     = "Systolic Blood Pressure"
         DBP     = "Diastolic Blood Pressure"
         DX      = "Diagnosis Code"
         AE      = "Adverse Event?";

   FORMAT VISIT MMDDYY10.;

RUN;

proc sort data=Clean.Patients; 
   by Patno Visit;

run;

proc print data=Clean.Patients ; 
   id Patno;
run;

PATNO,GENDER,VISIT,HR,SBP,DBP,DX,AE
,M,11/11/1998,90,190,100,,0
001,M,11/11/1998,88,140,80,1,0
002,F,11/13/1998,84,120,78,X,0
002,F,11/13/1998,84,120,78,X,0
003,X,10/21/1998,68,190,100,3,1
003,M,11/12/1999,58,112,74,,0
004,F,01/01/1999,101,200,120,5,A
006,,06/15/1999,72,102,68,6,1
006,F,07/07/1999,82,148,84,1,0
007,M,.,88,148,102,,0


### 2. Using PROC FREQ to Detect Character Variable Errors
__Program 1.2: Computing Frequencies Using PROC FREQ__


In [5]:
*1; 
libname Clean '/folders/myfolders/ban110';  

proc freq data=clean.Patients; 
   tables Gender; 
run;
/*
data Check_Char;
set Clean.Patients(keep=Patno Gender);  
run; 

title "Frequencies for Gender";
proc freq data=Check_Char; 
   tables Gender; 
run;
*/

Gender,Gender,Gender,Gender,Gender
GENDER,Frequency,Percent,Cumulative Frequency,Cumulative Percent
2,1,3.45,1,3.45
F,12,41.38,13,44.83
M,13,44.83,26,89.66
X,1,3.45,27,93.10
f,2,6.90,29,100.00
Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1


1. The LIBNAME statement points to the folder /folders/myfolders/lectures, where the Patients data set is located. You can change this value to point to a valid location on your computer.
2. The SET statement brings in the observations from the Patients data set. Note the use of the KEEP= data set option. Using the KEEP= data set option is more efficient than using a KEEP statement. This is an important point and, if you are not familiar with the distinction between a KEEP= data set option and a KEEP statement, pay close attention. If you had used a KEEP statement, all the variables from data set Patients would be imported into the PDV (program data vector – the place in memory that holds the variables and values). When you use a KEEP= data set option, only the variables Patno, Account_No, and Gender are read from data set Patients (and State is created in the program). For data sets with a large number of variables, using a KEEP= option is much more efficient then using a KEEP statement.
3. Use PROC FREQ to compute frequencies.
4. The TABLES statement lists the  variable (Gender) for which you want to compute frequencies. You can use the two options NOCUM and NOPERCENT to suppress the output of cumulative statistics and percentages.

There are several invalid values for Gender as well as three missing values. 
- The value of 'f' for Gender needs special attention. Raw data files sometimes contain values in mixed case. If you want to accept either upper- or lowercase values for Gender, you have several choices: The first option is to use the UPCASE function to change all lowercase values to uppercase. Another interesting and efficient option is to use the $UPCASE informat to convert all lowercase values for Gender (or any other variables you choose) to uppercase.

#### <font color=magenta> Q1. Activity  </font> 
- Reuse program 1.2.(use PROC FREQ) to Detect Character  Errors in variables Dx and AE 


In [6]:
* put you code here and run it; 
proc freq data=clean.Patients; 
   tables dx ae; 
run;

Diagnosis Code,Diagnosis Code,Diagnosis Code,Diagnosis Code,Diagnosis Code
DX,Frequency,Percent,Cumulative Frequency,Cumulative Percent
1,7,30.43,7,30.43
2,2,8.70,9,39.13
3,3,13.04,12,52.17
4,3,13.04,15,65.22
5,3,13.04,18,78.26
6,1,4.35,19,82.61
7,2,8.70,21,91.30
X,2,8.70,23,100.00
Frequency Missing = 7,Frequency Missing = 7,Frequency Missing = 7,Frequency Missing = 7,Frequency Missing = 7

Adverse Event?,Adverse Event?,Adverse Event?,Adverse Event?,Adverse Event?
AE,Frequency,Percent,Cumulative Frequency,Cumulative Percent
0,20,66.67,20,66.67
1,9,30.0,29,96.67
A,1,3.33,30,100.0


### 3. Changing the Case of All Character Variables in a Data Set
__Program 1.3: Programming Technique to Perform an Operation on All Character Variables in a Data Set__


In [11]:

libname Clean '/folders/myfolders/lectures'; 
data Clean.Patients_Caps;
   set Clean.Patients;
   array myCharVar[*] _character_; *1;
   do i = 1 to dim(Chars); *2;
      Chars[i] = upcase(Chars[i]); *3;
   end;
   drop i;
run;

title "Listing the First 10 Observations in Data Set Patients_Caps";
proc print data=clean.Patients_Caps(obs=10) noobs; *4;
run;

PATNO,GENDER,VISIT,HR,SBP,DBP,DX,AE
,M,11/11/1998,90,190,100,,0
1.0,M,11/11/1998,88,140,80,1,0
2.0,F,11/13/1998,84,120,78,X,0
2.0,F,11/13/1998,84,120,78,X,0
3.0,X,10/21/1998,68,190,100,3,1
3.0,M,11/12/1999,58,112,74,,0
4.0,F,01/01/1999,101,200,120,5,A
6.0,,06/15/1999,72,102,68,6,1
6.0,F,07/07/1999,82,148,84,1,0
7.0,M,.,88,148,102,,0


1. Use the keyword _CHARACTER_ to create an array of all the character variables in data set Patients. It is important to note that the keyword _CHARACTER_, when used in a DATA step, refers to all the character variables at that point in the DATA step. This is important, because had you placed the ARRAY statement before the SET statement, there would be no variables in the array.

2. Because you do not always know how many character variables are in the data set you want to process, you use an asterisk (*) instead of the actual number. The DIM function takes as its argument the name of an array and returns the number of elements (variables) in the array.

3. Each element of the array is converted to uppercase. You could substitute LOWCASE or PROPCASE (proper case) functions instead of UPCASE if you wish.

4. The data set option OBS= is used to print the first 10 observations in the Patients_Caps data set.

#### <font color=magenta> Q2. Short answer question </font> 
- How many variables there are in the array Char[*]? 

Answer: 4

- List the names of the variables in the array Char[*]? 

Answer: Patno gender dx ae


### 4. A Summary of Some Character Functions (Useful for Data Cleaning)
read this page: https://library-books24x7-com.libaccess.senecacollege.ca/assetviewer.aspx?bookid=127895&chunkid=580960570&rowid=91&noteMenuToggle=0&leftMenuState=1
    
    UPCASE, LOWCASE, and PROPCASE
    NOTDIGIT, NOTALPHA, and NOTALNUM
    VERIFY
    COMPBL
    COMPRESS
    MISSING
    TRIMN and STRIP 

In [23]:
Proc contents data=Clean.Patients; 
run; 

0,1,2,3
Data Set Name,CLEAN.PATIENTS,Observations,30
Member Type,DATA,Variables,8
Engine,V9,Indexes,0
Created,12/19/2018 13:14:17,Observation Length,40
Last Modified,12/19/2018 13:14:17,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,YES
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,1
First Data Page,1
Max Obs per Page,1632
Obs in First Data Page,30
Number of Data Set Repairs,0
Filename,/folders/myfolders/lectures/ch01/patients.sas7bdat
Release Created,9.0401M5
Host Created,Linux
Inode Number,6879

Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes
#,Variable,Type,Len,Format,Label
8,AE,Char,1,,Adverse Event?
6,DBP,Num,8,,Diastolic Blood Pressure
7,DX,Char,3,,Diagnosis Code
2,GENDER,Char,1,,Gender
4,HR,Num,8,,Heart Rate
1,PATNO,Char,3,,Patient Number
5,SBP,Num,8,,Systolic Blood Pressure
3,VISIT,Num,8,MMDDYY10.,Visit Date

Sort Information,Sort Information.1
Sortedby,PATNO VISIT
Validated,YES
Character Set,ASCII


### 5. Using a DATA Step to Detect Character Data Errors
__Program 1.4: Using a DATA _NULL_ Step to Detect Invalid Character Data__


In [9]:

libname Clean '/folders/myfolders/lectures'; 
title "Listing of invalid patient numbers and data values";

data _null_;
    set clean.patients; 
     file print;
    *check patno; 
    if  missing(patno) then put 
        "patno missing obs" _n_; 
    if notdigit(Patno)  then put
        patno= "is not digit"; 
    
    * check gender; 
    if  missing(gender) then put 
        patno= "is missing"; 
    else if gender not in ('M', 'F') then put 
        patno= "has an invalid gender value: " 
        gender= ; 
        
   *check Dx; 
   if missing(Dx) then put 
       patno= "has a missing value for Dx"; 
    
    if notdigit(trim(Dx)) and not missing(Dx)
    then put Patno= "has invalid Dx value"
            Dx= ; 
   *check AE; 
   if missing(AE) then put 
        patno= "has a missing value for AE"; 
    
    if AE not in ('0', '1') then put 
        Patno= "has invalid AE value"
            AE= ;
    
    
run;

Note that patient 002 appears twice in this output. This occurs because there is a duplicate observation for patient 002 (in addition to several other purposely included errors), so that the data set can be used for examples later in this book, such as the detection of duplicate ID's and duplicate observations.

#### <font color=magenta> Q3. Short answer question</font>  

Given the invalid values for Gender, specifically 
    
    PATNO=003 has an invalid gender value: GENDER=X 
What  do you suggest doing to fix this error. 

Answer: 
There are some options:
    Remove the value and keep the record with a missing gender value
    Remove the record
    Generate the gender value using the statistic distribution of the dataset


#### <font color=magenta> Q4. Short answer question</font>  

Given the following invalid value for gender:  
    
    PATNO=010 has an invalid gender value: GENDER=f  
What do you suggest to do to fix this error. 

Answer: 
    Upper case all the value in the gender variable (f ->F, m -> M)


### 6. Using PROC PRINT with a WHERE Statement to Identify Data Errors
__Program 1.6: Using PROC PRINT with a WHERE Statement to Check for Data Errors__


In [25]:
libname Clean '/folders/myfolders/lectures';
title "Using PROC Print to Identify Data Errors";

proc print data=Clean.patients; 
    id Patno; 
    var Gender; 
    var AE; 
    where notdigit(Patno) or 
        gender not in ('M', 'F') or 
        AE not in ('0', '1');
                   
run;


PATNO,GENDER,AE
,M,0
003,X,1
004,F,A
006,,1
010,f,0
013,2,0
023,f,0
XX5,M,0


In [5]:
libname Clean '/folders/myfolders/lectures';
title; 

proc print data=Clean.patients; 
    id Patno; 
     
    where patno='010';
                   
run;

PATNO,GENDER,VISIT,HR,SBP,DBP,DX,AE
10,f,10/19/1999,.,40,120,1,0


### 7. Using Formats to Check for Invalid Values
__Program 1.7: Using a User-Defined Format to Check for Invalid Values of Gender__

In [26]:
libname Clean '/folders/myfolders/lectures'; 
*Using formats to identify data errors;
title "Listing Invalid Values of Gender";

* define a gender format; 
proc format; 
 value $gender_check 'M', 'F'= 'valid'
                       ' '   = ' missing'
                       other = 'Error'; 
run; 

* use proc freq to compute frequencies on the formatted values; 
proc freq data=Clean.patients; 
    tables Gender / nocum nopercent missing; 
    format Gender $gender_check.; 
run; 

Gender,Gender
GENDER,Frequency
missing,1
Error,4
valid,25


Another way to check for invalid values of a character variable is with user-defined formats. Let's use a user-defined format to check for invalid values of Gender. Program 1.7 formats all values of Gender to 'Valid', 'Missing', or 'Error'. It then uses PROC FREQ to compute frequencies for each of these three values.

- Besides the two TABLES options NOCUM and NOPERCENT, this program includes the MISSING option as well. The MISSING option treats missing values as a valid category and places counts for missing values in the body of the table instead of at the bottom of the listing.

- Because you included a FORMAT statement in PROC FREQ, it will compute frequencies on the formatted values

__Program 1.8: Using a User-Defined Format to Identify Invalid Values__

In [2]:
libname Clean '/folders/myfolders/lectures'; 

*Using formats to identify data errors;

title "Listing Invalid Values of Gender";
proc format;
   value $Gender_Check 'M','F' = 'Valid'
                       ' '     = 'Missing'
                       other   = 'Error';
run;

data _null_; 
    set clean.patients (keep=patno Gender); 
    file print; 
    if put(Gender, $gender_check.) = 'Missing' then 
    put 
    " Missing value for Gender for patient" Patno ; 
    else if put(Gender, $gender_check.) = 'Error' then 
    put 
        "missing value of gender=" Gender "for patient" patno; 
    
run; 

Before we show output from this program, let's review the PUT function. I
It takes a value (character or numeric) and writes out formatted values to a file or other location. The PUT function performs a similar operation. It takes a value, formats this value, and "writes" out the value to a character variable. Because format values are always character values, the result of a PUT function is always a character value. To help clarify this, take a look at the following SAS statement:


    Char_Var = put(Gender,$Gender_Check.);

Char_Var will have one of three values: 'Valid', 'Missing', or 'Error'. By the way, the length of a variable created with a PUT function is the length of the longest formatted value. In this example, the length of Char_Var is 7 (the length of the value 'Missing').


Why would you use formats to check for invalid character values when this was easily done without all the added complication? There are several advantages to the format approach. 
1. First, if you have multiple studies with similar variables, you can create your formats and place them in a permanent format library. You can then use one or two SAS statements (using the PUT function) to test for invalid values. 
2. Another advantage to using formats is efficiency. Formats are stored in memory so if you have very large data sets, a program using formats should run faster than an alternative program that does not use formats.

### 8. Counting missing values for all character variables

__Program 7.2: Counting Missing Values for Character Variables __ 


In [30]:
libname Clean '/folders/myfolders/lectures'; 

title "Checking Missing Character Values";

Proc format;
    value $count_missing  ' '= 'missing'
                            other= 'Nonmissing'; 
run; 

proc freq data=Clean.patients; 
 tables _character_ / nocum missing; 
 format _character_  $count_missing. ; 
 run; 





Patient Number,Patient Number,Patient Number
PATNO,Frequency,Percent
missing,1,3.33
Nonmissing,29,96.67

Gender,Gender,Gender
GENDER,Frequency,Percent
missing,1,3.33
Nonmissing,29,96.67

Diagnosis Code,Diagnosis Code,Diagnosis Code
DX,Frequency,Percent
missing,7,23.33
Nonmissing,23,76.67

Adverse Event?,Adverse Event?,Adverse Event?
AE,Frequency,Percent
Nonmissing,30,100.0


#### <font color=magenta> Q5. Text completion question</font>  


A missing value for a categorical variable is represented with ___________  

Answer: 
 a blank (' ') or a null ('') (in our example it's a blank (' '))


### Conclusions 
Most of the techniques for checking the validity of character data are straightforward. 
- For variables with only a few possible valid values (such as Gender or Race), PROC FREQ can determine if there are invalid values in the data set. 
- You can use a DATA _NULL_ program to identify which subjects have invalid values. 
- In some cases, PROC PRINT followed by a WHERE statement can identify errors.
- Another way to identify invalid character data is to create a user-defined format specifying what values are valid, invalid, or missing. 