##  Chapter 17: Transformations of Categorical Variables (Svolba)
## <font color=blue> 17.1.  Introduction </font>

With categorical variables we mean variables such as binary, nominal, or ordinal variables. These variables are not used in calculations—instead they define categories. 

In this tutorial, we will investigate methods that can be used to create derived variables from categorical variables. The following topics are covered: 

- General considerations for categorical variables such as __formats and conversions__ between interval and categorical variables.

- __Derived variables__, where we will see which derived variables we can create from categorical information.

- __Dummy coding of categorical variables__, where we will show how categorical information can be used in analysis, which allows only interval variables.




## <font color=blue> 17.2 General Considerations for Categorical Variables </font>

### 17.2.1 Numeric or Character Format
Categorical information can be stored in a variable either in the form of a category name itself or in the form of a category code. 

In many cases the category code corresponds to a category name that will usually be stored in variables with character formats.

The following table shows possible representations of the variable GENDER:

<table>
    <tr>
        
        <td> Category Name </td> <td> Character Code </td><td> Numeric Code </td>
    </tr>    
    <tr>
     <td> Male </td> <td> M </td><td> 1 </td>
    </tr> 
      <tr>
     <td> Female </td> <td> F </td><td> 0 </td>
    </tr>
    
    
</table>


### 17.2.2 SAS Formats

SAS formats are an advantageous way to assign a category name depending on the category code. They are efficient for storage considerations and allow easy maintenance of category names.

The following example shows how formats can be used to assign the category name to the category code. 
- SAS formats for a numeric- and a character-coded gender variable are created. 
- These formats are used in PROC PRINT when the data set GENDER_EXAMPLE is printed.


In [3]:
proc format;
    value gender 1= 'MALE'
                0= 'FEMALE'; 
     
                    
run; 

DATA gender_example;
 INPUT Gender Gender2 $;
 DATALINES;
 1  M
 1  M
 0  F
 0  F
 1  M
 ;
RUN;

proc print data=gender_example;
    FORMAT Gender gender. ; 
run;

 

Obs,Gender,Gender2
1,1,M
2,1,M
3,0,F
4,0,F
5,1,M


### 17.2.3 Converting between Character and Numeric Formats
For some data management tasks it is necessary to convert the format of a variable from character to numeric or vice versa. It is advisable to explicitly convert them by using the functions INPUT or PUT.

The conversion from <font color=red>__character to numeric__</font> in our preceding gender example is done with the following statement:


    NEW_VAR = INPUT(VAR, 2.);

The conversion from <font color=red>__numeric to character__</font> in our preceding gender example is done with the following statement:


    NEW_VAR = PUT(VAR, 2.);


In [11]:
data gender_example_new; 
    set gender_example; 
    gender_char=put(gender, 3.);
    gender_char_num= input(gender_char, 3.); 
run; 


proc contents data=gender_example_new;
run;


0,1,2,3
Data Set Name,WORK.GENDER_EXAMPLE_NEW,Observations,5
Member Type,DATA,Variables,4
Engine,V9,Indexes,0
Created,09/20/2019 16:05:29,Observation Length,32
Last Modified,09/20/2019 16:05:29,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,1
First Data Page,1
Max Obs per Page,2038
Obs in First Data Page,5
Number of Data Set Repairs,0
Filename,/tmp/SAS_workD1D500006EB2_localhost.localdomain/gender_example_new.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,671644

Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes
#,Variable,Type,Len
1,Gender,Num,8
2,Gender2,Char,8
3,gender_char,Char,3
4,gender_char_num,Num,8


#### <font color=magenta> Q1. Activity  </font> 
- Given the patients dataset, define a format that converts gender to numerical formatted values as 0 for male and 1 for female.
- use proc print to display the dataset 

In [2]:
libname Clean '/folders/myfolders/ban110'; 
proc format;
    value $gender_code 'M' = 1
                       'F' = 0;
run;
DATA CLEAN.PATIENTS;
   INFILE "~/ban110/Patients.TXT" ;
   INPUT @1  PATNO    $3.
         @4  GENDER   $1.
         @5  VISIT    MMDDYY10.
         @15 HR       3.
         @18 SBP      3.
         @21 DBP      3.
         @24 DX       $3.
         @27 AE       $1.;

   LABEL PATNO   = "Patient Number"
         GENDER  = "Gender"
         VISIT   = "Visit Date"
         HR      = "Heart Rate"
         SBP     = "Systolic Blood Pressure"
         DBP     = "Diastolic Blood Pressure"
         DX      = "Diagnosis Code"
         AE      = "Adverse Event?";

   FORMAT VISIT MMDDYY10. gender $gender_code.;

RUN;

title "Frequencies for gender";
proc freq data=clean.patients;
   tables gender / nocum nopercent;
run;


Gender,Gender
GENDER,Frequency
2,1
0,12
1,13
X,1
f,2
Frequency Missing = 1,Frequency Missing = 1


## <font color=blue> 17.3 Derived Variables </font>

derived variables for categorical data are mostly derived either from the frequency distribution of values or from the extraction and combination of hierarchical codes.
The most common derived variables are:

- Extracting elements of hierarchical codes
- Indicators for the most frequent group
- IN variables in the SET statement to create binary indicator variables


### 17.3.1 Extracting Elements of Hierarchical Codes

Different digits or characters of numeric or alphanumeric codes can have certain meanings. These codes can be hierarchical or multidimensional.
- In a hierarchical code different elements of the code can define different hierarchies. For example, a product code can contain the PRODUCTMAINGROUP code in the first character.

- In the case of multidimensional codes different characters can contain different sub-classifications. For example a medical disease code contains in the first two digits the disease code and in the third and fourth digits a classification of the location in the body.

By extracting certain digits from a code, derived variables can be created. The following example shows a Product Code that contains the ProductMainGroup in the first digit, and in the second and third digits the hierarchical underlying subproduct group. Extracting the code for the ProductMainGroup can be done as in the following example:



In [3]:
data codes;
   input ProductCode $ 3.;
datalines;
216              
305               
404               
233              
105               
311               
290              
;

data codes;
    set codes; 
    productmainGroup= SUBSTR(ProductCode, 1, 1) ; 
run; 


proc print data=codes; 
run; 





Obs,ProductCode,productmainGroup
1,216,2
2,305,3
3,404,4
4,233,2
5,105,1
6,311,3
7,290,2


#### <font color=magenta > Q2. Activity </font> 
- use the codes dataset and create a new variable named productSubgroup based on extracting the second and third digit of the product code. 



In [4]:
data codes;
    set codes; 
    productsubGroup= SUBSTR(ProductCode, 2, 2) ; 
run; 


proc print data=codes; 
run; 


Obs,ProductCode,productmainGroup,productsubGroup
1,216,2,16
2,305,3,5
3,404,4,4
4,233,2,33
5,105,1,5
6,311,3,11
7,290,2,90


#### <font color=magenta > Q3. Activity </font> 
- Use the patients dataset 
- create day, month, year columns based on visit 
- use proc print to display the dataset 


Using functions to take apart date values: 
    
    day(date) returns the day of the month from a SAS date value (date)
    month(date) returns the month from a SAS date value (date)
    year(date) returns the year from a SAS date value (date)
ref: https://newonlinecourses.science.psu.edu/stat481/node/72/ 

In [5]:
data clean.patient_date_component;
    set clean.patients;
    day = day(visit);
    month =month(visit);
    year = year(visit);
run;

proc print data = clean.patient_date_component;
run;

Obs,PATNO,GENDER,VISIT,HR,SBP,DBP,DX,AE,day,month,year
1,001,1,11/11/1998,88,140,80,1,0,11,11,1998
2,002,0,11/13/1998,84,120,78,X,0,13,11,1998
3,003,X,10/21/1998,68,190,100,3,1,21,10,1998
4,004,0,01/01/1999,101,200,120,5,A,1,1,1999
5,XX5,1,05/07/1998,68,120,80,1,0,7,5,1998
6,006,,06/15/1999,72,102,68,6,1,15,6,1999
7,007,1,.,88,148,102,,0,.,.,.
8,,1,11/11/1998,90,190,100,,0,11,11,1998
9,008,0,08/08/1998,210,.,.,7,0,8,8,1998
10,009,1,09/25/1999,86,240,180,4,1,25,9,1999


### 17.3.2 Indicators for Special Categories
A possibility to create a derived variable is to compare the value for a subject to the distribution of values for the populations.

Examples of these properties include the following:

- the most frequent group
- the group with the highest average for an interval variable

The creation process of such an indicator variable has two steps:

   1.  Identify the category. This is done by calculating a simple or advanced descriptive statistics.

   2. Create the indicator variable.

If we want to create an indicator variable for the most common ProductMainGroup of the preceding example, we first use PROC FREQ to create a frequency table:


In [3]:
proc freq data=codes order=freq; 
    table ProductMainGroup;
run; 

productmainGroup,Frequency,Percent,Cumulative Frequency,Cumulative Percent
2,3,42.86,3,42.86
3,2,28.57,5,71.43
1,1,14.29,6,85.71
4,1,14.29,7,100.0


Comments: 
-  From the output, we see that ProductMainGroup 2 is the most frequent, Now, we can create the the indicator variable can with the following statement: 


        IF ProductMainGroup = '2' THEN ProductMainGroupMF = 1;
        ELSE ProductMainGroupMF =0;

With these derived variables we can create indicators that describe each subject in relation to other subjects

In [4]:
data codes; 
    set codes; 
    IF ProductMainGroup = '2' THEN ProductMainGroupMF = 1;
    ELSE ProductMainGroupMF =0;
run; 

proc print data=codes; 
run;

Obs,ProductCode,productmainGroup,ProductMainGroupMF
1,216,2,1
2,305,3,0
3,404,4,0
4,233,2,1
5,105,1,0
6,311,3,0
7,290,2,1


### 17.3.3 Using IN Variables to Create Derived Variables

The logical IN variables in the SET statement of a DATA step easily allow you to create derived variables, which can be used for analysis. These variables are binary variables (indicators) because they hold the information whether a subject has an entry in a certain table or not.

In the following coding example we have data from the call center records and from Web usage. Both tables are already aggregated per customer and have only one row per customer. These two tables are merged with the CUSTOMER_BASE table. In the resulting data set, variables that indicate whether a subject has an entry in the corresponding table or not are created.

All datasets should be first sorted by CustomerID

    proc sort data=customer_base out=customer_base;
         by CustomerID;
     run; 
     
     proc sort data=Call_center_aggr out=Call_center_aggr;
         by CustomerID;
     run; 
     
     proc sort data=Web_usage_aggr out=Web_usage_aggr;
         by CustomerID;
     run; 
     
    DATA customer;
     MERGE customer_base (IN=in1)
           Call_center_aggr (IN=in2)
           Web_usage_aggr (IN=in3);
        BY CustomerID;
     HasCallCenterRecord = in2;
     HasWebUsage = in3;
    RUN;

## <font color=blue> 17.5 Dummy Coding of Categorical Variables </font>
While categorical information in variables such as REGION or GENDER is very straightforward to understand for humans, statistical methods mostly can't deal with it. The simple reason is that statistical methods calculate measures such as estimates, weights, factors, or probabilities, and values such as MALE or NORTH CAROLINA can't be used in calculations.

To make use of this type of information in analytics, so-called dummy variables are built. A set of dummy variables represents the content of a categorical variable by creating an indicator variable for (almost) each category of the categorical variable.


Linear regression: 

\begin{align}
Weight & = \beta_0 + \beta_1*Height + \beta_2*Age +  \beta_3*Sex\_binary
\end{align}


In [80]:
 proc print data=sashelp.class;
 run;
 

Obs,Name,Sex,Age,Height,Weight
1,Alfred,M,14,69.0,112.5
2,Alice,F,13,56.5,84.0
3,Barbara,F,13,65.3,98.0
4,Carol,F,14,62.8,102.5
5,Henry,M,14,63.5,102.5
6,James,M,12,57.3,83.0
7,Jane,F,12,59.8,84.5
8,Janet,F,15,62.5,112.5
9,Jeffrey,M,13,62.5,84.0
10,John,M,12,59.0,99.5


### 17.5.1 Preliminary Example
The variable employment status can take the values EMPLOYED, RETIRED, EDUCATION, UNEMPLOYED. The table below shows the resulting dummy variables for this variable. This type of coding of dummy variables is also called GLM coding. Each category is represented by one dummy variable.

<table align=left>
    <tr>             
        <td> Employment_Status </td> <td> Employed </td><td> Retired </td> <td> Education </td>  <td> Unemployed </td> 
    </tr>    
    <tr>
     <td> Employed </td> <td> 1 </td><td> 0 </td><td> 0 </td><td> 0 </td>
    </tr> 
      <tr>
     <td> Retired </td> <td> 0 </td><td> 1 </td><td> 0 </td><td> 0 </td>
    </tr>
     <tr>
     <td> Education </td> <td> 0 </td><td> 0 </td><td> 1 </td><td> 0 </td>
    </tr>
     <tr>
     <td> Unemployed </td> <td> 0 </td><td> 0 </td><td> 0 </td><td> 1 </td>
    </tr>
    
    
</table>





### 17.5.2 GLM Coding (Reference Coding)
In regression analysis, the important point with dummy variables is that if an intercept is estimated, one of the categories can be omitted in the dummy variables. Because the whole set of dummy variables for one subject would sum to 1 and an intercept is present in the design matrix, which is usually also represented by 1, one of k categories can be determined from the values of the other k-1 variables.

The same is true for binary variables, e.g., gender male/female, where only one binary dummy variable such as MALE (0/1) is needed to represent the information sufficiently.

In regression analysis, for one categorical variable with k categories, only k-1 of dummy variables is used. The omitted category is referred to as the reference category and estimates for the dummy variable are interpreted as differences between the categories and the reference category. 

A GLM coding, where one category is not represented by a dummy variable but is treated as the reference category, is also referred to as reference coding.

In our EMPLOYMENT_STATUS example this means that we provide only three dummy variables (EMPLOYED, UNEMPLOYED, and RETIRED) for the four categories. The coefficients of these three variables are interpreted as the difference from the category EDUCATION.

It is advisable to choose that category as a reference category, which generates interpretable results. For example, for the variable CREDITCARDTYPE with the values NO_CARD, CREDIT_CARD, and GOLD_CARD, the reference category would presumably be NO_CARD in order to have the interpretations of the effects if a certain type of card is present.


### 17.5.4 Program Statements

Dummy variables for GLM or reference coding can be created in the same way:


    DATA CUSTOMER;
     SET CUSTOMER;
      SELECT (Employment_Status);
      WHEN ('Employed')   Employed  =1;
      WHEN ('Unemployed') Unemployed=1;
      WHEN ('Education')  Education =1;
      WHEN ('Retired')    Retired   =1;
      END;
    RUN;

They can, however, also be created by the use of a Boolean expression.


    DATA CUSTOMER;
     SET CUSTOMER;
      Employed   = (Employment_Status = 'Employed');
      Unemployed = (Employment_Status = 'Unemployed');
      Retired    = (Employment_Status = 'Retired');
      Education  = (Employment_Status = 'Education');
    RUN;



#### <font color=magenta > Q4. Activity  </font> 

use sashelp.class and create two dummy variables (male and female) for the Sex variable


In [9]:
data work.class;
    set sashelp.class;
    male = (sex = 'M');
    female = (sex = 'F');
run;

proc print data=work.class;
run;

Obs,Name,Sex,Age,Height,Weight,male,female
1,Alfred,M,14,69.0,112.5,1,0
2,Alice,F,13,56.5,84.0,0,1
3,Barbara,F,13,65.3,98.0,0,1
4,Carol,F,14,62.8,102.5,0,1
5,Henry,M,14,63.5,102.5,1,0
6,James,M,12,57.3,83.0,1,0
7,Jane,F,12,59.8,84.5,0,1
8,Janet,F,15,62.5,112.5,0,1
9,Jeffrey,M,13,62.5,84.0,1,0
10,John,M,12,59.0,99.5,1,0
