### GWU STAT 4197/STAT 6197
##### Week 6 SAS Code Examples: Combining SAS Data Sets

[Telanus, E.W. (2008). SET, MERGE and beyond. SAS Global Forum](https://support.sas.com/resources/papers/proceedings/pdfs/sgf2008/167-2008.pdf)



##### Combining Data Using DATA Step
* SET statement
* MERGE statement
* UPDATE statement
* MODIFY statement

##### Concatenating SAS Data Sets Using the SET Statement
At compile time, SAS puts the variable information from the first data set into the PDV, 
and puts the variable information from the second data set into the PDV, and so on.

The length of the variables in the first input data set will determine the length of the variables in the output data set.



In [1]:
*Ex1_concat_interleave.sas ;
ods html close;
options nocenter nonumber nodate nosource nonotes;
DATA D1;
  INPUT CustID Month $ purchased_amount;
  DATALINES; 
  11 Jan 237.4 
  12 Jan 249.2 
  13 Jan 227.7 
  ;



5                                                          The SAS System                                14:21 Friday, March 8, 2024

24         ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
24       ! ods graphics on / outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1[0m
25         
26         *Ex1_concat_interleave.sas ;
27         ods html close;
28         options nocenter nonumber nodate nosource nonotes;

The SAS System

E3969440A681A2408885998500000003


In [2]:
ods html close;
options nocenter nonumber nodate nosource nonotes;
DATA D2;
 INPUT CustID Month $ purchased_amount;
 DATALINES;
 11 Feb 288.2 
 12 Feb 221.7 
 13 Feb 274.4  
 14 Feb 222.9
;



The SAS System

E3969440A681A2408885998500000004


In [3]:
 DATA concat; 
   SET D1 D2; 
 run;
title 'Use the SET statement to concatenate data sets';
PROC PRINT DATA=concat noobs ; 
run;

CustID,Month,purchased_amount
11,Jan,237.4
12,Jan,249.2
13,Jan,227.7
11,Feb,288.2
12,Feb,221.7
13,Feb,274.4
14,Feb,222.9


In [None]:
*Ex1_concat_interleave.sas;
options nocenter nonumber nodate;
Data master; set D1; run;
Data add; set D2; run;
proc append base=master  data=add;
run;
title1 'Use the PROC APPEND to concatenate data sets';
proc print data=master noobs;
run;


In [None]:
*Ex1_concat_interleave.sas (Part 3);
options nocenter nonumber nodate;

proc sql;
 create table concat_sql as
 select * from D1
   union 
 select * from D2
 order by Month desc;
 title1 'Vertical Joining Using PROC SQL';
 select * from concat_sql;
quit;

##### Interleaving the Data Sets

The observations in the new data set are arranged by the values of the BY variable or variables. Then, within each BY group, they are arranged by the order of the data sets in which they occur. In another words, when two or more data sets are concatenated using a SET statement followed by a BY statement, as shown below, files are interleaved.


In [4]:

*Ex1_concat_interleave.sas (Part 4);
options nocenter nonumber nodate;
*Interleave Data Sets D1 and D2;
PROC SORT DATA=D1 out=D1sorted; 
  BY CustID Month; 
run;
PROC SORT DATA=D2 out=D2sorted; 
  BY CustID Month; run;
DATA interleave;
 SET D1sorted D2sorted ; 
   BY CustID descending Month;
run;

PROC PRINT DATA=interleave noobs;
title1 'Data - Interleaved';
run;

CustID,Month,purchased_amount
11,Jan,237.4
11,Feb,288.2
12,Jan,249.2
12,Feb,221.7
13,Jan,227.7
13,Feb,274.4
14,Feb,222.9


With the UNION operator with PROC SQL, rows from intermediate
result sets are concatenated. 

The default behavior of the UNION operator
is that the duplicate rows are removed from the final results. 



In [None]:
*Ex1_concat_interleave.sas (Part 5);
options nocenter nonumber nodate;
proc sql;
 create table concat_sql_i as
 select * from D1
   union 
 select * from D2
 order by CustID, Month desc;
  title1
 'Vertical Joining Using PROC SQL /Interleaved';
 select * from concat_sql_i;
quit;

    Create a data set with variables that have the same attribute 
    as those in an existing SAS data set- code idea from
    Marths Messineo (2017);

In [None]:
*Ex1_concat_interleave.sas (Part 6);
options nocenter nonumber nodate;
  data class1 ;
   set sashelp.class;
  run; 
  
data class2;
  if (0) then set SASHELP.CLASS;
  input name sex age height weight;
  datalines;
  Kia F 13 62  102
  ; 
proc append base=class1 data=class2;
run;
title1 'Appending data sets (PROC APPEND)';
proc print data=class1; 
run;

#####  Craete Two Example Data Sets (Birth and Death Files) for Merging

In [5]:
*Ex2_match_merge_sql_outer.sas (Part 1);
options nocenter nodate nonumber;
DATA BIRTH;
  INPUT id $ dob : mmddyy.;
  FORMAT dob  mmddyy10.;
  DATALINES; 
03 03/31/1944 
04 08/11/1950
01 01/09/1954 
02 09/12/1959 
05 07/18/1941
;
PROC SORT data=BIRTH; by id; 
title1 'BIRTH File - Listing'; 
PROC PRINT data=BIRTH noobs;  run;


id,dob
1,01/09/1954
2,09/12/1959
3,03/31/1944
4,08/11/1950
5,07/18/1941


In [6]:
DATA DEATH;
input id $ dod : mmddyy.;
FORMAT dod mmddyy10.;
DATALINES;
07 12/31/2011 
08 02/14/2012
04 12/31/2010 
05 12/12/2012 
06 12/29/2011 
; 
PROC SORT data=DEATH; by id; 
title1 'DEATH File - Listing'; footnote;
PROC PRINT data=DEATH noobs;  run;

id,dod
4,12/31/2010
5,12/12/2012
6,12/29/2011
7,12/31/2011
8,02/14/2012


### Merging
* The MERGE statement joins observations from two or more SAS data sets into single observations.
* The BY statement specifies the common variables to match-merge observations.
* The variables in the BY statement must be common to all data sets.
* The data sets listed in the MERGE statement must be sorted in the order of the values of the variables that are listed in the BY statement, or they must have an appropriate index.
* Variable name, type, and length attributes are established by the first data set.
* In one-to-one or one to many merge, variable values might come from the last data set.

In [7]:
*Ex2_match_merge_sql_outer.sas;
options nocenter nodate nonumber;
** DATA Step Merge (match-merge);
data match_merge;
 merge  BIRTH DEATH ; 
 by id;
 run;
title1 'DATA Step Merge (Match-Merge)';
proc print data=match_merge noobs;
run;

id,dob,dod
1,01/09/1954,.
2,09/12/1959,.
3,03/31/1944,.
4,08/11/1950,12/31/2010
5,07/18/1941,12/12/2012
6,.,12/29/2011
7,.,12/31/2011
8,.,02/14/2012


### Match-Merging
(Equivalent to Inner Joins in PROC SQL)

* The program writes obdervations for matches only.
* The IN= data set option creates a variable that can be used to identify matches and non-matches.

When you combine two data sets, you can use IN= data set option to track which of the original data sets contributes to each observation in the new data set. 


In [8]:
*Ex2_match_merge_sql_outer.sas;
options nocenter nodate nonumber;
** DATA Step Merge (exact match);
data Exact_Match;
 merge  BIRTH (in=b) DEATH (in=d);
   by id;
 if b=d;  /* Matches only */
 run;
title1 'DATA Step Merge - Exact Match';
proc print data=Exact_Match noobs;
run;

id,dob,dod
4,08/11/1950,12/31/2010
5,07/18/1941,12/12/2012


### DATA Step Merge (Right Merge)

The following code is equivalent to Right Joins in PROC SQL.

In [9]:
*Ex2_match_merge_sql_outer.sas (Part 12);
options nocenter nodate nonumber;
** DATA Step Merge ;
data right_merge;
 merge  BIRTH DEATH (in=d); 
 by id;
 if d; /* All observations from the DEATH file */
 run;
title1 'DATA Step Merge (Right Merge)';
proc print data=right_merge noobs;
run;

id,dob,dod
4,08/11/1950,12/31/2010
5,07/18/1941,12/12/2012
6,.,12/29/2011
7,.,12/31/2011
8,.,02/14/2012


#### Finding nonmatches from the DEATH file
( DATA step solution)

In [10]:
*Ex2_match_merge_sql_outer.sas (Part 14);
options nocenter nodate nonumber;
*** DATA Step Merge (nonmatch in the RIGHT data set) vs. PROC SQL subquery;
data Not_in_death;
 merge  BIRTH(in=b) DEATH (in=d); 
 by id;
 if b=1 & d ne 1; /* those observations the BIRTH file but not in the DEATH file */ 
 run;
title1 'DATA Step Merge - Finding BIRTH IDs that are not in the DEATH file';
proc print data=Not_in_death noobs;
run;

id,dob,dod
1,01/09/1954,.
2,09/12/1959,.
3,03/31/1944,.


#### Finding nonmatches from the DEATH file
( PROC SQL solution)

In [11]:
*Ex2_match_merge_sql_outer.sas (Part 15);
options nocenter nodate nonumber;
*PROC SQL subquery finding BIRTH IDs that are not in the DEATH file; 
proc sql;
title1 'SQL subquery - Finding BIRTH IDs that are not in the DEATH file';
  select id, dob
  from birth
  where id not in(select id from death);
quit;


id,dob
1,01/09/1954
2,09/12/1959
3,03/31/1944


#### Finding nonmatches from the BIRTH file
( DATA step solution)

In [12]:
*Ex2_match_merge_sql_outer.sas (Part 16);
options nocenter nodate nonumber;
** DATA Step Merge (nonmatch in the RIGHT data set);
data Not_in_birth;
 merge  BIRTH(in=b) DEATH (in=d); 
 by id;
 if d=1 & b ne 1; /* those observations the DEATH file but not in the BIRTH file */
 run;
title1 'DATA Step Merge - Finding DEATH IDs that are not in the BIRTH file';
proc print data=Not_in_birth noobs;
run;

id,dob,dod
6,.,12/29/2011
7,.,12/31/2011
8,.,02/14/2012


##### Updating a SAS data set
[Cochran, B. (2020). Urge to MERGE? Maybe You Should UPDATE Instead. SAS Global Forum.](https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/5145-2020.pdf)

In DATA step, the UPDATE statement updates a master file (SAS data set) by applying observations from another SAS data set (say, transaction file). As metioned in the above article, the UPDATE statement does the following:

* change data values for variables in the master  SAS data set
* adds observations in the master SAS data set

##### Create two example SAS data sets (master and transaction files)

In [13]:
*Ex3_merge_update_BY.sas (Part 4) - Subway Master File;
DATA master;
INFILE DATALINES DLM=',';
INPUT Id item & $14.  sub_6_inch footlong;
DATALINES;
1,Cold Cut Combo,3.50, 5.00
2,Pizza Sub,3.50, 5.00
3,Spicy Italian,3.50, 5.00
4,Veggie Delite, 3.50, 5.00
5,Turkey Breast, 4.00, 6.00
6,Tuna,             4.00, 6.00
7,Veggie Patty,     4.00, 6.00
8,Subway Club,     4.50, 7.00
9,Subway Melt,     4.50, 7.00
10,Steak & Cheese, 4.50, 7.25
11,Roast Beef,     4.50, 7.25
;
PROC SORT DATA=master; 
  BY id; 
run;
title1 'Master File - Subway Menu';
Proc print data=master noobs; run;



Id,item,sub_6_inch,footlong
1,Cold Cut Combo,5.0,2.0
3,"Spicy Italian,",5.0,4.0
5,Turkey Breast,4.0,6.0
6,Tuna,4.0,6.0
7,Veggie Patty,4.0,6.0
8,Subway Club,4.5,7.0
9,Subway Melt,4.5,7.0
10,Steak & Cheese,4.5,7.25
11,Roast Beef,4.5,7.25


In [14]:
*Ex3_merge_update_BY.sas (Part 5) - Subway Transaction File;
DATA Transact;
INFILE DATALINES DLM=',';
INPUT Id item & $14.  sub_6_inch footlong;
DATALINES;
9,Subway Melt,     5.50, 8.00
10,Steak & Cheese, 5.50, 9.25
11,Roast Beef,     5.50, 8.25
;
PROC SORT DATA=Transact; 
  BY id; 
run;
title1 'Transaction File - Subway Menu';
Proc print data=Transact noobs; run;


Id,item,sub_6_inch,footlong
9,Subway Melt,5.5,8.0
10,Steak & Cheese,5.5,9.25
11,Roast Beef,5.5,8.25


#### Update Statement

UPDATE replaces an existing file with a new file, allowing you to add, delete, or rename columns.

Rules for using the UPDATE statement (Cochran, Ben. 2020, page 1)
 
* only two data sets can appear on the UPDATE statement
* the MASTER file must be listed first
* a BY statement containing the ID-vriable must be used
* both data sets must be sorted by the BY variable
* the MASTER file must have only one observation per unique value of the BY variable


In [15]:
*Ex3_merge_update_BY.sas (Part 6) - Subway Updated File;
DATA updated; 
 UPDATE master Transact; BY id;
run;
title 'Example of the UPDATE statement';
PROC PRINT DATA=updated noobs; 
run;
title;


Id,item,sub_6_inch,footlong
1,Cold Cut Combo,5.0,2.0
3,"Spicy Italian,",5.0,4.0
5,Turkey Breast,4.0,6.0
6,Tuna,4.0,6.0
7,Veggie Patty,4.0,6.0
8,Subway Club,4.5,7.0
9,Subway Melt,5.5,8.0
10,Steak & Cheese,5.5,9.25
11,Roast Beef,5.5,8.25


##### UPDATE Statement
* MODIFY performs an update in place by rewriting only those records that have changed, or by appending new records to the end of the file. 

In [None]:
DATA master_x; 
 SET master;
run;
DATA master_x; 
 MODIFY master_x Transact; 
BY id;
run;
title 'Example of the MODIFY statement';
PROC PRINT DATA=Master_x noobs; 
run;
title;