## GWU STAT 4197/STAT 6197
### Week 2, Part 1: Reading Raw Data into SAS Using Various Input Styles 

* Column Input
* Formatted Input
* List Input 
* Named Input

### Column Input
#### Fixed column data that include 
* standard numeric data values
* character data values that are always standard and that can contain emedded blanks
* data are in fixed fields

In [1]:
*Ex1_Column_Input.sas;
options nocenter nonumber nodate;
data work.HAVE1;
 input id $ 1-3 name $ 5-16 
       score1 18-19 score2 21-22;
datalines;
001 Tim Dyson    74 87 
002 Sam Larson   96 82 
003 Jane Miller  91 88 
004 Bikas Das    90 87 
; 
title 'Column input style, no infile statement';
proc print data=work.HAVE1 noobs; 
run;
title;

id,name,score1,score2
1,Tim Dyson,74,87
2,Sam Larson,96,82
3,Jane Miller,91,88
4,Bikas Das,90,87


### Column Input
* INFILE statement (firstobs= option)
* The FIRSTOBS= option specifies a starting point (which is the 2nd record) for processing the data following the DATALINES statement.

In [2]:
*Ex1_Column_Input.sas;
* FISTOBS= option on the INFILE statement; 
options nocenter nonumber nodate;
data work.HAVE2;
 infile datalines firstobs=2;
 input id $ 1-3 name $ 5-16 
       score1 18-19 score2 21-22;
datalines;
1234567890123456789012
001 Tim Dyson    74 87 
002 Sam Larson   96 82 
003 Jane Miller  91 88 
004 Bikas Das    90 87 
; 
title 'Column input style';
proc print data=work.HAVE2 noobs; 
run;
title;


id,name,score1,score2
1,Tim Dyson,74,87
2,Sam Larson,96,82
3,Jane Miller,91,88
4,Bikas Das,90,87


### Column Input
#### Fixed column data in an external file that includes standard numeric data values and character data values

* With the PAD option on the INFILE statement, SAS pads the record from an external file with blanks to the length that is specified in the LRECL= option or implied by the column position. That way all data lines have the same length.

In [11]:
*Ex2_column_Input_PAD_Option.sas;
data HAVE;
 infile 'C:\Users\pmuhuri\SASCourse\Week2\Week2Data\short_records.txt'
         Lrecl=25 PAD;
 input id 1-3 name $ 5-16 
       score 18-19 @21 some_value 5.2;
proc print data=HAVE noobs; 
run;


id,name,score,some_value
1,Tim Dyson,74,87.24
2,Sam Larson,96,82.24
3,Jane Miller,91,.
4,Bikas Das,90,87.83


## Column Input
#### Fixed column data in an external file that only include character data values

* The TRUNCOVER option on the INFILE statement “causes the DATA step to assign the raw data value to the variable even  if the value is shorter than expected by the INPUT statement".


In [10]:
*Ex3_column_Input_TRUCOVER_Option.sas;
*Additional Examples;
options nonotes nocenter nodate nonumber;
DATA test2;
  INFILE "C:\Users\pmuhuri\SASCourse\Week2\Week2Data\test_data.txt" 
          firstobs=2 truncover;
  INPUT lastn $1-10 Firstn $ 11-20
   Empid $21-30 Jobcode $31-40;
RUN;
title "TRUNCOVER option on the INFILE statement"; 
proc print data=test2; run;
title;

Obs,lastn,Firstn,Empid,Jobcode
1,LANGKAMM,SARAH,E0045,Mechanic
2,TORRES,JAN,E0029,Pilot
3,SMITH,MICHAEL,E0065,
4,LEISTNER,COLIN,E0116,Mechanic
5,TOMAS,HARALD,,
6,WADE,KIRSTEN,E0126,Pilot
7,WAUGH,TIM,E0204,Pilot


## Formatted Input 

#### Fixed column data that include standard and nonstandard numeric data and character data values
#### Features of the formatted input style
* starting position (@n moves the pointer to column n)
* variable that contains nonstandard data values
* informat

With Formatted input, the column pointer moves the length that is specified in the INFORMAT and stops at the next column. 
* An informat always contain a w value to indicate the width of the raw data field.
* In an informat, a period (.) separates the w value from the d value, which specifies the number of decimal places for the numeric variable only.



In [3]:
*Ex4_Formatted_Input.sas;
OPTIONS nocenter nonumber nodate;
data have1;  
infile datalines firstobs=3;   /*Read from the 3rd record? */
input @1 x_software $char5.     /* $char informat right-justifies the value*/
      @7 book_titles 3.         /*SAS moves the pointer to column 7*/
      @11 date_searched mmddyy10.; /*Informat specified*/ 
format date_searched mmddyy10.;    /*Format specified*/
datalines; 
http://r4stats.com/articles/popularity/
12345678901234567890
SAS   576 06/01/2015
SPSS  339 07/01/2015
R     240 08/01/2015
Stata  62 09/01/2015
;                     
title 'Formatted Input';
proc print data=Have1 noobs ; run;
title;
proc contents data=Have1 p; 
ods select position;
run;


x_software,book_titles,date_searched
SAS,576,06/01/2015
SPSS,339,07/01/2015
R,240,08/01/2015
Stata,62,09/01/2015

Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order
#,Variable,Type,Len,Format
1,x_software,Char,5,
2,book_titles,Num,8,
3,date_searched,Num,8,MMDDYY10.


### Column Input and Formatted Input
#### Fixed column data that include chracter data values as well as both standard and nonstandard numeric data values

In [7]:
*Ex5_formatted_column_input.sas;
* Use of absolute pointer control;
data work.apc;
infile datalines firstobs=2;
input id $1.        @3 x1 5. 
      @9 x2 dollar7.   @9 a_x2 comma7.
      @17 x3 dollarx7. @17 a_x3 commax7. 
      @25 x4 6.        @32 x5 percent7.;
format x2 dollar7.  a_x2 comma7. 
       x3 dollarx7. a_x3 commax7. 
       x5 percent7.;
datalines;
1234567890123456789012345678901234
A 12909 $12,909 $12.909 12.909 12%
;
title 'Use of absolute pointer control';
proc print data=work.apc noobs; run;
title;

id,x1,x2,a_x2,x3,a_x3,x4,x5
A,12909,"$12,909",12909,$12.909,12.909,12.909,12%


In [1]:
*Ex15_Absolute_Relative_Pointer_controls.sas;
data Example_formatted_column_input;
input id $ 1 x1 3-7
     @9 x2 dollar7. 
     +1 x3 dollarx7. 
     +1 x4 6. 
     +1 x5 percent7.;
format x2 dollar7. x3 dollarx7. x5 percent7.;
datalines;
A 12909 $12,909 $12.909 12.909 12%
;
title 'Use of absolute and relative pointer controls';
proc print data=Example_formatted_column_input noobs;
run;

id,x1,x2,x3,x4,x5
A,12909,"$12,909",$12.909,12.909,12%


## Formatted Input 
#### +(expression) moves the pointer the number of columns given by expression.
#### Read the same data field mutiple times with different informats +(-7) moves the pointer 7 columns backward.

In [8]:
*Ex5_formatted_column_input.sas;
data work.rpc;
infile datalines firstobs=2;
input id $1.           +1 x1 5. 
      +1 x2 dollar7.   +(-7) a_x2 comma7.
      +1 x3 dollarx7.  +(-7) a_x3 commax7. 
      +1 x4 6.         +1 x5 percent7.;
format x2 dollar7.  a_x2 comma7. 
       x3 dollarx7. a_x3 commax7. 
       x5 percent7.;
datalines;
1234567890123456789012345678901234
A 12909 $12,909 $12.909 12.909 12%
;
title 'Use of relative pointer control';
proc print data=work.rpc noobs; run;
title;

id,x1,x2,a_x2,x3,a_x3,x4,x5
A,12909,"$12,909",12909,$12.909,12.909,12.909,12%


In [2]:
*Ex37_Formmatted_Input_Formatted_put (Part 1);
options nocenter nodate nonumber nosource;
data Have1;
input @1 date1 date9.  +(-10) date2  date9. 
      +(-9) date3  date9. +(-9) date4  date9.;
Format date1 date11. date2 date9. date3 yymmdd10. date4 comma7.;
datalines;
29JAN2019
;
proc print data=Have1;
var date:;
run;


Obs,date1,date2,date3,date4
1,29-JAN-2019,29JAN2019,2019-01-29,21578


### Formatted Input 
#### Fixed column data that include nonstandard numeric data and character data values
####  Both absolute and relative pointer controls are used in the INPUT statement.
#### n_date = input(c_date,anydtdte11.); converts the "character" date variable into a "numeric" data variable using the INPUT function



In [7]:
*Ex6_Formated_Input_Dates.sas (Part 1);
DATA work.Have1;
INPUT            
            @1  date1 date11. 
            +1  date2 ddmmyy6.
            +1  date3 mmddyy10. 
            +1  date4 yymmdd8.
            +1  date5 ddmmyy10.
            +1  date6 mmddyy8.
            @1  c_date $11.;   

   n_date = input(c_date,anydtdte11.);

FORMAT date1 date2 date3 date4 date5 date6 mmddyy10.   
       n_date date9. ; 
DATALINES;
14/JAN/2015 140115 01-14-2015 15 01 14 14.01.2015 01/14/15
;
title "Using the temporary formats in the PROC step";
proc print data=have1;
Format date1 date9. 
      date2 WORDDATE. 
      date3 WORDDATX. 
      date4 WEEKDATE. 
      date5 MONYY.  
      date6 DOWNAME.
      n_date mmddyy10.;
run;
title;



Obs,date1,date2,date3,date4,date5,date6,c_date,n_date
1,14JAN2015,"January 14, 2015",14 January 2015,"Wednesday, January 14, 2015",JAN15,Wednesday,14/JAN/2015,01/14/2015



## Formated Input 
#### Fixed column data that include nonstandard numeric data values (dates)
You can use an ANYDTDTE informat to read in dates
 of fifferent structures including: 
 
* DATE, DATETIME, TIME, DDMMYY, 
* MMDDYY, and YYMMDD 
* JULIAN, MONYY, and YYQ 

You can also use the following INFORMATs to extract parts of dates:

* ANYDTDTE. Extracts the date portion 
* ANYDTDTM. Extracts the datetime portion 
* ANYDTTME. Extracts the time portion

Adapted from Venky Chakraborty's PharmaSUG2010 paper


In [11]:
*Ex6_Formated_Input_Dates.sas (Part 2);
title ' ';
data work.date_data;
input @1 mix_dates anydtdte.;
format mix_dates date9.;
datalines;
27Aug2018
27Aug2018 3:30:32.8
180827
08272018
SEP2018
18Q4
;
proc print data=date_data;
run;


Obs,mix_dates
1,27AUG2018
2,27AUG2018
3,27AUG2018
4,27AUG2018
5,01SEP2018
6,01OCT2018


### Formated Input 
#### Reading data with a special informat

* The BZw.d informat reads numeric values, converts any trailing or embedded blanks to 0s, and ignores leading blanks.
    
* The BZw.d informat ignores blanks between a minus sign and a numeric value in an input field.


In [12]:
*Adapted from SAS Documentation ;
*Exa35_input_numeric_character_data.sas (Part 2);
options nocenter nodate nonumber nosource;
data Have2;
 input @1 some_numbers bz4.;
 datalines;
2 3      /*embedded blank in the data - COMMA. or BZ. informat*/
- 23     /*embedded blank in the data - COMMA. or BZ. informat*/
;
proc print data=Have2 noobs; run;


some_numbers
2030
-23


### List Input
#### Space delimited data that includes standard numeric data values

In [14]:
*Adapted from SAS Documentation ;
*Exa35_input_numeric_character_data.sas (Part 1);
options nocenter nodate nonumber nosource;
data Have1;
 input some_numbers;
 datalines;
   23    /*input right aligned*/
 23      /*input not aligned */
23       /*input left aligned*/
00023    /*input with leading zeros*/
23.0     /*input with decimal point*/
2.3E1    /*in E notation, 2.30*/
230E-1   /*in E notation, 230x10*/ 
-23
;
proc print data=Have1 noobs; run;

some_numbers
23
23
23
23
23
23
23
-23


### List Input
#### Space-delimited data that include character data values and standard numeric data values

In [15]:
*Ex7_Simple_List_Input.sas (Part 1);
OPTIONS nodate nonumber ps=58 ls=98;
DATA Work.Have1;
    INPUT  st_name $ pop percent_pop18p ;
    DATALINES;
    Alabama 4833722          77
    ;
  PROC PRINT data=work.Have1 noobs; run;

st_name,pop,percent_pop18p
Alabama,4833722,77


### List Input 
#### Space-delimited data that include character data values (some of them are longer than 8 bytes)

* Use the $ option to read in character data
* Use the LENGTH statement to avoid unwanted truncation of the values of character variables 
    that are more than  8 chracters long

In [16]:
*Ex7_Simple_List_Input.sas (Part 2);
DATA Work.Have2;   
   LENGTH st_name $ 10;                        
    INPUT st_name $ pop percent_pop18p ;
    DATALINES;
    Alabama            4833722  77
    California 38332521             76.1
  ;
 PROC PRINT data=work.Have2 noobs; run;
 proc contents data=Have2 varnum;
 ods select position;
 run;

st_name,pop,percent_pop18p
Alabama,4833722,77.0
California,38332521,76.1

Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order
#,Variable,Type,Len
1,st_name,Char,10
2,pop,Num,8
3,percent_pop18p,Num,8


### List Input
#### Space-delimited data that include character data values some of which are longer than 8 bytes and standard numeric values 
* The INFORMAT statement has the same impact of the LENGTH statement for character variables.

In [17]:
*Ex7_Simple_List_Input.sas (Part 3);
DATA Work.Have3;   
   INFORMAT st_name $ 10.;                        
    INPUT st_name $ pop percent_pop18p ;
    DATALINES;
    Alabama 4833722  77
    California 38332521 76.1
  ;
 PROC PRINT data=work.Have3 noobs; run;
 proc contents data=Have3 varnum;
 ods select position;
 run;

st_name,pop,percent_pop18p
Alabama,4833722,77.0
California,38332521,76.1

Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order
#,Variable,Type,Len,Informat
1,st_name,Char,10,10.0
2,pop,Num,8,
3,percent_pop18p,Num,8,


### List Input 
#### Comma-delimited data that include character data values and standard numeric data values
* DLM = option on the INFILE statement

In [18]:
*Ex7_Simple_List_Input.sas (Part 4);
* Use the DLM= option to read in comma delimited data;
 DATA Work.Have4;   
   LENGTH  st_name $ 10; 
    infile datalines DLM=','; 
    INPUT st_name $ pop percent_pop18p ;
    DATALINES;
    Alabama, 4833722,  77
    California, 38332521, 76.1
  ;
  PROC PRINT data=work.Have4 noobs; run;

st_name,pop,percent_pop18p
Alabama,4833722,77.0
California,38332521,76.1


## List Input 
#### Space-delimited data that include standard date values

* Use a placeholder for the missing value for fields in the middle of the record of the space-delimited file.

In [19]:
*Ex7_Simple_List_Input.sas (Part 5);
DATA Work.Have5;   
   LENGTH st_name $ 10;                        
    INPUT st_name $ pop percent_pop18p ;
    DATALINES;
    Alabama .    77
    California 38332521 76.1
  ;
 PROC PRINT data=work.Have5 noobs; run;

st_name,pop,percent_pop18p
Alabama,.,77.0
California,38332521,76.1


## List Input
#### Space delimited data with missing data values at the end of the record 
* The MISSOVER option prevents SAS from loading new record when the end of the current record is reached.

In [20]:
*Ex9_DLM_DSD_MISSOVER.sas (Part5);
* MISSOVER option on the INFILE statement;
data M_data;
   infile datalines missover;
   input id course $ class_size;
   datalines; 
   1 Stat4197 14
   2 Stat6207 
   3 Stat1028 22
   4 Stat6197 25
   ;
proc print data=M_data;
run;

Obs,id,course,class_size
1,1,Stat4197,14
2,2,Stat6207,.
3,3,Stat1028,22
4,4,Stat6197,25


### List Input 
#### Space delimited data data values (more than one record per line)
#### Use the @@ option to read in more than one record per line.

In [21]:
*Ex7_Simple_List_Input.sas (Part 6);
DATA Work.Have6;   
   LENGTH  st_name $ 10;                        
    INPUT st_name $ pop percent_pop18p  @@;
    DATALINES;
    Alabama 4833722  77   California 38332521 76.1
  ;
  PROC PRINT data =work.Have6 noobs; run;

st_name,pop,percent_pop18p
Alabama,4833722,77.0
California,38332521,76.1


### List Input 
#### Space delimited data that include character data values and standard numeric data values
#### LABEL and FORMAT statements are added to the DATA step
 

In [22]:
title;
*Ex7_Simple_List_Input.sas (Part 7);
DATA Work.Have7;   
    LENGTH  st_name $ 10;                        
    INPUT st_name $ pop percent_pop18p ;
    FORMAT pop comma10. percent_pop18p 5.1;
    LABEL st_name='State Name'
          pop='Population Size'
          percent_pop18p='Percentage of Population Aged 18 Years and Older';
    DATALINES;
    Alabama 4833722  77
    California 38332521 76.1
  ;
  proc print data=work.Have7 noobs labels; run;
  proc contents data=work.Have7 varnum;
  ods select position;
  run;
  title;

State Name,Population Size,Percentage of Population Aged 18 Years and Older
Alabama,4833722,77.0
California,38332521,76.1

Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order
#,Variable,Type,Len,Format,Label
1,st_name,Char,10,,State Name
2,pop,Num,8,COMMA10.,Population Size
3,percent_pop18p,Num,8,5.1,Percentage of Population Aged 18 Years and Older



### List Input 

#### Delimited data that include 
* character data values and standard numeric data values 
* semicolons in the lines of data
* DATALINES4 statement 

http://www.sascommunity.org/wiki/DATALINES4_statement

The DATALINES4 statement preceeds any lines of data that 
are going to be read into the DATA step. The lines of data, 
which may contain semicolons, that immediately follow 
this statement end when four consecutive semicolons are 
encountered on anew line. If the data itself does not 
contain any semicolons then the DATALINES statement 
can be used instead.



In [23]:
*Ex31_Datalines4;
data Have;
   input state_data $50. ;
   datalines4;
Alabama;  4833722; 77.0
    California;  38332521; 76.1
;;;;

  proc print data=work.HAVE noobs;
   run;

state_data
Alabama; 4833722; 77.0
California; 38332521; 76.1


### List Input 

#### Space delimited data that include character data values and standard numeric data values

* Use the LABEL and FORMAT statements in the PROC PRINT step to apply the labels and formats
  to the variables

* You must use a SPLIT= option with PROC PRINT to display descriptive column headings with split text

In [24]:
*Ex7_Simple_List_Input.sas (Part 8);
 DATA Work.Have8;   
    LENGTH  st_name $ 10;                        
    INPUT st_name $ pop percent_pop18p ;
    DATALINES;
    Alabama 4833722  77
    California 38332521 76.1
  ;
PROC PRINT data=work.Have8 noobs split='*';
    FORMAT pop comma10. percent_pop18p 5.1;
    LABEL st_name='State Name'
          pop='Population Size'
          percent_pop18p='Percentage*of Population* Aged 18 Years* and Older';
     
  run;

State Name,Population Size,Percentage of Population Aged 18 Years and Older
Alabama,4833722,77.0
California,38332521,76.1


### List Input
#### Reading data from multiple external files into a single SAS data set
#### Method 1

In [7]:
*Ex28_Reading_Multiple_Files.sas;

*Reading multiple raw data files into a single SAS data set;
*Method 1; 
FILENAME test ('C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile1.csv',
               'C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile2.csv',
			   'C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile3.csv');
data a; 
infile test DLM=','; 
input var1 $ var2 var3; 
run;
title 'Reading multiple raw data files into a single SAS data set (Method 1)';
proc print data=a noobs; run;

var1,var2,var3
A,123,345
B,456,456
C,789,678
D,456,334
E,456,223
F,876,456
G,456,334
H,456,223
I,876,456


### List Input
Reading data from multiple external files into a single SAS data set
#### Method 2

In [8]:
*Ex28_Reading_Multiple_Files.sas (Part 2);
* Method 2;
FILENAME test 'C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile*.csv'; 
data b; 
infile test DLM=','; 
input var1 $ var2 var3; 
run;
title 'Reading multiple raw data files into a single SAS data set (Method 2)';
proc print data= b noobs; run;

var1,var2,var3
A,123,345
B,456,456
C,789,678
D,456,334
E,456,223
F,876,456
G,456,334
H,456,223
I,876,456


### List Input
#### Reading data from multiple external files into a single SAS data set
#### Method 3 

* Use an INFILE statement with the FILEVAR= option
* FILEVAR=variable causes the INFILE statement 
   to close the current input file and open a new 
   input file whenever the value of variable changes
   (e.g., testfile1, testfile2, testfile3). 

* The END= option defines a temporary variable whose value is 1 when the DATA step is processing the last observation. At all other times, the value of variable is 0. For example, 
 
      * LASTFILE=0 when the current input data record is not the last 
       record in the input file
       
     * LASTFILE=1 when the current input data record is not the last 
       record in the input file
       
 Although the DATA step can use the END= variable, SAS does not add it to the resulting data set.


In [9]:
*Ex28_Reading_Multiple_Files.sas (Part 3);
data c;
 do i=1 to 3;
    add= "C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile" || put(i,1.)|| ".csv";
    do until (lastfile);
        infile dummy filevar=add end=lastfile DLM=',';
        filename=add;
        input var1 $ var2 var3;
      output;
     end;
  end;
  stop;
  run;
  title 'Reading multiple raw data files into a single SAS data set (Method 3)';
  proc print data=c noobs; run;

i,filename,var1,var2,var3
1,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile1.csv,A,123,345
1,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile1.csv,B,456,456
1,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile1.csv,C,789,678
2,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile2.csv,D,456,334
2,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile2.csv,E,456,223
2,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile2.csv,F,876,456
3,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile3.csv,G,456,334
3,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile3.csv,H,456,223
3,C:\Users\pmuhuri\SASCourse\Week2\Week2Data\testfile3.csv,I,876,456


### List Input
#### Reading first few data records from a raw data file
#### Some features of the SAS code in the next cell
* Infile option: firstobs=2
* if \_N_=5 then stop; is used to stop the execution of the DATA step in the fifth iteration (... writes only the four records).

In [14]:
*Ex25_read_from_text.sas;
Filename raw  
    "C:\Users\pmuhuri\SASCourse\Week2\Week2Data\pop2013_no_headers.txt";
data have1;
   infile raw  firstobs=2 truncover ;
   input record $80. ;
   if _n_=5 then stop;
run;
title 'Firstobs=2 and if _n_=5 then stop - Read observations from 2 through 5';
proc print data=have1; run;

Obs,record
1,"40,4,9,2,Alaska,735132,547000, 74.4"
2,"40,4,8,4,Arizona,6626624,5009810,75.6"
3,"40,3,7,5,Arkansas,2959373,2249507,76"
4,"40,4,9,6,California,38332521,29157644,76.1"


### List Input
#### Reading first few data records from a raw data file
* Infile option: firstobs=2 and obs=5

In [16]:
*Ex25_read_from_raw.sas;
Filename raw  
    "C:\Users\pmuhuri\SASCourse\Week2\Week2Data\pop2013_no_headers.txt";
data have2;
   infile raw  firstobs=2 obs=5 truncover ;
   input record $80.;
   run;
title '(Options on the INFILE statement: firstobs=2 obs=5) Read observations from 2 through 5';
proc print data=have2 noobs; run;
title;

record
"40,4,9,2,Alaska,735132,547000, 74.4"
"40,4,8,4,Arizona,6626624,5009810,75.6"
"40,3,7,5,Arkansas,2959373,2249507,76"
"40,4,9,6,California,38332521,29157644,76.1"


### List Input
#### Reading data that are in zipped files

* Download the zipped file (national data) from this site before running the SAS code below.
https://www.ssa.gov/oact/babynames/limits.html)

* The FILENAME statement specifies the type of file that needs to be unzipped (i.e., zipfile).

* The engine SASZIPAM is used to decompress the file.

In [12]:
*Ex18_Read_Zipped_File2.sas;
Filename ZIPFILE SASZIPAM 'C:\Users\pmuhuri\SASCourse\Week2\Week2Data\names.zip';
DATA newdata;
  INFILE ZIPFILE("yob1920.txt") DLM=',';
  INPUT name $ gender $ number;   
RUN;



The SAS System

242        ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
242      ! ods graphics on / outputfmt=png;
243        
244        *Ex18_Read_Zipped_File2.sas;
245        Filename ZIPFILE SASZIPAM 'C:\Users\pmuhuri\SASCourse\Week2\Week2Data\names.zip';
246        DATA newdata;
247          INFILE ZIPFILE("yob1920.txt") DLM=',';
248          INPUT name $ gender $ number;
249        RUN;
250        
251        
252        
253        ods html5 (id=saspy_internal) close;ods listing;
254        

The SAS System

255        


In [13]:
proc sort data=newdata; by gender descending number;
title " 5 most common girls' names";
proc print data=newdata (obs=5) noobs; 
var name number;
format number comma9.;
where gender='F';
run;
title " 5 most common boys' names";
proc print data=newdata (obs=5) noobs; 
var name number;
format number comma9.;
where gender='M';
run;
title ' ';

name,number
Mary,70975
Dorothy,36645
Helen,35097
Margaret,27997
Ruth,26101

name,number
John,56916
William,50152
Robert,48681
James,47913
Charles,28309


### Modified List Input
### Method 1
#### Comma delimited data that include both standard and nonstandard numeric data values as well as characte data values
* LENGTH statement
* INFORMAT statement
* DLM = option on the INFILE statement


In [31]:
*Ex8_List_Input_Modified_Input.sas (Part 1);
OPTIONS nodate nonumber ps=58 ls=98;

*List input style with LENGTH and INFORMAT statements; 
data work.Students_x; 
  length Id $6 Name $14 Address $16 City $20 State $2 zip $5 ; 
  informat Reg_date mmddyy10.;
  format Reg_date mmddyy10.;
  infile datalines dlm=',';
  input Id -- Reg_date;
datalines;
G009876, Ann Miller,2219 Pine St, Rockville,MD,28057, 08/20/2016
G008765, Rubi Tyson,6504 Spring St, Philadelphia,PA,19104,08/13/2016
;
PROC PRINT data=students_x noobs; 
RUN;

Id,Name,Address,City,State,zip,Reg_date
G00987,Ann Miller,2219 Pine St,Rockville,MD,28057,08/20/2016
G00876,Rubi Tyson,6504 Spring St,Philadelphia,PA,19104,08/13/2016


### Modified List Input
#### Method 2


* INFORMAT statement (no LENGTH statement added)
* DLM = option on the INFILE statement


In [32]:
*Ex8_List_Input_Modified_Input.sas (Part 2);
*List input style with INFORMAT statement; 
data students_y; 
informat Id $6. Name $14. Address $16. City $20. State $2. zip $5. 
         Reg_date mmddyy10.;
format Reg_date mmddyy10.;
infile datalines dlm=',';
input Id -- Reg_date;
datalines;
G009876, Ann Miller,2219 Pine St, Rockville,MD,28057, 08/20/2016
G008765, Rubi Tyson,6504 Spring St, Philadelphia,PA,19104,08/13/2016
;
PROC PRINT data=students_y noobs;  
RUN;

Id,Name,Address,City,State,zip,Reg_date
G00987,Ann Miller,2219 Pine St,Rockville,MD,28057,08/20/2016
G00876,Rubi Tyson,6504 Spring St,Philadelphia,PA,19104,08/13/2016



### Modified List Input
#### Method 3
* Use the colon(:) format modifier that enables you to use LIST INPUT and also specify an INFORMAT, whether character or numeric data values.

* The CITY variable is read in as a character variable using the $20.  ($w. ) Informat.   This informat tells SAS that the variable is character with a length of 20.  
  
* The REG_DATE variable is read as date informat MMDDYYw.  (the qualifier $w. is set to 10 since this date field occupies 10 spaces).
  
****

The default length of numeric variables is 8, so you don’t need to specify a w value to indicate the length of a numeric variable (e.g. Reg_date below) when reading data from delimited data. This is different from using a numeric informat with formatted input when reading data from fixed column data. In that case, you must specify a w value in order to indicate the number of columns to be read. SAS Certification Preparation Guide: Base Programming for SAS® 9 Third Edition (page 552).


In [33]:

*Ex8_List_Input_Modified_Input.sas (Part 3);

data students_z; 
infile datalines dlm=',';
input Id :$6. Name :$14. Address :$16. City :$20. 
      State :$2. zip :$5. Reg_date :mmddyy.;
format Reg_date mmddyy10.;
datalines;
G009876, Ann Miller,2219 Pine St, Rockville,MD,28057, 08/20/2016
G008765, Rubi Tyson,6504 Spring St, Philadelphia,PA,19104,08/13/2016
;
PROC PRINT data=students_z noobs;  
RUN;

Id,Name,Address,City,State,zip,Reg_date
G00987,Ann Miller,2219 Pine St,Rockville,MD,28057,08/20/2016
G00876,Rubi Tyson,6504 Spring St,Philadelphia,PA,19104,08/13/2016


### Modified List Input 
#### Space-delimited data that include nonstandard date values
* YEARCUTOFF = on the options statement
* Colon modifier
* Ampersand modifier
* STRIP function

Scenario: Read dates that fall in the 18th century 
using the YEARCUTOFF option, which defines the beginning 
of the 100-year period for those digit year.

In SAS 9.4, the SAS default value for this option is 1926.

You use this option when your date variable contains
a 2-digit year value (e.g., 78 instead of 1778) and 
the year values are outside of the 100-year span from
1920 to 2019 that is implied by the SAS default option 
YEARCUTOFF=1920. 

In the example below, we read in SAS 
the dates when four States joined the Union.  
Since these dates are outside of the default 100-year span
(1920-2019), we need to override the default option by 
using the option YEARCUTOFF=1720 to ensure that all the
dates we are reading range from years 1720 to 1820.

In [35]:
*Ex6_Formated_Input_Dates.sas (Part 7);
options yearcutoff=1720;
data yc;
   INPUT state_name  & $22. date_entry :mmddyy.; 
   FORMAT date_entry :mmddyy10.;
DATALINES;
Delaware  12/07/87
Pennsylvania  12/12/87
New Jersey  12/18/87
South Carolina  05/23/88
;
proc print data=yc noobs; 
run;

state_name,date_entry
Delaware,12/07/1787
Pennsylvania,12/12/1787
New Jersey,12/18/1787
South Carolina,05/23/1788


### Modified List Input
#### Comma delimited data that include standard and nonstandard numeric data values as well as missing values

* DSD option on the INFILE statement (the DLM option is not needed, because we are using a comma-delimited file)

The DSD option can 
* treat two consecutive delimiters as a missing value
* remove quotation marks from strings and treat any delimiter inside the quotation marks as a valid character



In [36]:
*Ex9_DLM_DSD_MISSOVER.sas (Part 3);
data DSD_data_X;
infile datalines  DSD;
input airport: $3. flight :8. airlines :$8. date :mmddyy10.; 
format date date9.;
datalines;
IDA, 972, Spirit, "05/14/14"
DCA,617,,"05/18/2018"
;
proc print data=DSD_data_X;
run;

Obs,airport,flight,airlines,date
1,IDA,972,Spirit,14MAY1814
2,DCA,617,,18MAY2018


### Modified List Input

#### Comma delimited data that include character data values of more than 8 bytes and embedded blanks

* In the example-code below, the ~ (tilde) format modifier enables to read delimiter-embedded numeric/character values within double quotation marks and retain this kind of data values. 

* The DSD option on the INFILE statement must be used to get the desired effect of this format modifier.

In [37]:
*Ex9_DLM_DSD_MISSOVER.sas (Part 4);
  DATA Work.Quotation_Surrounded_Values;   
    INFILE DATALINES DSD;
    INPUT st_name ~ $33. percent_pop18p ;
 DATALINES;
 "Alabama, The Yellowhammer State", 77.0
 "California, The Golden State",  76.1
 ;
 PROC PRINT;RUN;

Obs,st_name,percent_pop18p
1,"""Alabama, The Yellowhammer State""",77.0
2,"""California, The Golden State""",76.1


### Modified List Input

#### Space delimited data that include character data values (with embedded blanks) as well as both standard and nonstandard numeric data values

* In the example-code below, the & modifier after the  variable ST_NAME, which indicates that its value should be read until two consecutive blanks are encountered. 

* In the data, there two blanks instead of one blank after each of the data values: Alabama, California, and District of Columbia; two blanks, which are required.

* Also note the comma. Informat for the variable pop. The rule is that you do not specify a w value to indicate the length of a numeric variable when modifying list input with the colon (:) modifier. SAS Certification Preparation Guide:Base Programming for SAS® 9 Third Edition (page 552).


In [38]:
*Ex10_Modified_List_Input.sas;
OPTIONS nodate nonumber ps=58 ls=98;
  DATA work.Have1;   
    INPUT st_name & $20. pop :comma. percent_pop18p ;
    FORMAT pop comma10.;
     DATALINES;
    Alabama  4,833,722  77
    California  38,332,521 76.1
    District of Columbia  646,449 82.8
  ;
  PROC PRINT data=HAVE1 noobs;  RUN;

st_name,pop,percent_pop18p
Alabama,4833722,77.0
California,38332521,76.1
District of Columbia,646449,82.8


### Modified List Input
#### Comma delimted data that include date values
* The single question mark (?) format modifier in the INPUT statement below suppresses the invalid data message.
* The second data record has the invalid data in the “date” field.

In [39]:
*Ex11_Question_marks.sas (Part 1);
data temp2;
   infile datalines DLM = ',';
   input date ? :mmddyy.  copay_amount;
    format date mmddyy10.;
datalines;
10/05/2004,25
02/29/2015,25
;
proc print data=temp2; run;

Obs,date,copay_amount
1,10/05/2004,25
2,.,25


### Modified List Input
#### Comma delimted data that include invalid date values
The ?? format modifier also suppresses the invalid data message and, in addition, prevents the automatic variable _ERROR_ from being set to 1 when invalid data are read. [See SAS® Documentation for details]

In [40]:
*Ex11_Question_marks.sas (Part 2);
data temp3;
   infile datalines DLM = ',';
   input date ?? :mmddyy.  copay_amount;
   format date mmddyy10.;
datalines;
10/05/2004,25
02/29/2015,25
;
proc print data=temp3; run;

Obs,date,copay_amount
1,10/05/2004,25
2,.,25


### Modified List Input
### Space delimited data with embedded blanks in the data fields (multiple records per observation)
* In the example-code below, the & modifier after the  variable name indicates that its value should be read until two consecutive blanks are encountered. 
* The / line-pointer control advances the pointer to column 1 of the next input record.


In [41]:
*Ex13_Line_Pointer_controls.sas (Part 1);
options nocenter ls=132 nodate nonumber;
data address1;
      input name  & $ 30.
          /subname  & $ 20.
          /st_address1  & $ 30.
          /st_address2  & $ 30.
          /phone $ 14.;
datalines;
Air Force Personnel Center
HQ AFPC/DPSSRP
550 C Street West
Randolph AFB, TX 78150
1-800-525-0102
Navy Personnel Command
(PERS-312E)
5720 Integrity Drive
Millington, TN 38055
901-874-4885
;
proc print data= address1 noobs; run;

name,subname,st_address1,st_address2,phone
Air Force Personnel Center,HQ AFPC/DPSSRP,550 C Street West,"Randolph AFB, TX 78150",1-800-525-0102
Navy Personnel Command,(PERS-312E),5720 Integrity Drive,"Millington, TN 38055",901-874-4885


### Modified List Input
### Data and note for the SAS code in the next cell
* Space delimited data with embedded blanks in multiple include records per observation
* In the example-code below, the & modifier after the  variable name indicates that its value should be read until two consecutive blanks are encountered. 
* The #n line-pointer control advances the pointer to column 1 of record n.

In [42]:
*Ex13_Line_Pointer_controls.sas (Part 2);
*Multiple records per observation using the pound (#) sign;
data address2;
   infile datalines ;
   input name  & $ 30.
         #3 st_address1  & $ 30.
         #4 st_address2  & $ 30. 
         #5 phone $ 14.;
datalines;
Air Force Personnel Center
HQ AFPC/DPSSRP
550 C Street West
Randolph AFB, TX 78150
1-800-525-0102
Navy Personnel Command
(PERS-312E)
5720 Integrity Drive
Millington, TN 38055
901-874-4885
;
proc print data= address2 noobs; run;


name,st_address1,st_address2,phone
Air Force Personnel Center,550 C Street West,"Randolph AFB, TX 78150",1-800-525-0102
Navy Personnel Command,5720 Integrity Drive,"Millington, TN 38055",901-874-4885


### Modified List Input

#### Space delimited data that include character data values as well as both standard and nonstandard numeric data values (with multiple observations per line)

* Use multiple INPUT statements (Normally, each INPUT statement in a DATA step reads a new data record into the input buffer. 

#### When you use a trailing @, the following things occur: 
* The pointer position does not change.
* No new record is read into the input buffer.
* The next INPUT statement for the same iteration of the DATA step continues to read the same record rather than a new one.


In [43]:
*Ex29_Multiple_Input_Statements.sas;
title ' ';
data work.HAVE(drop=i);
 input date: Anydtdte9. @;
 do i = 1 to 4;
 input name $ hours_studied @;
 label date= 'Date'
       name = "Student's name"
       hours_studied = 'Hours studied*for STAT 4197/6197';
 output;
 end;
datalines;
27Aug2018 Doris 5.5 Alice 4.0 Mike 2.0 James 1.0
28Jun2018 Doris 3.0 Alice 3.0 Mike 3.0 James 1.0
;
proc print data=work.HAVE noobs split='*';
Format date worddate.;
run;

Date,Student's name,Hours studied for STAT 4197/6197
"August 27, 2018",Doris,5.5
"August 27, 2018",Alice,4.0
"August 27, 2018",Mike,2.0
"August 27, 2018",James,1.0
"June 28, 2018",Doris,3.0
"June 28, 2018",Alice,3.0
"June 28, 2018",Mike,3.0
"June 28, 2018",James,1.0



### Specifying the LENGTH Statement for the Numeric in the DATA Step
A variable's length (the number of bytes used to store it) is 
related to its type.

* Character variables can be up to 32,767 bytes long.
* All numeric variables have a default length of 8 bytes.
* Numeric values (no matter how many digits they contain) are stored as floating numbers in 8 bytes.

In [1]:
*Ex23_Length.sas;
data temp;
length x 4 y 3 ;
     do x=9006 to 9010;
        y=x;
       output;
     end;
proc print data=temp noobs; run;

x,y
9006,9006
9007,9006
9008,9008
9009,9008
9010,9010


### Mix of Column Input, Formatted Input, and List Input Styles
#### Fixed column data that include character data values ans nonstandard numeric data values
#### The following describes a records's values in the INPUT statement:
* column input
* formatted input
* list input

In [44]:
*Ex14_Column_Formatted_Input.sas;
data Mix_column_Formatted;    
input software $1-5 @9 date date9. amount :comma5.;     
format date date9. amount comma5.;      
datalines;                                                                                                                              
SAS     06jan1976   2,345       
Stata   05jan1998   1,560  
R       07jun1996   4,567  
;                                                                                                      
proc print data=Mix_column_Formatted noobs; 
run; 

software,date,amount
SAS,06JAN1976,2345
Stata,05JAN1998,1560
R,07JUN1996,4567


### Named Input
### Data records contain both variable names and values 
#### What to Specify on the Input Statement (Named Input)
* Each variable name must be followed by an equal sign on the INPUT statement
* Each variable name must be followed by an equal sign on the data as well
* Character variables must be indicated by a "$" following the equals sign on the INPUT statement
* Appropriate format modifiers and informats must be specified
* The"/" at the end of the line must be used to read the next data line in order to complete the observation


In [45]:
*Ex16_Named_Input.sas;
options nocenter nodate nonumber ls=132;
DATA TEST;
input name = & $ 30. address = & $ 30.
      city_zip  = & $ 30. phone= $ 14.
      Num_employees = ;
      FORMAT Num_employees comma7.;    
DATALINES;
name=Air Force Personnel Center /
address=550 C Street West /
city_zip=Randolph AFB, TX 78150 /
phone=1-800-525-0102 /
Num_employees=5876 
name= Navy Personnel Command /
address= 5720 Integrity Drive /
city_zip= Millington, TN 38055 /
phone= 901-874-4885 /
Num_employees=3987 
;
proc print data=TEST noobs; 
run; 

name,address,city_zip,phone,Num_employees
Air Force Personnel Center,550 C Street West,"Randolph AFB, TX 78150",1-800-525-0102,5876
Navy Personnel Command,5720 Integrity Drive,"Millington, TN 38055",901-874-4885,3987


[Turning external files into SAS® data sets: common problems and their solutions by Amber Elam - A must-read article from SAS Blogs](https://blogs.sas.com/content/sgf/2021/02/18/turning-text-files-into-sas-data-sets-6-common-problems-and-their-solutions/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+TheSasTrainingPost+%28The+SAS+Learning+Post+-%3E+SAS+Users%29)