# Assignment 1 
# Bank Marketing  Case Study: loading and merging data
### Learning outcomes
1. Load data using input Files in Various Formats to combine information from many data domains and sources
2. Rename columns and convert column types from character to numeric to prepare for merging
3. Merge sas datasets to obtain a datawarehouse ready for analysis

### Introduction 
The head of Marketing wants to know which customers have the highest propensity for buying a Certificate of Deposit (CD) from the institution. The goal of this assignment is to create part of an analytical data mart by combining information from many data domains and sources. 


#### Q1. Load data from customer_banking_info_promo.xslx
- define the library name "mylib" and specify its location using libname
- Use proc import DATAFILE to import customer_banking_info_promo.xlsx into a sas dataset named customer_banking_info_promo under mylib
- Print the first five rows of the dataset add (obs=5) at the end of proc print. 


In [1]:
libname mylib '/folders/myfolders/Assignments';

PROC IMPORT DATAFILE="data/customer_banking_info_promo.xlsx"
        OUT=mylib.customer_banking_info_promo
        DBMS=XLSX
        REPLACE;

RUN;

proc print data=mylib.customer_banking_info_promo (obs=5);
run;



SAS Connection established. Subprocess id is 5066



Obs,customer_id2,contact,day,month,duration,campaign,pdays,previous,poutcome,y
1,122482,cellular,22,aug,229,2,-1,0,unknown,no
2,119725,cellular,7,aug,125,2,-1,0,unknown,no
3,103490,unknown,15,may,68,2,-1,0,unknown,no
4,126218,cellular,19,nov,517,2,187,3,failure,no
5,104835,unknown,20,may,165,2,-1,0,unknown,no


#### Q2. Examine the variable Customer ID. Check the type and format. 
- Use proc content procedure to examine the variables and their types. This will also print more details. 

ref: http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p120panelmbpren1m0j2n77s9f67.htm

or 

https://www.cpc.unc.edu/research/tools/data_analysis/sastopics/contents

In [2]:

PROC CONTENTS DATA=mylib.customer_banking_info_promo;
RUN;

0,1,2,3
Data Set Name,MYLIB.CUSTOMER_BANKING_INFO_PROMO,Observations,10578
Member Type,DATA,Variables,10
Engine,V9,Indexes,0
Created,09/20/2019 15:55:05,Observation Length,72
Last Modified,09/20/2019 15:55:05,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,12
First Data Page,1
Max Obs per Page,908
Obs in First Data Page,864
Number of Data Set Repairs,0
Filename,/folders/myfolders/Assignments/customer_banking_info_promo.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,323

Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes
#,Variable,Type,Len,Format,Informat,Label
6,campaign,Num,8,BEST.,,campaign
2,contact,Char,9,$9.,$9.,contact
1,customer_id2,Char,6,$6.,$6.,customer_id2
3,day,Num,8,BEST.,,day
5,duration,Num,8,BEST.,,duration
4,month,Char,3,$3.,$3.,month
7,pdays,Num,8,BEST.,,pdays
9,poutcome,Char,7,$7.,$7.,poutcome
8,previous,Num,8,BEST.,,previous
10,y,Char,3,$3.,$3.,y


#### Q3. Column deletion/renaming


Look at the description of the different columns here: https://archive.ics.uci.edu/ml/datasets/bank+marketing 

__duration__: last contact duration, in seconds (numeric). 
Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

- Within a data step, perform the following:
    - keep the output dataset name same as input (customer_banking_info_promo)
    - Rename "customer_id2" to customer_id  
    - drop the column "duration" from the dataset.
- print the first 5 observations in the dataset

References:
- rename option: https://newonlinecourses.science.psu.edu/stat481/node/17/
- drop option: https://newonlinecourses.science.psu.edu/stat481/node/15/

In [3]:
DATA mylib.customer_banking_info_promo;
set mylib.customer_banking_info_promo(rename=(customer_id2=customer_id) drop=duration);
run; 

proc print data=mylib.customer_banking_info_promo (obs=5);
run;

Obs,customer_id,contact,day,month,campaign,pdays,previous,poutcome,y
1,122482,cellular,22,aug,2,-1,0,unknown,no
2,119725,cellular,7,aug,2,-1,0,unknown,no
3,103490,unknown,15,may,2,-1,0,unknown,no
4,126218,cellular,19,nov,2,187,3,failure,no
5,104835,unknown,20,may,2,-1,0,unknown,no


#### Q4. Load data from customer_banking_info.csv 

load the data and print the first five rows.  


In [1]:
PROC IMPORT DATAFILE='data/customer_banking_info.csv'
    DBMS=CSV
    OUT=mylib.customer_banking_info
    REPLACE;    
RUN;

proc print data = mylib.customer_banking_info (obs = 5) noobs;
run;


SAS Connection established. Subprocess id is 2506



#### Q5. Renaming columns

- use proc contents to examine the list of variables as before. You will see that customer_id1 is numerical with len=8. This is important to check as this column will be used to merge the datasets. 
- Within a data step, perform the following:
    - keep the output dataset name same as the input dataset name (customer_banking_info)
    - Rename "customer_id1" as customer_id  
- print the first 5 observations in the dataset

In [5]:
* use proc contents here; 
PROC CONTENTS DATA=mylib.customer_banking_info;
RUN;

0,1,2,3
Data Set Name,MYLIB.CUSTOMER_BANKING_INFO,Observations,10578
Member Type,DATA,Variables,5
Engine,V9,Indexes,0
Created,09/20/2019 15:55:10,Observation Length,32
Last Modified,09/20/2019 15:55:10,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,6
First Data Page,1
Max Obs per Page,2038
Obs in First Data Page,1962
Number of Data Set Repairs,0
Filename,/folders/myfolders/Assignments/customer_banking_info.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,337

Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes
#,Variable,Type,Len,Format,Informat
3,balance,Num,8,BEST12.,BEST32.
1,customer_id1,Num,8,BEST12.,BEST32.
2,default,Char,3,$3.,$3.
4,housing,Char,3,$3.,$3.
5,loan,Char,3,$3.,$3.


In [6]:

* code to rename columns here and print; 
data mylib.customer_banking_info ;
    set mylib.customer_banking_info (rename=(customer_id1=customer_id))  ;
run;

proc print data = mylib.customer_banking_info (obs = 5) noobs;
run;

customer_id,default,balance,housing,loan
122482,no,347,no,no
119725,no,3462,no,no
103490,no,157,yes,no
126218,no,3689,yes,no
104835,no,0,yes,yes


#### Q6. SAS data from customer_demographics.sas7bdat

- print the first 5 rows of customer_demographics.sas7bdat
- use proc contents and examine the list of variables. What is the type of customer_id

In [7]:
* code to print here; 
DATA mylib.cus_demographic; 
  set 'data/customer_demographics.sas7bdat'; 
RUN; 
proc print data = mylib.cus_demographic (obs = 5) noobs;
run;


Education,customer_id,AGE,marital,JOB
secondary,100103,33,married,entrepreneur
tertiary,100106,35,married,management
primary,100118,57,married,blue-collar
primary,100119,60,married,retired
secondary,100121,28,married,blue-collar


In [8]:
* use proc contents here; 
proc contents data= mylib.cus_demographic;
run;

0,1,2,3
Data Set Name,MYLIB.CUS_DEMOGRAPHIC,Observations,10578
Member Type,DATA,Variables,5
Engine,V9,Indexes,0
Created,09/20/2019 15:55:13,Observation Length,48
Last Modified,09/20/2019 15:55:13,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,8
First Data Page,1
Max Obs per Page,1360
Obs in First Data Page,1309
Number of Data Set Repairs,0
Filename,/folders/myfolders/Assignments/cus_demographic.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,341

Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes
#,Variable,Type,Len,Format,Label
3,AGE,Num,8,F4.,AGE
1,Education,Char,9,$CHAR9.,Education
5,JOB,Char,14,$CHAR14.,JOB
2,customer_id,Num,8,,
4,marital,Char,8,$CHAR8.,marital


#### Q7. Convert from character to numeric type

Before merging multiple datasets, the common column between the datasets should be of the same type.  
In customer_banking_info_promo, customer_id is defined as character. you are given a sample data step code to run: 
- the output dataset name customer_banking_info_promocv 
- to convert customer_id to numeric variable, we use the input function. 

reference: http://support.sas.com/kb/24/590.html 

In [9]:
data mylib.customer_banking_info_promocv;
   set mylib.customer_banking_info_promo;
   num_cusId = input(customer_id, 8.);  
   drop customer_id; 
   rename  num_cusId=customer_id;    
run;
 

- check the customer_id variable type again by using proc contents or proc means to see the list of numerical variables

In [10]:
*check type again;
proc means data=mylib.customer_banking_info_promocv;
Run;

Variable,Label,N,Mean,Std Dev,Minimum,Maximum
day campaign pdays previous customer_id,day campaign pdays previous,10578 10578 10578 10578 10578,15.4758934 2.4747589 51.9548119 0.8525241 127278.17,8.4137946 2.6151781 109.3471124 3.4721156 13660.22,1.0000000 1.0000000 -1.0000000 0 100103.00,31.0000000 50.0000000 854.0000000 275.0000000 145309.00


#### Q8. Data Merging
- Join the three sources of data into a single SAS data set.  
    - sort each of the datasets by customer_id 
    - merge the three datasets using the merge function within a data step. name the new dataset as "customer_all"
    - print the first five observations.

Refer to
https://newonlinecourses.science.psu.edu/stat481/node/28/


In [14]:
* code for merging goes here; 
proc sort data=mylib.customer_banking_info out=mylib.cus_banking_info_sorted;
     by customer_id;
 run;
proc sort data=mylib.customer_banking_info_promocv out=mylib.cus_bank_info_promocv_sorted;
     by customer_id;
run;
proc sort data=mylib.cus_demographic out=mylib.cus_demographic_sorted;
     by customer_id;
run;

* combine and name the new dataset as customer_all; 
data mylib.customer_all;
   merge mylib.cus_banking_info_sorted 
         mylib.cus_bank_info_promocv_sorted
         mylib.cus_demographic_sorted;
   by customer_id;
   
run;

In [13]:
* print data; 
proc print data=mylib.customer_all (obs=5);
run;

Obs,customer_id,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y,Education,AGE,marital,JOB
1,100103,no,2,yes,yes,unknown,5,may,1,-1,0,unknown,no,secondary,33,married,entrepreneur
2,100106,no,231,yes,no,unknown,5,may,1,-1,0,unknown,no,tertiary,35,married,management
3,100118,no,52,yes,no,unknown,5,may,1,-1,0,unknown,no,primary,57,married,blue-collar
4,100119,no,60,yes,no,unknown,5,may,1,-1,0,unknown,no,primary,60,married,retired
5,100121,no,723,yes,yes,unknown,5,may,1,-1,0,unknown,no,secondary,28,married,blue-collar


Finish Assignment 1