A Stata module that extends the default
import excel command and adds support on batch processing of excel files.
- load all excel files in the specified directory
- organize variable names and labels
- reshape the dataset if necessary
- integrate all files into a cleaned dataset.
The major improvement is on handling variable names and automating label generations. The module can manage most name exceptions in batch processing, including long variable names, multiline names, line breaks, special characters, etc. It generates variable names that can easily work with the
reshape command and provides
integrate options to automate the entire process.
net install excelclean, from("https://github.com/hanlulong/excelclean/raw/master")
Put all excel files of similar kinds into the same directory
excelclean datadir(string) sheet(string) cellrange(string) [, options]
After installing the package, run
excelclean datadir(string) sheet(string) cellrange(string) [, options] datadir(string) directory where excel files are stored. Please close and save all excel files under this directory before executing the command. e.g., datadir("c:/myplace/") sheet(string) the excel sheet in each excel file to be loaded into Stata. e.g., sheet("sheet1") or sheet("Results") cellrange(string) the range of data cells on each excel sheet to be loaded into Stata. e.g., cellrange("A1") to extract all cells; cellrange("B3") to extract cells starting from the second column and the third row.
integrate integrate all datasets into a single dta file. The default is to save each dta file separately using the name of the corresponding excel file. pivot reshape variables from a wide format to a long format. By default, the program recogonizes the last word from the formulated variable names as the time indicator. The program will detect the variables that need to be reshaped into a long format, e.g. Var2000, Var2001, Var2002 -> Var droplist(string) drop redundant variables from the dataset in the data integration process. It helps to reduce the file size and processing time. Separate variable names by a space " ". e.g., droplist("Var1 Var2 "). Remember to leave a space at the end of the last variable. resultdir(string) specify the directory where the results are saved. The default is the directory "datadir" where the excel files are stored. e.g., resultdir("C:/myresultdir/"). Always use the full path of the directory to aviod possible conflicts. extension(string) specify the extension of files to be included. The default is "xlsx". e.g., extension("xls") namerange(integer) specify the rows that record variables names. The default is the first row. e.g., namelines(4) to indicate that the first four rows contain information about variable names. Note that the first four lines will be deleted from the dataset after creating the variable names. namelines(string) select the rows to formulate variables names. The default is the first row. e.g., namelines("1 3") to specify the first and the third rows to be used to formulate variable names wordfilter(string) specify specific characters to be excluded from variable names. e.g., wordfilter("year quarter the"); to exclude the space before any word use wordfilter(`"" word1" " word2""')