I made a script for searching .xls
and .xlsx
(and .xlsm
) files with keywords/phrases.
This script searches all files old xls and new xlsx, and xlsm content for unlimited amount of strings\phrases\keywords
- Fast - uses parallel processing, with 4 search keywords and a 1000 excels took only 180s for me
- Configurable specify min filesize, input filename, output filename, and any number of keywords, And all this in a file! :)
- Excel-able creates a csv file, can be opened in excel with utf-8 chars
- searches all files from basically a csv file files_to_search.efu (efu is from everything save file)
- removes all files that have no matched keywords
This was built with windows in mind, but works for unix to, if you supply file list in specified format
-
python (pandas (in requirements.txt))
Everything from voidtools, https://www.voidtools.com/support/everything/ (https://en.wikipedia.org/wiki/Everything_(software))
create a "csv" file by dumping search with filenames and their sizes like so: 'Filename,Size' - with this exact header
- insure Python and Everything are installed, do
pip install -r requirements.txt
- Download the script and
search_config.txt
- Open the the
search_config.txt
- Add key phrases after line 19 after the last // (which are comments)
- Open Everything tool.
- Perform a search for
.xls
, if you turned on regex then\.xls
, sort them by path.
- this will return all types of excel files,
- both
.xls
,.xlsx
and.xlsm
, and some files with .xls in the middle of filename - autosaves of excel
- backups of excel-like tools
- both
- Save as
files_to_search.efu
.- You can open it in notepad and delete lines where there definitely is nothing, files are sorted by path and name.
- Put it in the same folder as the program.
- Make sure you saved
search_config.txt
- (for windows) Open the console by typing 'cmd' in the file system search bar
- Run the program through the command in the console
python .\search.py
. - Wait for the execution to complete.
- Open the file
output_results.csv
in excel or any text editor - work with the results.
using search_config.txt
in repo
file files_to_search.efu
is 1057 entries for me (you can delete the ones in certain dirs, like trash)
the whole search took
Time 23.49 - is total elapsed time.
reading from: files_to_search.efu, exporting to output_csv.csv
files bigger then 5000
with keywords: ['32', 'project', 'other project', 'third project', 'maybe']
removed 174 that are smaller then 5000
using 32 workers
1/883 files (0.00% done). Time total 0.08 $R59GSHE.xls
2/883 files (0.01% done). Time total 0.11 $RG6HYYA.xls
...
461/174 files (48.74% done). Time total 31.52 ***.xls
882/883 files (97.10% done). Time total 91.13 Розклад+.xls
883/883 files (100.00% done). Time total 183.74 темп.xlsx
search done
saving
done
Total files with hits 708, files with errors 5, total lines 713
"sep=;"
Filepath;Error_msg;32;project;other project;third project;maybe;Found
A:\\\\censored_name.xlsx;;True;True;False;False;False;2
C:\\\ncensored_name.xlsx;;True;True;False;False;False;2
A:\$RECYCLE.BIN\...\$R59GSHE.xls;;True;False;False;False;False;1
Errors look like this
C:\Program Files\Microsoft Office\root\vfs\ProgramFilesX86\Microsoft Office\Office16\DCF\SyncFusion.XlsIO.Base.dll;**File is not a zip file**;False;False;False;False;False;0