# Filtering Down the Scheduled Task Data Overview

**Description of Dataset**
- Over 12,000 CSV files in directory
- Over 5 million rows of Scheduled Task Data
- 28 columns, per row

**Objective:**
- Filter down the massive _Scheduled Task Dataset_ to a more managable size

**Sections:**
1. Functions Used
2. Pre-Specified File Paths
3. Combining CSV Files
4. Counting Unique Tasks
5. Filtering Out Pre-Specified Tasks
6. Output / Verification

=========================================================================================================

### __SECTION 1:__ _Functions Used_ ###
__Note:__
- These functions are used throughout the script
- Can easily be repurposed for other projects
- Stored in Python as a _Function_

In [None]:
## CREATES A LIST FROM A SINGLE FILE ##
def create_list_from_file(file_path):
    from csv import reader
    file = open(file_path)
    read_file = reader(file)
    list_of_file = list(read_file)[1:]
    return list_of_file

## USED TO FILTER THE DATASET FOR TASKS ##
def VanillaFilter(TaskName):
    for eachtask in VanillaTasks:
        if str(eachtask) in TaskName:
            return True
    return False

## CREATES A DATAFRAME FROM A DIRECTORY OF CSV FILES ##
def dataframe_from_directory(directory_path):
    import glob
    import pandas as pd
    files = glob.glob(directory_path)
    df = pd.concat([pd.read_csv(fp) for fp in files], ignore_index=True)
    return df

## COUNTS THE NUMBER OF INSTANCES FOR EACH UNIQUE SCHEDULED TASKS ##
def counting_unique_tasks(directory_path):
    def dataframe_from_directory(directory_path):
        import glob
        import pandas as pd
        files = glob.glob(directory_path)
        df = pd.concat([pd.read_csv(fp) for fp in files], ignore_index=True)
        return df   
    unfiltered_data = dataframe_from_directory(directory_path)
    data = unfiltered_data[unfiltered_data.TaskName != 'TaskName']    
    values = data['TaskName'].value_counts(dropna=False).keys().tolist()
    counts = data['TaskName'].value_counts(dropna=False).tolist()
    value_dict = dict(zip(values, counts))    
    task_num = 0
    print('Total Num. of Unique Tasks:', len(value_dict))
    for key in value_dict:
        task_num += 1
        print('Task', task_num, '---', 'Count:', value_dict[key], '---', 'TaskName:', key)

=========================================================================================================

### __SECTION 2:__ _Pre-Specified File Paths (INPUT)_ ###
__Note:__
- These are the _only_ two inputs needed, other than determining what tasks to filter for
- This is the _only_ place where the user _must_ interact with the script

__Instructions:__ 
- Paste the filepaths directly between the apostrophes for 'directory_path' and 'file_path'
- If the filepath is not working, try adding a lowercase letter 'r' in front of the first apostrophe
        - Example: directory_path = r'C:\Users\guest\Desktop\TEST2\*.csv'
        - Example: file_path = r'C:\Users\guest\Desktop\VanillaTasks.csv' 
- Make sure to include '*.csv' at the end of your filepath for directory (see above)
    - This ensures that all files ending in .csv are selected
- Stored in Python as a _String_

In [None]:
## DIRECTORY OF SCHEDULED TASK DATA ##
directory_path = ''

## CSV OF TASKS DEEMED 'VANILLA' OR 'SAFE' ##
file_path = '' 

=========================================================================================================

### __SECTION 3:__ _Combining CSV Files_ ###
__Note:__
- Combines all CSV files within the specified directory into one massive dataset
- Prints the length of the combined dataset
    - Used to check for mistakes at the end of the script
- Stored in Python as a _List of Lists_

In [None]:
## COMBINES CSVs ##
import os
import glob
fulldataset = []
for fname in glob.glob(directory_path):
    fname_as_list = create_list_from_file(fname)
    fulldataset = fulldataset + fname_as_list
print(len(fulldataset))

=========================================================================================================

### __SECTION 4:__ _Counting Unique Tasks_ ###
__Note:__
- Prints the total number of unique Scheduled Tasks found within the entire, unfiltered dataset
- Prints the number of times each Scheduled Task occurs
- __Warning:__
    - If used on the entire dataset of 12,000+ files, you may have hundreds, if not thousands, of unique scheduled task names
    - During testing, using a sample of 2000+ entries returned around ~350 unique tasks
- Stored in Python as a _Dictionary_

In [None]:
counting_unique_tasks(directory_path)

=========================================================================================================

### __SECTION 5:__ _Filtering Out Pre-Specified Tasks_ ###
__Note:__
- No printing / output in this section
- _Vanilla_dataset_ contains all entries that have been checked against a list of tasks deemed "safe"
- _NeedsThreatHunting_Dataset_ contains all entries that still need to be checked for potentially malicious tasks

- All tasks that match/contain the "keyword(s)" in VanillaTasks will be added to vanilla_dataset
- All tasks that DO NOT match/contain the "keyword(s)" in VanillaTasks will be added to NeedsThreatHunting_dataset

- Stored in Python as a _list_

In [None]:
## PARSING THROUGH COMBINED LIST OF ALL DATAPOINTS FOR SEARCH TERMS ##
vanilla_dataset = []  
NeedsThreatHunting_dataset = [] 

VanillaTasks1 = create_list_from_file(file_path)
VanillaTasks = []     
for eachtask in VanillaTasks1:
    VanillaTasks.append(eachtask[0])
    
for row in fulldataset:
    name = row[1]
    boolean = VanillaFilter(name)
    if boolean == True:
        vanilla_dataset.append(row)
    elif boolean == False:
        NeedsThreatHunting_dataset.append(row)

=========================================================================================================

### __SECTION 6:__ _Output / Verification_ ###
__Note:__
- Prints:
    - 'Num. of Tasks Deemed Safe'
        - _What's filtered out_
    - 'Num. of Tasks Left To Check'
        - _TaskNames whose names or keyword could not be found in the VanillaFilter CSV_
    - 'Total Number of Tasks Filtered'
        - _Self explanatory_
    - 'Expected Number of Tasks Filtered'
        - _Self explanatory_
- If the Total and Expected Number of Tasks Filtered differ in any way, something broke during the sorting process
- Stored in Python as an _Integer_

In [None]:
## OUTPUT / VERIFCATION ##        
print('Tasks Deemed Safe =', len(vanilla_dataset), '(AKA Vanilla Dataset)')        
print('Tasks Left to Check =', len(NeedsThreatHunting_dataset), '(AKA Threat Hunting Dataset)')
print('Total =', (len(vanilla_dataset) + len(NeedsThreatHunting_dataset)))
print('\n')
print('Expected Total =', len(fulldataset))

=========================================================================================================

# Additional Notes on Setup:
- In Section 2, you are required to input __two__ different paths, one for a directory and one for a file
    - The _directory path_ should lead to a folder that contains all CSV files with data to be inspected
    - The _file path_ should lead to a singlular CSV file that contains all names or keywords related to Scheduled Tasks data you would like to see filtered out of the dataset, hence the name "VanillaTasks"

- Setting up the file containing "Vanilla Tasks":
    - Create a new excel document
    - In cell __A1__, add an arbitrary column header like 'Vanilla Tasks to Remove'
        - This will not be used to filter, but is required
    - In cell __A2, A3, A4, etc__, type _names_ or _keywords_ of tasks you'd like filtered out
    - Click "File", "Save as" 
        - __File Name:__ _arbitrary_, __ex:__ VanillaTasks.csv
        - __File type:__ _'CSV (Comma delimited) (*.csv)'_
    
- Vanilla Tasks _keyword_ / _name_ Example:
    - If TaskName is _'\Adobe Flash Player Update'_, then the "keyword(s)" added to 'Vanilla Tasks' CSV could be:
        - _'Adobe Flash Player Update'_   <--- __Most__ exact, less likely to filter other tasks by accident
        - _'Adobe'_                       <--- __Least__ exact, more likely to filter other tasks by accident

- Setting up the directory containg the 12,000+ CSVs:
    - Two, very simple steps:
        - Throw every CSV file you want to work with into a single folder
        - Copy and paste the file path of this folder into its designated spot within this script (Refer to Section 2)