## Tool for making Sample Manager outputs more user-friendly

### Leila Barker, October 2024

#### Summary:
* Save Sample Manager output file to this location (W:\TD&R\Python Tools\Sample Manager data cleanup tool\Tool inputs)
    * Ensure that the file is saved as an Excel Workbook file (not an older version such as Excel 97-2003 Workbook)
    * The tool will import the most recently modified file in this folder if there is more than one file
* Run the following cells in order
* The output file will be saved in the folder W:\TD&R\Python Tools\Sample Manager data cleanup tool\Tool inputs as "SM_cleaned_today's date"

#### Step 1: Import data, delete unauthorized data, clean up DateTimes

In [64]:
# Import the data from the most recent Excel file in the "Tool inputs" folder

import os
import pandas as pd
from datetime import datetime

folder_path = 'W:\TD&R\Python Tools\Sample Manager data cleanup tool\Tool inputs'

files = [f for f in os.listdir(folder_path) if f.endswith('.xlsx') or f.endswith('.xls')]
files_full_path = [os.path.join(folder_path, f) for f in files]

# Find the latest file by modification time and load into a dataframe
latest_file = max(files_full_path, key=os.path.getmtime)
latest_file_name = os.path.basename(latest_file) # Name of the latest file

# Load the latest file into a dataframe
raw_df = pd.read_excel(latest_file)

print(f"Input file path: {latest_file}")
print(f"Input file name: {latest_file_name}")

# Delete any data that does not have the status "Authorised"
raw_df.drop(raw_df.columns[0], axis=1, inplace=True) # Deletes the first column (standard SM export -- not needed)
raw_df = raw_df.loc[raw_df['Result Status'] == 'Authorised'] # Retains only data that has been authorised (has passed final QA/QC)

# Convert datetimes to standard format; add a column for Date without time

temp_df = pd.DataFrame()
temp_df = raw_df
temp_df['DateTime'] = pd.to_datetime(temp_df['Sampled Date'], errors='coerce', format='%m/%d/%Y %I:%M %p')
temp_df['JustDate'] = pd.to_datetime(temp_df['DateTime']).dt.strftime('%Y-%m-%d')
raw_df['Date'] = temp_df['JustDate']

# Move DateTime next to Date
columns = raw_df.columns.tolist()
date_index = columns.index('Sampled Date')  # Get the index of the 'Date' column
columns.remove('Date')
columns.insert(date_index + 1, 'Date') # Insert 'Date' right after 'Sampled Date'
columns.remove('DateTime')
columns.insert(date_index + 1, 'DateTime') # Insert 'DateTime' two after 'Sampled Date'
raw_df = raw_df[columns] # Reorder the DataFrame based on the new column order
raw_df = raw_df.drop(columns=['Sampled Date']) # Delete 'Sampled Date' column
raw_df.drop(columns=raw_df.columns[-1], axis=1, inplace=True) # Delete the last column (temporary column called 'JustDate')

raw_df.head()

Input file path: W:\TD&R\Python Tools\Sample Manager data cleanup tool\Tool inputs\tmp2E63.xlsx
Input file name: tmp2E63.xlsx


Unnamed: 0,Sample Number,Sampling Point,Sample Name,DateTime,Date,Analysis,Component,Qualifiers,Result,Units,Result Status,Comments
0,514119,NTS 20 - Cell 6 - Effluent,Cell 6 at wall,2024-06-06 12:52:00,2024-06-06,TSS,TSS (Report),,26.0,ppm_wt_v,Authorised,
1,514119,NTS 20 - Cell 6 - Effluent,Cell 6 at wall,2024-06-06 12:52:00,2024-06-06,TSS,TVSS (Report),,14.0,ppm_wt_v,Authorised,
2,514121,NTS 24 - Cell 4 - Effluent Weir Box,NTS 24 Cell 4,2024-06-06 13:00:00,2024-06-06,TSS,TSS (Report),,11.0,ppm_wt_v,Authorised,
3,514121,NTS 24 - Cell 4 - Effluent Weir Box,NTS 24 Cell 4,2024-06-06 13:00:00,2024-06-06,TSS,TVSS (Report),,5.0,ppm_wt_v,Authorised,
4,514120,NTS 23 - Cell 1 - Effluent Manhole,NTS 23 manhole,2024-06-06 13:07:00,2024-06-06,TSS,TSS (Report),,12.0,ppm_wt_v,Authorised,


#### Step 2 (Optional): delete qualified data and/or select qualifiers (this is a placeholder for any modifications the user wants to make prior to exporting the data)

In [67]:
# Optional: delete qualified data (note: this deletes all data with something in the "Qualifiers" field;
# if desired, can modify to just delete data with certain qualifiers [e.g., U])

# We could also use this section to create a new field that indicates which parameters were flagged or commented on, and the content of those flags/comments, prior to the creation of the final table.

raw_df = raw_df[raw_df['Qualifiers'].isna()]

#### Step 3: Create a pivot table with each row representing a single sample

In [70]:
# Create a pivot table of the data and export to Excel

pivot_df = raw_df.copy()

# Create a new field, "Parameter", combining the "Analysis" and "Component" fields
pivot_df['Parameter'] = pivot_df['Analysis'] + ' ' + pivot_df['Component'] + ' (' + pivot_df['Units'] + ')'

# Create a pivot table
pivot_df = pivot_df.pivot_table(index=['Sample Number', 'Date', 'Sampling Point', 'DateTime'], columns = 'Parameter', values='Result', aggfunc='first').reset_index()

#Simplify the parameter names
# Function to convert column names
def convert_column_name(col_name):
    if "(Report)" in col_name:
        # Extract the part just before "(Report)"
        part_before_report = col_name.split("(Report)")[0].strip().split()
        # Get the last word (the parameter) and the prefix (if any)
        parameter = part_before_report[-1]
        prefix = ' '.join(part_before_report[:-1])  # Join everything before the last word

        # Determine the concentration unit
        if "(ppm_wt_v)" in col_name:
            concentration_unit = " (mg/L)"
        elif "(ppb_wt_v)" in col_name:
            concentration_unit = " (ug/L)"
        else:
            concentration_unit = ""

        # Determine the suffix based on conditions
        if "-S" in prefix and prefix != "ICANION-S":
            return f"{parameter} (S){concentration_unit}"
        elif "-T" in prefix:
            return f"{parameter} (T){concentration_unit}"
        else:
            return f"{parameter}{concentration_unit}"

    return col_name  # Return unchanged if it doesn't match the pattern


# Apply the conversion function to all column names
pivot_df.columns = [convert_column_name(col) for col in pivot_df.columns]

pivot_df.head()


Unnamed: 0,Sample Number,Date,Sampling Point,DateTime,CBOD (mg/L),Chloroph-A-C (ug/L),Chloroph-A-U (ug/L),Pheoph-A (ug/L),COD (S) (mg/L),COD (T) (mg/L),...,Tl (T) (ug/L),V (T) (ug/L),Zn (T) (ug/L),NH3 (S) (mg/L),o-PO4 (S) (mg/L),NPOC (T) (mg/L),TKN (mg/L),TP (mg/L),TSS (mg/L),TVSS (mg/L)
0,512338,2024-06-07,Miscellaneous Sample,2024-06-07 00:00:00,,,,,,,...,,,,0.015,0.031,,0.396,0.113,,
1,512339,2024-06-10,NTS Effluent,2024-06-10 11:56:00,,27.8,33.8,5.97,,,...,,,,0.013,0.013,,1.84,0.191,17.0,7.0
2,512340,2024-06-10,NTS 00 - Lower Treatment Wetland,2024-06-10 10:15:00,,0.888,1.13,<0.500,,,...,,,,0.215,0.423,,1.62,0.514,0.8,0.8
3,512341,2024-06-10,NTS 01 - Upper Pool,2024-06-10 10:42:00,,14.4,17.9,3.50,,,...,,,,0.05,0.26,,1.71,0.356,6.4,2.4
4,512342,2024-06-10,NTS 23 - Cell 1 - Effluent Manhole,2024-06-10 11:09:00,,18.6,23.5,4.90,,,...,,,,0.115,0.051,,2.0,0.221,4.4,2.4


#### Step 4. Export the data as an Excel workbook

In [75]:
# Get the current date in order to export the cleaned file with a timestamp
current_year = datetime.now().year
today = datetime.today().strftime('%Y-%m-%d')

# Export the file
output_directory = r'W:\TD&R\Python Tools\Sample Manager data cleanup tool\Tool outputs'
latest_file_name_no_ext = os.path.splitext(latest_file_name)[0]
filename = f'{latest_file_name_no_ext}_{today}.xlsx'
output_filepath = os.path.join(output_directory, filename)
pivot_df.to_excel(output_filepath)
print(f'File saved at: {output_filepath}')

File saved at: W:\TD&R\Python Tools\Sample Manager data cleanup tool\Tool outputs\tmp2E63_2024-10-15.xlsx
