# Conclusions From Analysis

## Exploration Conclusion

| Column | Values Removed | Reason |
| :-- | :-: | :-- |
| All Column | values <= 0 | These are either impossible (e.g. negative amount of sludge in water like `InpA Sludge In Water \[mg/l\]`) or unwanted (e.g. maintenances) |
| Tank1 Content Height | values > 25 m | The tank is probably not higher than 25m |
| Tank2 Content Height | values > 25 m | The tank is probably not higher than 25m |
| Tank1 Sludge Recycle In Flow | values > 150 m3/h | Considering that the InpA and InpB flow rates are never higher than 150 m3/h, it seems that values for `Tank1 Sludge Recycle In Flow` higher than 150 m3/h are outliers | 
| Tank2 Sludge Recycle In Flow | values > 200 m3/h | Same reasoning as above | 
| Exit N03 Dissolved | drop column | Too many missing values (56%) |
| Target | All NaN | Remove all rows without a target, they cant be used for modeling |
| All | 40% missing variables | Remove rows with more than 40% of the row missing |


### Additional Changes 
- Rows with at least one negative variable are removed entirely, as these are periods that we don't want to consider. The other removals shown in the table above are considered outliers. We set these outlying values to 'NaN'. Gaps of NaN that are not too long will be interpolated from neighboring 'good' values.



In [None]:
clean_data = dataset.copy()
# Remove rows where at least one variable is negative or zero
clean_data = clean_data.loc[~(clean_data <= 0).any(axis=1)]
print(f"Removing rows with at least one nonpositive variable: {dataset.shape[0] - clean_data.shape[0]} rows were removed. ")

# Handle the tank1 content height outliers:
remove = (clean_data[('Tank 1', 'Content height')] > 25)
clean_data.loc[remove] = np.nan
print(f"Setting values with Tank 1 Content height > 25m to NaN: {remove.sum()} observations were affected.")

# Handle the tank2 content height outliers:
remove = (clean_data[('Tank 2', 'Content height')] > 25)
clean_data.loc[remove] = np.nan
print(f"Setting values with Tank 2 Content height > 25m to NaN: {remove.sum()} observations were affected.")

# Handle the Tank1 Sludge Recycle In Flow outliers:
remove = (clean_data[('Tank 1', 'Sludge recycle in flow')] > 150)
clean_data.loc[remove] = np.nan
print(f"Setting values with Tank 1 Sludge recycle in flow > 150m to NaN: {remove.sum()} observations were affected.")

# Handle the Tank2 Sludge Recycle In Flow outliers:
remove = (clean_data[('Tank 2', 'Sludge recycle in flow')] > 200)
clean_data.loc[remove] = np.nan
print(f"Setting values with Tank 2 Sludge recycle in flow > 200m to NaN: {remove.sum()} observations were affected.")

In [None]:
mask = clean_data[('Exit', 'Target')].isna()
clean_data = clean_data.loc[~mask]

print(f'{mask.sum()} timestamps of the Target variable are missing. These full rows are removed. {clean_data.shape[0]} rows remain.')

threshold = .4 # 40% missing variables in one timestamp

missings = clean_data.isna().sum(axis=1) / clean_data.shape[1]

clean_data = clean_data.loc[missings < threshold]
print(f'{(missings >= threshold).sum()} rows had more missing values than the threshold. {clean_data.shape[0]} rows remain.')

# Re-make the grouped_data and group-names
grouped_data =clean_data.groupby(level=1, axis=1)
# Unique group names, sorted
group_names = sorted(grouped_data.groups.keys(), key=str.casefold)