Steps to be followed:

1. calculate missing data per column to figure if it can be used - done
2. according to bibliography and columns, figure out those to remain - done
3. outliers? - done
4. convert everything into numerical and normalized if needs be
5. separate into training, test, evaluation

In [None]:
import pandas as pd 
import pathlib

#create dataframe from data csv file as df
df = pd.read_csv("db_measurements_v2.1.0.csv") 

To figure out which columns to use, the first step would be to look for a percentage of NaN values on each. Seeing as the dataset is made up of a collection of studies, each of which have different parameters, maintaining those columns that are common among them is the easiest way to ensure some consistency. 

In [5]:
#cell to find percentage of NaNs per column, types it in txt file

#create percentages
size = df['index'].size + 1
nan_array = df.isnull().sum() / size * 100 #creates a series of the percentages

#store in file
nan_array_string = ["%.2f" % i for i in nan_array] #turns percentages into strings

data = {df.columns[col]: nan_array_string[col] for col in range(nan_array.size)} #makes dict and dataframe
nan_df = pd.DataFrame(data.items())

path = str(pathlib.Path().resolve()) + '\data.txt' #stores in file
nan_df.to_csv(path, header=None, index = None, sep = ' ')

Now, sorting the dataset's columns by their amount of NaN values can allow for an easy selection of columns to keep for the analysis and later prediction.

In [7]:
#sort through nan series and cut all percentages above 50%

nan_array_sorted = nan_array.sort_values(ascending=True) #sorts throught the series 
nan_array_sorted = nan_array_sorted[nan_array_sorted<50.0] #only keeps columns with below 50% NaN cells 

path = str(pathlib.Path().resolve()) + '\data_sorted.txt' #stores file for future use
nan_array_sorted.to_csv(path, header = None, sep = ' ')

According to the file produced and relevant bibliography and keeping in mind that the ultimate goal of this project is to predict thermal comfort using MET and HRV, the parameters to be included in the final dataset are:

1. index - for practical purposes 
2. building_id - to separate studies during outlier detection 
3. ta - temperature 
4. rh - humidity 
5. vel - air velocity 
6. met - due to its relevance for this work 
7. thermal sensation - the final predicted value  

Regarding NaN values, since the data comes from different studies and thus they can not simply be adjusted to comform to a general tendency, it was decided that the rows including them be removed. 

In [None]:
#keeping only a few of the columns for the test in df_outliers dataframe
import matplotlib.pyplot as plt
df_outliers  = df[['index','building_id','ta', 'rh', 'vel', 'met', 'thermal_sensation']]
#TODO: should I keep building_id?

#removing NaN values
df_outliers = df_outliers.dropna()
size_new = df_outliers['index'].size + 1
loss = 100 - size_new / size * 100
print(loss)

According to this, by removing NaN values, the loss is about 23% of the database, a relatively acceptable number (I think?)

Now, for the outlier prediction, three different methods are used below. 

Z-scores : Using the variance from each value by a mean, when applied to each of the study parameters, this technique detects the most variant values. 
It is considered not as effective since it requires a mean to exist. 

IQR : Removes the values that are higher than the 75th and lower than the 25th percentile of the same column by some multiple of the range among them. 

Isolation forest : Algorithm to detect anomalies based on distance from other datapoints. Considered best here since it takes multiple parameters into consideration at once. 

WOULD BE NICE IF I COULD ACTUALLY PLOT DATA BUT WOULDN'T YOU KNOW IT, PIL ISN'T WORKING NOW?

In [None]:
#different outlier methods
#z-scores - using scipy, applying for each column and finding 3 or more deviation cells
import scipy.stats as stats
import math

df_zscore = stats.zscore(df_outliers, nan_policy = 'omit')

def zfunc(column):
    counter = 0
    for cell in df_zscore[column]: 
        if (not math.isnan(cell)) and (cell>3 or cell<-3):
            counter+=1
    return counter

for col in df_zscore.columns:
    counter = zfunc(col)
    print(counter)


In [None]:
#different outlier methods 
#iqr - again just checking, going to fix code here 
import numpy as np 
import math

df_iqr = df_outliers

def iqr_func(column):
    q75, q25 = np.percentile(column, [75 ,25])
    iqr = q75 - q25
    valid = iqr*2.0
    counter = 0
    for cell in column:
        if  (not math.isnan(cell)) and (cell>q75+valid or cell<q25-valid): 
            counter+=1

    return counter

for col in df_iqr.columns: 
    counter = iqr_func(df_iqr[col])
    print(counter)

In [None]:
#different outlier methods 
#Isolation tree
from sklearn.ensemble import IsolationForest

df_iso = df_outliers

iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(df_outliers)
df_outliers['anomaly'] = iso_forest.predict(df_outliers)

counter = 0
for index, row in df_iso.iterrows():  
        if row['anomaly']==-1: 
            counter +=1
print(counter)

#since python has decided not to work with PIL and thus I can't plot anything
#i am now deciding that this is the best practice to remove outliers until I can 
#solve the issue since I've spent too much time on outliers and no results have 
#come forth

To figure out if I need to normalize I need to know what model I will be using. That said, it's probably useful to leave space for this step and then see what to do. So that's a reminder. 