# Model predicting thermal sensation using given database

Link to database: https://github.com/CenterForTheBuiltEnvironment/ashrae-db-II.git

### Creating dataframe 

Using pandas

In [49]:
import pandas as pd 
import pathlib

#create dataframe from data csv file as df
df = pd.read_csv("db_measurements_v2.1.0.csv") 

  df = pd.read_csv("db_measurements_v2.1.0.csv")


### Handling NaN values

Given the fact that the dataset consists of a collection of different studies, each of which take into consideration varied parameters, the following code calculates the amount of NaN values on each column of the dataframe. The aim here is to find the most common parameters used among the studies to create a final dataframe as consistent as possible.

In [56]:
#cell to find percentage of NaNs per column, types it in txt file

#create percentages
size = df['index'].size + 1
nan_array = df.isnull().sum() / size * 100 #creates a series of the percentages

#store in file
nan_array_string = ["%.2f" % i for i in nan_array] #turns percentages into strings

data = {df.columns[col]: nan_array_string[col] for col in range(nan_array.size)} #makes dict and dataframe
nan_df = pd.DataFrame(data.items())

path = str(pathlib.Path().resolve()) + '\data.txt' #stores in file
nan_df.to_csv(path, header=None, index = None, sep = ' ')

Now, sorting the dataset's columns by their amount of NaN values can allow for an easy selection of columns to keep for the analysis and later prediction.

In [58]:
#sort through nan series and cut all percentages above 50%

nan_array_sorted = nan_array.sort_values(ascending=True) #sorts throught the series 
nan_array_sorted = nan_array_sorted[nan_array_sorted<50.0] #only keeps columns with below 50% NaN cells 

path = str(pathlib.Path().resolve()) + '\data_sorted.txt' #stores file for future use
nan_array_sorted.to_csv(path, header = None, sep = ' ')

According to the file produced and relevant bibliography and keeping in mind that the ultimate goal of this project is to predict thermal comfort using MET and HRV, the parameters to be included in the final dataset are:

1. index - for practical purposes 
2. building_id - to separate studies during outlier detection 
3. ta - temperature 
4. rh - humidity 
5. vel - air velocity 
6. met - due to its relevance for this work 
7. thermal sensation - the final predicted value  

Regarding NaN values, since the data comes from different studies and thus they can not simply be adjusted to comform to a general tendency, it was decided that the rows including them be removed. 

In [59]:
#keeping only a few of the columns for the test in df_outliers dataframe
df_outliers  = df[['index','building_id','ta', 'rh', 'vel', 'met', 'thermal_sensation']]

#removing NaN values
df_outliers = df_outliers.dropna()
size_new = df_outliers['index'].size + 1
loss = 100 - size_new / size * 100
print(loss)

23.175339802263522


According to this, by removing NaN values, the loss is about 23% of the database, a relatively acceptable number (I think?)

### Outlier detection

For the outlier detection different methods are tried below. 

*WOULD BE NICE IF I COULD ACTUALLY PLOT DATA BUT WOULDN'T YOU KNOW IT, PIL ISN'T WORKING NOW? NO FINAL DECISION MADE*

Z-scores : Using the variance from each value by a mean, when applied to each of the study parameters, this technique detects the most variant values. 
It is considered not as effective since it requires a mean to exist. 

In [None]:
#different outlier methods
#z-scores 
import scipy.stats as stats
import math

df_zscore = stats.zscore(df_outliers, nan_policy = 'omit')

def zfunc(column):
    counter = 0
    for cell in df_zscore[column]: 
        if (not math.isnan(cell)) and (cell>3 or cell<-3):
            counter+=1
    return counter

for col in df_zscore.columns:
    counter = zfunc(col)
    print(counter)


IQR : Removes the values that are higher than the 75th and lower than the 25th percentile of the same column by some multiple of the range among them. 

In [None]:
#different outlier methods 
#iqr  
import numpy as np 
import math

df_iqr = df_outliers

def iqr_func(column):
    q75, q25 = np.percentile(column, [75 ,25])
    iqr = q75 - q25
    valid = iqr*2.0
    counter = 0
    for cell in column:
        if  (not math.isnan(cell)) and (cell>q75+valid or cell<q25-valid): 
            counter+=1

    return counter

for col in df_iqr.columns: 
    counter = iqr_func(df_iqr[col])
    print(counter)

Isolation forest : Algorithm to detect anomalies based on distance from other datapoints. Considered best here since it takes multiple parameters into consideration at once. 

In [None]:
#different outlier methods 
#Isolation tree
from sklearn.ensemble import IsolationForest

df_iso = df_outliers

iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(df_outliers)
df_outliers['anomaly'] = iso_forest.predict(df_outliers)

counter = 0
for index, row in df_iso.iterrows():  
        if row['anomaly']==-1: 
            counter +=1
print(counter)

#since python has decided not to work with PIL and thus I can't plot anything
#i am now deciding that this is the best practice to remove outliers until I can 
#solve the issue since I've spent too much time on outliers and no results have 
#come forth

In [63]:
#using isolation forest to handle outliers 
#dropping outliers since it's still kinda unclear what to do
#still have to look into it 
size_before = df_iso['index'].size + 1
df_iso = df_iso[df_iso['anomaly'] != -1]
size_clear = df_iso['index'].size + 1
print(size_before)
print(size_clear)

df_final = df_iso
size_final = df_final['index'].size+1
print(size_final)

83765
75388
75388


### Predictive model

In [32]:
#this is not a timeseries so I think i'm using regression
#to do that I first need to find out how to implement the linear regression
#there's steps to follow namely: 
#1. separate into training test and evaluation - can't i easily fit that with a package
#2. install tensorflow and keras probably - done 
#3. make python work? somehow - done 
#4. run the training and the testing, evaluate and somehow present results 
#MAYBE this can be done in a night but I don't think that's the best route to follow
#regardless, tomorrow you gotta call the transfer company and also take pictures 
#of your furniture to sell 
#and then sleep
#do those and then you just got the cvs to send and react with backend to learn
#which will take around the same amount of time as both getting pregnant and giving
#birth 
#in the meantime have fun looking for a server job once you get to athens and slowly
#drifting away from uni and all the work you've done or the possibility 
#of ever getting a fulfilling job out of the service industry! 
#peace and love! 

In [None]:
#UPDATE: i've been waiting for tensorflow and keras to be installed for like 15 minutes 
#now and I don't even know that they will actually work
#funny thing is i'm quite sure they won't since they never do unless you spend 
#a few hours looking into their issues that will end up being something stupid 
#you couldn't have predicted 
#thanks stack overflow? 

#there, they were already found incompatible with that version of python just like the rest of the things I've 
#tried to use, wonderful init?

#SECOND UPDATE: python has fucked me up, don't use 3.11 people