# Model predicting thermal sensation using given database

Link to database: https://github.com/CenterForTheBuiltEnvironment/ashrae-db-II.git

Packages: 
1. pandas
2. scipy 
3. math - no need 
4. numpy
5. scikit
6. tensorflow and keras

### So far: 

1. To begin with preprocessing is rushed to say the least. 
2. Parameters are picked among those with the least NaN values but instead of sampling, it gets rid of all NaN rows. 
3. Then, for the outlier detection, there's another amount of rows dropped, no reasoning there. Employs standard scaler later on without checking other methods. 
4. Main model is an ANN regressor that however fails entirely seeing as the mse and mae cannot really show anything when data have been scaled that way.
5. Still haven't plotted outliers out because old env would not work with pillow. 
6. Look into what kfold does since you clearly don't remember. 

### Creating dataframe 

Using pandas

In [38]:
import pandas as pd 
import pathlib

#create dataframe from data csv file as df
df = pd.read_csv("db_measurements_v2.1.0.csv") 

  df = pd.read_csv("db_measurements_v2.1.0.csv")


### Handling NaN values

Given the fact that the dataset consists of a collection of different studies, each of which take into consideration varied parameters, the following code calculates the amount of NaN values on each column of the dataframe. The aim here is to find the most common parameters used among the studies to create a final dataframe as consistent as possible.

In [37]:
#cell to find percentage of NaNs per column, types it in txt file

#create percentages
size = df['index'].size + 1
nan_array = df.isnull().sum() / size * 100 #creates a series of the percentages

#store in file
nan_array_string = ["%.2f" % i for i in nan_array] #turns percentages into strings

data = {df.columns[col]: nan_array_string[col] for col in range(nan_array.size)} #makes dict and dataframe
nan_df = pd.DataFrame(data.items())

path = str(pathlib.Path().resolve()) + '\data.csv' #stores in file
nan_df.to_csv(path, header=None, index = None, sep = ' ')

Now, sorting the dataset's columns by their amount of NaN values can allow for an easy selection of columns to keep for the analysis and later prediction.

In [3]:
#sort through nan series and cut all percentages above 50%

nan_array_sorted = nan_array.sort_values(ascending=True) #sorts throught the series 
nan_array_sorted = nan_array_sorted[nan_array_sorted<50.0] #only keeps columns with below 50% NaN cells 

path = str(pathlib.Path().resolve()) + '\data_sorted.csv' #stores file for future use
nan_array_sorted.to_csv(path, header = None, sep = ' ')

According to the file produced and relevant bibliography and keeping in mind that the ultimate goal of this project is to predict thermal comfort using MET and HRV, the parameters to be included in the final dataset are:

1. index - for practical purposes 
2. building_id - to separate studies during outlier detection 
3. ta - temperature 
4. rh - humidity 
5. vel - air velocity 
6. met - due to its relevance for this work 
7. thermal sensation - the final predicted value  

Regarding NaN values, since the data comes from different studies and thus they can not simply be adjusted to comform to a general tendency, it was decided that the rows including them be removed. 

In [None]:
#keeping only a few of the columns for the test in df_outliers dataframe
df_outliers  = df[['index','building_id','ta', 'rh', 'vel', 'met', 'thermal_sensation']]

#removing NaN values
df_outliers = df_outliers.dropna()
size_new = df_outliers['index'].size + 1
loss = 100 - size_new / size * 100
print(loss)

According to this, by removing NaN values, the loss is about 23% of the database, a relatively acceptable number (I think?)

### Outlier detection

For the outlier detection different methods are tried below. 

*WOULD BE NICE IF I COULD ACTUALLY PLOT DATA BUT WOULDN'T YOU KNOW IT, PIL ISN'T WORKING NOW? NO FINAL DECISION MADE*

Z-scores : Using the variance from each value by a mean, when applied to each of the study parameters, this technique detects the most variant values. 
It is considered not as effective since it requires a mean to exist. 

In [None]:
#different outlier methods
#z-scores 
import scipy.stats as stats
import math

df_zscore = stats.zscore(df_outliers, nan_policy = 'omit')

def zfunc(column):
    counter = 0
    for cell in df_zscore[column]: 
        if (not math.isnan(cell)) and (cell>3 or cell<-3):
            counter+=1
    return counter

for col in df_zscore.columns:
    counter = zfunc(col)
    print(counter)


IQR : Removes the values that are higher than the 75th and lower than the 25th percentile of the same column by some multiple of the range among them. 

In [None]:
#different outlier methods 
#iqr  
import numpy as np 
import math

df_iqr = df_outliers

def iqr_func(column):
    q75, q25 = np.percentile(column, [75 ,25])
    iqr = q75 - q25
    valid = iqr*2.0
    counter = 0
    for cell in column:
        if  (not math.isnan(cell)) and (cell>q75+valid or cell<q25-valid): 
            counter+=1

    return counter

for col in df_iqr.columns: 
    counter = iqr_func(df_iqr[col])
    print(counter)

Isolation forest : Algorithm to detect anomalies based on distance from other datapoints. Considered best here since it takes multiple parameters into consideration at once. 

In [None]:
#different outlier methods 
#Isolation tree
from sklearn.ensemble import IsolationForest

df_iso = df_outliers

iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(df_outliers)
df_outliers['anomaly'] = iso_forest.predict(df_outliers)

counter = 0
for index, row in df_iso.iterrows():  
        if row['anomaly']==-1: 
            counter +=1
print(counter)

#since python has decided not to work with PIL and thus I can't plot anything
#i am now deciding that this is the best practice to remove outliers until I can 
#solve the issue since I've spent too much time on outliers and no results have 
#come forth

In [None]:
#using isolation forest to handle outliers 
#dropping outliers since it's still kinda unclear what to do
#still have to look into it 
size_before = df_iso['index'].size + 1
df_iso = df_iso[df_iso['anomaly'] != -1]
size_clear = df_iso['index'].size + 1
print(size_before)
print(size_clear)

df_final = df_iso
size_final = df_final['index'].size+1
print(size_final)

### Predictive model

only the second attempt part SEEMS to work but still with the scaler i feel it's a little confusing whether there's any actual result or not

In [None]:
#second attempt with ANN - wrong
from tensorflow.python.keras.models import Sequential 
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.models import load_model
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import MinMaxScaler

#create data
data = df_final 
X = data[['index', 'building_id', 'ta', 'rh', 'vel', 'met']]
y = data[['thermal_sensation']]

scaler = MinMaxScaler()
X = scaler.fit_transform(X)
y = scaler.fit_transform(y)

#separate into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)

#create model
model = Sequential()
model.add(Dense(40, kernel_initializer= 'uniform', activation= 'relu', input_dim = 6))
model.add(Dense(40, kernel_initializer= 'uniform', activation= 'relu'))
model.add(Dense(40, kernel_initializer= 'uniform', activation= 'relu'))
model.add(Dense(1, kernel_initializer= 'uniform', activation= 'sigmoid'))

model.summary()

model.compile(optimizer= 'adam', loss = 'mse', metrics = ['mse', 'mae'])

#fit model
model.fit(X_train, y_train, batch_size= 128, epochs= 100)

#make predictions
y_pred = model.predict(X_test)


In [None]:
#errors
import math
rmse = mean_squared_error(y_pred, y_test)
print(math.sqrt(rmse))

mae = mean_absolute_error(y_pred, y_test)
print(mae)

test_results = pd.DataFrame(data = {'Predicted':y_pred.ravel(), 'Actual':y_test.ravel()})
comparison =  pd.DataFrame(data = {'Original':data['thermal_sensation'], 'New':y.ravel()})
path = str(pathlib.Path().resolve()) + '\\results.csv' #stores file for future use
path2 = str(pathlib.Path().resolve()) + '\\comparison.csv' #stores file for future use
test_results.to_csv(path, sep = ' ')
comparison.to_csv(path2, sep = ' ')

In [None]:
#first attempt with linear regression - doesn't work, is missing some code?

from tensorflow.python.keras.models import Sequential 
from tensorflow.python.keras.layers import Dense
from scikeras.wrappers import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold 
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split 

data = df_final 

X = data[['index', 'building_id', 'ta', 'rh', 'vel', 'met']]
y = data['thermal_sensation']

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=.2)

def baseline_model(): 
    model = Sequential()
    model.add(Dense(13, input_shape= (6,), kernel_initializer='normal', activation='relu'))
    model.add(Dense(6, kernel_initializer='normal', activation = 'relu'))
    model.add(Dense(1, kernel_initializer='normal'))

    model.compile(loss = 'mean_squared_error', optimizer='adam')

    return model

estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(model=baseline_model, epochs=50, batch_size=5, verbose=1)))

pipeline = Pipeline(estimators)

kfold = KFold(n_splits=10)
results = cross_val_score(pipeline, X_train, Y_train, cv=kfold ,scoring= 'neg_mean_squared_error')

print("Baseline: %.2f (%.2f) MSE" % (results.mean(), results.std()))

In [None]:
#random forest - wrong also missing code
from numpy import mean, std
from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler

data = df_final

X=data[['index', 'building_id', 'ta', 'rh', 'vel', 'met']]
y=data[['thermal_sensation']]

scaler = StandardScaler()
X = scaler.fit_transform(X)
y = scaler.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)

model = RandomForestRegressor()

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=42)
n_scores = cross_val_score(model, X_train, y_train, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

y_pred = model.predict(X_test)
print(y_pred)
print(y_test)

