# Preprocessing of Test Data

This notebook preprocesses the test data set.  
Please change the paths in the second cell to where you store the csv files  

There are 2 different notebooks for training data preprocessing and test data preprocessing to ensure our laptop's memory does not get overwhelmed.

In [1]:
from sklearn.impute import SimpleImputer
from sklearn import preprocessing

import pandas as pd
import numpy as np

### Import Files and do Preprocessing
First we import the necessary datasets 

In [2]:
test = pd.read_csv("C:/Users/hclsa/Desktop/Fall2019/CMPE188/TeamProject/raw_data/test.csv")
building_metadata_preprocessed = pd.read_csv("C:/Users/hclsa/Desktop/Fall2019/CMPE188/TeamProject/processed_data/building_metadata_preprocessed.csv")
weather_test_preprocessed = pd.read_csv("C:/Users/hclsa/Desktop/Fall2019/CMPE188/TeamProject/processed_data/weather_test_preprocessed.csv")

Now, we separate the row from the other values (row will be added back later for the prediction)

In [3]:
testRow = test['row_id']
testRow = pd.DataFrame(testRow, columns = ["row_id"])
testValues = test.drop(['row_id'], axis=1)

Next, we combine the testValues, preprocessed building metadata and preprocessed weather data

In [4]:
test_and_building = testValues.merge(building_metadata_preprocessed, on=['building_id'], how='left')
test_building_weather = test_and_building.merge(weather_test_preprocessed, on=['site_id', 'timestamp'], how='left')

Here, we convert the timestamp (date & time) into a single integer reflecting the time (in 24-hour)

In [6]:
test_building_weather_time = pd.DataFrame(test_building_weather['timestamp'].apply(lambda x : int(x[10:12])))

minMaxScaler = preprocessing.MinMaxScaler()
test_building_weather_timeScaled = minMaxScaler.fit_transform(test_building_weather_time)
test_building_weather_timeScaled = pd.DataFrame(test_building_weather_timeScaled, columns = ["hour"])
test_building_weather_time_combined = pd.concat([test_building_weather, test_building_weather_timeScaled], axis=1)

  return self.partial_fit(X, y)


Since we converted timestamp into hours, we do not need it anymore and can drop it.  
We also drop building ID (as we did for the training set)

In [8]:
test_building_weather_reduced = test_building_weather_time_combined.drop(['timestamp', 'building_id'], axis=1)

The weather data has some missing timestamps (times where no weather data was collected)  
We impute those missing values with the mean of the columns

In [9]:
meanImputer = SimpleImputer(missing_values=np.nan, strategy='mean')
test_final = meanImputer.fit_transform(test_building_weather_reduced)
test_final = pd.DataFrame(test_final, columns = test_building_weather_reduced.columns)


### Store the preprocessed files as CSV
Finally, we create two files: 'test_preprocessed.csv' and 'test_row.csv' which can be imported into the Prediction notebook.  
We split the preprocessing of training and the preprocessing of testing data to not overload our memory.

In [10]:
test_final.to_csv('../CSV/test_preprocessed.csv', index = False)

In [11]:
testRow.to_csv('../CSV/test_row.csv', index = False)