# precoding

In [0]:
import pandas as pd
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping

In [0]:
# Code to read csv file into Colaboratory:

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

importing data from test.csv and train.csv

In [0]:
train_link = 'https://drive.google.com/open?id=1-1u1z1Jwh-NHGwHLcvqDe99qLRNtU1ty' # The shareable link
test_link='https://drive.google.com/open?id=1HRUrR1J5sNMkPKga4GcHR1Lkmon6LdBn'

train_fluff, train_id = train_link.split('=')
test_fluff, test_id = test_link.split('=')


train_downloaded = drive.CreateFile({'id':train_id}) 
train_downloaded.GetContentFile('train.csv')  
train_df = pd.read_csv('train.csv')

test_downloaded = drive.CreateFile({'id':test_id}) 
test_downloaded.GetContentFile('test.csv')  
test_df = pd.read_csv('test.csv')

# Dataset is now stored in a Pandas Dataframe

# preprocessing

we need some preprocesing on data to make it suitable for neural network so first we drop columns that we dont need(cause they are not in test data) and rename the others to similar names in both train and test.

another thing that is necessary is to omit year prefix from Log_Dates, that is "1395/" and "1396/" .

In [0]:
result_df=test_df
train_df.drop(["Log_Time"], axis=1, inplace=True)
train_df.drop(["AL"], axis=1, inplace=True)
train_df.drop(["Departure_Time"], axis=1, inplace=True)
train_df.drop(["Departure_Date"], axis=1, inplace=True)
train_df.drop(["Price"], axis=1, inplace=True)
train_df.rename(columns={'FROM': 'From'}, inplace=True)  # renaming
train_df.rename(columns={'TO': 'To'}, inplace=True)  # renaming
train_df = train_df.replace(to_replace="1395/", value='', regex=True)
train_df = train_df.replace(to_replace="1396/", value='', regex=True)
test_df = test_df.replace(to_replace="1396/", value='', regex=True)

we should handle missing values before passing parameters to model, so our solution is to drop missing values in 3 columns below from test and train data frames.


In [0]:

train_df = train_df.dropna(subset=['Log_Date', 'From', 'To'])
test_df = test_df.dropna(subset=['Log_Date', 'From', 'To'])


we have to add a new column that is count of records with same (data, from, to) features .

to perform this action we use groupby method and put the sum result in sales column.

In [0]:

train_df = train_df.groupby(['Log_Date', 'From', 'To']).size().reset_index()  # counting same (date,from,to) records
train_df.rename(columns={0: 'Sales'}, inplace=True)  # renaming

if we omit year prefix from Log_Date column and then groupby the result, the Sales result would be wrong in first 6 months of year,
because our train data set contains records of one and half of a year so the sales feature would be calculated twice as real amount.
so our solution is to divide the Sales feature of first 6 month by 2.

In [0]:

train_df.loc[train_df["Log_Date"].str.startswith('01/'), 'Sales'] /= 2
train_df.loc[train_df["Log_Date"].str.startswith('02/'), 'Sales'] /= 2
train_df.loc[train_df["Log_Date"].str.startswith('03/'), 'Sales'] /= 2
train_df.loc[train_df["Log_Date"].str.startswith('04/'), 'Sales'] /= 2
train_df.loc[train_df["Log_Date"].str.startswith('05/'), 'Sales'] /= 2
train_df.loc[train_df["Log_Date"].str.startswith('06/'), 'Sales'] /= 2


there are several encoding methods for every feature type.
one of the most optimal and famous encoding methods for time is sin/cos encoding.
this encoding method extracs periodic features from time.
the code bellow shows how we emplemented this kind of encoding in our code but we decided to use one-hot encoding because of performance issues .
the nature of this problem caused this exception.


In [0]:

# my_df.Log_Date = my_df.Log_Date.astype(int)
# my_df['sin_time'] = np.sin(2 * np.pi * my_df.Log_Date / 31)  # periodic nature of month
# my_df['cos_time'] = np.cos(2 * np.pi * my_df.Log_Date / 31)
# plt.scatter(my_df.sin_time,my_df.cos_time)
# plt.show()

one of important part of preprocessing is encoding datas, as described before we decided to use one-hot encoding for date,from and to features.
if we encode test and train separatly, the resulting columns would be different because of different values in them.
to solve this issue we combined test and train data  and then encoded the combined data frame. we can separate test and train data after encoding with the predefined boolean feature called "train".

In [51]:

train_df['train'] = 1  
test_df['train'] = 0

combined = pd.concat([train_df, test_df])

combined = pd.get_dummies(combined, columns=['Log_Date', 'From', 'To'])   # one_hot encoding of given columns

train_df = combined[combined["train"] == 1]  

test_df = combined[combined["train"] == 0]

train_df.drop(["train"], axis=1, inplace=True)
test_df.drop(["train"], axis=1, inplace=True)

test_df.drop(["Sales"], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


we can increase performance by shuffling data frame order before passing it to model.
this performance increment happens because when we used group by method, the result was ordered by Log_Date  and this has bad effect on our learning process.

In [0]:
train_df = train_df.reindex(np.random.permutation(train_df.index))


Next, we need to split up our dataset into inputs (train_X) and our target (train_y).

Our input will be every column except ‘sales’ because ‘sales’ is what we will be attempting to predict so ‘sales’ is our target (train_y).

In [0]:

train_x = train_df.drop(columns=['Sales'])  # splitting data
train_y = train_df[['Sales']]

# building and compiling the model

The model type that we use is Sequential with ‘add()’ function we add 4 layer to our model (‘Dense’ is the layer type in a dense layer, all nodes in the previous layer connect to the nodes in the current layer).  we find model's optimal configurations experimentaly.
The first layer needs an input shape and the last layer is the output layer .

we used linear activation function for last layer because there shouldnt be upper and higher boundries for last layer.

 Compiling the model takes two parameters: optimizer and loss.The optimizer controls the learning rate. We use ‘adam’.For  loss function, we use ‘mean_absolute_percentage_error’ 

In [0]:

model = Sequential()
n_cols = train_x.shape[1]
model.add(Dense(100, activation='softmax', input_shape=(n_cols,)))
model.add(Dense(20, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1))

model.compile(optimizer='adam', loss='mean_absolute_percentage_error')


# training the model

Now we  train our model. 

To train, we  use the ‘fit()’ function on our training data frame.

We set the validation split to 0.2, which means that 20% of the training data we provide in the model will be set aside for testing model performance.

The number of epochs is the number of times the model will cycle through the data.
the default batch_size amount is 1, but it is very fine-grained for our problem, so we set it to 128, this will imoprove performance and learning time.

In [55]:

early_stopping_monitor = EarlyStopping(patience=10)

model.fit(train_x, train_y, validation_split=0.2, batch_size=128, epochs=200, callbacks=[early_stopping_monitor])


Train on 48662 samples, validate on 12166 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200


<keras.callbacks.History at 0x7f9e5717ec18>

# predicting test values

now our model is built and trained so it is ready to get output from it.

we can use predict method and pass test data frame to model and model will return result.
the results are in float format so we round them to nearest integer.

In [0]:


test_predictions = model.predict(test_df).round()
# print(test_predictions[0:10])
result_df['Sales']=test_predictions
result_df.to_csv(r'result.csv')