After running the notebook to generate the FP-Growth Rules, and the server assignments based on number of servers and ensuring IP pairs are placed on the same server to the extent possible.

We now go back to our origninal dataframe, and add 2 new columns: Src_server and Dst_server, which explicately states where that IP address (or app) should have been scheduled if all other resources were available.

The number of apps per server is set int he 'Rules' notebook. In this instance we assume 20 apps/server.

In [1]:
import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import BytesIO
from mlxtend.preprocessing import TransactionEncoder
import random
from mlxtend.frequent_patterns import apriori
import pyfpgrowth
#from apyori import apriori 

In [39]:
#load data

client = boto3.client('s3')
obj = client.get_object(Bucket='manifolddata', Key='week1.csv')
df = pd.read_csv(BytesIO(obj['Body'].read()))

df=df.iloc[:,[0,1,3,4,5,6,7,8]]
df.columns=['Date', 'Duration', 'Src_IP', 'Src_pt', 'Dst_IP', 'Dst_pt','Packets', 'Bytes']
#add an date column that is rounded to nearest hour, so we can use this as a timestep to see how frequently IP pairs occur in each timestep
df['Date']=pd.to_datetime(df['Date'], format="%Y-%m-%d %H:%M:%S.%f", errors = 'coerce')
df['date_hr']=pd.Series(df['Date']).dt.round("H")

  interactivity=interactivity, compiler=compiler, result=result)


In [18]:
#df['Src_IP']=df['Src_IP'].astype('str')
#df['Dst_IP']=df['Dst_IP'].astype('str')

In [40]:
#load in the server assignment dataframe

server_rules=pd.read_csv('server_rules.csv')
#make the IP column a string so it can match the others
#server_rules['IP']=server_rules['IP'].astype('str')

In [41]:
df.head()

Unnamed: 0,Date,Duration,Src_IP,Src_pt,Dst_IP,Dst_pt,Packets,Bytes,date_hr
0,2017-08-02 00:00:00.419,0.003,192.168.210.55,44870,192.168.100.11,445.0,2,174,2017-08-02
1,2017-08-02 00:00:00.421,0.0,192.168.100.11,445,192.168.210.55,44870.0,1,108,2017-08-02
2,2017-08-02 00:00:02.593,0.004,192.168.220.47,55101,192.168.100.11,445.0,2,174,2017-08-02
3,2017-08-02 00:00:02.859,0.0,10000_34,443,192.168.210.54,59628.0,1,100,2017-08-02
4,2017-08-02 00:00:02.594,0.0,192.168.100.11,445,192.168.220.47,55101.0,1,108,2017-08-02


In [42]:
server_rules.tail()

Unnamed: 0.1,Unnamed: 0,IP,serverid
334,192.168.220.47,192.168.220.47,0.0
335,192.168.220.48,192.168.220.48,1.0
336,192.168.220.49,192.168.220.49,5.0
337,192.168.220.50,192.168.220.50,3.0
338,192.168.220.51,192.168.220.51,0.0


In [43]:
server_rules.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 339 entries, 0 to 338
Data columns (total 3 columns):
Unnamed: 0    339 non-null object
IP            339 non-null object
serverid      339 non-null float64
dtypes: float64(1), object(2)
memory usage: 8.0+ KB


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8185992 entries, 0 to 8185991
Data columns (total 9 columns):
Date        datetime64[ns]
Duration    float64
Src_IP      object
Src_pt      int64
Dst_IP      object
Dst_pt      float64
Packets     int64
Bytes       object
date_hr     datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(2), object(3)
memory usage: 562.1+ MB


In [53]:
#merge in the serverid


df_servers=df.merge(server_rules.iloc[:,1:3], left_on='Src_IP', right_on='IP', how='left')
df_servers=df_servers.rename(columns={'serverid': 'Src_Server'})

In [54]:
df_servers=df_servers.merge(server_rules.iloc[:,1:3], left_on='Dst_IP', right_on='IP', how='left')
df_servers=df_servers.rename(columns={'serverid': 'Dst_Server'})
df_servers=df_servers.drop(['IP_x', 'IP_y'], axis=1)

In [60]:
#check to see if we get any pairs
df_servers[df_servers['Src_Server']==df_servers['Dst_Server']]

Unnamed: 0,Date,Duration,Src_IP,Src_pt,Dst_IP,Dst_pt,Packets,Bytes,date_hr,Src_Server,Dst_Server
1490,2017-08-02 00:23:54.113,0.997,192.168.100.20,59883,10011_132,25.0,2,148,2017-08-02 00:00:00,2.0,2.0
1497,2017-08-02 00:23:57.115,0.000,192.168.100.20,59883,10011_132,25.0,1,74,2017-08-02 00:00:00,2.0,2.0
1727,2017-08-02 00:28:54.340,0.998,192.168.100.20,59886,10011_132,25.0,2,148,2017-08-02 00:00:00,2.0,2.0
1736,2017-08-02 00:28:57.342,0.000,192.168.100.20,59886,10011_132,25.0,1,74,2017-08-02 00:00:00,2.0,2.0
3035,2017-08-02 00:58:53.915,0.999,192.168.100.20,59891,10011_132,25.0,2,148,2017-08-02 01:00:00,2.0,2.0
3038,2017-08-02 00:58:56.917,0.000,192.168.100.20,59891,10011_132,25.0,1,74,2017-08-02 01:00:00,2.0,2.0
4817,2017-08-02 01:33:54.518,0.999,192.168.100.20,59901,10011_132,25.0,2,148,2017-08-02 02:00:00,2.0,2.0
4820,2017-08-02 01:33:57.520,0.000,192.168.100.20,59901,10011_132,25.0,1,74,2017-08-02 02:00:00,2.0,2.0
5065,2017-08-02 01:38:53.788,1.002,192.168.100.20,59902,10011_132,25.0,2,148,2017-08-02 02:00:00,2.0,2.0
5066,2017-08-02 01:38:56.792,0.000,192.168.100.20,59902,10011_132,25.0,1,74,2017-08-02 02:00:00,2.0,2.0


Great, at least we get some pairs.

Now, let's go through and change all the durations where we get a matching server pair to be time =0, this assumes they are on the same machine and latency is 0.

In [63]:
df_servers['duration_pred']=df_servers['Duration']
df_servers['duration_pred'][df_servers['Src_Server']==df_servers['Dst_Server']]=0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [64]:
df_servers['Duration'].sum() #originial total latency

2014119.5810000007

In [65]:
df_servers['duration_pred'].sum() #new updated latency with the co-located apps

1885068.8650000014

In [66]:
#percent change 

orig_time=df_servers['Duration'].sum()
pred_time=df_servers['duration_pred'].sum()

(orig_time-pred_time )/orig_time

0.06407301593082484