# First Simple Model

This notebook demonstrates that I am able to open up a minimal version of my data and format it for modeling

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression

import os

In [2]:
raw_data_path = os.path.join(os.pardir, os.pardir, "data", "raw", "online_meetings_v1.csv")
df = pd.read_csv(raw_data_path)

In [3]:
df.head()

Unnamed: 0,Platform,Date,Start Time,End Time,Duration,Participant Video on,Participant Mic On,Participant Screen Share,Others Video on,Others Screen Share,Window Minimized,Group,Download,Upload,Total
0,Zoom,7/8/2020,9:00:00 AM,10:00:00 AM,1:00:00,0,0,0,0,1,0,1,41.1,9.3,50.5
1,Zoom,7/8/2020,10:00:00 AM,11:50:00 AM,1:50:00,0,0,0,1,1,0,1,283.6,19.4,303.1
2,Google Meet,7/8/2020,12:00:00 PM,1:10:00 PM,1:10:00,1,0,0,1,1,0,1,145.7,8.8,154.5
3,Zoom,7/10/2020,10:00:00 AM,10:30:00 AM,0:30:00,0,0,0,1,1,0,1,99.1,3.1,102.3
4,Zoom,7/10/2020,5:00:00 PM,5:33:00 PM,0:33:00,0,1,0,0,0,0,1,20.4,16.1,36.5


This model's goal is to predict the total bandwidth usage of a given video call

The target is encoded in the column `Total` (which is composed of `Download` and `Upload` so I can't use those features)

In [4]:
y = df['Total']
X = df.drop('Total', axis=1)

In [5]:
X.describe()

Unnamed: 0,Participant Video on,Participant Mic On,Participant Screen Share,Others Video on,Others Screen Share,Window Minimized,Group,Download,Upload
count,115.0,115.0,115.0,115.0,115.0,115.0,115.0,115.0,115.0
mean,0.121739,0.591304,0.208696,0.373913,0.6,0.104348,0.73913,91.168417,35.873836
std,0.328415,0.493744,0.408155,0.485958,0.492042,0.307049,0.441031,134.820077,66.960607
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.768,0.0061
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.45,3.45
50%,0.0,1.0,0.0,0.0,1.0,0.0,1.0,54.2,7.9
75%,0.0,1.0,0.0,1.0,1.0,0.0,1.0,120.7,30.05
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1250.0,286.1


In [6]:
X.columns

Index(['Platform', 'Date', 'Start Time', 'End Time', 'Duration',
       'Participant Video on', 'Participant Mic On',
       'Participant Screen Share', 'Others Video on', 'Others Screen Share',
       'Window Minimized', 'Group', 'Download', 'Upload'],
      dtype='object')

In [7]:
fsm_columns = [
    'Participant Video on',
    'Participant Mic On',
    'Participant Screen Share',
    'Others Video on',
    'Others Screen Share',
    'Window Minimized',
    'Group'
]

X_fsm = X[fsm_columns]

In [8]:
fsm = LinearRegression()

In [9]:
fsm.fit(X_fsm, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [10]:
fsm.score(X_fsm, y)

0.30192314840216994

More work is required to develop a true baseline model, but this demonstrates that my data is feasible for modeling.  