# Setting up the Notebook

## Dependencies
Loading in dependencies for the project.
- **pandas**: Used for DataFrame wrangling
- **os**: Used to define the filepath to load in data
- **datetime**: Used for date/time conversions
- **numpy**: Used for randomization and other
- **sklearn tree**: Decision Tree Classifier function
- **sklearn.ensemble RandomForestClassifier**: Decision Tree Classifier function

In [1]:
import pandas as pd
import os
import numpy as np
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

### Global Seed
We will define the global seed that we will later use to randomly select training and testing groups.  
We used *2021* but you can use whatever number you would like to very your results.

In [2]:
global_seed = 2021

In [3]:
np.random.seed(global_seed)

## Load in our Machine Learning DataFrame
We have a DataFrame that was built specifically for machine learning and other analytics. 
We load that in here.

In [4]:
file_to_load1 = os.path.join("Output Data", "ml_df.csv")
ml_df = pd.read_csv(file_to_load1)

# Machine Learning
Now that we have all necessary fields in a passable format, we will look to classify **Outcome** using a Decision Tree classifier.

## Create our training and testing groups
To avoid having any one **User**'s game results used to predict their other game results, we will be randomly selecting 5 **User**s as the testing group.  
This will cause there to be an imbalance between training and testing data, straying from a 4:1 ratio and between **Crewmate** and **Imposter** analysis. Given the explicit problem caused otherwise, we will use this method until a more appropriate method is determined.

### Create a list of users
We will use this list to split our data into a training and test groups

In [5]:
user_list = np.arange(1, ml_df["User"].astype(int).max() + 1, 1).tolist()
user_list

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29]

### Define groups
We will define our test group first, specifying without replacement to avoid repetition.  
We then define our training group by what is not in our test group.

In [6]:
test_list = np.random.choice(user_list, 5, replace = False)
test_list

array([15, 18, 12,  5, 10])

In [7]:
train_list = []
for x in user_list:
    if x not in test_list:
        train_list.append(x)
train_list = np.array(train_list)
train_list

array([ 1,  2,  3,  4,  6,  7,  8,  9, 11, 13, 14, 16, 17, 19, 20, 21, 22,
       23, 24, 25, 26, 27, 28, 29])

Using our test and training lists, we split our data.

In [8]:
test_group = ml_df.loc[ml_df["User"].astype(int).isin(test_list)]
train_group = ml_df.loc[ml_df["User"].astype(int).isin(train_list)]

Split our groups between crewmates and imposters and drop unnecessary columns.

In [9]:
crewmate_test_df = test_group.loc[test_group["Team"] == 1].reset_index(drop=True)
crewmate_test_df.drop(columns = ["Imposter Kills"], inplace = True)
crewmate_train_df = train_group.loc[train_group["Team"] == 1].reset_index(drop=True)
crewmate_train_df.drop(columns = ["Imposter Kills"], inplace = True)

imposter_test_df = test_group.loc[test_group["Team"] == 0].reset_index(drop=True)
imposter_test_df.drop(columns = ["Task Completed", "All Tasks Completed", "Sabotages Fixed", "Time to complete all tasks", "Murdered"], inplace = True)
imposter_train_df = train_group.loc[train_group["Team"] == 0].reset_index(drop=True)
imposter_train_df.drop(columns = ["Task Completed", "All Tasks Completed", "Sabotages Fixed", "Time to complete all tasks", "Murdered"], inplace = True)

In [10]:
crewmate_train_df.columns

Index(['User', 'Team', 'Outcome', 'Game Length', 'Task Completed',
       'All Tasks Completed', 'Time to complete all tasks', 'Sabotages Fixed',
       'Murdered', 'Ejected', 'Night', 'Morning', 'Afternoon', 'Evening'],
      dtype='object')

In [11]:
imposter_train_df.columns

Index(['User', 'Team', 'Outcome', 'Game Length', 'Imposter Kills', 'Ejected',
       'Night', 'Morning', 'Afternoon', 'Evening'],
      dtype='object')

## Machine Learning - Imposter
We start by defining an appropriate model to predict wins or losses for **Imposter**s.

### Define Model #1
We will have to specify what it is that we are trying to predict and with what data.
- **Target**: Outcome
- **Data**: Game Length, Imposter Kills, Ejected, Night, Morning, Afternoon, Evening

In [12]:
target_imposter_train = imposter_train_df["Outcome"]
data_imposter_train = imposter_train_df[["Game Length", "Imposter Kills", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

target_imposter_test = imposter_test_df["Outcome"]
data_imposter_test = imposter_test_df[["Game Length", "Imposter Kills", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

features_imposter1 = data_imposter_train.columns

### Decision Tree Classification - Model #1
We perform a decision tree classification in order to assess the validity of our features.

In [13]:
dtc_imposter1 = tree.DecisionTreeClassifier()
dtc_imposter1 = dtc_imposter1.fit(data_imposter_train, target_imposter_train)
dtc_imposter1.score(data_imposter_test, target_imposter_test)

0.6301369863013698

In [14]:
sorted(zip(dtc_imposter1.feature_importances_, features_imposter1), reverse=True)

[(0.5623060456191042, 'Game Length'),
 (0.2287844680453208, 'Ejected'),
 (0.0993935890391066, 'Imposter Kills'),
 (0.04277191763224575, 'Afternoon'),
 (0.03248067072065414, 'Morning'),
 (0.02270246153168223, 'Night'),
 (0.011560847411886253, 'Evening')]

In [15]:
dtc_predict_imposter1_df = pd.DataFrame({
    "Prediction_i" : dtc_imposter1.predict(data_imposter_test),
    "Actual_i" : target_imposter_test
})
dtc_predict_imposter1_df["Equals_i"] = dtc_predict_imposter1_df["Prediction_i"].eq(dtc_predict_imposter1_df["Actual_i"])
dtc_predict_imposter1_df

Unnamed: 0,Prediction_i,Actual_i,Equals_i
0,0,0,True
1,0,1,False
2,0,1,False
3,0,1,False
4,0,0,True
...,...,...,...
68,1,0,False
69,0,0,True
70,1,1,True
71,1,1,True


In [16]:
dtc_predict_imposter1_df["Equals_i"].value_counts()

True     46
False    27
Name: Equals_i, dtype: int64

### Random Forest Classification - Model #1
We perform a random forest classification in order to assess the validity of our features.

In [17]:
rfc_imposter1 = RandomForestClassifier()
rfc_imposter1 = rfc_imposter1.fit(data_imposter_train, target_imposter_train)
rfc_imposter1.score(data_imposter_test, target_imposter_test)

0.7123287671232876

In [18]:
sorted(zip(rfc_imposter1.feature_importances_, features_imposter1), reverse=True)

[(0.6166183835200303, 'Game Length'),
 (0.2121020934185452, 'Ejected'),
 (0.11476197539069986, 'Imposter Kills'),
 (0.015487820504724298, 'Afternoon'),
 (0.01434570354969304, 'Morning'),
 (0.013565037297346157, 'Evening'),
 (0.01311898631896125, 'Night')]

In [19]:
rfc_predict_imposter1_df = pd.DataFrame({
    "Prediction_i" : rfc_imposter1.predict(data_imposter_test),
    "Actual_i" : target_imposter_test
})
rfc_predict_imposter1_df["Equals_i"] = rfc_predict_imposter1_df["Prediction_i"].eq(rfc_predict_imposter1_df["Actual_i"])
rfc_predict_imposter1_df

Unnamed: 0,Prediction_i,Actual_i,Equals_i
0,0,0,True
1,0,1,False
2,1,1,True
3,0,1,False
4,0,0,True
...,...,...,...
68,1,0,False
69,0,0,True
70,1,1,True
71,1,1,True


In [20]:
rfc_predict_imposter1_df["Equals_i"].value_counts()

True     52
False    21
Name: Equals_i, dtype: int64

## Machine Learning - Crewmate
We start by defining appropriate models to predict wins or losses for **Crewmate**s.

### Define Model #1
We will have to specify what it is that we are trying to predict and with what data.
- **Target**: Outcome
- **Data**: Game Length, All Tasks Completed, Sabotages Fixed, Murdered, Ejected, Night, Morning, Afternoon, Evening

In [21]:
target_crewmate1_train = crewmate_train_df["Outcome"]
data_crewmate1_train = crewmate_train_df[["Game Length", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

target_crewmate1_test = crewmate_test_df["Outcome"]
data_crewmate1_test = crewmate_test_df[["Game Length", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

features_crewmate1 = data_crewmate1_train.columns

### Decision Tree Classification - Model #1
We perform a decision tree classification in order to assess the validity of our features.

In [22]:
dtc_crewmate1 = tree.DecisionTreeClassifier()
dtc_crewmate1 = dtc_crewmate1.fit(data_crewmate1_train, target_crewmate1_train)
dtc_crewmate1.score(data_crewmate1_test, target_crewmate1_test)

0.5780730897009967

In [23]:
sorted(zip(dtc_crewmate1.feature_importances_, features_crewmate1), reverse=True)

[(0.6734855275533298, 'Game Length'),
 (0.08768017558122011, 'Sabotages Fixed'),
 (0.04738038850550324, 'Murdered'),
 (0.04384758086740087, 'All Tasks Completed'),
 (0.036958648764409276, 'Afternoon'),
 (0.03521100011929915, 'Evening'),
 (0.03291061216392996, 'Morning'),
 (0.02520449428099215, 'Night'),
 (0.017321572163915444, 'Ejected')]

In [24]:
dtc_predict_crewmate1_df = pd.DataFrame({
    "Prediction_c1" : dtc_crewmate1.predict(data_crewmate1_test),
    "Actual_c1" : target_crewmate1_test
})
dtc_predict_crewmate1_df["Equals_c1"] = dtc_predict_crewmate1_df["Prediction_c1"].eq(dtc_predict_crewmate1_df["Actual_c1"])
dtc_predict_crewmate1_df

Unnamed: 0,Prediction_c1,Actual_c1,Equals_c1
0,0,1,False
1,1,1,True
2,1,0,False
3,1,1,True
4,1,1,True
...,...,...,...
296,0,0,True
297,1,1,True
298,1,0,False
299,1,1,True


In [25]:
dtc_predict_crewmate1_df["Equals_c1"].value_counts()

True     174
False    127
Name: Equals_c1, dtype: int64

### Random Forest Classification - Model #1
We perform a random forest classification in order to assess the validity of our features.

In [26]:
rfc_crewmate1 = RandomForestClassifier()
rfc_crewmate1 = rfc_crewmate1.fit(data_crewmate1_train, target_crewmate1_train)
rfc_crewmate1.score(data_crewmate1_test, target_crewmate1_test)

0.5116279069767442

In [27]:
sorted(zip(rfc_crewmate1.feature_importances_, features_crewmate1), reverse=True)

[(0.7915341715344302, 'Game Length'),
 (0.06933640050907076, 'Sabotages Fixed'),
 (0.054825823161697644, 'Murdered'),
 (0.02357477788631491, 'All Tasks Completed'),
 (0.018231946368486875, 'Ejected'),
 (0.01344326908852262, 'Evening'),
 (0.010365502750487966, 'Afternoon'),
 (0.009698607710739843, 'Night'),
 (0.008989500990249182, 'Morning')]

In [28]:
rfc_predict_crewmate1_df = pd.DataFrame({
    "Prediction_c1" : rfc_crewmate1.predict(data_crewmate1_test),
    "Actual_c1" : target_crewmate1_test
})
rfc_predict_crewmate1_df["Equals_c1"] = rfc_predict_crewmate1_df["Prediction_c1"].eq(rfc_predict_crewmate1_df["Actual_c1"])
rfc_predict_crewmate1_df

Unnamed: 0,Prediction_c1,Actual_c1,Equals_c1
0,0,1,False
1,0,1,False
2,1,0,False
3,1,1,True
4,1,1,True
...,...,...,...
296,0,0,True
297,0,1,False
298,0,0,True
299,1,1,True


In [29]:
rfc_predict_crewmate1_df["Equals_c1"].value_counts()

True     154
False    147
Name: Equals_c1, dtype: int64

### Define Model #2
We will have to specify what it is that we are trying to predict and with what data.
- **Target**: Outcome
- **Data**: Game Length, Task Completed, Sabotages Fixed, Murdered, Ejected, Night, Morning, Afternoon, Evening

In [30]:
target_crewmate2_train = crewmate_train_df["Outcome"]
data_crewmate2_train = crewmate_train_df[["Game Length", "Task Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

target_crewmate2_test = crewmate_test_df["Outcome"]
data_crewmate2_test = crewmate_test_df[["Game Length", "Task Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

features_crewmate2 = data_crewmate2_train.columns

### Decision Tree Classification - Model #2
We perform a decision tree classification in order to assess the validity of our features.

In [31]:
dtc_crewmate2 = tree.DecisionTreeClassifier()
dtc_crewmate2 = dtc_crewmate2.fit(data_crewmate2_train, target_crewmate2_train)
dtc_crewmate2.score(data_crewmate2_test, target_crewmate2_test)

0.5813953488372093

In [32]:
sorted(zip(dtc_crewmate2.feature_importances_, features_crewmate2), reverse=True)

[(0.5666822970244988, 'Game Length'),
 (0.17798527087455693, 'Task Completed'),
 (0.08743825821105715, 'Sabotages Fixed'),
 (0.04673942971256209, 'Murdered'),
 (0.032664698513737, 'Afternoon'),
 (0.02934848919378394, 'Night'),
 (0.024101092868279456, 'Morning'),
 (0.017953216531518813, 'Evening'),
 (0.017087247070005824, 'Ejected')]

In [33]:
dtc_predict_crewmate2_df = pd.DataFrame({
    "Prediction_c2" : dtc_crewmate2.predict(data_crewmate2_test),
    "Actual_c2" : target_crewmate2_test
})
dtc_predict_crewmate2_df["Equals_c2"] = dtc_predict_crewmate2_df["Prediction_c2"].eq(dtc_predict_crewmate2_df["Actual_c2"])
dtc_predict_crewmate2_df

Unnamed: 0,Prediction_c2,Actual_c2,Equals_c2
0,1,1,True
1,1,1,True
2,1,0,False
3,1,1,True
4,1,1,True
...,...,...,...
296,0,0,True
297,0,1,False
298,1,0,False
299,1,1,True


In [34]:
dtc_predict_crewmate2_df["Equals_c2"].value_counts()

True     175
False    126
Name: Equals_c2, dtype: int64

### Random Forest Classification - Model #2
We perform a random forest classification in order to assess the validity of our features.

In [35]:
rfc_crewmate2 = RandomForestClassifier()
rfc_crewmate2 = rfc_crewmate2.fit(data_crewmate2_train, target_crewmate2_train)
rfc_crewmate2.score(data_crewmate2_test, target_crewmate2_test)

0.5581395348837209

In [36]:
sorted(zip(rfc_crewmate2.feature_importances_, features_crewmate2), reverse=True)

[(0.6450617484940954, 'Game Length'),
 (0.16149309125954164, 'Task Completed'),
 (0.07070932879799949, 'Sabotages Fixed'),
 (0.05514250268091906, 'Murdered'),
 (0.01859433967994544, 'Ejected'),
 (0.01446876742098246, 'Evening'),
 (0.013142437051076202, 'Afternoon'),
 (0.011362997864670613, 'Night'),
 (0.010024786750769812, 'Morning')]

In [37]:
rfc_predict_crewmate2_df = pd.DataFrame({
    "Prediction_c2" : rfc_crewmate2.predict(data_crewmate2_test),
    "Actual_c2" : target_crewmate2_test
})
rfc_predict_crewmate2_df["Equals_c2"] = rfc_predict_crewmate2_df["Prediction_c2"].eq(rfc_predict_crewmate2_df["Actual_c2"])
rfc_predict_crewmate2_df

Unnamed: 0,Prediction_c2,Actual_c2,Equals_c2
0,1,1,True
1,0,1,False
2,1,0,False
3,1,1,True
4,0,1,False
...,...,...,...
296,0,0,True
297,0,1,False
298,1,0,False
299,1,1,True


In [38]:
rfc_predict_crewmate2_df["Equals_c2"].value_counts()

True     168
False    133
Name: Equals_c2, dtype: int64