# Setting up the Notebook

## Dependencies
Loading in dependencies for the project.
- **pandas**: Used for DataFrame wrangling
- **os**: Used to define the filepath to load in data
- **datetime**: Used for date/time conversions
- **numpy**: Used for randomization and other
- **sklearn tree**: Decision Tree Classifier function
- **sklearn.ensemble RandomForestClassifier**: Random Forest Classifier function
- **sklearn.preprocessing StandardScaler**: Used to standardize non-binary features
- **sklearn.model_selection GridSearchCV**: Used to fine tune our best predictive models

In [1]:
import pandas as pd
import os
import numpy as np
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

### Global Seed
We will define the global seed that we will later use to randomly select training and testing groups.  
We used *61518* but you can use whatever number you would like to vary your results.

In [2]:
global_seed = 61518

In [3]:
np.random.seed(global_seed)

## Load in our Machine Learning DataFrame
We have a DataFrame that was built specifically for machine learning and other analytics. 
We load that in here.

In [4]:
file_to_load1 = os.path.join("Output Data", "ml_df.csv")
ml_df = pd.read_csv(file_to_load1)

# Machine Learning
Now that we have all necessary fields in a passable format, we will look to classify **Outcome** using a Decision Tree classifier.

## Create our training and testing groups
To avoid having any one **User**'s game results used to predict their other game results, we will be randomly selecting 5 **User**s as the testing group.  
This will cause there to be an imbalance between training and testing data, straying from a 4:1 ratio and between **Crewmate** and **Imposter** analysis. Given the explicit problem caused otherwise, we will use this method until a more appropriate method is determined.

### Create a list of users
We will use this list to split our data into a training and test groups

In [5]:
user_list = np.arange(1, ml_df["User"].astype(int).max() + 1, 1).tolist()
user_list

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29]

### Define groups
We will define our test group first, specifying without replacement to avoid repetition.  
We then define our training group by what is not in our test group.

In [6]:
test_list = np.random.choice(user_list, 5, replace = False)
test_list

array([24, 10, 27, 19, 26])

In [7]:
train_list = []
for x in user_list:
    if x not in test_list:
        train_list.append(x)
train_list = np.array(train_list)
train_list

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 11, 12, 13, 14, 15, 16, 17, 18,
       20, 21, 22, 23, 25, 28, 29])

Using our test and training lists, we split our data.

In [8]:
test_group = ml_df.loc[ml_df["User"].astype(int).isin(test_list)]
train_group = ml_df.loc[ml_df["User"].astype(int).isin(train_list)]

Split our groups between crewmates and imposters and drop unnecessary columns.

In [9]:
crewmate_test_df = test_group.loc[test_group["Team"] == 1].reset_index(drop=True)
crewmate_test_df.drop(columns = ["Imposter Kills"], inplace = True)
crewmate_train_df = train_group.loc[train_group["Team"] == 1].reset_index(drop=True)
crewmate_train_df.drop(columns = ["Imposter Kills"], inplace = True)

imposter_test_df = test_group.loc[test_group["Team"] == 0].reset_index(drop=True)
imposter_test_df.drop(columns = ["Task Completed", "All Tasks Completed", "Sabotages Fixed", "Time to complete all tasks", "Murdered"], inplace = True)
imposter_train_df = train_group.loc[train_group["Team"] == 0].reset_index(drop=True)
imposter_train_df.drop(columns = ["Task Completed", "All Tasks Completed", "Sabotages Fixed", "Time to complete all tasks", "Murdered"], inplace = True)

In [10]:
crewmate_train_df.columns

Index(['User', 'Team', 'Outcome', 'Game Length', 'Task Completed',
       'All Tasks Completed', 'Time to complete all tasks', 'Sabotages Fixed',
       'Murdered', 'Ejected', 'Night', 'Morning', 'Afternoon', 'Evening'],
      dtype='object')

In [11]:
imposter_train_df.columns

Index(['User', 'Team', 'Outcome', 'Game Length', 'Imposter Kills', 'Ejected',
       'Night', 'Morning', 'Afternoon', 'Evening'],
      dtype='object')

# Scaling

## Imposter Scaling

In [12]:
target_imposter_train_df = imposter_train_df["Outcome"]
data_imposter_train_df = imposter_train_df[["Game Length", "Imposter Kills", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

target_imposter_test_df = imposter_test_df["Outcome"]
data_imposter_test_df = imposter_test_df[["Game Length", "Imposter Kills", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

In [13]:
data_imposter_train_val = StandardScaler().fit_transform(data_imposter_train_df.values)
data_imposter_test_val = StandardScaler().fit_transform(data_imposter_test_df.values)

In [14]:
data_i_scaler = StandardScaler().fit(data_imposter_train_val)

In [15]:
scaled_data_imposter_train = data_i_scaler.transform(data_imposter_train_val)
scaled_data_imposter_test = data_i_scaler.transform(data_imposter_test_val)

In [16]:
scaled_data_imposter_train_df = pd.DataFrame(scaled_data_imposter_train, index=data_imposter_train_df.index, columns=data_imposter_train_df.columns)
scaled_data_imposter_test_df = pd.DataFrame(scaled_data_imposter_test, index=data_imposter_test_df.index, columns=data_imposter_test_df.columns)

In [17]:
imposter_train_df = pd.DataFrame({
    'Outcome' : target_imposter_train_df,
    'Game Length' : scaled_data_imposter_train_df["Game Length"],
    'Imposter Kills' : scaled_data_imposter_train_df["Imposter Kills"],
    'Ejected' : imposter_train_df["Ejected"],
    'Night' : imposter_train_df["Night"],
    'Morning' : imposter_train_df["Morning"],
    'Afternoon' : imposter_train_df["Afternoon"],
    'Evening' : imposter_train_df["Evening"]
})

imposter_test_df = pd.DataFrame({
    'Outcome' : target_imposter_test_df,
    'Game Length' : scaled_data_imposter_test_df["Game Length"],
    'Imposter Kills' : scaled_data_imposter_test_df["Imposter Kills"],
    'Ejected' : imposter_test_df["Ejected"],
    'Night' : imposter_test_df["Night"],
    'Morning' : imposter_test_df["Morning"],
    'Afternoon' : imposter_test_df["Afternoon"],
    'Evening' : imposter_test_df["Evening"]
})

## Crewmate Scaling

In [18]:
target_crewmate_train_df = crewmate_train_df["Outcome"]
data_crewmate_train_df = crewmate_train_df[["Game Length", "Task Completed", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

target_crewmate_test_df = crewmate_test_df["Outcome"]
data_crewmate_test_df = crewmate_test_df[["Game Length", "Task Completed", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

In [19]:
data_crewmate_train_val = StandardScaler().fit_transform(data_crewmate_train_df.values)
data_crewmate_test_val = StandardScaler().fit_transform(data_crewmate_test_df.values)

In [20]:
data_c_scaler = StandardScaler().fit(data_crewmate_train_val)

In [21]:
scaled_data_crewmate_train = data_c_scaler.transform(data_crewmate_train_val)
scaled_data_crewmate_test = data_c_scaler.transform(data_crewmate_test_val)

In [22]:
scaled_data_crewmate_train_df = pd.DataFrame(scaled_data_crewmate_train, index=data_crewmate_train_df.index, columns=data_crewmate_train_df.columns)
scaled_data_crewmate_test_df = pd.DataFrame(scaled_data_crewmate_test, index=data_crewmate_test_df.index, columns=data_crewmate_test_df.columns)

In [23]:
crewmate_train_df = pd.DataFrame({
    'Outcome' : target_crewmate_train_df,
    'Game Length' : scaled_data_crewmate_train_df["Game Length"],
    'Task Completed' : scaled_data_crewmate_train_df["Task Completed"],
    'All Tasks Completed' : scaled_data_crewmate_train_df["All Tasks Completed"],
    'Sabotages Fixed' : scaled_data_crewmate_train_df["Sabotages Fixed"],
    'Murdered' : crewmate_train_df["Murdered"],
    'Ejected' : crewmate_train_df["Ejected"],
    'Night' : crewmate_train_df["Night"],
    'Morning' : crewmate_train_df["Morning"],
    'Afternoon' : crewmate_train_df["Afternoon"],
    'Evening' : crewmate_train_df["Evening"]
})

crewmate_test_df = pd.DataFrame({
    'Outcome' : target_crewmate_test_df,
    'Game Length' : scaled_data_crewmate_test_df["Game Length"],
    'Task Completed' : scaled_data_crewmate_test_df["Task Completed"],
    'All Tasks Completed' : scaled_data_crewmate_test_df["All Tasks Completed"],
    'Sabotages Fixed' : scaled_data_crewmate_test_df["Sabotages Fixed"],
    'Murdered' : crewmate_test_df["Murdered"],
    'Ejected' : crewmate_test_df["Ejected"],
    'Night' : crewmate_test_df["Night"],
    'Morning' : crewmate_test_df["Morning"],
    'Afternoon' : crewmate_test_df["Afternoon"],
    'Evening' : crewmate_test_df["Evening"]
})

## Machine Learning - Imposter
We start by defining an appropriate model to predict wins or losses for **Imposter**s.

### Define Model #1
We will have to specify what it is that we are trying to predict and with what data.  
To start we will include all applicable features and adjust to improve accuracy.
- **Target**: Outcome
- **Data**: Game Length, Imposter Kills, Ejected, Night, Morning, Afternoon, Evening

In [24]:
target_imposter1_train = imposter_train_df["Outcome"]
data_imposter1_train = imposter_train_df[["Game Length", "Imposter Kills", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

target_imposter1_test = imposter_test_df["Outcome"]
data_imposter1_test = imposter_test_df[["Game Length", "Imposter Kills", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

features_imposter1 = data_imposter1_train.columns

### Decision Tree Classification - Model #1
We perform a decision tree classification in order to assess the validity of our features.

In [25]:
dtc_imposter1 = tree.DecisionTreeClassifier()
dtc_imposter1 = dtc_imposter1.fit(data_imposter1_train, target_imposter1_train)
dtc_imposter1.score(data_imposter1_test, target_imposter1_test)

0.6428571428571429

In [26]:
sorted(zip(dtc_imposter1.feature_importances_, features_imposter1), reverse=True)

[(0.6077808620934279, 'Game Length'),
 (0.18313495173502667, 'Ejected'),
 (0.11753200699568495, 'Imposter Kills'),
 (0.0327158563880437, 'Afternoon'),
 (0.023444401000083485, 'Evening'),
 (0.02232557970206204, 'Morning'),
 (0.01306634208567141, 'Night')]

In [27]:
dtc_predict_imposter1_df = pd.DataFrame({
    "Prediction_i1_dtc" : dtc_imposter1.predict(data_imposter1_test),
    "Actual_i1_dtc" : target_imposter1_test
})
dtc_predict_imposter1_df["Equals_i1_dtc"] = dtc_predict_imposter1_df["Prediction_i1_dtc"].eq(dtc_predict_imposter1_df["Actual_i1_dtc"])
dtc_predict_imposter1_df

Unnamed: 0,Prediction_i1_dtc,Actual_i1_dtc,Equals_i1_dtc
0,1,0,False
1,0,0,True
2,0,0,True
3,0,0,True
4,0,1,False
...,...,...,...
65,0,0,True
66,0,0,True
67,0,1,False
68,0,1,False


In [28]:
dtc_predict_imposter1_df["Equals_i1_dtc"].value_counts()

True     45
False    25
Name: Equals_i1_dtc, dtype: int64

### Random Forest Classification - Model #1
We perform a random forest classification in order to assess the validity of our features.

In [29]:
rfc_imposter1 = RandomForestClassifier()
rfc_imposter1 = rfc_imposter1.fit(data_imposter1_train, target_imposter1_train)
rfc_imposter1.score(data_imposter1_test, target_imposter1_test)

0.6714285714285714

In [30]:
sorted(zip(rfc_imposter1.feature_importances_, features_imposter1), reverse=True)

[(0.646790528064293, 'Game Length'),
 (0.18130321214056458, 'Ejected'),
 (0.11693955321110565, 'Imposter Kills'),
 (0.01558925325201164, 'Night'),
 (0.01460442146592011, 'Evening'),
 (0.013620398689973489, 'Morning'),
 (0.011152633176131563, 'Afternoon')]

In [31]:
rfc_predict_imposter1_df = pd.DataFrame({
    "Prediction_i1_rfc" : rfc_imposter1.predict(data_imposter1_test),
    "Actual_i1_rfc" : target_imposter1_test
})
rfc_predict_imposter1_df["Equals_i1_rfc"] = rfc_predict_imposter1_df["Prediction_i1_rfc"].eq(rfc_predict_imposter1_df["Actual_i1_rfc"])
rfc_predict_imposter1_df

Unnamed: 0,Prediction_i1_rfc,Actual_i1_rfc,Equals_i1_rfc
0,1,0,False
1,0,0,True
2,0,0,True
3,0,0,True
4,0,1,False
...,...,...,...
65,0,0,True
66,0,0,True
67,0,1,False
68,0,1,False


In [32]:
rfc_predict_imposter1_df["Equals_i1_rfc"].value_counts()

True     47
False    23
Name: Equals_i1_rfc, dtype: int64

### Define Model #2
We will try dropping all **Time of Day** features as they appeared to contribute very little in our more accurate model (Random Forest Classification).
- **Target**: Outcome
- **Data**: Game Length, Imposter Kills, Ejected

In [33]:
target_imposter2_train = imposter_train_df["Outcome"]
data_imposter2_train = imposter_train_df[["Game Length", "Imposter Kills", "Ejected"]]

target_imposter2_test = imposter_test_df["Outcome"]
data_imposter2_test = imposter_test_df[["Game Length", "Imposter Kills", "Ejected"]]

features_imposter2 = data_imposter2_train.columns

### Decision Tree Classification - Model #2
We perform a decision tree classification in order to assess the validity of our features.

In [34]:
dtc_imposter2 = tree.DecisionTreeClassifier()
dtc_imposter2 = dtc_imposter2.fit(data_imposter2_train, target_imposter2_train)
dtc_imposter2.score(data_imposter2_test, target_imposter2_test)

0.5571428571428572

In [35]:
sorted(zip(dtc_imposter2.feature_importances_, features_imposter2), reverse=True)

[(0.7123858043026532, 'Game Length'),
 (0.18697918395803603, 'Ejected'),
 (0.10063501173931079, 'Imposter Kills')]

In [36]:
dtc_predict_imposter2_df = pd.DataFrame({
    "Prediction_i2_dtc" : dtc_imposter2.predict(data_imposter2_test),
    "Actual_i2_dtc" : target_imposter2_test
})
dtc_predict_imposter2_df["Equals_i2_dtc"] = dtc_predict_imposter2_df["Prediction_i2_dtc"].eq(dtc_predict_imposter2_df["Actual_i2_dtc"])
dtc_predict_imposter2_df

Unnamed: 0,Prediction_i2_dtc,Actual_i2_dtc,Equals_i2_dtc
0,1,0,False
1,0,0,True
2,0,0,True
3,0,0,True
4,0,1,False
...,...,...,...
65,0,0,True
66,0,0,True
67,0,1,False
68,0,1,False


In [37]:
dtc_predict_imposter2_df["Equals_i2_dtc"].value_counts()

True     39
False    31
Name: Equals_i2_dtc, dtype: int64

### Random Forest Classification - Model #2
We perform a random forest classification in order to assess the validity of our features.

In [38]:
rfc_imposter2 = RandomForestClassifier()
rfc_imposter2 = rfc_imposter2.fit(data_imposter2_train, target_imposter2_train)
rfc_imposter2.score(data_imposter2_test, target_imposter2_test)

0.5714285714285714

In [39]:
sorted(zip(rfc_imposter2.feature_importances_, features_imposter2), reverse=True)

[(0.7257980196773609, 'Game Length'),
 (0.1848321851650571, 'Ejected'),
 (0.08936979515758196, 'Imposter Kills')]

In [40]:
rfc_predict_imposter2_df = pd.DataFrame({
    "Prediction_i2_rfc" : rfc_imposter2.predict(data_imposter2_test),
    "Actual_i2_rfc" : target_imposter2_test
})
rfc_predict_imposter2_df["Equals_i2_rfc"] = rfc_predict_imposter2_df["Prediction_i2_rfc"].eq(rfc_predict_imposter2_df["Actual_i2_rfc"])
rfc_predict_imposter2_df

Unnamed: 0,Prediction_i2_rfc,Actual_i2_rfc,Equals_i2_rfc
0,1,0,False
1,0,0,True
2,0,0,True
3,0,0,True
4,0,1,False
...,...,...,...
65,0,0,True
66,0,0,True
67,0,1,False
68,0,1,False


In [41]:
rfc_predict_imposter2_df["Equals_i2_rfc"].value_counts()

True     40
False    30
Name: Equals_i2_rfc, dtype: int64

### Define Model #3
Dropping all **Time of Day** features lowered our accuracy so they provide some benefits.  
First we try adding in **Night** and **Evening** the two highest **Time of Day** in our initial Random Forest Classification model as it had the highest weight.
- **Target**: Outcome
- **Data**: Game Length, Imposter Kills, Ejected, Night, Evening

In [42]:
target_imposter3_train = imposter_train_df["Outcome"]
data_imposter3_train = imposter_train_df[["Game Length", "Imposter Kills", "Ejected", "Night", "Evening"]]

target_imposter3_test = imposter_test_df["Outcome"]
data_imposter3_test = imposter_test_df[["Game Length", "Imposter Kills", "Ejected", "Night", "Evening"]]

features_imposter3 = data_imposter3_train.columns

### Decision Tree Classification - Model #3
We perform a decision tree classification in order to assess the validity of our features.

In [43]:
dtc_imposter3 = tree.DecisionTreeClassifier()
dtc_imposter3 = dtc_imposter3.fit(data_imposter3_train, target_imposter3_train)
dtc_imposter3.score(data_imposter3_test, target_imposter3_test)

0.6

In [44]:
sorted(zip(dtc_imposter3.feature_importances_, features_imposter3), reverse=True)

[(0.6321958811676404, 'Game Length'),
 (0.18313495173502667, 'Ejected'),
 (0.09693212318504882, 'Imposter Kills'),
 (0.05925094168954822, 'Evening'),
 (0.028486102222735947, 'Night')]

In [45]:
dtc_predict_imposter3_df = pd.DataFrame({
    "Prediction_i3_dtc" : dtc_imposter3.predict(data_imposter3_test),
    "Actual_i3_dtc" : target_imposter3_test
})
dtc_predict_imposter3_df["Equals_i3_dtc"] = dtc_predict_imposter3_df["Prediction_i3_dtc"].eq(dtc_predict_imposter3_df["Actual_i3_dtc"])
dtc_predict_imposter3_df

Unnamed: 0,Prediction_i3_dtc,Actual_i3_dtc,Equals_i3_dtc
0,1,0,False
1,0,0,True
2,0,0,True
3,0,0,True
4,0,1,False
...,...,...,...
65,0,0,True
66,0,0,True
67,0,1,False
68,0,1,False


In [46]:
dtc_predict_imposter3_df["Equals_i3_dtc"].value_counts()

True     42
False    28
Name: Equals_i3_dtc, dtype: int64

### Random Forest Classification - Model #3
We perform a random forest classification in order to assess the validity of our features.

In [47]:
rfc_imposter3 = RandomForestClassifier()
rfc_imposter3 = rfc_imposter3.fit(data_imposter3_train, target_imposter3_train)
rfc_imposter3.score(data_imposter3_test, target_imposter3_test)

0.6714285714285714

In [48]:
sorted(zip(rfc_imposter3.feature_importances_, features_imposter3), reverse=True)

[(0.6559329342776339, 'Game Length'),
 (0.1793411972824306, 'Ejected'),
 (0.11735652182921855, 'Imposter Kills'),
 (0.02600410109924869, 'Evening'),
 (0.021365245511468216, 'Night')]

In [49]:
rfc_predict_imposter3_df = pd.DataFrame({
    "Prediction_i3_rfc" : rfc_imposter3.predict(data_imposter3_test),
    "Actual_i3_rfc" : target_imposter3_test
})
rfc_predict_imposter3_df["Equals_i3_rfc"] = rfc_predict_imposter3_df["Prediction_i3_rfc"].eq(rfc_predict_imposter3_df["Actual_i3_rfc"])
rfc_predict_imposter3_df

Unnamed: 0,Prediction_i3_rfc,Actual_i3_rfc,Equals_i3_rfc
0,1,0,False
1,0,0,True
2,0,0,True
3,0,0,True
4,0,1,False
...,...,...,...
65,0,0,True
66,0,0,True
67,0,1,False
68,0,1,False


In [50]:
rfc_predict_imposter3_df["Equals_i3_rfc"].value_counts()

True     47
False    23
Name: Equals_i3_rfc, dtype: int64

### Define Model #4
Our last model gave a significant increase in predictive accuracy, but we'll try an alternative model.  
We will try swapping out **Night** for **Afternoon** as **Afternoon** had the highest weight in our first decision tree classification model.
- **Target**: Outcome
- **Data**: Game Length, Imposter Kills, Ejected, Afternoon, Evening

In [51]:
target_imposter4_train = imposter_train_df["Outcome"]
data_imposter4_train = imposter_train_df[["Game Length", "Imposter Kills", "Ejected", "Afternoon", "Evening"]]

target_imposter4_test = imposter_test_df["Outcome"]
data_imposter4_test = imposter_test_df[["Game Length", "Imposter Kills", "Ejected", "Afternoon", "Evening"]]

features_imposter4 = data_imposter4_train.columns

### Decision Tree Classification - Model #4
We perform a decision tree classification in order to assess the validity of our features.

In [52]:
dtc_imposter4 = tree.DecisionTreeClassifier()
dtc_imposter4 = dtc_imposter4.fit(data_imposter4_train, target_imposter4_train)
dtc_imposter4.score(data_imposter4_test, target_imposter4_test)

0.6428571428571429

In [53]:
sorted(zip(dtc_imposter4.feature_importances_, features_imposter4), reverse=True)

[(0.6234192245761346, 'Game Length'),
 (0.1840811139426129, 'Ejected'),
 (0.11327490065910534, 'Imposter Kills'),
 (0.042864424494734495, 'Evening'),
 (0.03636033632741271, 'Afternoon')]

In [54]:
dtc_predict_imposter4_df = pd.DataFrame({
    "Prediction_i4_dtc" : dtc_imposter4.predict(data_imposter4_test),
    "Actual_i4_dtc" : target_imposter4_test
})
dtc_predict_imposter4_df["Equals_i4_dtc"] = dtc_predict_imposter4_df["Prediction_i4_dtc"].eq(dtc_predict_imposter4_df["Actual_i4_dtc"])
dtc_predict_imposter4_df

Unnamed: 0,Prediction_i4_dtc,Actual_i4_dtc,Equals_i4_dtc
0,1,0,False
1,0,0,True
2,0,0,True
3,0,0,True
4,0,1,False
...,...,...,...
65,0,0,True
66,0,0,True
67,0,1,False
68,0,1,False


In [55]:
dtc_predict_imposter4_df["Equals_i4_dtc"].value_counts()

True     45
False    25
Name: Equals_i4_dtc, dtype: int64

### Random Forest Classification - Model #4
We perform a random forest classification in order to assess the validity of our features.

In [56]:
rfc_imposter4 = RandomForestClassifier(n_estimators = 20, criterion = 'entropy')
rfc_imposter4 = rfc_imposter4.fit(data_imposter4_train, target_imposter4_train)
rfc_imposter4.score(data_imposter4_test, target_imposter4_test)

0.7

In [57]:
sorted(zip(rfc_imposter4.feature_importances_, features_imposter4), reverse=True)

[(0.6796233699639577, 'Game Length'),
 (0.15952139066979176, 'Ejected'),
 (0.11394373072584152, 'Imposter Kills'),
 (0.024481463572779617, 'Evening'),
 (0.022430045067629436, 'Afternoon')]

In [58]:
rfc_predict_imposter4_df = pd.DataFrame({
    "Game Length" : data_imposter_test_df["Game Length"],
    "Ejected" : data_imposter_test_df["Ejected"],
    "Imposter Kills" : data_imposter_test_df["Imposter Kills"],
    "Evening" : data_imposter_test_df["Evening"],
    "Afternoon" : data_imposter_test_df["Afternoon"],
    "Outcome" : target_imposter4_test,
    "Prediction" : rfc_imposter4.predict(data_imposter4_test)
})
rfc_predict_imposter4_df["Correct Prediction"] = rfc_predict_imposter4_df["Prediction"].eq(rfc_predict_imposter4_df["Outcome"])
rfc_predict_imposter4_df

Unnamed: 0,Game Length,Ejected,Imposter Kills,Evening,Afternoon,Outcome,Prediction,Correct Prediction
0,336,0,2,0,1,0,1,False
1,904,0,3,0,1,0,0,True
2,660,1,1,0,1,0,0,True
3,1081,1,1,0,1,0,0,True
4,865,0,2,0,1,1,0,False
...,...,...,...,...,...,...,...,...
65,503,1,2,0,1,0,0,True
66,107,1,1,0,1,0,0,True
67,202,0,2,0,1,1,0,False
68,196,0,2,0,1,1,0,False


In [59]:
rfc_predict_imposter4_df["Correct Prediction"].value_counts()

True     49
False    21
Name: Correct Prediction, dtype: int64

### Model Conclusions
Through trial and error, we've been able to narrow down on a model that provides the most accurate predictions.  
Our best model is **Model 4**, which includes Game Length, Imposter Kills, Ejected, Afternoon, and Evening features. This model offers an 0.6428571428571429 for DTC and 0.7142857142857143 for RFC in predictive accuracy. With the adjustment for the

## Machine Learning - Crewmate
We start by defining appropriate models to predict wins or losses for **Crewmate**s.

### Define Model #1
We will have to specify what it is that we are trying to predict and with what data.  
To start we will include all applicable features and adjust to improve accuracy.
- **Target**: Outcome
- **Data**: Game Length, Task Completed, All Tasks Completed, Sabotages Fixed, Murdered, Ejected, Night, Morning, Afternoon, Evening

In [60]:
target_crewmate1_train = crewmate_train_df["Outcome"]
data_crewmate1_train = crewmate_train_df[["Game Length", "Task Completed", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

target_crewmate1_test = crewmate_test_df["Outcome"]
data_crewmate1_test = crewmate_test_df[["Game Length", "Task Completed", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

features_crewmate1 = data_crewmate1_train.columns

### Decision Tree Classification - Model #1
We perform a decision tree classification in order to assess the validity of our features.

In [61]:
dtc_crewmate1 = tree.DecisionTreeClassifier()
dtc_crewmate1 = dtc_crewmate1.fit(data_crewmate1_train, target_crewmate1_train)
dtc_crewmate1.score(data_crewmate1_test, target_crewmate1_test)

0.5690376569037657

In [62]:
sorted(zip(dtc_crewmate1.feature_importances_, features_crewmate1), reverse=True)

[(0.5825727969776329, 'Game Length'),
 (0.11713520469809854, 'Task Completed'),
 (0.08409276854020513, 'Sabotages Fixed'),
 (0.043594194688478506, 'Murdered'),
 (0.036941172514074985, 'Afternoon'),
 (0.034375050854562286, 'Morning'),
 (0.0329122785121211, 'Evening'),
 (0.030465386113982557, 'All Tasks Completed'),
 (0.02191318973854975, 'Night'),
 (0.015997957362294295, 'Ejected')]

In [63]:
dtc_predict_crewmate1_df = pd.DataFrame({
    "Prediction_c1_dtc" : dtc_crewmate1.predict(data_crewmate1_test),
    "Actual_c1_dtc" : target_crewmate1_test
})
dtc_predict_crewmate1_df["Equals_c1_dtc"] = dtc_predict_crewmate1_df["Prediction_c1_dtc"].eq(dtc_predict_crewmate1_df["Actual_c1_dtc"])
dtc_predict_crewmate1_df

Unnamed: 0,Prediction_c1_dtc,Actual_c1_dtc,Equals_c1_dtc
0,1,1,True
1,1,1,True
2,0,0,True
3,1,0,False
4,1,0,False
...,...,...,...
234,1,0,False
235,0,0,True
236,1,1,True
237,0,1,False


In [64]:
dtc_predict_crewmate1_df["Equals_c1_dtc"].value_counts()

True     136
False    103
Name: Equals_c1_dtc, dtype: int64

### Random Forest Classification - Model #1
We perform a random forest classification in order to assess the validity of our features.

In [65]:
rfc_crewmate1 = RandomForestClassifier()
rfc_crewmate1 = rfc_crewmate1.fit(data_crewmate1_train, target_crewmate1_train)
rfc_crewmate1.score(data_crewmate1_test, target_crewmate1_test)

0.6150627615062761

In [66]:
sorted(zip(rfc_crewmate1.feature_importances_, features_crewmate1), reverse=True)

[(0.6445397581683949, 'Game Length'),
 (0.14356417238613303, 'Task Completed'),
 (0.06576053015844335, 'Sabotages Fixed'),
 (0.054699844790526156, 'Murdered'),
 (0.017843036041224773, 'All Tasks Completed'),
 (0.016914637865102784, 'Evening'),
 (0.016548463795283033, 'Ejected'),
 (0.014369827485476891, 'Afternoon'),
 (0.01382214761103475, 'Night'),
 (0.011937581698380354, 'Morning')]

In [67]:
rfc_predict_crewmate1_df = pd.DataFrame({
    "Prediction_c1_rfc" : rfc_crewmate1.predict(data_crewmate1_test),
    "Actual_c1_rfc" : target_crewmate1_test
})
rfc_predict_crewmate1_df["Equals_c1_rfc"] = rfc_predict_crewmate1_df["Prediction_c1_rfc"].eq(rfc_predict_crewmate1_df["Actual_c1_rfc"])
rfc_predict_crewmate1_df

Unnamed: 0,Prediction_c1_rfc,Actual_c1_rfc,Equals_c1_rfc
0,1,1,True
1,1,1,True
2,0,0,True
3,1,0,False
4,1,0,False
...,...,...,...
234,1,0,False
235,1,0,False
236,1,1,True
237,1,1,True


In [68]:
rfc_predict_crewmate1_df["Equals_c1_rfc"].value_counts()

True     147
False     92
Name: Equals_c1_rfc, dtype: int64

### Define Model #2
We take our previous model and attempt to make it more accurate by removing **Night**.  
This was one of the features with the lowest weight in our more accurate model (random forest classification).
- **Target**: Outcome
- **Data**: Game Length, Task Completed, All Tasks Completed, Sabotages Fixed, Murdered, Ejected, Morning, Afternoon, Evening

In [69]:
target_crewmate2_train = crewmate_train_df["Outcome"]
data_crewmate2_train = crewmate_train_df[["Game Length", "Task Completed", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Ejected", "Morning", "Afternoon", "Evening"]]

target_crewmate2_test = crewmate_test_df["Outcome"]
data_crewmate2_test = crewmate_test_df[["Game Length", "Task Completed", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Ejected", "Morning", "Afternoon", "Evening"]]

features_crewmate2 = data_crewmate2_train.columns

### Decision Tree Classification - Model #2
We perform a decision tree classification in order to assess the validity of our features.

In [70]:
dtc_crewmate2 = tree.DecisionTreeClassifier()
dtc_crewmate2 = dtc_crewmate2.fit(data_crewmate2_train, target_crewmate2_train)
dtc_crewmate2.score(data_crewmate2_test, target_crewmate2_test)

0.5774058577405857

In [71]:
sorted(zip(dtc_crewmate2.feature_importances_, features_crewmate2), reverse=True)

[(0.5846131684876373, 'Game Length'),
 (0.1195015298762009, 'Task Completed'),
 (0.08025546751934998, 'Sabotages Fixed'),
 (0.04917092064269182, 'Evening'),
 (0.0435941946884785, 'Murdered'),
 (0.040678510116754156, 'Afternoon'),
 (0.036250016443095855, 'Morning'),
 (0.029938234863497164, 'All Tasks Completed'),
 (0.015997957362294292, 'Ejected')]

In [72]:
dtc_predict_crewmate2_df = pd.DataFrame({
    "Prediction_c2_dtc" : dtc_crewmate2.predict(data_crewmate2_test),
    "Actual_c2_dtc" : target_crewmate2_test
})
dtc_predict_crewmate2_df["Equals_c2_dtc"] = dtc_predict_crewmate2_df["Prediction_c2_dtc"].eq(dtc_predict_crewmate2_df["Actual_c2_dtc"])
dtc_predict_crewmate2_df

Unnamed: 0,Prediction_c2_dtc,Actual_c2_dtc,Equals_c2_dtc
0,1,1,True
1,1,1,True
2,0,0,True
3,1,0,False
4,1,0,False
...,...,...,...
234,1,0,False
235,0,0,True
236,1,1,True
237,0,1,False


In [73]:
dtc_predict_crewmate2_df["Equals_c2_dtc"].value_counts()

True     138
False    101
Name: Equals_c2_dtc, dtype: int64

### Random Forest Classification - Model #2
We perform a random forest classification in order to assess the validity of our features.

In [74]:
rfc_crewmate2 = RandomForestClassifier()
rfc_crewmate2 = rfc_crewmate2.fit(data_crewmate2_train, target_crewmate2_train)
rfc_crewmate2.score(data_crewmate2_test, target_crewmate2_test)

0.6276150627615062

In [75]:
sorted(zip(rfc_crewmate2.feature_importances_, features_crewmate2), reverse=True)

[(0.6407048344322205, 'Game Length'),
 (0.1430348589732038, 'Task Completed'),
 (0.06549073543690553, 'Sabotages Fixed'),
 (0.05035684521982558, 'Murdered'),
 (0.025549034086443927, 'Evening'),
 (0.02292309722094385, 'Afternoon'),
 (0.018143558544316696, 'Ejected'),
 (0.01786035799797627, 'All Tasks Completed'),
 (0.01593667808816386, 'Morning')]

In [76]:
rfc_predict_crewmate2_df = pd.DataFrame({
    "Prediction_c2_rfc" : rfc_crewmate2.predict(data_crewmate2_test),
    "Actual_c2_rfc" : target_crewmate2_test
})
rfc_predict_crewmate2_df["Equals_c2_rfc"] = rfc_predict_crewmate2_df["Prediction_c2_rfc"].eq(rfc_predict_crewmate2_df["Actual_c2_rfc"])
rfc_predict_crewmate2_df

Unnamed: 0,Prediction_c2_rfc,Actual_c2_rfc,Equals_c2_rfc
0,1,1,True
1,1,1,True
2,0,0,True
3,1,0,False
4,1,0,False
...,...,...,...
234,0,0,True
235,0,0,True
236,1,1,True
237,1,1,True


In [77]:
rfc_predict_crewmate2_df["Equals_c2_rfc"].value_counts()

True     150
False     89
Name: Equals_c2_rfc, dtype: int64

### Define Model #3
It appears that keeping **Morning** lead to a comparably accurate model but we arrive at 2 options that we will test separately.  
We'll first keep **Ejected** and drop **All Tasks Completed**. In the next model, we will do the opposite.
- **Target**: Outcome
- **Data**: Game Length, Task Completed, Sabotages Fixed, Murdered, Ejected, Morning, Afternoon, Evening

In [78]:
target_crewmate3_train = crewmate_train_df["Outcome"]
data_crewmate3_train = crewmate_train_df[["Game Length", "Task Completed", "Sabotages Fixed", "Murdered", "Ejected", "Morning", "Afternoon", "Evening"]]

target_crewmate3_test = crewmate_test_df["Outcome"]
data_crewmate3_test = crewmate_test_df[["Game Length", "Task Completed", "Sabotages Fixed", "Murdered", "Ejected", "Morning", "Afternoon", "Evening"]]

features_crewmate3 = data_crewmate3_train.columns

### Decision Tree Classification - Model #3
We perform a decision tree classification in order to assess the validity of our features.

In [79]:
dtc_crewmate3 = tree.DecisionTreeClassifier()
dtc_crewmate3 = dtc_crewmate3.fit(data_crewmate3_train, target_crewmate3_train)
dtc_crewmate3.score(data_crewmate3_test, target_crewmate3_test)

0.602510460251046

In [80]:
sorted(zip(dtc_crewmate3.feature_importances_, features_crewmate3), reverse=True)

[(0.6145579376866471, 'Game Length'),
 (0.1263992215986235, 'Task Completed'),
 (0.08876481658981691, 'Sabotages Fixed'),
 (0.043594194688478506, 'Murdered'),
 (0.04187269704732493, 'Evening'),
 (0.0392475682264252, 'Afternoon'),
 (0.029565606800389624, 'Morning'),
 (0.015997957362294295, 'Ejected')]

In [81]:
dtc_predict_crewmate3_df = pd.DataFrame({
    "Prediction_c3_dtc" : dtc_crewmate3.predict(data_crewmate3_test),
    "Actual_c3_dtc" : target_crewmate3_test
})
dtc_predict_crewmate3_df["Equals_c3_dtc"] = dtc_predict_crewmate3_df["Prediction_c3_dtc"].eq(dtc_predict_crewmate3_df["Actual_c3_dtc"])
dtc_predict_crewmate3_df

Unnamed: 0,Prediction_c3_dtc,Actual_c3_dtc,Equals_c3_dtc
0,1,1,True
1,1,1,True
2,1,0,False
3,1,0,False
4,1,0,False
...,...,...,...
234,1,0,False
235,0,0,True
236,1,1,True
237,1,1,True


In [82]:
dtc_predict_crewmate3_df["Equals_c3_dtc"].value_counts()

True     144
False     95
Name: Equals_c3_dtc, dtype: int64

### Random Forest Classification - Model #3
We perform a random forest classification in order to assess the validity of our features.

In [83]:
rfc_crewmate3 = RandomForestClassifier()
rfc_crewmate3 = rfc_crewmate3.fit(data_crewmate3_train, target_crewmate3_train)
rfc_crewmate3.score(data_crewmate3_test, target_crewmate3_test)

0.5815899581589958

In [84]:
sorted(zip(rfc_crewmate3.feature_importances_, features_crewmate3), reverse=True)

[(0.672948813957116, 'Game Length'),
 (0.14915040933519644, 'Task Completed'),
 (0.0636539418922138, 'Sabotages Fixed'),
 (0.054068020768294846, 'Murdered'),
 (0.01689915884424858, 'Ejected'),
 (0.0165840804127642, 'Evening'),
 (0.015645376546108824, 'Afternoon'),
 (0.0110501982440573, 'Morning')]

In [85]:
rfc_predict_crewmate3_df = pd.DataFrame({
    "Prediction_c3_rfc" : rfc_crewmate3.predict(data_crewmate3_test),
    "Actual_c3_rfc" : target_crewmate3_test
})
rfc_predict_crewmate3_df["Equals_c3_rfc"] = rfc_predict_crewmate3_df["Prediction_c3_rfc"].eq(rfc_predict_crewmate3_df["Actual_c3_rfc"])
rfc_predict_crewmate3_df

Unnamed: 0,Prediction_c3_rfc,Actual_c3_rfc,Equals_c3_rfc
0,1,1,True
1,1,1,True
2,1,0,False
3,1,0,False
4,1,0,False
...,...,...,...
234,1,0,False
235,0,0,True
236,1,1,True
237,0,1,False


In [86]:
rfc_predict_crewmate3_df["Equals_c3_rfc"].value_counts()

True     139
False    100
Name: Equals_c3_rfc, dtype: int64

### Define Model #4
Now we'll do the inverse of **Model 3** as that model proved to be less accurate than **Model 2**.
- **Target**: Outcome
- **Data**: Game Length, Task Completed, All Tasks Completed, Sabotages Fixed, Murdered, Morning, Afternoon, Evening

In [87]:
target_crewmate4_train = crewmate_train_df["Outcome"]
data_crewmate4_train = crewmate_train_df[["Game Length", "Task Completed", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Morning", "Afternoon", "Evening"]]

target_crewmate4_test = crewmate_test_df["Outcome"]
data_crewmate4_test = crewmate_test_df[["Game Length", "Task Completed", "All Tasks Completed", "Sabotages Fixed", "Murdered", "Morning", "Afternoon", "Evening"]]

features_crewmate4 = data_crewmate4_train.columns

### Decision Tree Classification - Model #4
We perform a decision tree classification in order to assess the validity of our features.

In [88]:
dtc_crewmate4 = tree.DecisionTreeClassifier()
dtc_crewmate4 = dtc_crewmate4.fit(data_crewmate4_train, target_crewmate4_train)
dtc_crewmate4.score(data_crewmate4_test, target_crewmate4_test)

0.5355648535564853

In [89]:
sorted(zip(dtc_crewmate4.feature_importances_, features_crewmate4), reverse=True)

[(0.6014750378444914, 'Game Length'),
 (0.12360354458074953, 'Task Completed'),
 (0.08100071111643435, 'Sabotages Fixed'),
 (0.044880414545508805, 'Evening'),
 (0.04359419468847852, 'Murdered'),
 (0.04172718971304592, 'Afternoon'),
 (0.03611613032044552, 'Morning'),
 (0.02760277719084596, 'All Tasks Completed')]

In [90]:
dtc_predict_crewmate4_df = pd.DataFrame({
    "Prediction_c4_dtc" : dtc_crewmate4.predict(data_crewmate4_test),
    "Actual_c4_dtc" : target_crewmate4_test
})
dtc_predict_crewmate4_df["Equals_c4_dtc"] = dtc_predict_crewmate4_df["Prediction_c4_dtc"].eq(dtc_predict_crewmate4_df["Actual_c4_dtc"])
dtc_predict_crewmate4_df

Unnamed: 0,Prediction_c4_dtc,Actual_c4_dtc,Equals_c4_dtc
0,0,1,False
1,1,1,True
2,0,0,True
3,1,0,False
4,1,0,False
...,...,...,...
234,1,0,False
235,1,0,False
236,1,1,True
237,0,1,False


In [91]:
dtc_predict_crewmate4_df["Equals_c4_dtc"].value_counts()

True     128
False    111
Name: Equals_c4_dtc, dtype: int64

### Random Forest Classification - Model #4
We perform a random forest classification in order to assess the validity of our features.

In [92]:
rfc_crewmate4 = RandomForestClassifier()
rfc_crewmate4 = rfc_crewmate4.fit(data_crewmate4_train, target_crewmate4_train)
rfc_crewmate4.score(data_crewmate4_test, target_crewmate4_test)

0.5941422594142259

In [93]:
sorted(zip(rfc_crewmate4.feature_importances_, features_crewmate4), reverse=True)

[(0.6722059253253261, 'Game Length'),
 (0.1411638726266161, 'Task Completed'),
 (0.06751888008754144, 'Sabotages Fixed'),
 (0.053912778074309976, 'Murdered'),
 (0.01857189786410841, 'Evening'),
 (0.017718216440317763, 'Afternoon'),
 (0.015514575905034614, 'All Tasks Completed'),
 (0.013393853676745475, 'Morning')]

In [94]:
rfc_predict_crewmate4_df = pd.DataFrame({
    "Prediction_c4_rfc" : rfc_crewmate4.predict(data_crewmate4_test),
    "Actual_c4_rfc" : target_crewmate4_test
})
rfc_predict_crewmate4_df["Equals_c4_rfc"] = rfc_predict_crewmate4_df["Prediction_c4_rfc"].eq(rfc_predict_crewmate4_df["Actual_c4_rfc"])
rfc_predict_crewmate4_df

Unnamed: 0,Prediction_c4_rfc,Actual_c4_rfc,Equals_c4_rfc
0,0,1,False
1,1,1,True
2,0,0,True
3,1,0,False
4,1,0,False
...,...,...,...
234,0,0,True
235,1,0,False
236,1,1,True
237,0,1,False


In [95]:
rfc_predict_crewmate4_df["Equals_c4_rfc"].value_counts()

True     142
False     97
Name: Equals_c4_rfc, dtype: int64

### Define Model #5
As it appears that **Model 3** is most accurate so we will keep the **Ejected** feature.  
To attempt to identify a more salient direction, we can try building off of **Model 1** again. One consistenly low feature is **All Tasks Completed** so we'll try removing just that.
- **Target**: Outcome
- **Data**: Game Length, Task Completed, Sabotages Fixed, Murdered, Ejected, Night, Morning, Afternoon, Evening

In [96]:
target_crewmate5_train = crewmate_train_df["Outcome"]
data_crewmate5_train = crewmate_train_df[["Game Length", "Task Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

target_crewmate5_test = crewmate_test_df["Outcome"]
data_crewmate5_test = crewmate_test_df[["Game Length", "Task Completed", "Sabotages Fixed", "Murdered", "Ejected", "Night", "Morning", "Afternoon", "Evening"]]

features_crewmate5 = data_crewmate5_train.columns 

### Decision Tree Classification - Model #5
We perform a decision tree classification in order to assess the validity of our features.

In [97]:
dtc_crewmate5 = tree.DecisionTreeClassifier()
dtc_crewmate5 = dtc_crewmate5.fit(data_crewmate5_train, target_crewmate5_train)
dtc_crewmate5.score(data_crewmate5_test, target_crewmate5_test)

0.5690376569037657

In [98]:
sorted(zip(dtc_crewmate5.feature_importances_, features_crewmate5), reverse=True)

[(0.6016295985109898, 'Game Length'),
 (0.12810317543570562, 'Task Completed'),
 (0.08898739251059316, 'Sabotages Fixed'),
 (0.043594194688478506, 'Murdered'),
 (0.03452870747154555, 'Afternoon'),
 (0.03394267217240842, 'Evening'),
 (0.03131617465937671, 'Morning'),
 (0.021900127188607888, 'Night'),
 (0.015997957362294295, 'Ejected')]

In [99]:
dtc_predict_crewmate5_df = pd.DataFrame({
    "Prediction_c5_dtc" : dtc_crewmate5.predict(data_crewmate5_test),
    "Actual_c5_dtc" : target_crewmate5_test
})
dtc_predict_crewmate5_df["Equals_c5_dtc"] = dtc_predict_crewmate5_df["Prediction_c5_dtc"].eq(dtc_predict_crewmate5_df["Actual_c5_dtc"])
dtc_predict_crewmate5_df

Unnamed: 0,Prediction_c5_dtc,Actual_c5_dtc,Equals_c5_dtc
0,1,1,True
1,1,1,True
2,1,0,False
3,1,0,False
4,1,0,False
...,...,...,...
234,1,0,False
235,0,0,True
236,1,1,True
237,0,1,False


In [100]:
dtc_predict_crewmate5_df["Equals_c5_dtc"].value_counts()

True     136
False    103
Name: Equals_c5_dtc, dtype: int64

### Random Forest Classification - Model #5
We perform a random forest classification in order to assess the validity of our features.

In [101]:
rfc_crewmate5 = RandomForestClassifier(n_estimators = 50, criterion = 'entropy')
rfc_crewmate5 = rfc_crewmate5.fit(data_crewmate5_train, target_crewmate5_train)
rfc_crewmate5.score(data_crewmate5_test, target_crewmate5_test)

0.6150627615062761

In [102]:
sorted(zip(rfc_crewmate5.feature_importances_, features_crewmate5), reverse=True)

[(0.6773354238044523, 'Game Length'),
 (0.1519351318683869, 'Task Completed'),
 (0.06719794266501562, 'Sabotages Fixed'),
 (0.03916753564030174, 'Murdered'),
 (0.014589463095238437, 'Ejected'),
 (0.014348632342050275, 'Evening'),
 (0.013659713317453429, 'Afternoon'),
 (0.011476573046252393, 'Night'),
 (0.010289584220848894, 'Morning')]

In [103]:
rfc_predict_crewmate5_df = pd.DataFrame({
    "Game Length" : data_crewmate_test_df["Game Length"],
    "Task Completed" : data_crewmate_test_df["Task Completed"],
    "Sabotages Fixed" : data_crewmate_test_df["Sabotages Fixed"],
    "Murdered" : data_crewmate_test_df["Murdered"],
    "Ejected" : data_crewmate_test_df["Ejected"],
    "Evening" : data_crewmate_test_df["Evening"],
    "Afternoon" : data_crewmate_test_df["Afternoon"],
    "Night" : data_crewmate_test_df["Night"],
    "Morning" : data_crewmate_test_df["Morning"],
    "Outcome" : target_crewmate5_test,
    "Prediction" : rfc_crewmate5.predict(data_crewmate5_test)
})
rfc_predict_crewmate5_df["Correct Prediction"] = rfc_predict_crewmate5_df["Outcome"].eq(rfc_predict_crewmate5_df["Prediction"])
rfc_predict_crewmate5_df

Unnamed: 0,Game Length,Task Completed,Sabotages Fixed,Murdered,Ejected,Evening,Afternoon,Night,Morning,Outcome,Prediction,Correct Prediction
0,310,3,0.0,0,0,1,0,0,0,1,1,True
1,491,2,0.0,1,0,0,1,0,0,1,1,True
2,664,7,0.0,1,0,0,1,0,0,0,1,False
3,826,7,1.0,1,0,0,1,0,0,0,1,False
4,596,7,0.0,1,0,0,1,0,0,0,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
234,482,4,1.0,1,0,0,0,0,1,0,1,False
235,666,5,2.0,0,1,0,0,0,1,0,0,True
236,251,3,2.0,0,0,0,0,0,1,1,1,True
237,633,5,1.0,1,0,0,0,0,1,1,0,False


In [104]:
rfc_predict_crewmate5_df["Correct Prediction"].value_counts()

True     147
False     92
Name: Correct Prediction, dtype: int64

### Model Conclusions
Although we attempted to fine tune and improve our model, it appears that our best model was obtained somewhat randomly.
Our best model is **Model 5**, which includes Game Length, Task Completed, Sabotages Fixed, Murdered, Ejected, Night, Morning, Afternoon, and ing features. This model offers an 0.5481171548117155 for DTC and 0.6317991631799164 for RFC in predictive accuracy.

# Model Refinement

## Imposter Refinement

In [105]:
param_grid_i = {
    'n_estimators' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150],
    'criterion' : ["gini", "entropy"]
}
grid_i = GridSearchCV(rfc_imposter4, param_grid_i, verbose = 3)

In [106]:
grid_i.fit(data_imposter4_train, target_imposter4_train) 

Fitting 5 folds for each of 30 candidates, totalling 150 fits
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.750, total=   0.0s
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.709, total=   0.0s
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.671, total=   0.0s
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.684, total=   0.0s
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.557, total=   0.0s
[CV] criterion=gini, n_estimators=20 .................................
[CV] ..... criterion=gini, n_estimators=20, score=0.725, total=   0.0s
[CV] criterion=gini, n_estimators=20 .................................
[CV] ..... crit

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


[CV] ..... criterion=gini, n_estimators=20, score=0.671, total=   0.0s
[CV] criterion=gini, n_estimators=20 .................................
[CV] ..... criterion=gini, n_estimators=20, score=0.595, total=   0.0s
[CV] criterion=gini, n_estimators=30 .................................
[CV] ..... criterion=gini, n_estimators=30, score=0.787, total=   0.0s
[CV] criterion=gini, n_estimators=30 .................................
[CV] ..... criterion=gini, n_estimators=30, score=0.734, total=   0.0s
[CV] criterion=gini, n_estimators=30 .................................
[CV] ..... criterion=gini, n_estimators=30, score=0.671, total=   0.0s
[CV] criterion=gini, n_estimators=30 .................................
[CV] ..... criterion=gini, n_estimators=30, score=0.696, total=   0.0s
[CV] criterion=gini, n_estimators=30 .................................
[CV] ..... criterion=gini, n_estimators=30, score=0.608, total=   0.0s
[CV] criterion=gini, n_estimators=40 .................................
[CV] .

[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed:   22.1s finished


GridSearchCV(estimator=RandomForestClassifier(criterion='entropy',
                                              n_estimators=20),
             param_grid={'criterion': ['gini', 'entropy'],
                         'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90,
                                          100, 110, 120, 130, 140, 150]},
             verbose=3)

In [107]:
print(grid_i.best_params_)
print(grid_i.best_score_)

{'criterion': 'gini', 'n_estimators': 130}
0.6993037974683545


## Crewmate Refinement

In [108]:
param_grid_c = {
    'n_estimators' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150],
    'criterion' : ["gini", "entropy"]
}
grid_c = GridSearchCV(rfc_crewmate5, param_grid_c, verbose = 3)

In [109]:
grid_c.fit(data_crewmate5_train, target_crewmate5_train) 

Fitting 5 folds for each of 30 candidates, totalling 150 fits
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.551, total=   0.0s
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.574, total=   0.0s
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.572, total=   0.0s
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.569, total=   0.0s
[CV] criterion=gini, n_estimators=10 .................................
[CV] ..... criterion=gini, n_estimators=10, score=0.536, total=   0.0s
[CV] criterion=gini, n_estimators=20 .................................
[CV] ..... criterion=gini, n_estimators=20, score=0.567, total=   0.0s
[CV] criterion=gini, n_estimators=20 .................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


[CV] ..... criterion=gini, n_estimators=20, score=0.557, total=   0.0s
[CV] criterion=gini, n_estimators=20 .................................
[CV] ..... criterion=gini, n_estimators=20, score=0.562, total=   0.1s
[CV] criterion=gini, n_estimators=20 .................................
[CV] ..... criterion=gini, n_estimators=20, score=0.559, total=   0.0s
[CV] criterion=gini, n_estimators=20 .................................
[CV] ..... criterion=gini, n_estimators=20, score=0.576, total=   0.0s
[CV] criterion=gini, n_estimators=30 .................................
[CV] ..... criterion=gini, n_estimators=30, score=0.570, total=   0.1s
[CV] criterion=gini, n_estimators=30 .................................
[CV] ..... criterion=gini, n_estimators=30, score=0.587, total=   0.1s
[CV] criterion=gini, n_estimators=30 .................................
[CV] ..... criterion=gini, n_estimators=30, score=0.569, total=   0.1s
[CV] criterion=gini, n_estimators=30 .................................
[CV] .

[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed:   31.2s finished


GridSearchCV(estimator=RandomForestClassifier(criterion='entropy',
                                              n_estimators=50),
             param_grid={'criterion': ['gini', 'entropy'],
                         'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90,
                                          100, 110, 120, 130, 140, 150]},
             verbose=3)

In [110]:
print(grid_c.best_params_)
print(grid_c.best_score_)

{'criterion': 'entropy', 'n_estimators': 60}
0.5729378774805867


# Output Prediction Results

In [111]:
file_path1 = os.path.join("Output Data", "imposter_predictions.csv")
file_path2 = os.path.join("Output Data", "imposter_predictions.html")
file_path3 = os.path.join("Output Data", "crewmate_predictions.csv")
file_path4 = os.path.join("Output Data", "crewmate_predictions.html")

In [114]:
rfc_predict_imposter4_df.to_csv(file_path1)
rfc_predict_imposter4_df.to_html(file_path2)
rfc_predict_crewmate5_df.to_csv(file_path3)
rfc_predict_crewmate5_df.to_html(file_path4)