# UFC Final Project Machine Learning Model

# Step 1: Data Preprocessing

 This dataset contains roughly every match under the UFC umbrella, total it is 8,990 rows and 23 columns. The following steps were taken during the preprocessing stage in order to clean the data and make it appropriate for the machine learning model. 

- Dropped unnecessary columns for the model (Fighter_total, date_total, location_total) 
- Dropped NaN rows from the dataset so it is able to be processed (dropped slightly in total row count from 8,990 to 8,236) 
- Reassigned winner column so it is labeled 0 or 1 (1 for win and 0 for loss) 
- Converted columns with data type "object" to a list so able to encode the information for the model to read
- Used OneHotEncoder to encode the data and read into the model 

In [117]:
# Import dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder
import pandas as pd
import tensorflow as tf
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import statsmodels.api as sm

# Connect to PgAdmin Postgres
from sqlalchemy import create_engine
from config import db_password
db_string = f"postgres://postgres:{db_password}@127.0.0.1:5432/UFC"
engine = create_engine(db_string)


In [118]:
# Read in SQL database into a dataframe
UFC_df = pd.read_sql_query('select * from "final_joined_table"',con=engine)
UFC_df

Unnamed: 0,match_id,fighter_total,odds_total,ev_total,date_total,location_total,country_total,winner_total,weight_class_total,gender_total,...,total_win_by_ko_tko,total_win_by_submission,total_win_by_tko_doctor_stoppage,total_wins,total_stance,total_height,total_reach,total_weight,total_age,finish
0,1,Brandon Vera,215,215.000000,2010-03-21,"Broomfield, Colorado, USA",USA,0,Light Heavyweight,MALE,...,4,1,0,7,Orthodox,190.50,193.04,230,32,KO/TKO
1,2,Junior Dos Santos,-250,40.000000,2010-03-21,"Broomfield, Colorado, USA",USA,1,Heavyweight,MALE,...,4,0,0,4,Orthodox,193.04,195.58,238,26,KO/TKO
2,3,Cheick Kongo,-345,28.985507,2010-03-21,"Broomfield, Colorado, USA",USA,1,Heavyweight,MALE,...,4,0,1,7,Orthodox,193.04,208.28,240,34,KO/TKO
3,4,Alessio Sakara,-120,83.333333,2010-03-21,"Broomfield, Colorado, USA",USA,1,Middleweight,MALE,...,3,0,0,5,Orthodox,182.88,182.88,185,28,KO/TKO
4,5,Clay Guida,-420,23.809524,2010-03-21,"Broomfield, Colorado, USA",USA,1,Lightweight,MALE,...,1,1,0,5,Orthodox,170.18,177.80,155,28,SUB
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8985,8986,Kai Kamaka,-335,29.850746,2020-11-28,"Las Vegas, Nevada, USA",USA,0,Featherweight,MALE,...,0,0,0,1,Orthodox,170.18,175.26,145,25,
8986,8987,Anderson Dos Santos,135,135.000000,2020-11-28,"Las Vegas, Nevada, USA",USA,1,Bantamweight,MALE,...,0,0,0,0,Orthodox,165.10,177.80,135,35,
8987,8988,Rachael Ostovich,160,160.000000,2020-11-28,"Las Vegas, Nevada, USA",USA,0,Women's Flyweight,FEMALE,...,0,1,0,1,Orthodox,160.02,157.48,125,29,
8988,8989,Malcolm Gordon,250,250.000000,2020-11-28,"Las Vegas, Nevada, USA",USA,0,Flyweight,MALE,...,0,0,0,0,Switch,170.18,180.34,125,30,


In [119]:
# Determine shape (# rows and columns) of dataset
UFC_df.shape

(8990, 23)

In [120]:
# Determine the datatypes of the dataset
UFC_df.dtypes.value_counts()

int64      12
object      8
float64     3
dtype: int64

In [121]:
# Drop the non-beneficial columns "Fighter_total and date_total. - better model performance leaving in country 
UFC_df = UFC_df.drop(["fighter_total", "date_total", "location_total"], axis=1)
UFC_df.head()

Unnamed: 0,match_id,odds_total,ev_total,country_total,winner_total,weight_class_total,gender_total,no_of_rounds_total,total_draw,total_losses,total_win_by_ko_tko,total_win_by_submission,total_win_by_tko_doctor_stoppage,total_wins,total_stance,total_height,total_reach,total_weight,total_age,finish
0,1,215,215.0,USA,0,Light Heavyweight,MALE,3,0,4,4,1,0,7,Orthodox,190.5,193.04,230,32,KO/TKO
1,2,-250,40.0,USA,1,Heavyweight,MALE,3,0,0,4,0,0,4,Orthodox,193.04,195.58,238,26,KO/TKO
2,3,-345,28.985507,USA,1,Heavyweight,MALE,3,0,4,4,0,1,7,Orthodox,193.04,208.28,240,34,KO/TKO
3,4,-120,83.333333,USA,1,Middleweight,MALE,3,0,5,3,0,0,5,Orthodox,182.88,182.88,185,28,KO/TKO
4,5,-420,23.809524,USA,1,Lightweight,MALE,3,0,5,1,1,0,5,Orthodox,170.18,177.8,155,28,SUB


In [122]:
# Figure out columns with NaN in order to use OneHotEncoder
UFC_df.isna().sum()

match_id                              0
odds_total                            0
ev_total                              0
country_total                         0
winner_total                          0
weight_class_total                    0
gender_total                          0
no_of_rounds_total                    0
total_draw                            0
total_losses                          0
total_win_by_ko_tko                   0
total_win_by_submission               0
total_win_by_tko_doctor_stoppage      0
total_wins                            0
total_stance                          0
total_height                          0
total_reach                           0
total_weight                          0
total_age                             0
finish                              754
dtype: int64

In [123]:
# Drop rows with NaN
UFC_df = UFC_df.dropna()
UFC_df.head()

Unnamed: 0,match_id,odds_total,ev_total,country_total,winner_total,weight_class_total,gender_total,no_of_rounds_total,total_draw,total_losses,total_win_by_ko_tko,total_win_by_submission,total_win_by_tko_doctor_stoppage,total_wins,total_stance,total_height,total_reach,total_weight,total_age,finish
0,1,215,215.0,USA,0,Light Heavyweight,MALE,3,0,4,4,1,0,7,Orthodox,190.5,193.04,230,32,KO/TKO
1,2,-250,40.0,USA,1,Heavyweight,MALE,3,0,0,4,0,0,4,Orthodox,193.04,195.58,238,26,KO/TKO
2,3,-345,28.985507,USA,1,Heavyweight,MALE,3,0,4,4,0,1,7,Orthodox,193.04,208.28,240,34,KO/TKO
3,4,-120,83.333333,USA,1,Middleweight,MALE,3,0,5,3,0,0,5,Orthodox,182.88,182.88,185,28,KO/TKO
4,5,-420,23.809524,USA,1,Lightweight,MALE,3,0,5,1,1,0,5,Orthodox,170.18,177.8,155,28,SUB


In [124]:
# Check remaining rows without NaN values
UFC_df.shape

(8236, 20)

In [125]:
# Create categorical lists in order to encode data
UFC_cat = UFC_df.dtypes[UFC_df.dtypes == "object"].index.tolist()
UFC_cat

['country_total',
 'weight_class_total',
 'gender_total',
 'total_stance',
 'finish']

In [126]:
# Create a OneHotEncoder instance
enc = OneHotEncoder(sparse=False)

# Fit and transform the OneHotEncoder using the categorical variable list
encode_df = pd.DataFrame(enc.fit_transform(UFC_df[UFC_cat]))

# Add the encoded variable names to the dataframe
encode_df.columns = enc.get_feature_names(UFC_cat)
encode_df.head()

Unnamed: 0,country_total_ Argentina,country_total_ Australia,country_total_ Brazil,country_total_ Canada,country_total_ Chile,country_total_ China,country_total_ Croatia,country_total_ Czech Republic,country_total_ Denmark,country_total_ Germany,...,total_stance_Southpaw,total_stance_Switch,total_stance_Switch.1,finish_DQ,finish_KO/TKO,finish_M-DEC,finish_Overturned,finish_S-DEC,finish_SUB,finish_U-DEC
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [127]:
# Merge one-hot encoded features and drop the originals
UFC_df = UFC_df.merge(encode_df,left_index=True, right_index=True)
UFC_df = UFC_df.drop(UFC_cat,1)
UFC_df.head()

Unnamed: 0,match_id,odds_total,ev_total,winner_total,no_of_rounds_total,total_draw,total_losses,total_win_by_ko_tko,total_win_by_submission,total_win_by_tko_doctor_stoppage,...,total_stance_Southpaw,total_stance_Switch,total_stance_Switch.1,finish_DQ,finish_KO/TKO,finish_M-DEC,finish_Overturned,finish_S-DEC,finish_SUB,finish_U-DEC
0,1,215,215.0,0,3,0,4,4,1,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,2,-250,40.0,1,3,0,0,4,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,3,-345,28.985507,1,3,0,4,4,0,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,4,-120,83.333333,1,3,0,5,3,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,5,-420,23.809524,1,3,0,5,1,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


# Step 2: Feature Engineering and Feature Selection

The next few steps of feature engineering and feature selection involve finding the best subset of attributes which better explain the relationship between a fighter’s characteristics (independent variables) and winning matches (target variable).

- y, or the target variable indicates whether or not the fighter won the match 
- X, or the independent variables are the metrics that the model uses to predict whether the fighter would win (i.e. gender, class, weight, reach height or odds)

In [128]:
# Define the features in the set (X and y variables)
y = UFC_df["winner_total"]
X = UFC_df.drop(["winner_total"],1)

### Determining feature significance using linear regression:

In week 3 of the project, we wanted to improve our initial accuracy level of our Random Forest model at 64%. In order to eliminate the extra noise in our model, we decided to implement linear regression to determine feature significance. From this model, it was determined that the following metrics were statistically significant and had a p-value over 0.05 or 5%. In the next step, we eliminate the columns that did not have statistical significance and rerun the Random Forest model to see if this improved our accurary level. 

- odds_total
- ev_total
- total_height
- total_reach
- total_age
- country_total_ China
- country_total_ Ireland
- country_total_ Uruguay
- country_total_United Arab Emirates
- weight_class_total_Bantamweight
- weight_class_total_Heavyweight
- weight_class_total_Welterweight
- gender_total_FEMALE
- gender_total_MALE
- finish_KO/TKO
- finish_M-DEC
- finish_S-DEC
- finish_SUB
- finish_U-DEC

In [129]:
# Drop match ID column
X = X.drop(['match_id'], axis = 1)

In [130]:
# Add constant for linear regression model
X = sm.add_constant(X, prepend=False)
# Storing model 
lr_model = sm.OLS(y, X)
# Fitting model
lr_results = lr_model.fit()
# Results
print(lr_results.summary())

                            OLS Regression Results                            
Dep. Variable:           winner_total   R-squared:                       0.150
Model:                            OLS   Adj. R-squared:                  0.143
Method:                 Least Squares   F-statistic:                     21.67
Date:                Tue, 12 Jan 2021   Prob (F-statistic):          5.16e-219
Time:                        21:07:11   Log-Likelihood:                -4949.0
No. Observations:                7679   AIC:                         1.002e+04
Df Residuals:                    7616   BIC:                         1.046e+04
Df Model:                          62                                         
Covariance Type:            nonrobust                                         
                                               coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------

In [131]:
# Add the statistically significant columns to a list to use for Random Forest model
features_to_keep = []
for key, value in dict(lr_results.pvalues).items():
    if value < 0.05 and key != 'const':
        print(key)
        features_to_keep.append(key)

odds_total
ev_total
total_height
total_reach
total_age
country_total_ China
country_total_ Ireland
country_total_ Uruguay
country_total_United Arab Emirates
weight_class_total_Bantamweight
weight_class_total_Heavyweight
weight_class_total_Welterweight
gender_total_FEMALE
gender_total_MALE
finish_KO/TKO
finish_M-DEC
finish_S-DEC
finish_SUB
finish_U-DEC


### Testing and training the dataset: 

The next steps involve testing and training the dataset. 

- First, the encoded and preprocessed data is split into X and y train and test variables and run through the StandardScalar
- Once trained and tested, the data is used in the model

In [132]:
# Define the features in the set (X and y variables)
y = UFC_df["winner_total"]
X = UFC_df.drop(["winner_total", "match_id"],1)
X = X[features_to_keep]

In [148]:
# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

In [149]:
# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

Then the model is run using the trained and tested variables. We chose the Random Forest Classification Model. Random Forest is our preferred modeling tool because it:

- Runs efficiently on large data sets
- Works against overfitting
- Can be used to rank input variables

In [150]:
# Random Forest Classification model - first model to test 
rf_model = RandomForestClassifier(n_estimators=500, random_state=78) 

In [151]:
# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

In [152]:
# Making predictions using the testing data.
predictions = rf_model.predict(X_test_scaled)
predictions

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

# Step 3: Machine Learning Model Results

In [153]:
# Calculating the confusion matri to determine how well the model works 
cm = confusion_matrix(y_test, predictions)

# Create a DataFrame from the confusion matrix.
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"])

cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,591,335
Actual 1,366,628


In [154]:
# Calculating the accuracy score
acc_score = accuracy_score(y_test, predictions) 
acc_score

0.6348958333333333

In [155]:
# Displaying results

print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, predictions))

Confusion Matrix


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,591,335
Actual 1,366,628


Accuracy Score : 0.6348958333333333
Classification Report
              precision    recall  f1-score   support

           0       0.62      0.64      0.63       926
           1       0.65      0.63      0.64       994

    accuracy                           0.63      1920
   macro avg       0.63      0.64      0.63      1920
weighted avg       0.64      0.63      0.64      1920



As we can see from the accuracy level, it did not improve when we eliminated the statistically significant columns. This shows that there was not enough variation in our dataset to determine which UFC fighter would be the winning outcome of a match. 