# Random Forest CLassification

Males_players.csv dataset is used to train a Random Forest Model. The dataset is first preprocessed by encoding the categorical label data to numerical values column. Next, the Random Forest Model is trained on a training set that consists of 2/3 of the dataset, and evaluated by the remaining 1/3 of the dataset, the test set, to evaluate the performance of the trained model.Lastly, classification report and confusion matrix are generated to assess the accuracy and quality of the trained model.

In [22]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.mllib.evaluation import MulticlassMetrics
from sklearn.metrics import classification_report
import pandas as pd
from data_processing import *

# Categorical Encoding

Categorical encoding is done to the label_position column before fitting and evaluating the random forest model by using StringIndexer, which converts textual data to numeric data while keeping its categorical context

The table below displays the resulting dataset after assigning numerical values 0, 1 and 2 to the label positions <b>"Forward"</b>, <b>"Midfielder"</b> and <b>"Defender"</b>, respectively.

In [21]:
df = sampled_data()
indexer = StringIndexer(inputCol="label_position", outputCol="label_position_index")
df = indexer.fit(df).transform(df)
unique_values = [row['label_position'] for row in df.select('label_position').distinct().orderBy('label_position').collect()]
for value in unique_values:
    sample = df[df['label_position'] == value].head(1)
    sample = pd.DataFrame(sample, columns=sample[0].__fields__)
    display(sample)

Unnamed: 0,pace,shooting,passing,dribbling,defending,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,...,power_stamina,power_long_shots,mentality_interceptions,mentality_positioning,mentality_penalties,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,label_position,label_position_index
0,61,51,58,61,82,49,43,76,69,39,...,69,66,77,49,34,83,85,81,Defender,2.0


Unnamed: 0,pace,shooting,passing,dribbling,defending,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,...,power_stamina,power_long_shots,mentality_interceptions,mentality_positioning,mentality_penalties,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,label_position,label_position_index
0,75,62,52,62,38,49,63,62,59,57,...,68,59,30,59,64,44,31,28,Forward,0.0


Unnamed: 0,pace,shooting,passing,dribbling,defending,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,...,power_stamina,power_long_shots,mentality_interceptions,mentality_positioning,mentality_penalties,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,label_position,label_position_index
0,58,50,59,59,63,52,44,59,64,41,...,78,53,65,51,48,61,65,64,Midfielder,1.0


To prepare the input data for the random forest model, the <b>"label_position"</b> and <b>"label_position_index"</b> columns are removed, and a new list of DataFrame columns called <b>"list_of_features"</b> is created

In [6]:
list_of_features = df.drop("label_position").drop("label_position_index").columns
assembler = VectorAssembler(inputCols=list_of_features, outputCol="indexed_features")
df = assembler.transform(df)

# Training Random Forest Model
To train the Random Forest Model, a training set is created using 2/3 of the dataset, while the remaining 1/3 is used as the test set.

The table below displays a subset of the training set which contains all features of the dataset, including the target variable <b>"label_position_index"</b>

In [7]:
trainingData, testData = df.randomSplit([0.67, 0.33],24)
rf = RandomForestClassifier(labelCol="label_position_index", featuresCol="indexed_features", numTrees=10, maxDepth=5)
sample = trainingData.take(10)
sample = pd.DataFrame(sample, columns=sample[0].__fields__)
display(sample)

Unnamed: 0,pace,shooting,passing,dribbling,defending,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,...,power_long_shots,mentality_interceptions,mentality_positioning,mentality_penalties,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,label_position,label_position_index,indexed_features
0,25,22,30,23,51,24,23,56,38,21,...,18,47,20,31,51,51,53,Defender,0.0,"[25.0, 22.0, 30.0, 23.0, 51.0, 24.0, 23.0, 56...."
1,31,23,45,34,62,27,20,64,62,20,...,19,65,22,34,61,61,59,Defender,0.0,"[31.0, 23.0, 45.0, 34.0, 62.0, 27.0, 20.0, 64...."
2,31,40,51,48,73,39,39,70,61,43,...,26,72,39,48,76,72,68,Defender,0.0,"[31.0, 40.0, 51.0, 48.0, 73.0, 39.0, 39.0, 70...."
3,31,42,50,45,68,27,38,78,63,36,...,38,72,33,34,68,65,62,Defender,0.0,"[31.0, 42.0, 50.0, 45.0, 68.0, 27.0, 38.0, 78...."
4,32,57,51,55,76,28,56,79,60,47,...,62,74,45,53,77,77,74,Defender,0.0,"[32.0, 57.0, 51.0, 55.0, 76.0, 28.0, 56.0, 79...."
5,33,27,33,29,61,19,22,59,47,29,...,19,59,26,25,60,63,61,Defender,0.0,"[33.0, 27.0, 33.0, 29.0, 61.0, 19.0, 22.0, 59...."
6,33,34,49,34,66,47,30,70,52,31,...,21,64,22,35,65,66,65,Defender,0.0,"[33.0, 34.0, 49.0, 34.0, 66.0, 47.0, 30.0, 70...."
7,33,44,51,48,60,40,42,59,60,45,...,40,60,32,30,61,61,58,Defender,0.0,"[33.0, 44.0, 51.0, 48.0, 60.0, 40.0, 42.0, 59...."
8,34,43,43,35,68,36,24,76,48,24,...,60,65,34,39,70,67,62,Defender,0.0,"[34.0, 43.0, 43.0, 35.0, 68.0, 36.0, 24.0, 76...."
9,34,57,67,63,66,70,45,67,70,50,...,69,67,60,62,65,66,67,Defender,0.0,"[34.0, 57.0, 67.0, 63.0, 66.0, 70.0, 45.0, 67...."


In [8]:
sample = testData.take(10)
sample = pd.DataFrame(sample, columns=sample[0].__fields__)
display(sample)

Unnamed: 0,pace,shooting,passing,dribbling,defending,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,...,power_long_shots,mentality_interceptions,mentality_positioning,mentality_penalties,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,label_position,label_position_index,indexed_features
0,33,35,39,45,74,37,36,80,52,45,...,21,73,33,46,75,73,72,Defender,0.0,"[33.0, 35.0, 39.0, 45.0, 74.0, 37.0, 36.0, 80...."
1,33,54,58,54,64,48,48,68,67,40,...,59,65,54,45,65,64,55,Defender,0.0,"[33.0, 54.0, 58.0, 54.0, 64.0, 48.0, 48.0, 68...."
2,35,47,49,49,74,39,36,74,66,44,...,44,74,32,48,75,77,65,Defender,0.0,"[35.0, 47.0, 49.0, 49.0, 74.0, 39.0, 36.0, 74...."
3,35,48,56,55,65,46,43,70,65,43,...,50,64,40,35,63,66,65,Defender,0.0,"[35.0, 48.0, 56.0, 55.0, 65.0, 46.0, 43.0, 70...."
4,37,26,39,40,60,23,26,66,42,25,...,20,58,44,24,58,61,57,Defender,0.0,"[37.0, 26.0, 39.0, 40.0, 60.0, 23.0, 26.0, 66...."
5,37,29,38,35,61,25,26,72,45,27,...,23,61,28,32,63,58,55,Defender,0.0,"[37.0, 29.0, 38.0, 35.0, 61.0, 25.0, 26.0, 72...."
6,37,41,49,51,64,50,39,61,56,32,...,43,61,43,44,66,64,62,Defender,0.0,"[37.0, 41.0, 49.0, 51.0, 64.0, 50.0, 39.0, 61...."
7,37,57,60,56,69,52,52,71,63,44,...,56,70,61,62,72,66,63,Defender,0.0,"[37.0, 57.0, 60.0, 56.0, 69.0, 52.0, 52.0, 71...."
8,42,29,38,43,63,29,24,68,51,20,...,30,66,23,35,61,62,60,Defender,0.0,"[42.0, 29.0, 38.0, 43.0, 63.0, 29.0, 24.0, 68...."
9,43,47,39,36,69,27,27,69,47,21,...,70,70,28,54,64,74,63,Defender,0.0,"[43.0, 47.0, 39.0, 36.0, 69.0, 27.0, 27.0, 69...."


In [14]:
model = rf.fit(trainingData)
predictions = model.transform(testData)

# Evaluation Metrics

To generate the classification report, the <b>"prediction"</b> and <b>"label_position_index"</b> columns are selected from the <b>"predictions"</b> dataframe. The resulting DataFrame is then converted to a Pandas dataframe, allowing the use of the classification_report() function from the scikit-learn metrics module.

In [18]:
predictions_and_labels_pd= predictions.select("prediction", "label_position_index").toPandas()
class_report_dict = classification_report(predictions_and_labels_pd['label_position_index'],
                                              predictions_and_labels_pd['prediction'], output_dict=True)
class_report_df = pd.DataFrame.from_dict(class_report_dict).transpose()
print("Classification Report:")
print(class_report_df)

Classification Report:
              precision    recall  f1-score     support
0.0            0.900000  0.923077  0.911392  156.000000
1.0            0.758427  0.918367  0.830769  147.000000
2.0            0.779817  0.590278  0.671937  144.000000
accuracy       0.814318  0.814318  0.814318    0.814318
macro avg      0.812748  0.810574  0.804699  447.000000
weighted avg   0.814726  0.814318  0.807739  447.000000


To generate the Confusion Matrix, the <b>"prediction"</b> and <b>"label_position_index"</b> columns are selected from the dataframe, and the resulting DataFrame is passed to the confusionMatrix() function from the MulticlassMetrics module

In [19]:
predictions_and_labels = predictions.select("prediction", "label_position_index").rdd
metrics = MulticlassMetrics(predictions_and_labels)
confusion_matrix = metrics.confusionMatrix().toArray()
print("Confusion Matrix:")
print(confusion_matrix)



Confusion Matrix:
[[144.   0.  12.]
 [  0. 135.  12.]
 [ 16.  43.  85.]]
