# Random Forest Notebook
Attempting to implement random forrest algorithm for research and performance
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
https://data36.com/random-forest-in-python/

### Some info from good ol' ChatGPT
Random Forest is an ensemble learning method that uses multiple decision trees to make predictions. Here's how it works:

    Building the trees: Random Forest constructs a set of decision trees, each of which is trained on a random subset of the input features and a random subset of the data points. The tree is grown by recursively splitting the data into subsets based on the values of the input features.

    Making predictions: To make a prediction for a new data point, Random Forest takes the average of the predictions of all the trees in the forest. Each tree produces a prediction based on the input features and the subset of data it was trained on.

    Handling missing data: Random Forest can handle missing data in the input features by using surrogate splits. If a data point is missing a value for a particular feature, the algorithm can use a different feature that is highly correlated with the missing feature to make a split in the decision tree.

    Handling imbalanced data: Random Forest can also handle imbalanced data by using class weights or resampling techniques to ensure that each class is represented in the training data.

Random Forest has several advantages over other machine learning algorithms. It is robust to noise and missing data, can handle both continuous and categorical input features, and can be used for both regression and classification problems. However, there are some things to be aware of when using Random Forest:

    Interpretability: Random Forest can be difficult to interpret, as the resulting model is a combination of many individual decision trees. It can be hard to understand which input features are most important for making predictions.

    Overfitting: Random Forest can be prone to overfitting if the trees are too deep or if the number of trees in the forest is too high. Regularization techniques, such as limiting the depth of the trees or using a smaller number of trees, can help prevent overfitting.

    Computationally expensive: Random Forest can be computationally expensive to train and evaluate, especially for large datasets with many input features. It is important to use efficient algorithms and data structures to reduce the training time and memory usage.

Overall, Random Forest is a powerful and versatile algorithm that can be effective for many machine learning problems.

### Todo:
Understand and improve RF Model
Implement vectorisation model
Abstract data processing and filtering/wrangling to function

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import time

In [2]:
# Removes warnings for the error messege
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# Make sparse coulms, fix everything
# Scale the data for svm

In [3]:
X_full_train = pd.read_csv(r"book_rating_train.csv", index_col = False, delimiter = ',', header=0)
X_train = X_full_train.iloc[:,:-1]
Y_train = X_full_train["rating_label"]

In [4]:
doc2vec_data_source = "book_text_features_doc2vec"
# Maybe replace names of 0, 1, 2 etc with word vector i
# can't do random forrest with float64, need int64
authors_d2v_test = pd.read_csv(f"{doc2vec_data_source}/test_authors_doc2vec20.csv", index_col = False, delimiter = ',', header=None)
authors_d2v_train = pd.read_csv(f"{doc2vec_data_source}/train_authors_doc2vec20.csv", index_col = False, delimiter = ',', header=None)

desc_d2v_test = pd.read_csv(f"{doc2vec_data_source}/test_desc_doc2vec100.csv", index_col = False, delimiter = ',', header=None)
desc_d2v_train = pd.read_csv(f"{doc2vec_data_source}/train_desc_doc2vec100.csv", index_col = False, delimiter = ',', header=None)

name_d2v_test = pd.read_csv(f"{doc2vec_data_source}/test_name_doc2vec100.csv", index_col = False, delimiter = ',', header=None)
name_d2v_train = pd.read_csv(f"{doc2vec_data_source}/train_name_doc2vec100.csv", index_col = False, delimiter = ',', header=None)

In [5]:
# Gets the data types, and removes all int data types
# X_string = X.select_dtypes(exclude="number")
X_train_numerical = X_train.select_dtypes(include="number")

In [6]:
def randomForest(x_train=None, x_test=None, y_train=None, y_test=None, num_trees=100, depth=2, state=0):
    clf = rfc(n_estimators=num_trees, max_depth=depth, random_state=state)
    clf.fit(x_train, y_train)

    predictions = clf.predict(x_test)
    print(predictions)

    describe_data = pd.DataFrame(predictions)
    print(describe_data.describe())
    
    accuracy = accuracy_score(y_test, predictions)
    # print(accuracy)
    
    return accuracy

In [7]:

# start_time = time.time()

# X_train, X_test, y_train, y_test = train_test_split(X_concat_train, Y, test_size=0.2, random_state=42)

# acc_metrics = pd.DataFrame()

# for trees in range(0,400,20):
#     temp_acc_arr = []
#     for depth in range(0, 10, 2):
#         temp_acc_arr.append(randomForest(X_train, X_test, y_train, y_test, trees+1, depth+1))
#     temp_df = pd.DataFrame(temp_acc_arr, columns=[str(trees)])
#     acc_metrics = pd.concat([acc_metrics, temp_df], axis=1)

# end_time = time.time()
# elapsed_time = end_time - start_time
# print(f"Time taken to run: {elapsed_time:.4f} seconds")

# acc_metrics

In [8]:
# Merging all data together
# doc2vec
start_time = time.time()

X_train_merged = pd.concat([X_train_numerical, authors_d2v_train, desc_d2v_train, name_d2v_train], axis=1)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Time taken to run: {elapsed_time:.4f} seconds")
X_train_merged

Time taken to run: 0.0238 seconds


Unnamed: 0,PublishYear,PublishMonth,PublishDay,pagesNumber,0,1,2,3,4,5,...,90,91,92,93,94,95,96,97,98,99
0,2005,6,1,48,0.359375,-0.096944,0.021326,0.304888,-0.084434,-0.138658,...,-0.172811,0.098389,-0.062941,0.118057,-0.065377,0.227973,0.218879,-0.151266,-0.048105,0.300822
1,1991,10,1,364,-0.074845,0.060063,0.132891,0.051957,0.127083,0.017997,...,0.245650,-0.049657,0.072740,-0.055925,-0.000046,0.140500,0.067133,-0.238091,0.109774,-0.156772
2,2005,3,31,32,-0.127589,-0.100911,0.158580,0.046532,-0.065661,-0.037972,...,-0.033781,0.093943,0.132654,0.030295,0.102714,0.154334,0.129325,-0.231493,0.007541,-0.098540
3,2004,9,1,293,-0.000472,-0.048197,0.106046,-0.100795,-0.147681,-0.017288,...,0.020762,-0.149720,0.150557,0.294355,0.001157,0.285179,0.049340,-0.037548,0.042920,0.176173
4,2005,7,7,352,-0.162106,-0.023212,0.189444,-0.042658,-0.117135,-0.075968,...,0.191644,0.044182,0.054631,-0.025782,0.049917,0.122052,-0.084216,-0.096424,-0.068681,-0.005293
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23058,1997,8,1,120,-0.194677,0.063026,0.125115,-0.041354,-0.122502,-0.207333,...,-0.000418,-0.062899,0.048064,0.029612,0.191065,0.096081,-0.100516,-0.190299,0.224559,0.086601
23059,2005,6,1,32,-0.115993,-0.003955,-0.027285,-0.032830,0.091905,-0.257285,...,0.150964,-0.029046,0.171029,-0.072123,-0.004459,0.247430,0.111973,0.019573,0.070569,-0.112066
23060,1989,2,15,132,-0.126878,-0.120418,0.198828,0.093403,-0.053232,-0.114909,...,0.193755,-0.118570,0.006740,-0.108623,-0.036143,0.168113,0.136478,0.087885,0.113180,0.000569
23061,1998,4,21,136,-0.134530,-0.061256,0.178935,0.057537,-0.045066,-0.088796,...,0.009007,0.154127,0.219128,-0.305824,-0.017904,-0.059886,0.108616,0.041879,-0.138893,-0.044187


In [9]:

# x_nan_removed = X.dropna()
# Big flaw I made is dropping rows, need to remove respective nan ones

In [10]:
X_train_numerical.shape, authors_d2v_train.shape, desc_d2v_train.shape, name_d2v_train.shape

((23063, 4), (23063, 20), (23063, 100), (23063, 100))

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_train_merged, Y_train, test_size=0.2, random_state=42)

randomForest(X_train, X_test, y_train, y_test, 100, 4)

[4. 4. 4. ... 4. 4. 4.]
            0
count  4613.0
mean      4.0
std       0.0
min       4.0
25%       4.0
50%       4.0
75%       4.0
max       4.0


0.7112508129200087

In [12]:
Y_test = pd.read_csv(r"book_rating_test.csv", index_col = False, delimiter = ',', header=0)

In [13]:
16208 / Y_train.shape[0]

0.7027706716385552

In [15]:
rf_data = pd.read_csv(r"RF_accuracy_scores.csv", index_col = False, delimiter = ',', header=0)

<bound method NDFrame._add_numeric_operations.<locals>.max of     Unnamed: 0         0        20        40        60        80       100  \
0            0  0.708790  0.708790  0.708790  0.708790  0.708790  0.708790   
1            1  0.708790  0.708790  0.708790  0.708790  0.708790  0.708790   
2            2  0.708790  0.708790  0.708790  0.708790  0.708790  0.708790   
3            3  0.706842  0.708790  0.708790  0.708790  0.708790  0.708790   
4            4  0.705138  0.708790  0.708790  0.708790  0.708790  0.708790   
5            5  0.696616  0.708790  0.708790  0.708546  0.708790  0.708546   
6            6  0.698807  0.708546  0.708790  0.708790  0.708790  0.708546   
7            7  0.690772  0.708546  0.708790  0.708546  0.708790  0.708790   
8            8  0.682737  0.708790  0.708790  0.709277  0.709277  0.709033   
9            9  0.675676  0.710251  0.709033  0.710007  0.709277  0.709520   
10          10  0.671049  0.708546  0.709520  0.709520  0.710251  0.710251   
11

In [21]:
rf_data.max().max()

19.0