In [9]:
import pandas as pd

df = pd.read_csv('goldprice.csv')
df.head()

Unnamed: 0,Date,SPX,GLD,USO,SLV,EUR/USD
0,1/2/2008,1447.160034,84.860001,78.470001,15.18,1.471692
1,1/3/2008,1447.160034,85.57,78.370003,15.285,1.474491
2,1/4/2008,1411.630005,85.129997,77.309998,15.167,1.475492
3,1/7/2008,1416.180054,84.769997,75.5,15.053,1.468299
4,1/8/2008,1390.189941,86.779999,76.059998,15.59,1.557099


The above data columns means as follows:

“GLD” which stands for GOLD using other features such as “SPX” which I think is S&P 500 stock, “USO” which is United States Oil Fund, “SLV” which is also a stock, and “EUR/USD” which are currencies

In [10]:
df.shape

(2290, 6)

The dataset dimensions say that it has 2290 rows and 6 columns

In [11]:
df.isnull().sum()

Date       0
SPX        0
GLD        0
USO        0
SLV        0
EUR/USD    0
dtype: int64

The above results interprets the dataset does not contain any null values.

In [12]:
#  dropping date columns as it is irrelevant

df = df.drop(columns=['Date'])
print(df.columns)

Index(['SPX', 'GLD', 'USO', 'SLV', 'EUR/USD'], dtype='object')


To find correlation between all the columns of the dataset we use a heatmap below

In [15]:
import plotly.express as px

px.imshow(df.corr(),
          text_auto=True,
          color_continuous_scale='Viridis',
          title='Correlation Matrix',
          width=800,
          height=600,
          aspect='auto')

It is observed that the SVL column has a huge correlation with GLD column with the highest of 86% followed by EUR/USD with USO i.e 82%

In [16]:
X = df.drop(columns=['GLD'])
y = df['GLD']

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=345)

In [18]:
X_train.shape, y_train.shape

((1832, 4), (1832,))

In [19]:
X_test.shape, y_test.shape

((458, 4), (458,))

1. RANDOM FOREST REGRESSOR

In [20]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()

rfr.fit(X_train, y_train)

rfr_prediction = rfr.predict(X_test)

In [21]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [22]:
mean_absolute_error(y_test, rfr_prediction)

1.3091553520087287

In [23]:
mean_squared_error(y_test, rfr_prediction)

6.685533263716892

In [24]:
r2_score(y_test, rfr_prediction)

0.9872199999547485

It is observed that in random forest regressor the r2 score is 0.98 which is good. The ideal range should be between 0.5 to 1.0 for a model to perform well.

2. DECISION TREE REGRESSOR

In [25]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
dtr_prediction = dtr.predict(X_test)

In [26]:
mean_absolute_error(y_test, dtr_prediction)

1.4669212467248907

In [27]:
mean_squared_error(y_test, dtr_prediction)

9.825263030586024

In [28]:
r2_score(y_test, dtr_prediction)

0.981218123218837

Here, in decision tree regressor the r2 score is 0.98 which is also good as similar to random forest gegressor  