<a href="https://www.kaggle.com/code/manishkr1754/gold-price-prediction?scriptVersionId=143538085" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Gold Price Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

The gold market is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors including economic conditions, geopolitical events, and supply and demand. As a result, it can be difficult for investors to accurately predict the price of gold.

The goal of this project is to leverage machine learning **to develop a machine learning model that can predict the price of a used car based on its features**. This falls under **Regression Machine Learning Problem**.
.

The goal of this project is **to leverage machine learning to develop a machine learning model that can predict the price of gold based on various features and factors**. This falls under the category of **time series forecasting**. However, we will drop Date column and will not perform timeseries forecasting as of now. We will use Regression machine learning techniques.

## 2) Understanding Data
---

The project uses **Gold Price  Data** which contains several variables (independent variables) and one outcome variable (dependent variable) called **GLD** i.e GOld Price. The variables in the datasets are as follows:

- SPX
- GLD (Target Column)
- USO
- SLV
- EUR/USD
- Date

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
%matplotlib inline

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
gold_price_df = pd.read_csv('Datasets/Day8_Gold_Price_Data.csv') 

In [None]:
gold_price_df

In [None]:
print('The size of Dataframe is: ', gold_price_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
gold_price_df.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in gold_price_df.columns if gold_price_df[feature].dtype != 'O']
categorical_features = [feature for feature in gold_price_df.columns if gold_price_df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=gold_price_df.isnull().sum().sort_values(ascending=False)
percent=(gold_price_df.isnull().sum()/gold_price_df.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
gold_price_df.describe()

## 5) Data Cleaning & Preprocessing
---

### Dropping Date Column

In [None]:
gold_price_df = gold_price_df.drop(['Date'], axis=1)

In [None]:
gold_price_df

### Correlation Understanding Heatmap

In [None]:
correlation = gold_price_df.corr()

In [None]:
correlation

In [None]:
plt.figure(figsize = (8,8))
sns.heatmap(correlation, cbar=True, square=True, fmt='.1f',annot=True, annot_kws={'size':8}, cmap='Blues')

#### Inference

The provided correlation matrix shows the correlation coefficients between the 'GLD' (Gold Price) column and other columns ('SPX,' 'USO,' 'SLV,' and 'EUR/USD'). Here are the key inferences based on the correlation coefficients:

1. **SPX (S&P 500 Index):**
   - Correlation Coefficient: 0.049345 (Positive, but weak correlation)
   - There is a very weak positive correlation between the S&P 500 Index (equity market) and the price of gold. This means that changes in the S&P 500 Index do not strongly influence changes in gold prices.

2. **USO (United States Oil Fund):**
   - Correlation Coefficient: -0.186360 (Negative, but weak correlation)
   - There is a weak negative correlation between the United States Oil Fund (representing oil prices) and the price of gold. This suggests that there is a slight tendency for gold prices to move in the opposite direction of oil prices, but the correlation is not strong.

3. **SLV (Silver Price):**
   - Correlation Coefficient: 0.866632 (Strong positive correlation)
   - There is a strong positive correlation between the price of silver and the price of gold. This indicates that gold and silver prices tend to move together, and changes in silver prices are highly correlated with changes in gold prices.

4. **EUR/USD (Euro to US Dollar Exchange Rate):**
   - Correlation Coefficient: -0.024375 (Weak correlation)
   - There is a very weak correlation between the Euro to US Dollar exchange rate and the price of gold. Changes in the exchange rate do not appear to have a significant impact on gold prices.

Overall, the strongest correlation is observed between gold ('GLD') and silver ('SLV') indicating that these two precious metals tend to move in the same direction. The correlations with the S&P 500 Index, USO and EUR/USD are relatively weak, suggesting that other factors may have a greater influence on gold prices. Keep in mind that correlation does not imply causation and multiple factors can affect gold prices including economic conditions, geopolitical events, and investor sentiment.

## 6) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = gold_price_df.drop(columns = ['GLD'], axis=1) # Feature matrix
y = gold_price_df['GLD'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
# For Model Building
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
models = [LinearRegression, Lasso, Ridge, SVR, DecisionTreeRegressor, RandomForestRegressor]
mae_scores = []
mse_scores = []
rmse_scores = []
r2_scores = []

for model in models:
    regressor = model().fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    
    mae_scores.append(mean_absolute_error(y_test, y_pred))
    mse_scores.append(mean_squared_error(y_test, y_pred))
    rmse_scores.append(mean_squared_error(y_test, y_pred, squared=False))
    r2_scores.append(r2_score(y_test, y_pred))

In [None]:
regression_metrics_df = pd.DataFrame({
    "Model": ["Linear Regression", "Lasso", "Ridge", "SVR", "Decision Tree Regressor", "Random Forest Regressor"],
    "Mean Absolute Error": mae_scores,
    "Mean Squared Error": mse_scores,
    "Root Mean Squared Error": rmse_scores,
    "R-squared (R2)": r2_scores
})

regression_metrics_df.set_index('Model', inplace=True)
regression_metrics_df

### Inference

In the context of predicting gold prices,
- The **Support Vector Regressor (SVR)** and **Random Forest Regressor** stand out as the top models for gold price prediction showcasing low prediction errors (MAE, MSE, RMSE) and a high degree of explained variance (R-squared). 
- The Decision Tree Regressor also performs well but falls slightly behind. 
- Conversely, Linear Regression, Lasso, and Ridge models display comparatively higher errors and less variance explanation, making them less suitable for accurate gold price forecasting. Therefore, SVR and Random Forest Regressor are preferred choices for robust gold price prediction.