<a href="https://colab.research.google.com/github/mithun-mith/Assignment--Product-Dissection/blob/main/Capstone_Project_Capstone_End_to_End_Machine_Learning_Mithun.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Regression - Yes Bank Stock Closing Price Prediction



## **Project Type**    - Regression - Yes Bank Stock Closing Price Prediction
## **Contribution**    - Individual
## **Team Member 1 -** - Mithun tulshidas Waghmare


# **Project Summary -**

#Business Context

Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock’s closing price of the month.

# Main Libraries to be Used:
* Pandas for data manipulation, aggregation
* Matplotlib and Seaborn for visualisation and behaviour with respect to the target variable
* NumPy for computationally efficient operations
* Scikit Learn for model training, model optimization, and metrics calculation

#Efficient EDA
* Understanding of how to prep the data and make it ready for training.
* Understanding the target feature and its distribution
* Assessing target features for class imbalance.
* Modeling - which algorithm to use?
* Evaluation while keeping class imbalance in mind.
* Feature Importance and Conclusion
* Understanding how your project is useful to stakeholders

# **GitHub Link -**

Provide your GitHub Link here.

# ***Let's Begin !***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Loading the data from drive
data = pd.read_csv('/content/drive/MyDrive/data_YesBank_StockPrices (1).csv')


In [None]:
data.head()

In [None]:
# Check for missing values
print(data.isnull().sum())

In [None]:
#check the duplicated values
print(data.duplicated().sum)

In [None]:
# Converting 'Date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'], format='%b-%y')

In [None]:
# creating the 'Date' column is set as the index
data.set_index('Date', inplace=True)

In [None]:
data.head()

In [None]:
# Summary statistics
print(data.describe())

In [None]:
data.info()

In [None]:
# Plotting closing prices over time
plt.figure(figsize=(14, 7))
plt.plot(data.index, data['Close'], label='Closing Price')
plt.title('Yes Bank Stock Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.xticks(rotation=90)
plt.legend()
plt.show()


## Insights

Overall, the stock price has trended downwards since 2008. The price peaked at around ₹350 in 2018 and has since fallen to around ₹50 in 2020.

the price dropped sharply in 2012 and again in 2019.The stock price has been relatively stable since 2020.

In [None]:
 # visualize the distribution of the target feature 'Close'.
plt.figure(figsize=(14, 7))
plt.hist(data['Close'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Closing Prices')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.show()


##Insights

The distribution of closing prices is right-skewed, which means there are more stocks with lower closing prices than there are stocks with higher closing prices. The median closing price is likely to be lower than the mean closing price.

The distribution of closing prices is relatively flat.There are a few stocks with very high closing prices, but there are also a lot of stocks with closing prices that are clustered around the median.


Regression tasks, we typically focus on metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) to evaluate the performance of the regression model.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

In [None]:
# Assuming 'Close' is the target variable
X = data.drop('Close', axis=1)  # Features
y = data['Close']  # Target

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# linear regression model creating the object called lr
lr = LinearRegression()

In [None]:
# Train the model
lr.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = lr.predict(X_test)

In [None]:
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

In [None]:
# Results the metrics
print("linear regression:")
print(f'MAE: {mae:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'MSE: {mse:.2f}')

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Random Forest regression model creating the object called rf_model
rf_model = RandomForestRegressor()

In [None]:
# Random Forest Regressor
rf_model.fit(X_train, y_train)

# Make predictions on the test set
rf_pred= rf_model.predict(X_test)

In [None]:
# Evaluate models
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))


In [None]:
# Display results
print("Random Forest Regressor:")
print(f"MAE: {rf_mae:.2f}")
print(f"RMSE: {rf_rmse:.2f}")

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
# Gradient Boosting regression model creating the object called gb_model
gb_model = GradientBoostingRegressor()

In [None]:
# Gradient Boosting Regressor
gb_model.fit(X_train, y_train)

# Make predictions on the test set
gb_pred= gb_model.predict(X_test)

In [None]:
# Evaluate the model
gb_mae = mean_absolute_error(y_test, gb_pred)
gb_rmse = np.sqrt(mean_squared_error(y_test, gb_pred))


In [None]:
# Display results
print("Gradient Boosting Regressor:")
print(f"MAE: {gb_mae:.2f}")
print(f"RMSE: {gb_rmse:.2f}")

##Observations

Linear Regression has the lowest MAE and RMSE, indicating that, on average, its predictions are closer to the actual values compared to the other models.
Random Forest Regressor and Gradient Boosting Regressor have higher MAE and RMSE than Linear Regression, suggesting that their predictions have a higher average error.

##Conclusion

Based on the provided evaluation metrics, the Linear Regression model appears to be the best fit for this task.

## **5. Solution to Business Objective**

In this case, the Linear Regression model appears to be the most accurate based on the provided evaluation metrics (MAE and RMSE). Therefore, the recommended solution is to deploy and use the Linear Regression model for predicting the stock's closing price.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***