<a href="https://colab.research.google.com/github/leandromercanti/Stock_Price_Prediction_Using_Linear_Regression/blob/main/Stock_Price_Prediction_Using_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Stock Price Prediction Using Linear Regression**

<br>

**Objective:** Build a simple model to predict stock prices using historical data and linear regression techniques.

<br>

**Skills:** Data cleaning, feature engineering, basic statistics, linear regression.

<br>

**Important:** This project was created for educational purposes only. Consult your financial advisor before making any investment decisions.


### Step 1: Set Up the Environment

Programming Language: Python

Tools/IDE: Google Colab, Jupyter Notebook, VS Code, or any Python IDE.

Libraries to Install:
pandas for data manipulation, numpy for numerical computations, matplotlib and seaborn for data visualization, scikit-learn for machine learning algorithms, yfinance to fetch historical stock price data.

In [None]:
# Install the required libraries using pip
# You only need to comment out the code below if you are doing this project in an environment that requires that installation.

# pip install pandas numpy matplotlib seaborn scikit-learn yfinance


### Step 2: Fetch Historical Stock Price Data

Objective: Gather the historical stock price data for the company of interest.

Using yfinance:
Import the necessary library and fetch data.

In [None]:
import yfinance as yf

# Define the stock ticker and the date range
stock_ticker = 'AAPL'  # Example: Apple Inc.
start_date = '2019-01-01'
end_date = '2024-08-01'

# Fetch the data
stock_data = yf.download(stock_ticker, start=start_date, end=end_date)

In [None]:
# Display the first and last few rows of the dataset
print(stock_data.head())
print(stock_data.tail())

### Step 3: Data Preprocessing

Objective: Clean and prepare the data for modeling.

In [19]:
# Handling Missing Values:
stock_data = stock_data.dropna()

In [None]:
# Feature Engineering: You can create new features such as moving averages, returns, etc.
stock_data['Moving_Avg_30'] = stock_data['Close'].rolling(window=30).mean()
stock_data['Return'] = stock_data['Close'].pct_change()

print(stock_data.tail())

In [21]:
# Selecting Relevant Features: Focus on features like Open, High, Low, Close, Volume, and the engineered features.
features = ['Open', 'High', 'Low', 'Volume', 'Moving_Avg_30']
target = 'Close'

In [None]:
# Splitting Data into Train and Test Sets:
from sklearn.model_selection import train_test_split

X = stock_data[features].dropna()
y = stock_data[target][X.index]  # Align target with feature data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("The dataset has been successfully splitted between training(80%) and testing data(20%)")

### Step 4: Building the Linear Regression Model

Objective: Create and train a linear regression model to predict stock prices.

In [None]:
# Import and Train the Model:
from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

In [None]:
# Model Evaluation: Evaluate the model using metrics like R-squared and Mean Squared Error (MSE).
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate R-squared and MSE
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f'R-squared: {r2}')
print(f'Mean Squared Error: {mse}')

### Step 5: Model Interpretation and Visualization

Objective: Understand and visualize the model's predictions.

In [None]:
import pandas as pd

# Assuming y_test and y_pred are pandas Series with date indexes
comparison_df = pd.DataFrame({
    'Actual Price': y_test,
    'Predicted Price': y_pred
})

# Sort the dataframe by date in descending order and display the 10 most recent entries
recent_comparison_df = comparison_df.sort_index(ascending=False).head(10)

# Display the recent comparison
recent_comparison_df_rounded = recent_comparison_df.round(2)
print("\nComparison:")
print(recent_comparison_df_rounded)

In [None]:
# Plotting Actual vs Predicted Prices:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(y_test.index, y_test, label='Actual Price', color='b')
plt.plot(y_test.index, y_pred, label='Predicted Price', color='r')
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.title(f'{stock_ticker} Stock Price Prediction')
plt.legend()
plt.show()


In [None]:
# Residual Analysis: Plot the residuals to check for any patterns.
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_test.index, residuals)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Residuals')
plt.title('Residuals Analysis')
plt.show()

### Step 6: Conclusion and Further Steps

Model Performance Analysis
Based on the metrics you provided, the model shows exceptionally high performance:

<br>

**R-squared:**

Interpretation: R-squared, also known as the coefficient of determination, measures how well the independent variables explain the variance in the dependent variable. An R-squared value of 0.9999 indicates that the model explains approximately 99.99% of the variance in the stock prices. This suggests that the model fits the data very well and captures almost all the variability in the stock prices.

<br>

**Mean Squared Error (MSE):

Interpretation: MSE measures the average squared difference between the predicted and actual values. A lower MSE indicates a better fit of the model to the data. In this case, an MSE of 0.3566 is relatively low, indicating that the model’s predictions are very close to the actual stock prices.
Strengths of the Model

<br>

**High Accuracy:** The high R-squared value suggests that the linear regression model is highly accurate in predicting stock prices based on the selected features.

<br>

**Low Error:** The low MSE further supports the accuracy of the model, showing that the predictions are very close to the actual stock prices.
Potential Limitations

<br>

**Overfitting Risk:** The exceptionally high R-squared value may suggest that the model is overfitting the training data, especially if the test data is not sufficiently diverse or if the model is too closely tailored to the specific dataset used. Overfitting occurs when a model captures noise or random fluctuations in the data rather than the underlying trend.

<br>

**Limited Feature Set:** While the model performs well with the chosen features, it may not generalize as effectively to different stocks or market conditions. Stock prices can be influenced by many factors not captured in the current model, such as economic indicators, global events, or company-specific news.