**STOCK PRICE PREDICTION - MINI PROJECT**

# Stock Price Prediction Project
## Step 1: Data Collection
This notebook collects historical stock price data for Apple (AAPL) using the `yfinance` library.

In [2]:
!pip install yfinance



In [3]:
import yfinance as yf

In [23]:
# Download Apple stock data from 2018 to 2023
data = yf.download('AAPL', start='2018-01-01', end='2023-01-01')

[*********************100%***********************]  1 of 1 completed


In [24]:
print(data.head())

Price           Close       High        Low       Open     Volume
Ticker           AAPL       AAPL       AAPL       AAPL       AAPL
Date                                                             
2018-01-02  40.479847  40.489249  39.774869  39.986364  102223600
2018-01-03  40.472782  41.017967  40.409337  40.543281  118071600
2018-01-04  40.660782  40.764179  40.437540  40.545634   89738400
2018-01-05  41.123722  41.210668  40.665487  40.757134   94640000
2018-01-08  40.970974  41.267063  40.872274  40.970974   82271200


In [25]:
# Save the data to a CSV file
data.to_csv('AAPL_stock_data.csv')

## Step 2: Preprocess the Data
Preprocessing involves cleaning, transforming, and organizing the data so that it can be used effectively in model.


In [27]:
import pandas as pd


In [39]:
import numpy as np

In [40]:
import matplotlib.pyplot as plt

In [41]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [42]:
print(data.head())

   1       Date      Close       High        Low       Open     Volume  MA_7  \
0  2 2018-01-02  40.479847  40.489249  39.774869  39.986364  102223600   NaN   
1  3 2018-01-03  40.472782  41.017967  40.409337  40.543281  118071600   NaN   
2  4 2018-01-04  40.660782  40.764179  40.437540  40.545634   89738400   NaN   
3  5 2018-01-05  41.123722  41.210668  40.665487  40.757134   94640000   NaN   
4  6 2018-01-08  40.970974  41.267063  40.872274  40.970974   82271200   NaN   

      Target  
0  40.472782  
1  40.660782  
2  41.123722  
3  40.970974  
4  40.966278  


In [43]:
data.to_csv('AAPL_stock_data.csv')

In [44]:
from google.colab import files
files.download('AAPL_stock_data.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [18]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [19]:
data.to_csv('/content/drive/My Drive/AAPL_stock_data.csv')

In [45]:
# Load the .xlsx file
data = pd.read_excel('AAPL_stock_data.xlsx')

# Display the first few rows of the data
print(data.head())


   1       Date      Close       High        Low       Open     Volume
0  2 2018-01-02  40.479847  40.489249  39.774869  39.986364  102223600
1  3 2018-01-03  40.472782  41.017967  40.409337  40.543281  118071600
2  4 2018-01-04  40.660782  40.764179  40.437540  40.545634   89738400
3  5 2018-01-05  41.123722  41.210668  40.665487  40.757134   94640000
4  6 2018-01-08  40.970974  41.267063  40.872274  40.970974   82271200


In [46]:
data.to_csv('AAPL_stock_data_fixed.csv')


In [47]:
print(data.isnull().sum())  # Check for missing values
data = data.dropna()  # Drop rows with missing values


1         0
Date      0
Close     0
High      0
Low       0
Open      0
Volume    0
dtype: int64


In [48]:
# Example: Create a 7-day moving average
data['MA_7'] = data['Close'].rolling(window=7).mean()

In [49]:
data['Target'] = data['Close'].shift(-1)  # Next day's closing price

In [51]:
print(data.columns)

Index([1, 'Date', 'Close', 'High', 'Low', 'Open', 'Volume', 'MA_7', 'Target'], dtype='object')


In [52]:
# Convert column names to strings
data.columns = data.columns.astype(str)

# Check the column names again
print(data.columns)

Index(['1', 'Date', 'Close', 'High', 'Low', 'Open', 'Volume', 'MA_7',
       'Target'],
      dtype='object')


In [53]:
# Create the target variable (e.g., next day's closing price)
data['Target'] = data['Close'].shift(-1)

# Drop the last row (since it won't have a target value)
data = data.dropna()

In [55]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Convert DateTime column to string or drop it
if 'Date' in data.columns:
    data = data.drop(columns=['Date'])  # Drop Date column
    # OR
    # data['Date'] = data['Date'].astype(str)  # Convert Date column to string if needed

# Initialize the scaler
scaler = MinMaxScaler()

# Select only numeric columns (excluding target variable)
numeric_columns = data.select_dtypes(include=['number']).columns
numeric_columns = numeric_columns.drop('Target')  # Exclude target column

# Scale the features
scaled_features = scaler.fit_transform(data[numeric_columns])

# Convert back to DataFrame
scaled_data = pd.DataFrame(scaled_features, columns=numeric_columns, index=data.index)

# Add the target variable back
scaled_data['Target'] = data['Target'].values

# Display the first few rows
print(scaled_data.head())


           1     Close      High       Low      Open    Volume      MA_7  \
6   0.000000  0.048575  0.042762  0.047716  0.043743  0.154974  0.036085   
7   0.000799  0.050180  0.044690  0.050179  0.046058  0.100878  0.036805   
8   0.001599  0.053114  0.047719  0.052096  0.048632  0.169880  0.037965   
9   0.002398  0.051655  0.051008  0.052906  0.051417  0.212279  0.038719   
10  0.003197  0.056372  0.050782  0.051138  0.048583  0.261558  0.039697   

       Target  
6   41.189518  
7   41.614857  
8   41.403358  
9   42.087181  
10  42.124779  


In [56]:
# Save the scaled data to a CSV file
scaled_data.to_csv("scaled_stock_data.csv", index=False)
print("Scaled data saved successfully.")


Scaled data saved successfully.


In [57]:
from sklearn.model_selection import train_test_split

# Define input features (X) and target variable (y)
X = scaled_data.drop(columns=['Target'])  # All columns except target
y = scaled_data['Target']  # Target column

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

# Print shapes to verify
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (1001, 7), y_train shape: (1001,)
X_test shape: (251, 7), y_test shape: (251,)


In [58]:
# Save training set
X_train.to_csv("X_train.csv", index=False)
y_train.to_csv("y_train.csv", index=False)

# Save testing set
X_test.to_csv("X_test.csv", index=False)
y_test.to_csv("y_test.csv", index=False)

print("Train and test datasets saved successfully.")


Train and test datasets saved successfully.
