# Stock Price Prediction Project

## 1. Project Concept and Scope
### Objective
To predict future stock prices of specific companies representing different market sectors, using historical data.

### Scope
The project will focus on PLUG (Energy), NIO (Automotive), NTLA (Healthcare), SNAP (Communication Services), and CHPT (Industrials).

In the first part of our project, we will try to analyze the data. and in the second part, we will forecast the stock market.

# Overview of Selected Stocks

This notebook provides an overview of five distinct stocks, each representing different sectors and industries. The stocks covered are:

1. **PLUG (Plug Power Inc.)**: 
   - Sector: Energy
   - Industry: Electrical Equipment & Parts
   - Description: Plug Power is an innovator in hydrogen and fuel cell technology, providing comprehensive hydrogen fuel cell turnkey solutions.

2. **NIO (NIO Inc.)**:
   - Sector: Automotive
   - Industry: Auto Manufacturers
   - Description: NIO is a pioneer in China's premium electric vehicle market, specializing in designing, manufacturing, and selling electric vehicles.

3. **NTLA (Intellia Therapeutics Inc.)**:
   - Sector: Healthcare
   - Industry: Biotechnology
   - Description: Intellia Therapeutics is a leading biotechnology company developing therapies using a CRISPR/Cas9 gene-editing system.

4. **SNAP (Snap Inc.)**:
   - Sector: Communication Services
   - Industry: Internet Content & Information
   - Description: Snap Inc. is the parent company of Snapchat, a popular social media platform known for its ephemeral messaging and multimedia features.

5. **CHPT (ChargePoint Holdings Inc.)**:
   - Sector: Industrials
   - Industry: Specialty Industrial Machinery
   - Description: ChargePoint Holdings is at the forefront of electric vehicle charging infrastructure, offering a comprehensive array of charging solutions.

Each of these companies represents a unique investment opportunity within its respective sector, reflecting different aspects of technological and industrial advancement.


## 2. Data Collection
- Utilize Alpha Vantage API for historical stock price data.
- Gather comprehensive data including prices, volumes, and market indicators.

In [2]:
"""
This script imports necessary libraries for stock price prediction.
"""
import os
import pandas as pd
from alpha_vantage.timeseries import TimeSeries
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [3]:
symbols_list = ['PLUG', 'NIO', 'NTLA', 'SNAP', 'CHPT']

In [13]:
def retrieve_stock_data(symbols):
    """
    Retrieve historical stock data for a given list of symbols using Alpha Vantage API.
    Deletes old CSV files if newer data is found and downloaded.

    Parameters:
    symbols (list): A list of stock symbols to retrieve data for.

    Returns:
    None
    """
    # Read the API key from the file
    with open('AlphaVantage.txt', 'r') as file:
        api_key = file.read().strip()

    # Create a TimeSeries object with your API key
    ts = TimeSeries(key=api_key, output_format='pandas')

    # Loop through the symbols and retrieve the historical data
    for symbol in symbols:
        # Get the historical data for the symbol
        data, meta_data = ts.get_daily(symbol=symbol, outputsize='full')
        
        # Sort the data by index (date) just in case
        data.sort_index(inplace=True)

        # Get the first and last dates
        first_date = data.index[0].strftime('%Y-%m-%d')
        last_date = data.index[-1].strftime('%Y-%m-%d')

        # Generate the new file name
        new_file_name = f'{first_date}_{last_date}_{symbol}_historical_data.csv'

        # Check if a file for this symbol already exists
        existing_files = [f for f in os.listdir() if f.endswith(f'{symbol}_historical_data.csv')]
        if existing_files:
            # Sort files to find the most recent one
            existing_files.sort()
            most_recent_file = existing_files[-1]

            # Extract dates from the most recent file name
            existing_first_date, existing_last_date, *_ = most_recent_file.split('_')

            # Compare dates (strings comparison works because of the YYYY-MM-DD format)
            if existing_first_date <= first_date and existing_last_date >= last_date:
                print(f"Data already up-to-date for {symbol}")
                continue
            else:
                # Remove older files
                for file in existing_files:
                    os.remove(file)
                    print(f"Old file {file} deleted for {symbol}")

        # Save the new data to a CSV file
        data.to_csv(new_file_name)
        print(f"New data saved for {symbol}: {new_file_name}")

# Example usage
retrieve_stock_data(symbols_list)


Data already up-to-date for PLUG


ValueError: Thank you for using Alpha Vantage! Our standard API rate limit is 25 requests per day. Please subscribe to any of the premium plans at https://www.alphavantage.co/premium/ to instantly remove all daily rate limits.

In [4]:
import glob
import pandas as pd

def load_stock_data(symbols):
    """
    Load the most recent, up-to-date historical data CSV files into variables.
    The 'Date' column in each CSV file is used as the DataFrame index and parsed as dates.

    Parameters:
    symbols (list): A list of stock symbols to load data for.

    Returns:
    dict: A dictionary containing the loaded data frames, with stock symbols as keys.
    """
    data_frames = {}

    for symbol in symbols:
        # Find the most recent CSV file for the symbol
        files = glob.glob(f'*{symbol}_historical_data.csv')
        if files:
            files.sort()
            most_recent_file = files[-1]

            # Load the CSV file into a data frame with 'Date' as the index column and parse dates
            data_frames[symbol] = pd.read_csv(most_recent_file, index_col='date', parse_dates=['date'])
            print(f"Data loaded for {symbol}: {most_recent_file}")
        else:
            print(f"No data found for {symbol}")

    return data_frames

# Example usage
# symbols_list should be defined earlier in your script
# e.g., symbols_list = ['AAPL', 'GOOGL', 'MSFT']
stock_data = load_stock_data(symbols_list)


Data loaded for PLUG: 1999-11-01_2024-01-22_PLUG_historical_data.csv
Data loaded for NIO: 2018-09-12_2024-01-22_NIO_historical_data.csv
Data loaded for NTLA: 2016-05-06_2024-01-22_NTLA_historical_data.csv
Data loaded for SNAP: 2017-03-02_2024-01-22_SNAP_historical_data.csv
Data loaded for CHPT: 2019-09-16_2024-01-22_CHPT_historical_data.csv


### Looking at the heads of our data

In [5]:
stock_data['PLUG'].head()

Unnamed: 0_level_0,1. open,2. high,3. low,4. close,5. volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1999-11-01,16.75,16.75,15.0,16.0,1506000.0
1999-11-02,16.44,20.0,16.38,17.88,1701000.0
1999-11-03,18.88,19.31,18.13,18.63,683000.0
1999-11-04,19.44,19.88,18.63,19.06,480000.0
1999-11-05,19.09,19.5,17.38,17.38,489000.0


In [5]:
stock_data['NIO'].head()

Unnamed: 0_level_0,1. open,2. high,3. low,4. close,5. volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-09-12,6.0,6.93,5.35,6.6,66848996.0
2018-09-13,6.62,12.69,6.52,11.6,158346488.0
2018-09-14,12.66,13.8,9.22,9.9,172473559.0
2018-09-17,9.61,9.75,8.5,8.5,56323875.0
2018-09-18,8.73,9.1,7.67,7.68,41827593.0


In [6]:
stock_data['NTLA'].head()

Unnamed: 0_level_0,1. open,2. high,3. low,4. close,5. volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-05-06,22.0,24.0,21.0,22.1,5025236.0
2016-05-09,22.9,24.24,22.7,24.0,778138.0
2016-05-10,24.58,26.0,24.5,25.75,658353.0
2016-05-11,26.1,26.25,25.06,25.25,377679.0
2016-05-12,25.29,25.9999,23.54,23.54,588352.0


In [7]:
stock_data['SNAP'].head()

Unnamed: 0_level_0,1. open,2. high,3. low,4. close,5. volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-03-02,24.0,26.05,23.5,24.48,217109769.0
2017-03-03,26.39,29.44,26.06,27.09,148227379.0
2017-03-06,28.17,28.25,23.77,23.77,72938848.0
2017-03-07,22.21,22.5,20.64,21.44,71899652.0
2017-03-08,22.03,23.43,21.31,22.81,49834423.0


In [8]:
stock_data['CHPT'].head()

Unnamed: 0_level_0,1. open,2. high,3. low,4. close,5. volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-09-16,9.65,9.76,9.65,9.76,600.0
2019-09-17,9.76,9.76,9.76,9.76,0.0
2019-09-18,9.69,9.72,9.69,9.72,200.0
2019-09-19,9.72,9.72,9.72,9.72,0.0
2019-09-20,9.84,9.85,9.84,9.85,911.0


In [9]:
stock_data['PLUG'].describe()

Unnamed: 0,1. open,2. high,3. low,4. close,5. volume
count,6094.0,6094.0,6094.0,6094.0,6094.0
mean,8.725326,9.059263,8.346954,8.681892,6434427.0
std,15.138294,15.8744,14.281196,15.02567,13443030.0
min,0.12,0.125,0.1155,0.118,13700.0
25%,1.85,1.89,1.7825,1.84,359000.0
50%,3.69,3.81,3.545,3.67,1029350.0
75%,7.8,8.0275,7.53875,7.77,5554194.0
max,142.2,156.5,130.0,149.8,243272100.0


In [10]:
stock_data['PLUG'].columns

Index(['1. open', '2. high', '3. low', '4. close', '5. volume'], dtype='object')

In [11]:
stock_data['PLUG'].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6094 entries, 1999-11-01 to 2024-01-22
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   1. open    6094 non-null   float64
 1   2. high    6094 non-null   float64
 2   3. low     6094 non-null   float64
 3   4. close   6094 non-null   float64
 4   5. volume  6094 non-null   float64
dtypes: float64(5)
memory usage: 285.7 KB


In [12]:
# Check for missing values

stock_data['PLUG'].isna().sum()

1. open      0
2. high      0
3. low       0
4. close     0
5. volume    0
dtype: int64

In [13]:
stock_data['NIO'].describe()

Unnamed: 0,1. open,2. high,3. low,4. close,5. volume
count,1348.0,1348.0,1348.0,1348.0,1348.0
mean,17.222458,17.759463,16.630025,17.203023,62320850.0
std,15.13954,15.567361,14.625658,15.113029,56229170.0
min,1.19,1.45,1.19,1.32,5111018.0
25%,6.41,6.63375,6.1275,6.3975,32037890.0
50%,10.465,10.8175,10.2175,10.535,48738960.0
75%,23.765,24.56,22.7975,23.7675,73326410.0
max,64.95,66.99,62.19,62.84,579069900.0


In [14]:
stock_data['NIO'].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1348 entries, 2018-09-12 to 2024-01-22
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   1. open    1348 non-null   float64
 1   2. high    1348 non-null   float64
 2   3. low     1348 non-null   float64
 3   4. close   1348 non-null   float64
 4   5. volume  1348 non-null   float64
dtypes: float64(5)
memory usage: 63.2 KB


In [15]:
stock_data['NIO'].isna().sum()

1. open      0
2. high      0
3. low       0
4. close     0
5. volume    0
dtype: int64

In [6]:
stock_data['NTLA'].describe()

Unnamed: 0,1. open,2. high,3. low,4. close,5. volume
count,1940.0,1940.0,1940.0,1940.0,1940.0
mean,38.53918,39.871191,37.216529,38.491005,879489.2
std,33.5108,34.815905,32.32766,33.486071,978299.1
min,9.74,10.22,9.18,9.44,51719.0
25%,16.5975,17.00875,16.11,16.5475,439217.2
50%,25.46,26.3,24.43,25.315,721621.0
75%,46.04,47.6025,44.14,45.7725,1087951.0
max,175.7,202.73,170.4,176.78,23193670.0


In [18]:
stock_data['NTLA'].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1940 entries, 2016-05-06 to 2024-01-22
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   1. open    1940 non-null   float64
 1   2. high    1940 non-null   float64
 2   3. low     1940 non-null   float64
 3   4. close   1940 non-null   float64
 4   5. volume  1940 non-null   float64
dtypes: float64(5)
memory usage: 90.9 KB


In [19]:
stock_data['NTLA'].isna().sum()

1. open      0
2. high      0
3. low       0
4. close     0
5. volume    0
dtype: int64

In [16]:
stock_data['SNAP'].describe()

Unnamed: 0,1. open,2. high,3. low,4. close,5. volume
count,1734.0,1734.0,1734.0,1734.0,1734.0
mean,22.775956,23.334163,22.206448,22.773962,28551590.0
std,18.302069,18.736739,17.804848,18.276714,25850030.0
min,4.96,5.14,4.82,4.99,3285663.0
25%,11.0,11.3325,10.76,11.0125,16384530.0
50%,14.855,15.165,14.57,14.87,22018380.0
75%,23.985,24.419525,23.199875,23.7625,31507490.0
max,82.0,83.34,79.32,83.11,330993900.0


In [17]:
stock_data['SNAP'].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1734 entries, 2017-03-02 to 2024-01-22
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   1. open    1734 non-null   float64
 1   2. high    1734 non-null   float64
 2   3. low     1734 non-null   float64
 3   4. close   1734 non-null   float64
 4   5. volume  1734 non-null   float64
dtypes: float64(5)
memory usage: 81.3 KB


In [19]:
stock_data['SNAP'].isna().sum()

1. open      0
2. high      0
3. low       0
4. close     0
5. volume    0
dtype: int64

In [20]:
stock_data['CHPT'].describe()

Unnamed: 0,1. open,2. high,3. low,4. close,5. volume
count,1095.0,1095.0,1095.0,1095.0,1095.0
mean,14.95773,15.396813,14.448195,14.915188,7418417.0
std,8.87972,9.250031,8.398263,8.821867,7428137.0
min,1.64,1.76,1.56,1.65,0.0
25%,9.75,9.78,9.72,9.75,1992494.0
50%,11.86,12.22,11.39,11.93,7015886.0
75%,19.21,19.93,18.73,19.32,10346420.0
max,49.08,49.48,45.1247,46.1,102265700.0


In [21]:
stock_data['CHPT'].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1095 entries, 2019-09-16 to 2024-01-22
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   1. open    1095 non-null   float64
 1   2. high    1095 non-null   float64
 2   3. low     1095 non-null   float64
 3   4. close   1095 non-null   float64
 4   5. volume  1095 non-null   float64
dtypes: float64(5)
memory usage: 51.3 KB


In [7]:
stock_data['CHPT'].isna().sum()

1. open      0
2. high      0
3. low       0
4. close     0
5. volume    0
dtype: int64

In [23]:
"""
Create a histogram plot of the closing price distribution for the 'PLUG' stock.

Parameters:
- stock_data (DataFrame): The stock data containing the 'PLUG' stock information.
- nbins (int): The number of bins to use for the histogram.

Returns:
- None
"""
fig = px.histogram(
    stock_data['PLUG'], 
    x='4. close', 
    marginal='box',
    nbins=200,
    title='PLUG Closing Price Distribution'
)
fig.update_layout(bargap=0.1)
fig.show()

In [22]:
"""
Create a histogram plot of the closing price distribution for the 'PLUG' stock.

Parameters:
- stock_data (DataFrame): The stock data containing the 'PLUG' stock information.
- nbins (int): The number of bins to use for the histogram.

Returns:
- None
"""
fig = px.histogram(
    stock_data['PLUG'], 
    x='1. open', 
    marginal='box',
    color_discrete_sequence=['red'],
    nbins=200,
    title='PLUG Opening Price Distribution'
)
fig.update_layout(bargap=0.1)
fig.show()

In [24]:
fig = px.scatter(stock_data['PLUG'], 
                 x='1. open', 
                 y='4. close', 
                 opacity=0.8,
                  
                 title='Open vs. Close')
fig.update_traces(marker_size=5)
fig.show()

In [8]:
fig = px.line(stock_data['PLUG'], x=stock_data['PLUG'].index, y='4. close', title='PLUG Closing Prices')
fig.show()