#### Required Packages

In [5]:
from IPython.display import clear_output
#Recommended Python version: 3.11.9

!python --version
!python -c "import platform; print(platform.architecture())"
!pip install --upgrade pip
!git --version

# !pip install -q yfinance
# !pip install -q pandas
# !pip install -q matplotlib
# !pip install -q seaborn
# !pip install -q pandas_datareader
# !pip install -q setuptools
# !pip install -q scikit-learn
# !pip install -q keras
# !pip install -q jupyter pandas pmaw
# !pip install vaderSentiment
#!pip install -q plotly

# !pip install tensorflow
clear_output()

In [6]:
%cd back
!pip install -r requirements.txt
clear_output()

##### Application of the CRISP-DM Model to Analyze Stocks Using ML

# **1. Business Understanding**
#### The main goal of the stock analysis tool is to support investors (e.g., retail investors) in making informed investment decisions by providing accurate, data-driven analyses and predictions about stock prices, market trends, and potential investment opportunities.

**Business Objectives:**
- Increase returns: Assist users in identifying profitable investment opportunities.
- Risk minimization: Provide tools to assess risks (e.g., volatility, market risks).
- Time savings: Automate analyses that are typically performed manually by analysts.
- Competitive advantage: Develop a tool that stands out from existing solutions through accuracy, user-friendliness, and innovative features.

**Background:**
The stock market is complex and influenced by a variety of factors, including:
- Economic indicators,
- Company metrics,
- News,
- Geopolitical events,
- Market sentiment.
Many investors rely on traditional analysis methods (e.g., fundamental and technical analysis), which can be time-consuming and subjective. Data-driven tools leveraging machine learning and big data offer the opportunity to automate these processes and deliver more precise results.

**Current Challenges:**
- Data quality: Availability and reliability of financial data (e.g., stock prices, company reports).
- Market dynamics: Rapid market changes require real-time or near-real-time analyses.
- Complexity: Difficulty in integrating various data sources (e.g., social media, news, financial reports).
- Competition: Existing tools like Bloomberg Terminal, TradingView, or Robinhood set standards that must be surpassed.

**Opportunities:**
- Use of modern technologies such as machine learning, natural language processing (NLP), and sentiment analysis.
- Integration of alternative data sources (e.g., X posts, news, macroeconomic data).
- Customization for specific target groups (e.g., day traders, long-term investors, or institutional investors).

**Project Objectives:**
- Predictive model: Develop a model to forecast stock prices or market trends (e.g., price movements in the next days/weeks/months/years).
- Risk analysis: Provide risk assessments for individual stocks or portfolios.
- Market sentiment: Analyze sentiment based on news and social media (e.g., X posts).
- Automation: Automatically generate reports or actionable recommendations (buy, sell, hold).

**Success Criteria from a Data-Driven Perspective:**
- Prediction accuracy: 75% hit rate for price forecasts.
- Performance: Process large datasets in near real-time (e.g., analysis of stock prices and news within seconds).
- User-friendliness: Provide clear visualizations and understandable recommendations.

**Requirements:**
- Data sources: 
    - Historical stock prices,
    - Financial reports,
    - Macroeconomic data,
    - News,
    - Social media (e.g., X posts).
- Technology: 
    - Cloud computing for scalability,
    - Machine learning for predictions,
    - NLP for sentiment analysis.

**Assumptions:**
- High-quality data is available and can be obtained in sufficient quantities.
- Historical data is representative of future market conditions (with limitations due to unpredictable events such as crises).

## **2. Data Understanding**
#### 
**Required Data Sources:**
To achieve the business objectives (e.g., stock price predictions, risk analyses, market sentiment assessments), various data sources are needed. These can be divided into the following categories:

- Market data:
    - Historical stock prices (e.g., open, close, high, low, volume).
    - Indices (e.g., DAX, S&P 500, NASDAQ).
    - Commodity prices (e.g., oil, gold) that may influence stocks.
    - Exchange rates (e.g., EUR/USD, USD/JPY).
- Company data:
    - Financial reports (e.g., income statements, balance sheets, cash flow).
    - Metrics (e.g., P/E ratio, dividend yield, equity ratio).
    - Insider trading and stock buybacks.
- Macroeconomic data:
    - Interest rates (e.g., from central banks like ECB, Fed).
    - Inflation rates.
    - Labor market data (e.g., unemployment rate).
    - GDP growth.
- Alternative data:
    - News (e.g., press releases, market reports).
    - Social media data (e.g., X posts reflecting market sentiment).
    - Sentiment data (e.g., Fear and Greed Index).
- Technical data:
    - Technical indicators (e.g., moving averages, RSI, MACD).
    - Trading volume and liquidity metrics.

**Data Sources:**
- Financial data APIs: Yahoo Finance, Alpha Vantage, Quandl, Bloomberg API.
- News APIs: Reuters, NewsAPI.
- Social media: X API (for posts, hashtags, and trends).
- Macroeconomic data: OECD, World Bank, FRED (Federal Reserve Economic Data).
- Company data: SEC EDGAR database (for U.S. companies), company websites.

**Data Attributes:**
- Market data: Date, stock price (Open, Close, High, Low), volume.
- Company data: Revenue, profit, debt, equity, industry.
- Macroeconomic data: Interest rate, inflation, GDP.
- News/Social media: Text, date.

**Data Availability:**
- Market data: Widely available through APIs like Yahoo Finance or Alpha Vantage, sometimes paid for real-time data.
- Company data: Available via SEC EDGAR or financial data APIs, but with limitations for smaller companies or non-U.S. firms.
- News/Social media: Available through APIs (e.g., X API), but associated with costs and rate limits.
- Macroeconomic data: Free through public sources (e.g., FRED), but often delayed.

**Feasibility:**
- Technical feasibility: Data can be processed with Python libraries (e.g., Pandas, NumPy, Scikit-learn) and cloud computing (e.g., AWS).
- Data integration: The challenge lies in integrating structured (e.g., stock prices) and unstructured data (e.g., news).
- Modeling: Machine learning (e.g., time series models like ARIMA, LSTM) and NLP are suitable for achieving the business objectives.

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
import os
import time
import plotly.graph_objects as go
import matplotlib.pyplot as plt


from datetime import datetime, timedelta
from IPython.display import clear_output
from scipy.signal import argrelextrema

#### **get_Ticker**

In [2]:
def get_new_ticker(file_existing_ticker, symbol, file_new_ticker, symbol_name):
    ticker = pd.read_csv(file_existing_ticker)
    lst_ticker = ticker[symbol].tolist()
    print(len(lst_ticker))

    new_tickers = pd.read_csv(file_new_ticker, encoding_errors="replace")
    lst = new_tickers[symbol_name].tolist()

    lst_ticker = lst_ticker + lst
    lst_ticker = pd.unique(lst_ticker).tolist()
    print(len(lst_ticker))

    df_symbols = pd.DataFrame({"ticker": lst_ticker})

    if os.path.exists(f'ticker_new.csv'):
        os.remove(f'ticker_new.csv')

    df_symbols.to_csv(f'ticker_new.csv', index=False)

In [3]:
# Activate if necessary
# file_existing_ticker = 'ticker_new.csv'
# symbol = 'ticker'

# file_new_ticker = 'Taiwan.csv'
# symbol_name = 'Ticker'

# get_new_ticker(file_existing_ticker, symbol, file_new_ticker, symbol_name)

#### **yfinance**

In [4]:
testing = True

In [32]:
def get_symbols(var, file_data):
    try:
        df = pd.read_csv(f'{var}_new.csv')
        if var not in df.columns:
            raise ValueError(f'CSV file must contain a {var} column')
        lst = df[var].tolist()
    except Exception as e:
        print(f"Error reading {var}.csv: {str(e)}")
        exit()


    if os.path.exists(file_data):
        existing = pd.read_csv(file_data)
        # Group existing_data by var
        dfs_dict = {symbol: group.sort_values('Date') for symbol, group in existing.groupby('Symbol')}
    else:
        existing = pd.DataFrame()
        dfs_dict = {}
    return df, existing, dfs_dict, lst

In [33]:
if testing:
    tech_list = ['AAPL', 'MSFT', 'GOOGL']
    indices = ['^GSPC', '^DJI', '^IXIC']
    cryptos = ['BTC-USD', 'ETH-USD', 'XRP-USD']

In [34]:
def get_info(info):
    static_data = {
        'country': info.get('country'),
        'industry': info.get('industry'),
        'industryKey': info.get('industryKey'),
        'sector': info.get('sector'),
        'sectorKey': info.get('sectorKey'),
        'auditRisk': info.get('auditRisk'),
        'boardRisk': info.get('boardRisk'),
        'compensationRisk': info.get('compensationRisk'),
        'shareHolderRightsRisk': info.get('shareHolderRightsRisk'),
        'overallRisk': info.get('overallRisk'),
        'dividendRate': info.get('dividendRate'),
        'dividendYield': info.get('dividendYield'),
        'payoutRatio': info.get('payoutRatio'),
        'trailingPE': info.get('trailingPE'),
        'forwardPE': info.get('forwardPE'),
        'priceToSalesTrailing12Months': info.get('priceToSalesTrailing12Months'),
        'priceToBook': info.get('priceToBook'),
        'enterpriseToRevenue': info.get('enterpriseToRevenue'),
        'enterpriseToEbitda': info.get('enterpriseToEbitda'),
        'beta': info.get('beta'),
        'bookValue': info.get('bookValue'),
        'debtToEquity': info.get('debtToEquity'),
        'quickRatio': info.get('quickRatio'),
        'currentRatio': info.get('currentRatio'),
        'returnOnAssets': info.get('returnOnAssets'),
        'returnOnEquity': info.get('returnOnEquity'),
        'grossMargins': info.get('grossMargins'),
        'ebitdaMargins': info.get('ebitdaMargins'),
        'operatingMargins': info.get('operatingMargins'),
        'totalRevenue': info.get('totalRevenue'),
        'netIncomeToCommon': info.get('netIncomeToCommon'),
        'ebitda': info.get('ebitda'),
        'totalDebt': info.get('totalDebt'),
        'totalCash': info.get('totalCash'),
        'totalCashPerShare': info.get('totalCashPerShare'),
        'freeCashflow': info.get('freeCashflow'),
        'operatingCashflow': info.get('operatingCashflow'),
        'earningsGrowth': info.get('earningsGrowth'),
        'revenueGrowth': info.get('revenueGrowth'),
        'grossProfits': info.get('grossProfits'),
        'targetHighPrice': info.get('targetHighPrice'),
        'targetLowPrice': info.get('targetLowPrice'),
        'targetMeanPrice': info.get('targetMeanPrice'),
        'targetMedianPrice': info.get('targetMedianPrice'),
        'recommendationMean': info.get('recommendationMean'),
        'recommendationKey': info.get('recommendationKey'),
        'numberOfAnalystOpinions': info.get('numberOfAnalystOpinions'),
        'averageAnalystRating': info.get('averageAnalystRating'),
        'trailingEps': info.get('trailingEps'),
        'forwardEps': info.get('forwardEps'),
        'priceEpsCurrentYear': info.get('priceEpsCurrentYear'),
        'earningsQuarterlyGrowth': info.get('earningsQuarterlyGrowth'),
        'lastSplitFactor': info.get('lastSplitFactor'),
        'lastSplitDate': info.get('lastSplitDate'),
        'sharesOutstanding': info.get('sharesOutstanding'),
        'floatShares': info.get('floatShares'),
        'heldPercentInsiders': info.get('heldPercentInsiders'),
        'heldPercentInstitutions': info.get('heldPercentInstitutions')
    }
    return static_data

In [35]:
global data_dict, tickers_to_delete_lst, name_list

In [36]:
def get_data(lst, file, dfs_dict, existing_data):
    global data_dict, tickers_to_delete_lst, name_list 
    data_dict = {}
    name_list = []
    tickers_to_delete_lst = []
    end_date = datetime.today().strftime('%Y-%m-%d')

    max_retries = 10
    initial_retry_delay = 10

    # Retrieve data for each symbol
    for symbol in lst:
        print(symbol)
        retries = 0
        success = False
        while retries < max_retries and not success:
            try:
                # Determine the start date
                if symbol in dfs_dict and not dfs_dict[symbol].empty:
                    start_date = dfs_dict[symbol]['Date'].iloc[-1].strftime('%Y-%m-%d')
                else:
                    start_date = '2010-01-02'
                    if testing:
                        start_date = '2020-01-02'
                
                if end_date == start_date:
                    success = True
                    continue
                
                # Retrieve data from Yahoo Finance
                ticker_yf = yf.Ticker(symbol)
                info = ticker_yf.info
                name = info.get('longName', symbol)  # Fallback to symbol if no name is available
                name_list.append({'symbol': symbol, 'name': name})
                
                df = yf.download(symbol, start=start_date, end=end_date, auto_adjust=False, prepost=True, actions=True)
                df.columns = df.columns.get_level_values(0)
                
                if not df.empty:
                    data_dict[symbol] = df
                else:
                    print(f"No data found for {symbol}")

                success = True
                clear_output(wait=True)
            
            except Exception as e:
                if "Too Many Requests" in str(e) or "Rate limited" in str(e):
                    # Exponential backoff: Double wait time with each retry
                    wait_time = initial_retry_delay * (2 ** retries)
                    print(f"Rate limit error for {symbol}. Attempt {retries + 1}/{max_retries} in {wait_time} seconds...")
                    time.sleep(wait_time)
                    retries += 1
                else:
                    print(f"Error retrieving data for {symbol}: {str(e)}")
                    tickers_to_delete_lst.append(symbol)
                    break  # Other errors: No retry
                
        if not success and retries >= max_retries:
            print(f"Maximum retries reached for {symbol}. Skipping...")
            tickers_to_delete_lst.append(symbol)

    # Merge data into a DataFrame
    df_list = []
    for symbol, data in data_dict.items():
        data['Symbol'] = symbol
        name = next(item['name'] for item in name_list if item['symbol'] == symbol)
        data['Name'] = name
        df_list.append(data)
    
    # Combine all new data
    if df_list:
        new_data = pd.concat(df_list, axis=0)
    else:
        print("No new data available to merge")
        new_data = pd.DataFrame()
    
    # Combine with existing data
    if not new_data.empty:
        combined_data = pd.concat([existing_data, new_data])
        combined_data = combined_data.sort_values(by=['Symbol', 'Date'])
    else:
        combined_data = existing_data
    
    # Save the data
    if not combined_data.empty:
        if os.path.exists(file):
            os.remove(file)
        combined_data.to_csv(file, index=True)
        print(f"Data saved to {file}")
    
    return combined_data, tickers_to_delete_lst

TO-DO: Modify the code to keep querying until a 'too many requests' response is received. Then save the last variable and restart after a time X.

In [37]:
def delete_tickers(lst, file, df_symbols):
    print(df_symbols)
    try:
        df_symbols = df_symbols[~df_symbols[file].isin(lst)]

        if os.path.exists(f'{file}_new.csv'):
            os.remove(f'{file}_new.csv')

        df_symbols.to_csv(f'{file}_new.csv', index=False)
    except Exception as e:
        print(f"Error processing the CSV file: {str(e)}")

In [38]:
tech_list

['AAPL', 'MSFT', 'GOOGL']

In [39]:
if testing:
    add = '_testing'
else:
    add = ''

file_stocks= f'stock_data{add}.csv'
file_indices= f'indices_data{add}.csv'
file_cryptos= f'cryptos_data{add}.csv'
file_stocks_symbol= 'ticker'
file_indices_symbol= 'indices'
file_cryptos_symbol= 'cryptos'


if not testing:
    df_tickers_symbol, existing_data_stocks, dfs_dict_stocks, tech_list =  get_symbols(file_stocks_symbol, file_stocks)
    df_indices_symbol, existing_data_indices, dfs_dict_indices, indices =  get_symbols(file_indices_symbol, file_indices)
    df_cryptos_symbol, existing_data_cryptos, dfs_dict_cryptos, cryptos =  get_symbols(file_cryptos_symbol, file_cryptos)


df_stocks, delt = get_data(tech_list, file_stocks, dfs_dict_stocks, existing_data_stocks)
delete_tickers(delt, file_stocks_symbol, df_tickers_symbol)

df_indices, delt = get_data(indices, file_indices, dfs_dict_indices, existing_data_indices)
delete_tickers(delt, file_indices_symbol, df_indices_symbol)

df_cryptos, delt = get_data(cryptos, file_cryptos, dfs_dict_cryptos, existing_data_cryptos)
delete_tickers(delt, file_cryptos_symbol, df_cryptos_symbol)

Data saved to cryptos_data_testing.csv
Empty DataFrame
Columns: [cryptos]
Index: []


## **3. Data Preparation**

In [124]:
if testing:
    df = pd.read_csv('stock_data_testing.csv', index_col='Date', parse_dates=True)
else:
    df = pd.read_csv('stock_data.csv', index_col='Date', parse_dates=True)
    df = df.drop(columns='Capital Gains')

In [125]:
def add_candle_patterns(data):
    data = data.copy()

    if len(data) < 2:
        # With only one entry, no patterns can be identified
        data['Doji'] = False
        data['Hammer'] = False
        data['Engulfing Bullish'] = False
        data['Engulfing Bearish'] = False
        return data

    # Doji: Open ≈ Close
    data['Doji'] = (abs(data['Open'] - data['Close']) <= 0.001 * data['Adj Close'])

    # Hammer: Small candle with a long lower shadow
    body = abs(data['Close'] - data['Open'])
    lower_shadow = data[['Open', 'Close']].min(axis=1) - data['Low']
    upper_shadow = data['High'] - data[['Open', 'Close']].max(axis=1)
    data['Hammer'] = (lower_shadow > 2 * body) & (upper_shadow < body)

    # Engulfing Bullish: Current candle larger and engulfs previous day (bullish)
    prev_open = data['Open'].shift(1)
    prev_close = data['Close'].shift(1)
    data['Engulfing Bullish'] = (
        (prev_close < prev_open) &
        (data['Close'] > data['Open']) &
        (data['Open'] < prev_close) &
        (data['Close'] > prev_open)
    )

    # Engulfing Bearish: Opposite
    data['Engulfing Bearish'] = (
        (prev_close > prev_open) &
        (data['Close'] < data['Open']) &
        (data['Open'] > prev_close) &
        (data['Close'] < prev_open)
    )

    return data

In [126]:
def add_technical_indicators(data):
    data = data.copy()

    # --- Basic Returns & Trends ---
    data['Daily Return'] = data['Adj Close'].pct_change()
    data['Cumulative Return'] = (1 + data['Daily Return']).cumprod()
    data['50 Day MA'] = data['Adj Close'].rolling(window=50).mean()
    data['200 Day MA'] = data['Adj Close'].rolling(window=200).mean()
    data['20 Day MA'] = data['Adj Close'].rolling(window=20).mean()
    data['20 Day STD'] = data['Adj Close'].rolling(window=20).std()

    # --- Bollinger Bands ---
    data['Upper Band'] = data['20 Day MA'] + (data['20 Day STD'] * 2)
    data['Lower Band'] = data['20 Day MA'] - (data['20 Day STD'] * 2)

    # --- RSI ---
    delta = data['Adj Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    data['RSI'] = 100 - (100 / (1 + rs))

    # --- MACD ---
    ema12 = data['Adj Close'].ewm(span=12, adjust=False).mean()
    ema26 = data['Adj Close'].ewm(span=26, adjust=False).mean()
    data['MACD'] = ema12 - ema26
    data['Signal Line'] = data['MACD'].ewm(span=9, adjust=False).mean()

    # --- Momentum & ROC ---
    data['Momentum_10'] = data['Adj Close'] - data['Adj Close'].shift(10)
    data['ROC_10'] = data['Adj Close'].pct_change(periods=10)

    # --- Stochastic Oscillator ---
    low_14 = data['Low'].rolling(window=14).min()
    high_14 = data['High'].rolling(window=14).max()
    data['%K'] = 100 * ((data['Adj Close'] - low_14) / (high_14 - low_14))
    data['%D'] = data['%K'].rolling(window=3).mean()

    # --- Williams %R ---
    data['Williams_%R'] = -100 * ((high_14 - data['Adj Close']) / (high_14 - low_14))

    # --- ATR ---
    tr1 = data['High'] - data['Low']
    tr2 = abs(data['High'] - data['Adj Close'].shift())
    tr3 = abs(data['Low'] - data['Adj Close'].shift())
    tr = pd.concat([tr1, tr2, tr3], axis=1).max(axis=1)
    data['ATR_14'] = tr.rolling(window=14).mean()

    # --- OBV ---
    data['OBV'] = (np.sign(data['Daily Return']) * data['Volume']).fillna(0).cumsum()

    # --- Relative Volume ---
    data['Relative Volume'] = data['Volume'] / data['Volume'].rolling(20).mean()

    # --- Dividends & Splits ---
    data['Cumulative Dividends'] = data['Dividends'].cumsum()
    data['Has Dividend'] = data['Dividends'] > 0
    data['Had Split'] = data['Stock Splits'] > 0

    # --- Trend Indicator ---
    data['MA_Trend'] = np.where(data['50 Day MA'] > data['200 Day MA'], 1, 0)

    return data

In [127]:
def add_trend_label(data, horizon=5, threshold=0.02):
    """
    horizon: How many days into the future to look?
    threshold: Threshold for price change (e.g., 2%)
    """
    data = data.copy()
    future_return = data['Adj Close'].shift(-horizon) / data['Adj Close'] - 1
    data['Trend'] = 0
    data.loc[future_return > threshold, 'Trend'] = 1  # Uptrend
    data.loc[future_return < -threshold, 'Trend'] = -1  # Downtrend
    return data

In [128]:
def plot_candlestick_with_bollinger(data, symbol=""):
    fig = go.Figure(data=[
        go.Candlestick(x=data.index,
                       open=data['Open'], high=data['High'],
                       low=data['Low'], close=data['Close'],
                       name='Candles'),
        go.Scatter(x=data.index, y=data['Upper Band'], name='Upper Band', line=dict(color='blue', width=1)),
        go.Scatter(x=data.index, y=data['Lower Band'], name='Lower Band', line=dict(color='blue', width=1)),
        go.Scatter(x=data.index, y=data['20 Day MA'], name='20MA', line=dict(color='orange', width=1)),
    ])
    fig.update_layout(title=f'{symbol} - Candlestick with Bollinger Bands',
                      xaxis_title='Date', yaxis_title='Price')
    fig.show()

def plot_rsi(data):
    plt.figure(figsize=(10,4))
    plt.plot(data['RSI'], label='RSI', color='purple')
    plt.axhline(70, color='red', linestyle='--')
    plt.axhline(30, color='green', linestyle='--')
    plt.title('Relative Strength Index (RSI)')
    plt.legend()
    plt.grid()
    plt.show()


In [129]:
def add_heikin_ashi(data):
    ha = data.copy()
    ha['HA_Close'] = (data['Open'] + data['High'] + data['Low'] + data['Close']) / 4
    ha['HA_Open'] = (data['Open'] + data['Close']) / 2
    for i in range(1, len(ha)):
        ha.iloc[i, ha.columns.get_loc('HA_Open')] = (ha.iloc[i-1]['HA_Open'] + ha.iloc[i-1]['HA_Close']) / 2
    ha['HA_High'] = ha[['High', 'HA_Open', 'HA_Close']].max(axis=1)
    ha['HA_Low'] = ha[['Low', 'HA_Open', 'HA_Close']].min(axis=1)
    return ha

def add_adx(data, period=14):
    delta_high = data['High'].diff()
    delta_low = data['Low'].diff()
    
    plus_dm = np.where((delta_high > delta_low) & (delta_high > 0), delta_high, 0)
    minus_dm = np.where((delta_low > delta_high) & (delta_low > 0), delta_low, 0)

    tr1 = data['High'] - data['Low']
    tr2 = abs(data['High'] - data['Close'].shift())
    tr3 = abs(data['Low'] - data['Close'].shift())
    tr = pd.concat([tr1, tr2, tr3], axis=1).max(axis=1)

    atr = pd.Series(tr).rolling(window=14).mean()
    plus_di = 100 * pd.Series(plus_dm).rolling(window=14).mean() / atr
    minus_di = 100 * pd.Series(minus_dm).rolling(window=14).mean() / atr
    dx = (abs(plus_di - minus_di) / (plus_di + minus_di)) * 100
    adx = dx.rolling(window=14).mean()

    data['ADX'] = adx
    return data

def add_ichimoku(data):
    high_9 = data['High'].rolling(window=9).max()
    low_9 = data['Low'].rolling(window=9).min()
    data['Tenkan-sen'] = (high_9 + low_9) / 2

    high_26 = data['High'].rolling(window=26).max()
    low_26 = data['Low'].rolling(window=26).min()
    data['Kijun-sen'] = (high_26 + low_26) / 2

    data['Senkou Span A'] = ((data['Tenkan-sen'] + data['Kijun-sen']) / 2).shift(26)

    high_52 = data['High'].rolling(window=52).max()
    low_52 = data['Low'].rolling(window=52).min()
    data['Senkou Span B'] = ((high_52 + low_52) / 2).shift(26)

    data['Chikou Span'] = data['Close'].shift(-26)
    return data

def add_zigzag(data, window=5):
    local_max = argrelextrema(data['Close'].values, np.greater_equal, order=window)[0]
    local_min = argrelextrema(data['Close'].values, np.less_equal, order=window)[0]
    data['ZigZag'] = np.nan
    data.iloc[local_max, data.columns.get_loc('ZigZag')] = data.iloc[local_max, data.columns.get_loc('Close')]
    data.iloc[local_min, data.columns.get_loc('ZigZag')] = data.iloc[local_min, data.columns.get_loc('Close')]
    # data.iloc[local_max, 'ZigZag'] = data.iloc[local_max, 'Close']
    # data.iloc[local_min, 'ZigZag'] = data.iloc[local_min, 'Close']
    return data


In [130]:
def preprocess_stock_data(data):
    data = add_technical_indicators(data)
    data = add_candle_patterns(data)
    data = add_trend_label(data)
    data = add_heikin_ashi(data)
    data = add_adx(data, period=14)
    data = add_ichimoku(data)
    data = add_zigzag(data, window=5)
    #plot_candlestick_with_bollinger(data, 'APPL')
    data = data.dropna(subset=['200 Day MA'])
    data = data.dropna(axis=1)
    return data

In [131]:
df_prepro = df.groupby('Symbol', group_keys=False).apply(preprocess_stock_data, include_groups=False)

## Possible Scaling

🎯 Goal: Machine Learning / Feature Engineering
✅ 1. Min-Max Scaling (0–1 Normalization)
For algorithms like KNN, neural networks, or SVMs:

```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df[['Open', 'High', 'Low', 'Close', 'Volume']])
```
Advantages: Preserves structure, bounded input.
Caution: Outliers can significantly distort scaling.

✅ 2. Z-Standardization (StandardScaler)
For models that rely on normal distribution (e.g., linear models):

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[['Open', 'High', 'Low', 'Close', 'Volume']])
```
Advantages: Centers data at 0 with variance 1 – important for gradient-based methods.

✅ 3. Log Scaling
Helps to tame extreme differences (e.g., volume or highly volatile stock prices):

```python
import numpy as np
df['Log_Close'] = np.log1p(df['Close'])  # log(1 + x) to allow for 0
```
Typical for: Volume, market capitalization, price trends over decades.

✅ 4. Relative Price Changes (Returns)
Instead of absolute prices, percentage changes are considered:

```python
df['Return'] = df['Close'].pct_change()
```
Or cumulative:

```python
df['Cumulative Return'] = (1 + df['Return']).cumprod()
```
Particularly useful for: Time series models, portfolio comparisons, ML with temporal context.

✅ 5. Normalization to First Value
To make prices comparable (e.g., comparing multiple stocks synchronously):

```python
df['Norm_Close'] = df['Close'] / df['Close'].iloc[0]
```
All prices start at 1. Great for visualization & comparison!

🎯 Goal: Deep Learning / RNN / LSTM / TimeSeriesForecasting
Perform scaling separately for each stock (symbol)!

Fit scaling only on training data, then apply to validation/test data.

Caution with leakage! Never use future values when scaling.

## Available Signals
- Adj Close
- Close
- Dividends
- High
- Low
- Open
- Stock Splits
- Volume
- Name
- Daily Return
- Cumulative Return
- 50 Day MA
- 200 Day MA
- 20 Day MA
- 20 Day STD
- Upper Band
- Lower Band
- RSI
- MACD
- Signal Line
- Momentum_10
- ROC_10
- %K
- %D
- Williams_%R
- ATR_14
- OBV
- Relative Volume
- Cumulative Dividends
- Has Dividend
- Had Split
- MA_Trend
- Doji
- Hammer
- Engulfing Bullish
- Engulfing Bearish
- Trend
- HA_Close
- HA_Open
- HA_High
- HA_Low
- Tenkan-sen
- Kijun-sen
- Senkou Span A
- Senkou Span B

In [132]:
df_prepro.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3402 entries, 2020-10-15 to 2025-04-22
Data columns (total 45 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Adj Close             3402 non-null   float64
 1   Close                 3402 non-null   float64
 2   Dividends             3402 non-null   float64
 3   High                  3402 non-null   float64
 4   Low                   3402 non-null   float64
 5   Open                  3402 non-null   float64
 6   Stock Splits          3402 non-null   float64
 7   Volume                3402 non-null   int64  
 8   Name                  3402 non-null   object 
 9   Daily Return          3402 non-null   float64
 10  Cumulative Return     3402 non-null   float64
 11  50 Day MA             3402 non-null   float64
 12  200 Day MA            3402 non-null   float64
 13  20 Day MA             3402 non-null   float64
 14  20 Day STD            3402 non-null   float64
 15  Upp

In [50]:
df_prepro.columns

Index(['Adj Close', 'Close', 'Dividends', 'High', 'Low', 'Open',
       'Stock Splits', 'Volume', 'Name', 'Daily Return', 'Cumulative Return',
       '50 Day MA', '200 Day MA', '20 Day MA', '20 Day STD', 'Upper Band',
       'Lower Band', 'RSI', 'MACD', 'Signal Line', 'Momentum_10', 'ROC_10',
       '%K', '%D', 'Williams_%R', 'ATR_14', 'OBV', 'Relative Volume',
       'Cumulative Dividends', 'Has Dividend', 'Had Split', 'MA_Trend', 'Doji',
       'Hammer', 'Engulfing Bullish', 'Engulfing Bearish', 'Trend', 'HA_Close',
       'HA_Open', 'HA_High', 'HA_Low', 'ADX', 'Tenkan-sen', 'Kijun-sen',
       'Senkou Span A', 'Senkou Span B', 'Chikou Span', 'ZigZag'],
      dtype='object')

In [None]:
# Continue here. Consider whether additional indicators are necessary. Then save the current data and visually select the best indicators.
# Create an indicator that considers each sector, adds each stock in equal proportion, and determines the sector's movement.

4. Exploratory Data Analysis (EDA)
Goals of EDA:
- Understand the distribution and patterns in the data.
- Identify correlations and potential predictors for stock prices.
- Detect outliers or anomalies.
Methods:
Descriptive Statistics:
- Mean, median, standard deviation of stock prices and volume.
- Frequency of news or X posts per day.
Visualization:
- Time series plots of stock prices.
- Correlation matrix between stock prices, macroeconomic data, and sentiment.
- Boxplots to identify outliers (e.g., extreme price movements).
Correlations:
- Check whether stock prices correlate with macroeconomic indicators (e.g., interest rates) or sentiment data.
- Analyze the relationship between trading volume and price movements.

## **Notes**
####
- Create multiple dashboards:
    1. Dashboard: Price movement prediction
    2. Dashboard: Identify potential trends based on fundamental analysis, news, etc., and identify stocks as promising based on these data --> do not consider price data.