# 1.Project Title: Market Fear Regime Identification

#### Purpose: Data mining course final project
#### Description:
This is a project that aims to identify periods of market fear using various financial indicators and machine learning techniques. The project involves collecting historical financial data, preprocessing it, and applying clustering algorithms to classify different market regimes based on fear levels.

#### Tools and Technologies:
- Programming Language: Python
- Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn
- Data Sources: Yahoo Finance, Kaggle datasets, Coingecko API
- Environment: Jupyter Notebook
- Visualization: Matplotlib, Seaborn
- Machine Learning Algorithms: K-Means Clustering
- Version Control: Git and GitHub
- Documentation: Markdown

#### Steps Involved:
1. **Data Collection**: Gather historical financial data including coins prices
2. **Data Preprocessing**: Clean and preprocess the data to handle missing values and normalize features.
3. **Feature Selection**: Identify relevant features that indicate market fear, xxxx TODO
4. **Clustering**: Apply K-Means clustering to classify market regimes based on selected features.
5. **Visualization**: Visualize the clustering results to interpret different market fear regimes.
6. **Analysis**: Analyze the identified regimes and their characteristics.
7. **Documentation**: Document the entire process and draft a report summarizing findings.
8. **Presentation**: Prepare a presentation to showcase the project results on 8th June 2026.
9. **Submission**: Submit the final report and code repository by 6th June 2026.

#### Timeline:
- Week 1: Data Collection and Preprocessing
- Week 2: Feature Selection and Clustering
- Week 3: Visualization and Analysis
- Week 4: Documentation and Presentation Preparation
  
  

In [1]:
import yfinance as yf
import pandas as pd
import requests
import os

## Data Selection: Coins selected from Coingecko API

In [2]:
# becareful, yfinance need add -USD at the end of coin tickers
core_coins = [
    'BTC-USD', 'ETH-USD', 'BNB-USD', 'SOL-USD', 'XRP-USD', 
    'ADA-USD', 'DOGE-USD', 'DOT-USD', 'LTC-USD', 'TRX-USD'
]

In [3]:
# Depending on need, expand the list later
supplementary_coins = [
    'AVAX-USD', 'MATIC-USD', 'LINK-USD', 'ATOM-USD', 'UNI-USD',
    'ETC-USD', 'XLM-USD', 'ALGO-USD', 'FIL-USD', 'APT-USD' 
]

In [4]:

all_tickers = core_coins + supplementary_coins
print(f"all coins assets ({len(all_tickers)} ): {all_tickers}")

all coins assets (20 ): ['BTC-USD', 'ETH-USD', 'BNB-USD', 'SOL-USD', 'XRP-USD', 'ADA-USD', 'DOGE-USD', 'DOT-USD', 'LTC-USD', 'TRX-USD', 'AVAX-USD', 'MATIC-USD', 'LINK-USD', 'ATOM-USD', 'UNI-USD', 'ETC-USD', 'XLM-USD', 'ALGO-USD', 'FIL-USD', 'APT-USD']


In [5]:
# time range
START_DATE = "2018-01-01"
END_DATE = pd.Timestamp.today().strftime('%Y-%m-%d') # to today

In [None]:
# data directories

RAW_DATA_PATH = "../data/row"
CLEAN_DATA_PATH = "../data/processed"

# print("cwd:", os.getcwd())
# print("RAW_DATA_PATH (abs):", os.path.abspath(RAW_DATA_PATH))
# print("RAW_DATA_PATH exists:", os.path.exists(RAW_DATA_PATH))
print("files in RAW_DATA_PATH:", os.listdir(RAW_DATA_PATH))



files in RAW_DATA_PATH: ['price_matrix.csv']


## Fetching Raw Data and primary cleaning

parameter for reference

```python
def download(tickers, start=None, end=None, actions=False, threads=True,
             ignore_tz=None, group_by='column', auto_adjust=None, back_adjust=False,
             repair=False, keepna=False, progress=True, period=None, interval="1d",
             prepost=False, proxy=_SENTINEL_, rounding=False, timeout=10, session=None,
             multi_level_index=True) -> Union[_pd.DataFrame, None]:
    """
    Download yahoo tickers
    :Parameters:
        tickers : str, list
            List of tickers to download
        period : str
            Valid periods: 1d,5d,1mo,3mo,6mo,1y,2y,5y,10y,ytd,max
            Default: 1mo
            Either Use period parameter or use start and end
        interval : str
            Valid intervals: 1m,2m,5m,15m,30m,60m,90m,1h,1d,5d,1wk,1mo,3mo
            Intraday data cannot extend last 60 days
        start: str
            Download start date string (YYYY-MM-DD) or _datetime, inclusive.
            Default is 99 years ago
            E.g. for start="2020-01-01", the first data point will be on "2020-01-01"
        end: str
            Download end date string (YYYY-MM-DD) or _datetime, exclusive.
            Default is now
            E.g. for end="2023-01-01", the last data point will be on "2022-12-31"
        group_by : str
            Group by 'ticker' or 'column' (default)
        prepost : bool
            Include Pre and Post market data in results?
            Default is False
        auto_adjust: bool
            Adjust all OHLC automatically? Default is True
        repair: bool
            Detect currency unit 100x mixups and attempt repair
            Default is False
        keepna: bool
            Keep NaN rows returned by Yahoo?
            Default is False
        actions: bool
            Download dividend + stock splits data. Default is False
        threads: bool / int
            How many threads to use for mass downloading. Default is True
        ignore_tz: bool
            When combining from different timezones, ignore that part of datetime.
            Default depends on interval. Intraday = False. Day+ = True.
        rounding: bool
            Optional. Round values to 2 decimal places?
        timeout: None or float
            If not None stops waiting for a response after given number of
            seconds. (Can also be a fraction of a second e.g. 0.01)
        session: None or Session
            Optional. Pass your own session object to be used for all requests
        multi_level_index: bool
            Optional. Always return a MultiIndex DataFrame? Default is True
    """
    
    ```

In [None]:
def fetch_and_clean_crypto_data(tickers, start, end):
    raw = yf.download(tickers, start=start, end=end, progress=False)
    # 1) Series -> DataFrame
    if isinstance(raw, pd.Series):
        df = raw.to_frame(name=(tickers[0] if isinstance(tickers, (list, tuple)) else str(tickers)))
    else:
        cols = raw.columns
        # 2) MultiIndex
        if isinstance(cols, pd.MultiIndex):
            # try to find 'Adj Close' or 'Close'
            if 'Adj Close' in cols.get_level_values(0):
                df = raw['Adj Close']
            elif 'Close' in cols.get_level_values(0):
                df = raw['Close']
            elif 'Adj Close' in cols.get_level_values(1):
                df = raw.xs('Adj Close', axis=1, level=1)
            elif 'Close' in cols.get_level_values(1):
                df = raw.xs('Close', axis=1, level=1)
            else:
                df = raw.select_dtypes(include='number')
        else:
            # 3) single level columns
            if 'Adj Close' in cols:
                df = raw['Adj Close'] if raw['Adj Close'].ndim == 2 else raw[['Adj Close']]
            elif set(cols) & set(tickers):
                # already middle price matrix
                df = raw.copy()
            elif 'Close' in cols:
                df = raw['Close'] if raw['Close'].ndim == 2 else raw[['Close']]
            else:
                df = raw.select_dtypes(include='number')

    # unify column names by removing '-USD' suffix
    df.columns = [c.replace('-USD', '') for c in df.columns]
    # create date index
    df.index = pd.to_datetime(df.index).normalize()
    df.index.name = 'date'
    # handle missing values by forward fill
    df_clean = df.ffill()
    return df_clean


price_matrix = fetch_and_clean_crypto_data(all_tickers, START_DATE, END_DATE)
print(price_matrix.head())

price_file_path = "../data/row/price_matrix.csv"
price_matrix.to_csv(price_file_path)
print(f"file saved: {price_file_path}")


  raw = yf.download(tickers, start=start, end=end, progress=False)


                 ADA  ALGO  APT  ATOM  AVAX       BNB           BTC      DOGE  \
date                                                                            
2018-01-01  0.728657   NaN  NaN   NaN   NaN   8.41461  13657.200195  0.008909   
2018-01-02  0.782587   NaN  NaN   NaN   NaN   8.83777  14982.099609  0.009145   
2018-01-03  1.079660   NaN  NaN   NaN   NaN   9.53588  15201.000000  0.009320   
2018-01-04  1.114120   NaN  NaN   NaN   NaN   9.21399  15599.200195  0.009644   
2018-01-05  0.999559   NaN  NaN   NaN   NaN  14.91720  17429.500000  0.012167   

            DOT        ETC         ETH        FIL      LINK         LTC  \
date                                                                      
2018-01-01  NaN  34.167900  772.640991  19.480200  0.733563  229.033005   
2018-01-02  NaN  34.917099  884.443970  20.110600  0.673712  255.684006   
2018-01-03  NaN  34.863400  962.719971  19.827499  0.681167  245.367996   
2018-01-04  NaN  36.318001  980.921997  20.417801  0.9843

In [8]:
print(type(price_matrix))
print(price_matrix.columns)   
price_matrix.head()
price_matrix.tail()

<class 'pandas.core.frame.DataFrame'>
Index(['ADA', 'ALGO', 'APT', 'ATOM', 'AVAX', 'BNB', 'BTC', 'DOGE', 'DOT',
       'ETC', 'ETH', 'FIL', 'LINK', 'LTC', 'MATIC', 'SOL', 'TRX', 'UNI', 'XLM',
       'XRP'],
      dtype='object')


Unnamed: 0_level_0,ADA,ALGO,APT,ATOM,AVAX,BNB,BTC,DOGE,DOT,ETC,ETH,FIL,LINK,LTC,MATIC,SOL,TRX,UNI,XLM,XRP
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2025-11-22,0.404608,0.135694,0.000131,2.509819,13.22755,833.351074,84648.359375,0.140304,2.309507,13.499508,2767.607422,1.614149,12.172948,82.174629,0.216415,127.551201,0.27413,0.000163,0.230294,1.949751
2025-11-23,0.408549,0.143644,0.000131,2.492277,13.277553,843.224792,86805.007812,0.144831,2.255722,13.556331,2801.676025,1.609677,12.509099,82.977539,0.216415,130.705063,0.275073,0.000163,0.24703,2.046265
2025-11-24,0.427842,0.143671,0.000131,2.50112,13.890861,864.421143,88270.5625,0.151794,2.339354,14.159579,2952.713379,1.64108,12.964276,85.477417,0.216415,138.371353,0.274809,0.000163,0.254841,2.225878
2025-11-25,0.421617,0.146376,0.000131,2.465958,14.1699,862.108032,87341.890625,0.152972,2.294471,14.156011,2957.936279,1.662887,13.065638,85.282166,0.216415,138.891144,0.274355,0.000163,0.2519,2.198544
2025-11-26,0.435622,0.146284,0.000131,2.526135,14.936302,891.753357,90518.367188,0.154778,2.344442,14.134871,3027.812012,1.673222,13.455601,86.870491,0.216415,143.012192,0.276516,0.000163,0.258838,2.224285


## Fetching fear and greed index data and primary cleaning

In [9]:
def fetch_fear_greed_index():
    print(f" fetching Fear & Greed Index data from Alternative.me...")
    url = "https://api.alternative.me/fng/?limit=0"
    try:
        r = requests.get(url, timeout=10)
        data = r.json()['data']
        df = pd.DataFrame(data)
        
        # primary cleaning
        df['value'] = df['value'].astype(float) 
        df['date'] = pd.to_datetime(df['timestamp'], unit='s').dt.normalize() # date  handling
        df = df[['date', 'value']].rename(columns={'value': 'fg_raw'})
        df.set_index('date', inplace=True)
        df.sort_index(inplace=True)
        
        print(f"finished")
        return df
    except Exception as e:
        print(f"Error fetching: {e}")
        return pd.DataFrame()

fg_data = fetch_fear_greed_index()


print(f"\n  Successfully fetched Fear & Greed, covering {len(fg_data)} days")


 fetching Fear & Greed Index data from Alternative.me...
finished

  Successfully fetched Fear & Greed, covering 2853 days


  df['date'] = pd.to_datetime(df['timestamp'], unit='s').dt.normalize() # date  handling


## Matching and merging all data sources

In [None]:

# 1. merge all according to date value
full_market_matrix = price_matrix.join(fg_data, how='left')

# 2. if emotion data missing, ffill from last available
# because emotions can be treated as a linear time series, so I think ffill is acceptable
full_market_matrix['fg_raw'] = full_market_matrix['fg_raw'].ffill()


full_file_path = "../data/processed/full_market_matrix.csv"
full_market_matrix.to_csv(full_file_path)
print(f"save full market matrix to: {full_file_path}")


save full market matrix to: ../data/processed/full_market_matrix.csv


## Check data  

In [17]:
# full_market_matrix.head()
full_market_matrix.tail()

Unnamed: 0_level_0,ADA,ALGO,APT,ATOM,AVAX,BNB,BTC,DOGE,DOT,ETC,...,FIL,LINK,LTC,MATIC,SOL,TRX,UNI,XLM,XRP,fg_raw
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-11-22,0.404608,0.135694,0.000131,2.509819,13.22755,833.351074,84648.359375,0.140304,2.309507,13.499508,...,1.614149,12.172948,82.174629,0.216415,127.551201,0.27413,0.000163,0.230294,1.949751,11.0
2025-11-23,0.408549,0.143644,0.000131,2.492277,13.277553,843.224792,86805.007812,0.144831,2.255722,13.556331,...,1.609677,12.509099,82.977539,0.216415,130.705063,0.275073,0.000163,0.24703,2.046265,13.0
2025-11-24,0.427842,0.143671,0.000131,2.50112,13.890861,864.421143,88270.5625,0.151794,2.339354,14.159579,...,1.64108,12.964276,85.477417,0.216415,138.371353,0.274809,0.000163,0.254841,2.225878,19.0
2025-11-25,0.421617,0.146376,0.000131,2.465958,14.1699,862.108032,87341.890625,0.152972,2.294471,14.156011,...,1.662887,13.065638,85.282166,0.216415,138.891144,0.274355,0.000163,0.2519,2.198544,20.0
2025-11-26,0.435622,0.146284,0.000131,2.526135,14.936302,891.753357,90518.367188,0.154778,2.344442,14.134871,...,1.673222,13.455601,86.870491,0.216415,143.012192,0.276516,0.000163,0.258838,2.224285,15.0


In [14]:
print("(Data Preparation Complete)")

print(f"1. Coin count: {len(price_matrix.columns)}")
print(f"2. Date range: {full_market_matrix.index.min().date()} to {full_market_matrix.index.max().date()}")
print(f"3. Data preview (Tail):{full_market_matrix[['BTC', 'ETH', 'SOL', 'fg_raw']].tail()}")


(Data Preparation Complete)
1. Coin count: 20
2. Date range: 2018-01-01 to 2025-11-26
3. Data preview (Tail):                     BTC          ETH         SOL  fg_raw
date                                                     
2025-11-22  84648.359375  2767.607422  127.551201    11.0
2025-11-23  86805.007812  2801.676025  130.705063    13.0
2025-11-24  88270.562500  2952.713379  138.371353    19.0
2025-11-25  87341.890625  2957.936279  138.891144    20.0
2025-11-26  90518.367188  3027.812012  143.012192    15.0


In [15]:
print(full_market_matrix.tail())

                 ADA      ALGO       APT      ATOM       AVAX         BNB  \
date                                                                        
2025-11-22  0.404608  0.135694  0.000131  2.509819  13.227550  833.351074   
2025-11-23  0.408549  0.143644  0.000131  2.492277  13.277553  843.224792   
2025-11-24  0.427842  0.143671  0.000131  2.501120  13.890861  864.421143   
2025-11-25  0.421617  0.146376  0.000131  2.465958  14.169900  862.108032   
2025-11-26  0.435622  0.146284  0.000131  2.526135  14.936302  891.753357   

                     BTC      DOGE       DOT        ETC  ...       FIL  \
date                                                     ...             
2025-11-22  84648.359375  0.140304  2.309507  13.499508  ...  1.614149   
2025-11-23  86805.007812  0.144831  2.255722  13.556331  ...  1.609677   
2025-11-24  88270.562500  0.151794  2.339354  14.159579  ...  1.641080   
2025-11-25  87341.890625  0.152972  2.294471  14.156011  ...  1.662887   
2025-11-26  9051