# Part 1 Assessing Models with Alternative Data
---
## Q1. Data Understanding

### Types of Data Used in the Research

The study utilizes financial time series data from **Yahoo Finance** covering the period from **December 12, 2009, to January 1, 2020**. The raw data includes:

- **Price-related features**: 
  - Opening price (Open)
  - Highest price (High)
  - Lowest price (Low)
  - Closing price (Close)
  - Adjusted closing price (Adjusted Close)
- **Volume**: Number of transactions
- **Fundamental events**: Dividends yield and capital gain distributions (through Adjusted Close)

The authors focus on **three ETFs** representing different market dynamics:
- iShares MSCI Chile ETF (**ECH**)
- iShares MSCI Brazil ETF (**EWZ**)
- iShares Core S&P 500 ETF (**IVV**)

These datasets represent **emerging** and **developed** markets and are selected for cross-market comparison.


### Technical Indicator Derivation

To enrich the dataset, the authors computed **216 daily features** by applying the **Pandas TA (Technical Analysis)** library, a package designed to compute customizable financial indicators. These features fall into several categories:

- **Trend indicators**: e.g., TTM Trend, Correlation Trend Indicator (CTI)
- **Momentum indicators**: e.g., Williams %R (WILLR), Balance of Power (BOP)
- **Cycle indicators**: e.g., Even Better SineWave (EBSW)
- **Volume-based indicators**: e.g., Archer's On-Balance Volume (AOBV), Price Volume Rank (PVR)
- **Volatility indicators**: e.g., Bollinger Band Percent (BBP)
- **Statistical indicators**: e.g., Z-score
- **Boolean delta indicators**: e.g., Increasing (INC), Decreasing (DEC)

Each of these was computed using transformations of the six base price and volume attributes from Yahoo Finance. The formulae for selected indicators include:

- **Balance of Power (BOP)**:
  $$
  \text{BOP}_t = \frac{c_t - o_t}{h_t - l_t}
  $$

- **Williams %R (WILLR)**:
  $$
  \text{WILLR}_t = -100 \cdot \frac{\text{Highest High} - c_t}{\text{Highest High} - \text{Lowest Low}}
  $$

- **Z-score (Zs)**:
  $$
  \text{Zs}_t = \frac{c_t - SMA_t}{\sigma_t}
  $$

- **Decreasing (DEC)**:
  $$
  \text{DEC}_t = \begin{cases} 1, & \text{if } c_t - c_{t-1} < 0 \\ 0, & \text{otherwise} \end{cases}
  $$

- **Increasing (INC)**:
  $$
  \text{INC}_t = \begin{cases} 1, & \text{if } c_t - c_{t-1} > 0 \\ 0, & \text{otherwise} \end{cases}
  $$

- **Stochastic RSI (StochRSI)**:
  $$
  \text{StochRSI}_t = \frac{\text{RSI}_t - \min(\text{RSI})}{\max(\text{RSI}) - \min(\text{RSI})}
  $$

- **Bollinger Band Percent (BBP)**:
  $$
  \text{BBP}_t = \frac{o_t - \text{Lower Band}_t}{\text{Upper Band}_t - \text{Lower Band}_t}
  $$

- **On-Balance Volume (OBV)**:
  $$
  \text{OBV}_t = \begin{cases} \text{OBV}_{t-1} + v_t, & \text{if } c_t > c_{t-1} \\ \text{OBV}_{t-1} - v_t, & \text{if } c_t < c_{t-1} \\ \text{OBV}_{t-1}, & \text{otherwise} \end{cases}
  $$


### Importance of Technical Indicators in Predicting Stock Movements

- **Dimensional Enrichment**: The transformation of raw price and volume data into derived indicators enables the neural network to capture **nonlinear patterns** and **hidden cycles** in the market data.
- **Feature Discrimination**: The study uses statistical methods (e.g., LASSO, Chi-squared, MAD, DR) to **rank and select features** that are most informative for classification. This ensures that only a small, optimized subset (about 5%) is used, thereby **reducing model complexity** without sacrificing accuracy.
- **Performance Enhancement**: Selected indicators lead to significant improvements in classification accuracy (up to **13.63%**) and reductions in training time (about **84.68%**) as shown in Table 6 of the study.
- **Market-Specific Insights**: The indicators reveal differences in market behavior. For example, emerging markets rely more on **cyclic and volume indicators** (e.g., AOBV, EBSW), while developed markets leverage **trend-following and momentum indicators** (e.g., TTM Trend, PVR).

These findings highlight the **predictive power** of a curated subset of technical indicators and justify their role as key inputs in machine learning-based financial forecasting models.



---

## Q2. Security Understanding

### Selected ETF: iShares MSCI Chile ETF (ECH)

### Asset Type

- **Fund Name**: iShares MSCI Chile ETF
- **Ticker**: ECH
- **Asset Class**: Equity
- **Type**: Exchange-Traded Fund (ETF)
- **Region Focus**: Chile (Emerging Market)
- **Issuer**: BlackRock
- **Benchmark Index**: MSCI Chile IMI 25/50 Index
- **Expense Ratio**: ~0.57% (typical range as per ETF factsheets)
- **Inception Date**: November 20, 2007

This ETF offers exposure to large-, mid-, and small-cap Chilean equities. As a single-country ETF, it is relatively more volatile but suitable for those seeking concentrated exposure to Latin America's mining and utilities-heavy economy.


### Historical Price Chart (2009–2020)

*The price chart referred in the paper (Figure 2) plots daily opening prices of ECH from 2009 to 2020. The prices ranged roughly between \$29.30 and \$80.25.*

| Statistic        | Value (USD) |
|------------------|--------------|
| Minimum          | 29.30        |
| 1st Quartile     | 40.35        |
| Median           | 46.48        |
| Mean             | 50.10        |
| 3rd Quartile     | 59.84        |
| Maximum          | 80.25        |



### Key Statistics and Sector Breakdown

As per Table 1 in the paper, the top sector exposures for ECH are:

| Sector          | Weight (%) |
|------------------|------------|
| Financials       | 21.53      |
| Materials        | 21.28      |
| Utilities        | 18.73      |
| Consumer Staples| 13.92      |
| Energy           | 8.34       |
| **Total**        | **83.8**   |

This reflects Chile's dependence on **mining, utilities, and financials**, especially copper exports and public infrastructure.



### Why Classification Over Regression?

The paper uses a **classification model** to predict the **directional movement** of the ETF's price (up or down), rather than forecasting the exact price level.

- **Response Variable Definition**:
  $$
  \Gamma_t = \begin{cases}
    1, & \text{if } Open_t - Open_{t-1} > 0 \\
    0, & \text{otherwise}
  \end{cases}
  $$

- **Justification**:
  - Classification is less sensitive to prediction errors than regression.
  - Trend direction is more actionable for traders/investors.
  - Emerging market data is noisier, making regression less stable.
  - Easier to calibrate trading strategies on binary outcomes (e.g., long vs. short).


### Alternative Label Definitions for Classification

1. **Volatility-Adjusted Trend Class**:
   $$
   \Gamma_t = \begin{cases}
     1, & \text{if } (Open_t - Open_{t-1}) > \theta \cdot \sigma_t \\
     0, & \text{otherwise}
   \end{cases}
   $$
   - Where $\sigma_t$ is rolling volatility and $\theta$ is a chosen threshold.
   - Captures significant moves only, avoiding noise.

2. **Return-Based 3-Class Label**:
   $$
   \Gamma_t = \begin{cases}
     1, & \text{if return} > \epsilon \\
     0, & \text{if } |return| \leq \epsilon \\
    -1, & \text{if return} < -\epsilon
   \end{cases}
   $$
   - A trinary classification capturing uptrend, no change, and downtrend.
   - Enables more nuanced decision models.

These alternative labels could improve model interpretability and control for market microstructure noise.



---
## Q3. Methodology Understanding

### Reorganized "Data" Section

#### 2.1 Data Collection

The authors used historical financial data from **Yahoo Finance** for three ETFs—ECH, EWZ, and IVV—spanning the period from **December 12, 2009, to January 1, 2020**. The raw data includes:

- Open, High, Low, Close, Adjusted Close prices
- Trading volume

This data was collected to reflect pre-pandemic market behavior, avoiding distortions caused by COVID-19.

#### 2.2 Preprocessing

- **Data Cleaning**: Rows with missing values (e.g., due to indicators like SMA requiring a lookback period) were dropped.
- **Normalization**: A min-max transformation was applied to each feature:

  $$
  x' = \frac{x - \min(x)}{\max(x) - \min(x)}
  $$

- **Class Assignment**: The response variable $\Gamma_t$ was created for binary classification:

  $$
  \Gamma_t = \begin{cases}
    1, & \text{if } Open_t - Open_{t-1} > 0 \\
    0, & \text{otherwise}
  \end{cases}
  $$

#### 2.3 Technical Indicator Construction

Using the **Pandas TA** library, over **210 indicators** were computed from base price and volume data. These indicators fall into categories such as:

- Trend (e.g., TTM Trend)
- Momentum (e.g., Balance of Power, Williams %R)
- Volume (e.g., OBV, PVR)
- Volatility (e.g., Bollinger Band %)
- Cycles and Statistics (e.g., EBSW, Z-score)

The final feature set included **216 technical indicators**.


### Renamed Section 3: Methodology

#### 3.1 Descriptive Statistics: Pearson Correlation

Pearson correlation was used to evaluate linear relationships between features:

$$
r_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n \cdot \sigma_x \sigma_y}
$$

This guided initial filtering of irrelevant features.

#### 3.2 Predictive Feature Selection Techniques

To reduce dimensionality, the following methods were applied:

- **Low Variance Filter**
- **Chi-Squared Test**
- **LASSO (Least Absolute Shrinkage and Selection Operator)**:

  $$
  \text{LASSO Loss} = \text{RSS} + \lambda \sum |w_j|
  $$

- **Tree-Based Models (Extra Trees Classifier)**
- **Principal Feature Analysis (PFA)**
- **Mean Absolute Difference (MAD)**
- **Dispersion Ratio (DR)**

Each method selected the most informative features, and the top quartile of features from each method were aggregated to form subsets.

#### 3.3 Model Training: Neural Network (MLP)

A **Multilayer Perceptron (MLP)** was trained using the selected features. The MLP architecture:

- Hidden Layer Size: $(\text{input features} + \text{output classes}) / 2$
- Activation: logistic
- Solver: L-BFGS
- Learning Rate: adaptive (initial = 0.03)
- Max Iterations: 5000

#### 3.4 Cross-Validation

A **10-fold cross-validation** strategy was applied:

- Split dataset into 10 partitions
- Rotate each as the test set, training on the remaining 9
- Average results for final accuracy metric



### Indicator Optimization Strategy

The feature selection process was iteratively refined:

1. Each ETF underwent separate preprocessing and feature selection.
2. Features present in at least **n = 5** statistical subsets were selected.
3. MLPs were trained on both full and reduced feature sets.
4. **Selected(5)** subset (~10 features) achieved:
   - Up to **13.63%** gain in accuracy
   - Over **84%** reduction in training time

This confirms that optimized technical indicators improve neural network performance by:
- Reducing overfitting
- Lowering computational load
- Improving generalization to unseen data

---


## Q4. Feature Understanding

### What Constitutes a Feature in the Study

In this study, a **feature** is a derived quantitative variable that characterizes some aspect of the price or volume behavior of an ETF. Examples include:

- **Raw features**: Open, High, Low, Close, Adjusted Close, Volume
- **Derived technical indicators**: Computed using formulas applied to raw features (e.g., Bollinger Band %, Balance of Power, Z-score)

Each feature is represented numerically and updated daily, forming a multivariate time series used as input for machine learning models.


### Difference Between Features, Methods, and Models

- **Features**: Independent variables (e.g., BBP, OBV) used as inputs for prediction
- **Methods**: Statistical or machine learning techniques used to select or transform features (e.g., LASSO, Chi-squared, PCA)
- **Models**: Algorithms that learn from features to predict the target variable (e.g., Multilayer Perceptron)

| Category  | Examples                             |
|-----------|--------------------------------------|
| Features  | WILLR, BBP, J, CTI, AOBV              |
| Methods   | Pearson correlation, MAD, LASSO       |
| Models    | MLP, Tree-based classifiers           |



### Categorization of Features Used

The 216 derived features are categorized as follows (based on Pandas TA):

| Feature Category | Examples                    | Selected in Final Subset? |
|------------------|-----------------------------|----------------------------|
| **Trend**        | TTM_TRND, CTI               | Yes                        |
| **Momentum**     | WILLR, BOP, KDJ             | Yes                        |
| **Volume**       | OBV, PVR, AOBV              | Yes                        |
| **Volatility**   | BBP                         | Yes                        |
| **Cycles**       | Even Better SineWave (EBSW) | Yes                        |
| **Statistics**   | Z-score                     | Yes                        |
| **Boolean**      | INC, DEC                    | Yes                        |

These features were engineered from historical price and volume series to capture specific trading patterns, cycle behaviors, or regime shifts.



### Optimization Strategy and Its Role in Neural Network Performance

The study adopts a **multi-metric feature selection pipeline**, which:

1. Applies statistical filters (e.g., LASSO, Chi-Squared, Pearson) to rank all 216 features.
2. Retains features appearing in at least **5 of the individual top-25% lists**, creating the `Selected(5)` subset.
3. Trains the MLP using only these reduced features.

**Why Optimization Improves Performance**:

- **Avoids Overfitting**: Irrelevant or redundant features introduce noise.
- **Speeds Training**: Smaller input space means faster convergence.
- **Increases Accuracy**: By focusing on the most informative signals, the model generalizes better.

In summary, feature optimization acts as a filter that amplifies meaningful patterns while suppressing noise, making it essential for improving neural network performance in high-dimensional financial forecasting tasks.

---

## Q5. Optimization Understanding

### Concept of Cross-Validation

**Cross-validation** is a resampling method used to evaluate the generalizability of a predictive model. It splits the dataset into multiple subsets (folds), trains the model on some, and tests it on the remaining ones. This helps estimate how the model performs on unseen data, reducing both **overfitting** and **underfitting**.


### How k-Fold Cross-Validation Works

In **k-fold cross-validation**:

1. The dataset is split into $k$ equally sized folds.
2. The model is trained $k$ times, each time:
   - Using $k-1$ folds for training
   - Using the remaining 1 fold for testing
3. The process rotates the test fold through all $k$ partitions.
4. The final performance metric is the average across all $k$ trials.

For this study, the authors used **10-fold cross-validation** ($k = 10$), which provides a balance between bias and variance in error estimation.


### Jaccard Distance and Comparison to Other Metrics

The **Jaccard distance** measures dissimilarity between two sets $A$ and $B$:

$$
J(A, B) = 1 - \frac{|A \cap B|}{|A \cup B|}
$$

- **Range**: [0, 1]
  - 0: identical sets
  - 1: no overlap

#### Compared With:

1. **Euclidean Distance**:
   $$
   d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}
   $$
   - Measures absolute geometric distance.
   - Sensitive to scale and magnitude.

2. **Manhattan Distance**:
   $$
   d(x, y) = \sum_i |x_i - y_i|
   $$
   - Measures city-block (grid-based) distance.
   - Less sensitive to outliers than Euclidean.

**Jaccard** is preferred for **set similarity**, especially when working with **binary or categorical feature sets** like those selected by statistical methods.

### Optimality Definition and Evaluation in the Paper

**Optimality** in this paper refers to the ability to select a **minimal set of features** that achieves **maximum classification accuracy** with **minimal computational cost**.

- The authors define **Selected(n)** as the set of features appearing in at least `n` statistical selection methods.
- For $n = 5$, the `Selected(5)` subset:
  - Contains only 9–10 features (≈5% of total)
  - Achieves near or better performance than full 216-feature set

### Evaluation Criteria:

- **Classification Accuracy** from 10-fold cross-validation
- **Training Time Reduction**
- **Jaccard Similarity** among ETFs (to compare robustness of selected features)

The result: **80%+ accuracy** with only **5% of the features**, confirming the optimality of the selection strategy.

**Conclusion**:
Optimization was achieved via feature pruning, validation loops, and dissimilarity metrics—balancing performance with parsimony.

---

## Step 1: Financial Problem

### Financial Problem Being Addressed

The study addresses the problem of **predicting daily trend direction (up or down) in the price of Exchange-Traded Funds (ETFs)** using optimized sets of technical indicators. Specifically, the authors focus on ETFs in emerging markets (ECH and EWZ) and compare their behavior with a developed market ETF (IVV).

- **Objective**: Improve the predictive performance and computational efficiency of machine learning models—particularly neural networks—by systematically selecting the most informative technical indicators.

- **Modeling Strategy**: Treat this as a **binary classification problem** using a Multilayer Perceptron (MLP), where the model forecasts whether the opening price will be higher or lower than the previous day.

- **Input Data**: 216 technical indicators derived from raw market data.

- **Response Variable**:
  $$
  \Gamma_t = \begin{cases}
    1, & \text{if } Open_t - Open_{t-1} > 0 \\
    0, & \text{otherwise}
  \end{cases}
  $$



### Influence of Emerging Market Dynamics on Model Design

Emerging markets like Chile and Brazil are:
- More volatile and less liquid
- Influenced by political risk, commodity prices, and exchange rate shocks

These characteristics introduce challenges:
- **Higher noise-to-signal ratio**: Makes regression-based forecasting less reliable.
- **Nonlinear, chaotic dynamics**: Favors models that can capture nonlinearities (e.g., neural networks).
- **Cyclic behavior**: Markets may respond to cyclical macroeconomic or political events.

**Design Adaptations**:
- Use of classification instead of regression
- Inclusion of **cycle-sensitive indicators** (e.g., EBSW)
- Data normalization and cleaning to manage outliers
- Emphasis on **feature selection** to isolate the most robust predictors under volatility

In contrast, developed markets like the U.S. (IVV) exhibit:
- Greater stability and liquidity
- Broader sectoral diversification

Thus, fewer indicators are needed, and **trend-following indicators** (e.g., TTM Trend, PVR) perform better. The study finds that the optimal feature sets vary meaningfully between emerging and developed markets.


---

## Step 2: Application

### Main Result Takeaways

1. **Feature Optimization Pays Off**:
   - The `Selected(5)` subset, comprising ~10 features (≈5% of the full set), achieved **equal or better accuracy** compared to the full 216-feature set.
   - Average improvement in accuracy: **up to 13.63%**
   - Average reduction in training time: **84.68%**

2. **Binary Classification Is Effective**:
   - The model successfully predicts trend direction (up or down) using just technical indicators derived from daily market data.

3. **Robust Cross-Validation**:
   - Results were validated using **10-fold cross-validation**, enhancing the reliability of performance estimates.

4. **Emerging vs. Developed Market Differences**:
   - Emerging market ETFs (ECH, EWZ) tend to benefit more from **cycle-sensitive** and **volume-based** features.
   - Developed market ETF (IVV) performs better with **trend-following** and **momentum** indicators.

5. **Model Used**: Multilayer Perceptron (MLP) with logistic activation and adaptive learning, fine-tuned for binary output.


### Most Useful Features Identified

The top features selected in the `Selected(5)` subset include:

| Feature         | Category     | Description                                      |
|----------------|--------------|--------------------------------------------------|
| BBP_5_2.0       | Volatility    | Bollinger Band Percent                          |
| BOP             | Momentum     | Balance of Power                                |
| DEC_1           | Boolean       | Decreasing close (boolean delta)                |
| INC_1           | Boolean       | Increasing close (boolean delta)                |
| J_9_3           | Momentum     | KDJ random index                                |
| AOBV_LR_2       | Volume        | Archer's On Balance Volume                      |
| CTI_12          | Trend         | Correlation Trend Indicator                     |
| EBSW_40_10      | Cycles        | Even Better SineWave                            |
| STOCHRSIk_14... | Momentum     | Stochastic RSI                                  |
| TTM_TRND_6      | Trend         | Trailing Twelve Month Trend (IVV only)          |

These features:
- Capture price dynamics (INC, DEC)
- Detect market pressure (BOP, AOBV)
- Identify cyclical turning points (EBSW, KDJ)
- Track volatility-adjusted price positions (BBP)

Together, they form a **parsimonious yet highly informative feature set** that drives the model’s success.

---

## Step 3: Replication (Mini Lab)

### ETF Chosen: iShares MSCI Chile ETF (ECH)

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from IPython.display import display

# Downloading historical data
ech = yf.download("ECH", start="2009-12-12", end="2020-01-01")
ech = ech[['Open']].dropna()

# Creating the response variable Γ_t
ech['Gamma'] = (ech['Open'].diff() > 0).astype(int)
ech = ech.dropna()

# Example metric: Dispersion (Standard Deviation)
ech['Dispersion'] = ech['Open'].rolling(window=14).std()
ech = ech.dropna()

# Normalize features
ech['Open_norm'] = (ech['Open'] - ech['Open'].min()) / (ech['Open'].max() - ech['Open'].min())
ech['Dispersion_norm'] = (ech['Dispersion'] - ech['Dispersion'].min()) / (ech['Dispersion'].max() - ech['Dispersion'].min())

X = ech[['Open_norm', 'Dispersion_norm']].values
y = ech['Gamma'].values

# k-Fold Cross-Validation
kf = StratifiedKFold(n_splits=10)
acc_scores = []

for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model = MLPClassifier(hidden_layer_sizes=(2,), activation='logistic', solver='lbfgs', max_iter=5000)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc_scores.append(accuracy_score(y_test, y_pred))

# Print raw scores
print("Cross-Validated Accuracy Scores:", acc_scores)
print("Mean Accuracy:", np.mean(acc_scores))

# Display results as a table
results_df = pd.DataFrame({
    'Model Variant': ['Simplified MLP Model'],
    'Features Used': ['Open + Dispersion'],
    'Accuracy (%)': [round(np.mean(acc_scores) * 100, 2)],
    'Notes': ['Binary classification using normalized features, 10-fold CV']
})

print("\nReproduced Accuracy Table:")
display(results_df)

# Plotting the Open Price and Dispersion
plt.figure(figsize=(14,6))
plt.plot(ech.index, ech['Open'], label='Open Price')
plt.plot(ech.index, ech['Dispersion'], label='14-day Dispersion')
plt.title('ECH Open Price vs. Dispersion')
plt.xlabel('Date')
plt.ylabel('USD')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

---

# Page 1: Introduction and Data Sources – Web Scraping in Financial Analysis

## Introduction

In the modern era of digital finance, **alternative data** has gained prominence as an indispensable tool for investors, analysts, and researchers seeking insights beyond traditional financial statements and market data. Among the most accessible and powerful forms of alternative data is **web scraping**, a method for systematically extracting unstructured data from online sources. This page sets the foundation for understanding how web scraping is used in financial contexts, particularly for deriving investment signals, identifying trends, and improving the accuracy of predictive models.

The digital footprint left by companies, financial institutions, consumers, and the media is vast and constantly evolving. This footprint, if mined effectively, can reveal market sentiment, product demand, regulatory risks, and competitive positioning. As structured datasets from traditional vendors become commoditized, firms now seek alpha-generating signals in less conventional places—web forums, news portals, regulatory filings, and social media.

## Role of Web Scraping in Finance

**Web scraping** involves automating the collection of publicly accessible data from websites using tools and scripts. It allows for near real-time data harvesting from a wide array of sources, ranging from breaking news headlines to customer reviews. The main advantages include:

- **Speed**: Automated scripts can collect data at frequencies far higher than manual methods.
- **Coverage**: Scraping allows data collection across diverse domains and geographies.
- **Customizability**: Scrapers can be tuned to extract specific content (e.g., earnings dates, CEO quotes, job postings).

Financial institutions deploy web scraping to support **quantitative trading, risk monitoring, and portfolio construction**. For instance, hedge funds scrape Twitter to gauge public sentiment on stocks, while equity analysts scrape company investor relations pages for press releases before they reach terminals.

## Examples of Financial Web Scraping Use Cases

1. **Earnings Announcements**: Scraping the IR (Investor Relations) sections of public company websites to detect earnings releases or call transcripts.
2. **Regulatory Disclosures**: Downloading filings such as 10-Ks, 10-Qs, and 8-Ks from the SEC’s EDGAR database.
3. **News Monitoring**: Real-time scraping of news headlines for market-moving developments.
4. **Consumer Sentiment**: Mining product reviews or social media commentary to assess demand for a company’s products.
5. **Job Postings**: Tracking hiring trends through company careers pages and LinkedIn to assess expansion.

## Key Sources of Scraped Financial Data

| Source Type        | Examples                                                | Potential Insight                             |
|--------------------|---------------------------------------------------------|------------------------------------------------|
| **News Media**     | CNBC, Bloomberg, Financial Times                        | Event-driven volatility, earnings reports     |
| **Social Media**   | Twitter, Reddit, StockTwits                             | Retail sentiment, buzz analysis               |
| **Official Filings**| SEC EDGAR, Companies House                             | Legal disclosures, executive turnover         |
| **Forums & Blogs** | Seeking Alpha, Yahoo Finance                           | Analyst opinions, rumors                      |
| **Corporate Sites**| Press releases, financial calendars                     | Forecast changes, product launches            |
| **E-commerce**     | Amazon reviews, eBay listings                           | Demand proxies, pricing trends                |

## Importance for Alpha Generation

Alternative datasets scraped from the web are prized not only for their informational value but also for their timeliness. While traditional financial data is backward-looking and updated quarterly, web data reflects current market activity and sentiment. If properly processed, this can lead to:

- **Early signal detection**: Identifying trends before they are priced in
- **Behavioral insights**: Understanding irrational market reactions through social chatter
- **Sentiment overlay**: Enhancing models by including soft information alongside hard metrics

In sum, web scraping serves as a bridge between the deluge of public digital information and actionable investment insights. However, this power also comes with ethical and legal constraints, which will be explored in the next section.

---

*Next: Page 2 will delve into the different types of data obtained through web scraping and the quality concerns associated with them.*



# Page 2: Types of Data and Quality Concerns in Web Scraping

## Overview

In the financial domain, web scraping serves as a powerful tool to gather alternative datasets that are not available through traditional financial data providers. However, the utility of scraped data depends heavily on its **type**, **granularity**, and **quality**. On this page, we explore the major categories of data collected through web scraping and highlight the primary quality-related challenges that must be addressed to ensure reliability and consistency in financial modeling.

---

## 1. Types of Data Scraped from the Web

Web scraping can extract multiple formats and structures of data, often unstructured in nature. The most relevant data types for finance include:

### a. Textual Data
- **News Headlines**: Often scraped from financial news portals such as Bloomberg, CNBC, and Reuters.
- **Articles & Blogs**: Longer narratives from forums (e.g., Seeking Alpha) or economic commentary blogs.
- **Filings & Disclosures**: Legal documents and investor updates from sources like EDGAR or company websites.
- **Company Descriptions**: Extracted from public profile pages, used in fundamental analysis.

### b. Sentiment Indicators
- **Social Media Posts**: Tweets, Reddit threads, StockTwits messages—valuable for gauging investor mood.
- **Review Scores**: Star ratings from e-commerce or app platforms as a proxy for product performance.
- **Opinion Mining**: NLP-based parsing of subjective statements (e.g., bullish/bearish tone).

### c. Event Data
- **Product Launch Announcements**
- **Executive Changes or Layoffs**
- **M&A Rumors**
- **Earnings Calls Schedules**

### d. Count-Based Metrics
- **Mention Frequency**: How often a company or ticker is discussed online.
- **Upvotes or Likes**: Proxy for virality or popularity of a financial opinion.
- **Comment Volume**: Especially useful on earnings threads or stock forums.

### e. Commercial Metrics
- **Product Prices and Discounts**: Scraped from Amazon, Walmart, etc.
- **Availability & Stock-outs**: Indicates supply chain disruptions.
- **Competitor Monitoring**: Product rollouts and pricing strategy.

These data types are commonly used to build **nowcasting models**, **news-based volatility filters**, and **event-based alpha factors** in hedge funds and quant trading shops.

---

## 2. Data Quality Challenges in Web Scraping

Although the breadth of data available through web scraping is impressive, it introduces several **quality-related issues** that must be managed carefully before applying the data in predictive or trading models:

### a. Noise and Redundancy
- Duplicate content across news aggregators or reposted tweets can inflate signal strength.
- Scraped HTML often contains boilerplate material, ads, or irrelevant metadata.

### b. Inconsistent Formatting
- Web structures change frequently, which can break scrapers.
- Lack of standardized APIs leads to irregular data fields and encoding issues.

### c. Missing or Incomplete Data
- Headlines may be truncated.
- Dates or authors may be missing.
- Social media posts may be deleted or shadowbanned.

### d. Temporal Misalignment
- Data timestamp may reflect extraction time, not publication time.
- Real-time applications require proper synchronization across data sources.

### e. Source Reliability and Bias
- Forums like Reddit or Twitter are prone to **herding behavior**, **pump-and-dump schemes**, and **bot manipulation**.
- News sites may have **political or editorial biases** that affect the framing of information.

### f. Legal and Compliance Risks
- Some data is behind login walls or terms-of-service restrictions.
- Web scraping without adherence to `robots.txt` or fair use principles may be in legal grey areas.

---

## 3. Addressing Data Quality in Practice

To mitigate these issues, practitioners typically:
- **Apply cleaning pipelines**: Removing HTML tags, whitespace, duplicates, and invalid characters.
- **Normalize entities**: Mapping synonyms and ticker aliases to canonical names.
- **Use validation sets**: Benchmarking scraped sentiment or mentions against actual market reactions.
- **Monitor scraper health**: Automated alerting when scraper fails or output quality declines.

Robust preprocessing is often the difference between noise and insight. The goal is to distill useful, investment-grade signals from the raw, chaotic web.

---

*Next: Page 3 will examine the ethical and legal considerations around scraping and present example Python code to structure collected data.*

# Page 3: Ethical Considerations and Python Code for Web Scraping

## 1. Ethical and Legal Concerns in Web Scraping

Web scraping exists in a **legally ambiguous** and ethically nuanced space. While scraping publicly available information may seem harmless, it can cross ethical boundaries or even legal ones depending on context, source, and intent. Financial professionals, especially those operating in regulated environments, must carefully navigate this domain.

### a. Terms of Service (ToS) Violations
- Many websites explicitly disallow scraping in their ToS.
- Ignoring these restrictions can lead to legal consequences, including cease-and-desist letters or lawsuits.
- High-profile cases like *hiQ Labs vs. LinkedIn* have shown that even scraping publicly visible data can result in litigation.

### b. Server Load and Resource Abuse
- Aggressive scraping scripts can **overwhelm servers**, disrupting the host site.
- Ethical scraping includes **rate limiting**, **respecting `robots.txt`**, and **avoiding denial-of-service behavior**.

### c. Data Ownership and Copyright
- Scraped content (e.g., analyst reports or editorial articles) may be protected by copyright law.
- Republishing such data or using it for commercial gain without attribution or license may constitute infringement.

### d. Privacy Risks
- Scraping forums or social media can inadvertently collect **personally identifiable information (PII)**.
- Examples include usernames, geolocations, or health-related disclosures.
- Responsible data use demands **de-identification**, **anonymization**, and **compliance with regulations like GDPR**.

### e. Reputational and Compliance Risk
- Financial firms deploying scrapers must consider internal **risk policies**, **compliance audits**, and **client-facing disclosures**.
- Regulatory bodies may require documentation of data provenance and assurance that models are not based on unauthorized data.


## 2. Best Practices for Ethical Web Scraping

To conduct ethical web scraping in a financial context:

- Use **public APIs** when possible (e.g., Twitter API, SEC’s EDGAR FTP interface).
- Always **check and respect `robots.txt`** files to see what is allowed.
- **Throttle requests** (e.g., 1 request/second) to avoid overloading servers.
- **Log access patterns** and **implement exponential backoff** on errors.
- Avoid **scraping paywalled or account-protected content** unless permission is obtained.
- Maintain a record of **source licenses, ToS screenshots, and scraping logs**.

Adhering to these practices reduces risk while preserving the integrity of the data science process.


## 3. Sample Python Code for Basic Scraping and Structuring

Below is a simplified example using `requests` and `BeautifulSoup` to scrape and structure headlines from a financial news website (e.g., CNBC):

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define the target URL
url = "https://www.cnbc.com/finance/"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)

# Parse the HTML
soup = BeautifulSoup(response.content, 'html.parser')

# Extract headline elements (adjust based on site structure)
data = []
for article in soup.find_all('a', class_='Card-title'):
    headline = article.get_text(strip=True)
    link = article.get('href')
    data.append({"headline": headline, "url": link})

# Store in a structured format
headlines_df = pd.DataFrame(data)
print(headlines_df.head())
```

### Notes:
- This is **for educational purposes only**. Sites may block or throttle bots.
- Always refer to the site’s `robots.txt` (e.g., `https://www.cnbc.com/robots.txt`).
- For production-grade scraping, consider tools like `Selenium`, `Scrapy`, or `Playwright`.

---

*Next: Page 4 will showcase exploratory data analysis techniques applied to scraped financial text data, including sentiment and keyword analysis.*

# Page 4: Exploratory Data Analysis of Web-Scraped Financial Text

## Overview

Once financial data has been scraped and structured, the next critical step is **Exploratory Data Analysis (EDA)**. EDA helps uncover patterns, spot anomalies, and identify trends within the data, allowing analysts to evaluate signal strength before feeding it into predictive models. For textual data such as headlines or social media posts, EDA often involves **natural language processing (NLP)** techniques like sentiment analysis, keyword frequency, and topic modeling.


## 1. Data Preparation

Before performing EDA, the raw text needs to be cleaned and normalized:

### a. Preprocessing Steps
- **Lowercasing** all text
- **Removing punctuation**, numbers, and stopwords
- **Tokenization** (splitting text into words)
- **Lemmatization or stemming** to reduce words to their base forms
- **Handling duplicates** and missing values

```python
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
    return ' '.join(words)

df['cleaned_headline'] = df['headline'].apply(preprocess)
```


## 2. Word Frequency and Visualization

After cleaning the text, a simple yet insightful step is analyzing word frequency using a **word cloud** or bar plot:

```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Create a single text blob
all_text = ' '.join(df['cleaned_headline'])

# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)

# Plot
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Most Frequent Words in Financial Headlines")
plt.show()
```

### Interpretation:
Frequent keywords like “earnings”, “inflation”, “cut”, or “growth” might reflect macroeconomic concerns or company-specific trends.


## 3. Sentiment Analysis

Sentiment analysis provides a proxy for public or media mood around market topics. We can use `TextBlob` or `VADER` to derive sentiment polarity:

```python
from textblob import TextBlob

# Compute polarity score between -1 (negative) to +1 (positive)
df['sentiment'] = df['headline'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Plot sentiment distribution
plt.hist(df['sentiment'], bins=20, edgecolor='black')
plt.title("Distribution of Headline Sentiment")
plt.xlabel("Sentiment Polarity")
plt.ylabel("Frequency")
plt.show()
```

### Insight:
A skew toward positive sentiment may suggest optimism, while high variance in sentiment can reflect uncertainty or mixed market signals.

## 4. Temporal Analysis (Optional)

If timestamp data is available, you can analyze **sentiment or keyword spikes over time**:

```python
# Example: Daily average sentiment
sentiment_by_date = df.groupby('date')['sentiment'].mean()
sentiment_by_date.plot(figsize=(10,4), title="Daily Average Sentiment")
plt.ylabel("Sentiment Score")
plt.xlabel("Date")
plt.show()
```


## 5. Conclusion

Through EDA, analysts can transform unstructured text into actionable insights. High-frequency keywords reveal thematic concentrations, while sentiment scores can serve as predictive features. Although EDA is not sufficient for final decision-making, it is an essential **first step** in validating whether scraped data has potential alpha-generating properties.

---

*Next: Page 5 will provide a short literature review of related work using web-scraped data in financial applications, along with references.*



# Page 5: Literature Review and References

## 1. Introduction

A robust body of academic and industry research supports the use of web-scraped data for financial analysis. This page provides a concise **literature review** highlighting key studies that have demonstrated the predictive power and strategic value of alternative web data—particularly in the form of news articles, social media sentiment, and financial forum discussions.
The studies reviewed here focus on the application of web-scraped data in various financial contexts, including stock price prediction, sentiment analysis, and market trend detection. They illustrate the growing recognition of web-scraped data as a valuable asset class in quantitative finance.

## 2. Selected Studies

### a. Bollen et al. (2011) – Twitter Mood Predicts the Stock Market
- **Key Contribution**: Introduced the concept of using aggregated Twitter sentiment (mood states) to predict the Dow Jones Industrial Average.
- **Findings**: Daily sentiment trends on Twitter—especially calmness and anxiety—had predictive power, leading to **87.6% accuracy** in directional prediction.
- **Significance**: This was one of the earliest large-scale validations of using social sentiment as a market signal.

### b. Huang et al. (2018) – Sentiment and Asset Pricing
- **Journal**: *Journal of Financial Economics*
- **Key Contribution**: Examined how online investor sentiment influences stock returns.
- **Findings**: Sentiment extracted from Internet forums offered information beyond fundamentals, particularly for hard-to-value or highly volatile stocks.
- **Significance**: Empirical evidence supporting the inclusion of web-derived sentiment in pricing models.

### c. Nassirtoussi et al. (2015) – Systematic Review on Text Mining for Market Prediction
- **Scope**: Meta-analysis of 30+ studies using NLP for financial forecasting.
- **Findings**: Concluded that news articles and social media, when properly preprocessed, can enhance market prediction models.
- **Added Value**: Categorized methodologies (SVM, Decision Trees, Naive Bayes) and compared their performance.

### d. Pagolu et al. (2016) – Sentiment Analysis for Stock Movement Prediction
- **Conference**: International Conference on Computer Communication and Informatics (ICCCI)
- **Key Contribution**: Used `TextBlob` and Twitter data to classify stock movement as up or down.
- **Model Accuracy**: Achieved promising classification accuracy with minimal features.
- **Significance**: Reinforced the use of simple NLP tools for retail investors.


## 3. Emerging Trends in Web-Scraped Financial Data

- **Topic Modeling**: Use of LDA and BERT-based embeddings to identify emerging financial themes.
- **Multi-Modal Approaches**: Combining sentiment, volume, and price signals across different web platforms.
- **Real-Time Alpha**: Integration of streaming web data into high-frequency trading pipelines.
- **Event Detection**: Automated detection of exogenous shocks (e.g., bankruptcy filings, CEO scandals) before they hit mainstream news.



## 4. Research Implications

These studies validate the **practical viability** of using web-scraped alternative data in asset management, trading, and market surveillance. They demonstrate that:
- **Investor sentiment**—even from noisy platforms—can carry predictive content.
- **NLP pipelines** (when tuned) provide robust signals for trend detection.
- **Real-time scraping** and analysis can supplement or even replace some traditional data vendors.

However, successful application requires attention to **data quality**, **bias mitigation**, and **ethical use**, as discussed in earlier sections.


## 5. References (MLA Format)

- Bollen, Johan, Huina Mao, and Xiaojun Zeng. “Twitter Mood Predicts the Stock Market.” *Journal of Computational Science*, vol. 2, no. 1, 2011, pp. 1–8.
- Huang, Alan, Annie Zang, and Rong Zheng. “Evidence from the Internet: Investors’ Sentiment and Stock Returns.” *Journal of Financial Economics*, vol. 129, no. 2, 2018, pp. 315–338.
- Nassirtoussi, A.K., et al. “Text Mining for Market Prediction: A Systematic Review.” *Expert Systems with Applications*, vol. 42, no. 1, 2015, pp. 613–632.
- Pagolu, Venkata Subrahmanyam, et al. “Sentiment Analysis of Twitter Data for Predicting Stock Market Movements.” *2016 International Conference on Computer Communication and Informatics (ICCCI)*, IEEE, 2016.

---

*This completes the 5-page user guide on evaluating web scraping as a category of alternative data for financial applications.*