## 1. Structure & Guidelines

**Problem Statement & Dataset Decision**:
For this project, our partner and I have chosen to investigate whether incorporating social media sentiment (from Twitter/X and Reddit) into historical stock market data can enhance our trading models. Our dataset comprises two parts:
- **Historical Stock Data**: Daily OHLC and Volume CSV files stored in the `hist/` folder.
- **Social Media Sentiment**: Daily aggregated sentiment scores (generated via web scraping and NLP, e.g., using VADER) for selected tickers (stored in files like `sentiment_<TICKER>.csv`).

*Note: This problem statement will guide our entire analysis. Although we might refine the statement and dataset in future assignments, this serves as our nexus for now.*

---


## 2. Assignment Questions/Tasks

1) **Discuss & write down a problem statement**
   - Our focus: "Can social media sentiment improve short-term trading signal accuracy when combined with historical price data?"

2) **Find a Dataset(s) that will help you solve your problem**
   - We use historical stock CSV files (in `hist/`) and sentiment CSV files from Twitter/X and Reddit.

3) **EDA Study**
   - We will follow guides from Analytics Vidhya and the Python Data Science Handbook for our initial analysis.

4) **Start your EDA**
   - We will load our stock and sentiment data, perform data cleaning, and generate visualizations to form hypotheses.

5) **Use 5 additional visualizations/techniques**
   - We’ll include additional plots (e.g., time series overlays, scatter plots, violin plots for sentiment, etc.) beyond the base example.

6) **Write down insights**
   - We will summarize key findings and link them back to our problem statement.

---


## 3. Exploratory Data Analysis

Below we perform a detailed EDA on our combined dataset.
*Note: For illustration, we use a single ticker ("AAPL"). Adjust file names and parameters for additional tickers.*


### 1. Data Loading & Quick Overview

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# Load stock data (historical price data for AAPL)
df_price = pd.read_csv("hist/AAPL.csv")
df_price['Date'] = pd.to_datetime(df_price['Date'])
df_price.set_index('Date', inplace=True)

# Load sentiment data (daily aggregated sentiment scores for AAPL)
df_sent = pd.read_csv("sentiment_AAPL.csv")
df_sent['Date'] = pd.to_datetime(df_sent['Date'])
df_sent.set_index('Date', inplace=True)

# Merge datasets on Date
df = df_price.join(df_sent, how='left')
df.fillna(0, inplace=True)  # Fill missing sentiment scores with 0

# Display first few rows
df.head()


#### Questions to Ponder:
1. Does the merged data capture both price and sentiment as expected?
2. Are the dates aligned properly between stock prices and sentiment?
3. Is there any immediate sign of missing or inconsistent data?

---


### 2. Shape & Features

In [None]:

print(f"Dataset shape: {df.shape}")

# Display column names (e.g., Open, High, Low, Close, Volume, sentiment_score)
print("\nFeature Names:")
print(df.columns.tolist())


#### Questions to Ponder:
1. Is the dataset large enough for a robust analysis?
2. Do the features match our problem requirements?
3. Are there any columns that might need renaming or removal?

---


### 3. Data Types & Missing Values

In [None]:

print("Data Types:")
print(df.dtypes)

print("\nMissing Values Count:")
print(df.isnull().sum())


#### Questions to Ponder:
- Should we drop or impute any missing values?
- Does the absence of sentiment on some dates provide insight?

---


### 4. Summary Statistics & Outlier Detection

In [None]:

# Display summary statistics for numerical features
df.describe()


#### Questions to Ponder:
- Which features show extreme values or outliers?
- Are there any unexpected ranges in price or sentiment scores?
- Do outliers suggest market anomalies or data errors?

---


### 5. Univariate Analysis

In [None]:

# Define numerical features for visualization
num_features = ['Open', 'High', 'Low', 'Close', 'Volume', 'sentiment_score']

# Create histograms for numerical features
fig, ax = plt.subplots(2, 3, figsize=(18, 12))
for i, feature in enumerate(num_features):
    row, col = divmod(i, 3)
    sns.histplot(df[feature], kde=True, bins=30, ax=ax[row, col])
    ax[row, col].set_title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

# Boxplots for numerical variables
fig, ax = plt.subplots(2, 3, figsize=(18, 12))
for i, feature in enumerate(num_features):
    row, col = divmod(i, 3)
    sns.boxplot(y=df[feature], ax=ax[row, col])
    ax[row, col].set_title(f'Boxplot of {feature}')
plt.tight_layout()
plt.show()


#### Questions to Ponder:
- Are the distributions of stock prices and sentiment scores skewed?
- Which features show signs of potential outliers?

---


### 6. Bivariate Analysis

In [None]:

# Scatter plot: sentiment_score vs. Close
sns.scatterplot(data=df, x='sentiment_score', y='Close')
plt.title("Sentiment Score vs. Closing Price")
plt.show()

# Time series overlay: Closing price and sentiment
fig, ax1 = plt.subplots(figsize=(12, 6))
ax2 = ax1.twinx()
ax1.plot(df.index, df['Close'], color='blue', label='Close')
ax2.plot(df.index, df['sentiment_score'], color='red', label='Sentiment Score', alpha=0.7)
ax1.set_xlabel("Date")
ax1.set_ylabel("Closing Price", color='blue')
ax2.set_ylabel("Sentiment Score", color='red')
plt.title("Closing Price & Sentiment Over Time")
plt.legend()
plt.show()

# Box plot: Compare sentiment by a price-level bin (e.g., high vs. low close)
df['price_bin'] = pd.qcut(df['Close'], 4, labels=["Low", "Medium", "High", "Very High"])
sns.boxplot(x='price_bin', y='sentiment_score', data=df)
plt.title("Sentiment Distribution Across Price Bins")
plt.show()


#### Questions to Ponder:
- Is there a visible relationship between sentiment and closing price?
- Do certain price bins show higher sentiment volatility?

---


### 7. Multivariate Analysis

In [None]:

# Correlation heatmap for numerical features
plt.figure(figsize=(10, 8))
sns.heatmap(df[num_features].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Numerical Features")
plt.show()


#### Questions to Ponder:
- Which features are strongly correlated with price or sentiment?
- Do any correlations challenge our initial assumptions?

---


### 8. Next Steps & Additional Visualizations

#### Additional Visualizations/Techniques (choose at least 5):
1. **Rolling Mean & Standard Deviation**: Plot a rolling mean and rolling standard deviation for Close and sentiment_score.
2. **Violin Plots**: Visualize the distribution of sentiment scores per price_bin using violin plots.
3. **Lag Analysis**: Create scatter plots comparing today's sentiment with next-day price changes.
4. **Joint Plot**: Use a joint plot to examine the relationship between Volume and sentiment_score.
5. **Time Series Decomposition**: Decompose the Close price time series to inspect trend and seasonality.

*Add your code and findings for these techniques in your notebook.*

---


### 9. Insights & How It Relates Back to the Problem

**Key Insights**:
- The merged dataset successfully combines stock price and sentiment data.
- Preliminary EDA suggests that on days when sentiment is strongly positive (or negative), there might be noticeable shifts in closing prices.
- Some outliers in sentiment coincide with significant market events, suggesting that extreme sentiment could serve as a trading signal.
- The correlation heatmap shows moderate correlation between sentiment_score and Close price, warranting further model-based investigation.

**Relation to the Problem Statement**:
- If sentiment scores can help predict short-term price movements, our trading model incorporating these features should outperform a model based solely on technical indicators.
- These insights will guide our feature engineering and modeling decisions in subsequent phases of the project.

---


## Resources
- [Analytics Vidhya - EDA using Python](https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/)
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [Elements of Data Science](https://allendowney.github.io/ElementsOfDataScience/)
- [VADER Sentiment Analysis Documentation](https://www.nltk.org/_modules/nltk/sentiment/vader.html)