<a href="https://colab.research.google.com/github/leomercanti/Beginner_Investing_with_AI/blob/main/Module_2_Data_Science_Fundamentals_for_Investing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 2 - Data Science Fundamentals for Investing**

- **Objective:** Acquire skills for handling and analyzing financial data.

- **Topics:**
  - **Data Collection:** Using APIs and web scraping.
  - **Data Cleaning:** Handling missing values, normalization.
  - **Exploratory Data Analysis (EDA):** Statistical summaries, visualizations.

- **Readings:**
  - “Python for Data Analysis” by Wes McKinney.

### **2.1 Data Collection and Cleaning**

- **Objective:** Learn how to gather and prepare financial data for analysis.

#### **Data Collection**

  - **APIs:** Many financial platforms offer APIs to access market data. For example, Yahoo Finance, Alpha Vantage, and Quandl provide various financial data sets.
  
  - **Web Scraping:** Another method is to scrape financial data from websites. Libraries like BeautifulSoup and Scrapy can help with this.

- **Hands-on Example:** Fetching Data with Yahoo Finance API

In [None]:
# Install Required Libraries - Only needed if you are running this code out of Google Colab
!pip install yfinance

In [None]:
import yfinance as yf

In [None]:
# Fetch historical data for a stock
data = yf.download('AAPL', start='2021-01-01', end='2024-09-01')

In [None]:
# Display first few rows
print(data.head())

- **Explanation:** This code uses the yfinance library to download historical stock data for Apple (AAPL) from January 1, 2020, to January 1, 2023.

#### **Data Cleaning**

  - **Handling Missing Values:** Financial data often contains missing values. Common strategies include forward-fill, backward-fill, or interpolation.
  - **Normalization:** Scale features to ensure they contribute equally to the analysis. Techniques include min-max scaling or z-score normalization.

- **Hands-on Example:** Cleaning Data

In [None]:
# Forward fill missing values
data = data.fillna(method='ffill')

In [None]:
# Display cleaned data
print(data.head())

- **Explanation:** This code forward-fills missing values in the dataset, ensuring that each missing value is replaced with the last valid observation.

### **2.2 Exploratory Data Analysis (EDA)**

- **Objective:** Analyze and visualize financial data to uncover patterns and insights.

#### **Descriptive Statistics**

  - **Summary Statistics:** Calculate measures such as mean, median, standard deviation to understand the data's distribution.

- **Hands-on Example:** Descriptive Statistics

In [None]:
# Summary statistics
print(data.describe())

- **Explanation:** The describe method provides key statistical metrics, including mean, standard deviation, and quartiles.

#### **Data Visualization**

  - **Time Series Analysis:** Plot time-series data to observe trends, seasonality, and anomalies.

  - **Correlation Analysis:** Use scatter plots and heatmaps to understand relationships between variables.

- **Hands-on Example:** Time Series Plot

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Plot closing prices
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Close'], label='AAPL Closing Price')
plt.title('AAPL Closing Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.show()

  - **Hands-on Example:** Correlation Heatmap

In [None]:
# Calculate correlation matrix
correlation_matrix = data.corr()

In [None]:
# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

- **Explanation:** The time series plot shows the trend in closing prices, while the heatmap visualizes the correlation between different features in the dataset.

### **2.3 Feature Engineering**

- **Objective:** Create and select features that are relevant for machine learning models.

#### **Creating Features**

  - **Technical Indicators:** Compute indicators like moving averages, relative strength index (RSI), and volatility.

  - **Lag Features:** Include past values as features to capture temporal dependencies.

- **Hands-on Example:** Creating Moving Averages

In [None]:
# Calculate moving averages
data['SMA_20'] = data['Close'].rolling(window=20).mean()
data['SMA_50'] = data['Close'].rolling(window=50).mean()

In [None]:
# Plot closing price and moving averages
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Close'], label='AAPL Closing Price')
plt.plot(data.index, data['SMA_20'], label='20-Day SMA', color='orange')
plt.plot(data.index, data['SMA_50'], label='50-Day SMA', color='green')
plt.title('AAPL Closing Price and Moving Averages')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.show()

- **Explanation:** This code calculates and plots 20-day and 50-day simple moving averages (SMA), which are commonly used technical indicators.

#### **Feature Selection**

  - **Correlation:** Use correlation coefficients to identify relevant features.

  - **Feature Importance:** Algorithms like Random Forest can rank feature importance.

- **Hands-on Example:** Feature Selection with Correlation

In [None]:
# Select features with high correlation to the target
high_corr_features = correlation_matrix['Close'].sort_values(ascending=False)
print(high_corr_features)

- **Explanation:** This code identifies features that have a strong correlation with the target variable (closing price).

### **2.4 Further Reading and Resources**

- **Books:**
  - “Python for Data Analysis” by Wes McKinney
  - “Data Science for Finance” by Michael Halls-Moore

- **Online Courses:**

  - Coursera’s “Data Science Specialization” by Johns Hopkins University
  - DataCamp’s “Introduction to Financial Concepts in Python”

- **Websites:**

  - [Kaggle](https://www.kaggle.com/) for datasets and competitions related to financial data.