# Correlation Analysis

## 1. Datasets 

We will use the following real world data for this notebook.

- Daily movement of sectoral indices.
    - Banks
    - Metal
    - Healthcare
    - Tech
- For the period of **01 September 2022 to 14 March 2023**
- The dails stock price data can be downloaded from **BSE India Site**

https://www.bseindia.com/indices/IndexArchiveData.html

## 2. Daily Data of Sensex 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
DATA_PATH = 'https://raw.githubusercontent.com/manaranjanp/MLCourseV1/main/Session_2/'

In [None]:
sensex_df = pd.read_csv(DATA_PATH+"Sensex.csv",
                        index_col=False, 
                        parse_dates=['Date'])

In [None]:
sensex_df.info()

In [None]:
sensex_df.head()

In [None]:
sensex_df = sensex_df.set_index('Date', drop=True)

### 2.a. Calculating daily gains

In [None]:
sensex_df['sensex_gain'] = (sensex_df.Close - sensex_df.Open)*100/sensex_df.Open

In [None]:
sensex_df[0:5]

## 3. Sectoral Indexes

For this analysis, we are comparing daily movement of the following sectoral indexes with the sensex.
   - Banks
   - Metal
   - Healthcare
   - Tech

### 3.a. Defining a method to load data and calculate gains

- Pass the file name and it will return a dataframe with daily open and close price and gain percentage 

In [None]:
def get_sector_gain( filename ):

    # Read the csv file
    df = pd.read_csv(filename, index_col = False, parse_dates=['Date'])
    # Set the time index 
    df = df.set_index(['Date'], drop=True)

    # Sort the records based on time
    df.sort_index(ascending = True, inplace=True)

    # Select only Close and Open Price columns for further analysis
    df['gain'] = ((df['Close'] - df['Open']) * 100 /
                    df['Open'])

    return df[['Close', 'Open', 'gain']]

### 3.b. Loading the data for various sectors

In [None]:
bankex_df = get_sector_gain(DATA_PATH+"BSE_BANKEX.csv")
metal_df = get_sector_gain(DATA_PATH+"BSE_Metal.csv")
healthcare_df = get_sector_gain(DATA_PATH+"BSE_Healthcare.csv")
tech_df = get_sector_gain(DATA_PATH+"BSE_Tech.csv")

### 3.b. Calculate gain for various sectors

In [None]:
sensex_df['bankex_gain'] = bankex_df['gain']
sensex_df['metal_gain'] = metal_df['gain']
sensex_df['healthcare_gain'] = healthcare_df['gain']
sensex_df['tech_gain'] = tech_df['gain']

## 4. Insights

Correlation is often used in portfolio management to measure the amount of diversification among the assets contained in a portfolio.
1. Does the sectoral index goes up when the market index goes up. How strong or weak this movement are?
2. Are there any sectors which have weak relationship with sensex? 
3. How to select stocks or stocks from specic sectors to create a portifoio that reduces the overall risk?

## 5. What is correlation?

Correlation measures the extent to which two variables are linearly related (meaning they change together at a constant rate).

- We can observe the correlation using **scatter plot**.

<img src="correlation.png" alt="Nowmal Distribution" width="600"/>

Source: *https://www.investopedia.com/ask/answers/032515/what-does-it-mean-if-correlation-coefficient-positive-negative-or-zero.asp*

### 5.a. Scatter Plot between Sensex and Secotral Indexes

In [None]:
plt.figure(figsize=(12, 6))
sn.scatterplot(data = sensex_df, x = 'sensex_gain', y = 'bankex_gain');

In [None]:
plt.figure(figsize=(12, 6))
sn.scatterplot(data = sensex_df, x = 'sensex_gain', y = 'tech_gain');

In [None]:
plt.figure(figsize=(12, 6))
sn.scatterplot(data = sensex_df, x = 'sensex_gain', y = 'healthcare_gain');

In [None]:
plt.figure(figsize=(12, 6))
sn.scatterplot(data = sensex_df, x = 'bankex_gain', y = 'metal_gain');


## 6. Strength of Correlation 

We describe correlations with a unit-free measure called the correlation coefficient which ranges from -1 to +1. The correlation is denoted by **r**.

- The closer the correlation is to zero, the weaker the linear relationship.
- Positive correlation values indicate a positive correlation, where the values of both variables tend to increase together.
- Negative correlation values indicate a negative correlation, where the values of one variable tend to increase when the values of the other variable decrease.

Source: https://www.jmp.com/en_in/statistics-knowledge-portal/what-is-correlation.html


Correlation is given by:

$$\frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}
{\sigma_{x}\sigma_{y}}$$

This is also known as **Pearson Correlation**.

- | r | < 0.25 - No relationship
- 0.25 < | r | < 0.5 - Weak relationship
- 0.5 < | r | < 0.75 - Moderate relationship
- | r | > 0.75 - Strong relationship

The definition of a “weak” correlation can vary from domain to domain. Check the link below:
   - https://www.statology.org/what-is-a-weak-correlation/

In [None]:
sensex_df[['sensex_gain', 'bankex_gain']].corr()

### 6.a. Creating a Heatmap

In [None]:
sector_corr = sensex_df[['sensex_gain', 
                         'bankex_gain', 
                         'metal_gain',
                         'healthcare_gain',
                         'tech_gain']].corr()
sector_corr

In [None]:
plt.figure(figsize=(8, 6))
sn.heatmap(sector_corr,
           annot = True,
           fmt = "0.2f",
           cmap = sn.diverging_palette(240, 10),
           vmin = -1.0, 
           vmax = 1.0);

### 6.b. How to use correlation to select stocks?


- The correlation coefficient can be used to select stocks in different industries that tend to move in tandem, or to select stocks with a negative coefficient so that if one stock fails, the other is likely to get a boost.

- Choosing a variety of stocks with different degrees and directions of correlation is one of the most common and effective diversification strategies. The result is a portfolio that displays a general upward trend, since, at any given time, at least one security should be doing well even if others are failing.

**Source**: https://www.investopedia.com/ask/answers/021716/how-does-correlation-affect-stock-market.asp

### 6.c. Correlation is not Causation

Correlation measures the relationship between two variables. However, two variables moving together does not necessarily indicate if one variable causes the other to occur or change. 

**“correlation does not imply causation.”**

#### Examples of spurious correlations:

  - https://www.tylervigen.com/spurious-correlations