# Topic 03 - Problem 4: Calculate Correlation and Covariance Between Two Variables

---

## 1. About the Problem

This problem asks me to calculate the **correlation** and **covariance** between two numerical variables in a dataset.  
- **Covariance** indicates whether two variables tend to increase/decrease together, but it is sensitive to the scale of the variables.  
- **Correlation** standardizes covariance by dividing by the standard deviations of both variables, providing a scale-independent measure between -1 and 1.

To solve this, I will compute the covariance using the formula and then use it to calculate the correlation between the two variables.

---


## 2. Solution Code

In [26]:
import math

def calculate_covariance(x,y):
    mean_x=sum(x)/len(x)
    mean_y=sum(y)/len(y)
    covariance=sum(((xi-mean_x)*(yi-mean_y)) for xi,yi in zip(x,y))/len(x)
    return covariance

def calculate_correlation(x,y):
    covariance=calculate_covariance(x,y)
    mean_x=sum(x)/len(x)
    mean_y=sum(y)/len(y)
    std_x=math.sqrt(sum((xi-mean_x)**2 for xi in x)/len(x))
    std_y=math.sqrt(sum((yi-mean_y)**2 for yi in y)/len(y))

    correlation=covariance/(std_x*std_y)
    return correlation

x = [45, 50, 38, 60]
y = [50000, 60000, 42000, 68000]

cov = calculate_covariance(x, y)
cor = calculate_correlation(x, y)

print("Covariance:", cov)
print("Correlation:", cor)


Covariance: 77750.0
Correlation: 0.9853472301247486


---

## 3. Summary / Takeaways

By solving this problem, I learned how to calculate **covariance** and **correlation** between two datasets.  
I understood how covariance measures the directional relationship between variables, but it can be influenced by their scale.  
Correlation provides a standardized measure, making it easier to interpret.  
These metrics are fundamental in **feature selection** and **understanding relationships** between features before applying machine learning models.  
Next, I want to explore **pairwise correlation** across multiple variables in a dataset.


In [None]:
#Alternative for finding covariance

x = [45, 50, 38, 60]
y = [50000, 60000, 42000, 68000]

mean_x=sum(x)/len(x)
mean_y=sum(y)/len(y)
total_x=[(val-mean_x) for val in x]
total_y=[(v-mean_y) for v in y]

# print(mean_x,mean_y,total_x,total_y)
total=0
for i,v in enumerate(total_x):
    total+=(v*total_y[i])

cov=total/len(x)
print(cov)

77750.0


In [None]:
#Alternative method for finding correlation
import math
x = [45, 50, 38, 60]
y = [50000, 60000, 42000, 68000]

variance_x=sum((val-mean_x)**2 for val in x)/len(x)
variance_y=sum((v-mean_y)**2 for v in y)/len(y)

std_x=math.sqrt(variance_x)
std_y=math.sqrt(variance_y)

correlation=(cov)/((std_x)*(std_y))
print(correlation)
print(variance_x,variance_y)
print(std_x,std_y)

0.9853472301247486
64.1875 97000000.0
8.011710179481033 9848.857801796104
