# 1) Pandas correlation

- korelacia je statisticky koncept, ktory kvantifikuje mieru vzajomneho vztahu 2 premennych
- **corr()**


In [None]:
import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125],
}

df = pd.DataFrame(data)

# calculate correlation matrix
print(df.corr())

# vysvetlenie
# Here, the correlation coefficient between Temperature and Ice_Cream_Sales
# is 0.923401, which is positive. This indicates that as the temperature increases,
# the ice cream sales also increase.
# The coefficient value of 1.000000 along the diagonal represents
# the correlation of each column with itself.

                 Temperature  Ice_Cream_Sales
Temperature         1.000000         0.923401
Ice_Cream_Sales     0.923401         1.000000


# 2) Positive and negative correlation

- **positive correlation** - vzajomny vztah 2 premennych ma tendenciu rovnakeho trendu, tj. ak jedna rastie, tak druha ma tiez tenenciu rast a naopak.
  ![image.png](attachment:image.png)

- **negative correlation** - vzajomny vztah 2 premennych ma tendenciu opacneho trendu, tj. ak jedna rastia, tak druha klesa a naopak
  ![image-2.png](attachment:image-2.png)


## 2.1) Correlation between 2 columns

- namiesto hladania celkovej korelacnej matice, mozme specifikovat stlpce medzi, ktorymi chceme vyjadrit korelacny vztah


In [None]:
import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125],
}

df = pd.DataFrame(data)

# calculate correlation coefficient
correlation = df["Temperature"].corr(df["Ice_Cream_Sales"])

print(correlation)

0.9234007664064656
                 Temperature  Ice_Cream_Sales
Temperature         1.000000         0.923401
Ice_Cream_Sales     0.923401         1.000000


## 2.2) Missing values

- **corr()** f-cia ignoruje chybajuce udaje (NaN)


In [5]:
import pandas as pd
import numpy as np

# create a dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Coffee_Sales": [158, 145, np.nan, np.nan, 140],
}

df = pd.DataFrame(data)

# calculate correlation between Temperature and Ice_Cream_sales
correlation1 = df["Temperature"].corr(df["Coffee_Sales"])

print("With NaN values")
print(df)
print(f"correlation = {correlation1}")
print()

# remove missing values
df.dropna(inplace=True)

# calculate correlation between Temperature and Ice_Cream_sales
correlation2 = df["Temperature"].corr(df["Coffee_Sales"])

print("Without NaN values")
print(df)
print(f"correlation = {correlation2}")
print()

With NaN values
   Temperature  Coffee_Sales
0           22         158.0
1           25         145.0
2           32           NaN
3           28           NaN
4           30         140.0
correlation = -0.923177938058926

Without NaN values
   Temperature  Coffee_Sales
0           22         158.0
1           25         145.0
4           30         140.0
correlation = -0.923177938058926



# 3) Correlation methods in pandas

1. **Pearson method** - default, vyhodnocuje linearnu zavislost medzi 2 spojitymi premennymi

2. **Kendall method** - meria radovy vztah medzi 2 meratelnymi velicinami

3) **Spearman method** - vyhodnocuje monotonny vztah medzi 2 spojitymi alebo ordinalnymi premennymi


In [None]:
# Zjavne musi byt nainstalovany "scipy" modul/kniznica
import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125],
}

df = pd.DataFrame(data)

# calculate different correlation coefficients
pearson = df["Temperature"].corr(df["Ice_Cream_Sales"])
kendall = df["Temperature"].corr(df["Ice_Cream_Sales"], method="kendall")
spearman = df["Temperature"].corr(df["Ice_Cream_Sales"], method="spearman")

# display different correlation coefficient
print(f"Pearson's Coefficient: {pearson}")
print(f"Kendall's Coefficient: {kendall}")
print(f"Spearman's Coefficient: {spearman}")

# 4) Perfect, good, bad correlation

- **perfect correlation** - moze byt pozitivna alebo negativna korelacia. Pozitivna perfektna korelacia znamena, ze medzi premennymi je proporcna korelacia o hodnote +1. Negativan ma proporcnu korelaciu -1.
  ![image.png](attachment:image.png)

- **good correlation** - rozmedzie 0.5 - 0.9 (pozitivna alebo negativna)
  ![image-2.png](attachment:image-2.png)

- **bad correlation** - blizko 0.
  ![image-3.png](attachment:image-3.png)
