### Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

Collect the data: Obtain the study time and exam score values for each student.

* Calculate the means: Find the mean (average) of the study time values and the mean of the exam score values.
* Calculate the standard deviations: Determine the standard deviation of the study time values and the standard deviation of the exam score values.
* Calculate the covariance: Calculate the covariance between the study time and exam scores using the following formula:<br>
covariance = Σ((study time - mean of study time) * (exam score - mean of exam scores)) / (number of data points - 1)
* Calculate the Pearson correlation coefficient: Divide the covariance by the product of the standard deviations of the study time and exam scores.<br>
Pearson correlation coefficient = covariance / (standard deviation of study time * standard deviation of exam scores)

Interpretation of the Pearson correlation coefficient:

* The Pearson correlation coefficient ranges from -1 to 1.
* If the coefficient is close to 1, it indicates a strong positive linear relationship between study time and exam scores. This means that as study time increases, exam scores tend to increase as well.
* If the coefficient is close to -1, it indicates a strong negative linear relationship. This implies that as study time increases, exam scores tend to decrease.
* If the coefficient is close to 0, it suggests a weak or no linear relationship between study time and exam scores.

### Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

In [1]:
import pandas as pd
from scipy.stats import spearmanr

# Create a sample dataset
data = {
    'Sleep': [7, 6, 8, 5, 7],
    'Job Satisfaction': [8, 6, 9, 5, 7]
}

df = pd.DataFrame(data)

# Calculate Spearman's rank correlation coefficient and p-value
correlation, p_value = spearmanr(df['Sleep'], df['Job Satisfaction'])

# Print the correlation coefficient and p-value
print(f"Spearman's rank correlation coefficient: {correlation}")
print(f"P-value: {p_value}")


Spearman's rank correlation coefficient: 0.9746794344808963
P-value: 0.004818230468198566


Interpreting the result:

The Spearman's rank correlation coefficient between the amount of sleep individuals get each night and their overall job satisfaction level is 0.9746794344808963. This coefficient indicates a very strong positive monotonic relationship between the two variables. The value close to 1 suggests that there is a high tendency for the job satisfaction level to increase as the amount of sleep individuals get each night increases, in a monotonic fashion.

The p-value associated with the correlation coefficient is 0.004818230468198566, which is less than the commonly used significance level of 0.05. This indicates that the observed correlation is statistically significant. We have sufficient evidence to reject the null hypothesis of no correlation and conclude that there is a significant monotonic relationship between the amount of sleep and job satisfaction.

### Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [2]:
import pandas as pd
from scipy.stats import pearsonr, spearmanr

# Create a sample dataset
data = {
    'Exercise Hours': [3, 5, 2, 4, 6, 2, 1, 3, 4, 5, 1, 2, 3, 4, 2, 5, 6, 1, 4, 3, 5, 2, 3, 1, 4, 2, 5, 3, 2, 1, 6, 4, 3, 5, 2, 4, 1, 3, 2, 5, 4, 3, 1, 2, 4, 3, 5, 1, 2, 3, 4],
    'BMI': [22.5, 25.0, 21.8, 24.0, 26.2, 21.5, 20.0, 22.0, 23.6, 25.3, 20.2, 21.0, 22.1, 23.8, 21.3, 25.5, 26.8, 20.3, 23.5, 22.7, 25.1, 21.7, 22.4, 20.1, 24.1, 21.6, 25.4, 22.3, 21.2, 20.4, 26.3, 23.9, 22.8, 25.6, 21.4, 24.2, 20.5, 22.2, 21.1, 25.7, 23.7, 22.6, 20.6, 21.9, 24.3, 23.2, 25.8, 20.7, 22.9, 23.0, 24.4]
}

df = pd.DataFrame(data)

# Calculate the Pearson correlation coefficient and p-value
pearson_corr, pearson_p_value = pearsonr(df['Exercise Hours'], df['BMI'])

# Calculate the Spearman's rank correlation coefficient and p-value
spearman_corr, spearman_p_value = spearmanr(df['Exercise Hours'], df['BMI'])

# Print the correlation coefficients and p-values
print(f"Pearson correlation coefficient: {pearson_corr}")
print(f"Pearson p-value: {pearson_p_value}")
print()
print(f"Spearman's rank correlation coefficient: {spearman_corr}")
print(f"Spearman's rank p-value: {spearman_p_value}")


Pearson correlation coefficient: 0.9797571412654146
Pearson p-value: 6.820328725399555e-36

Spearman's rank correlation coefficient: 0.9731424718408446
Spearman's rank p-value: 6.44873527464813e-33


The Pearson correlation coefficient between the number of hours of exercise per week and body mass index (BMI) is 0.9797571412654146. This coefficient indicates a very strong positive linear relationship between the two variables. The p-value (6.820328725399555e-36) is extremely small, indicating strong evidence against the null hypothesis of no correlation. Thus, we can conclude that there is a statistically significant positive linear relationship between exercise hours and BMI.

Similarly, the Spearman's rank correlation coefficient between the number of hours of exercise per week and BMI is 0.9731424718408446. This coefficient also indicates a very strong positive monotonic relationship between the two variables. The p-value (6.44873527464813e-33) is extremely small, suggesting strong evidence against the null hypothesis of no correlation. Therefore, we can conclude that there is a statistically significant positive monotonic relationship between exercise hours and BMI.

Both correlation coefficients (Pearson and Spearman's rank) indicate a very strong positive relationship between exercise hours and BMI. This implies that as the number of hours of exercise per week increases, the BMI tends to increase as well. The statistically significant p-values provide evidence to support the observed relationships, indicating that these findings are unlikely to be due to random chance.

### Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [3]:
import pandas as pd
from scipy.stats import pearsonr

# Create a sample dataset
data = {
    'TV Hours': [3, 5, 2, 4, 6, 2, 1, 3, 4, 5, 1, 2, 3, 4, 2, 5, 6, 1, 4, 3, 5, 2, 3, 1, 4, 2, 5, 3, 2, 1, 6, 4, 3, 5, 2, 4, 1, 3, 2, 5, 4, 3, 1, 2, 4, 3, 5, 1, 2, 3, 4],
    'Physical Activity': [2, 3, 1, 2, 4, 3, 2, 1, 4, 2, 5, 4, 3, 1, 2, 3, 4, 2, 5, 3, 2, 1, 4, 3, 5, 1, 2, 3, 4, 2, 5, 4, 3, 1, 2, 4, 3, 5, 1, 2, 3, 4, 3, 1, 2, 4, 3, 5, 1, 2, 3]
}

df = pd.DataFrame(data)

# Calculate the Pearson correlation coefficient and p-value
correlation, p_value = pearsonr(df['TV Hours'], df['Physical Activity'])

# Print the correlation coefficient and p-value
print(f"Pearson correlation coefficient: {correlation}")
print(f"P-value: {p_value}")


Pearson correlation coefficient: 0.15177887261419132
P-value: 0.2876816679754577


Interpreting the result:

The Pearson correlation coefficient between the number of hours individuals spend watching television per day and their level of physical activity is 0.15177887261419132. This coefficient indicates a very weak positive linear relationship between the two variables. The positive value suggests that there is a slight tendency for individuals who watch more television to have a slightly higher level of physical activity, but the relationship is not strong.

The p-value associated with the correlation coefficient is 0.2876816679754577, which is greater than the commonly used significance level of 0.05. This suggests that the observed correlation is not statistically significant. Therefore, we do not have sufficient evidence to reject the null hypothesis of no correlation and conclude that there is no significant linear relationship between the number of hours spent watching television and the level of physical activity.

### Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:
![image.png](attachment:6428be4e-c1ea-4157-8f94-7bea02ace374.png)

In [4]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create a sample dataset
data = {
    'Age': [25, 42, 37, 19, 31, 28],
    'Soft drink Preference': ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']
}

df = pd.DataFrame(data)

# Encode the categorical variable using label encoding
label_encoder = LabelEncoder()
df['Soft drink Preference'] = label_encoder.fit_transform(df['Soft drink Preference'])

# Calculate the correlation coefficient
correlation = df['Age'].corr(df['Soft drink Preference'])

# Print the correlation coefficient
print(f"Correlation coefficient: {correlation}")


Correlation coefficient: 0.7691751415594736


Interpreting the result:

The correlation coefficient between age and preference for a particular brand of soft drink is 0.7691751415594736. This coefficient indicates a strong positive relationship between the two variables. As age increases, there is a tendency for individuals to have a higher preference for the particular brand of soft drink.

The positive correlation suggests that older individuals are more likely to prefer this brand of soft drink compared to younger individuals. However, it's important to note that correlation does not imply causation. The observed relationship may be influenced by other factors or may be coincidental.

### Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [5]:
import pandas as pd
from scipy.stats import pearsonr

# Create a sample dataset
data = {
    'Sales Calls per Day': [30, 25, 28, 32, 27, 29, 31, 26, 24, 33, 28, 30, 29, 31, 27, 26, 30, 28, 29, 25, 31, 33, 27, 28, 26, 30, 32, 29, 27, 28],
    'Sales per Week': [15, 13, 14, 16, 13, 14, 15, 13, 12, 17, 14, 15, 14, 16, 13, 13, 15, 14, 15, 13, 16, 17, 14, 14, 13, 15, 16, 14, 13, 14]
}

df = pd.DataFrame(data)

# Calculate the Pearson correlation coefficient and p-value
correlation, p_value = pearsonr(df['Sales Calls per Day'], df['Sales per Week'])

# Print the correlation coefficient and p-value
print(f"Pearson correlation coefficient: {correlation}")
print(f"P-value: {p_value}")


Pearson correlation coefficient: 0.9614382678482712
P-value: 3.109010339632336e-17


Interpreting the result:

The Pearson correlation coefficient between the number of sales calls made per day and the number of sales made per week is 0.9614382678482712. This coefficient indicates a very strong positive linear relationship between the two variables. As the number of sales calls per day increases, there is a high tendency for the number of sales made per week to also increase.

The high positive correlation suggests that there is a strong association between the number of sales calls and the number of sales made. Sales representatives who make more calls per day tend to have higher sales numbers per week.

The p-value associated with the correlation coefficient is 3.109010339632336e-17, which is significantly smaller than the commonly used significance level of 0.05. This indicates that the observed correlation is statistically significant. Therefore, we have strong evidence to reject the null hypothesis of no correlation and conclude that there is a significant positive linear relationship between the number of sales calls per day and the number of sales made per week.