Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose
you have collected data on the amount of time students spend studying for an exam and their final exam
scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result



To calculate the Pearson correlation coefficient, we need to have paired data for both variables (study time and exam score) for each student. Once we have the paired data, we can use the following formula:

r = (nΣXY - ΣXΣY) / sqrt[(nΣX^2 - (ΣX)^2)(nΣY^2 - (ΣY)^2)]

where:

- r is the Pearson correlation coefficient
- n is the number of paired observations
- ΣXY is the sum of the products of the paired observations for each variable
- ΣX is the sum of the observations for the first variable
- ΣY is the sum of the observations for the second variable
- ΣX^2 is the sum of the squared observations for the first variable
- ΣY^2 is the sum of the squared observations for the second variable

Once we have calculated the Pearson correlation coefficient, we can interpret the result as follows:

- If r is positive, there is a positive linear relationship between the two variables. This means that as one variable increases, the other variable tends to increase as well. The closer r is to 1, the stronger the positive relationship.
- If r is negative, there is a negative linear relationship between the two variables. This means that as one variable increases, the other variable tends to decrease. The closer r is to -1, the stronger the negative relationship.
- If r is zero, there is no linear relationship between the two variables.

Note that the Pearson correlation coefficient only measures the strength and direction of the linear relationship between two variables. It does not indicate causation, and it cannot detect non-linear relationships. Additionally, it assumes that both variables are normally distributed and that there are no outliers or influential observations in the data.

Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
Suppose you have collected data on the amount of sleep individuals get each night and their overall job
satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two
variables and interpret the result.

To calculate Spearman's rank correlation coefficient, we first need to rank both variables from lowest to highest values. Then, we calculate the difference between the ranks for each observation and use these differences to calculate the correlation coefficient.

Suppose we have the following data:

|Individuals|Amount of sleep|Job satisfaction|
|---|---|---|
|1|7|8|
|2|6|6|
|3|8|7|
|4|5|4|
|5|7|9|

First, we rank the data:

|Individuals|Amount of sleep (X)|Rank(X)|Job satisfaction (Y)|Rank(Y)|d|
|---|---|---|---|---|---|
|1|7|2.5|8|3|0.5|
|2|6|1|6|1|0|
|3|8|4|7|2|2|
|4|5|0|4|0|0|
|5|7|2.5|9|4|1.5|

Here, we have tied ranks for individuals 1 and 5, so we take the average of the ranks for those individuals.

Next, we calculate the difference between the ranks for each observation and square the differences:

d = 0.5^2 + 0^2 + 2^2 + 0^2 + 1.5^2 = 7.5

Finally, we calculate the Spearman's rank correlation coefficient as:

r = 1 - (6Σd^2)/(n(n^2-1))
r = 1 - (6*7.5)/(5*(5^2-1))
r = 1 - 0.6
r = 0.4

Interpretation: The Spearman's rank correlation coefficient is 0.4, indicating a moderate positive monotonic relationship between the amount of sleep individuals get each night and their overall job satisfaction level. This means that as the amount of sleep individuals get increases, their job satisfaction level tends to increase as well, but this relationship is not necessarily linear.

Q3. Suppose you are conducting a study to examine the relationship between the number of hours of
exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables
for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation
between these two variables and compare the results.

To calculate the Pearson correlation coefficient and Spearman's rank correlation coefficient, we need to have paired data on both variables (exercise per week and BMI) for each participant. Once we have the paired data, we can use statistical software like R or Python to calculate the coefficients. 

Assuming we have the paired data, let's say the Pearson correlation coefficient is r = -0.62 and the Spearman's rank correlation coefficient is rho = -0.54. 

Interpretation:

The Pearson correlation coefficient of -0.62 indicates a moderate negative linear relationship between the number of hours of exercise per week and BMI. This means that as the number of hours of exercise per week increases, the BMI tends to decrease, and vice versa.

The Spearman's rank correlation coefficient of -0.54 indicates a moderate negative monotonic relationship between the number of hours of exercise per week and BMI. This means that there is a consistent pattern in the relationship between the two variables, but it may not be perfectly linear.

Comparing the two coefficients, we see that they are similar in terms of direction and strength of the relationship. However, the Spearman's rank correlation coefficient may be a better choice when the relationship between the variables is not strictly linear, as it does not assume a linear relationship like the Pearson correlation coefficient does.

Q4. A researcher is interested in examining the relationship between the number of hours individuals
spend watching television per day and their level of physical activity. The researcher collected data on
both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between
these two variables.

Let's assume the researcher has the following data:

Number of hours watching TV per day: [2, 3, 1, 4, 2, 3, 1, 2, 1, 3, 2, 4, 1, 2, 3, 1, 1, 2, 3, 2, 4, 2, 1, 3, 2, 1, 4, 3, 2, 1, 2, 3, 2, 2, 1, 3, 1, 2, 2, 3, 2, 1, 3, 4, 2, 1, 2, 3, 2, 4]

Level of physical activity: [5, 4, 6, 3, 5, 4, 6, 5, 6, 4, 5, 3, 6, 5, 4, 6, 6, 5, 4, 5, 3, 5, 6, 4, 5, 6, 3, 4, 5, 6, 5, 4, 5, 5, 6, 4, 6, 5, 6, 5, 4, 4, 5, 4, 3, 5, 6, 4, 6, 5, 4]

We can use the scipy.stats module in Python to calculate the Pearson correlation coefficient as follows

In [None]:
from scipy.stats import pearsonr

tv_hours = [2, 3, 1, 4, 2, 3, 1, 2, 1, 3, 2, 4, 1, 2, 3, 1, 1, 2, 3, 2, 4, 2, 1, 3, 2, 1, 4, 3, 2, 1, 2, 3, 2, 2, 1, 3, 1, 2, 2, 3, 2, 1, 3, 4, 2, 1, 2, 3, 2, 4]
physical_activity = [5, 4, 6, 3, 5, 4, 6, 5, 6, 4, 5, 3, 6, 5, 4, 6, 6, 5, 4, 5, 3, 5, 6, 4, 5, 6, 3, 4, 5, 6, 5, 4, 5, 5, 6, 4, 6, 5, 6, 5, 4, 4, 5, 4, 3, 5, 6, 4, 6, 5, 4]

corr, p_value = pearsonr(tv_hours, physical_activity)
print("Pearson correlation coefficient:", corr)


Q5. A survey was conducted to examine the relationship between age and preference for a particular
brand of soft drink. The survey results are shown below:
    

There are two variables in this survey: age and preference for a particular brand of soft drink. The age variable is quantitative and the preference variable is categorical. To examine the relationship between these two variables, we can create a frequency table to show how many participants prefer each brand of soft drink within each age group. However, since there are missing values and an unclear age group for one participant, we need to clean the data first.

Cleaned data:

Age (Years)   | Preference
------------- | -------------
25            | Coke
42            | Pepsi
37            | Pepsi
19            | Mountain Dew
31            | Coke
28            | Mountain Dew
Unknown       | Coke
Unknown       | Pepsi
Unknown       | Coke

Now, we can create a frequency table to examine the relationship between age and preference for a particular brand of soft drink:

Age (Years)   | Coke | Pepsi | Mountain Dew
------------- | ---- | ----- | ------------
19            | 0    | 0     | 1
25            | 1    | 0     | 0
28            | 0    | 0     | 1
31            | 1    | 0     | 0
37            | 0    | 1     | 0
42            | 0    | 1     | 0
Unknown       | 2    | 1     | 0

From the frequency table, we can see that among the known age groups, individuals aged 19 prefer Mountain Dew, while individuals aged 25 and 31 prefer Coke. Individuals aged 37 and 42 both prefer Pepsi. However, it's difficult to draw conclusions about the relationship between age and preference for a particular brand of soft drink due to the small sample size and missing age data.

Q6. A company is interested in examining the relationship between the number of sales calls made per day
and the number of sales made per week. The company collected data on both variables from a sample of
30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

As both the number of sales calls made per day and the number of sales made per week are quantitative variables, we can calculate the Pearson correlation coefficient to examine the linear relationship between the two variables. 

Let's assume the Pearson correlation coefficient between the two variables is denoted by 'r'. We can use the following formula to calculate 'r':

r = [n(Σxy) − (Σx)(Σy)] / [sqrt(nΣx^2 − (Σx)^2) * sqrt(nΣy^2 − (Σy)^2)]

where n is the sample size, Σxy is the sum of the product of the deviations of x and y from their respective means, Σx is the sum of the deviations of x from its mean, and Σy is the sum of the deviations of y from its mean.

Let's assume the following data represents the number of sales calls made per day and the number of sales made per week for the 30 sales representatives:

Sales Calls per Day (x)	Sales per Week (y)
20	4
16	3
25	5
30	7
18	3
24	4
21	5
19	4
27	6
23	5
22	5
15	2
28	6
17	3
26	6
20	4
24	5
16	2
29	7
22	5
20	4
21	5
19	3
25	6
17	3
28	7
26	6
18	3
23	5
21	5

First, we need to calculate the mean of x and y:

mean of x = (20+16+25+30+18+24+21+19+27+23+22+15+28+17+26+20+24+16+29+22+20+21+19+25+17+28+26+18+23+21)/30 = 21.1
mean of y = (4+3+5+7+3+4+5+4+6+5+5+2+6+3+6+4+5+2+7+5+4+5+3+6+3+7+6+3+5+5)/30 = 4.6

Next, we need to calculate the deviations of x and y from their respective means:

Deviation of x = x - mean of x
Deviation of y = y - mean of y

Using these deviations, we can calculate the sum of the product of the deviations of x and y from their respective means (Σxy), the sum of the deviations of x from its mean (Σx), and the sum of the deviations of y from its mean (Σy):

Σxy = (20-21.1)*(4-4.6) + (16-21.1)*(3-4.6) + (25-21.1)*(5-4.6) + (30-21.1)*(7-4.6) + (18-21.1)*(3-4.6) + (24-21.1)*(4-4.6) + (21-21.1)*(5-4.6) + (19-21.1)*(4-4.6) + (27-21.1)*(6-4.6) + (23-21.1)*(5-4.6) + (22-21.1)*(5-4.6) + (15-21.1)*(2-4