## 22 March Assignment

## Feature Engineering-6

### Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.






To calculate the Pearson correlation coefficient between two variables, "amount of time students spend studying for an exam" and "final exam scores," you can use the following formula:

\[ \text{Pearson Correlation Coefficient (r)} = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2} \sum{(y_i - \bar{y})^2}}}\]

Where:
- \( x_i \) and \( y_i \) are the individual data points for the two variables.
- \( \bar{x} \) and \( \bar{y} \) are the means of the two variables.
- The summation is carried out over all data points.

You can calculate the Pearson correlation coefficient using Python and NumPy as follows:

In [1]:
###python
import numpy as np

# Sample data for time spent studying and exam scores
time_spent = [5, 8, 3, 6, 7]
exam_scores = [75, 85, 60, 70, 80]

# Calculate means
mean_time_spent = np.mean(time_spent)
mean_exam_scores = np.mean(exam_scores)

# Calculate the Pearson correlation coefficient
numerator = np.sum((np.array(time_spent) - mean_time_spent) * (np.array(exam_scores) - mean_exam_scores))
denominator = np.sqrt(np.sum((np.array(time_spent) - mean_time_spent)**2) * np.sum((np.array(exam_scores) - mean_exam_scores)**2))
pearson_corr_coefficient = numerator / denominator

print("Pearson Correlation Coefficient:", pearson_corr_coefficient)

Pearson Correlation Coefficient: 0.9324324324324325


Interpretation of the Pearson correlation coefficient:
The Pearson correlation coefficient (\(r\)) ranges between -1 and 1. Here's how to interpret the result:

- \(r = 1\): A perfect positive linear correlation. As time spent studying increases, exam scores also increase.
- \(r > 0\): A positive linear correlation. More time spent studying tends to be associated with higher exam scores.
- \(r = 0\): No linear correlation. Time spent studying and exam scores are not linearly related.
- \(r < 0\): A negative linear correlation. More time spent studying tends to be associated with lower exam scores.
- \(r = -1\): A perfect negative linear correlation. As time spent studying increases, exam scores decrease.

For example, if the calculated \(r\) is 0.85, it suggests a strong positive linear correlation between time spent studying and exam scores. Students who spend more time studying tend to have higher exam scores.

Keep in mind that the Pearson correlation coefficient measures only linear relationships and may not capture nonlinear associations between variables. Additionally, correlation does not imply causation, so further analysis is required to understand the underlying factors that contribute to the observed relationship.

### Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.


To calculate Spearman's rank correlation between two variables, "amount of sleep individuals get each night" and "overall job satisfaction level on a scale of 1 to 10," follow these steps:

1. Rank the data for both variables separately.
2. Calculate the differences between the ranks for each data point.
3. Square the rank differences.
4. Calculate the Spearman's rank correlation coefficient using the formula:

\[ \text{Spearman's Rank Correlation (ρ)} = 1 - \frac{6\sum{d_i^2}}{n(n^2-1)} \]

Where:
- \( d_i \) is the difference between the ranks of each data point.
- \( n \) is the number of data points.

Here's how you can calculate Spearman's rank correlation using Python and NumPy:

In [2]:
### python
import numpy as np

# Sample data for sleep and job satisfaction
sleep = [6, 7, 5, 8, 6]
job_satisfaction = [8, 7, 6, 9, 7]

# Calculate ranks for each variable
sleep_ranks = np.argsort(np.argsort(sleep))
job_satisfaction_ranks = np.argsort(np.argsort(job_satisfaction))

# Calculate the rank differences
rank_diff = sleep_ranks - job_satisfaction_ranks

# Calculate the Spearman's rank correlation coefficient
n = len(sleep)
spearman_corr_coefficient = 1 - (6 * np.sum(rank_diff**2)) / (n * (n**2 - 1))

print("Spearman's Rank Correlation:", spearman_corr_coefficient)

Spearman's Rank Correlation: 0.6


Interpretation of the Spearman's rank correlation:
Spearman's rank correlation (\(ρ\)) ranges between -1 and 1. Here's how to interpret the result:

- \(ρ = 1\): Perfect monotonic positive correlation. As one variable increases, the other variable also increases monotonically.
- \(ρ > 0\): Monotonic positive correlation. Higher amounts of sleep tend to be associated with higher job satisfaction levels, but not necessarily linearly.
- \(ρ = 0\): No monotonic correlation. There's no consistent monotonic relationship between the variables.
- \(ρ < 0\): Monotonic negative correlation. Higher amounts of sleep tend to be associated with lower job satisfaction levels, but not necessarily linearly.
- \(ρ = -1\): Perfect monotonic negative correlation. As one variable increases, the other variable decreases monotonically.

For example, if the calculated \(ρ\) is 0.6, it suggests a moderate positive monotonic correlation between the amount of sleep individuals get and their overall job satisfaction level. More sleep tends to be associated with higher job satisfaction, although the relationship may not be strictly linear.

Spearman's rank correlation is particularly useful when dealing with ordinal or non-linear relationships and when the assumption of linearity is not met. It measures the strength and direction of monotonic relationships between variables.

### Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [3]:
import numpy as np
from scipy.stats import pearsonr, spearmanr

# Sample data for exercise hours and BMI
exercise_hours = [2, 4, 3, 5, 1, 2, 3, 4, 2, 5,
                  0, 1, 2, 3, 4, 2, 3, 4, 5, 1,
                  2, 3, 4, 5, 0, 2, 3, 4, 2, 5,
                  1, 2, 3, 4, 2, 3, 4, 5, 0, 1,
                  2, 3, 4, 2, 3, 4, 5, 1, 2, 3]

bmi = [22.5, 24.7, 21.8, 26.5, 19.2, 20.1, 23.0, 25.8, 20.9, 27.3,
       18.6, 20.5, 23.2, 24.9, 26.1, 21.3, 23.4, 25.7, 28.2, 19.8,
       20.7, 24.3, 26.8, 28.5, 17.9, 21.0, 22.8, 25.1, 21.1, 27.9,
       19.7, 22.2, 24.6, 26.0, 20.8, 23.3, 24.8, 27.0, 18.5, 20.2,
       21.9, 24.1, 25.6, 21.2, 23.5, 25.3, 26.6, 19.5, 22.0, 23.9]

# Calculate Pearson correlation coefficient
pearson_corr_coefficient, _ = pearsonr(exercise_hours, bmi)

# Calculate Spearman's rank correlation
spearman_corr_coefficient, _ = spearmanr(exercise_hours, bmi)

print("Pearson Correlation Coefficient:", pearson_corr_coefficient)
print("Spearman's Rank Correlation:", spearman_corr_coefficient)

Pearson Correlation Coefficient: 0.9648199555800469
Spearman's Rank Correlation: 0.9657018541535369


Comparison of results:
The Pearson correlation coefficient measures linear relationships, while Spearman's rank correlation measures monotonic relationships. Here's how to interpret the results:

If both correlation coefficients are similar and positive (e.g., both around 0.8):
This suggests a strong positive linear and monotonic relationship between exercise hours and BMI. Participants who exercise more tend to have lower BMI, and the relationship is consistent across both linear and monotonic measures.

If the Pearson correlation coefficient is high (positive) and the Spearman's rank correlation is lower (positive but not as high):
This suggests a strong positive linear relationship between exercise hours and BMI. However, the monotonic relationship might be slightly weaker due to potential non-linearities.

If the Pearson correlation coefficient is low (close to 0) and the Spearman's rank correlation is moderate (positive):
This suggests a weak linear relationship between exercise hours and BMI, but a moderate monotonic relationship. This could be due to non-linearities or outliers that affect the Pearson correlation more.

If the Spearman's rank correlation is high (positive) and the Pearson correlation coefficient is low (close to 0):
This suggests a strong monotonic relationship, but not necessarily linear. The relationship may be affected by non-linear patterns in the data.

Keep in mind that both correlation coefficients have their own strengths and limitations, and their interpretation should be based on the context of the data and the research question.

### Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [4]:
import numpy as np
from scipy.stats import pearsonr

# Sample data for hours of TV watching and physical activity level
tv_hours = [3, 2, 4, 5, 4, 6, 3, 2, 1, 4,
            5, 2, 3, 1, 4, 6, 2, 3, 5, 4,
            4, 2, 3, 5, 3, 6, 4, 2, 1, 5,
            5, 2, 3, 4, 4, 6, 3, 2, 1, 4,
            5, 2, 3, 5, 4, 6, 3, 2, 1, 4]

physical_activity = [2, 4, 3, 1, 3, 1, 2, 4, 5, 3,
                     1, 4, 2, 5, 3, 1, 4, 2, 1, 3,
                     3, 4, 2, 1, 3, 1, 2, 4, 5, 3,
                     1, 4, 2, 5, 3, 1, 4, 2, 1, 3,
                     3, 4, 2, 1, 3, 1, 2, 4, 5, 3]

# Calculate Pearson correlation coefficient
pearson_corr_coefficient, _ = pearsonr(tv_hours, physical_activity)

print("Pearson Correlation Coefficient:", pearson_corr_coefficient)


Pearson Correlation Coefficient: -0.6971275728037658


r is -0.69, it suggests a moderate negative linear correlation between the number of hours individuals spend watching television per day and their level of physical activity. Those who spend more time watching TV tend to have lower physical activity levels.

### Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:
| Age(Year) | Brand |
|-----------|------|
| 25        | Coke |
| 42        | Pepsi|
| 37        | Mountain dew   |
| 19        | Coke |
| 31        |Pepsi |
| 28        | Coke |

### Calculate Spearman Rank Coeffiecient & interpret the result.

In [5]:
import numpy as np
from scipy.stats import spearmanr

# Sample data
ages = [25, 42, 37, 19, 31, 28]
brands = ['Coke', 'Pepsi', 'Mountain dew', 'Coke', 'Pepsi', 'Coke']

# Convert brand names to numerical ranks
brand_ranks = {brand: rank for rank, brand in enumerate(sorted(set(brands)), start=1)}
numerical_brands = [brand_ranks[brand] for brand in brands]

# Calculate Spearman's rank correlation
spearman_corr_coefficient, _ = spearmanr(ages, numerical_brands)

print("Spearman's Rank Correlation Coefficient:", spearman_corr_coefficient)

Spearman's Rank Correlation Coefficient: 0.8332380897952965


### Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [6]:
import numpy as np
from scipy.stats import pearsonr

# Sample data for sales calls per day and sales per week
sales_calls_per_day = [20, 15, 18, 25, 22, 24, 19, 17, 21, 23,
                       16, 14, 18, 26, 20, 21, 23, 22, 19, 15,
                       17, 20, 22, 16, 24, 18, 15, 23, 25, 21]

sales_per_week = [120, 90, 110, 160, 140, 150, 130, 100, 125, 135,
                  80, 70, 110, 165, 115, 125, 140, 130, 110, 85,
                  100, 120, 140, 95, 155, 105, 90, 140, 150, 130]

# Calculate Pearson correlation coefficient
pearson_corr_coefficient, _ = pearsonr(sales_calls_per_day, sales_per_week)

print("Pearson Correlation Coefficient:", pearson_corr_coefficient)


Pearson Correlation Coefficient: 0.9757740084876352


the calculated r is 0.975, it suggests a strong positive linear correlation between the number of sales calls made per day and the number of sales made per week. Sales representatives who make more sales calls tend to achieve higher sales numbers per week.