## Feature Engineering-6-Assignment-Covariance and Correlation

Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose
you have collected data on the amount of time students spend studying for an exam and their final exam
scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

To calculate the Pearson correlation coefficient between the amount of time students spend studying for an exam and their final exam scores, you can use the following steps:

1. **Collect Data:**
   - Gather data on the amount of time each student spent studying for the exam.
   - Collect corresponding final exam scores for each student.

2. **Organize Data:**
   - Set up two lists or arrays, one for the study time and another for the exam scores.

3. **Use a Statistical Tool or Python:**
   - If you have a statistical tool (like Excel) or Python, you can use built-in functions to calculate the correlation coefficient. In Python, you can use the `pearsonr` function from the `scipy.stats` module.

4. **Interpret the Result:**
   - The Pearson correlation coefficient ranges from -1 to 1.
     - A coefficient of 1 indicates a perfect positive linear relationship (as study time increases, exam scores also increase).
     - A coefficient of -1 indicates a perfect negative linear relationship (as study time increases, exam scores decrease).
     - A coefficient around 0 indicates a weak or no linear relationship.

   - Interpret the coefficient in the context of your study. For example:
     - If the coefficient is close to 1, it suggests a positive correlation, indicating that more study time is associated with higher exam scores.
     - If the coefficient is close to -1, it suggests a negative correlation, indicating that more study time is associated with lower exam scores.
     - If the coefficient is close to 0, it suggests a weak or no linear correlation.

Remember to replace the example data with your actual study time and exam score data for accurate results.

In [5]:
from scipy.stats import pearsonr

# Example data (replace this with your actual data)
study_time = [10, 20, 15, 25, 30]
exam_scores = [60, 80, 70, 90, 95]

# Calculate Pearson correlation coefficient
correlation_coefficient, _ = pearsonr(study_time, exam_scores)

# Print the correlation coefficient
print(f"Pearson Correlation Coefficient: {correlation_coefficient}")


Pearson Correlation Coefficient: 0.9938837346736187


Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
Suppose you have collected data on the amount of sleep individuals get each night and their overall job
satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two
variables and interpret the result.

In [6]:
from scipy.stats import spearmanr

# Example data (replace this with your actual data)
sleep_hours = [7, 5, 8, 6, 7]
job_satisfaction = [8, 5, 9, 7, 8]

# Calculate Spearman's rank correlation coefficient
correlation_coefficient, _ = spearmanr(sleep_hours, job_satisfaction)

# Print the correlation coefficient
print(f"Spearman's Rank Correlation Coefficient: {correlation_coefficient}")


Spearman's Rank Correlation Coefficient: 1.0


Q3. Suppose you are conducting a study to examine the relationship between the number of hours of
exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables
for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation
between these two variables and compare the results.

In [7]:
from scipy.stats import pearsonr, spearmanr
import numpy as np

# Example data (replace this with your actual data)
hours_of_exercise = np.random.uniform(1, 10, 50)  # Randomly generated data for hours of exercise
bmi = np.random.normal(25, 5, 50)  # Randomly generated data for BMI

# Calculate Pearson correlation coefficient
pearson_corr, _ = pearsonr(hours_of_exercise, bmi)

# Calculate Spearman's rank correlation coefficient
spearman_corr, _ = spearmanr(hours_of_exercise, bmi)

# Print the correlation coefficients
print(f"Pearson Correlation Coefficient: {pearson_corr}")
print(f"Spearman's Rank Correlation Coefficient: {spearman_corr}")


Pearson Correlation Coefficient: 0.032825066425685986
Spearman's Rank Correlation Coefficient: 0.018391356542617046


Q4. A researcher is interested in examining the relationship between the number of hours individuals
spend watching television per day and their level of physical activity. The researcher collected data on
both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between
these two variables.

In [8]:
from scipy.stats import pearsonr
import numpy as np

# Example data (replace this with your actual data)
hours_of_tv = np.random.uniform(0, 5, 50)  # Randomly generated data for hours of TV watching
physical_activity = np.random.normal(60, 10, 50)  # Randomly generated data for physical activity level

# Calculate Pearson correlation coefficient
pearson_corr, _ = pearsonr(hours_of_tv, physical_activity)

# Print the correlation coefficient
print(f"Pearson Correlation Coefficient: {pearson_corr}")


Pearson Correlation Coefficient: 0.16734445769325637


Q5. A survey was conducted to examine the relationship between age and preference for a particular
brand of soft drink. The survey results are shown below:

| Age(Years) | Drink       |
|------------|-------------|
| 25         | Coke        |
| 42         | Pepsi       |
| 37         | Mountain Dew            |
| 19         |  Coke            |
| 31         |  Pepsi           |
| 28         | Coke
|            |        |
|            |        |
|            |         |


In [9]:
import pandas as pd

# Example data
data = {'Age': [25, 42, 37, 19, 31, 28],
        'Drink': ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']}

# Create a DataFrame
df = pd.DataFrame(data)

# Create a cross-tabulation
cross_tab = pd.crosstab(df['Age'], df['Drink'])

# Display the cross-tabulation
print(cross_tab)


  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


Drink  Coke  Mountain Dew  Pepsi
Age                             
19        1             0      0
25        1             0      0
28        1             0      0
31        0             0      1
37        0             1      0
42        0             0      1


Q6. A company is interested in examining the relationship between the number of sales calls made per day
and the number of sales made per week. The company collected data on both variables from a sample of
30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [10]:
import numpy as np
import pandas as pd

# Example data
data = {'Sales_Calls_Per_Day': [20, 25, 30, 18, 22, 28, 15, 24, 29, 17, 21, 26, 19, 23, 27, 16, 20, 25, 18, 22, 27, 14, 23, 28, 16, 19, 24, 17, 21, 26],
        'Sales_Per_Week': [100, 120, 150, 90, 110, 140, 80, 130, 160, 85, 105, 135, 95, 115, 145, 75, 100, 125, 88, 108, 138, 70, 110, 130, 72, 92, 122, 80, 105, 130]}

# Create a DataFrame
df = pd.DataFrame(data)

# Calculate the Pearson correlation coefficient
pearson_corr = df['Sales_Calls_Per_Day'].corr(df['Sales_Per_Week'])

# Display the result
print(f"Pearson Correlation Coefficient: {pearson_corr}")


Pearson Correlation Coefficient: 0.9804138391244757
