The goal of regression analysis is to describe the relationship between one set of variables called the dependent variables, and another set of variables, called independent or explanatory variables. When there is only one explanatory variable, it is called simple regression.

### Regression with T-test: Using the teachers rating data set, does gender affect teaching evaluation rates?

*   $H\_0: β1$ = 0 (Gender has no effect on teaching evaluation scores)
*   $H\_1: β1$ is not equal to 0 (Gender has an effect on teaching evaluation scores)


In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import requests
import io

URL = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/teachingratings.csv'
resp = requests.get(URL)
    
ratings_url = io.StringIO(resp.text)
ratings_df = pd.read_csv(ratings_url)
print('Data downloaded and read into a dataframe!')
ratings_df.info()

Data downloaded and read into a dataframe!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 463 entries, 0 to 462
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ï»¿minority      463 non-null    object 
 1   age              463 non-null    int64  
 2   gender           463 non-null    object 
 3   credits          463 non-null    object 
 4   beauty           463 non-null    float64
 5   eval             463 non-null    float64
 6   division         463 non-null    object 
 7   native           463 non-null    object 
 8   tenure           463 non-null    object 
 9   students         463 non-null    int64  
 10  allstudents      463 non-null    int64  
 11  prof             463 non-null    int64  
 12  PrimaryLast      463 non-null    int64  
 13  vismin           463 non-null    int64  
 14  female           463 non-null    int64  
 15  single_credit    463 non-null    int64  
 16  upper_division   46

In [5]:
## X is the input variables (or independent variables)
X = ratings_df['female']
## y is the target/dependent variable
y = ratings_df['eval']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.022
Model:,OLS,Adj. R-squared:,0.02
Method:,Least Squares,F-statistic:,10.56
Date:,"Wed, 03 Jul 2024",Prob (F-statistic):,0.00124
Time:,19:01:48,Log-Likelihood:,-378.5
No. Observations:,463,AIC:,761.0
Df Residuals:,461,BIC:,769.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.0690,0.034,121.288,0.000,4.003,4.135
female,-0.1680,0.052,-3.250,0.001,-0.270,-0.066

0,1,2,3
Omnibus:,17.625,Durbin-Watson:,1.209
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.97
Skew:,-0.496,Prob(JB):,7.6e-05
Kurtosis:,2.981,Cond. No.,2.47


Conclusion: Like the t-test, the p-value is less than the alpha (α) level = 0.05, so we reject the null hypothesis as there is evidence that there is a difference in mean evaluation scores based on gender. The coefficient -0.1680 means that females get 0.168 scores less than men.

### Regression with ANOVA: Using the teachers' rating data set, does beauty  score for instructors  differ by age?

State the Hypothesis:

*   $H\_0: µ1 = µ2 = µ3$ (the three population means are equal)
*   $H\_1:$ At least one of the means differ


In [6]:
ratings_df.loc[(ratings_df['age'] <= 40), 'age_group'] = '40 years and younger'
ratings_df.loc[(ratings_df['age'] > 40)&(ratings_df['age'] < 57), 'age_group'] = 'between 40 and 57 years'
ratings_df.loc[(ratings_df['age'] >= 57), 'age_group'] = '57 years and older'

from statsmodels.formula.api import ols
lm = ols('beauty ~ age_group', data = ratings_df).fit()
table= sm.stats.anova_lm(lm)
print(table)

              df      sum_sq    mean_sq          F        PR(>F)
age_group    2.0   20.422744  10.211372  17.597559  4.322549e-08
Residual   460.0  266.925153   0.580272        NaN           NaN


Conclusion: We can also see the same values for ANOVA like before and we will reject the null hypothesis since the p-value is less than 0.05 there is significant evidence that at least one of the means differ.

### Correlation: Using the teachers' rating dataset, Is teaching evaluation score correlated with beauty score?

In [9]:
## X is the input variables (or independent variables)
X = ratings_df['beauty']
## y is the target/dependent variable
y = ratings_df['eval']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.036
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,17.08
Date:,"Wed, 03 Jul 2024",Prob (F-statistic):,4.25e-05
Time:,19:08:10,Log-Likelihood:,-375.32
No. Observations:,463,AIC:,754.6
Df Residuals:,461,BIC:,762.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9983,0.025,157.727,0.000,3.948,4.048
beauty,0.1330,0.032,4.133,0.000,0.070,0.196

0,1,2,3
Omnibus:,15.399,Durbin-Watson:,1.238
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.405
Skew:,-0.453,Prob(JB):,0.000274
Kurtosis:,2.831,Cond. No.,1.27


**Conclusion:** p < 0.05 there is evidence of correlation between beauty and evaluation scores


### Question 1: Using the teachers' rating data set, does tenure affect beauty scores?

*   Use α = 0.05


In [11]:
## X is the input variables (or independent variables)
X = ratings_df['tenured_prof']
## y is the target/dependent variable
y = ratings_df['beauty']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,beauty,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.002
Method:,Least Squares,F-statistic:,0.1689
Date:,"Wed, 03 Jul 2024",Prob (F-statistic):,0.681
Time:,19:10:31,Log-Likelihood:,-546.45
No. Observations:,463,AIC:,1097.0
Df Residuals:,461,BIC:,1105.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0284,0.078,0.363,0.717,-0.125,0.182
tenured_prof,-0.0364,0.089,-0.411,0.681,-0.210,0.138

0,1,2,3
Omnibus:,23.184,Durbin-Watson:,0.461
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.229
Skew:,0.507,Prob(JB):,9.03e-06
Kurtosis:,2.583,Cond. No.,4.05


p-value is greater than 0.05, so we fail to reject the null hypothesis as there is no evidence that the mean difference of tenured and untenured instructors are different

### Question 2: Using the teachers' rating data set, does being an English speaker affect the number of students assigned to professors?

*   Use "allstudents"
*   Use α = 0.05 and α = 0.1


Null Hypothesis: Mean number of students assigned to native English speakers vs non-native English speakers are equal

Alternative Hypothesis: There is a difference in mean number of students assigned to native English speakers vs non-native English speakers

In [13]:
X = ratings_df['English_speaker']
y = ratings_df['allstudents']

## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,allstudents,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,3.476
Date:,"Wed, 03 Jul 2024",Prob (F-statistic):,0.0629
Time:,19:12:24,Log-Likelihood:,-2654.2
No. Observations:,463,AIC:,5312.0
Df Residuals:,461,BIC:,5321.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,29.6071,14.150,2.092,0.037,1.802,57.413
English_speaker,27.2158,14.598,1.864,0.063,-1.471,55.902

0,1,2,3
Omnibus:,429.792,Durbin-Watson:,0.708
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10527.126
Skew:,4.129,Prob(JB):,0.0
Kurtosis:,24.852,Cond. No.,8.01


At α = 0.05, p-value is greater, we fail to reject the null hypothesis as there is no evidence that being a native English speaker or a non-native English speaker affects the number of students assigned to an instructor.
At α = 0.1, p-value is less, we reject the null hypothesis as there is evidence that there is a significant difference of mean number of students assigned to native English speakers vs non-native English speakers.

### Question 3: Using the teachers' rating data set, what is the correlation between the number of students who participated in the evaluation survey and evaluation scores?

*   Use "students" variable


In [16]:
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()


0,1,2,3
Dep. Variable:,allstudents,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,3.476
Date:,"Wed, 03 Jul 2024",Prob (F-statistic):,0.0629
Time:,19:14:21,Log-Likelihood:,-2654.2
No. Observations:,463,AIC:,5312.0
Df Residuals:,461,BIC:,5321.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,29.6071,14.150,2.092,0.037,1.802,57.413
English_speaker,27.2158,14.598,1.864,0.063,-1.471,55.902

0,1,2,3
Omnibus:,429.792,Durbin-Watson:,0.708
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10527.126
Skew:,4.129,Prob(JB):,0.0
Kurtosis:,24.852,Cond. No.,8.01



R-square is 0.001, R will be √0.001, correlation coefficient is 0.03 (close to 0). There is a very weak correlation between the number of students who participated in the evaluation survey and evaluation scores