# OLS Regression

In [1]:
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('data.csv')

y = df['SUS']
x = df.drop(columns=['SUS', 'Unnamed: 6'])
x = sm.add_constant(x)

# Show OLS Regression Report
model = sm.OLS(y, x).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                    SUS   R-squared:                       0.593
Model:                            OLS   Adj. R-squared:                  0.571
Method:                 Least Squares   F-statistic:                     27.39
Date:                Wed, 01 Feb 2023   Prob (F-statistic):           5.25e-17
Time:                        22:14:39   Log-Likelihood:                -362.39
No. Observations:                 100   AIC:                             736.8
Df Residuals:                      94   BIC:                             752.4
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           93.0282      5.541     16.788   

If we take a = 0.05, then the significant factors to SUS would be ASR_Error (0.001), Intent_Error (0.000).

# Regression Analysis

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score,mean_squared_error

y = df['SUS']
x = df.drop(columns=['SUS', 'Unnamed: 6'])
print(x)

# Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=4)

    Purchase  Duration  Gender  ASR_Error  Intent_Error
0          1       254       0          3             2
1          0       247       0          6             9
2          0       125       1          6             8
3          0        22       0         11             7
4          1       262       0          2             3
..       ...       ...     ...        ...           ...
95         0       358       0         13             7
96         1        71       0          3             0
97         0        34       1          0             9
98         1        49       1          4             1
99         1       213       0          1             4

[100 rows x 5 columns]


## Linear Regression

In [6]:
# If you're fitting the line here, why predict again later?
lr = LinearRegression().fit(x_train, y_train)

y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)

# Calculate the r squared score
print("The R square score of linear regression model is: ", lr.score(x_test, y_test))


The R square score of linear regression model is:  0.628117280347927


About 62.8% of dependent variable (response variable) can be explained from 5 independent variables (predictor variables) in the model.

## Quadratic Regression

In [4]:
quad = PolynomialFeatures(degree=2)

x_quad = quad.fit_transform(x)

# Split the dataset
X_train, X_test, Y_train, Y_test = train_test_split(x_quad, y)

# Train the model
plr = LinearRegression().fit(X_train, Y_train)

Y_train_pred = plr.predict(X_train)
Y_test_pred = plr.predict(X_test)

print("The R square score of 2-order polynomial regression model is: ", plr.score(X_test, Y_test))

The R square score of 2-order polynomial regression model is:  0.6129411048927595


# Questions
2. What features are significant? What features are insignificant?
- Purchase, Duration, Gender factors are insignificant since all their p-values are under a = 0.05. ASR_Error and Intent_Error are statistically significant since their p-values are under 0.05.


3. Were the results what you expected? Explain why or why not, for each feature.
- Purchase: I predicted that this would have some significance becuase whether or not people were able to achieve their task (buy a ticket) might be important to user experience. However, its p-value is .716, which is insignificant.
- Duration: I didn't think duration would matter much because for some cases, a user might get frustrated fast and end it early or a user might have lots of questions to ask while Siri was able to answer them perfectly. It did come out as insignificant with p-value of 0.98.
- Gender: I didn't think this factor would be significant because I don't think there is any factor that gender would matter in buying a flight ticket. It also came out as insignificant with a p-value of 0.67.
- ASR_Error: I did think this would be significant because it would be frustrating if Siri isn't able to understand what the user is talking about. It came out as significant with p-value of 0.001.
- Intent_Error: For the same reason for ASR_Error, I also predicted that this would be a significant factor. It came out with p-value of 0, which shows that whether or not Siri is able to understand the intent of the speech properly influences SUS immensly.


4. What does the model suggest is the most influential factor on SUS? Explain what tells you this is the most influential factor statistically.
- The model suggests that ASR_Error and Intent_Error is the most influencial factors statistically with p-values of 0.001 and 0 each.


5. What are the potential reasons for these factor(s) being significant predictors of SUS?
- I believe how well Siri can understand user's speech is directly related to achieving the task of buying the flight ticket. If Siri can't help users progress through steps of buying the tickets, then it would unnecessarily lengthen the duration of the conversation, and it is likely that the user would get tired of repeating themselves.