# Input

After our problem owner suggested to look into a polynomial regression model again as it ranks the predictions which can be useful in our case, I briefly looked into the topic with an example code and then decided to take it further with this experiment. Through this, I aimed to find the best conditions for a regression model.

In [3]:
# Importing libraries 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PowerTransformer, PolynomialFeatures
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error 
from sklearn.model_selection import KFold

pd.options.mode.chained_assignment = None

In [4]:
# Create df 
df = pd.read_csv('/datc/nano/notebooks/Target variable & Features (V3).csv', index_col = 0)

# Creating dataframe with only yen values
df_yen = df[df['Threshold method'] =='yen']

In [5]:
# Define function

def polReg(n_classes, degree):
    
    # Classes (function from Oscar)
    def split_classes(n_classes):
        class_labels = []
        current_classes = np.sort(df_yen['User score'].unique()) 
        split = np.array_split(current_classes, n_classes) 

        for row in df_yen.iterrows(): 
            for label, class_ranges in enumerate(split): 
                if(row[1]['User score'] in class_ranges): 
                    class_labels.append(label) 
        return class_labels 

    # Create df 
    df_yen['Class'] = split_classes(n_classes)

    # Creating x and y
    x = df_yen[['Threshold: separation', 'Threshold: border']]
    if n_classes == 10:
        y = df_yen[['User score']]
    else:
        y = df_yen[['Class']]

    # Scaler
    scaler = StandardScaler()
    x_array = scaler.fit_transform(x)
    x = pd.DataFrame(x_array, index=x.index, columns=x.columns)
    
    # Cross validation
    mse_list = []
    kf = KFold(n_splits=10, random_state=42, shuffle=True)
    for train_ind, valid_ind in kf.split(x):
        x_train = x.iloc[train_ind].to_numpy()
        x_test = x.iloc[valid_ind].to_numpy()
        y_train = y.iloc[train_ind].to_numpy()
        y_test = y.iloc[valid_ind].to_numpy()

        # Polynomial
        poly = PolynomialFeatures(degree=degree, include_bias=False)
        xp_train = poly.fit_transform(x_train)
        xp_test = poly.transform(x_test)

        # Model
        model = LinearRegression()
        model.fit(xp_train, y_train)

        # Scores
        y_pred = model.predict(xp_test)
        mse = mean_squared_error(y_test, y_pred)
        mse_list.append(mse)

        # Mean MSE
        mse_mean = sum(mse_list)/len(mse_list)
        print(mse_mean)

    return mse_mean

In [22]:
classes = [2,3,4,5,6,7,8,9,10]
degrees = [1,2,3,4,5]
scalers = [StandardScaler(), PowerTransformer()]

scores = pd.DataFrame(columns=('Classes','Degree','MSE'))

data_classes = []
data_degree = []
data_mse = []

for n_classes in classes:
    for degree in degrees:
        # First append values to a list
        data_classes.append(n_classes);
        data_degree.append(degree);
        data_mse.append(polReg(n_classes, degree));

0.21563873674016928
0.16028560480619067
0.17686186856716382
0.17093988321439463
0.18421105770529775
0.19997370755907773
0.22446131267034158
0.2295556411732221
0.23406038994334322
0.23177623333587558
0.2208167526010826
0.18783796592336588
0.20081803597061318
0.19067537509069327
0.20556269964419988
0.21226078697480574
0.22831533109185903
0.2284657826489871
0.2151785370645951
0.21513783670251932
0.29081155406516807
0.21822253580941314
0.22063569334617583
0.21047392753579675
0.2237542927953954
0.21757469738388258
0.2626328961665289
0.26797660012849617
1.1864019458845647
1.0906472649363645
0.2521009181987118
0.531939347308974
0.42016376679149353
0.36305608879702056
0.5191907929336046
0.4590147001009304
0.4238016434709589
0.44075174596942224
2.735517693060631
2.5119718028234574
0.16884467339767403
1.5504985087809031
1.0842516995647644
1.2031618911408097
71.57118008080785
59.78313208444826
54.11605924336645
47.46045925741294
2075.0665902971464
1868.158717457451
0.7093809606341583
0.6515344372

In [25]:
# Append the lists to the dataframe scores 
scores['Classes']=data_classes
scores['Degree']=data_degree
scores['MSE']=data_mse
          
scores[scores['Classes']>4].sort_values('MSE').head(10)

Unnamed: 0,Classes,Degree,MSE
15,5,1,2.126859
16,5,2,2.253968
20,6,1,2.362178
21,6,2,2.578489
25,7,1,3.832394
26,7,2,4.196855
30,8,1,5.682279
31,8,2,5.980538
22,6,3,6.538749
35,9,1,7.36916


# Output

We decided to drop our idea of having a polynomial regression model after this experiment because even though some results are not too inaccurate, it is not the type of model we are looking for as our problem fits all the criteria of a classification problem (not regression). With a regression model we have no upper or lower limit for the predictions and with the wrong polynomial/features the predictions are way too far off. It was good to look into this to know this, but it is not ideal to use for a final, reliable model.