## This notebook is about learning a Linear Regression model.

### Algorithms from sk-learn
Before writing up our own algorithms, it made sense to use the pre-existing algorithms from libraries such as sklearn.
This provides us a baseline for the performance of LR on our dataset.

### Preliminary Considerations
There were many considerations to be made. The first regarding hyper-parameters and high-dimensional data.
It was important to not overthink the first few steps so considerations with bias-variance and tweaking
were considered later.


In [None]:
import pandas as pd

In [None]:
# Open Dataset
data = pd.read_csv('dataset/GSMArena_dataset_2020.csv', index_col=0)

# Some Insight
data.info()
data.head()


### Issues so far
Before attempting to learn a regression model on the data, it is clear that there are some considerations to
be made.

Firstly, some rows have null values (N/A) and some features have categorical values.
Here, we have decided to drop the categorical features, and subsequently drop all the null valued rows.


In [None]:
# Load up data_cleaning script
# noinspection PyUnresolvedReferences
from auxiliary.data_clean2 import clean_data

# Remove features that don't seem necessary at this stage, i.e. take all the other features.
# NOTE: getting rid of 'main_camera_dual', 'comms_nfc', 'battery_charging', 'selfie_camera_video' as they seem to be conflicting/resulting in many null cols.
data_features = data[["oem", "launch_announced", "launch_status", "body_dimensions", "display_size", "comms_wlan", "comms_usb",
                "features_sensors", "platform_os", "platform_cpu", "platform_gpu", "memory_internal",
                "main_camera_single", "main_camera_video", "misc_price",
                "selfie_camera_video",
                "selfie_camera_single", "battery"]]

# Clean up the data into a trainable form.
df = clean_data(data_features)

### Preliminary plots

Consider the correlation between certain features (numerical & categorical) and the classes.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation & heat map; For visuals & feature selection/validation

# NOTE: OEM was used as the index;

df.dropna(inplace=True)
df.reset_index(drop=True)

plt.figure(figsize=(20,20))
sns.heatmap(df.corr())

### Now that the data is the right format, it is now possible to train a simple LR model.

We start off by investigating the performance of sk-learn's batch-learned LR models.

In [None]:
# Now its time to split the data
from sklearn.model_selection import train_test_split

y = df["misc_price"]
X = df.drop(["key_index", "misc_price"], axis=1)

# Train & test split. Seed = 120 for reproducing same shuffling of indices.
# Note 70-30 split for the preliminary split.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=120, test_size=.3)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr_model = LinearRegression()

# Batch-train LR
lr_model.fit(X_train, y_train)

# Test the model & retreive predictions
y_pred = lr_model.predict(X_test)

# The coefficients
print('Coefficients: \n', lr_model.coef_)

# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred))

# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))

# plt.scatter(X_test, y_test, color='black')
# plt.plot(X_test, y_pred, color='blue', linewidth=3)

# plt.xticks()
# plt.yticks()

# plt.show()

### Performance of simple LR
As can be seen, the preliminary performance is very poor. This raises some considerations with whether the data is too noisy or in the wrong form. It could also mean the function is significantly non-linear and our LR model would be a bad choice.

### Plot of data & LR model
We now try to visualize high-dimensional data & try specific combination-correlations, to gain some idea of the nature of the fit.

In [None]:
# plt

# sns


### Investigating Linear Regression in more detail
Now we investigate LR in more depth by learning our own models and tweaking parameters, normalizing and comparing differences.

In [None]:
# Set up the function defs & ML algorithms


In [None]:
# Train our custom LR model


# Test variance -> validation set accuracy


# Perform 4-fold cross-validation on the datasets


# Compile results into table

### Plots & Analysis

So far, our LR model has been trained and tested via cross-validation. We now visualize the scores and analyze the
performance below.

In [None]:
# matplotlib


# sns
