# Homework 6 - Regression
In this guide, we will be exploring using regression as an intro to artificial intelligence. 
For this week's assignment, we will be exploring linear regression. We'll be using the data from our soccer database from assignment 4.

### Instructions
1. Follow the instructions on how to setup your Python and Jupyter (or VSCode) environment and cloning or downloading our repository. Instructions can be found in the class notes.
2. Import soccer database using pandas.
3. Load the values from the attributes `gk_reflexes` and `gk_handling` from table `Player_Attributes`.
4. Use `gk_reflexes` (as x) and `gk_handling` (as y) as your data.
5. Drop the missing values from these two columns.
6. Scale the dataset using a standard scaler.
7. Split the data into training and testing in a 0.3 ratio (70% training, 30% testing).
8. Apply Linear Regression, Cross-Validation (with 5 splits), Ridge Regularization, and Lasso Regularizations and print the co-relation result of each technique using `r2_score`. All of the functions for this last step are located in sklearn.
9. Answer the questions in the notebook through code.
10. Run the notebook and make sure everything works.
11. Export the notebook as HTML or PDF.
12. Submit the notebook through Canvas.

Remember to fill the missing pieces of code in the provided notebook.

### Dataset Overview
The dataset covers information about soccer players in sqlite format. This file is located in the `Datasets` directory of this repository. The file is called `fifa_soccer_dataset.sqlite.gz`. **This is the same file from the previous homework (assignment 4).**

If you haven't decompressed the file, you may need to follow the instructions below to decompress it.

**IMPORTANT** The database is compressed and needs to be decompressed before use. You can do this by running the following command in your terminal on Linux or MacOS:

```bash
gunzip Datasets/fifa_soccer_dataset.sqlite.gz
```

If you are using Windows, you can use the following command in your powershell:

```powershell
$sourceFile = "$PWD\Datasets\fifa_soccer_dataset.sqlite.gz"
$destinationFile = "$PWD\Datasets\fifa_soccer_dataset.sqlite"

$inputStream = [System.IO.File]::OpenRead($sourceFile)
$outputStream = [System.IO.File]::Create($destinationFile)
$gzipStream = New-Object System.IO.Compression.GzipStream($inputStream, [System.IO.Compression.CompressionMode]::Decompress)
$gzipStream.CopyTo($outputStream)

$gzipStream.Close()
$outputStream.Close()
$inputStream.Close()
```

Alternatively, you can extract the file using the GUI of your operating system.


### Submission Guidelines

- Submit your completed notebook as a HTML export, or a PDF file.

To export to HTML, if you are on Jupyter, select `File` > `Export Notebook As` > `HTML`.

If you are on VSCode, you can use the `Jupyter: Export to HTML` command.
 - Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on Mac).
     - Search for `Jupyter: Export to HTML`.
     - Save the HTML file to your computer and submit it via Canvas.

---

To begin, we'll need quite a few imports.

In [None]:
import pandas as pd
import numpy as np
import sqlite3
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

We're going to use the soccer data to run regressions. In the cell below, connect to the database.

In [None]:
# Input Code Here
dataset_path = "../../Datasets/fifa_soccer_dataset.sqlite" # Fix your path accordingly

conn = sqlite3.connect(dataset_path)


To get started, let's write a query to grab all of the entries from the `Player_Attributes` table, and print the first 5 rows below.

In [None]:
player_attr_df = pd.read_sql("Your Query Here", _) # Input Code Here

# Display the first 5 rows of the table

We are going to play with two fields today, the `gk_handling` field as the dependent feature and the `gk_reflexes` field as the independent feature. Let's drop some missing values from these two columns as well. They represent the goalkeeping handling and reflexes of a player respectively.

In [None]:
# Input Code Here - Drop null values from the mentioned columns

Let's store those columns in their own variables for easy reading.

In [None]:
x = player_attr_df[['Column Here']].values # Input Code Here
y = player_attr_df[['Column Here']].values # Input Code Here

To preform and evaluate our linear regression, we need to split our data into test and training batches. We can do this by using the `train_test_split()` function. In the cell below, use this function and pass it `x` and `y` as the data for it to split. The final parameter `test_size` indicates how big the test batch should be, in this case 30% of the initial dataset inputted.

In [None]:
X_train, X_test, Y_train, Y_test = _(_, _, test_size = 0.3) #Input Code here

We can now preform the fitting. Let's call the `fit()` function on our `lm` variable, passing the `X_train` and `Y_train` data as parameters.

In [None]:
lm = LinearRegression()
lm._(_, _) # Input Code Here

Great! Now we can use the predict funtion to see how the model preforms against our test data set. Call the `predict()` function on `lm` and pass `X_test` as our input parameter. We'll then see the r2 score to see how correlated these values are. 

In [None]:
Y_predicted = lm._(_) # Input Code Here
rsquared = r2_score(_, _)
print("R2 Score: " + str(rsquared))

These values are pretty correlated! We can also use the `StandarScalar()` to transform our values before fitting our model. In the cell below, call the `StandardScalar()` function and pass `x` and `y` to the `fit_transform()` functions. 

In [None]:
sc = _() # Input Code Here

x_scaled = sc.fit_transform(_) # Input Code Here
y_scaled = sc.fit_transform(_) # Input Code Here

Now we can run the model again as we did before. We'll need to split the training and test batches again, then run a new `fit()`. Once fitted, we can again use `predict()` and run a r2 score again.

In [None]:
# Input Code Here to split the scaled dataset

In [None]:
# Input Code Here to create and fit the model

In [None]:
Y_predicted = lm._(X_test) # Input Code Here
# Input Code Here - Grab the R2 Score like we did above
print("R2 Score: " + str(rsquared))

Implementing various models - LinearRegression(), Ridge(), Lasso() along with K-Fold CrossValidation with 5 splits.
Use the unscaled data for this step.

In [None]:
# Apply Linear regression, ridge regularization, lasso regularization with cross validation

# Define models
model_lr = _
model_ridge = _
model_lasso = _

# Cross validation
kf = KFold(n_splits = _)
list_r2_score = [] # to keep the r2 score
# Split the train set:
for train_index, test_index in kf.split(_):
    X_train, X_test = _[train_index], _[test_index]
    y_train, y_test = y[train_index], y[test_index]
    k_fold_r2 = []
    for model in [_, _, _]:
        model._(X_train, y_train)
        pred = model.predict(_)
        k_fold_r2.append(r2_score(_, _))
    list_r2_score.append(k_fold_r2)
    
# Show the result - Add Mean and Standard Deviation of the R2-scores
list_r2_score.append(list(np.mean(list_r2_score, axis = _)))
list_r2_score.append(list(np.std(list_r2_score[:-1], axis = _)))
result_r2 = pd.DataFrame(list_r2_score)
result_r2.columns = ['Linear Regression', 'Ridge', 'Lasso']
result_r2.index = ['k1', 'k2', 'k3', 'k4', 'k5', 'average', 'std']

print('The result of r2 scores for k=5 cross validation')
display(result_r2)

And that's basic linear regression with python. Please turn in this notebook completed with your outputs displayed in html or pdf formats.