# REGRESSION MODEL:
Due to nature of this process (creating a predictive model), involving coming back and forth trial and error while cleaning data and testing differet hyperparameters setup. We decided to:

- Move all our helper functions to a separated file, so we can reach them from every file while keeping our files cleaner.
- Have 2 code files to get our final model:
    - File `reg_data_proccessing.ipynb`:<br>
        &emsp;To read and clean the data and save it to sql database in ``./database/models.db`` file.<br>
        &emsp;To test cleanliness of data, we'll use a random forest model.
    - File `reg_model_selection.ipynb` to test and compare different models working with cleaned dataset.
- Once final model version is selected, it will be serialized after trainning and stored in ``./trained_models`` folder.
- Then trained model will be deployed to a website built with flask/jinja to perform predictions for data entered by users.
***

### MODEL CREATION
With our data cleaned, we'll try different regression models to come up with the model to be deployed on the website.<br>
Well test:
- Random forest regressor
- Knn regressor
- Linear regressor

we use cross validation to select the best version for each model, then we just use score method in the model to select the final model



In [1]:
from myFunc import *  # importing helper functions
# pull cleaned dataset
con = sqlite3.connect('./../database/models.db')
df=pd.read_sql_query('select * from reg_clean_data',con)
# separating vector features from target
X=df.drop(['Weight'],axis=1)
y=df['Weight']
# pulling out test data, we'll use it after tweeking hyperparameters in different models.
X1,Xtest,y1,ytest=train_test_split(X, y, test_size=0.1, random_state=7)

## Random Forest Regressor

In [2]:
rf1=RandomForestRegressor(n_estimators=55, bootstrap=False)
cross_val(rf1,X1,y1,'r')
rf2=RandomForestRegressor(n_estimators=150)
cross_val(rf2,X1,y1,'r')
rf3=RandomForestRegressor(n_estimators=200)
cross_val(rf3,X1,y1,'r')

-------------Cross Validation-----------------
Accuracy -val set: 68.8046487193262
Accuracy -test set: 72.56679794257205
-------------Cross Validation-----------------
Accuracy -val set: 84.78280469532905
Accuracy -test set: 82.47817240285156
-------------Cross Validation-----------------
Accuracy -val set: 84.50353385387017
Accuracy -test set: 83.27089208331955


## KNeighbors Regressor

In [3]:
knn1 = KNeighborsRegressor(n_neighbors=2)
cross_val(knn1,X1,y1,'r')
knn2 = KNeighborsRegressor(n_neighbors=10)
cross_val(knn2,X1,y1,'r')
knn3 = KNeighborsRegressor(n_neighbors=20)
cross_val(knn3,X1,y1,'r')

-------------Cross Validation-----------------
Accuracy -val set: 63.70527975498022
Accuracy -test set: 54.17059789122498
-------------Cross Validation-----------------
Accuracy -val set: 59.84907320465174
Accuracy -test set: 54.358570132422265
-------------Cross Validation-----------------
Accuracy -val set: 54.28952543103072
Accuracy -test set: 44.30862217925014


## Linear Regression

In [4]:
lr1=LinearRegression(fit_intercept=True, n_jobs=100)
cross_val(lr1,X1,y1,'r')
lr2=LinearRegression(n_jobs=10)
cross_val(lr2,X1,y1,'r')
lr3=LinearRegression(n_jobs=1)
cross_val(lr3,X1,y1,'r')

-------------Cross Validation-----------------
Accuracy -val set: 46.75608455507852
Accuracy -test set: 41.379509201876175
-------------Cross Validation-----------------
Accuracy -val set: 46.75608455507852
Accuracy -test set: 41.379509201876175
-------------Cross Validation-----------------
Accuracy -val set: 46.75608455507852
Accuracy -test set: 41.379509201876175


In [5]:
print(rf3.score(Xtest,ytest))
print(knn2.score(Xtest,ytest))
print(lr2.score(Xtest,ytest))

0.845837128600671
0.5846203196520195
0.41128247538065243


### Serializing the winner!

In [6]:
jl_filedir = Path("./../trained_models")
jl_filedir.mkdir(parents=True,exist_ok=True)

jl_filepath=jl_filedir / 'reg_obesity.joblib'

joblib.dump(rf3,jl_filepath)

# rf3_jl=joblib.load(jl_filepath)
# rf3_jl.predict(Xtest)