# **If you wanna to know how to build your first machine learning model, the blog below will be a good guide.**
### [**How to Build Your First Machine Learning Model**](https://towardsdatascience.com/how-to-build-a-machine-learning-model-439ab8fb3fb1)

## **Data loading**

In [None]:
# Use pandas library to be able to work with data frames.
import pandas as pd

# Create data frame object that holds my dataset.
df = pd.read_csv('https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv')
df

## **Data Preparation**

### **Separate X and Y**

In [None]:
# Create y variable as the target/output of the training set.
y = df['logS']
y

In [None]:
# Create x variable as the input/feature of the training set.
# Drop the y column from the data frame.
# We use the option 'axis = 1' to work with the data frame as column major.
x = df.drop('logS', axis = 1)
x

### **Split the dataset to training set and test set**

In [37]:
from sklearn.model_selection import train_test_split

# 80% of the training set will be set in x_train variable, and the remaining 20% stored in x_test.
# 80% of the training set will be set in y_train variable, and the remaining 20% stored in y_test.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=200)

print(x_train)
print(y_train)
print(x_test)
print(y_test)

      MolLogP    MolWt  NumRotatableBonds  AromaticProportion
1131  2.43100  430.934                6.0            0.000000
1     2.37650  133.405                0.0            0.000000
45    3.21160  235.906                0.0            0.750000
639   3.92420  333.266                6.0            0.260870
446   1.57760  129.384                0.0            0.000000
...       ...      ...                ...                 ...
1103 -0.62838  126.115                0.0            0.666667
844   1.57722  107.156                0.0            0.750000
810   0.17930   74.079                0.0            0.000000
784   3.10250  191.455                0.0            0.750000
617   3.27740  168.239                2.0            0.923077

[915 rows x 4 columns]
1131   -4.190
1      -2.000
45     -4.070
639    -6.124
446    -0.890
        ...  
1103   -1.506
844    -0.850
810     0.460
784    -3.210
617    -4.080
Name: logS, Length: 915, dtype: float64
      MolLogP    MolWt  NumRotatableBo

## **Model Building**

### **Linear Regression**

In [36]:
from sklearn.linear_model import LinearRegression

# Initiate model object.
lr_model = LinearRegression()

# Train the model.
lr_model.fit(x_train, y_train)

# Applying the model to make a prediction.
y_train_prediction = lr_model.predict(x_train)
y_test_prediction = lr_model.predict(x_test)

print(y_train_prediction)
print(y_test_prediction)


[ -4.34907471  -2.39577721  -4.03144864  -4.97254076  -1.76103331
   0.94283602  -3.21886302  -2.06686407  -3.30806515  -1.1535936
  -2.62509201  -2.65554314  -5.09876454  -3.02990686  -0.64841279
  -7.24330006  -3.79646953  -3.01660745  -2.9017818   -5.97743303
  -5.08511909  -2.13682071  -4.21541268  -4.67436456  -3.48088448
  -2.85778171  -2.08044431  -2.13602462  -7.89553353  -5.79389308
  -1.17788946  -4.24129412  -0.47883832  -4.59178629  -1.0882932
  -2.71417003  -7.91882451  -3.13212792  -3.01815927  -2.2678698
  -1.99259648  -1.58137217  -5.38619825  -5.63649498  -4.36623206
   2.10179355  -1.91320611  -3.21506128  -2.41420099  -2.05006392
  -4.55605599  -6.07362775  -7.99125286  -1.7508475   -3.1560336
  -3.81176265  -5.79368166  -2.60632972  -3.70872958  -5.52300006
  -2.93336555  -5.21724852  -2.73116456  -2.50562326  -1.60163363
  -1.51487684  -3.08838271  -4.2149886   -2.62678741  -1.49321391
  -3.45328833  -2.84444938  -3.02943318  -1.88189659  -2.40059364
  -2.99787121 

  ### **Evaluate Model Performance**

In [34]:
from sklearn.metrics import mean_squared_error, r2_score

# Evaluate the performance of the train subset.
lr_train_mse = mean_squared_error(y_train, y_train_prediction)
lr_train_r2 = r2_score(y_train, y_train_prediction)

# Evaluate the performance of the test subset.
lr_test_mse = mean_squared_error(y_test, y_test_prediction)
lr_test_r2 = r2_score(y_test, y_test_prediction)

# Put the result in a data frame to show it.
lr_result = pd.DataFrame(['Linear Regression', lr_train_mse, lr_train_r2, lr_test_mse, lr_test_r2]).transpose()
lr_result.columns = ['Method Type', 'Training MSE', 'Training R2', 'Test MSE', 'Test R2']
lr_result

Unnamed: 0,Method Type,Training MSE,Training R2,Test MSE,Test R2
0,Linear Regression,0.971722,0.779728,1.168611,0.72891
