# Machine Learning and scikit-learn Library

When using Python, most machine learning models can be found in the sklearn (scikit-learn) library. In this notebook, you will learn the basics of using sklearn library models.

The use of the models is carried out through the following steps:

- Prepare data
- Import a model
- Fit the model to the data
- Evaluate the model's accuracy to the data
- Use the model to predict

### Prepare data

The data used must be carefully read in advance. Particular attention should be paid to:

- **Missing values**: Most models do not tolerate missing values of variables. Rows containing missing values should be deleted or replace missing values in an appropriate way.
- **Categorical variables**: Most models require variables to be quantitative. Categorical variables can be used if they are converted into dichotomous variables, i.e. dummy variables.
- **Scaling**: If the independent variables differ in order of magnitude, then it is usually a good idea to scale the variables to the same order of magnitude. This can be done, for example, by converting the values of the variable into standard points. In this technique, the mean is subtracted from the value of the variable, and the difference is divided by the standard deviation, that is, calculate, how many standard deviations away the variable is from the mean.

In supervised learning models, two dataframes are required.

- Values of independent variables (**feature matrix**, x variables). This dataframe of the independent variables is often called **X**.
- Values of the dependent/predictable variable (**target**, labels, y variable). This dataframe of the dependent variable is often called **y**.

In unsupervised models, only the feature matrix is required.

### Import a model

Import a model from the sklearn library. For example, the linear regression model can be imported in the following way:

<center>
    <it>from sklearn.linear_model import LinearRegression</it>
</center>

### Fit the model to data

Fit the model to the data with the **fit** function and save the resulting object as the value of a variable.

For example, the following fits linear regression to the data X (dataframe of independent variables) and y (values of the variable to be predicted).

<center>
    model = LinearRegression().fit(X, y)
</center>

The resulting object (model) contains a variety of information about the model. In many examples found online and literature, the former is done in two steps:

<center>
<DL>
<DD> model = LinearRegression() </DD>
<DD> model.fit(X, y) </DD>
</DL>
</center>

The resulting model object is the same, whichever way it was formed.

The model can be tuned with various additional parameters. Any additional parameters are written inside the parentheses. For example, in the following, a linear regression model is formed that does not include the constant term (intercept) at all.

<center>
    model = LinearRegression(fit_intercept = False).fit(X, y)
</center>

Modifying models with additional parameters requires good knowledge of the models.

### Accuracy of the model

There are many methods for assessing the accuracy of the model. For example, you can print the value of the coefficient of determination, $R^2$, using the command **model.score(X, y)**. The coefficient of determination indicates how many percent of the variation in the dependent variable is explained by the model. See <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination">Coefficient of determination in Wikidedia</a>.

### Predict using the model

Of course, the key step in predictive analytics is the calculation of predictions for new data. If the dataframe X_new contains new values of the dependent variable, then the predictions are obtained by the command:

<center>
    model.predict(X_new)
</center>

#### Further information

Source and origin of inspiration: <br />
Aki Taanila: Data-analytiikka Pythonilla: https://tilastoapu.wordpress.com/python/