---

<center><h1>📍 📍 Basics of Scikit Learn 📍 📍</h1></center>


---

- It provides simple and efficient tools for pre-processing and predictive modeling



![](images/sklearn.png)

---


***Steps to build a model in scikit-learn.***

---

1. Import the model
2. Prepare the data set
3. Separate the independent and target variables.
4. Create an object of the model
5. Fit the model with the data
6. Use the model to predict target.

In [2]:
!pip3 install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.0-cp38-cp38-win_amd64.whl (7.2 MB)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1317 sha256=bc4f139e79cc3700e2663baae9f435f53fbd5b9c35d7b11e53797d68dfcd4178
  Stored in directory: c:\users\mnr41\appdata\local\pip\cache\wheels\22\0b\40\fd3f795caaa1fb4c6cb738bc1f56100be1e57da95849bfc897
Successfully built sklearn
Installing collected packages: threadpoolctl, scikit-learn, sklearn
Successfully installed scikit-learn-1.0 sklearn-0.0 threadpoolctl-3.0.0


In [3]:
# import the scikit-learn library
import sklearn

***If you got an error while running the above cell, import it by using the following command.***

If you are using anaconda with python3: ***`!pip install scikit-learn`***

If you are using jupyter with python3: ***`!pip3 install scikit-learn`***

---

In [4]:
# check the version 
sklearn.__version__

'1.0'

- ***We have seen in the pandas notebook, that we have some missing values in out data.***
- ***We will impute those missing values using the scikit-learn Imputer.***

---

In [6]:
# read the data set and check for thre null values
import pandas as pd
data = pd.read_csv('Dataset/big_mart_sales.csv')
data.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [7]:
# import the SimpleImputer
from sklearn.impute import SimpleImputer

---

- For imputing the missing values, we will use `SimpleImputer`.
- First we will create an object of the Imputer and define the strategy.
- We will impute the Item_Weight by `mean` value and Outlet_Size by `most_fequent` value.
- Fit the objects with the data.
- Transform the data

---

In [8]:
# create the object of the imputer for Item_Weight and Outlet_Size
impute_weight = SimpleImputer(strategy= 'mean')
impute_size   = SimpleImputer(strategy= 'most_frequent')

In [9]:
# fit the Item_Weight imputer with the data and transform
impute_weight.fit(data[['Item_Weight']])
data.Item_Weight = impute_weight.transform(data[['Item_Weight']])

In [10]:
# fit the Outlet_Size imputer with the data and transform
impute_size.fit(data[['Outlet_Size']])
data.Outlet_Size = impute_size.transform(data[['Outlet_Size']])

In [11]:
# check the null values.
data.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

- ***Now, after the preprocessing step, we separate the independent and target variable and pass the data to the model object to train the model.***
---

- ***If we have a problem in which we have to identify the category of an object based on some features. For example whether the given picture is of a cat or a dog. These are `classification problems`.***
- ***Or, if we have to identify a continous attribute like predicting sales based on some features. These are `Regression Problems`***

---

***`SCIKIT-LEARN` has tools which will help you build Regression, Classification models and many others.***

---

In [12]:
# some of the very basic models scikit learn has.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier

---

After we have build the model now whenever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the model to make predictions. This becomes a tedious and time consuming process!

So, scikit-learn provides tools to create a pipeline of all those steps that will make your work a lot more easier.

---

In [13]:
from sklearn.pipeline import Pipeline

### Creating pipeline essentially helps you from the minute data comes in, preprocessing the data, building machine learning models and making prediction on them. It saves a lot of time.

___

***We will study each of the step in detail in the upcoming modules.***

---

---

***Learn more about the scikit-learn here: https://scikit-learn.org/stable/index.html***

---