---

<center><h1>📍 📍 Basics of Scikit Learn 📍 📍</h1></center>


---

- It provides simple and efficient tools for pre-processing and predictive modeling



![](images/sklearn.png)

---


***Steps to build a model in scikit-learn.***

---

1. Import the model
2. Prepare the data set
3. Separate the independent and target variables.
4. Create an object of the model
5. Fit the model with the data
6. Use the model to predict target.

In [1]:
# import the scikit-learn library
import sklearn

***If you got an error while running the above cell, import it by using the following command.***

If you are using anaconda with python3: ***`!pip install scikit-learn`***

If you are using jupyter with python3: ***`!pip3 install scikit-learn`***

---

In [2]:
# check the version 
sklearn.__version__

'0.23.2'

- ***We have seen in the pandas notebook, that we have some missing values in out data.***
- ***We will impute those missing values using the scikit-learn Imputer.***

---

In [14]:
# read the data set and check for thre null values
import pandas as pd
data = pd.read_csv('dataset/big_mart_sales.csv')
data.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [15]:
# import the SimpleImputer
from sklearn.impute import SimpleImputer

In [17]:
data.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


---

- For imputing the missing values, we will use `SimpleImputer`.
- First we will create an object of the Imputer and define the strategy.
- We will impute the Item_Weight by `mean` value and Outlet_Size by `most_fequent` value.
- Fit the objects with the data.
- Transform the data

---

In [19]:
# create the object of the imputer for Item_Weight and Outlet_Size
impute_weight = SimpleImputer(strategy= 'mean')
impute_size   = SimpleImputer(strategy= 'most_frequent')
print("Impute_weight",impute_weight)
print("Impute_size",impute_size)

Impute_weight SimpleImputer()
Impute_size SimpleImputer(strategy='most_frequent')


In [20]:
# fit the Item_Weight imputer with the data and transform
impute_weight.fit(data[['Item_Weight']])
data.Item_Weight = impute_weight.transform(data[['Item_Weight']])

In [21]:
# fit the Outlet_Size imputer with the data and transform
impute_size.fit(data[['Outlet_Size']])
data.Outlet_Size = impute_size.transform(data[['Outlet_Size']])

In [22]:
# check the null values.
data.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

- ***Now, after the preprocessing step, we separate the independent and target variable and pass the data to the model object to train the model.***
---

- ***If we have a problem in which we have to identify the category of an object based on some features. For example whether the given picture is of a cat or a dog. These are `classification problems`.***
- ***Or, if we have to identify a continous attribute like predicting sales based on some features. These are `Regression Problems`***

---

***`SCIKIT-LEARN` has tools which will help you build Regression, Classification models and many others.***

---

In [23]:
# some of the very basic models scikit learn has.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier

---

After we have build the model now whenever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the model to make predictions. This becomes a tedious and time consuming process!

So, scikit-learn provides tools to create a pipeline of all those steps that will make your work a lot more easier.

---

In [24]:
from sklearn.pipeline import Pipeline

___

***We will study each of the step in detail in the upcoming modules.***

---

---

***Learn more about the scikit-learn here: https://scikit-learn.org/stable/index.html***

---