<a href="https://colab.research.google.com/github/kanishkraj-ops/Machine-Learning/blob/main/Scikit_Learn(Model_Creation).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scikit-Learn:Model Creation**


1. Getting the data ready
2. Choosing the right estimator/algorithm
3. Fit the modl and use it to make predictions
4. Evaluating the model
5. Improve the model
6. Save and Load the trained model






# **Getting The Data Ready**

In [84]:
import pandas as pd
heart_disease = pd.read_csv("./data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


# **✅ What Are "Features"?**
In machine learning, a feature is simply a piece of information used to help the model make a decision.

Think of features as columns in a table:

Each row is a data sample (like one person, one email, or one house).

Each column is a feature (like age, price, or temperature).

In [85]:
#create a feature
X = heart_disease.drop("target",axis=1)

#create y(labels)
y = heart_disease["target"]

In [86]:
#choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier() #clf--->classifier

#keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [87]:
#fit the moe=del to the training data
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [88]:
clf.fit(X_train,y_train);

In [89]:
#make a prediction
y_preds = clf.predict(X_test)
y_preds

array([0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1])

In [90]:
#Evaluate the model on the training and test data
clf.score(X_train,y_train)

1.0

In [91]:
clf.score(X_test,y_test)

0.819672131147541

In [92]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       0.74      0.83      0.78        24
           1       0.88      0.81      0.85        37

    accuracy                           0.82        61
   macro avg       0.81      0.82      0.81        61
weighted avg       0.83      0.82      0.82        61



In [93]:
accuracy_score(y_test,y_preds)

0.819672131147541

In [94]:
#Improve a model
#Try different amount of n_estimators
import numpy as np
np.random.seed(40)
for i  in range(10,100,10):
  print(f"Trying model with {i} estimators...")
  clf = RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
  print(f"Model accuracy on test set: {clf.score(X_test,y_test)}")


Trying model with 10 estimators...
Model accuracy on test set: 0.7377049180327869
Trying model with 20 estimators...
Model accuracy on test set: 0.7540983606557377
Trying model with 30 estimators...
Model accuracy on test set: 0.7540983606557377
Trying model with 40 estimators...
Model accuracy on test set: 0.7540983606557377
Trying model with 50 estimators...
Model accuracy on test set: 0.819672131147541
Trying model with 60 estimators...
Model accuracy on test set: 0.7868852459016393
Trying model with 70 estimators...
Model accuracy on test set: 0.819672131147541
Trying model with 80 estimators...
Model accuracy on test set: 0.8032786885245902
Trying model with 90 estimators...
Model accuracy on test set: 0.8032786885245902


In [95]:
import pickle

pickle.dump(clf,open("random_forest_model_1.pkl","wb"))

In [96]:
loaded_model = pickle.load(open("random_forest_model_1.pkl","rb"))
loaded_model.score(X_test,y_test)

0.8032786885245902

# **INTRO TO SCIKIT LEARN**

Three main things that we have to do:
1. Split the data into feeatures and labels(usually 'X' , 'y')
2. Fitting (also called imputing) or disregarding missing values
3. Converting non-numerical values to numerical values(also called feature encoding)

In [97]:
X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]

In [98]:
X.head(), y.head()

(   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
 0   63    1   3       145   233    1        0      150      0      2.3      0   
 1   37    1   2       130   250    0        1      187      0      3.5      0   
 2   41    0   1       130   204    0        0      172      0      1.4      2   
 3   56    1   1       120   236    0        1      178      0      0.8      2   
 4   57    0   0       120   354    0        1      163      1      0.6      2   
 
    ca  thal  
 0   0     1  
 1   0     2  
 2   0     2  
 3   0     2  
 4   0     2  ,
 0    1
 1    1
 2    1
 3    1
 4    1
 Name: target, dtype: int64)

In [99]:
#split the data into training and test data

In [100]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)


The important thing to understand here is that machines cant understand what a house or a car or a dog is.The only thing they understand are numbers.So its important to convert non-numerical features(or columns) to numerical so that our model can function accordingly.

In [101]:
##MAKE SURE ITS ALL NUMERICAL
car_sales = pd.read_csv("./data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [102]:
car_sales.dtypes

Unnamed: 0,0
Make,object
Colour,object
Odometer (KM),int64
Doors,int64
Price,int64


In [103]:
#turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features,)],
                                  remainder="passthrough")
transformed_X = transformer.fit_transform(car_sales)
transformed_X



array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [104]:
#What the f**k is onehotencoding!!
#well its simple take an example
car_num = pd.Series([0,1,2,3])
car_color=pd.Series(["Red","Green","Blue","Red"])
cars = pd.DataFrame({"Car":car_num,
                     "Colour":car_color})
cars


Unnamed: 0,Car,Colour
0,0,Red
1,1,Green
2,2,Blue
3,3,Red


In [105]:
#now as you see we have color categories so lets apply encoding
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
enc = OneHotEncoder()
features = ["Colour"]
transforming = ColumnTransformer([("enc",
                                  enc,
                                  features)],
                                 remainder="passthrough"
                                 )
transformed_cars = transforming.fit_transform(cars)
transformed_cars


array([[0., 0., 1., 0.],
       [0., 1., 0., 1.],
       [1., 0., 0., 2.],
       [0., 0., 1., 3.]])

In [106]:
pd.DataFrame(transformed_cars)

Unnamed: 0,0,1,2,3
0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,1.0
2,1.0,0.0,0.0,2.0
3,0.0,0.0,1.0,3.0


In [107]:
pd.DataFrame(transformed_X)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0,15323.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0,19943.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0,28343.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0,13434.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0,14043.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0,32042.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0,5716.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0,31570.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0,4001.0


In [108]:
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

#split into training and test set

from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

#size=0.2 means that 80% of data will be used for training and 20% for testing


In [109]:
#Building a model
from sklearn.ensemble import RandomForestRegressor
# so we are using a regressor here as we want to predict the price of a car
# that is a number
#we use a classifier when we want to classify thing
#like a person has a heart disease or not thats classification
model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

ValueError: could not convert string to float: 'Toyota'

In [110]:
#lets refit the model again with transformed
np.random.seed(1)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)
model.fit(X_train,y_train)

In [111]:
model.score(X_test,y_test)

0.9999047237335844

##**What If There Were Missing Values**
1. Fill them with some values(called imputation)
2. Remove the samples with missing data altogether

In [112]:
car_sales_missing = pd.read_csv("./data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [113]:
car_sales_missing.isna().sum() #how many missing values are there in each column

Unnamed: 0,0
Make,49
Colour,50
Odometer (KM),50
Doors,50
Price,50


In [114]:
#create X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]
X.head(),y.head()
#we cant for the model with these as these contain Nan

(     Make Colour  Odometer (KM)  Doors
 0   Honda  White        35431.0    4.0
 1     BMW   Blue       192714.0    5.0
 2   Honda  White        84714.0    4.0
 3  Toyota  White       154365.0    4.0
 4  Nissan   Blue       181577.0    3.0,
 0    15323.0
 1    19943.0
 2    28343.0
 3    13434.0
 4    14043.0
 Name: Price, dtype: float64)

In [115]:
#Fill missing data with pandas
car_sales_missing["Make"].fillna("missing", inplace=True)
car_sales_missing["Colour"].fillna("missing", inplace=True)
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)
car_sales_missing["Doors"].fillna(4, inplace=True)
car_sales_missing.isna().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_sales_missing["Make"].fillna("missing", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_sales_missing["Colour"].fillna("missing", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on whi

Unnamed: 0,0
Make,0
Colour,0
Odometer (KM),0
Doors,0
Price,50


In [116]:
car_sales_missing.dropna(inplace=True)
car_sales_missing.isna().sum(), len(car_sales_missing)

(Make             0
 Colour           0
 Odometer (KM)    0
 Doors            0
 Price            0
 dtype: int64,
 950)

In [117]:
X=car_sales_missing.drop("Price",axis=1)
y=car_sales_missing["Price"]

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features,)],
                                  remainder="passthrough")
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [118]:
#Filling missing data with scikit learn

In [119]:
car_sales_missing = pd.read_csv("./data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,49
Colour,50
Odometer (KM),50
Doors,50
Price,50


In [120]:
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,47
Colour,46
Odometer (KM),48
Doors,47
Price,0


In [121]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [122]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

#fill categorical values with "missing" and numerical values with mean
cat_imputer  = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

#define columns
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

#create an imputer (something that fills missing data)

imputer = ColumnTransformer([("cat_imputer", cat_imputer,cat_features),
                             ("door_imputer", door_imputer,door_feature),
                             ("num_imputer", num_imputer,num_features)])
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [124]:
car_sales_filled = pd.DataFrame(filled_X)
car_sales_filled.head()

Unnamed: 0,0,1,2,3
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0


In [125]:
car_sales_filled.isna().sum()

Unnamed: 0,0
0,0
1,0
2,0
3,0


In [128]:
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.9998421058539825