# Introduction to scikit-learn

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.

What we're going to cover


0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problem
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together!

In [1]:
# Let's listify the contents
what_were_covering = [
    "0. An end-to-end Scikit-Learn workflow",
    "1. Getting the data ready",
    "2. Choose the right estimator/algorithm for our problems",
    "3. Fit the model/algorithm and use it to make predictions on our data",
    "4. Evaluating a model",
    "5. Improve a model",
    "6. Save and load a trained model",
    "7. Putting it all together!"]

## 0. An end-to-end Scikit-Learn workflow

In [None]:
# 1. Get the data ready
import pandas as pd
import numpy as np

heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()

In [None]:
# Create X (feautres matrix)
X =  heart_disease.drop("target", axis=1)

# Create y (labels)
y = heart_disease["target"]

In [None]:
# 2. Choose the right model and hyperparameters
# This is a classification problem because we want to determine if X = heart disease

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 90)

# We'll keep the default hyperparameters
clf.get_params()

In [None]:
# Fit the model to the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
clf.fit(X_train, y_train);
X_train.head()

In [None]:
# Make a prediction
y_label = clf.predict(np.array([0, 2, 3, 4]))

In [None]:
y_preds = clf.predict(X_test)

In [None]:
y_preds

In [None]:
y_test.head()

In [None]:
# 4. Evaluate the model on the training data and test data
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

In [None]:
confusion_matrix(y_test, y_preds)

In [None]:
accuracy_score(y_test, y_preds)

In [None]:
# 5. Improve a model
# Try different amount of n_estimators

np.random.seed(10)
for i in range(10, 100, 10):
    print("Trying model with {} estimators".format(i))
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print("Model accuracy on test set: {} %".format(clf.score(X_test, y_test)))
    print()

In [None]:
# 6. Save a model and load it
import pickle

pickle.dump(clf, open("random-forest-model1.pkl", "wb"))

In [None]:
loaded_model = pickle.load(open("Random-forest-model1.pkl", "rb"))
loaded_model.score(X_test, y_test)

# Retry again

In [None]:
heart_data = pd.read_csv("heart-disease.csv")
heart_data.head()

In [None]:
X = heart_data.drop("target", axis=1)
X

In [None]:
y = heart_data["target"]
y

In [None]:
clf = RandomForestClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
clf.fit(X_train, y_train);

In [None]:
y_pred = clf.predict(X_test)
y_pred

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

# Retry again

In [None]:
heart_data = pd.read_csv("heart-disease.csv")
X = heart_data.drop("target", axis=1)
y = heart_data["target"]

clf = RandomForestClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

clf.score(X_test, y_test)

In [None]:
what_were_covering

# 1. Getting your data ready

Three main things we have to do:
    1. Split the data into features and labels (Usually "X" and "y")
    2. Filling (also called imputing) or disregarding missing values
    3. Converting non-numerical values into numerical values (also called feature encoding)

In [None]:
heart_disease.head()

In [None]:
X = heart_disease.drop("target", axis=1)
X.head()

In [None]:
y = heart_disease["target"]
y.head()

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape # Important, make sure shapes match.

# 1.1 Make sure its all numerical

In [None]:
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales.head()

In [None]:
len(car_sales)

In [None]:
car_sales.dtypes

In [None]:
# Split the data into x and y
X = car_sales.drop("Price", axis=1)
X.head()

In [None]:
y = car_sales["Price"]
y.head()

In [None]:
# SPlit into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Build machine learning model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train) # FIt on training data
model.score(X_test, y_test) # Evaluate on test data

In [None]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")
transformed_x = transformer.fit_transform(X)
transformed_x

In [None]:
X.head()

In [None]:
pd.DataFrame(transformed_x).head()

In [None]:
# dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
# dummies.head()

In [None]:
# Lets try refit the model
np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformed_x,
                                                   y,
                                                   test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
model.fit(X_train, y_train);

In [None]:
model.score(X_test, y_test)

# 1.2 What if there were missing values?
1. Fill them with some value (also known as imputation)
2. Remove the samples with missing data altogether

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Import car sales missing data
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.head()
missing_data.dtypes

In [None]:
missing_data.isna().sum()

In [None]:
# Create X and y
X = missing_data.drop("Price", axis=1)
X.head()

In [None]:
y = missing_data["Price"]
y.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")

transformed_x = transformer.fit_transform(X)

In [None]:
X.isna().sum()

#### Option 1: Fill missing  data with pandas

In [None]:
# Fill the "Make" column
missing_data["Make"].fillna("missing", inplace=True)

# Fill the "Color" column
missing_data["Colour"].fillna("missing", inplace=True)

# Fill missing "Odometer (KM)" with mean of Odometer
missing_data["Odometer (KM)"].fillna(missing_data["Odometer (KM)"].mean(), inplace=True)

# Fill the "Doors" column with the average of doors
missing_data["Doors"].value_counts()
missing_data["Doors"].fillna(4, inplace=True)

# CHeck our dataframe again
missing_data.isna().sum()

In [None]:
# Remove rows with missing price value
missing_data.dropna(inplace=True)
missing_data.isna().sum()

In [None]:
len(missing_data)

In [None]:
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
import pandas as pd
import numpy as np

In [None]:
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()

In [None]:
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

In [None]:
heart_disease.isna().sum()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
clf.fit(X_train, y_train);

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

# Try again making data numerical and running ML

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Get the data ready
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales.head(), len(car_sales), car_sales.dtypes

In [None]:
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

In [None]:
feature_data = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 feature_data)],
                               remainder="passthrough")

transformed_X = transformer.fit_transform(X)
X_pd = pd.DataFrame(data=transformed_X)
X_pd.head()
# model = RandomForestRegressor()
# transformed_X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pd, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
model.fit(X_train, y_train)
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

# Restart

In [None]:
what_were_covering

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Getting the data ready
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.head(), len(missing_data), missing_data.dtypes

In [None]:
missing_data.isna().sum()

In [None]:
# FIll missing data in "Make", "Color", "Odometer", "Doors" with "missing" and drop missing data in price
# Make string data numerical

missing_data["Make"].fillna("missing", inplace=True)
missing_data["Colour"].fillna("missing", inplace=True)
missing_data["Odometer (KM)"].fillna(missing_data["Odometer (KM)"].mean(), inplace=True)
missing_data["Doors"].fillna(4, inplace=True)

missing_data.dropna(inplace=True)

In [None]:
missing_data.isna().sum()

In [None]:
# Now that all missing data has been sorted. Sort the data into X and y
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

In [None]:
X.head(), y.head()

In [None]:
# Turn the data into numerical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

feature_data = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 feature_data)],
                               remainder="passthrough",
                               sparse_threshold=0)

transformed_X = transformer.fit_transform(X)
X_pd = pd.DataFrame(transformed_X)
X_pd.head()

In [None]:
# Fit the model. We are trying to get an estimate of price
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=60)
X_train, X_test, y_train, y_test = train_test_split(X_pd, y, test_size=0.2)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Train the data
model.fit(X_train, y_train);

In [None]:
# Test the trained data
model.score(X_train, y_train)

In [None]:
# Test the test data
model.score(X_test, y_test)

In [None]:
for i in range(10, 100, 25):
    model = RandomForestRegressor(n_estimators=i)
    X_train, X_test, y_train, y_test = train_test_split(X_pd, y, test_size=0.2)
    model.fit(X_train, y_train);
    model.score(X_train, y_train)
    print("Testing {} estimators: Result = {}".format(i, model.score(X_test, y_test)))

# Option 2. FIll missing values with scikit-learn

In [None]:
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.isna().sum()

In [None]:
missing_data

In [None]:
# Drop the rows with no labeks
missing_data.dropna(subset=["Price"], inplace=True)
missing_data.isna().sum()

In [None]:
missing_data

In [None]:
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

In [None]:
missing_data

In [None]:
# Fill missing values with scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with missing and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
cat_feature = ["Make", "Colour"]
door_feature = ["Doors"]
num_feature = ["Odometer (KM)"]

# Create an imputer (Something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_feature),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_feature)
])

# Transform the data
filled_X = imputer.fit_transform(X)
filled_X

In [None]:
X_pd = pd.DataFrame(filled_X,
                    columns=["Make", "Colour", "Doors", "Odometer (KM)"])
X_pd.head()

In [None]:
X_pd.isna().sum()

In [None]:
# Convert catagorical value to numerical
from sklearn.preprocessing import OneHotEncoder


cat_data = ["Make", "Colour", "Doors"]
hot_one = OneHotEncoder()
transformer = ColumnTransformer([("hot_one", hot_one, cat_data)],
                                remainder="passthrough",
                               sparse_threshold=0)

transform_X = transformer.fit_transform(X_pd)
transform_X

In [None]:
# Now that our data is numbers and filled (No missing values). Fit a model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transform_X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

# Retry

In [None]:
what_were_covering

In [None]:
# Getting the data ready
# Import the data
# Check for missing data
# Make categorical data numerical
# Choose the right model
# fit the model
# Make predictions

In [2]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

In [None]:
# 1. import the data
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.head()

In [None]:
# 2. check for missing data
missing_data.isna().sum()

In [None]:
# Drop data with no labels
missing_data.dropna(subset=["Price"], inplace=True)
missing_data.isna().sum()

In [None]:
# Split the data
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

In [None]:
X.isna().sum()

In [None]:
# Fix the missing data in X
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
mean_imputer = SimpleImputer(strategy="mean")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)

cat_feature = ["Make", "Colour"]
mean_feature = ["Odometer (KM)"]
door_feature = ["Doors"]

imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_feature),
    ("mean_imputer", mean_imputer,mean_feature),
    ("door_imputer", door_imputer, door_feature)])

filled_X = imputer.fit_transform(X)
filled_X[:5]

In [None]:
# Put data back into a dataframe
X_pd = pd.DataFrame(filled_X, columns=["Make", "Colour", "Odometer (KM)", "Doors"])

In [None]:
X_pd.head()

In [None]:
# Change categorical data into numerical data

categorical_data = ["Make", "Colour", "Odometer (KM)", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_data)], remainder="passthrough", sparse_threshold=0)

transformed_X = transformer.fit_transform(X_pd)

In [None]:
# Sort data into training and test data
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train);

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

# Fixing missing values with scikit-learn the recommended way
* Fill the missing data with the transformer on X_train and X_test for better results

In [None]:
what_were_covering

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [None]:
# 1. Get the data ready
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.head()

In [None]:
missing_data.isna().sum()

In [None]:
# Drop data with no labels
missing_data.dropna(subset=["Price"], inplace=True)
missing_data.isna().sum()

In [None]:
# Split the data
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Fill in the missing values
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
num_imputer = SimpleImputer(strategy="mean")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)

cat_feat = ["Make", "Colour"]
num_feat = ["Odometer (KM)"]
door_feat = ["Doors"]

imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_feat),
    ("num_imputer", num_imputer, num_feat),
    ("door_imputer", door_imputer, door_feat)
], remainder="passthrough", sparse_threshold=0)

transformed_X_train = imputer.fit_transform(X_train)
transformed_X_test = imputer.fit_transform(X_test)

In [None]:
# Put transformed data into a dataframe
X_train_pd = pd.DataFrame(transformed_X_train, columns=["Make", "Colour", "Odometer (KM)", "Doors"])
X_test_pd = pd.DataFrame(transformed_X_test, columns=["Make", "Colour", "Odometer (KM)", "Doors"])

In [None]:
# No missing data
X_train_pd.isna().sum(), X_test_pd.isna().sum()

In [None]:
len(X_test_pd), len(y_test)

In [None]:
# Change X_train_pd and X_test_pd into numerical data
one_hot = OneHotEncoder()
cat_features = ["Make", "Colour", "Doors"]

transformer = ColumnTransformer([("one_hot", one_hot, cat_features)], remainder="passthrough", sparse_threshold=0)

transformed_X_train = transformer.fit_transform(X_train_pd)
transformed_X_test = transformer.fit_transform(X_test_pd)

In [None]:
model = RandomForestRegressor()
model.fit(transformed_X_train, y_train)

In [None]:

model.score(transformed_X_test, y_test)

In [None]:
what_were_covering

# 2. Choose the right estimator/algorithm for out problems
Scikit-Learn uses estimator for another term for machine learning or algorithm

* Classification - predicting whether a sample is one thing or another
* Regression - predicting a number

## 2.1 picking a machine learning model for a regression problem

In [None]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

boston = pd.read_csv("HousingData.csv")
boston["target"] = target

In [None]:
boston.isna().sum()

In [None]:
# How many samples
len(boston)

In [None]:
boston.dropna(inplace=True)

In [None]:
boston.isna().sum()

In [2]:
# Lets try the ridge regression model
from sklearn.linear_model import Ridge

# Setup random seed
np.random.seed(42)

# Create the data
X = boston.drop("target", axis=1)
y = boston["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the ridge model
model = Ridge()
model.fit(X_train, y_train)

# Check the train data
model.score(X_train, y_train)

NameError: name 'boston' is not defined

In [None]:
# Check the score of the ridge model on test data
model.score(X_test, y_test)

In [4]:
# Test cali housing
from sklearn.linear_model import Ridge
from sklearn.datasets import fetch_california_housing
cali_data = fetch_california_housing()

cali_df = pd.DataFrame(cali_data["data"], columns=cali_data["feature_names"])
cali_df["target"] = pd.Series(cali_data["target"])

In [5]:
# Sort out the data
X = cali_df.drop("target", axis=1)
y = cali_df["target"]

# Split the data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit the model
model = Ridge()

# Check the score
model.fit(X_train, y_train)

# Check train score
model.score(X_train, y_train)

0.6091097708974573

In [6]:
# Check test score
model.score(X_test, y_test)

0.583476492324516

How do we improve this score?

What if ridge isnt working

Lets refer back tot the map... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [7]:
# Try it with the RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8098934806633551

In [17]:
# The RandomForest did a better job
# Use the machine learning map

# 2.2 Choosing an estimator for a classification problem
 * Refer to the map

In [8]:
# Choose an estimator for the heart disease csv
# Get the data ready

heart_disease = pd.read_csv("heart-disease.csv")
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Choose the correct estimator
from sklearn.linear_model import SGDClassifier

clf = SGDClassifier()
clf.fit(X_train, y_train);

In [9]:
clf.score(X_test, y_test)

0.6721311475409836

In [29]:
# Lets compare to RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)
round(model.score(X_test, y_test), 2)

0.85

* You chose the model wrong. You had less than 100K samples
* Use LinearSVC


In [30]:
import warnings 
warnings.filterwarnings("ignore")

In [31]:
from sklearn.svm import LinearSVC

np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate LinearSVC
clf = LinearSVC()
clf.fit(X_train, y_train)

# Evaluate the LinearSVC
round(clf.score(X_test, y_test), 2)

0.87

Tidbit:

    1. If you have structured data, use ensamble methods
    2. If you have unstructured data, use deeplearning or transfer learning

In [32]:
what_were_covering

['0. An end-to-end Scikit-Learn workflow',
 '1. Getting the data ready',
 '2. Choose the right estimator/algorithm for our problems',
 '3. Fit the model/algorithm and use it to make predictions on our data',
 '4. Evaluating a model',
 '5. Improve a model',
 '6. Save and load a trained model',
 '7. Putting it all together!']

# 3. Fit the model and our data and use it to make predictions

## 3.1 Fitting the model to the data

Different names for:
* "X" = features, feature variables, data
* "y" = labels, targets, target variables

In [36]:
# Import the model
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data
heart_disease = pd.read_csv("heart-disease.csv")
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the model
model = RandomForestClassifier()

# Fit the model to the data (Training machine learning model)
model.fit(X_train, y_train)

# Evaluate the model (Use the patterns the model has learnt)
model.score(X_test, y_test)

0.8524590163934426

In [37]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [38]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

### 3.2 Make predictions using a machine learning model