# Quick Machine Learning

Simple Machine learning processes can be broken down into the following steps:
1. Data Collection
2. Exploratory Data Analysis
3. Data Preprocessing
4. Model Training
5. Model Evaluation
6. Model Refinement

We won't go into any data collection in this notebook, but we will go through the rest of the steps using the Iris dataset.

In [1]:
# Import the iris dataset
from sklearn.datasets import load_iris

iris = load_iris()

In [2]:
# Exploratory Data Analysis
import pandas as pd

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# With most machine learning problems, we will have a target variable that we are trying to predict.
# We use features to predict the target variable.

# In this case, the target variable is the species of the iris flower.
# We can add the target variable to the dataframe.
df['species'] = iris.target

# The target variable is a number, but we can convert it to the actual species name.
df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Let's take a look at the first 5 rows of the dataframe.
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
# Data Preprocessing
# We need to split the data into features and target variables.
X = df.drop('species', axis=1)
y = df['species']

# We also need to split the data into training and testing sets.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [4]:
# Model Training
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(X_train, y_train)

In [5]:
# Model Evaluation
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# We can also use a confusion matrix to see how well the model did.
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

Accuracy: 0.9777777777777777


array([[19,  0,  0],
       [ 0, 12,  1],
       [ 0,  0, 13]])

In [6]:
# Model Refinement
# We can try different models and see which one performs the best.
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')

confusion_matrix(y_test, y_pred)

Accuracy: 1.0


array([[19,  0,  0],
       [ 0, 13,  0],
       [ 0,  0, 13]])