<a href="https://colab.research.google.com/github/hniu-tamu/ds_energy/blob/main/intro_to_ml/sklearn_iris_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning with scikit-learn
Haoyu Niu, Texas A&M University

Updated on Feb 20, 2025

We will explore the Iris Data set again with scikit-learn, which contains a clean copy of the Iris data set.
<img src="https://github.com/jtao/dswebinar/blob/master/sklearn/session2/images/petal_sepal.jpg?raw=1" />

### Import modules, load the Iris data set, and prepare the data set for training

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets

In [None]:
# load the data set
iris = datasets.load_iris()
n_samples, n_features = iris.data.shape
print(iris.keys())
print((n_samples, n_features))
print(iris.data.shape)
print(iris.target.shape)
print(iris.target_names)
print(iris.feature_names)
X, y = iris.data, iris.target

In [None]:
df=pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                columns= iris['feature_names'] + ['target'])

In [None]:
df.head()

In [None]:
df.info()

In [None]:
#df = df.drop(['target'], axis=1)
sns.pairplot(df, kind="scatter");

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr(), annot=True, fmt=".3f");

In [None]:
#split the data into training and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
# scale the train data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))

### Classification - Perceptron
The Perceptron is another simple binary classification algorithm suitable for large scale learning. By default:

It does not require a learning rate.
It is not regularized (penalized).
It updates its model only on mistakes.
The last characteristic implies that the Perceptron is slightly faster to train than SGD with the hinge loss and that the resulting models are sparser.

In [None]:
from sklearn.linear_model import Perceptron
ppn = Perceptron(max_iter=40, eta0=0.1, random_state=0)
ppn.fit(X_train_std, y_train)
y_pred = ppn.predict(X_test_std)
print('Misclassfied samples: %d' % (y_test != y_pred).sum())

In [None]:
from sklearn.metrics import accuracy_score
print ('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

### Classification - Supporting-vector classifier

In [None]:
from sklearn.svm import SVC

svm = SVC(kernel='linear')
svm.fit(X_train_std, y_train)

print('The accuracy of the svm classifier on training data is {:.2f} out of 1'.format(svm.score(X_train_std, y_train)))

print('The accuracy of the svm classifier on test data is {:.2f} out of 1'.format(svm.score(X_test_std, y_test)))

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)

print('The accuracy of the knn classifier is {:.2f} out of 1 on training data'.format(knn.score(X_train_std, y_train)))
print('The accuracy of the knn classifier is {:.2f} out of 1 on test data'.format(knn.score(X_test_std, y_test)))