# **Section 8: Machine learning basics**
<a href="https://colab.research.google.com/github/osuranyi/UdemyCourses/blob/main/NumpyStack/Section8_MachineLearningBasics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using scikit-learn package to learn the basic concepts of machine learning.

In [1]:
import numpy as np

## **36. Classification in code**

Here, we will classify tumours as malignant and beneign.
First, we load in a built-in dataset from sklearn:

In [2]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
type(data)

sklearn.utils.Bunch

This a Bunch object, which works like a dictionary. Let's check its keys, shape, etc.:

In [3]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

These can be accessed as attributes:

In [4]:
data.data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [5]:
data.data.shape

(569, 30)

In [6]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [7]:
data.target.shape

(569,)

In [8]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')

The targets corresponds to 'malignant' or 'benign' labels.

We can also check what the features represent:

In [9]:
data.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

Normally, we want to divide our data into train and test sets:

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.33)

Import and instantiate the random forest classifier:

In [11]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

To train, we should call the fit instance method:

In [12]:
model.fit(X_train, y_train)

RandomForestClassifier()

Let's check the train and test scores (accuracy of classification):

In [13]:
model.score(X_train,y_train)

1.0

In [14]:
model.score(X_test,y_test)

0.9521276595744681

How to predict output for new data?

In [15]:
predictions = model.predict(X_test)
predictions

array([1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1])

Checking accuracy manually:

In [16]:
np.sum(predictions == y_test) / len(y_test)

0.9521276595744681

The same classification can be performed using a deep neural network. In this case, we need to scale the inputs:

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train2 = scaler.fit_transform(X_train)
X_test2 = scaler.transform(X_test)

Loading and fitting the model:

In [18]:
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(max_iter=500)
model.fit(X_train2,y_train)

MLPClassifier(max_iter=500)

And validate this as well:

In [19]:
model.score(X_train2,y_train)

0.9921259842519685

In [20]:
model.score(X_test2,y_test)

0.9680851063829787

## **9. Regression in code**

We will predict how ''loud'' are the different airfoils (~ plane wings).
First, get the dataset:

In [25]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat

--2021-12-29 22:58:54--  https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 59984 (59K) [application/x-httpd-php]
Saving to: ‘airfoil_self_noise.dat’


2021-12-29 22:58:54 (458 KB/s) - ‘airfoil_self_noise.dat’ saved [59984/59984]



Import Pandas and load dataset:

In [30]:
import pandas as pd

df = pd.read_csv('airfoil_self_noise.dat',sep='\t', header=None)

print(df.head())
print(df.tail())

      0    1       2     3         4        5
0   800  0.0  0.3048  71.3  0.002663  126.201
1  1000  0.0  0.3048  71.3  0.002663  125.201
2  1250  0.0  0.3048  71.3  0.002663  125.951
3  1600  0.0  0.3048  71.3  0.002663  127.591
4  2000  0.0  0.3048  71.3  0.002663  127.461
         0     1       2     3         4        5
1498  2500  15.6  0.1016  39.6  0.052849  110.264
1499  3150  15.6  0.1016  39.6  0.052849  109.254
1500  4000  15.6  0.1016  39.6  0.052849  106.604
1501  5000  15.6  0.1016  39.6  0.052849  106.224
1502  6300  15.6  0.1016  39.6  0.052849  104.204


The first 4 columns are the features, and the last column is the target. Extracting these into arrays:

In [39]:
data = df[[0,1,2,3,4]].values
target = df[5].values

Doing train/test split:

In [40]:
X_train, X_test, y_train, y_test = train_test_split(data,target,test_size=0.33)

Import linear regression, fit and score:

In [43]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train,y_train)

print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

0.5046327200590066
0.5344287925845795


The score here is not accuracy, but $R^2$.

Predicting out put for new data:

In [44]:
model.predict(X_test)

array([121.71523272, 130.10711746, 128.34666992, 125.95245461,
       105.24984649, 132.69285542, 122.08409841, 133.53184442,
       130.97591605, 123.00240151, 131.79116726, 131.98386507,
       123.66073821, 116.13585062, 126.28959866, 128.99164672,
       125.1717576 , 114.88165479, 122.83014737, 114.31572531,
       121.55561264, 125.55302937, 123.73449858, 124.59183102,
       127.26833872, 127.75533014, 126.69398351, 124.49610559,
       123.65679387, 120.44940012, 126.64954663, 130.81282821,
       110.72536784, 120.53272055, 128.46776931, 123.80286858,
       119.9437145 , 126.42878194, 130.1166087 , 122.11341205,
       125.61527799, 125.94161278, 113.30438163, 130.15449151,
       125.96142099, 130.38719848, 129.69874284, 131.9027924 ,
       127.81297076, 126.90276581, 132.02969235, 125.97246241,
       128.43814944, 121.91872462, 120.76511779, 124.06846816,
       125.57075318, 124.41563571, 122.12335853, 126.52544873,
       128.01446507, 122.16619377, 126.59356968, 115.19

Finally, an other algorithm: random forest regression:

In [45]:
from sklearn.ensemble import RandomForestRegressor

model2 = RandomForestRegressor()
model2.fit(X_train,y_train)

print(model2.score(X_train,y_train))
print(model2.score(X_test,y_test))

0.9895007165376153
0.9109300625046516
