***

# Machine Learning & Statistics Assessment - Scikit-Learn

***

Machine learning uses a variety of algorithms that continuously learn from data.  By using training data, we can build increasingly more precise models based on that data.They do this so they can characterise data, improve their output and predict outcomes. 
The power of machine learning is that we can constantly learn from data and help predict the future. “This powerful set of algorithms and models are being used across industries to improve processes and gain insights into patterns and anomalies within data” [2]
<br></br>
The open source Python package, scikit-learn enables us to implement these algorithms. This aim of this notebook is to provide a clear and concise overview of the Python package Scikit-Learn. This will be achieved by reseaching the package and demonstrating it's functionality through three algorithms of choice. 

[Link to Scikit-Learn Documentation](https://scikit-learn.org/stable/tutorial/index.html)

### What is Scikit Learn?

Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms. It’s built upon some of the technology you might already be familiar with, like NumPy, pandas, and Matplotlib [1] Scikit-learn is a Python library created by David Cournapeau in 2007 as a Google Summer of Code project. Scikit-learn efficiently implements many of the machine learning algorithms. [3] The library is focused on modeling data. Scikit-learn aims on bringing machine learning to non-specialists. It puts an importance on its ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings [4] Scikit-learn is built upon NumPy, SciPy and Matplotlib.

#### Set Up

In [24]:
# Importing Libaries
import sklearn.datasets as sk

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import sklearn.model_selection as mod

### Supervised Learning

A supervised machine learning algorithm (as opposed to an unsupervised machine learning algorithm) is one that relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data.
Supervised machine learning algorithms are used to solve classification or regression problems.
<br></br>

Some of the most commonly used supervised learning algorithms are:

- Linear Regression
- Logistic Regression 
- K Nearest Neighbour
- Neural Networks
- Naive Bayers
- Random Forest
- Support Vector Machines

Supervised machine learning is used to solve classification or regression problem.

A *classification problem* has a discrete value as its output. i.e. There is no middle ground, it’s a yes or no, true or false.

A *regression problem* has a real number (float) as its output.  


### Unsupervised Learning

An unsupervised machine learning algorithm  tries to learn the basic structure of the data to give more insight into the data.Unlike supervised learning which tries to learn a function that will allow us to make predictions given some new unlabeled data, unsupervised learning makes use of input data without any labels. s. From this data, the algorithm discovers patterns which help solve clustering or association problems. This is particularly useful when subject matter experts are unsure of common properties within a data set.
<br></br>

Unsupervised machine learning, mainly deals with finding a pattern in a collection of uncategorized data. Clustering algorithms will process the data and find natural clusters(groups) if they exist. 

### K Nearest Neighbour

[K Nearest Neighbour documentation](https://scikit-learn.org/stable/modules/neighbors.html)

K Nearest Neighbour (KNN) works on the principle assuming every data point falling in near to each other is falling in the same class. In other words, it classifies a new data point based on similarity. KNN is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. KNN is typically used for recommendation engines and image recognition.

Going to use the Breast Cancer Wisconsin (diagnostic) dataset — use ML to diagnose cancer scans as benign (does not spread to the rest of the body) or malignant (spreads to rest of the body)


#### Setup

In [2]:
# Importing Libaries

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

import sklearn.neighbors as knn

#### Data

In [13]:
# Loading the data
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [14]:
print(cancer.DESCR) 

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [11]:
type(cancer)

sklearn.utils.Bunch

These load functions (such as load_breast_cancer()) don’t return data in the tabular format we may expect. They return a Bunch object. A bunch is a container object for datasets. It's like a dictionary object that stores data as keys and values. 

https://stackoverflow.com/questions/38105539/how-to-convert-a-scikit-learn-dataset-to-a-pandas-dataset/46379878#46379878

In [15]:
print(cancer.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


<br></br>
data is all features of the data (attributes that help identify if the tumor is malignant or benign) 
Target is the target data (the variable you want to predict i.e. if the tumor is malignant or benign) 

The remaining keys serve a descriptive purpose. It’s important to note that all of Scikit-Learn datasets are divided into data and target. data represents the features, which are the variables that help the model learn how to predict. target includes the actual labels. In our case, the target data is one column classifies the tumor as either 0 indicating malignant or 1 for benign.
https://towardsdatascience.com/how-to-use-scikit-learn-datasets-for-machine-learning-d6493b38eca3

In [16]:
# Read the DataFrame, first using the feature data
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)

In [17]:
# Add a target column, and fill it with the target data
df['target'] = cancer.target

In [19]:
# Having a look at the data
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [21]:
# Summary
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

- This dataframe consists of 569 rows x 31 columns. 
- There are no missing (Null/NaN) values. 
- All vaules are a float type.

## References

[1] https://www.codecademy.com/articles/scikit-learn
[2]https://www.ibm.com/downloads/cas/GB8ZMQZ3#:~:text=Machine%20learning%20is%20a%20form,describe%20data%2C%20and%20predict%20outcomes.
[3] https://books.google.ie/books?id=HnetDwAAQBAJ&printsec=frontcover&dq=what+is+scikit+learn&hl=en&sa=X&ved=2ahUKEwjK0I3TzcXzAhXBoVwKHYM4CHEQ6AF6BAgKEAI#v=onepage&q=what%20is%20scikit%20learn&f=false
[4] https://jmlr.org/papers/v12/pedregosa11a.html


## End