#### Python for Machine Learning
- 1.numpy 
- 2.SciPy 
- 3.Matplotlib 
- 4.Pandas
- 5.scikit-learn

##### Scikit-learn

Scikit-Learn, also known as sklearn is a python library to implement machine learning models and statistical modelling

- allows to work with numpy and scipy
- good docs
- easy to use

pre-processing of data, feature selection, feature extraction, train/test splitting, defining the algorithms, fitting models, tuning parameters, prediction, evaluation, and exporting the model.

In [None]:
# pip install scikit-learn
"""The preprocessing module in scikit-learn provides functions to preprocess and transform data 
before using it to train machine learning models. Preprocessing is an essential step in the machine learning pipeline, 
as it helps to prepare the data and improve the performance of the models."""
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
from sklearn import svm
clf = svm.SVC(gamma=0.01, C=90.0) #estimator instance

clf.fit(X_train, y_train)

clf.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix # use metrics to avaluate model accuracy
print(confusion_matrix(y_test, labels=[1,0]))

In [None]:
# Save your model
import pickle
s = pickle.dumps(clf)

##### Supervised vs Unsupervised

In [26]:
import pandas as pd
import numpy as np

cancer_data = pd.read_csv("cancer_data.csv")

In [27]:
cancer_data.sample(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
131,8670,M,15.46,19.48,101.7,748.9,0.1092,0.1223,0.1466,0.08087,...,26.0,124.9,1156.0,0.1546,0.2394,0.3791,0.1514,0.2837,0.08019,
284,8912284,B,12.89,15.7,84.08,516.6,0.07818,0.0958,0.1115,0.0339,...,19.69,92.12,595.6,0.09926,0.2317,0.3344,0.1017,0.1999,0.07127,
308,893526,B,13.5,12.71,85.69,566.2,0.07376,0.03614,0.002758,0.004419,...,16.94,95.48,698.7,0.09023,0.05836,0.01379,0.0221,0.2267,0.06192,
382,90250,B,12.05,22.72,78.75,447.8,0.06935,0.1073,0.07943,0.02978,...,28.71,87.36,488.4,0.08799,0.3214,0.2912,0.1092,0.2191,0.09349,
197,877159,M,18.08,21.84,117.4,1024.0,0.07371,0.08642,0.1103,0.05778,...,24.7,129.1,1228.0,0.08822,0.1963,0.2535,0.09181,0.2369,0.06558,


In [28]:
cancer_data.dtypes

id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst     

In [31]:
cancer_data["diagnosis"] = cancer_data["diagnosis"].astype("category")

In [32]:
cancer_data.dtypes

id                            int64
diagnosis                  category
radius_mean                 float64
texture_mean                float64
perimeter_mean              float64
area_mean                   float64
smoothness_mean             float64
compactness_mean            float64
concavity_mean              float64
concave points_mean         float64
symmetry_mean               float64
fractal_dimension_mean      float64
radius_se                   float64
texture_se                  float64
perimeter_se                float64
area_se                     float64
smoothness_se               float64
compactness_se              float64
concavity_se                float64
concave points_se           float64
symmetry_se                 float64
fractal_dimension_se        float64
radius_worst                float64
texture_worst               float64
perimeter_worst             float64
area_worst                  float64
smoothness_worst            float64
compactness_worst           

In [33]:
cancer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   id                       569 non-null    int64   
 1   diagnosis                569 non-null    category
 2   radius_mean              569 non-null    float64 
 3   texture_mean             569 non-null    float64 
 4   perimeter_mean           569 non-null    float64 
 5   area_mean                569 non-null    float64 
 6   smoothness_mean          569 non-null    float64 
 7   compactness_mean         569 non-null    float64 
 8   concavity_mean           569 non-null    float64 
 9   concave points_mean      569 non-null    float64 
 10  symmetry_mean            569 non-null    float64 
 11  fractal_dimension_mean   569 non-null    float64 
 12  radius_se                569 non-null    float64 
 13  texture_se               569 non-null    float64 
 14  perimeter_

In [34]:
cancer_data.columns[2:]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

- Attributes 

 In a dataset, attributes are often represented by columns. Each column contains a specific type of information about the observations.
- Features

"features" is often used interchangeably with "attributes." Features refer to the input variables or columns in a dataset that are used to make predictions or classifications.
- Observation

Observations, also known as samples or data points, are the rows in a dataset. Each row corresponds to a specific instance or example in the dataset.

##### Supervised Learning Techniques:

- Classification
- Regression

Supervised vs unsupervised

- Labeled and unlabelled

Dimension reduction, Density estimation, Market basket analysis and Clustering are the most widely used unsupervised machine learning techniques.