In [None]:
%matplotlib inline 

The goal of this lab is to practice with basic features of several main python libraries for machine learning.
In order to get access to the necessary functionality we will import some libraries.

In [1]:
import sklearn
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

### Loading an example dataset


For this introductory exercise, we will use a classic toy data set, Iris dataset wich is also a builtin dataset in scikit-learn library.

The dataset was collected by botanist Edward Anderson and made famous by Ronald Fisher, one of the most prolific statisticians in history. Anderson carefully measured the anatomical properties of samples of three different species of iris, Iris setosa, Iris versicolor, and Iris virginica. The full data set is available as part of scikit-learn. 
The dataset can be downloaded also from this [repository](http://archive.ics.uci.edu/ml/machine-learning-databases/iris/)

The dataset includes 150 observations of the iris flower specifying some measurements: 

- sepal length, sepal width, petal length and petal width together with its subtype:
*Iris setosa*, *Iris versicolor*, *Iris virginica*.

In [3]:
from sklearn import datasets

### Machine Learning Terminology

Each row is an observation (also known as : sample, example, instance, record)

Each column is a feature (also known as: predictor, attribute, independent variable, regressor, covariate)

#### Data as table

A basic table is a two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the columns represent quantities related to each of these elements. In general, we will refer to the rows of the matrix as *samples*, and the number of rows as n_samples and the the columns of the matrix as *features*, and the number of columns as n_features.

Features matrix - This table layout makes clear that the information can be thought of as a two-dimensional numerical array or matrix, called the features matrix with shape [n_samples, n_features]

Target array.- In addition to the feature matrix X, we also generally work with a label or *target array*, which by convention we will usually call *y*. The target array is usually one dimensional, with length n_samples, and is generally contained in a NumPy array or Pandas Series.

The loader functions from sklearn return a dictionary-like object holding at least two items: an array of shape n_samples * n_features with key data and a numpy array of length n_samples, containing the target values, with key target.

This data is stored in the `.data` member, which is a `(n_samples, n_features)`
array.
The labels or response variables (dependent variables) are stored in the *.target* member.  


The datasets also contain a full description in their DESCR attribute and some contain feature_names and target_names. See the [dataset descriptions](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris) for details.



In [19]:
# save "bunch" object containing iris dataset and its attributes
iris = datasets.load_iris()
type (iris)

sklearn.utils.Bunch

In [5]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [23]:
# print the iris dataset and its shape
print(iris.data)
print(iris.data.shape)


[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

In [24]:
# print the names of the four features
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [None]:
# print the integers representing the species of each observation


In [None]:
# print the encoding scheme for species; 0 = Setosa , 1=Versicolor, 2= virginica


Each value we are predicting is the response (also known as: target, outcome, label, dependent variable)

Requirements for working with data in scikit-learn:

1) Features and response are separate objects

2) Features and response should be numeric

3)Features and response should be NumPy arrays

4)Features and response should have specific shapes

In [None]:
# Check the types of the features and response



In [None]:
# Extract the values for features (unpacking)



In [None]:
 # subsetting the first 4 rows of the array (and all columns)


In [None]:
 # subsetting the first 10 rows and the last 2 columns


 ### Reading the data in a pandas dataframe
 
Pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. Pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing. The main data structures pandas provides are ‘Series’ and ‘DataFrames’. 


In [28]:
iris_df = pd.read_csv?

 
#iris_df['Class'] = None

iris_df = pd.read_csv("datasets-uci-iris.csv", names = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width"])
iris_df.head()


Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa



After loading the data using pandas library, we should check out what the content is, description using the following:

* dataset.head()  -  getting the first 10 rows of the data set
* dataset.tail()  -  gettinge out last 10 row of the data set
* dataset.describe() - to get a statistical summary of the dataset
* dataframe.sample(5) - pops up 5 random rows from the data set 
* dataframe.isnull().sum()  - checks out how many null info are on the dataset

In [29]:
# subset the iris data frame on columns 'Sepal Length', 'Sepal Width' and store them in an ew dataframe object X_

X_ = iris_df[["Sepal Length","Sepal Width"]]
X_.head()
#iris_df = pd.read_csv

Unnamed: 0,Sepal Length,Sepal Width
5.1,3.5,1.4
4.9,3.0,1.4
4.7,3.2,1.3
4.6,3.1,1.5
5.0,3.6,1.4


In [31]:
# change the value for 'Sepal Length' in the first record/ data example to a chosen value
X_.iloc[1]["Sepal Length"] = 100
X_.head()

Unnamed: 0,Sepal Length,Sepal Width
5.1,3.5,1.4
4.9,100.0,1.4
4.7,3.2,1.3
4.6,3.1,1.5
5.0,3.6,1.4


In [32]:
# check the df X_
iris_df.head()

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
# check the original df

In [33]:
# What's the range of Sepal Length?
min(iris_df["Sepal Length"]), "to", max(iris_df["Sepal Length"])

(2.0, 'to', 4.4)

In [35]:
# get a summary statistics of the traget variable
iris_df["Class"].describe()


KeyError: 'Class'

In [37]:
# Checking species
labels = iris_df["Class"].unique()
print(labels)
type(labels)

KeyError: 'class'

In [38]:
# get a list with the values of traget variables
labels.tolist()

NameError: name 'labels' is not defined

In [None]:
# filtering by label (species)
# # extract a new data frame that contains all the records from species setosa


### Plotting the data

In [None]:
# plot the petal length attribute versus petal width for the first 1o examples

# Label the axes


# label the figure



In [None]:
# introspection; get the doc string for scatter function from the pyplot module



In [None]:
# Start with a plot figure of width 12 units and height 9 units 


# Create an array of three colours, one for each species.
#colors = np.array(['red', 'green', 'blue'])

#Draw a Scatter plot for Sepal Length vs Sepal Width
#nrows=1, ncols=2, plot_number=1
# http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot or plt.subplot?


# http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter




More plotting examples at: http://matplotlib.org/examples/index.html

In [None]:
### plot the 'petal length' attribute versus 'petal width' using pandas plot function


In [None]:
# loading data from a .csv file in a pandas dataframe

import pandas as pd
iris_filename = 'datasets-uci-iris.csv'
iris = None

### Train test dataset  spliting

In [None]:
np.seed = 42

# number of observations
n = None
print(n)
is_train = None # Create an array of the given shape and populate it with
                                   # random samples from a uniform distribution, and mask the numbers smaller than 0.7
                                   # the result is a boolean vector
print(sum(is_train))



In [None]:
X_train = None  # boolean masking
y_train = None
X_test = None
y_test = None
print('Size of training features matrix:', X_train.shape)
print('Size of training labels vector:', y_train.shape)
print('Size of test features matrix :', X_test.shape)
print('Size of test labels vector:', y_test.shape)

 ### Train test dataset  spliting using  *model_selection* class from sklearn

In [None]:
from sklearn import model_selection

# Split-out validation dataset
array = None
type(array)
array.shape
X = None
Y = None
validation_size = None
seed = 7
X_train_2, X_validation_2, Y_train_2, Y_validation_2 = None

In [None]:
X_train_2.shape