## Overview

In this tutorial, you will be introduced to build a simple linear regression model to predict flowers type. We will be covering the following topics:

1. Intrduction to the dataset
2. How to load dataset using Pandas
3. How to split the dataset into training and validation
4. How to train the model in a couple lines of code

In this tutorial, we are going to use **Logistic Regression** for training our model. As a beginner tutorial, we are not going to cover much details of how logistic regression works. Feel free to read more information about it here: [https://en.wikipedia.org/wiki/Logistic_regression]. 

### Introduction to the iris dataset 

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant

The dataset has four features:
* sepal_length
* sepal_width
* petal_length
* petal_width

Each plant has been measured on those 4 features and the species is recorded.
The question is, if you see a new plant in the field, could you make a prediction of its species based on these meassurements?

The data is saved in a .csv file.

<img src="resources/Iris-image.png" width="500">

### Import necessary libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

### Load dataset using Pandas

In [None]:
iris = pd.read_csv('iris.csv')

By default, **head( )** method returns the top 5 rows of dataset

In [None]:
iris.head() # If you want to see more of data, say 20 lines, simply use iris.head(20)

**info( )** method lists the data type of each columns, number of non-missing values, and memory usage.

In [None]:
iris.info() 

In [None]:
iris.isnull().sum() # iris.dropna() will remove rows with null values

### Split dataset into training and validation

#### Seperate data into features(X) and label(y)

In [None]:
X = iris.iloc[:, 0:4] # all rows, and column 0 to column 3
y = iris.iloc[:, 4] # all rows, and column 4

#### Check data type

In [None]:
print("X's type: " + str(type(X)))
print("Y's type: " + str(type(y)))

#### Split 80% to training, 20% to validation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)

#### Check data shape

In [None]:
print("X_train's shape" + str(X_train.shape))
print("y_train's shape" + str(y_train.shape))
print("X_test's shape" + str(X_test.shape))
print("y_test's shape" + str(y_test.shape))

### Train the model

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

## Reference

* https://archive.ics.uci.edu/ml/datasets/iris<br>
* "Python Data Analytics: With Pandas, NumPy, and Matplotlib" 2ed by Fabio Nelli.