# Practical Machine Learning and Deep Learning
## Lesson 1. Basic Machine Learning Application

## Introduction


The task of our first lesson is to understand the flow of machine learning task. This includes going through the theory and following a set of instructions to complete the code.

## Task
In this lab, we would be performing a basic classification on Iris Dataset.


### About Iris Dataset
The famous Iris dataset is a widely used dataset in machine learning and deep learning. It consists of 150 samples from three species of iris flowers (Iris setosa, Iris versicolor, and Iris virginica), with 50 samples from each species. Each sample includes four features: the lengths and widths of the sepals and petals, measured in centimeters. The dataset is often used for testing algorithms for classification and pattern recognition due to its simplicity and well-defined structure.


### Performing the Classification
To do so you will need:
- Obtain data from competition
- Create a Jupyter notebook which will produce a file for submission
- Submit it to the competition

## Importing required Libraries

First we need to import necessary libraries:

[Pandas](https://pandas.pydata.org/) - For data analysis and manipulation

[Numpy](https://numpy.org/) - To deal with matrices

[Warnings](https://docs.python.org/3/library/warnings.html) - To curb warnings to ensure smooth workflow

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Preparaing Data
Data in Machine Learning and Deep Learning is usally consisting of `train` and `test` splits. Sometimes there's a `validation` split as well.

The main purpose of the train-test split is to assess how well a machine learning model generalizes to unseen data. By splitting the dataset, we can train the model on one subset of the data (the training set) and test its performance on another subset (the test set).


Train test split usually serves two purposes:
1.   **Avoiding Overfitting**: By using a separate test set, we ensure that the model’s performance is not overly optimistic, as it has not seen the test data during training. This helps to avoid overfitting, where the model performs well on the training data but poorly on unseen data

2.   **Model Validation**: It provides a straightforward way to validate the model, giving insights into how it might perform in real-world scenarios


Here your goal is to train any appropriate ML model on `train` split and run inference on `test` split.



In [None]:
train_data = pd.read_csv('train.csv')
train_data.head()

## Preprocessing
In this section, we will preprocess the data. The main steps are to prepare:


1.   Training data - which is feature data mainly
2.   Labels - It explain what label does the features refer to

The target variable y in the Iris dataset consists of categorical values representing different species of iris flowers (e.g., 'setosa', 'versicolor', 'virginica'). Many machine learning algorithms require numerical input, so these categorical labels need to be converted into numerical form. A common approach is to use label encoding, where each category is assigned a unique integer.

---

EXERCISE 1:

Use [label encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to encode the target variables before fitting the model.

Step 1: Import the Label Encoder
```
from sklearn.preprocessing import LabelEncoder

```
Step 2: Use the Encoder to encode values

```
preproc =  LabelEncoder()

```
---

In [None]:

y = train_data['species']

preproc = ...
y = preproc.fit_transform(y)

X = train_data.drop('species', axis=1).values


In [None]:
assert all(label in y for label in [0, 1, 2])
print("Success")


Success


## Fitting the Model

It is a process of training a model on a dataset to learn the underlying patterns and relationships within the data. This involves adjusting the model's parameters so that its predictions closely match the actual target values. During fitting, the model uses algorithms to minimize the error between its predictions and the true outcomes by optimizing a loss function. The result is a trained model that can make accurate predictions on new, unseen data.

Implement any appropriate ML model you like.

Since, we have a classification problem we can use any model from the following:


*  Decision Trees
*  k-Nearest Neighbors (k-NN)
*  Support Vector Machines (SVM)
*  Logistic Regression
*  Neural Networks
*  Random Forests


---
EXERCISE 2:

Create a model from the list above using sklearn

For example:


Step 1: Import the module
```
from sklearn.linear_model import LogisticRegression

```
Step 2: Use the model

```
model = DecisionTreeClassifier()

```
---

In [None]:

model = ...
model.fit(X, y)


### Inference

The primary goal of inference is to predict the output for new data. For example, in the context of the Iris dataset, inference might involve predicting the species of an iris flower based on its features (sepal length, sepal width, petal length, and petal width).

Now that our model is trained on Iris flowers features, we would like to predict the specia based on features (sepal length, sepal width, petal length, and petal width)

Run your trained model on `test` split


In [None]:
test_data = pd.read_csv('test.csv')
test_data.head()


---
EXCERCISE 3:

Prepare test features and labels like we did for training data

---

In [None]:
X_test = ...
y_test = ...
y_test = preproc.transform(y_test)
predictions = model.predict(X_test)

### Save model predictions
Save model predictions to `submission.csv` and submit to competition

In [None]:
preds = pd.DataFrame(columns=['id', 'species'])
preds['id'] = test_data.index
preds['species'] = preproc.inverse_transform(predictions)



---
EXCERCISE 4:

Save the results to a file namely "results.csv" and keep without index

---

In [None]:
preds...