### Instructions

#### Goal of the Project

This project is designed for you to practice and solve the activities that are based on the concepts covered in the lesson:

- Support Vector Machines - Introduction

- Support Vector Machines - MNIST Digits Classification II



---

#### Getting Started:

1. Click on this link to open the Colab file for this project.

     https://colab.research.google.com/drive/1TNhwOp6IDpbbhBwd1b4GiyqNhLgbrRVq  

2. Create a duplicate copy of the Colab file as described below.

  - Click on the **File menu**. A new drop-down list will appear.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/0_file_menu.png' width=500>

  - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/1_create_colab_duplicate_copy.png' width=500>

3. After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_Project87** format.

4. Now, write your code in the prescribed code cells.

---

### Problem Statement

In this project, you are going to peform multiclass classification on a synthetic dataset and create an SVM model to classify the data using Machine Learning.


---

### List of Activities

**Activity 1:** Dummy Dataset Creation

**Activity 2:**  Dataset Inspection

**Activity 3:**  Train-Test Split

**Activity 4:** Model Training and Prediction

**Activity 5:** Model Evaluation


---

#### Activity 1: Create Dummy Dataset

In this activity, you have to execute the code cell which creates a dummy dataset for multiclass classification using `make_blobs()` function of the `sklearn.datasets` module.

**Syntax:** `make_blobs(n_samples, centers, n_features, random_state, cluster_std)`


In [None]:
# Run this code cell to generate dummy data using 'make_blobs()' function
from sklearn.datasets import make_blobs
import pandas as pd

features_array, target_array = make_blobs(n_samples = [200, 500, 700, 272, 333], n_features = 2, random_state = 42, centers=None,cluster_std=1)

# Creating Pandas DataFrame containing the items from the 'features_array' and 'target_array' arrays.
# A dummy dictionary
dummy_dict = {'col 1': [features_array[i][0] for i in range(features_array.shape[0])],
             'col 2': [features_array[i][1] for i in range(features_array.shape[0])],
             'target': target_array}


# Converting the dictionary into DataFrame
dummy_df = pd.DataFrame.from_dict(dummy_dict)

# Printing first five rows of the dummy DataFrame
dummy_df.head()

Unnamed: 0,col 1,col 2,target
0,5.94162,3.534681,1
1,-6.878422,-7.697198,2
2,5.867548,0.763529,1
3,-6.327327,-6.254479,2
4,-0.682091,3.531567,4


In the above code cell,

- A dummy dataset is created having two columns representing two independent variables and a third column representing the target.  

- The number of records are divided into 5 random groups like `[200, 500, 700, 272, 333]` such that the target columns has 5 different labels `[0, 1, 2, 3, 4]`.  

- A dummy DataFrame is created from the two arrays using a Python dictionary. *(Learnt in "Logistic Regression - Decision Boundary" lesson)*



**After this activity, the DataFrame should be created with two independent features columns and one dependent target column.**

----

#### Activity 2: Dataset Inspection

In this activity, you have look into the distribution of the labels in the `target` column of the DataFrame.

**1.** Print the number of occurences of each label in `target` column.

In [None]:
# Display the number of occurrences of each label in the 'target' column.
dummy_df['target'].value_counts()

2    700
1    500
4    333
3    272
0    200
Name: target, dtype: int64

**2.** Print the percentage of the samples for each label in `target` column.

In [None]:
# Get the percentage of count of each label samples in the dataset.
percent = dummy_df['target'].value_counts()/dummy_df.shape[0]*100
percent

2    34.912718
1    24.937656
4    16.608479
3    13.566085
0     9.975062
Name: target, dtype: float64

**Q:** How many unique labels are present in the DataFrame? What are they?

**A:** There are unique labels are present in the DataFrame. They are 0,1,2,3,4.

---

**After this activity, the labels to be predicted i.e the target variables and their distribution should be known.**

----

#### Activity 3: Train-Test Split

We need to predict the value of the `target` variable, using other variables. Thus, `target` is the dependent variable and other columns are the independent variables.

**1.** Split the dataset into the training set and test set such that the training set contains 70% of the instances and the remaining instances will become the test set.

**2.** Set `random_state = 42`.

In [None]:
dummy_df.head()

Unnamed: 0,col 1,col 2,target
0,5.94162,3.534681,1
1,-6.878422,-7.697198,2
2,5.867548,0.763529,1
3,-6.327327,-6.254479,2
4,-0.682091,3.531567,4


In [None]:
# Import 'train_test_split' module
from sklearn.model_selection import train_test_split

# Create the features data frame holding all the columns except the last column
# and print first five rows of this dataframe
X = dummy_df[['col 1','col 2']]
print(X.head())
print('----'*10)

# Create the target series that holds last column 'target'
# and print first five rows of this series
y = dummy_df['target']
print(y.head())
print('----'*10)

# Split the train and test sets using the 'train_test_split()' function._train
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.33,random_state = 42)

      col 1     col 2
0  5.941620  3.534681
1 -6.878422 -7.697198
2  5.867548  0.763529
3 -6.327327 -6.254479
4 -0.682091  3.531567
----------------------------------------
0    1
1    2
2    1
3    2
4    4
Name: target, dtype: int64
----------------------------------------


**3.** Print the number of rows and columns in the training and testing set.

In [None]:
# Print the shape of all the four variables i.e. 'X_train', 'X_test', 'y_train' and 'y_test'
print(f"Shape of X_train : {X_train.shape}")
print(f"Shape of X_test : {X_test.shape}")
print(f"Shape of y_train : {y_train.shape}")
print(f"Shape of y_test : {y_test.shape}")

Shape of X_train : (1343, 2)
Shape of X_test : (662, 2)
Shape of y_train : (1343,)
Shape of y_test : (662,)


**After this activity, the features and target data should be splitted into training and testing data.**

----

#### Activity 4: Model Training and Prediction

Implement SVM classification using `sklearn` module in the following way:

**1.** Deploy the model by importing the `SVC` class.

**2.** Create an object of the `SVC` class and pass `kernel = "linear"` as input to its constructor.

**3.** Call the `fit()` function of the `SVC` class on the object created and pass `X_train` and `y_train` as inputs to the function.

**4.** Call the `score()` function with `X_train` and `y_train` as inputs to check the accuracy score of the model.


In [None]:
# Build a logistic regression model using the 'sklearn' module.
from sklearn.svm import SVC

# 1. Create the SVC model and pass 'kernel=linear' as input.
svc_model = SVC(kernel= 'linear')


# 2. Call the 'fit()' function with 'X_train' and 'y_train' as inputs.
print(svc_model.fit(X_train,y_train))

# 3. Call the 'score()' function with 'X_train' and 'y_train' as inputs to check the accuracy score of the model.
print(svc_model.score(X_train,y_train))

SVC(kernel='linear')
0.9813849590469099


**5.** Make the predictions on the train set using `predict()` function.

In [None]:
# Make predictions on the train dataset by using the 'predict()' function.
train_predict_values = pd.Series(svc_model.predict(X_train))

# Print the occurrence of each label computed in the predictions.
print(train_predict_values.value_counts())

2    483
1    336
4    223
3    181
0    120
dtype: int64


**Q:** Does the model classify all the labels in the training set?

**A:** Yes

---

**6.** Make predictions on the test dataset by using the `predict()` function.


In [None]:
# Make predictions on the test dataset.
test_predict_values = pd.Series(svc_model.predict(X_test))
# Print the occurrence of each label computed in the predictions.
print(test_predict_values.value_counts())

2    217
1    164
4    110
3     91
0     80
dtype: int64


**Q:** Does the model classify all the labels in the test set?

**A:** Yes




**After this activity, an SVM model should be trained and values of the labels should be predicted for the target columns for multiclass classification.**

----

#### Activity 5: Model Evaluation

**1.** Create a confusion matrix to calculate True Positives, False Positives, True Negatives and False Negatives for the test set to evaluate the SVC linear model.

In [None]:
# Create a confusion matrix for the test set.
# Import the libraries
from sklearn.metrics import confusion_matrix,classification_report

# Print the confusion matrix
print(confusion_matrix(y_test,test_predict_values))

[[ 80   0   0   0   0]
 [  0 158   0   0  11]
 [  0   0 217   0   0]
 [  0   0   0  91   0]
 [  0   6   0   0  99]]


**Q:** Does the confusion matrix indicate any misclassification?

**A:** Yes

---

**2.** Print the classification report to observe the recall, precision and f1-scores for linear SVC model.


In [None]:
# Print the classification report for the actual and predicted data of the testing set
print(classification_report(y_test,test_predict_values))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        80
           1       0.96      0.93      0.95       169
           2       1.00      1.00      1.00       217
           3       1.00      1.00      1.00        91
           4       0.90      0.94      0.92       105

    accuracy                           0.97       662
   macro avg       0.97      0.98      0.97       662
weighted avg       0.97      0.97      0.97       662



**Q:** What are the f1-scores for all the labels?

**A:** The f1-scores for all the labels is 1 ,0.95 ,1 ,1 ,0.92 respectively for 0 ,1 ,2 ,3 , 4 labels respectively.



**After this activity, the model should be evaluated for the target columns using the test features set.**

---

**Write your interpretation of the results here.**

- Interpretation 1:

- ...

---

### Submitting the Project

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, make sure that '**Anyone on the Internet with this link can view**' option is selected and then click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>

3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_Project87**) of the notebook will get copied.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.
   
   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_Project87** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>