<a href="https://colab.research.google.com/github/poudyaldiksha/Data-Science-project/blob/main/Lesson_44_b2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 44 : Implementing a Predictive Model- Random Forest Classifier

In this session, we will explore how to use the Random Forest Classifier to predict whether a star has a planet based on its flux values. Machine learning, a field within artificial intelligence, enables computers to learn patterns and features from data without explicit programming.

Our goal is to train a model using the training dataset, where it will learn the characteristics of stars with planets and those without. Once trained, the model will use this knowledge to analyze a new dataset and predict the presence of planets based on observed features.

**Understanding Machine Learning and Random Forest**

- Machine learning allows systems to learn and make predictions or decisions based on data. In this context, the system will identify patterns in the flux values of stars. When presented with a new set of flux values, it will predict whether these stars have planets.

- There are numerous machine learning algorithms for such predictions, and one widely used algorithm is the Random Forest Classifier. This algorithm is particularly effective for classification tasks, where the goal is to assign data points to specific categories based on their features. For example, it can classify sounds such as "Meow!" as coming from a cat, "Woof!" from a dog, and hissing from a snake.

**Random Forest Classifier for Star-Planet Prediction**

- The Random Forest Classifier operates by constructing multiple decision trees during training. Each tree makes a prediction, and the forest aggregates these predictions to make a final decision. This ensemble approach enhances accuracy and reduces the risk of overfitting.

- In our case, we will train the Random Forest Classifier to distinguish between stars with planets (labeled as 2) and stars without planets (labeled as 1). The model will learn from the training dataset, which includes flux values and corresponding labels. It will then apply this knowledge to the test dataset, predicting whether stars have planets based on their flux values.

**Important Points**

- Training the Model: The model will learn the properties of stars with and without planets from the training data.
- Making Predictions: Once trained, the model will predict the presence of planets in stars from new data based on learned properties.
- Random Forest Classifier: This algorithm uses multiple decision trees to make accurate classifications.
- Labels: Stars with planets are labeled as 2, and stars without planets are labeled as 1.
- Ensemble Learning: Random Forest's ensemble approach improves prediction accuracy by combining the outputs of multiple trees.


By the end of this lesson, you'll understand how to deploy the Random Forest Classifier for star-planet prediction, gaining insights into machine learning's practical application.






### Activity 1: Loading The Datasets

In [None]:
# Mounting drive with collab
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Loading both the training and test datasets.
import pandas as pd

train_df=pd.read_csv("/content/drive/MyDrive/datasets/exoTrain.csv")
test_df = pd.read_csv("/content/drive/MyDrive/datasets/exoTest.csv")

In [None]:
train_df.shape

(5087, 3198)

In [None]:
test_df.shape

(570, 3198)

In [None]:
train_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,2,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,2,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


### Activity 2: The `value_counts()` Function


To compute how many times a value occurs in a series, we use the `value_counts()` function.

Our prediction model should classify the stars either as `1` or `2`. Let's find out how many stars in the test dataset are classified as `1` and `2`.

In [None]:
train_df["LABEL"].value_counts()

Unnamed: 0_level_0,count
LABEL,Unnamed: 1_level_1
1,5050
2,37


In [None]:
#Count the number of times a value occurs in a Pandas series
test_df['LABEL'].value_counts()

Unnamed: 0_level_0,count
LABEL,Unnamed: 1_level_1
1,565
2,5


There are `565` stars which are classified as `1` and `5` stars classified as `2` which means only `5` stars have a planet. Interestingly, if our prediction model mindlessly classifies every star as `1`, then it is a very accurate model. Why?

Because the accuracy of a model is calculated as **a percentage of the correct predictions out of the total number of predictions**. In this case, the percentage of the correct predictions is

 $\frac{565\times100}{570} = 99.122$ %

Thus, without actually deploying a proper prediction model, we can predict the stars having a planet with 99% accuracy.

This is **WRONG**! This is where we need to be careful. Because we have very imbalanced data. The ultimate goal of the Kepler space telescope is to detect exoplanets in outer space. Hence, a machine learning model, based on some data should also **correctly** detect stars having planets. This means a prediction model will be considered useful if it correctly detects almost all the stars having a planet.

So, the prediction model which always labels every star as `1` is useless. Because it must detect almost all the stars having a planet.

Now, we are going to deploy the Random Forest Classifier model so that it can detect all the five (or at least three) stars having a planet.

### Activity 3: Random Forest Classifier

A Random Forest is an ensemble learning method consisting of numerous decision trees. A decision tree is a hierarchical model that splits data into subsets based on certain conditions. When a condition is met, it follows one path, and if not, it follows another.

You might start by asking if there is a decrease in the star's flux values. If the answer is no, then the star does not have a planet. If yes, you proceed to check if the decrease is periodic. A no answer here means no planet, while a yes indicates the presence of a planet.

This example demonstrates a basic decision tree. In reality, decision trees can become more intricate depending on the complexity of the problem.

When you have a collection of N such decision trees, it forms a random forest. Each tree in the forest provides a predicted class (in this case, either class 1 or class 2).

- The final prediction from the random forest is determined by majority voting; the class that most trees agree upon is the final predicted class.

- Think of the Random Forest Classifier as a sophisticated tool that can classify data into different categories (in this case, either class 1 or class 2) by learning from the features of each class within a training dataset.








### Random Forest Classifier^

A Random Forest is a collection (a.k.a. ensemble) of many decision trees. A decision tree is a flow chart which separates data based on some condition. If a condition is true, you move on a path otherwise, you move on to another path.


For e.g., in case of finding a star having a planet, you can construct the following decision tree:

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/decision-tree.png' width=600>

You could ask a question whether there is decrease in the flux values of a star. If the answer is no, then it clearly means the star does not have a planet. However, if the answer is yes, then you could ask another question to check whether the decrease is periodic or not. Again, if the answer is no, then the star does not have a planet. Otherwise, it has a planet.

This is one of the examples of a decision tree. Based on a problem, the decision tree could get more and more complex.

A collection of `N` number of trees is a random forest wherein each tree gives some predicted value (in this case either class `1` or class `2`).

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-14/rfc-image.jpg' width=800>

The final predicted value is the majority class, i.e, the class that is predicted by the most number of decision trees in a random forest.

For the time being, just consider the Random Forest Classifier as some kind of a black-box which classifies data into different classes (in this case, either class `1` or class `2`) by learning the properties of every class through a training dataset.


### Activity 4: Importing `RandomForestClassifier`

We need to import a module called `RandomForestClassifier` from a package called `sklearn.ensemble`. The `sklearn` (or **scikit-learn**) is a collection of many machine learning modules. Almost every machine learning algorithm can be directly applied without a knowledge of math using the **scikit-learn** library. It is kind of a plug-and-play device.



In [None]:

# Import the 'RandomForestClassifier' module from the 'sklearn.ensemble' library.
from sklearn.ensemble import RandomForestClassifier

### Activity 5: Target & Feature Variables Separation

The `RandomForestClassifier` module has a function called `fit()` which takes two inputs. The first input is the collection of feature variables.

*The features are those variables which describe the features or properties of an entity.* In this case, the `FLUX.1` to `FLUX.3197` are feature variables. Hence, the values stored in these columns are the features of a star in exoplanets dataset.

The second input is the target variable.

*The variable which needs to be predicted is called a target variable.* In this case, the `LABEL` is the target variable because the prediction model needs to predict which star belongs to which class in the test dataset. Hence, the values stored in the `LABEL` column are the target values.

So, we need to extract the target variable and the feature variables separately from the training dataset.

Let's store the feature variables in the `x_train` variable and the target variable in the `y_train` variable. We will separate the features using the `iloc[]` function.

We need all the rows from the training set. So, inside the `iloc[]` function, we will enter the colon (`:`) sign to get all the rows. We do not need the first column, i.e., the `LABEL` column. Therefore, inside the `iloc[]` function, as part of column indexing, enter `1` as the starting index followed by the colon (`:`) sign to include the rest of the columns from the training dataset.


In [None]:
train_df.iloc[5:10, 3:11]

Unnamed: 0,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10
5,179.16,187.82,188.46,168.13,203.46,178.65,166.49,139.34
6,33.3,9.63,37.64,20.85,4.54,22.42,10.11,40.1
7,277.8,190.16,180.98,123.27,103.95,50.7,59.91,110.19
8,-108.93,-72.25,-61.46,-50.16,-20.61,-12.44,1.48,11.55
9,-335.66,-450.47,-453.09,-561.47,-606.03,-712.72,-685.97,-753.97


In [None]:
#Extract the feature variables from the training dataset using the 'iloc[]' function.
x_train=train_df.iloc[:,1:]
x_train

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,93.85,83.81,20.10,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,-160.17,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,-73.38,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.70,6.46,16.00,19.93
2,532.64,535.92,513.73,496.92,456.45,466.00,464.50,486.39,436.56,484.39,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.80,-28.91,-70.02,-96.67
3,326.52,347.39,302.35,298.13,317.74,312.70,322.33,311.31,312.42,323.33,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,-1107.21,-1112.59,-1118.95,-1095.10,-1057.55,-1034.48,-998.34,-1022.71,-989.57,-970.88,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5082,-91.91,-92.97,-78.76,-97.33,-68.00,-68.24,-75.48,-49.25,-30.92,-11.88,...,139.95,147.26,156.95,155.64,156.36,151.75,-24.45,-17.00,3.23,19.28
5083,989.75,891.01,908.53,851.83,755.11,615.78,595.77,458.87,492.84,384.34,...,-26.50,-4.84,-76.30,-37.84,-153.83,-136.16,38.03,100.28,-45.64,35.58
5084,273.39,278.00,261.73,236.99,280.73,264.90,252.92,254.88,237.60,238.51,...,-26.82,-53.89,-48.71,30.99,15.96,-3.47,65.73,88.42,79.07,79.43
5085,3.82,2.09,-3.29,-2.88,1.66,-0.75,3.85,-0.03,3.28,6.29,...,10.86,-3.23,-5.10,-4.61,-9.82,-1.50,-4.65,-14.55,-6.41,-2.55



Now, let's get only the target variables from the training dataset.


In [None]:
train_df["LABEL"]

Unnamed: 0,LABEL
0,2
1,2
2,2
3,2
4,2
...,...
5082,1
5083,1
5084,1
5085,1


In [None]:
#  Using the 'iloc[]' function, retrieve only the first column, i.e., the 'LABEL' column from the training dataset.
y_train=train_df.iloc[:,0]
y_train

Unnamed: 0,LABEL
0,2
1,2
2,2
3,2
4,2
...,...
5082,1
5083,1
5084,1
5085,1


### Activity 6: Fitting The Model

Now that we have separated the feature and target variables for deploying the `RandomForestClassifier` model, let's train the model with the feature variables using the `fit()` function. The steps to be followed are described below.

1. First, call the `RandomForestClassifier` module with inputs as `n_jobs=-1` and `n_estimators=50`. Store the function in a variable with the name `rf_clf`.



  ```
  rf_clf = RandomForestClassifier(n_jobs=-1, n_estimators=50)
  ```

  The `n_estimators` parameter defines the number of decision trees in a Random Forest. Therefore, `n_estimators=50` means that the forest contains `50` decision trees. If `n_estimators` parameter is not defined by a user, then by default, the forest contains `100` decision trees.

2. Call the `fit()` function with `x_train` and `y_train` as inputs.

  ```
  rf_clf.fit(x_train, y_train)
  ```
3. Call the `score()` function with `x_train` and `y_train` as inputs to check the accuracy score of the model. This step is actually not required. If you wish, you can skip this step.
  
  ```
  rf_clf.score(x_train, y_train)
  ```


In [None]:
# Train the 'RandomForestClassifier' model using the 'fit()' function.
rf_data=RandomForestClassifier()

rf_data.fit(x_train,y_train)

rf_data.score(x_train,y_train)

1.0

In [None]:
# Train the 'RandomForestClassifier' model using the 'fit()' function.
rf_data=RandomForestClassifier(n_jobs=-1,n_estimators=50)

rf_data.fit(x_train,y_train)

rf_data.score(x_train,y_train)

0.9996068409671712

As you can see, we have deployed the `RandomForestClassifier` model with an accuracy of almost 100% .

When you deploy a RandomForestClassifier (RFC) on training data and obtain a score (typically using the `.score()` method), that score usually represents the accuracy of the model on the training dataset.

**What the Score Represents?**

Accuracy: The score typically indicates the proportion of correctly classified instances out of the total instances in the training data. If the score is 100%, it means the model perfectly classified every instance in the training set.

### Activity 7: Target & Feature Variables From Test Dataset

Now we need to make predictions on the test dataset. So, we just need to extract feature variables from the test dataset using the `iloc[]` function.

In [None]:
#Using the 'iloc[]' function, extract the feature variables from the test dataset.
x_test=test_df.iloc[:,1:]
x_test

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,119.88,100.21,86.46,48.68,46.12,39.39,18.57,6.98,6.63,-21.97,...,14.52,19.29,14.44,-1.62,13.33,45.50,31.93,35.78,269.43,57.72
1,5736.59,5699.98,5717.16,5692.73,5663.83,5631.16,5626.39,5569.47,5550.44,5458.80,...,-581.91,-984.09,-1230.89,-1600.45,-1824.53,-2061.17,-2265.98,-2366.19,-2294.86,-2034.72
2,844.48,817.49,770.07,675.01,605.52,499.45,440.77,362.95,207.27,150.46,...,17.82,-51.66,-48.29,-59.99,-82.10,-174.54,-95.23,-162.68,-36.79,30.63
3,-826.00,-827.31,-846.12,-836.03,-745.50,-784.69,-791.22,-746.50,-709.53,-679.56,...,122.34,93.03,93.03,68.81,9.81,20.75,20.25,-120.81,-257.56,-215.41
4,-39.57,-15.88,-9.16,-6.37,-16.13,-24.05,-0.90,-45.20,-5.04,14.62,...,-37.87,-61.85,-27.15,-21.18,-33.76,-85.34,-81.46,-61.98,-69.34,-17.84
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
565,374.46,326.06,319.87,338.23,251.54,209.84,186.35,167.46,135.45,107.28,...,-123.55,-166.90,-222.44,-209.71,-180.16,-166.83,-235.66,-213.63,-205.99,-194.07
566,-0.36,4.96,6.25,4.20,8.26,-9.53,-10.10,-4.54,-11.55,-10.48,...,-12.40,-5.99,-17.94,-11.96,-12.11,-13.68,-3.59,-5.32,-10.98,-11.24
567,-54.01,-44.13,-41.23,-42.82,-39.47,-24.88,-31.14,-24.71,-13.12,-14.78,...,-0.73,-1.64,1.58,-4.82,-11.93,-17.14,-4.25,5.47,14.46,18.70
568,91.36,85.60,48.81,48.69,70.05,22.30,11.63,37.86,28.27,-4.36,...,2.44,11.53,-16.42,-17.86,21.10,-10.25,-37.06,-8.43,-6.48,17.60


Let's also extract the target variable from the test dataset so that we can compare the actual target values with the predicted values later.

In [None]:
#Using the 'iloc[]' function, extract the target variable from the test dataset.
y_test=test_df.iloc[:,0]
y_test

Unnamed: 0,LABEL
0,2
1,2
2,2
3,2
4,2
...,...
565,1
566,1
567,1
568,1


### Activity 8: The `predict()` Function

Now, let's make predictions on the test dataset by calling the `predict()` function with the features variables of the test dataset as an input.

In [None]:
y_test

In [None]:
 #Make predictions on the test dataset by using the 'predict()' function.
y_predicted=rf_data.predict(x_test)
print(y_predicted)
type(y_predicted)



[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 

numpy.ndarray

In [None]:
print(type(y_predicted))

<class 'numpy.ndarray'>


The predict function returns a NumPy array of the predicted values. You can verify it using the `type()` function.

The actual target values are stored in a Pandas series. So, for the sake of consistency, let's convert the NumPy array of the predicted values into a Pandas series.

In [None]:
#Convert the NumPy array of predicted values into a Pandas series.
import pandas as pd
y_prediction=pd.Series(y_predicted)
y_prediction

Unnamed: 0,0
0,1
1,1
2,1
3,1
4,1
...,...
565,1
566,1
567,1
568,1


Now, let's count the number of stars classified as `1` and `2`.

In [None]:
#Using the 'value_counts()' function, count the number of times 1 and 2 occur in the predicted values.
y_prediction.value_counts()

Unnamed: 0,count
1,570


As you can see, we did not get the expected results. The model should have classified all the stars having a planet as `2`. Ideally, the Random Forest Classifier model should have classified `565` values as `1` and the remaining `5` values as `2`.

In this case, even though the accuracy of a prediction model is high but according to the problem statement, it is not giving the desired result. Hence, **accuracy alone is not the metric to test the efficacy of a prediction model.**

Generally, a classification model (in this case, Random Forest Classification) is evaluated through a concept called **confusion matrix**.


### Activity 9: The Confusion Matrix
Let's quickly first create a confusion matrix and then will try to understand it.

To create a confusion matrix, first import `confusion_matrix` module from the `sklearn.metrics` library. This library contains all the parameters to evaluate a machine learning model. In addition to the `confusion_matrix` module, let's also import the `classification_report` module. We will use them later to evaluate our module.

In [None]:
#Import the 'confusion_matrix' and 'classification_report' functions from the 'sklearn.metrics' module.
from sklearn.metrics import confusion_matrix, classification_report

Now, create the confusion matrix using the `confusion_matrix()` function. It requires two inputs. The first input is actual target values (`y_test`) and the second input is predicted target values (`y_predicted`).

In [None]:
# Create a confusion matrix using the 'y_test' and 'y_predicted' values.
confusion_matrix(y_test,y_predicted)

array([[565,   0],
       [  5,   0]])

Now that we have got our confusion matrix, let's try to understand this concept.

**Confusion Matrix:**


It is way of evaluating the performance of your machine learning algorithm.

**For Example:**

Suppose that you attempted an online exam in which you already know that out of 100 questions, you have given 75 correct answers and 25 incorrect answers.




However, the exam software did not assessed the answers correctly and marked many correct answers as incorrect and incorrect answers as correct. Let us evaluate the performance of this software using confusion matrix.

- There are two possible classes:
  1. Class `correct`.
  2. Class `incorrect`.

We need to find out how many correct answers were accurately assessed or predicted by the software.

Thus,
- positive outcome $\Rightarrow$ `correct` answer.
- negative outcome $\Rightarrow$ `incorrect` answer.

In technical terms, the desired outcome is called a **positive outcome**.


|Actual      |Predicted              |
|------------|-----------------------|
|Incorrect   |Incorrect             |
|Correct     |Correct                |
|Correct     |Incorrect              |
|Incorrect   |Correct                |



1. The first row first column value indicates those `incorrect` answers which were <b><font color=green>accurately</font></b> assessed or predicted as `incorrect` by the software.
Such values are called as **True Negative (TN)**.


2. The second  row second column value  indicates those `correct` answers which were <b><font color=green>accurately</font></b> assessed or predicted as `correct` by the software.
Such values are called as **True Positive (TP)**.

3. The second  row first column value  indicates those `correct`  answers which were <b><font color=red>inaccurately</font></b>  assessed or predicted as `incorrect` by the software.
Such values are called as **False Negative (FN)**.


4. The first  row second column value  indicates those `incorrect`  answers which were <b><font color=red>inaccurately</font></b>  assessed or predicted as `correct` by the software.
Such values are called **False Positive (FP)**.





The resultant confusion matrix obtained after evaluating predicted values  are as follows:


       
|| (Incorrect)| (Correct)|
|-|-|-|
|Incorrect |TN|FP|
|Correct |FN|TP|






Thus the confusion matrix compares the actual values with the predicted values and thus it is very useful in evaluating the performance of your machine learning model.

---

Let us apply the concept of confusion matrix for our dataset.

There are two possible classes:
1. The class `1` values are stars **NOT** having a planet.
2. The class `2` values are stars having a planet.

 So, after you deploy the classification model, there are 4 possible outcomes. They are:

1. Class `1` values predicted as class `1`.

2. Class `1` values predicted as class `2`.

3. Class `2` values predicted as class `1`.

4. Class `2` values predicted as class `2`.

These 4 possibilities can be reported in a confusion matrix.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|||
|Actual Class `2` (`y_test`)|||

where

- `y_test` contains the actual class `1` and class `2` values

- `y_predicted` contains the predicted class `1` and class `2` values

In this table,

- the values **predicted as class `1` and actually belonging to class `1`** are reported in the first row and first column.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565||
|Actual Class `2` (`y_test`)|||

- the values **predicted as class `1` but actually belonging to class `2`** are reported in the second row and first column.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565||
|Actual Class `2` (`y_test`)|5||

- the values **predicted as class `2` and actually belonging to class `2`** are reported in the second row and second column.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565||
|Actual Class `2` (`y_test`)|5|0|

- the values **predicted as class `2` but actually belonging to class `1`** are reported in the first row and second column.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565|0|
|Actual Class `2` (`y_test`)|5|0|

In this case, the class `1` values refer to the stars not having a planet whereas class `2` values refer to the stars having a planet.

**Positive Outcome**

Detecting a star having a planet is the desired outcome (positive outcome).
Thus,
- positive outcome $\Rightarrow$ class `2`.

- negative outcome $\Rightarrow$ class `1`.

So, here the positive outcome is the prediction of the stars having a planet, i.e., prediction of the class `2` values. Likewise, finding a star which does not have any planet is a *negative outcome*. So, here the negative outcome is the prediction of the class `1` values.

Observe the output of `confusion_matrix(y_test, y_predicted)` function.

```
array([[565,   0],
       [  5,   0]])
```




- `565` values are **True Negative (TN)** values because they are **truly** predicted as class `1` values.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565 (TN)||
|Actual Class `2` (`y_test`)|||


- `5` values are **False Negative (FN)** values because they are **falsely** predicted as class `1` values. They should have been predicted as class `2` values.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565 (TN)||
|Actual Class `2` (`y_test`)|5 (FN)||


- `0` values are **True Positive (TP)** values because they are **truly** predicted as class `2` values.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565 (TN)||
|Actual Class `2` (`y_test`)|5 (FN)|0 (TP)|


- `0` values are **False Positive (FP)** values because they are **falsely** predicted as class `2` values. They should have been predicted as class `1` values.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565 (TN)|0 (FP)|
|Actual Class `2` (`y_test`)|5 (FN)|0 (TP)|

---

### Activity 10: Precision And Recall

A good prediction model provides a very large number of true positive (TP) values and a very low number of true negative (TN) values.

**Precision:**

Based on the TP and FP values, we define a parameter called **precision**.
It is used to evaluate the number of correct positive predictions made.

It is the ratio of the TP values to the sum of TP and FP values, i.e.,

(defn of precision and recall)
(precision of both outcomes.)
(add example values)

$$\text{precision} = \frac{\text{TP}}{\text{TP + FP}}$$

**precision:  When it predicts YES. how often is it correct?**

**recall/sensitivity: when its actually YES, how often does it  predict yes?**

**Accuracy: (TP+TN) /total, overall, how often is the classifier correct?**

**Misclassification Rate: Overall how often is it wrong? (FP+FN)/ total**


Ideally, the precision should be 1 for a good classifier model.

let's calculate the precision for our Random Forest classifier model.
Currently, the model has given `0` TP values and `0` FP values. Therefore, the precision value is undefined because

$$\text{precision} = \frac{0}{0 + 0} = \text{undefined}$$

*In mathematics, the division by 0 is undefined (or not defined).*

**Recall:**

Based on the TP and FN values, we define another parameter called **recall**.
It is the ratio of the TP values to the sum of TP and FN values, i.e,

$$\text{recall} = \frac{\text{TP}}{\text{TP + FN}}$$




The recall should be 1 for a good classifier model.

let's calculate the recall for our Random Forest classifier model.

Currently, the model gives `0` TP and `5` FN values. Hence, the recall value is 0 because

$$\text{recall} = \frac{0}{0+5} = \text{0}$$

Imagine if the prediction model labels every star as `2`, i.e, every star has a planet. Then, the number of TP values will be the maximum, i.e., `5` but the number of FP values will also be maximum, i.e., `565`. In such a case, the precision value would be

$$\text{precision} = \frac{5}{5+565} = \frac{5}{570} = 0.008$$

which is very very low.

Also, the model will give `0` FN values. Then, the recall value would be

$$\text{recall} = \frac{5}{5 + 0} = 1$$


So, even though the recall value would be equal to 1, the precision value would be close to 0. Hence, this would be a bad prediction model.


Evidently, there is a trade-off. If the recall value is high, then the precision value will be low and vice-versa. Hence, we need to find an optimum point where both, the precision and the recall values are acceptable.

---

### Activity 11: The `f1-score`

To find an optimum point where both, the precision and recall values, are high, we calculate another parameter called **f1-score**. It is a harmonic mean of the precision and recall values, i.e.,



$$\text{f1-score} = 2 \left( \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \right)$$




f1-score will be high only when both precision and recall are high.

let's calculate the f1-score for our Random Forest classifier model.

Based on the current predictions, the f1-scores value is undefined because both the precision and recall values are also undefined.

$$\text{f1-score} = 2 \left( \frac{\text{undefined} \times 0}{\text{undefined} + 0} \right) =  \text{undefined}$$

You can also get these values by calling a function called `classification_report()`. It takes two inputs: the actual target values and the predicted target values, i.e., `y_test` and `y_predicted`.

**Note:** You may get the following warning message after executing the code in the code cell below.

```
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
```

Ignore the warning.

In [None]:
# Print the 'precision', 'recall' and 'f1-score' values using the 'classification_report()' function.
print(classification_report(y_test,y_predicted))

              precision    recall  f1-score   support

           1       0.99      1.00      1.00       565
           2       0.00      0.00      0.00         5

    accuracy                           0.99       570
   macro avg       0.50      0.50      0.50       570
weighted avg       0.98      0.99      0.99       570



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


As you can see, the `precision` and `f1-score` values are reported as `0.00` for class `2` because they are actually undefined values.
  
Ideally, the above values for class `2` should also be close to `1.00`. Then only we can say that our prediction model is satisfactory. This shows that accuracy alone cannot tell whether a prediction model is making correct predictions or not.

In the next class, we will try to improve the model so that we get the desired precision, recall and f1-score values for class `2`.

