<a href="https://colab.research.google.com/github/poudyaldiksha/Data-Science-project/blob/main/Lesson_45_b2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 45: Data Normalisation, over_sampling technique,xgboost classifier,logistic regression



In the last class, we evaluated the Random Forest Classifier model through a concept called the **confusion matrix**. We also calculated the precision, recall and f1-score values. They were undefined. Based on these three parameters, we concluded that our model needs a lot of improvement because it did not classify the stars having a planet as `2` rather it labelled every star as `1`.

In this lesson, we will process the data before deploying a prediction model so that it can learn the properties of the different stars through the training dataset.

Now, there is no right approach to the data processing method. It is an iterative process. It comes through experience and domain knowledge. (*The term 'domain' means a particular field of industry or academics*). For e.g., if you are a banker, then you would have the knowledge of the finance field. Similarly, if you are an astrophysicist, then you would have the technical knowledge of astronomy, quantum mechanics, optics etc. So, based on the knowledge of a respective field, data should be processed before deploying a prediction model.

So for this dataset, we will perform the following data processing exercises:

1. Data Normalisation

2. Oversampling

Finally, we will check if the Random Forest Classifier model is still providing the expected results. If not, then we will deploy a much stronger prediction model called XGBoost Classifier. Generally, it doesn't require a lot of data pre-processing but it requires heavy computation resources such as high RAM, CPU, GPU and at least 4 cores processor. Thanks to Google Colab notebook, we have a decent computation power to run the XGBoost Classifier algorithm.






---

#### Loading The Datasets


In [None]:
# Mounting drive with collab
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# Loading both the training and test datasets.
import pandas as pd

train_df=pd.read_csv("/content/drive/MyDrive/datasets/exoTrain.csv")
test_df = pd.read_csv("/content/drive/MyDrive/datasets/exoTest.csv")

---

#### Activity 1: Data Normalisation

After creating a DataFrame and inspecting data for the missing values, we can normalise the data.

**What is data normalisation?**

Data normalisation is a process of standardising data. It brings every single data-point on a uniform scale. Let us try to understand this with the help of an example:



If you look at both the `train_df` and `test_df` DataFrames, they contain highly varying `FLUX` values.

You can get the data summary using the `describe()` function to look at the variation of the data.

**Syntax:** `dataframe_name.describe()`

In [None]:
# Get the data description by calling the 'describe()' function.
train_df.describe()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
count,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,...,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0
mean,1.007273,144.5054,128.5778,147.1348,156.1512,156.1477,146.9646,116.838,114.4983,122.8639,...,348.5578,495.6476,671.1211,746.879,693.7372,655.3031,-494.784966,-544.594264,-440.2391,-300.536399
std,0.084982,21506.69,21797.17,21913.09,22233.66,23084.48,24105.67,24141.09,22906.91,21026.81,...,28647.86,35518.76,43499.63,49813.75,50871.03,53399.79,17844.46952,17722.339334,16273.406292,14459.795577
min,1.0,-227856.3,-315440.8,-284001.8,-234006.9,-423195.6,-597552.1,-672404.6,-579013.6,-397388.2,...,-324048.0,-304554.0,-293314.0,-283842.0,-328821.4,-502889.4,-775322.0,-732006.0,-700992.0,-643170.0
25%,1.0,-42.34,-39.52,-38.505,-35.05,-31.955,-33.38,-28.13,-27.84,-26.835,...,-17.6,-19.485,-17.57,-20.76,-22.26,-24.405,-26.76,-24.065,-21.135,-19.82
50%,1.0,-0.71,-0.89,-0.74,-0.4,-0.61,-1.03,-0.87,-0.66,-0.56,...,2.6,2.68,3.05,3.59,3.23,3.5,-0.68,0.36,0.9,1.43
75%,1.0,48.255,44.285,42.325,39.765,39.75,35.14,34.06,31.7,30.455,...,22.11,22.35,26.395,29.09,27.8,30.855,18.175,18.77,19.465,20.28
max,2.0,1439240.0,1453319.0,1468429.0,1495750.0,1510937.0,1508152.0,1465743.0,1416827.0,1342888.0,...,1779338.0,2379227.0,2992070.0,3434973.0,3481220.0,3616292.0,288607.5,215972.0,207590.0,211302.0


As you can see, the values in the `FLUX.1` column range between `-227,856.3` (minimum `FLUX.1` value) and `1,439,240` (maximum `FLUX.1` value).

**Note:** `-2.278563e+05` is equivalent to $-2.278563\times10^5$ and `1.439240e+06` equivalent to $1.439240\times10^6$

In the `FLUX.1` column, the difference in the maximum and minimum values, i.e.,

$$1,439,240 - (-227,856.3) = 1,667,096.3$$


is huge because of the $10^5$ and $10^6$ scales.

Similarly, the data in all the other `FLUX` columns also vary a lot because they lie on a huge scale. The big figures are less readable. For e.g., `122` (one hundred twenty two) is more readable than `1,439,240` (one million, four hundred thirty nine thousand, two hundred forty).

<b><font color=green>The data normalisation process lowers the scale and brings all the data-points on the same scale.</b>

**Why must data be normalised?**

The machine learning models are quite sensitive to the scale of data. They give more importance to the larger values while learning the properties of data. Hence, it becomes crucial for us to remove this bias by bringing down all the data-points on the same scale.


**How to normalise data?**

There are various methods of data normalisation.  For the time being, we will use the *mean normalisation* method. Let's understand the *mean normalisation* method.

Consider a series of numbers having the values

$$x_1, x_2, x_3, ... , x_N$$

where $N$ is the total number of values in a series.

Let

- $x_{mean}$ denote the mean (or average) value of a series

- $x_{min}$ denote the minimum value in a series and

- $x_{max}$ denote the maximum value in a series

The normalised value in a series is calculated as

$$x_{norm} = \frac{x_p - x_{mean}}{x_{max} - x_{min}}$$

where

$$x_p = x_1, x_2, x_3, ..., x_N$$

So after normalisation, the new values in the series will be

$$\left(\frac{x_1 - x_{mean}}{x_{max} - x_{min}}\right), \left(\frac{x_2 - x_{mean}}{x_{max} - x_{min}}\right), \left(\frac{x_3 - x_{mean}}{x_{max} - x_{min}}\right), ..., \left(\frac{x_N - x_{mean}}{x_{max} - x_{min}}\right)$$


$$ [5, 192, 20019, 12, 209]$$, Lets consider this example








- The average value of the series is $x_{mean} = 4087.4$

- The minimum value in the series is $x_{min} = 5$

- The maximum value in the series is $x_{max} = 20019$

So, after normalisation, the new series would have the following numbers.

$$\left[ \left( \frac{5 - 4087.4}{20019 - 5} \right), \left( \frac{192 - 4087.4}{20019 - 5} \right), \left( \frac{20019 - 4087.4}{20019 - 5} \right), \left( \frac{12 - 4087.4}{20019 - 5} \right), \left( \frac{209 - 4087.4}{20019 - 5} \right) \right]$$

$$\Rightarrow \left[-0.203977, -0.194634, 0.796023, -0.203627, -0.193784 \right]$$

or

$$\left[-\frac{203,977}{1,000,000}, -\frac{194,634}{1,000,000}, \frac{796,023}{1,000,000}, -\frac{203,627}{1,000,000}, -
\frac{193,784}{1,000,000} \right]$$

As you can see, after normalisation all the new values are on the same scale which is $\frac{1}{1,000,000}$ or $10^{-6}$.

So, now let's create a function which normalises data in a series. It should take a Pandas series as an input and should return a normalised series as an output.





In [None]:
#  Create a function to normalise a Pandas series using the mean normalisation method.
import pandas as pd
def mean_normalise(pd_series):
  pd_series_normalised = (pd_series - pd_series.mean())/ (pd_series.max() - pd_series.min())
  return pd_series_normalised

Now, let's test the `mean_normalise()` function on the $[5, 192, 20019, 12, 209]$ series. If we get the desired output, then it means the function is working correctly.

In [None]:
#  Test the 'mean_normalise()' function on the '[5, 192, 20019, 12, 209]' series.
pd_s = pd.Series([5,192,20019,12,209])
pd_s = mean_normalise(pd_s)
pd_s

Unnamed: 0,0
0,-0.203977
1,-0.194634
2,0.796023
3,-0.203627
4,-0.193784


Now, let's apply the `mean_normalise()` function on the `train_df` DataFrame to normalise only the `FLUX` values.

Using the `iloc[]` function, we will first exclude the `LABEL` column from the DataFrame and then will apply the `mean_normalise()` function on the `train_df` DataFrame using the `apply()` function.

**The `apply()` function:**
- It is used to apply a function to each row or column in the DataFrame.

**Syntax:** `dataframe.apply(function_name, axis)`



**Note:** Whenever you apply a function, say `function_name()` on a DataFrame using the `apply()` function, remove the brackets from the name of the function (i.e., `function_name`) to be applied. That's why the syntax is `dataframe.apply(function_name, axis)`

A DataFrame has two axes (axes is plural of axis).

- The first axis is the vertical axis. It is represented as `axis = 0`

- The second axis is the horizontal axis. It is represented as `axis = 1`

The DataFrame axes define whether an operation needs to be applied row-wise or column-wise.

1. If `axis = 0`, then it means the function needs to be applied **vertically**. In other words, the function will be applied on all the rows but only **one column at a time.** So, on the `train_df` DataFrame, if the `mean_normalise()` function is applied **vertically**, then it will be applied in the following order:

    - `train_df.iloc[:, 1]`, i.e., all the rows and the `FLUX.1` column at a time.
    
    - `train_df.iloc[:, 2]`, i.e., all the rows and the `FLUX.2` column at a time.

    - `train_df.iloc[:, 3]`, i.e., all the rows and the `FLUX.3` column at a time.

    ...

    - `train_df.iloc[:, 3197]`, i.e., all the rows and the `FLUX.3197` column at a time.


2. If `axis = 1`, then it means the function needs to be applied **horizontally**. This means the function will be applied on all the columns but only **one row at a time.** So, on the `train_df` DataFrame, if the `mean_normalise()` function is applied **horizontally**, then it will be applied in the following order:

    - `train_df.iloc[0, :]`, i.e., the first row and all the columns at a time.
    
    - `train_df.iloc[1, :]`, i.e., the second row and all the columns at a time.

    - `train_df.iloc[2, :]`, i.e., the third row and all the columns at a time.

      ...

    - `train_df.iloc[5086, :]`, i.e., the last row and all the columns at a time.



We will apply the `mean_normalise()` function **horizontally** on the `train_df` DataFrame to normalise the `FLUX` values for a star at a time.


In [None]:
# Apply the 'mean_normalise' function horizontally on the training DataFrame.
norm_train_df = train_df.iloc[:,1:].apply(mean_normalise,axis=1)
# After applying the 'mean_normalise' function on the 'train_df' DataFrame, let's print the first 5 rows of the new DataFrame.
norm_train_df.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,0.053834,0.047391,0.00651,-0.023699,-0.031772,-0.08641,-0.093128,-0.068161,-0.05765,-0.109164,...,-0.056482,-0.071934,-0.071934,0.009738,0.024779,0.052993,0.018843,0.033024,-0.003127,-0.031759
1,-0.050411,-0.042317,-0.081922,-0.052351,-0.115212,-0.104794,-0.126816,-0.124861,-0.122681,-0.105708,...,0.006648,-0.039721,-0.039721,-0.027988,0.004116,0.013124,-0.006847,0.02226,0.03755,0.043849
2,0.243983,0.245509,0.235186,0.227365,0.208538,0.212981,0.212283,0.222467,0.199285,0.221536,...,-0.037161,0.002382,0.002382,-0.017715,-0.013523,-0.001456,-0.009299,-0.017259,-0.036384,-0.048782
3,0.518501,0.551177,0.480659,0.474051,0.504754,0.496863,0.511941,0.494687,0.496425,0.513506,...,0.016215,0.001435,0.001435,0.054324,0.038636,-0.012562,-0.006456,-0.019827,-0.019889,0.029163
4,-0.399904,-0.401872,-0.404199,-0.395473,-0.381734,-0.373293,-0.36007,-0.368986,-0.356861,-0.350022,...,-0.212262,-0.141752,-0.141752,-0.125499,-0.157156,-0.155246,-0.141038,-0.135528,-0.145458,-0.18159


You can see that, all the data-points are on the same scale after mean normalisation. Notice that we didn't normalise the `LABEL` data as we intended.

Now, let's insert the `LABEL` column to the `norm_train_df` DataFrame to get the full DataFrame with the normalised `FLUX` values.

We can obtain the `LABEL` column from the `exo_train_df` DataFrame using the `train_df['LABEL']` method.

To insert a column in a DataFrame, use the `insert()` function.

**Syntax:** `dataframe.insert(loc=column_index, column=column_name, value=some_pandas_series)`



It takes three inputs.

- The first input should be the desired column index of the new column after its insertion.

- The second input should be the desired column name.

- The third input should be the values of the new column.





In [None]:
#Apply the 'insert()' function to add the 'LABEL' column to the 'norm_train_df' DataFrame.
norm_train_df.insert(loc=0,column='LABEL',value=train_df.iloc[:,0])
# After inserting the 'LABEL' column to the 'norm_train_df' DataFrame, print its first five rows.
norm_train_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,0.053834,0.047391,0.00651,-0.023699,-0.031772,-0.08641,-0.093128,-0.068161,-0.05765,...,-0.056482,-0.071934,-0.071934,0.009738,0.024779,0.052993,0.018843,0.033024,-0.003127,-0.031759
1,2,-0.050411,-0.042317,-0.081922,-0.052351,-0.115212,-0.104794,-0.126816,-0.124861,-0.122681,...,0.006648,-0.039721,-0.039721,-0.027988,0.004116,0.013124,-0.006847,0.02226,0.03755,0.043849
2,2,0.243983,0.245509,0.235186,0.227365,0.208538,0.212981,0.212283,0.222467,0.199285,...,-0.037161,0.002382,0.002382,-0.017715,-0.013523,-0.001456,-0.009299,-0.017259,-0.036384,-0.048782
3,2,0.518501,0.551177,0.480659,0.474051,0.504754,0.496863,0.511941,0.494687,0.496425,...,0.016215,0.001435,0.001435,0.054324,0.038636,-0.012562,-0.006456,-0.019827,-0.019889,0.029163
4,2,-0.399904,-0.401872,-0.404199,-0.395473,-0.381734,-0.373293,-0.36007,-0.368986,-0.356861,...,-0.212262,-0.141752,-0.141752,-0.125499,-0.157156,-0.155246,-0.141038,-0.135528,-0.145458,-0.18159


Now, you normalise the `FLUX` values in the `test_df` DataFrame using the `mean_normalise()` function. Make sure that you apply the function horizontally to normalise the `FLUX` values for a star at a time.

In [None]:
#  Apply the 'mean_normalize()' function on the testing DataFrame. Store the new DataFrame in the 'norm_test_df' variable.
norm_test_df = test_df.iloc[:,1:].apply(mean_normalise,axis=1)
# After applying, the 'mean_normalise()' function on the 'exo_test_df' DataFrame, print the first 5 rows of the new DataFrame.
norm_test_df.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,0.273347,0.228221,0.196676,0.110003,0.10413,0.08869,0.040926,0.014337,0.013534,-0.052079,...,0.031635,0.042578,0.031451,-0.005393,0.028904,0.102708,0.071576,0.080408,0.616438,0.130742
1,0.394038,0.39148,0.39268,0.390974,0.388955,0.386673,0.38634,0.382364,0.381035,0.374634,...,-0.047311,-0.075404,-0.092643,-0.118456,-0.134109,-0.150638,-0.164944,-0.171944,-0.166961,-0.14879
2,0.64815,0.627582,0.591444,0.519002,0.466046,0.385214,0.340496,0.281192,0.162553,0.11926,...,0.018179,-0.034769,-0.032201,-0.041117,-0.057967,-0.128412,-0.067972,-0.119374,-0.023437,0.027941
3,-0.232813,-0.233212,-0.238944,-0.235869,-0.208281,-0.220224,-0.222214,-0.208586,-0.197319,-0.188186,...,0.056186,0.047254,0.047254,0.039873,0.021893,0.025227,0.025075,-0.017912,-0.059585,-0.04674
4,-0.006994,0.003426,0.006382,0.00761,0.003316,-0.000167,0.010016,-0.009471,0.008195,0.016842,...,-0.006247,-0.016795,-0.001531,0.001095,-0.004439,-0.027127,-0.025421,-0.016852,-0.020089,0.002564


Now, you insert the `LABEL` column to the `norm_test_df` DataFrame with the corresponding `LABEL` values.

In [None]:
#Apply the 'insert()' function to add the 'LABEL' column to the 'norm_test_df' DataFrame.
norm_test_df.insert(loc=0,column="LABEL",value=test_df.iloc[:,0])
# After inserting the 'LABEL' column to the 'norm_test_df' DataFrame, print its first five rows.
norm_test_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,0.273347,0.228221,0.196676,0.110003,0.10413,0.08869,0.040926,0.014337,0.013534,...,0.031635,0.042578,0.031451,-0.005393,0.028904,0.102708,0.071576,0.080408,0.616438,0.130742
1,2,0.394038,0.39148,0.39268,0.390974,0.388955,0.386673,0.38634,0.382364,0.381035,...,-0.047311,-0.075404,-0.092643,-0.118456,-0.134109,-0.150638,-0.164944,-0.171944,-0.166961,-0.14879
2,2,0.64815,0.627582,0.591444,0.519002,0.466046,0.385214,0.340496,0.281192,0.162553,...,0.018179,-0.034769,-0.032201,-0.041117,-0.057967,-0.128412,-0.067972,-0.119374,-0.023437,0.027941
3,2,-0.232813,-0.233212,-0.238944,-0.235869,-0.208281,-0.220224,-0.222214,-0.208586,-0.197319,...,0.056186,0.047254,0.047254,0.039873,0.021893,0.025227,0.025075,-0.017912,-0.059585,-0.04674
4,2,-0.006994,0.003426,0.006382,0.00761,0.003316,-0.000167,0.010016,-0.009471,0.008195,...,-0.006247,-0.016795,-0.001531,0.001095,-0.004439,-0.027127,-0.025421,-0.016852,-0.020089,0.002564


In [None]:
norm_train_df["LABEL"].value_counts()

Unnamed: 0_level_0,count
LABEL,Unnamed: 1_level_1
1,5050
2,37


---

#### Activity 2: Oversampling For Classification Problems - SMOTE

There are 3 different methods to synthesize the artificial data points for a classification problem. They are:

1. Random oversampling

2. SMOTE

3. ADASYN

We will apply the SMOTE method to synthesize the artificial data points in the training dataset. The SMOTE method is the easiest one to understand. The term SMOTE stands for Synthetic Minority Over-Sampling Technique. How the SMOTE technique works, is beyond the scope of this course. But we will learn how to apply it to synthesize the artificial data points for a minority class.


Before applying the SMOTE method, let's retrieve the `LABEL` data from the training and test DataFrames.



we will learn how to synthesize (or manufacture) the artificial data points in a dataset by applying an oversampling technique. Generally, in classification problems such as this one, the data is highly imbalanced.

**Imbalanced Dataset:**

In a highly imbalanced data, the number of data points for one class is very high compared to another class. The class having the most number of data points is called the **majority class** whereas the class having the least number of data points is called the **minority class**.




In the case of the exoplanets dataset, class `1` is a majority class because the dataset contains the maximum number of stars not having a planet.


The `exo_train_df` dataset has a total of `5087` stars in which only `37` stars have a planet and the remaining `5050` stars don't have a planet. The percentage of stars having a planet is
$\frac{37 \times 100}{5087} = 0.727$
% which is very low. Hence, the training dataset is highly imbalanced.




The test dataset is also highly imbalanced because out of `570` data points, it contains only `5` stars labelled as class `2`.



So, the percentage of class `2` data points is
$\frac{5 \times 100}{570} = 0.877$
% which is also very low. Thus, the test dataset is also highly imbalanced.

Oversampling:
---

- The major problem with imbalanced data is that a prediction model will always be biased in favour of the majority class in making predictions. Recall that when we deployed the Random Forest Classifier model, it labelled every star in the test dataset as `1` even though the test dataset contains `5` stars belonging to class `2`.

- An oversampling technique synthesizes the artificial data points for the minority class data to balance a highly imbalanced dataset.
- An oversampling technique is required to remove the bias in favour of the majority class in a dataset.


Hence, using an oversampling technique, we can artificially synthesize the minority class data in a training dataset so that both the classes have equal representation in the dataset.


**Note:** The oversampling technique is applied only to the training dataset. It is never applied to the test dataset.



In [None]:
#  Get the 'x_n_train' and 'x_n_test' series from the 'norm_train_df' and 'norm_test_df' DataFrames respectively.
x_n_train= norm_train_df.iloc[:,1:]
x_n_test = norm_test_df.iloc[:,1:]

In [None]:
#  Get the 'y_train' and 'y_test' series from the 'norm_train_df' and 'norm_test_df' DataFrames respectively.
y_train= norm_train_df["LABEL"]
y_test = norm_test_df["LABEL"]

To apply the `SMOTE` method, we have to follow these steps:

1. From the `imblearn.over_sampling` library import the `SMOTE` module.

2. Then, call the `SMOTE()` function with `ratio=1` as an input. The `ratio=1` denotes that after resampling the dataset, the data points for both the majority and minority class should be in equal numbers. In this case, class `1` has `5050` data points, so class `2` should also have `5050` data points.

3. Apply the `fit_sample()` function from the `SMOTE` module to synthesize data for the minority class.

**Note:** The `fit_sample()` function returns a NumPy array for both the feature and target variables. Hence, you cannot apply any Pandas series or Python list function on them. You can apply only NumPy functions on them.

In [None]:
 y_train_res

Unnamed: 0,LABEL
0,2
1,2
2,2
3,2
4,2
...,...
10095,2
10096,2
10097,2
10098,2


In [None]:
# Import the SMOTE module from the imblearn.over_sampling library
from imblearn.over_sampling import SMOTE

# Initialize the SMOTE function with sampling_strategy=1 (to balance the classes 1:1)
smote = SMOTE(sampling_strategy=1)

# Apply the SMOTE to the training data
x_train_res, y_train_res = smote.fit_resample(x_n_train, y_train)

# The term 'res' stands for 'resampled'

**SMOTE:** It's a technique used to generate synthetic samples for the minority class to balance the class distribution in your dataset.

1. `SMOTE(sampling_strategy=1)`: This parameter balances the minority and majority classes in a 1:1 ratio.
2. `fit_resample()`: This method is used to apply SMOTE to your training data, resulting in balanced `x_train_res` and `y_train_res` datasets by generating the artificial values for both the feature and target values



Also, we now have `10100` data points for the training dataset containing `5050` class `1` values and `5050` class `2` values.



In [None]:
type(y_train_res)

In [None]:
pd.Series(y_train_res).value_counts()

Unnamed: 0_level_0,count
LABEL,Unnamed: 1_level_1
2,5050
1,5050


---

As you can see, both the classes, i.e., `1` and `2` appear the equal number of times in the `y_train_res` .

Now, let's deploy the Random Forest Classifier prediction model again to see if the prediction model is able to identify the stars having a planet in the test dataset.

---

#### Activity 3: Importing The Required Libraries

Now, import the `RandomForestClassifier` module from the `sklearn.ensemble` library. Also, import the `confusion_matrix` and `classification_report` modules from the `sklearn.metrics` library.

In [None]:
#Import the required modules from the 'sklearn.ensemble' and 'sklearn.metrics' libraries.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

---

#### Activity 4: Applying The RandomForestClassifier Model

Now that we have processed the data to enable our prediction model little more robust, let's once again deploy the Random Forest Classifier model to see if it is able to detect the stars having a planet.

In [None]:
#  Deploy the random Forest Classifier prediction model.
rfc =RandomForestClassifier( n_jobs = -1, n_estimators= 50)

rfc.fit(x_train_res,y_train_res)

rfc.score(x_train_res,y_train_res)

1.0

In [None]:
y_pred = rfc.predict(x_n_test)
y_pred

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

Let's quickly make the confusion matrix and classification report to test the efficacy of the model.

---

#### Activity 5: The Confusion Matrix & Classification Report

Now create the confusion matrix and classification report for the model deployed to see if the model is able to detect the stars having a planet.

In [None]:
# Create the confusion matrix using the 'y_test' and 'y_pred' values as inputs.
cm=confusion_matrix(y_test,y_pred)
cm

array([[565,   0],
       [  5,   0]])

As you can see, the value in the second row and the second column is `0` which means the Random Forest Classifier model has failed to detect class `2` values. Thus, it failed to detect the stars having a planet.

Hence, this will lead to undefined precision, recall and f1-score values. Let's verify it by printing the classification report.

In [None]:
# Print the classification report using the 'y_test' and 'y_pred' values as inputs.
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.99      1.00      1.00       565
           2       0.00      0.00      0.00         5

    accuracy                           0.99       570
   macro avg       0.50      0.50      0.50       570
weighted avg       0.98      0.99      0.99       570



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


So, even after processing the data with normalisation and oversampling, the Random Forest Classifier prediction model has failed to detect the stars having a planet. One of the possible reasons for the failure of the Random Forest Classifier model could be its inability to form a right decision tree (recall that random forest is a collection of decision trees). This suggests that maybe we have to further process the data or we might have to apply a different prediction model.

Let's deploy the **XGBoost Classifier** model to see if it can detect the stars having a planet. If it successfully detects the class `2` values, then it means the XGBoost Classifier model is a more appropriate model here to make prediction compared to the Random Forest Classifier model. If not, then we will have to further process the data and then deploy the classification models again.

---

#### Activity 6: The XGBoost Classifier Model

**How to deploy the XGBoost Classifier model?**

1. Import the `xgboost` library with `xg` as an alias.
2. Use the `XGBClassifier()` function of the `xgboost` library to initiate the model.
3. Call the `fit()` function with `x_train_res` and `y_train_res` as input to deploy the model.
4.  Call the `predict()` function with `x_n_test` data  as an input to get the predicted values.




**NOTE:** The XGBoost Classifier is a computationally heavy model. It requires a very high RAM, CPU and GPU to run. It will take some time to learn the feature variables through the training data and then make predictions on the test data. Hence, use it ONLY if all the other lightweight (requiring less RAM, CPU and GPU) prediction models fail.

In [None]:
#  Deploy the XGBoost Classifier model to detect the stars having a planet.
import xgboost as xg
# Call the 'XGBClassifier()' function and store it in the 'model' variable.
model = xg.XGBClassifier()
# Call the 'fit()' function with the 'x_train_res' and 'y_train_res' NumPy arrays as input.
#model.fit(x_train_res,y_train_res)


# Now fit the model again
model.fit(x_train_res, y_train_res)

# Make predictions on test data
y2_pred = model.predict(x_n_test)

# Output the predictions
y2_pred


ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1], got [1 2]

In [None]:
a = np.array([1,2]) - 1
a

In [None]:
# Adjust the target labels to start from 0 instead of 1
y_train_res = y_train_res - 1

# Now, call the 'XGBClassifier()' function and store it in the 'model' variable.
model = xg.XGBClassifier()

# Fit the model using the adjusted 'y_train_res'
model.fit(x_train_res, y_train_res)

# Make predictions on test data
y2_pred = model.predict(x_n_test)

# Output the predictions
y2_pred

Now that we have got the predicted values, let's create a confusion matrix to check if the XGBoost Classifier model has detected any class `2` values in the test dataset.

In [None]:
# Create the confusion matrix using the 'y_test' and 'y2_pred' values as inputs.
c = confusion_matrix((y_test-1),y2_pred)
c

array([[565,   0],
       [  5,   0]])

In [None]:
#Print the classification report using the 'y_test' and 'y2_pred' values as inputs.
print(classification_report((y_test-1),y2_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       565
           1       0.00      0.00      0.00         5

    accuracy                           0.99       570
   macro avg       0.50      0.50      0.50       570
weighted avg       0.98      0.99      0.99       570



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



This is not the best classification model



#### Activity 7: Logistic Regression

Logistic Regression is a type of **classification** algorithm which classifies or categorises a given set of data into different class labels.

Logistic Regression is used to predict the probability of an outcome for an event. It calculates a threshold probability value. If the probability of an outcome is less than the threshold probability, then logistic regression classifies that outcome as `1`, otherwise as `2`. You will learn the technical details in the subsequent classes, but for the time being, let's build a Logistic Regression model on the train set by following the steps listed below:

1. Import `LogisticRegression` class from the `sklearn.linear_model` module.
2. Create an object of the `LogisticRegression` class, say `log_reg` and pass `n_jobs = -1` as input to its constructor.
3. Call the `fit()` function of the `LogisticRegression` class on the object created and pass `X_train` and `y_train` as inputs to the function.

**Logistic Regression:** Although it has "regression" in its name, logistic regression is actually used for classification. It predicts the probability of a certain class (e.g., spam or not spam). If the probability is above a certain threshold (like 0.5), the model classifies the data into one class (e.g., spam), otherwise into another (e.g., not spam).

In [None]:
# Import the SMOTE module from the imblearn.over_sampling library
from imblearn.over_sampling import SMOTE

# Initialize the SMOTE function with sampling_strategy=1 (to balance the classes 1:1)
smote = SMOTE(sampling_strategy=1)

# Apply the SMOTE to the training data
x_train_res, y_train_res = smote.fit_resample(x_n_train, y_train)

# The term 'res' stands for 'resampled'

In [None]:
# Deploy the 'LogisticRegression' model using the 'fit()' function.
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(n_jobs=-1)
log_reg.fit(x_train_res,y_train_res)
log_reg.score(x_train_res,y_train_res)

0.9993069306930693

In [None]:
#  Make predictions on the test dataset by using the 'predict()' function.
log_y_pred = log_reg.predict(x_n_test)
log_y_pred

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

Let's compute the confusion matrix to calculate recall, precision and f1-scores
to evaluate the logistic regression model.

In [None]:
# Display the confusion_matrix.
confusion_matrix(y_test,log_y_pred)

array([[565,   0],
       [  5,   0]])


The resultant confusion matrix obtained after evaluating predicted values  are as follows:


       
|| (Incorrect)| (Correct)|
|-|-|-|
|Incorrect |TN|FP|
|Correct |FN|TP|


In [None]:
#  Display recall, precision and f1-score values.
print(classification_report(y_test,log_y_pred))

              precision    recall  f1-score   support

           1       0.99      1.00      1.00       565
           2       0.00      0.00      0.00         5

    accuracy                           0.99       570
   macro avg       0.50      0.50      0.50       570
weighted avg       0.98      0.99      0.99       570



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))




You will soon get to learn how both these models work behind the scenes and then you will develop a sense of which classification model to use for different kinds of problem statements.



Let's stop here, in the next class, we will understand the working of Logistic Regression algorithm using sigmoid function