<a href="https://colab.research.google.com/github/poudyaldiksha/Data-Science-project/blob/main/Lesson_47_b2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 47: Logistic Regression - Heart Disease Prediction



In the previous few classes, you learnt how a logistic regression model classifies labels behind the scenes.

In this class, we will continue to build a multivariate logistic regression model to predict whether a patient has heart disease.

---

#### Recap

Run the code below

In [None]:
# Import the required modules and load the heart disease dataset. Also, display the first five rows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

csv_file = '/content/heart.csv'
df = pd.read_csv(csv_file)
print("\n", df.head(), "\n", df.info(), "\n")

# Print the number of records with and without heart disease
print("Number of records in each label are")
print(df['target'].value_counts())

# Print the percentage of each label
print("\nPercentage of records in each label are")
print(df['target'].value_counts() * 100 / df.shape[0])

# Split the training and testing data
X = df.drop(columns = 'target')
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

    age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187 

---

####Activity 1: Multivariate Logistic Regression

Let's include all the features present in the heart disease dataset to build a multivariate logistic regression model using the `sklearn` module.

In [None]:
#  Create a multivariate logistic regression model. Also, predict the target values for the train set.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

lg_clf_1 = LogisticRegression()
lg_clf_1.fit(X_train, y_train)
lg_clf_1.score(X_train, y_train)

# Predict the target values for the train set.
y_train_pred = lg_clf_1.predict(X_train)

print(f"{'Train Set'.upper()}\n{'-' * 75}\nConfusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))

print("\nClassification Report:")
print(classification_report(y_train, y_train_pred))

In [None]:
help(lg_clf_1)

In [None]:
#Predict the target values for the test set.
y_test_pred = lg_clf_1.predict(X_test)

print(f"{'Test Set'.upper()}\n{'-' * 75}\nConfusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))

print("\nClassification Report")
print(classification_report(y_test, y_test_pred))

As you can see,
- The FP and FN values in the confusion matrix are low
- The precision and recall values are also good
- The f1-score is also greater than **0.7**



But this logistic regression model (refer to the object stored in the `lg_clf_1` variable) is created using all the features (or independent variables). It is quite possible that not all features are of imporatance for the classification of the labels in the `target` column. Therefore, we still can improve the model by reducing the number of features to obtain higher f1-scores.

---

#### Activity 2: Data Standardisation

As you must have observed, when the logistic regression is applied we got the following warning message shown below quite a few times:
```
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
```


One of the reason for the popping-up of the warning message is poorly scaled data. Here are a couple of ways to avoid `ConvergenceWarning` message:

1. Increase the number of iterations i.e. set the value of `max_iter` parameter to 100 i.e. `max_iter = 100` in the `LogisticRegression` constructor.

2. Scale the data using one of the normalisation methods, say standard normalisation.

Therefore, let's create a function `standard_scalar()` to normalise the `X_train` and `X_test` data-frames using standard normalisation method i.e.

$$x_{\text{std}} = \frac{(x_i - \mu)}{\sigma} $$



In [None]:
#Normalise the train and test data-frames using the standard normalisation method.
def standard_scaler(series):
  new_series = (series - series.mean()) / series.std()
  return new_series

norm_X_train = X_train.apply(standard_scaler, axis = 0)
norm_X_test = X_test.apply(standard_scaler, axis = 0)

norm_X_train.describe()

In [None]:
# Display descriptive statistics for the normalised values of the features for the test data-frames.
norm_X_test.describe()

As we can observe in the output, the data is normalised because the mean and standard deviation values for each column are 0 and 1 respectively.

---

####Activity 3: Features Selection Using RFE

Our next task is to select the relevant features from all the features that contribute to a person having a heart disease. The irrelevant features do not help in increasing the accuracy of a prediction model. Secondly, they also increase the training time of a model. You don't want to have either a very few features or too many of them in your prediction model.

So, the question is **how to select features?**

One simpler way is trial and error. You can pick **any one feature** at a time, build a prediction model and evaluate it.

Similarly, you pick **any two features** at a time, a prediction model and evaluate it. For example
- 1, 2
- 1, 3
- 1, 4
etc.

Similarly, you pick **any three features** at a time, a prediction model and evaluate it. For example
- 1, 2, 3
- 1, 2, 4
- 2, 3, 4
etc.

And so on. However, all this is a very time-consuming process to do manually. Instead, you can use the `RFE` (Recursive Feature Elimination) class of the `sklearn.feature_selection` module.It is a  backward feature selection technique and is based on **feature importance**.

So let's try to find the optimal number of features required using RFE to build a logistic regression model to predict whether a person has heart disease. Here is the list of steps below that we will follow for this purpose:

1. Import the following modules
```
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
```

2. Create an empty dictionary and store it in a variable called `dict_rfe`.

3. Create a `for` loop that iterates through all the columns in normalised training data-frame. Inside the loop:
   
   - Create an object of `LogisticRegression` class and store it in a variable called `lg_clf_2`.
   
   - Create an object of `RFE` class and store it in a variable called `rfe`. Inside the `RFE()` constructor, pass the object of logistic regression and the number of features to be selected by RFE as inputs.
   
   - Call the `fit()` function of the `RFE` class to train a logistic regression model on the train set with `i` number of features where `i` goes from `1` to `len(X_train.columns)`.
   
   - The `support_` attribute holds rank value(s) of the selected feature(s) where rank `1` denotes the most important feature.
   
   - Create a list to store the important features in a variable called `rfe_features`.
   
   - Create a new data-frame having the features selected by RFE store it in a variable called `rfe_X_train`.
   
   - Create another `LogisticRegression` object, store it in a variable called `lg_clf_3` and build a logistic regression model using the `rfe_X_train` data-frame and `y_train` series.
   
   - Predict the target values for the normalised test set (containing the feature(s) selected by RFE) by calling the `predict()` function on `lg_clf_3` object.
   
   - Calculate f1-scores using the function `f1_score()` function of `sklearn.metrics` module that returns a NumPy array containing f1-scores for both the classes. Store the array in a variable called `f1_scores_array`. The **syntax** for the `f1_score()` function is `f1_score(y_true, y_pred, average = None)`
     where `y_true` and `y_pred` are the actual and predicted labels respectively, and `average = None` parameter returns the scores for each class.

   - Add the number of selected features and corresponding features & f1-scores as key-value pairs in the `dict_rfe` dictionary.

In [None]:
help(RFE)

In [None]:
print(norm_X_train.columns)

In [None]:

from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe_model = RFE(model,n_features_to_select=3)
rfe_model.fit(norm_X_train,y_train)
print(rfe_model.support_)
print(rfe_model.ranking_)
print(norm_X_train.columns)
features = list(norm_X_train.columns[rfe_model.support_])
print(features)

In [None]:
# Create a dictionary containing the different combination of features selected by RFE and their corresponding f1-scores.
# Import the libraries
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

# Create the empty dictionary.
dict_rfe = {}

# Create a loop
for i in range(1, len(X_train.columns) + 1):
  lg_clf_2 = LogisticRegression()
  rfe = RFE(lg_clf_2, n_features_to_select=i) # 'i' is the number of features to be selected by RFE to fit a logistic regression model on norm_X_train and y_train.
  rfe.fit(norm_X_train, y_train)

  rfe_features = list(norm_X_train.columns[rfe.support_]) # A list of important features chosen by RFE.
  rfe_X_train = norm_X_train[rfe_features]

  # Build a logistic regression model using the features selected by RFE.
  lg_clf_3 = LogisticRegression()
  lg_clf_3.fit(rfe_X_train, y_train)

  # Predicting 'y' values only for the test set as generally, they are predicted quite accurately for the train set.
  y_test_pred = lg_clf_3.predict(norm_X_test[rfe_features])

  f1_scores_array = f1_score(y_test, y_test_pred, average = None)
  dict_rfe[i] = {"features": list(rfe_features), "f1_score": f1_scores_array} # 'i' is the number of features to be selected by RFE.

In the above code:

1. ```
   lg_clf_2 = LogisticRegression()
   rfe = RFE(lg_clf_2, i)
   rfe.fit(norm_X_train, y_train)
   ```
   part gets the most important features using RFE.

2. ```
   rfe_features = list(norm_X_train.columns[rfe.support_])
   rfe_X_train = norm_X_train[rfe_features]
   ```
   part creates a new data-frame containing the values of the most important feature(s) selected by RFE.

3. ```
   lg_clf_3 = LogisticRegression()
   lg_clf_3.fit(rfe_X_train, y_train)
   ```
   part builds a logistic regression model using the most important feature(s) selected by RFE.

4. ```
   y_test_pred = lg_clf_3.predict(norm_X_test[rfe_features])
   ```
   part predicts the target values on the test set only as generally a machine learning model performs well on the training set.

5. ```
   f1_scores_array = f1_score(y_test, y_test_pred, average = None)
   ```
   part calculates f1-scores

6. ```
   dict_rfe[i] = {"features": list(rfe_features), "f1_score": f1_scores_array}
   ```
   part adds the number of features, features and their corresponding f1-scores as key-value pairs to the dictionary stored in the `dict_rfe` variable.

Let's print the dictionary created.

In [None]:
# Print the dictionary created in the previous exercise.
dict_rfe

Let's convert the `dict_rfe` dictionary to a Pandas DataFrame using the `from_dict()` function of `pandas` module. Pass `orient = index` parameter to the function to orient the DataFrame index-wise. Otherwise, the keys of the dictionary i.e. (1 through 12) will become columns.

Moreover, we need columns having larger width in the data-frame as the columns will contain lists and arrays as their values. To do this you can use the `max_colwidth` attribute.

**Syntax:** `pd.options.display.max_colwidth = W`

where `W` is the required column width.

Let's set the column widths to 100.


In [None]:
# Convert the dictionary to the dataframe
pd.options.display.max_colwidth = 100
f1_df = pd.DataFrame.from_dict(dict_rfe, orient = 'index')
f1_df

From the above data-frame, we can see that we get the best f1-scores for both the classes when we have 3 features which are `cp, oldpeak` and `ca`. Beyond this point, the number of features increase but the f1-scores increase only marginally. Hence, it is best to have these many features to build a prediction model to predict whether a patient has heart disease.

Let's now rebuild a logistic regression model with the ideal number of features to predict whether a person has a heart disease.

In [None]:
#  Logistic Regression with the ideal number of features.
lg_clf_4 = LogisticRegression()
rfe = RFE(lg_clf_4, n_features_to_select=3)

rfe.fit(norm_X_train, y_train)

rfe_features = norm_X_train.columns[rfe.support_]
print(rfe_features)
final_X_train = norm_X_train[rfe_features]

lg_clf_4 = LogisticRegression()
lg_clf_4.fit(final_X_train, y_train)

y_test_predict = lg_clf_4.predict(norm_X_test[rfe_features])
final_f1_scores_array = f1_score(y_test, y_test_predict, average = None)
print(final_f1_scores_array)

----

###Logistic Regression - Multiclass Classification I

So far you have learnt to build a logistic regression model for only two labels. There are a few cases when you have to classify more than two labels. So the classification of such labels is called multiclass classification. In order to practice it, we are going to solve another problem-statement wherein we have to classify different types of glasses based on their chemical and physical composition. Let's call this project glass-type classification.

Also, in this class we will learn to create graphs with Plotly.

**Dataset Description:**

The dataset used in this problem statement involves the classification of samples of different glasses based on their physical and chemical properties. They are as follows:

1. **RI:** Refractive Index

2. **Na:** Sodium

3. **Mg:** Magnesium

4. **Al:** Aluminum

5. **Si:** Silicon

6. **K:** Potassium

7. **Ca:** Calcium

8. **Ba:** Barium

9. **Fe:** Iron

The chemical compositions are measured as the weight per cent in their corresponding oxides such as $\text{Na}_2\text{O}$, $\text{Al}_2\text{O}_3$, $\text{Si}\text{O}_2$ etc.

There are seven types (classes or labels) of glass listed; they are:

* **Class 1:** used for making building windows (float processed)

* **Class 2:** used for making building windows (non-float processed)

* **Class 3:** used for making vehicle windows (float processed)

* **Class 4:** used for making vehicle windows (non-float processed)

* **Class 5:** used for making containers

* **Class 6:** used for making tableware

* **Class 7:** used for making headlamps

A float-type glass refers to the process used to make the glass. The molten glass is introduced into a bath of molten tin, causing the glass to float freely. These glasses are used to absorb heat and UV rays.

**Dataset Credits:** https://archive.ics.uci.edu/ml/datasets/Glass+Identification


**Citation:** Dua, D., & Graff, C.. (2017). UCI Machine Learning Repository

---

#### Activity 1: Data Loading

So let's go through the routine steps before we build a logistic regression model and explore the dataset.



In [None]:
# Load the dataset.
# Import the necessary libraries.
import numpy as np
import pandas as pd

# Load the dataset.
file_path = '/content/glass-types.csv'
df = pd.read_csv(file_path)
df.head()

As you can see from the output, the data columns have strange headers (or titles). Let's load the dataset again without the column headers. For this, you can pass a parameter called `header` inside the `read_csv()` function of the `pandas` module and set its value equal to `None`.

**Syntax:** `pd.read_csv(file_path, header = None)`

In [None]:
# Load the dataset again without the column headers.
df = pd.read_csv(file_path, header =None)
df.head()

It seems like the first column might contain the serial numbers for the samples of glasses collected. Let's display the last 10 rows of the first column (indicated by 0) of the dataset.

In [None]:
#  Display the last 10 rows of the first column (indicated by 0) of the dataset.
df[0].tail(10)

So our suspicion was correct. Let's drop this column because we don't need it to build a logistic regression model later.

In [None]:
# Drop the 0th column as it contains only the serial numbers.
df.drop(columns = 0, inplace = True)

# Get an array of the new set of columns.
df.columns

---

#### Activity 2: Renaming Column Headers

Now let's provide the suitable column headers to the dataset so that we know the values of each independent variable for each glass sample. For this, we need to

- Create a Python list containing the suitable column headers as string values. The desired column headers are `'RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'GlassType'` in the same order.

- Create a Python dictionary containing the current column heads and the desired column headers as key-value pairs.

- Change the column heads by calling the `rename()` function of the `pandas` module on the `pandas` data frame object. The **syntax** to apply the `rename()` function is

  `data_frame_object.rename(python_dictionary)`

  where `python_dictionary` contains the elements as described in the second point.





In [None]:
#Create a Python list containing the suitable column headers as string values. Also, create a Python dictionary as described above.
column_headers = ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'GlassType']

# Create the required Python dictionary.
columns_dict = {}
for i in df.columns:
  columns_dict[i] = column_headers[i - 1]

columns_dict

In [None]:
# Call the 'rename()' function on the data frame object to rename the columns.
df.rename(columns_dict, axis = 1, inplace = True)

# Display the first five rows of the data frame.
df.head()

As you can see, all the column headers are renamed as required.

---

#### Activity 3: Dataset Inspection

Let's look at the kind of values each of the columns have, number of rows and columns in the dataset and whether the dataset has any missing values or not.

In [None]:
#  Get the information about the dataset.
df.info()

Except for the last column, all the columns have floating-point values as we already observed. There are 214 rows and 10 columns. And there are no missing values in the dataset because all the columns contain 214 non-null values.

Now let's get the count of each glass-type samples in the dataset.

In [None]:
# Get the count of each glass-type samples in the dataset.
df['GlassType'].value_counts()

Notice that there is no count for glass-type `4`. This means the dataset does not have any sample of glass-type `4`.

Also, glass types `2` and `1` are the most common among all the samples and glass-type `6` is the least. This suggests that the dataset is slightly imbalanced and biased in-favour of types `1` and `2`. Let's also calculate the percentage of these values.

In [None]:
# Get the percentage of count of each glass-type samples in the dataset.
round(df['GlassType'].value_counts() * 100 / df.shape[0], 2)

Through percentages, we can clearly see the imbalance in the dataset.


---

#### Activity 4: Data Visualisation using Plotly

Plotly another Python library used for Data visualisation. We can create various kinds of graphs like line plot, pie plot, scatter plot etc. using plotly as well.  

**So why should we use Plotly over matplotlib or seaborn?** The reason is:

- There is a hover tool capabilities that can be use to observe anomalies in a large number of data points.

- Also there are endless customizations to make interactive visualisation which can be displayed in Colab/Jupyter notebooks or standalone HTML files as well.


Let's start with a creating a count plot using plotly.

**Steps:**

1. Import the `plotly.express` module for Plotly features.

2. Group the DataFrame `df` by the column `GlassType` without making it a default index column and save the grouping object in a variable .

3. Compute the size of each group with the `size()` function.

>`glass_group_df = glass_group.size()`

> where `glass_group` is the grouping object and `size()` returns a DataFrame   the number of records in each unique group saved as `glass_group_df`.

4.  Create the count plot with the `bar()` function of the plotly library. The syntax for the `bar()` function is:

> **Syntax:**  `plotly.express.bar(data_frame, x, y, color)`

> where

  - `data_frame` : parameter requires the name of the dataframe with the distribution of values
  - `x` : parameter requires a column name / pandas series name / array name from where the values are used to position marks along the x axis.

  - `y` : parameter requires a column name / pandas series name / array name from where the values are used to position marks along the y axis.

  - `color` : parameter requires a column name / pandas series name / array name from where the values are used to assign color to marks.

5. Display the graph using the `show()` function.



Let's create a count plot with plotly to observe the distribution of types of glasses in the dataset.

In [None]:
# Create the count plot to observe distribution of glass types using Plotly.

# Import the Plotly library
import plotly.express as px

# Group the DataFrame by the 'GlassType' column
glass_group = df.groupby(by = "GlassType", as_index = False)

# Get the size of each glass type from the group object
glass_group_df = glass_group.size()
print(glass_group_df.head())

# Create the count plot using the 'bar()' function
fig = px.bar(data_frame = glass_group_df, x = "GlassType", y = "size", color = "GlassType")

fig.show()

In [None]:
import plotly.express as px

fig = px.colors.qualitative.swatches()
fig.show()

In [None]:
df["GlassType"].value_counts()

In [None]:
import plotly.express.colors
a=df["GlassType"].value_counts()
a= pd.DataFrame(a )
fig = px.bar(data_frame = a, x = a.index, y = "count", color=a.index)
fig.show()


**Note:** The `bar()` function can take in more parameters that can be passed to create more customised data. You may refer to the following document:

https://plotly.com/python-api-reference/generated/plotly.express.bar.html

As it can be observed, count plot is created using plotly. Also if you hover the mouse over the bars, a pop-up appears with the `GlassType` and its size information.

We can also convert the plot to html with the `write_html()` function.

In [None]:
#Convert the plot to html file.

fig.write_html("Glass Distribution.html")

Check the file explorer on the left-hand side to verify if a new `.html` file is created. We can download that the graph file from the explorer.

Now, let's move ahead create a scatter plot with Plotly with dummy data. The `plotly.express` has the function `scatter()` to create the scatter plot. The syntax of the `scatter()` function is:

> **Syntax:**  `plotly.express.scatter(data_frame, x, y, color, size, hover_data, title)`

> where


Create a scatter plot and show the distribution across labels using the steps below:.

**Steps:**

1. Create two NumPy arrays `x` and `y` with 10 integers from range 1-10 and 1-100 respectively.

2. Create a NumPy array `labels` to divide the above array `data` into three labels - `1` , `2`, `3`  randomly.

3. Create the scatter plot between `x` and `y` and show the distribution of data points with `labels` array with the color parameter.

4. Display the plot.


In [None]:
#Create scatter plot between 'x' and 'y' and show the distribution using 'labels' array

# Import the module
import numpy as np

# Create the 'x' and 'y' arrays
x = np.random.randint(1,11,10, dtype = int)
y = np.random.randint(1,101,10, dtype = int)

# Create the 'labels' array
labels = np.random.randint(1,3,10)

print(f"Array x: {x}")
print(f"Array y: {y}")
print(f"Labels array: {labels}")

# Create the scatter plot
sc_plot = px.scatter(x = x, y = y, color = labels)

# Display the plot
sc_plot.show()

As it can be observed, the scatter plot is created with the dummy data points and different classes are assigned different colors. Also, when hover over the data points, the pop-up shows three pieces of information `x`, `y` and `color` which refers to the class of the data point.

We can also observe the data points are really small. So, we can include the `size` parameter which should be an array of the same shape as values in `x`. The `size` parameter like `color` can be used to distinguish between different labels as well.

Now, let's create a scatter plot using plotly between the column `Fe` to understand distribution of types of glasses with the Iron (Fe) with the guidlines below:

- `dataframe` will be `df`

- `x` will be an numpy array of size `df.shape[0]` within the range from the minimum value of the column `Fe` to the maximum value + 1 of the column `Fe`.

- `y` will be the values in the column `Fe`

- `size` will be values in the column `GlassType` such that the size of points change with the glass types

- `color` will also be the values in the column `GlassType`such that the color of points change with the glass types.

- `title` will be string representing the plot e.g. "Scatter plot between Fe and Glass Type"

- `color_continuous_scale` will be `px.colors.sequential.Viridis`. This parameter is used to create list of continuous color scale values when the column denoted by `color` contains numeric data.

In [None]:
#Create the scatter plot for the column 'Fe' values and display the display the distribution of glass types over the column values.


fig = px.scatter(df, x = np.linspace(df['Fe'].min(), df['Fe'].max() + 1, df.shape[0]), y = df['Fe'],
                  size = "GlassType", color = "GlassType",
                 color_continuous_scale = px.colors.sequential.Viridis, title = f'Scatter plot between Fe and Glass Type')

fig.show()

As it can be observed, the scatter plot is created for the column `Fe`. We can distinguish the data points into different types of glass using color with the color bar in the right or even the size (smallest represent label `1` and largest represent label `7`).

The above scatter plot shows that label `5` type of glass can have the highest amout of `Fe`.

**Note:** The different color scales can be observed in the `colors` sub modules of Plotly like `plotly.express.colors.sequential`, `plotly.express.colors.diverging` and `plotly.express.colors.cyclical`.

Let's create the Plotly scatter plot for all the columns to check the distribution of glass types.

In [None]:
# Create the scatter plot for all the columns in 'df' to observe the distribution of glass types.

for i in list(df.columns[:-1]):
  fig = px.scatter(df, x= np.linspace(df[i].min(), df[i].max() + 1, df.shape[0]), y = df[i],
                   hover_data = ['GlassType'], size = "GlassType",
                   color = "GlassType", color_continuous_scale = px.colors.sequential.Viridis,
                   title = f'Scatter plot between {i} and Glass Type')
  fig.show()

We can observe the scatter plots above to deduce various facts like `RI` reflective index of label `2` glass type is highest.



---

#### Activity 5: Model Building

Let's build a logistic regression model first without balancing the dataset. If the model evaluation parameters suggest that the model is not classifying the labels correctly, then we will first deal with the imbalance and then build a logistic regression model again.

In [None]:
# Create separate data frames for training and testing the model.
from sklearn.model_selection import train_test_split

# Creating the features data frame holding all the columns accept last column
x = df.iloc[:, :-1]
print(f"First five rows of the features data frame:\n{x.head()}\n")

# Creating the target series that holds last column 'GlassType'
y = df['GlassType']
print(f"First five rows of the GlassType column:\n{y.head()}")

# Splitting the train and test sets using the 'train_test_split()' function.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)

In [None]:
#  Print the shape of all the four variables i.e. 'x_train', 'x_test', 'y_train' and 'y_test'
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
#  Build a logistic regression model using the 'sklearn' module.
from sklearn.linear_model import LogisticRegression

# 1. First, call the 'LogisticRegression' module and store it in 'lg_clg' variable.
lg_clf = LogisticRegression()

# 2. Call the 'fit()' function with 'x_train' and 'y_train' as inputs.
lg_clf.fit(x_train, y_train)

# 3. Call the 'score()' function with 'x_train' and 'y_train' as inputs to check the accuracy score of the model.
lg_clf.score(x_train, y_train)

**Note:** This is a preliminary model building step. Hence, we can ignore `ConvergenceWarning` completely.  

So the accuracy score is 61.75% which is not a good score.

Now in the cases of binary classification, we generally create a confusion matrix and print the precision, recall and f1-score values. But in the case of multiclass classification, it best to first check what all labels the classification model identified or detected. For this, you can use either the `unique()` function or the `value_counts()` function.

In [None]:
# Get the target values predicted by the logistic regression model on the train set.
y_train_predict = lg_clf.predict(x_train)
y_train_predict = pd.Series(y_train_predict)

print("Classes or labels identified by the logistic regression model:\n", y_train_predict.unique())
print("\nCount of the labels identified by the logistic regression model:")
print(y_train_predict.value_counts())

As you can see, the logistic regression model failed to identify glass-type `3`.

Consequently, it does not makes sense to create a confusion matrix here because the actual target set has all the labels but the predicted target set misses one label (glass-type `3`) among the available (the whole dataset does not have any glass-type `4` sample) labels.

Hence, **in the case of multiclass classification, before creating a confusion matrix, always first check whether the predicted target set has all the labels**.

Let's repeat the above exercise on the test set and find out all the classes identified by the logistic regression model.

In [None]:
#  Get the target values predicted by the logistic regression model on the test set.
y_test_predict = pd.Series(lg_clf.predict(x_test))

print("Classes or labels identified by the logistic regression model on the test set:\n", y_test_predict.unique())
print("\nCount of the labels identified by the logistic regression model on the test set:")
print(y_test_predict.value_counts())

On the test set, the logistic regression model failed to identify labels `3` and `6`. This is clearly a very bad classification model.

Let's stop here. In the next class, we will try to build a logistic regression model again so that it can identify all the different labels before we can evaluate its performance further using confusion matrix, precision, recall and f1-score values.

---