# Logistic Regression

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [2]:
# Reading in a CSV file containing data 
dataset = pd.read_csv('Social_Network_Ads.csv')

In [3]:
# Outputting information about the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Age              400 non-null    int64
 1   EstimatedSalary  400 non-null    int64
 2   Purchased        400 non-null    int64
dtypes: int64(3)
memory usage: 9.5 KB


The code is calling the `info()` method, which provides information about the data frame, including the number of rows and columns, the column names, the number of non-null values in each column, and the data types of the columns. The output of this method is useful for understanding the structure and content of the dataset, and can be used to inform data cleaning and analysis tasks.

In [4]:
# Generating descriptive statistics about the dataset
dataset.describe()

Unnamed: 0,Age,EstimatedSalary,Purchased
count,400.0,400.0,400.0
mean,37.655,69742.5,0.3575
std,10.482877,34096.960282,0.479864
min,18.0,15000.0,0.0
25%,29.75,43000.0,0.0
50%,37.0,70000.0,0.0
75%,46.0,88000.0,1.0
max,60.0,150000.0,1.0


The code is calling the `describe()` method, which generates descriptive statistics about the data frame, including the count, mean, standard deviation, minimum and maximum values, and quartiles for each numeric column in the data frame. The output of this method is useful for getting a sense of the distribution of the data and identifying potential issues such as missing values, outliers, or anomalies. It can also help inform data cleaning and analysis tasks.

In [5]:
# Displaying the first few rows of the dataset
dataset.head()

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


The code is calling the `head()`, which displays the first few rows of the data frame, by default the first 5 rows. This can be useful for getting a quick overview of the dataset, including the column names and the values in the first few rows. The output of this method is often used to check that the data has been imported correctly and to get a sense of the data's structure and content.

We are trying to predict whether a customer purchased a product, which is the last column. This is a discrete variable of either 0 or 1

We thus split the data into the variables we want to use to predict the variable Purchased, and the variable Purchased.


`iloc` is a pandas method for selecting subsets of data based on integer-based indexing.

`[:, :-1]` selects all rows (:) and all columns except the last one (:-1) as the input features.

`[:, -1]` selects all rows (:) and only the last column (-1) as the output labels.

`.values` returns the data as a NumPy array.

In [6]:
# Select all rows and all columns except the last one as the features (input data) and assign it to X
X = dataset.iloc[:, :-1].values

# Select all rows and only the last column as the labels (output data) and assign it to y
y = dataset.iloc[:, -1].values

## Splitting the dataset into the Training set and Test set

In [7]:
from sklearn.model_selection import train_test_split

`train_test_split` is a function from the sklearn.model_selection module that splits input features and output labels into random train and test subsets.

The first two arguments (`X` and `y`) are the input features and output labels to be split.

`test_size = 0.25` specifies that 25% of the input features and output labels will be used for testing, and the remaining 75% will be used for training.

`random_state = 0` sets the random seed for reproducibility of the train/test split.

The function returns four output tuples, in the order `(X_train, X_test, y_train, y_test)` which correspond to the training and testing subsets of the input features and output labels, respectively.

In [8]:
# Split the input features (X) and output labels (y) into training and testing sets using a 75/25 ratio and a fixed random state


In [9]:
# X_train.shape

In [10]:
# X_test.shape

In [11]:
# y_train.shape

In [12]:
# y_test.shape

## Feature Scaling

In [13]:
from sklearn.preprocessing import StandardScaler

`StandardScaler` is a class from the sklearn.preprocessing module that scales input features to have a mean of 0 and variance of 1.

`sc` is an instance of the `StandardScaler` class that will be used to scale the input features.

`fit_transform` is a method of the `StandardScaler` class that both fits the scaler to the training input features (i.e., calculates the mean and variance to be used for scaling) and applies the same transformation to the training input features.

`transform` is a method of the StandardScaler class that applies the same transformation (scaling) that was calculated from the training input features to the testing input features.

In [14]:
# Create a StandardScaler object to standardize (normalize) the input features

#Scale whole matrix of features to prevent information leakage

# Fit the scaler to the training input features and apply the same transformation to the testing input features


Overall, this code is scaling (normalizing) the input features to prevent information leakage between the training and testing sets. By standardizing the input features, it helps ensure that the features have a similar scale and distribution, which can help improve the accuracy and stability of the machine learning model.

## Training the Logistic Regression model on the Training set

In [15]:
from sklearn.linear_model import LogisticRegression

`LogisticRegression` is a class from the `sklearn.linear_model` module that implements logistic regression for binary classification tasks.


`random_state = 0` sets the random seed for reproducibility of the logistic regression model.

`classifier` is an instance of the LogisticRegression class that will be used to fit and predict with the logistic regression model.

`fit` is a method of the `LogisticRegression` class that trains (fits) the logistic regression model using the training input features (`X_train`) and output labels (`y_train`).

In [16]:
# Create a LogisticRegression object with a fixed random state

# Train (fit) the model using the training input features (X_train) and output labels (y_train)


Overall, this code is creating and training a logistic regression model using the training input features and output labels. The `random_state` parameter ensures reproducibility of the model fitting, while `fit` actually trains the model on the training data. Once the model has been trained, it can be used to make predictions on new data.





## Predicting a new result

`predict` is a method of the `LogisticRegression` class that makes predictions on new data points using the trained logistic regression model.


`sc.transform` is used to scale (normalize) the input features of the new data point in the same way as the training data.

`[[30,87000]]` is a nested list containing the input features of the new data point. In this case, the new data point has an age of 30 and a salary of 87000.


The output of `predict` is a binary value (0 or 1) indicating the predicted class label of the new data point. In this case, the predicted class label is not provided in the code snippet.

In [17]:
# Use the trained logistic regression classifier to make a prediction on a new data point
# Here, the age is 30 and the salary is 87000
# The input data is first transformed using the same scaling as the training data
# The output prediction is a binary value (0 or 1) indicating the predicted class label


Overall, this code is using the trained logistic regression classifier to make a prediction on a new data point with an age of 30 and a salary of 87000. The input data is first transformed using the same scaling as the training data, and the output prediction is a binary value indicating the predicted class label.

## Predicting the Test set results

In [18]:
# Use the test output to make a prediction, and name this prediction y_pred

The function below is a utility function that can be used to create a DataFrame that compares predicted and actual values for a binary classification problem.

You do not need to understand this as it is mostly for show, but if you are curious:

The function uses the `np.concatenate` function to concatenate the `y_pred` and `y_test` arrays along the second axis to create a two-dimensional array named data.

The `reshape` method is used to ensure that the `y_pred` and `y_test` arrays are column vectors.

In [19]:
def predictions_vs_actual(y_pred, y_test):
    """Returns a DataFrame that compares predicted and actual values.

    Args:
        y_pred (array-like): An array of predicted values for a binary classification problem.
        y_test (array-like): An array of actual values for the same binary classification problem.

    Returns:
        pandas.DataFrame: A DataFrame with two columns named 'Predicted' and 'Actual'.

    """
    # Concatenate the predicted and actual values along the second axis to create a two-dimensional array
    data = np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1)
    # Create a DataFrame from the two-dimensional array with columns 'Predicted' and 'Actual'
    df = pd.DataFrame(data=data, columns=['Predicted', 'Actual'])
    # Return the DataFrame
    return df

In [20]:
# Use the above method to compare the predictions to the actual results

## Making the Confusion Matrix

In [21]:
from sklearn.metrics import confusion_matrix, accuracy_score

`confusion_matrix` is a function from the `sklearn.metrics` module that computes the confusion matrix for a classifier given the true labels and predicted labels.


`y_test` contains the true class labels for the testing data.

`y_pred` contains the predicted class labels for the testing data.

The confusion matrix is a table with four entries: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

`TP` represents the number of true positive predictions, `TN` represents the number of true negative predictions, `FP` represents the number of false positive predictions, and `FN` represents the number of false negative predictions.

The diagonal of the confusion matrix shows the correct predictions (TP and TN), while the off-diagonal elements show the incorrect predictions (FP and FN).

In [22]:
# Compute the confusion matrix and print it out
# A confusion matrix is a table used to evaluate the performance of a classifier
# The confusion matrix shows the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions
# TP: actual positive and predicted positive
# TN: actual negative and predicted negative
# FP: actual negative but predicted positive
# FN: actual positive but predicted negative
# The diagonal of the matrix shows the correct predictions (TP and TN)
# The off-diagonal elements show the incorrect predictions (FP and FN)
# Here, y_test contains the true labels of the testing data, and y_pred contains the predicted labels


`accuracy_score` is a function from the `sklearn.metrics` module that computes the accuracy of a classifier given the true labels and predicted labels.

The accuracy is the proportion of correct predictions (TP and TN) out of all predictions.

In [23]:
# Compute the accuracy of the classifier and print it out
# The accuracy is the proportion of correct predictions (TP and TN) out of all predictions
# Here, y_test contains the true labels of the testing data, and y_pred contains the predicted labels


Overall, this code is evaluating the performance of the logistic regression classifier using the testing data. The confusion matrix shows the number of true positives, false positives, true negatives, and false negatives, and the accuracy score shows the proportion of correct predictions out of all predictions. These metrics are useful for evaluating the performance of a classifier and for comparing the performance of different classifiers.

## Visualising the Training set results

This utility function is used to visualize the decision boundary of a logistic regression model along with the training or test data. The function takes in the feature array `X` and target array `y`, as well as a string `train_test` indicating whether the visualization is for the training or test set.

You do not need to understand this as it is mostly for show, but if you are curious:

The function first transforms `X` to the original scale and assigns the transformed `X` and `y` to new variables. Then, it creates a meshgrid of points using `numpy.meshgrid` to plot the decision boundary. The `start`, `stop`, and `step` arguments of `numpy.arange` are used to define the ranges and spacing of the points on the meshgrid.

The decision boundary is plotted using `plt.contourf`, which predicts the target variable for all points in the meshgrid using the logistic regression model `classifier` and the transformed feature array `sc.transform`. The `alpha` argument controls the transparency of the decision boundary, and the `cmap` argument controls the color scheme.

The limits of the plot are set to the minimum and maximum values of X1 and X2 using `plt.xlim` and `plt.ylim`. The training or test data is plotted using plt.scatter, with points color-coded by class label. The title and axis labels of the plot are set using `plt.title`, `plt.xlabel`, and `plt.ylabel`. The legend is shown using plt.legend, and the plot is displayed using `plt.show`.



In [24]:
from matplotlib.colors import ListedColormap

In [25]:
def visualize_data(X, y, train_test = 'Train'):
  """
  Function to visualize the data and decision boundary of the Logistic Regression model.

  Parameters:
  -----------
  X : numpy array
      The feature array.
  y : numpy array
      The target array.
  train_test : str, optional (default='Train')
      The string to indicate if the visualization is for 'Train' or 'Test' data.

  Returns:
  --------
  None
  """

  # Transform X to original scale and assign X and y to new variables
  X_set, y_set = sc.inverse_transform(X), y

  # Create a meshgrid of points to plot decision boundary
  X1, X2 = np.meshgrid(
      np.arange(
          start = X_set[:, 0].min() - 10, 
          stop = X_set[:, 0].max() + 10, 
          step = 0.25
      ),
      np.arange(
          start = X_set[:, 1].min() - 1000, 
          stop = X_set[:, 1].max() + 1000, 
          step = 0.25
      )
  )

  # Plot the decision boundary by predicting the target variable 
  # for all points in the meshgrid
  plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
              alpha = 0.75, cmap = ListedColormap(('red', 'green')))
  
  # Set the limits of the plot to the minimum and maximum values 
  # of X1 and X2
  plt.xlim(X1.min(), X1.max())
  plt.ylim(X2.min(), X2.max())

  # Plot the data points, color-coded by class label
  for i, j in enumerate(np.unique(y_set)):
      plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
  
  # Set the title and labels of the plot, and show the legend
  plt.title(f'Logistic Regression ({train_test}ing set)')
  plt.xlabel('Age')
  plt.ylabel('Estimated Salary')
  plt.legend()
  plt.show()

In [26]:
# Use the visualizaiton method above the visualize the training results of logistic regression


## Visualising the Test set results

In [27]:
# Make a graph similar to above with the test set