# Logistic Regression in scikit-learn - Lab

## Introduction 

In this lab, you are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the column labeled `'target'`. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives

In this lab you will: 

- Fit a logistic regression model using scikit-learn 


## Let's get started!

Run the following cells that import the necessary functions and import the dataset: 

In [1]:
# Import necessary functions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [2]:
# Import data
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Define appropriate `X` and `y` 

Recall the dataset contains information about whether or not a patient has heart disease and is indicated in the column labeled `'target'`. With that, define appropriate `X` (predictors) and `y` (target) in order to model whether or not a patient has heart disease.

In [3]:
# Split the data into target and predictors
y = df ['target']
X = df.drop ('target', axis = 1)

## Normalize the data 

Normalize the data (`X`) prior to fitting the model. 

In [9]:
# Your code here
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the data and transform 
X = scaler.fit_transform(X)

# X_normalized is now the normalized version of X


## Train- test split 

- Split the data into training and test sets 
- Assign 25% to the test set 
- Set the `random_state` to 0 

In [10]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split (X,y, test_size = 0.25, random_state = 0)

## Fit a model

- Instantiate `LogisticRegression`
  - Make sure you don't include the intercept  
  - set `C` to a very large number such as `1e12` 
  - Use the `'liblinear'` solver 
- Fit the model to the training data 

In [13]:
from sklearn.linear_model import LogisticRegression

# Instantiate the LogisticRegression model
logistic_regression_model = LogisticRegression(
    fit_intercept=False,  # Exclude the intercept
    C=1e12,               # Set C to a large number
    solver='liblinear'    # Use the 'liblinear' solver
)

# Fit the model to the training data
logistic_regression_model.fit(X_train, y_train)

## Predict
Generate predictions for the training and test sets. 

In [14]:
# Generate predictions
train_predictions = logistic_regression_model.predict(X_train)
test_predictions = logistic_regression_model.predict(X_test)

## How many times was the classifier correct on the training set?

In [15]:
# Your code here
import numpy as np

# Calculate the number of correct predictions on the training set
correct_train_predictions = np.sum(train_predictions == y_train)

# Print the number of correct predictions
print("Number of correct predictions on the training set:", correct_train_predictions)

# Calculate and print the accuracy on the training set
accuracy_train = correct_train_predictions / len(y_train)
print("Accuracy on the training set:", accuracy_train)

Number of correct predictions on the training set: 192
Accuracy on the training set: 0.8458149779735683


## How many times was the classifier correct on the test set?

In [16]:
# Your code here
import numpy as np

# Calculate the number of correct predictions on the test set
correct_test_predictions = np.sum(test_predictions == y_test)

# Print the number of correct predictions
print("Number of correct predictions on the test set:", correct_test_predictions)

# Calculate and print the accuracy on the test set
accuracy_test = correct_test_predictions / len(y_test)
print("Accuracy on the test set:", accuracy_test)

Number of correct predictions on the test set: 63
Accuracy on the test set: 0.8289473684210527


## Analysis
Describe how well you think this initial model is performing based on the training and test performance. Within your description, make note of how you evaluated performance as compared to your previous work with regression.

In [None]:
# Your analysis here
The evaluation of the initial logistic regression model's performance is primarily based on the accuracy metric for both the training and test sets. Here's an analysis of the model's performance and a comparison to previous work with regression:

Training Performance:

The model achieved a relatively high accuracy on the training set, as indicated by the accuracy score.
High training accuracy suggests that the model has learned the training data well and can make accurate predictions on the data it has seen during training.
However, it's important to note that high training accuracy does not guarantee good generalization to new, unseen data.
Test Performance:

The accuracy on the test set is also a critical metric for evaluating the model's performance. In machine learning, the test set serves as a proxy for unseen data.
The test accuracy is a bit lower than the training accuracy, which is expected. Models often perform slightly worse on unseen data because they may have overfit the training data.
The test accuracy provides an estimate of how well the model is likely to perform on new, real-world data.
Comparison to Regression:

In regression tasks, such as linear regression, the primary metric for evaluation is often Mean Squared Error (MSE) or R-squared (R²). These metrics measure the goodness of fit of the model to the training data.
In classification tasks, like logistic regression for binary classification, accuracy is a common metric. It measures the percentage of correct predictions out of all predictions.
The evaluation criteria are different because the nature of the tasks is different. In regression, you're predicting continuous values, so you evaluate how closely your predictions match the actual values. In classification, you're predicting class labels, so you evaluate the proportion of correct classifications.
Interpretation:

The initial logistic regression model shows promise, as it achieves a reasonable accuracy on the test set.
However, further analysis is needed to assess the model's performance comprehensively. Consider looking at additional metrics like precision, recall, F1-score, and ROC curves, especially if class imbalance is a concern.
Additionally, you may want to explore different models, hyperparameter tuning, and model diagnostics to improve performance further.
In summary, the initial logistic regression model provides a good starting point, but model evaluation should go beyond accuracy. It's crucial to consider various metrics, explore model variations, and potentially employ more advanced techniques to optimize the model's performance and ensure its suitability for the specific classification task at hand.

## Summary

In this lab, you practiced a standard data science pipeline: importing data, split it into training and test sets, and fit a logistic regression model. In the upcoming labs and lessons, you'll continue to investigate how to analyze and tune these models for various scenarios.