# Week 06 - Classification Models Logistic Regression
The goal of this assignment is to introduce students to classification models, helping them understand how to apply machine learning algorithms to predict categories. Students will work with two datasets and will use a variety of classification models to compare their performance. The assignment will focus on using **Logistic Regression** in this Jupyter Notebook.
<br><br>
The libraries that you will use are listed in the code cell below. Please run the code cell. See the Getting Started module in the Canvas Classroom to see the parameter options available to each class and method used.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay

# Reading in the Data set
 You will use the Titanic data set to predict whether a person survives based on data from the data set.  The code beloow reads data into the a dataframe from a dataset located on the Internet.

```
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
```
To see the parameters of pd.read_csv click this url: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

# Question 1
1. Read the titanic data set into a dataframe variable called titanic.
2. Display the first 5 records in the dataframe.

In [2]:
# Question 1

# Data Structure and Summary Statistics
Create a data structure list and also the summary statistics using the info(), describe().T, describe(include=['object']).T methods.

- info() parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info
- describe() parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe



# Question 2
1. Use the appropriate dataframe variable and list the data structure using info() method.
2. Use the appropriate dataframe variable and list the data structure using describe().T for the numerical values.
3. Use the appropriate dataframe varible and list the data structure using describe(include=['object']).T for the string variables.

In [4]:
# Question 2 - 1)


In [5]:
# Question 2 - 2)


In [6]:
# Question 2 - 3)


# Drop variables that will not be used in the model
Some variables are not relevant to the model such as PassengerId, Name, and Ticket.  Cabin would be valuable if we knew if the missing values indicated no cabin.  But since we cannot make that assumption that missing data indicated passenger did not have a cabin.  We will delete the Cabin column.  Also there is a lot of missing data.

- drop parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html#pandas.DataFrame.drop
- dl = ['column1', 'column2', 'column3']  # list all the columns you want to drop in a list
- df.drop(columns=dl, axis=1, inplace=True)

# Question 3
1. Create a list of the column names you want to drop.  Run drop only once if you run twice it will generate an error.
2. Using the drop method to drop the columns above from the titanic dataframe.  Put all code in one code cell.
3. Use info() method to display remaining columns.

In [10]:
# Question 3


# Determine balanced data
Is the data balanced?  Is the target variable equal in each output value?  

- df['column'].value_counts
- value_counts method parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts 

# Question 4
1. Use the value_counts() method to determine if the target variable, Survived has equal values.

In [13]:
# Question 4


# Splitting the dataset
It is time to split the dataset into training and testing data sets.  Since the Survived column is not equal we will use a stratified random sample based on the Survived column.

- train_test_split parameters: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- train, test = train_test_split(df, test_size=0.2, random_state=33, stratify=df['target column'])

# Question 5
1. Split the data into train and test dataframes using the train_test_split function.
2. Split the data 70% in training and 30% in testing.
3. Use the random_state = 42. (seed for random number generator)
4. Stratified the data by the target variable.

In [16]:
# Question 5


# View the percentage in training and test set
We stratified the sample based on the target variable.  We can see the results of the target variables in the test and training set by using the value_counts(normalize=True).

```
y.value_counts(normalize=True) # to get percentage in each number in Survived for test set
y_t.value_counts(normalize=True) # to get percentage in each number in Survived for test set
```
# Question 6
1. Determine the percentage for the target variable for the entire data set.
2. Determine the percentage for the target variable in the training data set.
3. Determine the percentage for the target variable in the testing data set.

In [18]:
# Question 6 - 1)


In [19]:
# Question 6 - 2)


In [20]:
# Question 6 - 3)


# Correlation of training data set
Create a correlation matrix of the training data set using the corr() method
- the corr() parameters is at this url: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html#pandas.DataFrame.corr

```
df_t.corr(numeric_only=True)
```

# Question 7
1. Using the corr() method and numeric_only=True as the parameter along with the appropriate dataframe to fine the correlation matrix of the training data set.


In [24]:
# Question 7


# Data Visualization
We will create a bar chart of the sex and # survived by sex.  Also, the Embared and the number serviced.  In Week 04 you learned to create a bar chart using code below.

- groupby parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby
- dataframe bar plot parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html#pandas.DataFrame.plot.bar

```
dfg=train.groupby("x column", as_index=False)["target column"].sum()
dfg.plot.bar(x='x column', y='y column', rot=0)
```

# Question 8
1. Groupby Sex and Survived totalling the number of survivors for each sex.
2. Create a bar plot of the groupby data for Sex and Survived.
3. Create a groupby for Embark and Survived.
4. Create a bar plot of the groupby data for Embark and Survived.

# Separate Features and Target Variables
It is now time to separate the Features and the target variables.

Use the drop('Target Variable', axis=1) method to drop the target variable.

# Question 9
1. Using the appropriate dataframe for the training data assign the Target variable to y.
2. Using the appropriate dataframe for the training data assign the features to X by dropping the target variable.
3. Using the appropriate dataframe for the testing data assign the Target variable to y_test.
4. Using the appropriate dataframe for the testing data assign the Target variable to X_test.

# Create numeric pipeline
It is now time to create the numeric pipeline.  Use what you learned in Week 04 to create the numeric pipeline.  This time replace missing values with the mean.

- **Pipeline** parameters: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline
- **SimpleImputer**: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
- **Standard Scaler** - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

<br>

# Question 10
1. Use the code learned in Week 04 to create a numeric pipeline.
2. Include SimpleImputer replacing the missing values with mean(strategry='mean')
3. Include a StandardScaler.

# Create Categorical pipeline
It is now time to create the categorical pipeline.  Use what your learned in Week 04 to create the numeric pipeline.

- **make_pipeline** parameters: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html
- **SimpleImputer** parameters: see above
- **OneHotEncoder** parameters: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

<br>

# Question 11
1. Use the code learned in Week 04 to create a categorical pipeline.
2. Include SimpleImputer replacing the missing values with most_frequent.
3. Include a OneHotEncoder.

# Preprocessing Pipeline
Create a preprocessing Pipeline to combine the numeric and categorical pipeline.  

- **make_column_transformer** parameters:https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer
- **make_column_selector** parameters: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector


# Question 12
1. Use what you learned in Week 04 to combine the numeric and categorical pipelines.

# Create model for Logistic Regression
The first model you will create is the Logistic Regression model similar to you creating the regression model.  However, this model with include the Cross-Validation within it.

- **LogisticRegressionCV** parameters:https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
- **Pipeline** parameters:https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

```
log_pipeline = Pipeline([
           ('preprocessor', preprocessor),
           ('classifier', LogisticRegressionCV(cv=10, max_iter=200, solver='liblinear', random_state=42,                          class_weight='balanced')
           ])
```

# Question 13
1. Create a logistic regression with Cross Validation to include in a pipeline for logistic_pipeline.
2. The parameters for the LogisticRegressionCV class should be cv=5, max_iter=200, solver='liblinear', random_state=42, class_weight='balanced', and penalty='l2'
3. Fit the model using the X and y variables.
4. Run the predict model using X and assign it to y_pred.
5. Run the predict model using X_test and assign it to y_pred_test.

In [None]:
# Question 13


In [None]:
# Question 13


# Determine the Metrics for Logistic Regression
Since the data is not balanced we need more than the accuracy score.  We will compute the precision, recall, f1 score and the confusion matrix.

- **accuracy_score** parameters:https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
- **precision_score** parameters: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
- **recall_score parameters**: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
- **f1_score parameters**: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
- **confusion_matrix parameters**: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix

```
m1  = accuracy_score(y, ypred)  # training set
m1t = accuracy_score(y_test, ypredtest)
print("Training accuracy score: ", m1)
print("Testing accuracy score:", m1t
m2 = precision_score(y, ypred)
m3 = recall_score(y, ypred)
m4 = f1_score(y, ypred)
cm = confusion_matrix(y, y_pred)
cmn = confusion_matrix(y, y_pred, normalize='true')
print("Training Confusion Matrix\n", cm)
print("Training Confusion Matrix normalize \n", cmn)
```


# Question 14
1. Compute the accuracy score for the training data set put result in train_a.  ompute the accuracy score for the test data set put result in test_a.  Print both out.
2. Compute the precision score for the training and test data sets put in train_p and test_p.  Print both out.
3. Compute the recall score for the training and test data sets put in train_r and test_r.  Print both out.
4. Compute the f1_score for the training and test data sets sets put in train_f1 and test_f1.  Print both out.
5. Compute the confusion matrix without normalize = 'true' and with normalize='true'
11. Compute the confustion matrix with normalize = True and print it out for training and testing data sets.  Put the results in train_cmn and test_cmn.


In [None]:
# Question 14  1)


In [None]:
# Question 14  2)


In [None]:
# Question 14  3)


In [None]:
# Question 14  4)


In [None]:
# Question 14  5)


In [None]:
# Question 14  6)


# Question 15
Is the model overfitting the data?   yes or no

In [None]:
ans=''