# Let's Get This Snowball Rolling

Now, it's your turn to evaluate how a logistic regression model performs. This time, you’ll do so in the context of trying to help a fintech startup more quickly grow its user base. By applying the code you learned to customer data, you’ll discover how machine learning has the potential to turbocharge the growth trajectory of a fintech firm.

## Instructions

1. Read in the dataset about the current customers of the startup.

2. Split the data into X and y and then into testing and training sets.

3. Fit a logistic regression classifier.

4. Create the predicted values for the testing and the training data.

5. Print a confusion matrix for the training data.

6. Print a confusion matrix for the testing data.

7. Print the training classification report.

8. Print the testing classification report.

9. Answer the following question: How does the model performance compare between the training data and the testing data?


## Resources:

Following are links to modules from the scikit learn library that will be utilized:

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[classifiction_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

[confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)


In [2]:
# Import the required modules
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


## Step 1: Read in the dataset about the current customers of the startup.

In [4]:
# Read the usage_stats.csv file from the Resources folder into a Pandas DataFrame
customer_df = pd.read_csv(Path("usage_stats.csv"))
# YOUR CODE HERE

# Review the DataFrame
# YOUR CODE HERE
display(customer_df.head())
display(customer_df.tail())

Unnamed: 0,Usage Stats,Referral History,Customer Rank,target
0,1.054075,-2.010163,-0.918689,0
1,2.033251,-0.212776,-2.947451,0
2,1.049233,-2.239878,-0.77708,0
3,0.837035,-1.926558,-1.113686,0
4,1.19377,-1.550953,-1.539586,0


Unnamed: 0,Usage Stats,Referral History,Customer Rank,target
1205,1.97554,-2.200099,0.345623,1
1206,2.093416,-1.592133,-1.300825,0
1207,2.010334,-1.758225,-1.173162,0
1208,4.451947,-0.502815,-2.35502,0
1209,2.141445,-1.993869,-0.946396,0


In [6]:
# is the response (Y --> "target" column) balanced?
customer_df["target"].value_counts()

0    1089
1     121
Name: target, dtype: int64

In [None]:
# no ... the above shows not a balanced response data set

## Step 2: Split the data into X and y and then into testing and training sets.

In [9]:
# Split the data into X (features) and y (target)

# The y variable should focus on the target column
y = customer_df["target"] 
# YOUR CODE HERE

# The X variable should include all features except the target
X = customer_df.drop(columns=["target"])
# YOUR CODE HERE
X

Unnamed: 0,Usage Stats,Referral History,Customer Rank
0,1.054075,-2.010163,-0.918689
1,2.033251,-0.212776,-2.947451
2,1.049233,-2.239878,-0.777080
3,0.837035,-1.926558,-1.113686
4,1.193770,-1.550953,-1.539586
...,...,...,...
1205,1.975540,-2.200099,0.345623
1206,2.093416,-1.592133,-1.300825
1207,2.010334,-1.758225,-1.173162
1208,4.451947,-0.502815,-2.355020


In [10]:
# Split into testing and training sets using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)
# YOUR CODE HERE


## Step 3: Fit a logistic regression classifier.

In [11]:
# Declare a logistic regression model.
# Apply a random_state of 9 to the model
logistic_regression_model = LogisticRegression(random_state=9)
# YOUR CODE HERE

# Fit and save the logistic regression model using the training data
lr_model = logistic_regression_model.fit(X_train, y_train)
# YOUR CODE HERE


## Step 4: Create the predicted values for the testing and the training data.

In [13]:
#Generate training predictions
training_predictions = lr_model.predict(X_train)
# YOUR CODE HERE

#Generate testing predictions
testing_predictions = lr_model.predict(X_test)
# YOUR CODE HERE


In [14]:
train_score = lr_model.score(X_train, y_train)
test_score = lr_model.score(X_test, y_test)
display(train_score)
display(test_score)

0.9757442116868799

0.9702970297029703

## Step 5: Print a confusion matrix for the training data.

In [16]:
# Import the model for sklearn's confusion matrix
from sklearn.metrics import confusion_matrix
# YOUR CODE HERE

# Create and save the confustion matrix for the training data
training_matrix = confusion_matrix(y_train, training_predictions)
# YOUR CODE HERE

# Print the confusion matrix for the training data
# YOUR CODE HERE
training_matrix

array([[811,   7],
       [ 15,  74]])

## Step 6: Pring a confusion matrix for the texting data.

In [18]:
# Create and save the confustion matrix for the testing data
test_matrix = confusion_matrix(y_test, testing_predictions)
# YOUR CODE HERE

# Print the confusion matrix for the testing data
# YOUR CODE HERE
test_matrix

array([[267,   4],
       [  5,  27]])

## Step 7: Print the training classification report.

In [20]:
# Create and save the training classifiction report
training_report = classification_report(y_train, training_predictions)
# YOUR CODE HERE

# Print the training classification report
# YOUR CODE HERE
print(training_report)

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       818
           1       0.91      0.83      0.87        89

    accuracy                           0.98       907
   macro avg       0.95      0.91      0.93       907
weighted avg       0.98      0.98      0.98       907



## Step 8: Print the testing classification report.

In [21]:
# Create and save the testing classifiction report
testing_report = classification_report(y_test, testing_predictions)
# YOUR CODE HERE

# Print the testing classification report
# YOUR CODE HERE
print(testing_report)

              precision    recall  f1-score   support

           0       0.98      0.99      0.98       271
           1       0.87      0.84      0.86        32

    accuracy                           0.97       303
   macro avg       0.93      0.91      0.92       303
weighted avg       0.97      0.97      0.97       303



## Step 9: Answer the following question

**Question:** How does the performance of the training and test dataset compare?

**Sample Answer:** # YOUR ANSWER HERE 