In [None]:
#visualise the top 10 largest number 
# Get the top 10 'daysincelastorder' values with the highest churn counts
top_10 = grouped.sum(axis=1).nlargest(10)

# Filter the original DataFrame for the top 10 'daysincelastorder' values
df_top_10 = df[df['daysincelastorder'].isin(top_10.index)]

# Create a pivot table to prepare data for a stacked bar plot
pivot_table = df_top_10.pivot_table(index='daysincelastorder', columns='churn', aggfunc='size', fill_value=0)

# Create a stacked bar plot
ax = pivot_table.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.xlabel('Days Since Last Order')
plt.ylabel('Count')
plt.title('Stacked Bar Plot of Churn vs. Days Since Last Order')
plt.legend(title='Churn', labels=['0- Not Churned', '1- Churned'])

# Annotate the values on the bars
for index, row in pivot_table.iterrows():
    plt.annotate(f'0: {row[0]}\n1: {row[1]}', (index, row[0] + row[1] / 2), ha='center')

plt.show()

In [None]:
# Filter the DataFrame to include only churn (churn equals 1)
churned_df = df[df['churn'] == 1]
churned_df.head()

In [None]:
# Group by 'daysincelastorder' and calculate the count of churn values (churn equals 1)
grouped = churned_df.groupby('daysincelastorder')['churn'].count()

# Get the top 5 'daysincelastorder' values with the highest churn count
top_5 = grouped.nlargest(5)

print(top_5)

In [None]:
# Create a histogram using Plotly
fig = px.bar(
    x=top_5.index,
    y=top_5,
    title='Histogram of Churn Counts for Top 5 Days Since Last Order',
    labels={'x': 'Days Since Last Order', 'y': 'Churn Count'},
)
# Display the chart
fig.show()

In [None]:
# ROC Curve
logistic_roc_auc = roc_auc_score(y_test, logistic_model.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, logistic_model.predict_proba(X_test)[:, 1])

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {logistic_roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

In [None]:
# # feature selection
# why perform feature selection
# Feature selection is the process of selecting the most important features for a machine learning model. 
# This can be done for a variety of reasons, such as to improve model performance, 
# reduce model complexity, and make the model more interpretable.

# Information gain: This measure is used to assess how much information a feature provides about the target variable. 
#        Features with high information gain are typically more important for predicting the target variable.
# Mutual information: This measure is similar to information gain, but it takes into account the nonlinear 
#       relationships between features. Features with high mutual information are typically more important for predicting the target variable.
# Recursive feature elimination: This method works by iteratively removing the least important feature from a model 
#       until a stopping criterion is met. The stopping criterion may be based on the model performance, the number of features remaining, or some other measure.
# LASSO and Ridge regression: These regularization methods shrink the coefficients of unimportant features to zero, 
#       effectively removing them from the model.

 Question: is it possible for precision, recall, accuracy to have thesame value
 
 Yes, it is possible for precision, recall, and accuracy to have the same value in a classification report, especially when you have a balanced dataset with equal numbers of true positives, true negatives, false positives, and false negatives.

Meaning of the followning evaluation metrics 
Evaluation metrics are used to assess the performance of machine learning models, particularly in classification tasks. Here are some commonly used evaluation metrics, along with their formulas and explanations:

Accuracy:
    Formula: (TP + TN) / (TP + TN + FP + FN)
    Explanation: Accuracy measures the proportion of correctly predicted instances out of all instances. It is a straightforward metric but may not be suitable for imbalanced datasets.

Precision (Positive Predictive Value):
    Formula: TP / (TP + FP)
    Explanation: Precision measures the proportion of true positive predictions among all positive predictions. It is valuable when minimizing false positives is critical.

Recall (Sensitivity, True Positive Rate):
    Formula: TP / (TP + FN)
    Explanation: Recall measures the proportion of true positives among all actual positives. It is valuable when minimizing false negatives is critical.

F1 Score:
    Formula: 2 * (Precision * Recall) / (Precision + Recall)
    Explanation: The F1 score is the harmonic mean of precision and recall. It balances precision and recall and is useful when you want to find a compromise between the two.

Specificity (True Negative Rate):
    Formula: TN / (TN + FP)
    Explanation: Specificity measures the proportion of true negatives among all actual negatives. It is useful when you want to evaluate a model's performance in correctly identifying negatives.

ROC Curve (Receiver Operating Characteristic Curve):
    Formula: ROC is not a single numeric metric but a graphical representation of the true positive rate (Recall) against the false positive rate at various thresholds.
    Explanation: The ROC curve shows the trade-off between true positives and false positives at different classification thresholds. The area under the ROC curve (AUC-ROC) is often used as a metric, with higher values indicating better model performance.

AUC-ROC (Area Under the ROC Curve):
    Formula: Calculated as the area under the ROC curve.
    Explanation: AUC-ROC summarizes the model's ability to distinguish between positive and negative classes. A higher AUC-ROC indicates better model performance.

AUC-PR (Area Under the Precision-Recall Curve):
    Formula: Calculated as the area under the precision-recall curve.
    Explanation: AUC-PR summarizes the trade-off between precision and recall. It is useful for imbalanced datasets where the positive class is rare.

Log Loss (Logarithmic Loss, Cross-Entropy Loss):
    Formula: -Σ(y log(p) + (1 - y) log(1 - p)), where y is the actual class (0 or 1) and p is the predicted probability of the positive class.
    Explanation: Log Loss measures the accuracy of a classifier's probability estimates. Lower log loss values indicate better-calibrated models.

Mean Absolute Error (MAE):
    Formula: (1/n) * Σ|actual - predicted|
    Explanation: MAE measures the average absolute difference between actual and predicted values. It is commonly used in regression tasks.

Mean Squared Error (MSE):
    Formula: (1/n) * Σ(actual - predicted)^2
    Explanation: MSE measures the average squared difference between actual and predicted values. It penalizes larger errors more than MAE and is also used in regression tasks.

Root Mean Squared Error (RMSE):
    Formula: √MSE
    Explanation: RMSE is the square root of MSE. It's in the same units as the target variable and provides a measure of the average magnitude of errors. RMSE is more sensitive to outliers compared to MAE.

In [None]:
# Deploying model to cloud
- create a web service: it is use to communicate between electronic devices.