<a id='I' name="I"></a>
## [Introduction](#P0)

Workflow:   - Read the dataset information and details at the source.
- Download the dataset using pandas.
- List all the columns and try to interpret each one.
- Identify artifacts and other unusual things at first.
- Use analyse dataframe to do analysis - level 2.
- Exploratory data analysis
- Feature selection
- Scaling and normalization
- Data splitting - Train test validation

<a id='SU' name="SU"></a>
## [Set up](#P0)

### Magics

### Packages

In [1]:
# General
import pandas as pd
import numpy as np

# modelling
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt


from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder


In [2]:
plt.style.use("fivethirtyeight")

In [3]:
# check pycaret version
from pycaret.utils import version
version()

'3.3.1'

### Custom classes and functions

In [4]:
import warnings
warnings.filterwarnings("ignore")

import sys
sys.path.append('/Users/ayushyapare/Desktop/Ayushya/Snippets')

from DataFrame_Analysis import analyze_dataframe, eda

ModuleNotFoundError: No module named 'DataFrame_Analysis'

### Global Parameters Setting

## Data Retrieval and introduction


__Telecom customers__: Dataset consisting of different features of customers of a telecom company and based on their usage and other factors, one can cluster the customers into different segments and predict if a customer is going to churn (cancel the subscription).

#### Download Data

In [5]:
# import the dataset
df = pd.read_csv('../data/raw/telecom_users.csv', index_col=0)

#### Data exploration

In [None]:
# Basic:
# 1. Shape
# 2. Columns - look for artifacts in column name
# 3. Info - look for appropriate datatypes 
# 4. Describe - look for min max mean and std. 

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.head(5)

In [None]:
df.describe()

In [None]:
# Check for missing values in the DataFrame
df.isnull().sum()

In [None]:
# Display all duplicate rows
# Identify duplicate rows
df.duplicated().sum()


In [None]:
df.columns

In [None]:
# Advanced (Separate categorical and numerical)
# 1. value counts | Unique values | Missing values
# 2. Explore column of interest
#    1. Hist / Countplot
#    2. Boxplot

#### Perform exploratory data analysis on each column

___Initial observations___   
_Interesting to know the correlation between:_
1. Paperless_billing may be unimpotrant

_Some ambiguous column names can be changed:_   
1. Partner --> married
2. Dependents --> childern

_Total charges must be numerical_


In [None]:
# Rename columns for avoiding ambiguity
df.rename(columns={'Partner': 'Married', 'Dependents': 'Children'}, inplace=True)

In [None]:
# replace all values like ' ' to 0 in numerical columns

In [None]:
incorrect_values = [" "]
df.replace(incorrect_values, '0', inplace=True)


In [None]:
df['TotalCharges'] = pd.to_numeric(df.TotalCharges)

In [None]:
df.head(5)

In [None]:
# Replace Senior citizen 1 with 'yes' and 0 with 'no'
#df['SeniorCitizen'] = df['SeniorCitizen'].replace({1:'yes',0: 'no'})

# Replace gender with 1:Male and 0:Female
#df['gender'] = df['gender'].replace({'Male':1,'Female':0})

# Replace churn
#df['Churn'] = df['Churn'].replace({'Yes': 1, 'No': 0})


In [None]:
# Save the cleaned DataFrame to a CSV file
df.to_csv('../data/processed/telecom_users.csv', index=False)

In [None]:
# 1. Drop Customerid column since it does not give any statistical information
# 2. Drop Churn atleast for the clustering

df.drop(columns=['customerID'], inplace=True)

In [None]:
# Perform EDA now
analyze_dataframe(df)

___Observations___
1. The section of the customers which are categorized as 'No internet service' -> ambiguous for the model.

Example:   
StreamingMovies    
No                     2356   
Yes                    2339   
No internet service    1291

Irrelevant for model. It wants to know if Customer streams movie or not. 


Suggestions:   
- group into new category
- remove from this feature as these customers are already segregated in 'Internet service' column.
- Change the values 'No internet access' to 'No'. This makes sense that assuming these customers do not stream movies, or even if they do, they do not do it through our client's network. So categorize under 'No'

In [None]:
# Remove the no internet section of the customers
internet_service_features = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for feature in internet_service_features:
    df[feature] = df[feature].replace('No internet service', 'No')

#df['MultipleLines'] = df['MultipleLines'].replace('No phone service', 'No')

In [None]:
# Drop PaperlessBilling
df.drop(columns=['PaperlessBilling'], inplace=True)

In [None]:
analyze_dataframe(df)

TODOS:
1. Scaling and transform
2. Clustering - K-Means, K-Medoids, Hierarchical, DBSCAN
3. Silhuetter and Elbow methods (Number of clusters)
4. Dimensionality reduction for visualization - PCA
5. Visualization

In [None]:
# Split the dataframe
X = df.drop(columns='Churn', axis = 1)
y = df['Churn']

In [None]:
# Standard scaling for numerical columns
# One Hot Encoding for categorical columns

In [6]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

# Select numeric and categorical columns
numeric_cols = X.select_dtypes(include=[np.number]).columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns

# Define the preprocessor
preprocessor = ColumnTransformer(transformers=[
    ("num", StandardScaler(), numeric_cols),
    ("cat", OneHotEncoder(), categorical_cols)
])

NameError: name 'X' is not defined

In [None]:
# Create and fit the KMeans model with preprocessing
def fit_kmeans(n_clusters, X):
    kmeans_pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("cluster", KMeans(n_clusters=n_clusters, random_state=9, verbose=0))
    ])
    kmeans_pipeline.fit(X)
    return kmeans_pipeline.named_steps["cluster"].inertia_

# Compute WCSS for different numbers of clusters
cluster_errors = []

for n_clusters in range(2, 11):
    wcsse = fit_kmeans(n_clusters,X)
    print('K = ', n_clusters, '\tWCSS Err. = ', wcsse)
    cluster_errors.append(wcsse)

# Plot the SSE for different numbers of clusters
plt.plot(range(2, 11), cluster_errors, "o-")
plt.xlabel("No. Clusters")
plt.ylabel("SSE")
plt.title("Elbow Method")
plt.show()

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score


silhouette_s = []

for n_clusters in range(2, 11):
    kmeans_pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("cluster", KMeans(n_clusters=n_clusters, random_state=9, verbose=0))
    ])

    # Fit the pipeline and get the cluster labels
    cluster_labels = kmeans_pipeline.fit_predict(X)
    
    # Get the preprocessed data
    X_tr = kmeans_pipeline.named_steps["preprocessor"].transform(X)
    
    silhouette_avg = silhouette_score(X_tr, cluster_labels).round(4)
    print(f"For n_clusters = {n_clusters}, The average silhouette_score is : {silhouette_avg}")
    
    silhouette_s.append(silhouette_avg)

# Plot the Silhouette Scores for different numbers of clusters
plt.plot(range(2, 11), silhouette_s, "o-")
plt.xlabel("No. Clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for Different Numbers of Clusters")
plt.show()

<pre>

| Range       | Interpretation                                |
|-------------|-----------------------------------------------|
| 0.71 - 1.0  | A strong structure has been found.            |
| 0.51 - 0.7  | A reasonable structure has been found.        |
| 0.26 - 0.5  | The structure is weak and could be artificial.|
| < 0.25      | No substantial structure has been found.      |

</pre>

In [None]:
# Visualization

In [None]:
df.columns

In [7]:
from sklearn.decomposition import PCA

In [None]:
# Dimensionality Reduction (PCA) before Clustering

In [None]:
from sklearn.decomposition import PCA

cluster_errors = []

for n_cluster in range(1, 14):
    pipe_pca_kmean = Pipeline(
        [
            ("preprocessor", preprocessor), 
            ("pca", PCA(0.90)), 
            ("cluster", KMeans(n_clusters=n_cluster, random_state=9))]
    )

    pipe_pca_kmean.fit_predict(X)
    cluster_errors.append(pipe_pca_kmean.named_steps["cluster"].inertia_) 

#plt.clf()
plt.plot(cluster_errors, "o-")
plt.xlabel("n_clusters")
plt.ylabel("wss")
plt.show()

In [None]:
K = 3

# Define the pipeline with preprocessing, PCA, and KMeans clustering
pipe_pca_kmean_f = Pipeline([
    ("preprocessor", preprocessor), 
    ("pca", PCA(0.90)), 
    ("cluster", KMeans(n_clusters=K, random_state=9))
])

# Fit the pipeline and get the cluster labels
X['kmean_cluster'] = pipe_pca_kmean_f.fit_predict(X)

# Get the cluster inertia
cluster_errors = []
cluster_errors.append(pipe_pca_kmean_f.named_steps["cluster"].inertia_) 

X.head()

#### Segmentation

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(data=X, x='InternetService', hue='kmean_cluster', palette='viridis')
plt.xlabel('Customer Segment')
plt.ylabel('Number of Customers')
plt.title('Number of Customers in Each Segment by Internet Service')
plt.legend(title='Internet Service', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Filter the data for customers who stream movies
streaming_df = X[X['StreamingMovies'] == 'Yes']

# Plotting the number of customers from each segment who stream movies based on the internet service they use
plt.figure(figsize=(12, 6))
sns.countplot(data=streaming_df, x='InternetService', hue='kmean_cluster', palette='viridis')
plt.xlabel('Customer Segment')
plt.ylabel('Number of Customers')
plt.title('Number of Customers Who Stream Movies in Each Segment by Internet Service')
plt.legend(title='Internet Service', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
mean_total_charges = X.groupby(['kmean_cluster', 'Contract'])['TotalCharges'].mean().reset_index()

# Plotting the mean total charges by contract type for each segment
plt.figure(figsize=(14, 8))
sns.barplot(data=mean_total_charges, x='Contract', y='TotalCharges', hue='kmean_cluster', palette='viridis')
plt.xlabel('Customer Segment')
plt.ylabel('Mean Total Charges')
plt.title('Mean Total Charges by Contract Type for Each Segment')
plt.legend(title='Contract Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
mean_total_charges = X.groupby(['kmean_cluster', 'InternetService'])['TotalCharges'].mean().reset_index()

# Plotting the mean total charges by contract type for each segment
plt.figure(figsize=(14, 8))
sns.barplot(data=mean_total_charges, x='InternetService', y='TotalCharges', hue='kmean_cluster', palette='viridis')
plt.xlabel('Customer Segment')
plt.ylabel('Mean Total Charges')
plt.title('Total Charges for Internet Service type for each Segment')
plt.legend(title='Contract Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:

# Plot the results
plt.figure(figsize=(8, 4))
ax = sns.scatterplot(
    x='tenure',
    y='MonthlyCharges',
    hue='kmean_cluster',
    data=X,
    palette='viridis'
)
ax.legend(bbox_to_anchor=(1.04, 1.02), loc='upper left', fontsize='large')
plt.tight_layout()
plt.show()


## Classification

In [None]:
# 1. define features and labels
# 2. choose features
# 3. train test split

In [None]:
df.columns

In [None]:
df.columns

In [None]:
Xx = df.drop(columns = 'Churn', axis = 1)
yy = df['Churn']

In [None]:
# Train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(Xx, yy, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape

In [3]:
from pycaret.classification import *

RuntimeError: ('Pycaret only supports python 3.9, 3.10, 3.11. Your actual Python version: ', sys.version_info(major=3, minor=8, micro=18, releaselevel='final', serial=0), 'Please UPGRADE your Python version.')

In [None]:
%env MLFLOW_TRACKING_URI=https://2521-2a02-168-57c6-0-68e5-3a-ff22-7e75.ngrok-free.app

In [None]:

clsfr = setup(
    data=pd.concat([X_train, y_train], axis=1),
    target = 'Churn',
    session_id=9,
    #max_encoding_ohe=600, # columns with 600 or less categories will be One-hot encoded ELSE target encoding
    #rare_to_value=0.008, # Categories with less than 0.008 (0.8%) of the data will be grouped into a new category (Other)
    #rare_value='Other',
    fix_imbalance = True,
    fix_imbalance_method = 'SMOTE',
    transformation = True,
    transformation_method = 'yeo-johnson',
    experiment_name='Clsfctn_tel_cust_ayushya_(dm)',
    log_experiment = False,
    normalize=True,  # True, False
    normalize_method='zscore',  # 'zscore', 'minmax', 'maxabs', 'robust'
    n_jobs=-1)

7. Train and compare models

In [None]:
best_models = compare_models(fold = 5,
                             n_select=1,
                             sort='f1',

                        )

8. Save ML Flow and analyse

In [None]:
#!mlflow ui

9. Chose and analyse the best model

10. Tune the hyperparameters

In [None]:
lr_model = create_model('lr')

param_grid = {
    'C': [0.01,0.05, 0.1,0.5, 1, 10, 100]
}

# Tune the model
tuned_model = tune_model(lr_model, custom_grid=param_grid)

11. Analyse the performance of the model

In [None]:
plot_model(tuned_model)

In [None]:
plot_model(tuned_model,plot='learning')

In [None]:
plot_model(tuned_model,plot='error')

In [None]:
plot_model(tuned_model,plot='confusion_matrix')

In [None]:
plot_model(tuned_model,plot='feature')

12. Finalize and predict and save the chosen model  