In [0]:
# To run the provided code in Databricks Community Edition, follow these steps:

# Create a new notebook: Log in to your Databricks Community Edition account, and from the homepage, click "Create" and then "Notebook". Name your notebook (e.g., "TelcoChurnAnalysis"), and select "Python" as the language. Click "Create" to create the notebook.

# Install necessary libraries: In the first cell of the notebook, install the required libraries by running the following command:

## dbutils.library.installPyPI("pyspark")

# Upload the dataset: To upload the Telco Churn dataset to Databricks, click on the "Data" tab in the left sidebar. Click on the "Add Data" button, then click on "Browse" to select and upload your dataset (telco_churn.csv). Once uploaded, you'll see a file path similar to /FileStore/tables/telco_churn.csv. Copy this path to use in the code.

# Add and run the code: Copy the complete code provided in the previous response, and replace the path path/to/telco_churn.csv with the actual path to your dataset from step 3. Paste the code into a new cell in your Databricks notebook, and run the cell by clicking the "Run" button.

# After running the code, you should see the performance metrics for both the Decision Tree and Logistic Regression models printed below the cell.

In [0]:
# PART 1: Decision Tree Analysis using Apache Spark - 40 points

In [0]:
# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier, LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator


In [0]:
# Load the Telco Churn dataset
spark = SparkSession.builder.appName("TelcoChurnDecisionTree").getOrCreate()
data = spark.read.csv("dbfs:/FileStore/shared_uploads/jindalkalash298@gmail.com/poc1/v2/WA_Fn_UseC__Telco_Customer_Churn.csv", header=True, inferSchema=True)

In [0]:
# Convert TotalCharges to float data type
data = data.withColumn("TotalCharges", col("TotalCharges").cast("float"))
data = data.na.drop()  # Remove rows with null or empty values

In [0]:
# Prepare the data for modeling

categorical_features = [
    'gender',
    'SeniorCitizen',
    'Partner',
    'Dependents',
    'PhoneService',
    'MultipleLines',
    'InternetService',
    'OnlineSecurity',
    'OnlineBackup',
    'DeviceProtection',
    'TechSupport',
    'StreamingTV',
    'StreamingMovies',
    'Contract',
    'PaperlessBilling',
    'PaymentMethod'
]

numerical_features = [
    'tenure',
    'MonthlyCharges',
    'TotalCharges'
]

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in categorical_features]
assembler = VectorAssembler(inputCols=[column+"_index" for column in categorical_features] + numerical_features, outputCol="features")
label_indexer = StringIndexer(inputCol="Churn", outputCol="label")


In [0]:
# Create the Decision Tree model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")

# Create the Logistic Regression model
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)


In [0]:
# Train and evaluate the models
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)
evaluator_accuracy = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
evaluator_precision = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
evaluator_recall = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")
evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")


In [0]:
for model_name, model in [("Decision Tree", dt), ("Logistic Regression", lr)]:
    pipeline = Pipeline(stages=indexers + [assembler, label_indexer, model])
    model_fitted = pipeline.fit(train_data)
    predictions = model_fitted.transform(test_data)
    
    accuracy = evaluator_accuracy.evaluate(predictions)
    precision = evaluator_precision.evaluate(predictions)
    recall = evaluator_recall.evaluate(predictions)
    f1_score = evaluator_f1.evaluate(predictions)
    
    print(f"{model_name} Performance:")
    print(f"Accuracy: {accuracy:.3f}")
    print(f"Precision: {precision:.3f}")
    print(f"Recall: {recall:.3f}")
    print(f"F1 Score: {f1_score:.3f}")
    print("------------------------------------------------")

Decision Tree Performance:
Accuracy: 0.789
Precision: 0.776
Recall: 0.789
F1 Score: 0.779
------------------------------------------------
Logistic Regression Performance:
Accuracy: 0.811
Precision: 0.802
Recall: 0.811
F1 Score: 0.804
------------------------------------------------


In [0]:
#PART 3: Explain the following: Under what circumstances would you adopt Logistic Regression Analysis Technique?

In [0]:
# Logistic Regression is a widely used statistical method for analyzing a dataset in which the dependent
# variable (label) is binary or categorical. The technique allows you to predict the probability of an
# outcome based on the values of the independent variables (features). In contrast to linear regression,
# logistic regression is used when the response variable is categorical.
#
# Some circumstances in which you would adopt Logistic Regression Analysis Technique are:
#
# 1. Binary classification problems: Logistic Regression is particularly suited for binary classification
# problems, i.e., when there are two possible outcomes (e.g., churn or no churn, spam or not spam).
#
# 2. Probabilistic interpretation: When you require not only a class label but also the probability of
# belonging to a particular class, logistic regression provides an interpretable probabilistic output.
#
# 3. Linear relationships between features and log-odds: Logistic Regression is appropriate when there
# is an approximate linear relationship between the features and the log-odds of the outcome.
#
# 4. Simplicity and interpretability: Logistic Regression is relatively simple, computationally efficient,
# and easy to interpret compared to some other machine learning techniques, such as deep learning or
# ensemble methods.
#
# 5. Regularization and feature selection: Logistic Regression supports regularization techniques like L1
# and L2 regularization, which help prevent overfitting and can be used for feature selection.
#
# It's important to note that Logistic Regression may not be the best choice when dealing with non-linear
# relationships or when there are complex interactions between features. In these cases, more advanced
# techniques like decision trees, random forests, or neural networks may be more appropriate.
# ----------------------------------------------------------------------------