# Formação Cientista de Dados
### Projeto com Feedback 4
### Prevendo Customer Churn em Operadoras de Telecom

### Gabriel Quaiotti - Abr 2020

Customer Churn (ou Rotatividade de Clientes, em uma tradução livre) refere-se a uma decisão tomada pelo cliente sobre o término do relacionamento comercial. Refere-se também à perda de clientes. A fidelidade do cliente e a rotatividade de clientes sempre somam 100%. Se uma empresa tem uma taxa de fidelidade de 60%, então a taxa de perda de clientes é de 40%. De acordo com a
regra de lucratividade do cliente 80/20, 20% dos clientes estão gerando 80% da receita. Portanto, é muito importante prever os usuários que provavelmente abandonarão o relacionamento comercial e os fatores que afetam as decisões do cliente.

Neste projeto, você deve prever o Customer Churn em uma Operadora de Telecom.

Os datasets de treino e de teste serão fornecidos para você em anexo a este projeto. Seu trabalho é criar um modelo de aprendizagem de máquina que possa prever se um cliente pode ou não cancelar seu plano e qual a probabilidade
de isso ocorrer. O cabeçalho do dataset é uma descrição do tipo de informação em cada coluna.

Usando linguagem Python, recomendamos você criar um modelo de Regressão Logística, para extrair a informação se um cliente vai cancelar seu plano (Sim ou Não) e a probabilidade de uma opção ou outra.

In [107]:
# Libraries
from pyspark.sql import SparkSession

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType
from pyspark.sql.types import DoubleType
from pyspark.sql.types import IntegerType

from pyspark.sql.functions import col

from pyspark.ml.classification import LogisticRegression

from pyspark.ml.feature import VectorAssembler

from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.mllib.evaluation import MulticlassMetrics

In [108]:
# Spark Session - usado para trabalhar com o Spark
spSession = SparkSession.builder.master("local").appName("DSA-TELECOM-TRAIN").getOrCreate()

# TRAIN

In [109]:
# Read train dataset
# scaled_rdd = sc.textFile('../dataset/df_scaled_1.csv')
features_rdd = sc.textFile('../dataset/df_features.csv')

In [110]:
features_rdd.take(5)

['id,international_plan,number_customer_service_calls,total_day_minutes,total_eve_charge,churn',
 '10.0,2.2014212203111105,-0.6613633377187209,1.0896917903776833,0.3117714141662552,0.0',
 '15.0,-0.4537898957205725,1.0880350100175813,-1.1588572700838589,1.9754405873059246,0.0',
 '23.0,-0.4537898957205725,-0.6613633377187209,-0.14301530875859367,-2.5996496388281667,0.0',
 '39.0,2.2014212203111105,0.32267323288294913,0.1977727649444245,-0.3592111805254348,0.0']

In [111]:
# Remove header and split by ','
header = features_rdd.first()
features_rdd2 = features_rdd.filter(lambda line: line != header).map(lambda line: line.split(","))

In [112]:
# Define the dataFrame columns
features_fields = [StructField("id", StringType(), True), 
                   StructField("international_plan", StringType(), True),
                   StructField("number_customer_service_calls", StringType(), True),
                   StructField("total_day_minutes", StringType(), True),
                   StructField("total_eve_charge", StringType(), True),
                   StructField("churn", StringType(), True)]

In [113]:
# Define the dataFrame schema
features_schema = StructType( features_fields )

In [114]:
# Create dataFrame
features_ds = spSession.createDataFrame(features_rdd2, features_schema)

In [115]:
features_ds = features_ds.withColumn('id', col('id').cast(DoubleType()))
features_ds = features_ds.withColumn('international_plan', col('international_plan').cast(DoubleType()))
features_ds = features_ds.withColumn('total_day_minutes', col('total_day_minutes').cast(DoubleType()))
features_ds = features_ds.withColumn('total_eve_charge', col('total_eve_charge').cast(DoubleType()))
features_ds = features_ds.withColumn('number_customer_service_calls', col('number_customer_service_calls').cast(DoubleType()))
features_ds = features_ds.withColumn('churn', col('churn').cast(DoubleType()))

In [116]:
features_ds

DataFrame[id: double, international_plan: double, number_customer_service_calls: double, total_day_minutes: double, total_eve_charge: double, churn: double]

In [117]:
features_ds.toPandas().head()

Unnamed: 0,id,international_plan,number_customer_service_calls,total_day_minutes,total_eve_charge,churn
0,10.0,2.201421,-0.661363,1.089692,0.311771,0.0
1,15.0,-0.45379,1.088035,-1.158857,1.975441,0.0
2,23.0,-0.45379,-0.661363,-0.143015,-2.59965,0.0
3,39.0,2.201421,0.322673,0.197773,-0.359211,0.0
4,45.0,-0.45379,-0.552026,-0.615879,0.385304,0.0


In [118]:
# Convert predictor columns to vector
vector_ds = VectorAssembler(inputCols=features_ds.drop('id', 'churn').columns, outputCol="features").transform(features_ds)

In [119]:
(train_ds, test_ds) = vector_ds.randomSplit([0.7, 0.3])

In [120]:
train_ds.count()

676

In [121]:
test_ds.count()

307

In [122]:
# Fit Logistic Regression model
model = LogisticRegression(featuresCol = "features", labelCol = "churn").setMaxIter(100).fit(train_ds)

# TEST

In [123]:
# Generate preditions
prediction = model.transform(test_ds)

In [124]:
prediction.toPandas().head()

Unnamed: 0,id,international_plan,number_customer_service_calls,total_day_minutes,total_eve_charge,churn,features,rawPrediction,probability,prediction
0,10.0,2.201421,-0.661363,1.089692,0.311771,0.0,"[2.2014212203111105, -0.6613633377187209, 1.08...","[-2.2628824278833357, 2.2628824278833357]","[0.09424403093437508, 0.9057559690656248]",1.0
1,23.0,-0.45379,-0.661363,-0.143015,-2.59965,0.0,"[-0.4537898957205725, -0.6613633377187209, -0....","[1.6292350386570995, -1.6292350386570995]","[0.8360648197923921, 0.16393518020760786]",0.0
2,53.0,-0.45379,-0.224014,-0.092468,-0.308658,0.0,"[-0.4537898957205725, -0.22401375078464533, -0...","[0.7380331037829547, -0.7380331037829547]","[0.6765656005433838, 0.3234343994566163]",0.0
3,65.0,-0.45379,-0.224014,-0.987648,-0.841767,0.0,"[-0.4537898957205725, -0.22401375078464533, -0...","[1.5226854381407915, -1.5226854381407915]","[0.8209335846699297, 0.17906641533007023]",0.0
4,104.0,-0.45379,-0.552026,-0.925686,-0.437339,0.0,"[-0.4537898957205725, -0.552025940985202, -0.9...","[1.7211304916150094, -1.7211304916150094]","[0.848274393499993, 0.15172560650000694]",0.0


# EVALUATE

In [125]:
# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol = "churn")
print(evaluator.getMetricName(), evaluator.evaluate(prediction))

areaUnderROC 0.8384125636672334


In [126]:
# View confusion matrix
total = prediction.count()
prediction.groupBy('churn', 'prediction').count().withColumn('%', col('count') / total).show()

+-----+----------+-----+-------------------+
|churn|prediction|count|                  %|
+-----+----------+-----+-------------------+
|  1.0|       1.0|  109| 0.3550488599348534|
|  0.0|       1.0|   30|0.09771986970684039|
|  1.0|       0.0|   46| 0.1498371335504886|
|  0.0|       0.0|  122| 0.3973941368078176|
+-----+----------+-----+-------------------+



In [127]:
preds = prediction.select('churn', 'prediction').withColumnRenamed('churn', 'label')

In [128]:
metrics = MulticlassMetrics(preds.rdd)

In [129]:
metrics.accuracy

0.752442996742671

In [130]:
metrics.confusionMatrix().toArray()

array([[122.,  46.],
       [ 30., 109.]])

In [131]:
model.write().overwrite().save("../obj/logistic_regression.obj")