# Regresión logística

<font size='5'>
La regresión logistica se emplea para modelar la variable dependiente cuando esta es categorica.<br>
Esta función se explica a partir de los datos que cada uno de los ensayos puede aportar al resultado de Y, luego la estimación resulta de los logit de los datos que se conocen, mas no la probabilidad de Y. Por lo cual. la función que se muestra a continuación es una probabilidad condicional de Y dado todas la X (variables independientes):
$$P(Y|X_{1},X_{2},...X_{n})=\frac{1}{1+e^{\alpha + \sum^{n}_{i} {\beta_{i}*X_{i}}}}$$
    <br><br>
Donde:<br>
    $\alpha$ : error sobre el conjunto de datos<br>
    $\beta_{i}$: parámetros de cada variable para generar a Y
    
<br><br>
La curva final de adaptación puede dar similar a la siguiente:
<img src="rlog.png"> <br>


</font>

## Ejemplo

Este ejemplo utiliza un conjunto de datos sobre una empresa de marketing que esta teniendo inconvenientes para asignar los agentes de ventas a los clientes, debido a esto los clientes no están utilizando los servicios o cancelan los servicios actuales. Por lo cual se busca modelar la salida o permanencia de un cliente (churn).
Los campos a utilizar son los siguientes:

    Name : nombre de la ultima empresa que tuvo relación con el cliente
    Age: edad de la relación
    Total_Purchase: total de publicidad comprada
    Account_Manager: binario 0=sin agente, 1=con agente
    Years: edad de la empresa
    Num_sites: numero de paginasweb que usan el servicio del cliente.
    Onboard_date: fecha del último onbording
    Location: dirección d ela empresa
    Company: nombre del cliente
    

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('logregconsult').getOrCreate()

In [3]:
data = spark.read.csv('customer_churn.csv',inferSchema=True,
                     header=True)

In [4]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [5]:
data.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|                null|                null|0.16666666666666666|
| stddev|         null|6.127560416916251|2408.644531858096|0.4999208935073339|1.274449013194616|1.764835592035

In [6]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [7]:
from pyspark.ml.feature import VectorAssembler

In [8]:
assembler = VectorAssembler(inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'],outputCol='features')

In [9]:
output = assembler.transform(data)

In [39]:
final_data = output.select('features','churn')

### Test Train Split

In [40]:
train_churn,test_churn = final_data.randomSplit([0.7,0.3])

### Fit the model

In [12]:
from pyspark.ml.classification import LogisticRegression

In [13]:
lr_churn = LogisticRegression(labelCol='churn')

In [14]:
fitted_churn_model = lr_churn.fit(train_churn)

In [15]:
training_sum = fitted_churn_model.summary

In [41]:
training_sum.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                632|                632|
|   mean|0.16772151898734178|0.13924050632911392|
| stddev|0.37391474020622584| 0.3464715405857694|
|    min|                  0|                0.0|
|    max|                  1|                1.0|
+-------+-------------------+-------------------+



### Evaluatar resultados


In [17]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [18]:
pred_and_labels = fitted_churn_model.evaluate(test_churn)

In [42]:
pred_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[29.0,11274.46,1....|    0|[4.87277048314045...|[0.99240597473215...|       0.0|
|[30.0,8403.78,1.0...|    0|[6.62706699787450...|[0.99867770995491...|       0.0|
|[30.0,8874.83,0.0...|    0|[3.83233030863620...|[0.97880008629612...|       0.0|
|[31.0,5387.75,0.0...|    0|[3.24742811458119...|[0.96258058552664...|       0.0|
|[31.0,7073.61,0.0...|    0|[3.79911450433881...|[0.97809976923405...|       0.0|
|[31.0,11297.57,1....|    1|[0.79751152640735...|[0.68944192100551...|       0.0|
|[31.0,11743.24,0....|    0|[7.95951793845681...|[0.99965080051155...|       0.0|
|[31.0,12264.68,1....|    0|[3.77281170068563...|[0.97752920495855...|       0.0|
|[32.0,6367.22,1.0...|    0|[3.20017220414578...|[0.96084075703562...|       0.0|
|[32.0,8575.71,0

### AUC

In [24]:
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                           labelCol='churn')

In [26]:
auc = churn_eval.evaluate(pred_and_labels.predictions)

In [43]:
auc

0.6866883116883117

### Predicción 
Utilizando un conjunto de validación nuevo

In [28]:
final_lr_model = lr_churn.fit(final_data)

In [29]:
new_customers = spark.read.csv('new_customers.csv',inferSchema=True,
                              header=True)

In [30]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



In [31]:
test_new_customers = assembler.transform(new_customers)

In [32]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [33]:
final_results = final_lr_model.transform(test_new_customers)

In [35]:
final_results.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

