# Analyzing churn of an Internet and Phone Service Provider

This project delves into the intricate dynamics of customer churn within the realm of internet service subscriptions. Through meticulous analysis and exploration, we aim to uncover the underlying factors contributing to subscription cancellations among customers. 

By scrutinizing patterns, behavior, and key indicators, our goal is to gain a comprehensive understanding of churn in the context of internet plans. Through data-driven insights and robust methodologies, this analysis endeavors to provide valuable insights for businesses seeking to mitigate churn, enhance customer retention strategies, and optimize service offerings. 

I've picked up the database of a Brazillian company as a case to analyze and get in the reasons why the customers are cancelling their subscripitions. 

First of all, we need to import and read our database for our analysis.

In [16]:
import pandas as pd

db=pd.read_csv('cancelamentos_sample.csv')

display(db)

Unnamed: 0,CustomerID,idade,sexo,tempo_como_cliente,frequencia_uso,ligacoes_callcenter,dias_atraso,assinatura,duracao_contrato,total_gasto,meses_ultima_interacao,cancelou
0,349936.0,23.0,Male,13.0,22.0,2.0,1.0,Standard,Annual,909.58,23.0,0.0
1,100634.0,49.0,Male,55.0,16.0,3.0,6.0,Premium,Monthly,207.00,29.0,1.0
2,301263.0,30.0,Male,7.0,1.0,0.0,8.0,Basic,Annual,768.78,7.0,0.0
3,119358.0,26.0,Male,40.0,5.0,3.0,8.0,Premium,Annual,398.00,12.0,1.0
4,130955.0,27.0,Female,17.0,30.0,5.0,6.0,Basic,Annual,507.00,15.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
49995,195680.0,62.0,Female,35.0,7.0,2.0,8.0,Basic,Annual,232.00,15.0,1.0
49996,43477.0,36.0,Male,43.0,21.0,2.0,30.0,Basic,Quarterly,928.00,30.0,1.0
49997,169273.0,55.0,Male,42.0,8.0,1.0,12.0,Basic,Monthly,326.00,27.0,1.0
49998,310693.0,40.0,Female,14.0,19.0,1.0,17.0,Premium,Quarterly,826.76,12.0,0.0


# DATABASE GUIDE 

Idade: Customer's age

Sexo: Customer's gender

Tempo_como_cliente: Time as company's customer

Frequencia_uso: Customer's frequency of using the services

Ligacoes_callcenter: How many times the customer called the company's Customer Service

Dias_atraso: Number of days overdue for payment for each customer

Assinatura: Type of the plan for each customer

Duracao_contrato: Type of customer's contract length

Total_gasto: The amount of customer's payments done to the company in Brazillian currency (Real (R$))

Meses_ultima_interação: Months past since the customer's last call to the company's Customer Service

Cancelou: Shows if the customer cancelled (1) or not (0).


# REMOVING IRRELEVANT DATA

Now we need to remove from our database the informations that can't make any difference on our analysis. In our study, we'll only need to remove the customer ID, because it has nothing to do with the customers' profile or behavior.

In [17]:
db=db.drop(columns='CustomerID')

display(db)

Unnamed: 0,idade,sexo,tempo_como_cliente,frequencia_uso,ligacoes_callcenter,dias_atraso,assinatura,duracao_contrato,total_gasto,meses_ultima_interacao,cancelou
0,23.0,Male,13.0,22.0,2.0,1.0,Standard,Annual,909.58,23.0,0.0
1,49.0,Male,55.0,16.0,3.0,6.0,Premium,Monthly,207.00,29.0,1.0
2,30.0,Male,7.0,1.0,0.0,8.0,Basic,Annual,768.78,7.0,0.0
3,26.0,Male,40.0,5.0,3.0,8.0,Premium,Annual,398.00,12.0,1.0
4,27.0,Female,17.0,30.0,5.0,6.0,Basic,Annual,507.00,15.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
49995,62.0,Female,35.0,7.0,2.0,8.0,Basic,Annual,232.00,15.0,1.0
49996,36.0,Male,43.0,21.0,2.0,30.0,Basic,Quarterly,928.00,30.0,1.0
49997,55.0,Male,42.0,8.0,1.0,12.0,Basic,Monthly,326.00,27.0,1.0
49998,40.0,Female,14.0,19.0,1.0,17.0,Premium,Quarterly,826.76,12.0,0.0


# REMOVING LINES WITH BLANK SPACES

Now, we need to check if there are blank spaces in the columns that are being analyzed. We'll only analyze the lines that have all the columns fullfilled and delete the others with at least a column empty. 

First, let's see how many blank spaces we have:

In [18]:
display (db.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   idade                   50000 non-null  float64
 1   sexo                    49997 non-null  object 
 2   tempo_como_cliente      49998 non-null  float64
 3   frequencia_uso          50000 non-null  float64
 4   ligacoes_callcenter     50000 non-null  float64
 5   dias_atraso             50000 non-null  float64
 6   assinatura              50000 non-null  object 
 7   duracao_contrato        50000 non-null  object 
 8   total_gasto             50000 non-null  float64
 9   meses_ultima_interacao  50000 non-null  float64
 10  cancelou                50000 non-null  float64
dtypes: float64(8), object(3)
memory usage: 4.2+ MB


None

So we can see that we have a database with 50000 entries (lines), but in the column 'sexo' we have 3 lines with blank spaces and in the column 'tempo_como_cliente' we have 2 lines with blank spaces.

Then we'll delete these 5 lines.

In [32]:
db=db.dropna()

display(db.info())

display (db)


<class 'pandas.core.frame.DataFrame'>
Index: 49996 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   idade                   49996 non-null  float64
 1   sexo                    49996 non-null  object 
 2   tempo_como_cliente      49996 non-null  float64
 3   frequencia_uso          49996 non-null  float64
 4   ligacoes_callcenter     49996 non-null  float64
 5   dias_atraso             49996 non-null  float64
 6   assinatura              49996 non-null  object 
 7   duracao_contrato        49996 non-null  object 
 8   total_gasto             49996 non-null  float64
 9   meses_ultima_interacao  49996 non-null  float64
 10  cancelou                49996 non-null  float64
dtypes: float64(8), object(3)
memory usage: 5.6+ MB


None

Unnamed: 0,idade,sexo,tempo_como_cliente,frequencia_uso,ligacoes_callcenter,dias_atraso,assinatura,duracao_contrato,total_gasto,meses_ultima_interacao,cancelou
0,23.0,Male,13.0,22.0,2.0,1.0,Standard,Annual,909.58,23.0,0.0
1,49.0,Male,55.0,16.0,3.0,6.0,Premium,Monthly,207.00,29.0,1.0
2,30.0,Male,7.0,1.0,0.0,8.0,Basic,Annual,768.78,7.0,0.0
3,26.0,Male,40.0,5.0,3.0,8.0,Premium,Annual,398.00,12.0,1.0
4,27.0,Female,17.0,30.0,5.0,6.0,Basic,Annual,507.00,15.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
49995,62.0,Female,35.0,7.0,2.0,8.0,Basic,Annual,232.00,15.0,1.0
49996,36.0,Male,43.0,21.0,2.0,30.0,Basic,Quarterly,928.00,30.0,1.0
49997,55.0,Male,42.0,8.0,1.0,12.0,Basic,Monthly,326.00,27.0,1.0
49998,40.0,Female,14.0,19.0,1.0,17.0,Premium,Quarterly,826.76,12.0,0.0


Now, our database is out of blank spaces and it's ready to be analyzed.

# ANALYZING DATA

### Cancellation rate

First we need to know the ratio between the customers who cancelled and the ones that stay as customers.

OBS: Remember that the value '1.0' says that the customer cancelled and the value '0.0' says that the customer didn't cancelled.

In [25]:
display(db['cancelou'].value_counts())
display(db['cancelou'].value_counts(normalize=True).map('{:.1%}'.format))

cancelou
1.0    28393
0.0    21603
Name: count, dtype: int64

cancelou
1.0    56.8%
0.0    43.2%
Name: proportion, dtype: object

#### Conclusion: 56,8% of the customers cancelled the service. 

This is a very high cancellement rate, so we need to analyze the data of all the customers that cancelled the service. To do this, we'll need to exclude from our database the lines with the info of those who still are customers. 

In [31]:
cancel_base=db.drop(db[db['cancelou']== 0.0].index)

display(cancel_base)

Unnamed: 0,idade,sexo,tempo_como_cliente,frequencia_uso,ligacoes_callcenter,dias_atraso,assinatura,duracao_contrato,total_gasto,meses_ultima_interacao,cancelou
1,49.0,Male,55.0,16.0,3.0,6.0,Premium,Monthly,207.00,29.0,1.0
3,26.0,Male,40.0,5.0,3.0,8.0,Premium,Annual,398.00,12.0,1.0
4,27.0,Female,17.0,30.0,5.0,6.0,Basic,Annual,507.00,15.0,1.0
10,47.0,Male,24.0,6.0,0.0,5.0,Premium,Monthly,752.00,28.0,1.0
11,43.0,Male,46.0,20.0,6.0,7.0,Standard,Annual,549.00,19.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
49994,63.0,Male,16.0,24.0,2.0,18.0,Standard,Quarterly,442.00,26.0,1.0
49995,62.0,Female,35.0,7.0,2.0,8.0,Basic,Annual,232.00,15.0,1.0
49996,36.0,Male,43.0,21.0,2.0,30.0,Basic,Quarterly,928.00,30.0,1.0
49997,55.0,Male,42.0,8.0,1.0,12.0,Basic,Monthly,326.00,27.0,1.0


###

### Analyzing cancellation by the contract lentgh

Now we need to see the ratio between the length of cancelled contracts

In [36]:
display (cancel_base['duracao_contrato'].value_counts())
display (cancel_base['duracao_contrato'].value_counts(normalize=True).map('{:.1%}'.format))

duracao_contrato
Monthly      9884
Annual       9354
Quarterly    9155
Name: count, dtype: int64

duracao_contrato
Monthly      34.8%
Annual       32.9%
Quarterly    32.2%
Name: proportion, dtype: object

We can see that the ratio between Monthly, Annual and Quarterly contracts are verey similar, so probably this criterion, when analyzed in isolation, is not relevant to the cancellation events.

### Analyzing cancellation by the type of the contract

Now let's check the cancellment by the type of the contract:

In [37]:
display(cancel_base['assinatura'].value_counts())
display(cancel_base['assinatura'].value_counts(normalize=True).map('{:.1%}'.format))

assinatura
Standard    9552
Basic       9508
Premium     9333
Name: count, dtype: int64

assinatura
Standard    33.6%
Basic       33.5%
Premium     32.9%
Name: proportion, dtype: object

We can see that the ratio between Standard, Basic and Premium contracts are verey similar, so probably this criterion, when analyzed in isolation, is not relevant to the cancellation events.

### Analyzing cancellation by gender

Now let's check the cancellment by the customers' gender:

In [38]:
display(cancel_base['sexo'].value_counts())
display(cancel_base['sexo'].value_counts(normalize=True).map('{:.1%}'.format))

sexo
Female    14575
Male      13818
Name: count, dtype: int64

sexo
Female    51.3%
Male      48.7%
Name: proportion, dtype: object

We can see that the ratio between Men and Women that cancelled are verey similar, so probably this criterion, when analyzed in isolation, is not relevant to the cancellation events.