<a href="https://colab.research.google.com/github/mauricioaalmeida/ONE-TelecomX/blob/main/TelecomX_BR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Telecom X - Análise de Evasão de Clientes



Você foi contratado como assistente de análise de dados na Telecom X e fará parte do projeto "Churn de Clientes". A empresa enfrenta um alto índice de cancelamentos e precisa entender os fatores que levam à perda de clientes.

Seu desafio será coletar, tratar e analisar os dados, utilizando Python e suas principais bibliotecas para extrair insights valiosos. A partir da sua análise, os demais colegas da  equipe de Data Science poderá avançar para modelos preditivos e desenvolver estratégias para reduzir a evasão.

O que você vai praticar:

✅ Importar e manipular dados de uma API de forma eficiente.

✅ Aplicar os conceitos de ETL (Extração, Transformação e Carga) na preparação dos dados.

✅ Criar visualizações de dados estratégicas para identificar padrões e tendências.

✅ Realizar uma Análise Exploratória de Dados (EDA) e gerar um relatório com insights relevantes.



## Arquitetura Medalhão

Nessa análise utilizaremos a Arquitetura Medalhão, para uma melhor organização e qualidade dos dados.
Serão utilizadas 3 camadas, Bronze, Prata e Ouro, para mantermos a rastreabilidade durante o tratamento dos dados.

- Camada Bronze:
  Esta camada armazena cópias dos dados brutos, como foram recebidos da fonte. É um ponto de partida para a jornada dos dados, onde eles são armazenados sem modificações significativas.
- Camada Prata:
  Na camada Prata, os dados são limpos, transformados e enriquecidos, removendo dados redundantes ou inválidos. Inclui a validação de dados, desduplicação e agregação de informações.
- Camada Ouro:
  A camada Ouro representa o nível mais alto de refinamento, onde os dados são transformados em formatos otimizados para análises e tomada de decisões. Aqui, os dados são preparados para serem utilizados por ferramentas de análise de dados e relatórios.



## Preparação do ambiente:

In [1]:
# Importação de Bibliotecas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json

#📌 Extracão - Camada Bronze

In [2]:
url = 'https://raw.githubusercontent.com/alura-cursos/challenge2-data-science/refs/heads/main/TelecomX_Data.json'


In [3]:
dados = pd.read_json(url)
dados

Unnamed: 0,customerID,Churn,customer,phone,internet,account
0,0002-ORFBO,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'One year', 'PaperlessBilling': '..."
1,0003-MKNFE,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
2,0004-TLHLJ,Yes,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
3,0011-IGKFF,Yes,"{'gender': 'Male', 'SeniorCitizen': 1, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
4,0013-EXCHZ,Yes,"{'gender': 'Female', 'SeniorCitizen': 1, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
...,...,...,...,...,...,...
7262,9987-LUTYD,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'One year', 'PaperlessBilling': '..."
7263,9992-RRAMN,Yes,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
7264,9992-UJOEL,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
7265,9993-LHIEB,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'Two year', 'PaperlessBilling': '..."


## Normalização dos dados JSON

In [4]:
# Lista de colunas JSON
json_columns = ['customer','phone','internet', 'account']

# Normalizar todas as colunas JSON e concatenar
normalized_dfs = [pd.json_normalize(dados[col]) for col in json_columns]
df_bronze = pd.concat([dados[['id']] if 'id' in dados.columns else dados.drop(columns=json_columns), *normalized_dfs], axis=1)

df_bronze

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,65.60,593.3
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,No,Yes,Month-to-month,No,Mailed check,59.90,542.4
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.90,280.85
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,98.00,1237.85
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,No,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,83.90,267.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7262,9987-LUTYD,No,Female,0,No,No,13,Yes,No,DSL,...,No,No,Yes,No,No,One year,No,Mailed check,55.15,742.9
7263,9992-RRAMN,Yes,Male,0,Yes,No,22,Yes,Yes,Fiber optic,...,No,No,No,No,Yes,Month-to-month,Yes,Electronic check,85.10,1873.7
7264,9992-UJOEL,No,Male,0,No,No,2,Yes,No,DSL,...,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,50.30,92.75
7265,9993-LHIEB,No,Male,0,Yes,Yes,67,Yes,No,DSL,...,No,Yes,Yes,No,Yes,Two year,No,Mailed check,67.85,4627.65


#🔧 Transformação - Camada Prata

## Exploração inicial dos dados

In [5]:
df_prata = df_bronze.copy()

### Dicionário de dados

  

    customerID: número de identificação único de cada cliente
    Churn: se o cliente deixou ou não a empresa
    gender: gênero (masculino e feminino)
    SeniorCitizen: informação sobre um cliente ter ou não idade igual ou maior que 65 anos
    Partner: se o cliente possui ou não um parceiro ou parceira
    Dependents: se o cliente possui ou não dependentes
    tenure: meses de contrato do cliente
    PhoneService: assinatura de serviço telefônico
    MultipleLines: assisnatura de mais de uma linha de telefone
    InternetService: assinatura de um provedor internet
    OnlineSecurity: assinatura adicional de segurança online
    OnlineBackup: assinatura adicional de backup online
    DeviceProtection: assinatura adicional de proteção no dispositivo
    TechSupport: assinatura adicional de suporte técnico, menos tempo de espera
    StreamingTV: assinatura de TV a cabo
    StreamingMovies: assinatura de streaming de filmes
    Contract: tipo de contrato
    PaperlessBilling: se o cliente prefere receber online a fatura
    PaymentMethod: forma de pagamento
    Charges.Monthly: total de todos os serviços do cliente por mês
    Charges.Total: total gasto pelo cliente


In [6]:
df_prata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7267 entries, 0 to 7266
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7267 non-null   object 
 1   Churn             7267 non-null   object 
 2   gender            7267 non-null   object 
 3   SeniorCitizen     7267 non-null   int64  
 4   Partner           7267 non-null   object 
 5   Dependents        7267 non-null   object 
 6   tenure            7267 non-null   int64  
 7   PhoneService      7267 non-null   object 
 8   MultipleLines     7267 non-null   object 
 9   InternetService   7267 non-null   object 
 10  OnlineSecurity    7267 non-null   object 
 11  OnlineBackup      7267 non-null   object 
 12  DeviceProtection  7267 non-null   object 
 13  TechSupport       7267 non-null   object 
 14  StreamingTV       7267 non-null   object 
 15  StreamingMovies   7267 non-null   object 
 16  Contract          7267 non-null   object 


In [7]:
df_prata.describe()

Unnamed: 0,SeniorCitizen,tenure,Charges.Monthly
count,7267.0,7267.0,7267.0
mean,0.162653,32.346498,64.720098
std,0.369074,24.571773,30.129572
min,0.0,0.0,18.25
25%,0.0,9.0,35.425
50%,0.0,29.0,70.3
75%,0.0,55.0,89.875
max,1.0,72.0,118.75


In [8]:
df_prata.head(2)

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4


### Passo 1 - Converter colunas númericas


In [9]:
try:
  df_prata['Charges.Total'] = df_prata['Charges.Total'].astype(np.float64)
except Exception as e:
  print('Erro ao converter coluna: ', e)

Erro ao converter coluna:  could not convert string to float: ' '


In [10]:
# Remover o espaço que está causando erro e tentar novamente
df_prata['Charges.Total'] = df_prata['Charges.Total'].replace(' ', np.nan)
try:
  df_prata['Charges.Total'] = df_prata['Charges.Total'].astype(np.float64)
except Exception as e:
  print('Erro ao converter coluna: ', e)

### Passo 2 - Converter colunas sim/não para integer

In [11]:
df_prata['Churn'].value_counts()

Unnamed: 0_level_0,count
Churn,Unnamed: 1_level_1
No,5174
Yes,1869
,224


In [12]:
df_prata.query('Churn == ""')

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
30,0047-ZHDTW,,Female,0,No,No,11,Yes,Yes,Fiber optic,...,No,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),79.00,929.30
75,0120-YZLQA,,Male,0,No,No,71,Yes,No,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Credit card (automatic),19.90,1355.10
96,0154-QYHJU,,Male,0,No,No,29,Yes,No,DSL,...,Yes,No,Yes,No,No,One year,Yes,Electronic check,58.75,1696.20
98,0162-RZGMZ,,Female,1,No,No,5,Yes,No,DSL,...,Yes,No,Yes,No,No,Month-to-month,No,Credit card (automatic),59.90,287.85
175,0274-VVQOQ,,Male,1,Yes,No,65,Yes,Yes,Fiber optic,...,Yes,Yes,No,Yes,Yes,One year,Yes,Bank transfer (automatic),103.15,6792.45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7158,9840-GSRFX,,Female,0,No,No,14,Yes,Yes,DSL,...,Yes,No,No,No,No,One year,Yes,Mailed check,54.25,773.20
7180,9872-RZQQB,,Female,0,Yes,No,49,No,No phone service,DSL,...,No,No,No,Yes,No,Month-to-month,No,Bank transfer (automatic),40.65,2070.75
7211,9920-GNDMB,,Male,0,No,No,9,Yes,Yes,Fiber optic,...,No,No,No,No,No,Month-to-month,Yes,Electronic check,76.25,684.85
7239,9955-RVWSC,,Female,0,Yes,Yes,67,Yes,No,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Bank transfer (automatic),19.25,1372.90


Como não temos informação de churn nesses registros e essa será uma coluna importante em nossa análise, vamos excluir estes da camada Prata antes de transformar a coluna em númerica (0=No e 1=Yes)


In [13]:
df_prata = df_prata.query('Churn != ""')
df_prata['Churn'].value_counts()

Unnamed: 0_level_0,count
Churn,Unnamed: 1_level_1
No,5174
Yes,1869


In [14]:
df_prata['Churn'] = df_prata['Churn'].replace({'Yes': '1', 'No': '0'})
df_prata['Churn'] = df_prata['Churn'].astype(np.int64)

### Verificando as demais colunas com valores Yes/No (ou similares que podem ser convertidos em 0 e 1)

In [15]:
colunas = df_prata.columns
colunas

Index(['customerID', 'Churn', 'gender', 'SeniorCitizen', 'Partner',
       'Dependents', 'tenure', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'Charges.Monthly',
       'Charges.Total'],
      dtype='object')

In [16]:
df_prata.head(3)

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
0,0002-ORFBO,0,Female,0,Yes,Yes,9,Yes,No,DSL,...,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3
1,0003-MKNFE,0,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4
2,0004-TLHLJ,1,Male,0,No,No,4,Yes,No,Fiber optic,...,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.9,280.85


In [17]:
colunas_binarias = ['Partner', 'Dependents', 'PhoneService', 'MultipleLines',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling']
# Verificar valores de cada coluna
for coluna in colunas_binarias:
  print('-> Coluna: ',coluna)
  print(df_prata[coluna].value_counts())
  print('---')


-> Coluna:  Partner
Partner
No     3641
Yes    3402
Name: count, dtype: int64
---
-> Coluna:  Dependents
Dependents
No     4933
Yes    2110
Name: count, dtype: int64
---
-> Coluna:  PhoneService
PhoneService
Yes    6361
No      682
Name: count, dtype: int64
---
-> Coluna:  MultipleLines
MultipleLines
No                  3390
Yes                 2971
No phone service     682
Name: count, dtype: int64
---
-> Coluna:  OnlineSecurity
OnlineSecurity
No                     3498
Yes                    2019
No internet service    1526
Name: count, dtype: int64
---
-> Coluna:  OnlineBackup
OnlineBackup
No                     3088
Yes                    2429
No internet service    1526
Name: count, dtype: int64
---
-> Coluna:  DeviceProtection
DeviceProtection
No                     3095
Yes                    2422
No internet service    1526
Name: count, dtype: int64
---
-> Coluna:  TechSupport
TechSupport
No                     3473
Yes                    2044
No internet service    1526
Name:

Vou usar um Regex para identificar as colunas que começam com "No internet.. " ou "No Phone.." convertendo todas para 0


In [18]:
for coluna in colunas_binarias:
  df_prata[coluna] = df_prata[coluna].str.replace('(?i)^no.*', '0',regex=True) # Regex começando com 'no' (case insensitive), e selecionando o restante do texto para subsituir por '0'
  df_prata[coluna] = df_prata[coluna].str.replace('(?i)^yes.*', '1',regex=True)
  df_prata[coluna] = df_prata[coluna].astype(np.int64)


### Passo 3 - Converter colunas categoricas

In [19]:
df_prata['gender'].value_counts()

Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
Male,3555
Female,3488


In [20]:
df_prata['gender'] = df_prata['gender'].astype('category')

In [21]:
df_prata['Contract'].value_counts()

Unnamed: 0_level_0,count
Contract,Unnamed: 1_level_1
Month-to-month,3875
Two year,1695
One year,1473


In [22]:
df_prata['Contract'] = df_prata['Contract'].astype('category')

In [23]:
df_prata['PaymentMethod'].value_counts()

Unnamed: 0_level_0,count
PaymentMethod,Unnamed: 1_level_1
Electronic check,2365
Mailed check,1612
Bank transfer (automatic),1544
Credit card (automatic),1522


In [24]:
df_prata['PaymentMethod'] = df_prata['PaymentMethod'].astype('category')

In [25]:
df_prata['InternetService'].value_counts()

Unnamed: 0_level_0,count
InternetService,Unnamed: 1_level_1
Fiber optic,3096
DSL,2421
No,1526


In [26]:
df_prata['InternetService'] = df_prata['InternetService'].astype('category')

In [27]:
df_prata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 0 to 7266
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   customerID        7043 non-null   object  
 1   Churn             7043 non-null   int64   
 2   gender            7043 non-null   category
 3   SeniorCitizen     7043 non-null   int64   
 4   Partner           7043 non-null   int64   
 5   Dependents        7043 non-null   int64   
 6   tenure            7043 non-null   int64   
 7   PhoneService      7043 non-null   int64   
 8   MultipleLines     7043 non-null   int64   
 9   InternetService   7043 non-null   category
 10  OnlineSecurity    7043 non-null   int64   
 11  OnlineBackup      7043 non-null   int64   
 12  DeviceProtection  7043 non-null   int64   
 13  TechSupport       7043 non-null   int64   
 14  StreamingTV       7043 non-null   int64   
 15  StreamingMovies   7043 non-null   int64   
 16  Contract          7043 non-nu

In [28]:
df_prata.head(3)

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
0,0002-ORFBO,0,Female,0,1,1,9,1,0,DSL,...,1,0,1,1,0,One year,1,Mailed check,65.6,593.3
1,0003-MKNFE,0,Male,0,0,0,9,1,1,DSL,...,0,0,0,0,1,Month-to-month,0,Mailed check,59.9,542.4
2,0004-TLHLJ,1,Male,0,0,0,4,1,0,Fiber optic,...,0,1,0,0,0,Month-to-month,1,Electronic check,73.9,280.85


In [29]:
df_prata.describe()

Unnamed: 0,Churn,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,Charges.Monthly,Charges.Total
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7032.0
mean,0.26537,0.162147,0.483033,0.299588,32.371149,0.903166,0.421837,0.286668,0.344881,0.343888,0.290217,0.384353,0.387903,0.592219,64.761692,2283.300441
std,0.441561,0.368612,0.499748,0.45811,24.559481,0.295752,0.493888,0.452237,0.475363,0.475038,0.453895,0.486477,0.487307,0.491457,30.090047,2266.771362
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.25,18.8
25%,0.0,0.0,0.0,0.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,35.5,401.45
50%,0.0,0.0,0.0,0.0,29.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,70.35,1397.475
75%,1.0,0.0,1.0,1.0,55.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,89.85,3794.7375
max,1.0,1.0,1.0,1.0,72.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,118.75,8684.8


Remover clientes com Tenure = 0 (meses de contrato)

In [30]:
df_prata.query('tenure == 0')['customerID'].count()

np.int64(11)

In [31]:
df_prata = df_prata.query('tenure > 0')

Verificar registros com Charges.Total = 0 (Gastos Totais)

In [32]:
df_prata.query('`Charges.Total` == 0')

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total


In [33]:
#Salvando a camada Prata como arquivo Parquet para analise futura
df_prata.to_parquet('prata_TelecomX.parquet')

#📊 Carga e análise - Camada Ouro

In [34]:
df_ouro = pd.read_parquet('prata_TelecomX.parquet')


Vamos carregar os dados em uma camada Ouro e criar uma Base de dados para armazenar os dados em um esquema estrela

In [35]:
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table, inspect
engine = create_engine('sqlite:///:memory:')

In [36]:
colunas_cliente = ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents']
df_clientes = df_ouro[colunas_cliente].copy()
df_clientes.to_sql('dim_clientes', engine, index=True)

7032

In [37]:
colunas_servicos = ['PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies']


In [38]:
df_combinacoes = df_ouro[colunas_servicos].drop_duplicates().reset_index(drop=True)
df_combinacoes['tipo_servico'] = df_combinacoes.apply(lambda x: '-'.join(x.dropna().astype(str)), axis=1)
df_combinacoes['ID_services'] = df_combinacoes.index
df_combinacoes.set_index('ID_services', inplace=True)
df_combinacoes.head()

Unnamed: 0_level_0,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,tipo_servico
ID_services,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,1,0,DSL,0,1,0,1,1,0,1-0-DSL-0-1-0-1-1-0
1,1,1,DSL,0,0,0,0,0,1,1-1-DSL-0-0-0-0-0-1
2,1,0,Fiber optic,0,0,1,0,0,0,1-0-Fiber optic-0-0-1-0-0-0
3,1,0,Fiber optic,0,1,1,0,1,1,1-0-Fiber optic-0-1-1-0-1-1
4,1,0,Fiber optic,0,0,0,1,1,0,1-0-Fiber optic-0-0-0-1-1-0


In [39]:
print(df_combinacoes.columns)

Index(['PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'tipo_servico'],
      dtype='object')


In [40]:
df_servicos = df_ouro[['customerID']+colunas_servicos ].copy()
df_servicos['tipo_servico'] = df_servicos[colunas_servicos].apply(lambda x: '-'.join(x.dropna().astype(str)), axis=1)
df_servicos.drop(columns=colunas_servicos, inplace=True)
df_servicos = df_servicos.merge(df_combinacoes[['tipo_servico']].assign(ID_services=df_combinacoes.index), on='tipo_servico', how='left')
df_servicos.set_index('customerID', inplace=True)
df_servicos.drop(columns='tipo_servico', inplace=True)

df_ouro = df_ouro.merge(df_servicos[['customerID']].assign(customerID=df_servicos.index), on='customerID', how='left')
df_ouro.head()

KeyError: "None of [Index(['customerID'], dtype='object')] are in the [columns]"

In [None]:
df_servicos.head()

In [None]:
df_ouro = df_ouro.merge(df_servicos[['ID_services']], on='customerID', how='left')
df_ouro

In [None]:
df_combinacoes.to_sql('dim_tipos_servicos', engine, index=True)
df_servicos.to_sql('fato_servicos', engine, index=True)

In [None]:
err

In [None]:
colunas_contratos = ['customerID', 'Churn', 'tenure', 'Contract',
       'PaperlessBilling', 'PaymentMethod']
contratos = df_ouro[colunas_contratos].copy()
contratos.to_sql('dim_contratos', engine, index=True)

colunas_servicos = ['customerID', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies' ]
servicos = df_ouro[colunas_servicos].copy()
servicos.to_sql('dim_servicos', engine, index=True)

colunas_gastos = ['customerID', 'Charges.Monthly', 'Charges.Total']
gastos = df_ouro[colunas_gasto_m].copy()
gastos.to_sql('fato_gasto', engine, index=True)

colunas_gastos = ['customerID', 'Charges.Total']
gastos = df_ouro[colunas_gastos].copy()
gastos.to_sql('fato_gasto_totais', engine, index=True)

#📄Relatorio Final

In [None]:
# Map Service Combinations to IDs
service_id_map = df_dim_servicos.set_index('tipo_servico')['service_id'].to_dict()
print(service_id_map)