Data Wrangling - Case 2

Se você fosse gerente em uma empresa, o que você faria diferente caso descobrisse que alguns de seus funcionários estão pensando em sair? Daria um aumento, buscaria contratar outro ou tentaria ver como melhorar? De toda forma, seria útil e evitaria muitos problemas!

No case de hoje, vamos fazer exatamente isso! 

Daily Happiness & Employee Turnover
Is There a Relationship Between Employee Happiness and Job Turnover?


Dataset: https://www.kaggle.com/harriken/employeeturnover?select=votes.csv


Importando as Bibliotecas e o DataSet

In [2]:
import pandas as pd
import numpy as np

- <b>Arquivo de churn Enriquecido (resultado da parte 1 com colunas adicionais)</b>:
Descreve se o funcionário ainda está na empresa, indicando os que já sairam (<i>churn</i>). O id do funcionário é único dentro da empresa somente (funcionários distintos em empresas diferentes podem ter o mesmo ID). Contem informações referentes às votações em que esse funcionário participou e aos comentários registrados por ele, além do genero da pessoa e do tipo de empresa.
Arquivo: data/churn.csv

- <b>Arquivo de Interação nos Comentários</b>:
Descreve os usuário que interagiram com os comentários postados, dando like ou deslike. Arquivo: data/commentInteractions.csv

- <b>Arquivo de Comentários Anônimos</b>:
Descreve cada comentário escrito pelos funcionários e uma visão geral da quantidade de likes e deslikes. Arquivo: data/comments_clean_anonimized.csv

In [3]:
df_churn_enriq = pd.read_csv('df_churn_enriq.csv')
df_comment_interactions = pd.read_csv('df_comment_interactions.csv')
df_comments_clean_anonimized = pd.read_csv('df_comments_clean_anonimized.csv')
df_churn_enriq.head()

Unnamed: 0,employee,companyAlias,numVotes,lastParticipationDate,stillExists,vote_mean,vote_mode,vote_min,vote_max,vote_count,...,likes_max,dislikes_mean,dislikes_min,dislikes_max,qt_dias_diff_comment_min,qt_dias_diff_comment_max,qt_dias_diff_comment_median,commentId_count,companyType,gender
0,512,56aec740f1ef260003e307d6,4,2017-02-23 11:48:04,1,2.0,1.0,1.0,3.0,3.0,...,9.0,1.0,0.0,2.0,2.0,30.0,16.0,2.0,Product,Male
1,2,56aec740f1ef260003e307d6,72,2017-03-17 00:00:00,1,2.239437,1.0,1.0,4.0,71.0,...,12.0,4.785714,0.0,12.0,43.0,399.0,267.0,14.0,Product,Male
2,487,56aec740f1ef260003e307d6,14,2016-11-19 14:02:14,0,3.181818,3.0,2.0,4.0,22.0,...,6.0,0.0,0.0,0.0,22.0,22.0,22.0,1.0,Product,Female
3,3,56aec740f1ef260003e307d6,22,2017-02-16 00:00:00,1,3.47619,4.0,2.0,4.0,21.0,...,14.0,0.888889,0.0,4.0,6.0,376.0,239.0,9.0,Product,Female
4,4,56aec740f1ef260003e307d6,195,2017-03-20 00:00:00,1,3.860825,4.0,1.0,4.0,194.0,...,29.0,1.0,0.0,4.0,33.0,271.0,107.0,10.0,Product,Female


Enriquecimento dos Dados

a) Vamos combinar as colunas likes_mean e dislikes_mean calculando a razão de uma sobre a outra. Isso nos dará uma ideia da proporção entre likes e dislikes de um comentário.

In [6]:
df_churn_enriq['rt_like_dislike'] = df_churn_enriq.likes_mean / df_churn_enriq.dislikes_mean
df_churn_enriq['rt_like_dislike'].head()

0    7.500000
1    1.134328
2         inf
3    3.750000
4    7.500000
Name: rt_like_dislike, dtype: float64

In [8]:
df_churn_enriq.shape

(4064, 26)

b) Por ser uma divisão, essa variável ficou com alguns valores estranhos, como nulos e infinitos. O nulo acontece quando dividimos 0 por 0. Já o infinito acontece quando temos algum like e 0 dislikes. Calcule o valor de nulos e infinitos (np.inf).

In [9]:
df_churn_enriq[df_churn_enriq['rt_like_dislike'] == np.inf].shape   

(538, 26)

In [15]:
df_churn_enriq[df_churn_enriq['rt_like_dislike'].isna()].shape

(0, 26)

c) Vamos preencher os nulos com o valor 1 que indica mesma qt. de likes e dislikes

In [14]:
df_churn_enriq.loc[df_churn_enriq['rt_like_dislike'].isna(), 'rt_like_dislike'] = 1

d) Plote o histograma dessa nova variável

In [17]:
df_churn_enriq['rt_like_dislike'].describe()

count    4064.000000
mean             inf
std              NaN
min         0.000000
25%         5.466071
50%         5.611109
75%        11.167619
max              inf
Name: rt_like_dislike, dtype: float64

In [18]:
import plotly.express as px
fig = px.histogram(df_churn_enriq, 'rt_like_dislike')
fig.show()

e) Como essa variável tá com alguns picos indesejados e uma grande quantidade de outliers, faz sentido nesse caso aplicarmos a discretização por equal-frequncy usando 3 bins.

Dica: para funcionar o algoritmo do KBins, antes substitua os infinitos por um valor alto

In [19]:
from sklearn.preprocessing import KBinsDiscretizer

In [23]:
df_churn_enriq.loc[df_churn_enriq['rt_like_dislike'] == np.inf, 'rt_like_dislike'] = 50

In [24]:
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
discretizer.fit(df_churn_enriq[['rt_like_dislike']].values)
df_churn_enriq['rt_like_dislike'] = discretizer.transform(df_churn_enriq[['rt_like_dislike']].values)

f) Verifique a nova distribuição

In [25]:
df_churn_enriq["rt_like_dislike"].value_counts()

1.0    1673
2.0    1358
0.0    1033
Name: rt_like_dislike, dtype: int64

In [26]:
fig = px.histogram(df_churn_enriq, x='rt_like_dislike')
fig.show()

Criação de uma nova dimensão

Vamos retornar aos datasets de comentários! Já que não sabemos se os comentários são positivos ou não, poderiamos calcular uma dimensão de comentários que indique se o autor daquele comentário chegou a sair da empresa ou não. Isso pode nos indicar que um funcionário que dê like em algum comentário de alguém que saiu pode ser mais propenso a churn também.

a) Faça um merge entre a df_comments_clean_anonimized e a df_churn_enriq, trazendo somente a variável stillExists.

In [27]:
df_comments = df_comments_clean_anonimized.merge(df_churn_enriq[['employee', 'companyAlias', 'stillExists']], on=['employee', 'companyAlias'])
df_comments.head()

Unnamed: 0,employee,companyAlias,commentId,txt,likes,dislikes,commentDate,stillExists
0,307,56aec740f1ef260003e307d6,58d018d7e010990004e38070,**********************************************...,4.0,0.0,2017-03-20 18:00:17,1
1,307,56aec740f1ef260003e307d6,58c913dfbd760e00043f1695,**********************************************...,8.0,1.0,2017-03-15 10:12:39,1
2,307,56aec740f1ef260003e307d6,58c736e732f72a00046f5614,********************************,5.0,0.0,2017-03-14 00:18:23,1
3,307,56aec740f1ef260003e307d6,58c53f19307b1e0004ad41af,**********************************************...,8.0,0.0,2017-03-12 12:28:34,1
4,307,56aec740f1ef260003e307d6,58c3ccf161ca670004c032fb,**********************,5.0,0.0,2017-03-11 10:09:36,1


b) Agora faça um merge entre a df_comment_interactions e essa nova base, trazendo somente a variável stillExists.

In [28]:
df_comments_emp = df_comment_interactions.merge(df_comments[['commentId', 'stillExists']], on=['commentId'])
df_comments_emp.head()

Unnamed: 0,employee,companyAlias,liked,disliked,commentId,stillExists
0,307,56aec740f1ef260003e307d6,1,0,58d018d7e010990004e38070,1
1,36,56aec740f1ef260003e307d6,1,0,58d018d7e010990004e38070,1
2,276,56aec740f1ef260003e307d6,1,0,58d018d7e010990004e38070,1
3,24,56aec740f1ef260003e307d6,1,0,58d018d7e010990004e38070,1
4,382,56aec740f1ef260003e307d6,1,0,58d0179ae010990004e3806d,1


c) Nosso objetivo agora é resgatar o conceito de pivot table e calcular a media do stillExists para cada <employee, company> separando em liked e disliked comments.

In [30]:
df_rate_liked_comments = pd.pivot_table(df_comments_emp, values='stillExists', index=['companyAlias', 'employee'],
                    columns=['liked','disliked'], aggfunc=np.mean, fill_value=0)
df_rate_liked_comments.head()

Unnamed: 0_level_0,liked,0,1
Unnamed: 0_level_1,disliked,1,0
companyAlias,employee,Unnamed: 2_level_2,Unnamed: 3_level_2
5370af43e4b0cff95558c12a,18,0.444444,0.589912
5370af43e4b0cff95558c12a,20,0.0,0.375
5370af43e4b0cff95558c12a,21,0.125,0.595238
5370af43e4b0cff95558c12a,22,0.5,0.333333
5370af43e4b0cff95558c12a,23,0.5,0.5


In [31]:
df_rate_liked_comments.columns = list(df_rate_liked_comments.columns.names)

In [33]:
df_rate_liked_comments = df_rate_liked_comments.reset_index().rename(columns={'liked':'mean_stillExists_liked_comments',
                                            'disliked':'mean_stillExists_disliked_comments'})
df_rate_liked_comments.head()

Unnamed: 0,index,companyAlias,employee,mean_stillExists_liked_comments,mean_stillExists_disliked_comments
0,0,5370af43e4b0cff95558c12a,18,0.444444,0.589912
1,1,5370af43e4b0cff95558c12a,20,0.0,0.375
2,2,5370af43e4b0cff95558c12a,21,0.125,0.595238
3,3,5370af43e4b0cff95558c12a,22,0.5,0.333333
4,4,5370af43e4b0cff95558c12a,23,0.5,0.5


d) Agora vamos fazer um merge com o dataset inicial!

In [34]:
df_churn_enriq = df_churn_enriq.merge(df_rate_liked_comments, on=["employee","companyAlias"], how="left")
df_churn_enriq.head()

Unnamed: 0,employee,companyAlias,numVotes,lastParticipationDate,stillExists,vote_mean,vote_mode,vote_min,vote_max,vote_count,...,qt_dias_diff_comment_min,qt_dias_diff_comment_max,qt_dias_diff_comment_median,commentId_count,companyType,gender,rt_like_dislike,index,mean_stillExists_liked_comments,mean_stillExists_disliked_comments
0,512,56aec740f1ef260003e307d6,4,2017-02-23 11:48:04,1,2.0,1.0,1.0,3.0,3.0,...,2.0,30.0,16.0,2.0,Product,Male,2.0,1129.0,0.0,1.0
1,2,56aec740f1ef260003e307d6,72,2017-03-17 00:00:00,1,2.239437,1.0,1.0,4.0,71.0,...,43.0,399.0,267.0,14.0,Product,Male,0.0,841.0,1.0,0.979167
2,487,56aec740f1ef260003e307d6,14,2016-11-19 14:02:14,0,3.181818,3.0,2.0,4.0,22.0,...,22.0,22.0,22.0,1.0,Product,Female,2.0,1124.0,0.0,1.0
3,3,56aec740f1ef260003e307d6,22,2017-02-16 00:00:00,1,3.47619,4.0,2.0,4.0,21.0,...,6.0,376.0,239.0,9.0,Product,Female,0.0,842.0,1.0,1.0
4,4,56aec740f1ef260003e307d6,195,2017-03-20 00:00:00,1,3.860825,4.0,1.0,4.0,194.0,...,33.0,271.0,107.0,10.0,Product,Female,2.0,843.0,1.0,0.983607


Tratamento em uma única variável

a) Plote os histogramas de cada variável numérica. As distribuicões parecem normais?

In [35]:
bx_columns=['numVotes','qt_dias_diff_vote_max',
       'qt_dias_diff_vote_median', 'qt_dias_diff_comment_min']

In [36]:
df_churn_enriq[bx_columns].describe()

Unnamed: 0,numVotes,qt_dias_diff_vote_max,qt_dias_diff_vote_median,qt_dias_diff_comment_min
count,4064.0,4064.0,4064.0,4064.0
mean,45.640994,148.462106,83.165108,33.290108
std,57.645434,125.680774,70.91659,46.409744
min,0.0,1.0,1.0,1.0
25%,6.0,45.0,26.0,10.0
50%,20.0,130.0,72.5,17.0
75%,65.0,207.0,117.125,28.0
max,273.0,563.0,336.0,215.0


In [37]:
fig = px.histogram(df_churn_enriq, x="qt_dias_diff_vote_median")
fig.show()

b) Muitas variáveis do dataset estão com uma cauda muito longa na distribuição, vamos melhorar isso?

Aplique o Box-Cox Transformation para as variáveis a seguir:

In [39]:
df_churn_enriq['numVotes_pos'] = df_churn_enriq['numVotes'] + 1

In [40]:
from sklearn.preprocessing import PowerTransformer

bct_numVotes = PowerTransformer(method='box-cox', standardize=False)
bct_qt_dias_diff_vote_max = PowerTransformer(method='box-cox', standardize=False)
bct_qt_dias_diff_vote_median = PowerTransformer(method='box-cox', standardize=False)
bct_qt_dias_diff_comment_min = PowerTransformer(method='box-cox', standardize=False)

df_churn_enriq['numVotes_box-cox'] = bct_numVotes.fit_transform(df_churn_enriq[['numVotes_pos']])
df_churn_enriq['qt_dias_diff_vote_max_box-cox'] = bct_qt_dias_diff_vote_max.fit_transform(df_churn_enriq[['qt_dias_diff_vote_max']])
df_churn_enriq['qt_dias_diff_vote_median_box-cox'] = bct_qt_dias_diff_vote_median.fit_transform(df_churn_enriq[['qt_dias_diff_vote_median']])
df_churn_enriq['qt_dias_diff_comment_min_box-cox'] = bct_qt_dias_diff_comment_min.fit_transform(df_churn_enriq[['qt_dias_diff_comment_min']])

In [41]:
df_churn_enriq = df_churn_enriq.drop('numVotes_pos',axis=1)

d) Visualize novamente após a transformação

In [42]:
fig = px.histogram(df_churn_enriq, x="qt_dias_diff_vote_median_box-cox")
fig.show()

Variáveis Categóricas

a) Faça um value_counts() das variáveis categóricas

In [43]:
categ_cols = ["gender", "companyType"]
for col in categ_cols:
    print(df_churn_enriq[col].value_counts(dropna=False))

Female    2491
Male      1573
Name: gender, dtype: int64
Service    3018
Product    1046
Name: companyType, dtype: int64


b) Aplique o One-Hot Encoding para cada uma delas, aplicando o drop_first.

In [44]:
df_churn_enriq[["gender_Male","companyType_Service"]] = pd.get_dummies(df_churn_enriq[categ_cols], prefix=categ_cols, drop_first=True)
df_churn_enriq.head()

Unnamed: 0,employee,companyAlias,numVotes,lastParticipationDate,stillExists,vote_mean,vote_mode,vote_min,vote_max,vote_count,...,rt_like_dislike,index,mean_stillExists_liked_comments,mean_stillExists_disliked_comments,numVotes_box-cox,qt_dias_diff_vote_max_box-cox,qt_dias_diff_vote_median_box-cox,qt_dias_diff_comment_min_box-cox,gender_Male,companyType_Service
0,512,56aec740f1ef260003e307d6,4,2017-02-23 11:48:04,1,2.0,1.0,1.0,3.0,3.0,...,2.0,1129.0,0.0,1.0,1.65573,7.822055,6.833475,0.704922,1,0
1,2,56aec740f1ef260003e307d6,72,2017-03-17 00:00:00,1,2.239437,1.0,1.0,4.0,71.0,...,0.0,841.0,1.0,0.979167,4.630097,23.573827,17.893925,4.125844,1,0
2,487,56aec740f1ef260003e307d6,14,2016-11-19 14:02:14,0,3.181818,3.0,2.0,4.0,22.0,...,2.0,1124.0,0.0,1.0,2.840827,10.418957,6.302382,3.334595,0,0
3,3,56aec740f1ef260003e307d6,22,2017-02-16 00:00:00,1,3.47619,4.0,2.0,4.0,21.0,...,0.0,842.0,1.0,1.0,3.314402,22.864652,20.0386,1.871862,0,0
4,4,56aec740f1ef260003e307d6,195,2017-03-20 00:00:00,1,3.860825,4.0,1.0,4.0,194.0,...,2.0,843.0,1.0,0.983607,5.798258,22.603556,12.207508,3.81025,0,0


Validação & Limpeza Final

a) Ver correlação entre variáveis e variável target

In [49]:
for col in df_churn_enriq.select_dtypes(include=np.number).columns:
    print(col, df_churn_enriq[col].corr(df_churn_enriq['stillExists']))

employee -0.13061930593362
numVotes 0.08261231402977402
stillExists 1.0
vote_mean 0.03660244375244784
vote_mode 0.024327597708567215
vote_min -0.019976048116231233
vote_max 0.04955892091679814
vote_count 0.051693958745364255
qt_dias_diff_vote_min 0.017745344080851926
qt_dias_diff_vote_max 0.04458513068028744
qt_dias_diff_vote_median 0.07389656088690946
likes_mean -0.0046716420603598
likes_min 0.08659450789337486
likes_max 0.023660705460176183
dislikes_mean -0.04865795353413266
dislikes_min 0.0634297097249172
dislikes_max 0.03957882960319471
qt_dias_diff_comment_min 0.07984348456655171
qt_dias_diff_comment_max -0.014643999691361625
qt_dias_diff_comment_median 0.011666737355124758
commentId_count 0.03857497478620653
rt_like_dislike 0.0463161959215407
index 0.27292281861793455
mean_stillExists_liked_comments 0.07396052856940599
mean_stillExists_disliked_comments 0.3379784293482079
numVotes_box-cox 0.1091111216309417
qt_dias_diff_vote_max_box-cox 0.08378458353986959
qt_dias_diff_vote_media

b) Verifique de novo as distribuições e os tipos das variáveis restantes

In [46]:
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
df_churn_enriq.describe()

Unnamed: 0,employee,numVotes,stillExists,vote_mean,vote_mode,vote_min,vote_max,vote_count,qt_dias_diff_vote_min,qt_dias_diff_vote_max,qt_dias_diff_vote_median,likes_mean,likes_min,likes_max,dislikes_mean,dislikes_min,dislikes_max,qt_dias_diff_comment_min,qt_dias_diff_comment_max,qt_dias_diff_comment_median,commentId_count,rt_like_dislike,index,mean_stillExists_liked_comments,mean_stillExists_disliked_comments,numVotes_box-cox,qt_dias_diff_vote_max_box-cox,qt_dias_diff_vote_median_box-cox,qt_dias_diff_comment_min_box-cox,gender_Male,companyType_Service
count,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0,3007.0,3007.0,3007.0,4064.0,4064.0,4064.0,4064.0,4064.0,4064.0
mean,192.518455,45.640994,0.852608,2.949092,2.981053,1.859744,3.765748,45.355315,9.157972,148.462106,83.165108,6.621822,1.675689,13.477362,1.135128,0.103346,3.330217,33.290108,145.930856,102.120325,8.565207,1.07997,1604.256734,0.657053,0.916686,3.222967,13.403065,9.647534,3.03849,0.387057,0.742618
std,206.118146,57.645434,0.35454,0.596316,0.789037,0.86926,0.48583,58.683164,17.706919,125.680774,70.91659,4.231692,2.515155,9.533752,1.048029,0.458624,3.942949,46.409744,93.598221,70.456009,18.528229,0.762945,890.694963,0.445815,0.197129,1.549058,6.282314,4.563803,1.406228,0.487137,0.437245
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,51.0,6.0,1.0,2.705882,3.0,1.0,4.0,5.0,1.0,45.0,26.0,4.0,0.0,8.0,0.333333,0.0,1.0,10.0,97.75,64.0,0.0,0.0,844.5,0.0,0.953718,2.013851,8.636581,6.239919,2.435986,0.0,0.0
50%,131.0,20.0,1.0,3.0,3.0,2.0,4.0,19.0,3.0,130.0,72.5,6.771111,1.0,12.0,1.206733,0.0,2.0,17.0,139.0,94.0,2.0,1.0,1623.0,1.0,1.0,3.213016,14.294602,10.300969,3.036959,0.0,1.0
75%,244.0,65.0,1.0,3.266667,3.0,2.0,4.0,65.0,7.0,207.0,117.125,6.771111,1.0,15.0,1.206733,0.0,3.0,28.0,165.0,116.125,9.0,2.0,2371.5,1.0,1.0,4.51313,17.602663,12.786742,3.616377,1.0,1.0
max,999.0,273.0,1.0,4.0,4.0,4.0,4.0,323.0,100.0,563.0,336.0,20.5,12.0,44.0,5.6,3.0,17.0,215.0,468.0,351.0,279.0,2.0,3122.0,1.0,1.0,6.203779,27.078685,20.0386,6.134444,1.0,1.0


c) Há ainda algum valor nulo? Se sim, trate-os.

In [50]:
mean_cols=['mean_stillExists_liked_comments','mean_stillExists_disliked_comments']
for col in mean_cols:
    avg=df_churn_enriq[col].mean()
    df_churn_enriq.loc[df_churn_enriq[col].isna(),col]=avg
    print(col,"- substituição de nulos por",avg)

mean_stillExists_liked_comments - substituição de nulos por 0.6570527436627507
mean_stillExists_disliked_comments - substituição de nulos por 0.9166864577580643


In [51]:
df_churn_enriq.dtypes

employee                                int64
companyAlias                           object
numVotes                                int64
lastParticipationDate                  object
stillExists                             int64
vote_mean                             float64
vote_mode                             float64
vote_min                              float64
vote_max                              float64
vote_count                            float64
qt_dias_diff_vote_min                 float64
qt_dias_diff_vote_max                 float64
qt_dias_diff_vote_median              float64
likes_mean                            float64
likes_min                             float64
likes_max                             float64
dislikes_mean                         float64
dislikes_min                          float64
dislikes_max                          float64
qt_dias_diff_comment_min              float64
qt_dias_diff_comment_max              float64
qt_dias_diff_comment_median       

d) Vamos ver quais colunas manter

In [52]:
list(df_churn_enriq)

['employee',
 'companyAlias',
 'numVotes',
 'lastParticipationDate',
 'stillExists',
 'vote_mean',
 'vote_mode',
 'vote_min',
 'vote_max',
 'vote_count',
 'qt_dias_diff_vote_min',
 'qt_dias_diff_vote_max',
 'qt_dias_diff_vote_median',
 'likes_mean',
 'likes_min',
 'likes_max',
 'dislikes_mean',
 'dislikes_min',
 'dislikes_max',
 'qt_dias_diff_comment_min',
 'qt_dias_diff_comment_max',
 'qt_dias_diff_comment_median',
 'commentId_count',
 'companyType',
 'gender',
 'rt_like_dislike',
 'index',
 'mean_stillExists_liked_comments',
 'mean_stillExists_disliked_comments',
 'numVotes_box-cox',
 'qt_dias_diff_vote_max_box-cox',
 'qt_dias_diff_vote_median_box-cox',
 'qt_dias_diff_comment_min_box-cox',
 'gender_Male',
 'companyType_Service']

In [53]:
df_churn_final=df_churn_enriq[['employee',
 'companyAlias',
 #'numVotes',
 'vote_mean',
 'vote_mode',
 'vote_min',
 'vote_max',
 'vote_count',
 'qt_dias_diff_vote_min',
 #'qt_dias_diff_vote_max',
 #'qt_dias_diff_vote_median',
 'likes_mean',
 'likes_min',
 'likes_max',
 'dislikes_mean',
 'dislikes_min',
 'dislikes_max',
 #'qt_dias_diff_comment_min',
 'qt_dias_diff_comment_max',
 'qt_dias_diff_comment_median',
 'commentId_count',
 #'companyType',
 #'gender',
 'rt_like_dislike',
 'mean_stillExists_liked_comments',
 'mean_stillExists_disliked_comments',
 'numVotes_box-cox',
 'qt_dias_diff_vote_max_box-cox',
 'qt_dias_diff_vote_median_box-cox',
 'qt_dias_diff_comment_min_box-cox',
 'gender_Male',
 'companyType_Service',
 # previsão
 'lastParticipationDate',
 'stillExists']]          

In [54]:
df_churn_final.head()

Unnamed: 0,employee,companyAlias,vote_mean,vote_mode,vote_min,vote_max,vote_count,qt_dias_diff_vote_min,likes_mean,likes_min,likes_max,dislikes_mean,dislikes_min,dislikes_max,qt_dias_diff_comment_max,qt_dias_diff_comment_median,commentId_count,rt_like_dislike,mean_stillExists_liked_comments,mean_stillExists_disliked_comments,numVotes_box-cox,qt_dias_diff_vote_max_box-cox,qt_dias_diff_vote_median_box-cox,qt_dias_diff_comment_min_box-cox,gender_Male,companyType_Service,lastParticipationDate,stillExists
0,512,56aec740f1ef260003e307d6,2.0,1.0,1.0,3.0,3.0,2.0,7.5,6.0,9.0,1.0,0.0,2.0,30.0,16.0,2.0,2.0,0.0,1.0,1.65573,7.822055,6.833475,0.704922,1,0,2017-02-23 11:48:04,1
1,2,56aec740f1ef260003e307d6,2.239437,1.0,1.0,4.0,71.0,21.0,5.428571,0.0,12.0,4.785714,0.0,12.0,399.0,267.0,14.0,0.0,1.0,0.979167,4.630097,23.573827,17.893925,4.125844,1,0,2017-03-17 00:00:00,1
2,487,56aec740f1ef260003e307d6,3.181818,3.0,2.0,4.0,22.0,2.0,6.0,6.0,6.0,0.0,0.0,0.0,22.0,22.0,1.0,2.0,0.0,1.0,2.840827,10.418957,6.302382,3.334595,0,0,2016-11-19 14:02:14,0
3,3,56aec740f1ef260003e307d6,3.47619,4.0,2.0,4.0,21.0,7.0,3.333333,0.0,14.0,0.888889,0.0,4.0,376.0,239.0,9.0,0.0,1.0,1.0,3.314402,22.864652,20.0386,1.871862,0,0,2017-02-16 00:00:00,1
4,4,56aec740f1ef260003e307d6,3.860825,4.0,1.0,4.0,194.0,1.0,7.5,1.0,29.0,1.0,0.0,4.0,271.0,107.0,10.0,2.0,1.0,0.983607,5.798258,22.603556,12.207508,3.81025,0,0,2017-03-20 00:00:00,1


In [55]:
df_churn_final.to_csv('df_churn_final.csv',sep=',',index=False)