# 분류 - 고객 이탈율

고객 이탈 데이터 활용
- Telco Customer Churn 데이터 활용
- 출처 : [Kaggle Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn/data)
- 총 21개 컬럼 중 Target 컬럼은 `Churn`

---

# Import Libraries & Load data

In [None]:
# Visual Python: Data Analysis > Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
# Visual Python: Data Analysis > File
df = pd.read_csv('./data/Telco_Customer_Churn.csv')
df

In [None]:
# Visual Python: Library > columns
df.columns

In [None]:
# 타겟 컬럼
target_col = 'Churn'
# 수치형 컬럼
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

# EDA & Data Preprocessing

#### Q. 데이터의 상위 5개 행을 출력하세요.

In [None]:
# Visual Python: Data Analysis > Data Info
df.head()

#### Q. 각 컬럼별 데이터 타입과 데이터 개수를 확인하세요.

In [None]:
# Visual Python: Data Analysis > Data Info
df.info()

#### Q. `customerID` 컬럼을 삭제하세요.

In [None]:
# Visual Python: Data Analysis > Frame
df.drop(['customerID'], axis=1, inplace=True)
df

#### Q. 수치형 컬럼 중 `object` 타입으로 되어 있는 `TotalCharges` 컬럼의 타입을 `float`으로 변경하세요.
- 단, ' ' 공백 데이터는 np.nan으로 변경한 후 데이터 타입을 변경하시오.

In [None]:
# Visual Python: Data Analysis > Frame
df['TotalCharges'] = df[['TotalCharges']].replace({' ': np.nan})
df = df.astype({'TotalCharges': 'float64'})
df

#### Q. `TotalCharges`의 결측치 수를 확인한 후, 결측치가 있다면 그 행을 제거하세요.

In [None]:
# Visual Python: Data Analysis > Data Info
pd.DataFrame({'Null Count': df[['TotalCharges']].isnull().sum(), 'Non-Null Count': df[['TotalCharges']].notnull().sum()})

In [None]:
# Visual Python: Data Analysis > Frame
df.dropna(axis=0, subset=['TotalCharges'], how='all', inplace=True)
df

#### Q. 수치형 컬럼들의 통계값을 출력하세요.
- 수치형 컬럼: `tenure`, `MonthlyCharges`, `TotalCharges`

In [None]:
# Visual Python: Data Analysis > Data Info
df[['tenure', 'MonthlyCharges', 'TotalCharges']].describe()

#### Q. 수치형 컬럼간 상관계수를 확인하시오.
- 수치형 컬럼: `tenure`, `MonthlyCharges`, `TotalCharges`

In [None]:
# Visual Python: Data Analysis > Data Info
df[['tenure', 'MonthlyCharges', 'TotalCharges']].corr(numeric_only=True)

#### Q. 히스토그램을 이용해 수치형 컬럼들의 데이터 분포를 확인하시오.
- 수치형 컬럼: `tenure`, `MonthlyCharges`, `TotalCharges`

In [None]:
# Visual Python: Data Analysis > Data Info
df[['tenure', 'MonthlyCharges', 'TotalCharges']].hist()
plt.show()

#### Q. 박스플롯을 이용해 수치형 컬럼들의 분포와 이상치를 확인하세요.
- 수치형 컬럼: `tenure`, `MonthlyCharges`, `TotalCharges`

In [None]:
# Visual Python: Data Analysis > Data Info
df[['tenure','MonthlyCharges','TotalCharges']].plot(kind='box')
plt.show()

In [None]:
# Visual Python: Data Analysis > Data Info
df['tenure'].plot(kind='box')
plt.show()

In [None]:
# Visual Python: Data Analysis > Data Info
df['MonthlyCharges'].plot(kind='box')
plt.show()

In [None]:
# Visual Python: Data Analysis > Data Info
df['TotalCharges'].plot(kind='box')
plt.show()

#### Q. 고객의 이탈 여부(`Churn`)의 Yes와 No의 비율을 countplot으로 그려보세요.

In [None]:
# Visual Python: Visualization > Seaborn
sns.countplot(data=df, x='Churn')
plt.show()

#### Q. 고객의 이탈 여부(Churn)를 범주로 `Contract`와 `TechSupport`의 countplot을 그려보세요.

In [None]:
# Visual Python: Visualization > Seaborn
sns.countplot(data=df, x='Contract', hue='Churn', order=df['Contract'].value_counts(ascending=False).index)
plt.show()

In [None]:
# Visual Python: Visualization > Seaborn
sns.countplot(data=df, x='TechSupport', hue='Churn')
plt.show()

#### Q. `df`를 이용해 `gender`, `Partner`, `Dependents`, `PhoneService`, `PaperlessBilling`, `Churn` 컬럼을 라벨 인코딩한 후 `df1`에 저장하세요.

In [None]:
# Visual Python: Data Analysis > Frame
df1 = df.copy()
df1['gender_label'] = pd.Categorical(df1['gender']).codes
df1['Partner_label'] = pd.Categorical(df1['Partner']).codes
df1['Dependents_label'] = pd.Categorical(df1['Dependents']).codes
df1['PhoneService_label'] = pd.Categorical(df1['PhoneService']).codes
df1['PaperlessBilling_label'] = pd.Categorical(df1['PaperlessBilling']).codes
df1['Churn_label'] = pd.Categorical(df1['Churn']).codes
df1

#### Q. `df1`을 이용해 `MultipleLines`, `InternetService`, `OnlineSecurity`, `OnlineBackup`, 원핫인코딩을 한 후, `df2`로 저장하세요.

In [None]:
# Visual Python: Data Analysis > Frame
df2 = df1.copy()
df2 = pd.get_dummies(data=df2, columns=['MultipleLines'])
df2 = pd.get_dummies(data=df2, columns=['InternetService'])
df2 = pd.get_dummies(data=df2, columns=['OnlineSecurity'])
df2 = pd.get_dummies(data=df2, columns=['OnlineBackup'])
df2 = pd.get_dummies(data=df2, columns=['DeviceProtection'])
df2 = pd.get_dummies(data=df2, columns=['TechSupport'])
df2 = pd.get_dummies(data=df2, columns=['StreamingTV'])
df2 = pd.get_dummies(data=df2, columns=['StreamingMovies'])
df2 = pd.get_dummies(data=df2, columns=['Contract'])
df2 = pd.get_dummies(data=df2, columns=['PaymentMethod'])
df2

#### Q. `df2`에서 사용할 변수만 골라 subset을 한 후 `df3`로 저장하세요.
- 아직 카테고리(범주형)로 이루어진 컬럼만 제외하세요.

In [None]:
# Visual Python: Data Analysis > Subset
df3 = df2.loc[:, ['SeniorCitizen','tenure','MonthlyCharges','TotalCharges','gender_label','Partner_label','Dependents_label','PhoneService_label','PaperlessBilling_label','Churn_label','MultipleLines_No','MultipleLines_No phone service','MultipleLines_Yes','InternetService_DSL','InternetService_Fiber optic','InternetService_No','OnlineSecurity_No','OnlineSecurity_No internet service','OnlineSecurity_Yes','OnlineBackup_No','OnlineBackup_No internet service','OnlineBackup_Yes','DeviceProtection_No','DeviceProtection_No internet service','DeviceProtection_Yes','TechSupport_No','TechSupport_No internet service','TechSupport_Yes','StreamingTV_No','StreamingTV_No internet service','StreamingTV_Yes','StreamingMovies_No','StreamingMovies_No internet service','StreamingMovies_Yes','Contract_Month-to-month','Contract_One year','Contract_Two year','PaymentMethod_Bank transfer (automatic)','PaymentMethod_Credit card (automatic)','PaymentMethod_Electronic check','PaymentMethod_Mailed check']]
df3

#### Q. `df3`를 이용해 데이터셋을 train, test로 분리해주세요.

In [None]:
# Visual Python: Machine Learning > Data Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df3[['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'gender_label', 'Partner_label', 'Dependents_label', 'PhoneService_label', 'PaperlessBilling_label', 'MultipleLines_No', 'MultipleLines_No phone service', 'MultipleLines_Yes', 'InternetService_DSL', 'InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_No', 'OnlineSecurity_No internet service', 'OnlineSecurity_Yes', 'OnlineBackup_No', 'OnlineBackup_No internet service', 'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_No internet service', 'DeviceProtection_Yes', 'TechSupport_No', 'TechSupport_No internet service', 'TechSupport_Yes', 'StreamingTV_No', 'StreamingTV_No internet service', 'StreamingTV_Yes', 'StreamingMovies_No', 'StreamingMovies_No internet service', 'StreamingMovies_Yes', 'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year', 'PaymentMethod_Bank transfer (automatic)', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']], df3['Churn_label'])

# RandomForestClassifier 모델로 분류하기

#### Q. RandomForestClassifier 모델을 생성하고, fit으로 학습시킨 후 예측 결과를 `pred`에 저장하세요.

In [None]:
# Visual Python: Machine Learning > Classifier
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()

In [None]:
# Visual Python: Machine Learning > Fit/Predict
model_rf.fit(X_train, y_train)

In [None]:
# Visual Python: Machine Learning > Fit/Predict
pred = model_rf.predict(X_test)
pred

#### Q. 예측결과인 `pred`를 평가해 정확도(accuracy)와 f1-score를 확인하세요.
- Visual Python: Machine Learning > Evaluation

In [None]:
# Visual Python: Machine Learning > Evaluation
from sklearn import metrics

In [None]:
# Visual Python: Machine Learning > Evaluation
from IPython.display import display, Markdown

In [None]:
# Visual Python: Machine Learning > Evaluation
# Confusion Matrix
display(Markdown('### Confusion Matrix'))
display(pd.crosstab(y_test, pred, margins=True))

In [None]:
# Visual Python: Machine Learning > Evaluation
# Classification report
print(metrics.classification_report(y_test, pred))

In [None]:
# Visual Python: Machine Learning > Model Info
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_test, model_rf.predict_proba(X_test)[:, 1])                                
plt.plot(fpr, tpr, label='ROC Curve')                                
plt.xlabel('Sensitivity')                                
plt.ylabel('Specificity')                                
plt.show()

In [None]:
# Visual Python: Machine Learning > Model Info
def vp_create_feature_importances(model, X_train=None, sort=False):
    if isinstance(X_train, pd.core.frame.DataFrame):
        feature_names = X_train.columns
    else:
        feature_names = [ 'X{}'.format(i) for i in range(len(model.feature_importances_)) ]
                        
    df_i = pd.DataFrame(model.feature_importances_, index=feature_names, columns=['Feature_importance'])
    df_i['Percentage'] = 100 * df_i['Feature_importance']
    if sort: df_i.sort_values(by='Feature_importance', ascending=False, inplace=True)
    df_i = df_i.round(2)
                        
    return df_i
def vp_plot_feature_importances(model, X_train=None, sort=False, top_count=0):
    df_i = vp_create_feature_importances(model, X_train, sort)
                        
    if sort: 
        if top_count > 0:
            df_i['Percentage'].sort_values().tail(top_count).plot(kind='barh')
        else:
            df_i['Percentage'].sort_values().plot(kind='barh')
    else: 
        df_i['Percentage'].plot(kind='barh')
    plt.xlabel('Feature importance Percentage')
    plt.ylabel('Features')
                        
    plt.show()

In [None]:
# Visual Python: Machine Learning > Model Info
vp_plot_feature_importances(model_rf, X_train, sort=True, top_count=10)

---

In [None]:
# End of file