# Customer churn

A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors. They need to understand who is leaving. As analyst at this company we have to find out who is leaving and why.

# Import required libraries

In [1]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import itertools
from sklearn.model_selection import train_test_split
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import log_loss
%matplotlib inline 
import matplotlib.pyplot as plt

In [2]:
ls

 Volume in drive C is OS
 Volume Serial Number is D816-28B0

 Directory of C:\Users\luigi\Data_science_practice\Customer_Churn_Feature_Selection

07/09/2020  12:18 PM    <DIR>          .
07/09/2020  12:18 PM    <DIR>          ..
07/09/2020  12:02 PM    <DIR>          .ipynb_checkpoints
05/20/2020  11:30 AM            36,144 ChurnData.csv
07/09/2020  12:18 PM            14,774 Feature_selection.ipynb
07/09/2020  12:00 PM               141 README.md
               3 File(s)         51,059 bytes
               3 Dir(s)   2,790,731,776 bytes free


In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization library  
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import time
from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))

# About the dataset
We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company.
This data set provides information to help us predict what behavior will help you to retain customers. We can analyze all relevant customer data and develop focused customer retention programs.

The dataset includes information about:

- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

You can find the dataset [here](https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv).

In [4]:
ls

 Volume in drive C is OS
 Volume Serial Number is D816-28B0

 Directory of C:\Users\luigi\Data_science_practice\Customer_Churn_Feature_Selection

07/09/2020  12:18 PM    <DIR>          .
07/09/2020  12:18 PM    <DIR>          ..
07/09/2020  12:02 PM    <DIR>          .ipynb_checkpoints
05/20/2020  11:30 AM            36,144 ChurnData.csv
07/09/2020  12:18 PM            14,774 Feature_selection.ipynb
07/09/2020  12:00 PM               141 README.md
               3 File(s)         51,059 bytes
               3 Dir(s)   2,790,711,296 bytes free


# Load Data From CSV File

In [5]:
churn_df = pd.read_csv("ChurnData.csv")
churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


In [6]:
churn_df.columns

Index(['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',
       'callcard', 'wireless', 'longmon', 'tollmon', 'equipmon', 'cardmon',
       'wiremon', 'longten', 'tollten', 'cardten', 'voice', 'pager',
       'internet', 'callwait', 'confer', 'ebill', 'loglong', 'logtoll',
       'lninc', 'custcat', 'churn'],
      dtype='object')

In [7]:
churn_df.isnull().any()

tenure      False
age         False
address     False
income      False
ed          False
employ      False
equip       False
callcard    False
wireless    False
longmon     False
tollmon     False
equipmon    False
cardmon     False
wiremon     False
longten     False
tollten     False
cardten     False
voice       False
pager       False
internet    False
callwait    False
confer      False
ebill       False
loglong     False
logtoll     False
lninc       False
custcat     False
churn       False
dtype: bool

There are 3 things that take my attention: 

1) All the features are numerical so can be used for feature selection 

2) Churn is our class label 

3) Feature don't includes NaN

In [8]:
# Convert target variable to integer for sklearn algorithm
churn_df['churn'] = churn_df['churn'].astype('int')