In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib as mpl
import sklearn
import sys

Print versions

In [2]:
print('The Python version is {}.\n'.format(sys.version))
print('The Numpy version is {}.\n'.format(np.__version__))
print('The Pandas version is {}.\n'.format(pd.__version__))
print('The Matplotlib version is {}.\n'.format(mpl.__version__))
print('The Scikit-Learn version is {}.\n'.format(sklearn.__version__))

The Python version is 3.7.1 (default, Dec 14 2018, 13:28:58) 
[Clang 4.0.1 (tags/RELEASE_401/final)].

The Numpy version is 1.15.4.

The Pandas version is 0.23.4.

The Matplotlib version is 3.0.2.

The Scikit-Learn version is 0.20.1.



Hi Rajesh, I was planning to actually get to the logistic regression modeling, but I got a little sidetracked and wound up writing a fair amount of text on how to encode categorical variables. I think this will be useful for materials, probably in chapters 1 or 3. Just updating the repo so you can see what I've been working on. You may be able to use some of this for the random forest part.

# Learn about the data set

The case study will be churn reduction, using the Telco customer churn dataset from here: https://www.kaggle.com/blastchar/telco-customer-churn/version/1

I think we'll ultimately want to use an edited version of the data set, which might be better hosted directly by us or Packt.

From the source, here is a typo-corrected version of the data dictionary. It would be good to instruct the students to always demand a data dictionary, if they are receiving a data set like this from a data engineer or database administrator. If they created the data set themselves, it is important to document the data dictionary.

---

1. customerID: Customer ID
2. gender
3. SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
4. Partner: Whether the customer has a partner or not (Yes, No)
5. Dependents: Whether the customer has dependents or not (Yes, No)
6. tenure: Number of months the customer has stayed with the company
7. PhoneService: Whether the customer has a phone service or not (Yes, No)
8. MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
9. InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
10. OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
11. OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
12. DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
13. TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
14. StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
15. StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
16. Contract: The contract term of the customer (Month-to-month, One year, Two year)
17. PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
18. PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
19. MonthlyCharges: The amount charged to the customer monthly
20. TotalCharges: The total amount charged to the customer
21. Churn: Whether the customer churned or not (Yes or No)

---

[Explain the concepts of features and target]
[Explain concept of features versus samples]

There will be 20 features and one response variable, aka target variable (Churn).

From the Description section of the challenge we learn the following:

---

### Context
"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

### Content
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

### Inspiration
To explore this type of models and learn more about the subject.

---

Other things that would be good to point out: if a data scientist is handed a data set like this, a crucial first step is to understand the data and have confidence in it. This means checking basic things about the data set.

Ask questions like these:

- If there is supposed to be data about a certain number of customers, are there that many rows in the data set? Are there that many unique customer IDs?

- Is there any missing data? If so, what should be done about it?
    - Can you afford to throw out samples with missing data?
    - Features with missing data?
    - If not, explain the concept of imputation and discuss different methods for imputation.

Even if the data scientist has obtained the data directly through a SQL query that they designed, it is important to do these basic data quality checks. If any of the basic checks bring the data in to question, the first part of the data science workflow is to go back to the data provider, or your own SQL query if that's the case, and determine the cause.

# Reading in a data set and viewing summaries

In [3]:
telco_df = pd.read_csv('../Data/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [4]:
telco_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), obj

In [5]:
telco_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [6]:
telco_df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


## Checking for data quality

Would suggest a script that "dirties" the data: create some missingness, repeated user IDs, etc. Then make it an exercise for the student to find the issues.

Get a distinct count for each column that appears to be categorical:

In [7]:
telco_df.columns.values

array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'Churn'], dtype=object)

In [8]:
categorical_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService',
                    'MultipleLines', 'InternetService',
                    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                    'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
                    'PaperlessBilling', 'PaymentMethod']

In [16]:
len(categorical_cols)

16

In [18]:
for column in telco_df.columns.values:
    if column in categorical_cols:
        print(telco_df[column].value_counts())
        print('\n')

Male      3555
Female    3488
Name: gender, dtype: int64


0    5901
1    1142
Name: SeniorCitizen, dtype: int64


No     3641
Yes    3402
Name: Partner, dtype: int64


No     4933
Yes    2110
Name: Dependents, dtype: int64


Yes    6361
No      682
Name: PhoneService, dtype: int64


No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64


Fiber optic    3096
DSL            2421
No             1526
Name: InternetService, dtype: int64


No                     3498
Yes                    2019
No internet service    1526
Name: OnlineSecurity, dtype: int64


No                     3088
Yes                    2429
No internet service    1526
Name: OnlineBackup, dtype: int64


No                     3095
Yes                    2422
No internet service    1526
Name: DeviceProtection, dtype: int64


No                     3473
Yes                    2044
No internet service    1526
Name: TechSupport, dtype: int64


No                     28

### Convert numerical types

In [10]:
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

In [17]:
len(numerical_cols)

3

In [11]:
pd.to_numeric(telco_df['TotalCharges'])#.astype(float)# = telco_df['TotalCharges'].values.astype(float)

ValueError: Unable to parse string " " at position 488

This one didn't work. [teach them how to read an error message and try to debug].

For 'TotalCharges', it looks like it should be a float data type. But because of missing values, which look like they got indicated as a space (" "), it has been imported as a string. Convert " " to np.nan then reattempt conversion to numeric type.

In [12]:
space_mask = telco_df['TotalCharges'] == ' '
sum(space_mask)

11

Looks like there are 11 missing values represented as a space. Change these to np.nan.

In [13]:
telco_df.loc[space_mask, ['TotalCharges']] = np.nan

In [14]:
telco_df['TotalCharges'] = telco_df['TotalCharges'].astype(float)

Now get a description of numerical columns

In [15]:
telco_df[numerical_cols].describe()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7032.0
mean,32.371149,64.761692,2283.300441
std,24.559481,30.090047,2266.771362
min,0.0,18.25,18.8
25%,9.0,35.5,401.45
50%,29.0,70.35,1397.475
75%,55.0,89.85,3794.7375
max,72.0,118.75,8684.8


Have something about the looking through these to see if they make intuitive sense, in light of the data dictionary. Also what can be learned by examining them.

# Encoding categorical variables

Machine learning algorithms only work with numbers. However some of our features are text-based, such as 'Contract', which can take on the values 'Month-to-month', 'One year', and 'Two year', as we saw earlier. We will refer to these kinds of features as categorical. Different software packages treat categorical features in different ways. Some will convert them to numbers for you, while others require you to do this yourself. Scikit-Learn falls in the latter category, so we will need to convert our text features to numbers explicitly.

One approach to converting the levels of a categorical variable to numbers, is to do so directly. For example, in our feature 'Contract', the levels 'Month-to-month', 'One year', and 'Two year', could be converted to the numbers 1, 2, and 3 respectively. In doing so, we have created what is called an ordinal categorical variable, as the numbers imply an order to the levels. This feature would in fact be treated just like any other numerical feature in a machine learning algorithm.

For the 'Contract' feature, the ordinal approach makes some amount of sense as the levels of the original variable do have an implied ordering, in terms of the commitment a customer has made. However if we make this conversion, and then directly use this feature column in a linear model like logistic regression, we are assuming that the difference in commitment between 'Month-to-month' and 'One year', is the same as the difference between 'One year' and 'Two year'. Would we want to assume that a two year commitment somehow represents "twice as much commitment" as a one year commitment, in comparison to month-to-month? While we may or may not wish to assume this, the ordinal approach does allow us to keep the 'Contract' feature confined to a single column, which may help the interpretability of the model, and may be advantageous for the performance of non-linear models, such as random forest which we will explore later.

In [None]:
#Code block demonstrating ordinal encoding

An alternative method is to use one hot encoding (OHE) which effectively "spreads out" the categorical variable over several columns. Each of these columns will contain only ones and zeros, and a given sample will only have a one in one of the columns, hence the name "one hot encoding". This method avoids the need to assume an ordering, or constant interval between the levels of a categorical variable. However, if the categorial variables have a large number of levels, this could potentially create a large number of new columns in the data set, effectively "blowing up" the feature space.

In [None]:
#Code block demonstrating OHE
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

As is often the case in data science and machine learning, the "best" method, ordinal encoding or OHE, depends on the specific problem being solved. There are additional ways to encode categoricals and experimentation with different possible methods is advisable. For this example, since we observe that the numbers of levels within our categorical variables are not too large (2, 3, or 4), we will implement one hot encoding.

In [19]:
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

If you have some extra time, check out [this explanation of categorical variables](https://www.coursera.org/lecture/competitive-data-science/categorical-and-ordinal-features-qu1TF). (I think there is a section where we can specify things like this, that they can optionally do if they have extra time).

# Normalizing numerical features

The normalization part of logistic regression requires that features be on the same scale [say why]. We will scale all features to be between zero and one. Practically speaking, if features are on the same order of magnitude (power of ten), these methods should work fine (reference quote from Andrew Ng).

In [None]:
#Use SKLearn minmax scaler for this, probably.

# Build a quick initial model

Once all data quality questions have been answered...