In [1]:
#MASTER_NOTEBOOK_CLASSIFICATION

# Classification Project!

Why are our customers churning?
Some questions I have include:
*	Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers))
*	Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?
*	Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?
*	If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

## Deliverables:
1.	I will also need a report (ipynb) answering the question, "Why are our customers churning?" I want to see the analysis you did to answer my questions and lead to your findings. Please clearly call out the questions and answers you are analyzing. E.g. If you find that month-to-month customers churn more, I won't be surprised, but I am not getting rid of that plan. The fact that they churn is not because they can, it's because they can and they are motivated to do so. I want some insight into why they are motivated to do so. I realize you will not be able to do a full causal experiment, but I hope to see some solid evidence of your conclusions.
2.	I will need you to deliver to me a csv with the customer_id, probability of churn, and the prediction of churn (1=churn, 0=not_churn). I would also like a single goolgle slide that illustrates how your model works, including the features being used, so that I can deliver this to the SLT when they come with questions about how these values were derived. Please make sure you include how likely your model is to give a high probability of churn when churn doesn't occur, to give a low probability of churn when churn occurs, and to accurately predict churn.
3.	Finally, our development team will need a .py file that will take in a new dataset, (in the exact same form of the one you acquired from telco_churn.customers) and perform all the transformations necessary to run the model you have developed on this new dataset to provide probabilities and predictions.


# Acquisition
Use the mysql connector to query telco_churn.customers. Assign the output of that query to the dataframe df. You want to include all the fields.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [2]:
path = './'
df = pd.read_csv(path + "telco_churn.csv")

Write a function, peekatdata(dataframe), that takes a dataframe as input and computes and returns the following:
    *	creates dataframe object head_df (df of the first 5 rows) and prints contents to screen
    *	creates dataframe object tail_df (df of the last 5 rows) and prints contents to screen
    *	creates tuple object shape_tuple (tuple of (nrows, ncols)) and prints tuple to screen
    *	creates dataframe object describe_df (summary statistics of all numeric variables) and prints contents to screen.
    *	prints to screen the information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.




In [3]:
def peekatdata(df):
    print("\nRows & Columns:\n")
    print(df.shape)
    print("\nColumn Info:\n")
    print(df.info())
    print("\nFirst 5 rows:\n")
    print(df.head())
    print("\nLast 5 rows:\n")
    print(df.tail())
    print("\nMissing Values:\n")
    missing_vals = df.columns[df.isnull().any()]
    print(df.isnull().sum())
    print("\nSummary Stats:\n")
    print(df.describe())

In [4]:
peekatdata(df)


Rows & Columns:

(7043, 27)

Column Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 27 columns):
customer_id                   7043 non-null object
gender                        7043 non-null object
senior_citizen                7043 non-null int64
partner                       7043 non-null object
dependents                    7043 non-null object
tenure                        7043 non-null int64
phone_service                 7043 non-null object
multiple_lines                7043 non-null object
internet_service_type_id      7043 non-null int64
online_security               7043 non-null object
online_backup                 7043 non-null object
device_protection             7043 non-null object
tech_support                  7043 non-null object
streaming_tv                  7043 non-null object
streaming_movies              7043 non-null object
contract_type_id              7043 non-null int64
paperless_billing             7043 no

# Data Prep
# Question 1
1.	Write a function, df_value_counts(dataframe), that takes a dataframe as input and computes and returns the values by frequency for each variable. Use the rule of thumb for your logic on whether or not to use the bins argument. The function will use a for loop and an in statement.
for col in df.columns: n = df[col].unique().shape[0] col_bins = min(n,10) print('%s:' % col) if df[col].dtype in ['int64','float64'] and n > 10: print(df[col].value_counts(bins=col_bins, sort=False)) else: print(df[col].value_counts()) print('\n')

In [5]:
def df_value_counts(df):
    for col in df.columns: 
        n = df[col].unique().shape[0] 
        col_bins = min(n,10) 
        if df[col].dtype in ['int64','float64'] and n > 10:
            print('%s:' % col)
            print(df[col].value_counts(bins=col_bins, sort=False)) 
        else: 
            print(df[col].value_counts()) 
        print('\n')

In [6]:
df_value_counts(df)

8990-YOZLV    1
7740-BTPUX    1
7675-OZCZG    1
4199-QHJNM    1
5702-SKUOB    1
2996-XAUVF    1
9577-WJVCQ    1
3318-NMQXL    1
7130-VTEWQ    1
4959-JOSRX    1
7595-EUIVN    1
8706-HRADD    1
0060-FUALY    1
2483-XSSMZ    1
8621-MNIHH    1
7956-XQWGU    1
3572-UUHRS    1
2896-TBNBE    1
9985-MWVIX    1
7853-WNZSY    1
8450-UYIBU    1
6240-EURKS    1
9778-OGKQZ    1
7009-LGECI    1
6476-YHMGA    1
5178-LMXOP    1
6718-BDGHG    1
2460-NGXBJ    1
9786-IJYDL    1
7315-WYOAW    1
             ..
6661-EIPZC    1
0923-PNFUB    1
7109-MFBYV    1
8242-JSVBO    1
8884-ADFVN    1
1041-RXHRA    1
4926-UMJZD    1
1183-CANVH    1
5627-TVBPP    1
0447-BEMNG    1
9888-ZCUMM    1
6754-LZUKA    1
7382-DFJTU    1
1496-GGSUK    1
2925-MXLSX    1
9251-AWQGT    1
8064-RAVOH    1
7657-DYEPJ    1
3605-JISKB    1
2672-TGEFF    1
0626-QXNGV    1
8735-IJJEG    1
7811-JIVPF    1
7013-PSXHK    1
3999-WRNGR    1
5480-HPRRX    1
8999-EXMNO    1
5423-BHIXO    1
2661-GKBTK    1
6457-GIRWB    1
Name: customer_id, Lengt

# Question 2
2.	Missing Values:
    *	Write a function, that returns a dataframe of the column name and the number of missing values and the percentage of missing values (missing records/total records) for each of the columns that have > 0 missing values.
df.isnull().sum()
    *	Document your takeaways. For each variable:
    *	should you remove the observations with a missing value for that variable?
    *	should you remove the variable altogether?
    *	is missing equivalent to 0 (or some other constant value) in the specific case of this variable?
    *	should you replace the missing values with a value it is most likely to represent (e.g. Are the missing values a result of data integrity issues and should be replaced by the most likely value?)
    *	Handle the missing values in the way you recommended above.

In [7]:
def percent_missing(df):
    missing_table = df.isnull().sum()/df.shape[0]*100
    return missing_table

In [8]:
percent_missing(df)

customer_id                   0.0
gender                        0.0
senior_citizen                0.0
partner                       0.0
dependents                    0.0
tenure                        0.0
phone_service                 0.0
multiple_lines                0.0
internet_service_type_id      0.0
online_security               0.0
online_backup                 0.0
device_protection             0.0
tech_support                  0.0
streaming_tv                  0.0
streaming_movies              0.0
contract_type_id              0.0
paperless_billing             0.0
payment_type_id               0.0
monthly_charges               0.0
total_charges                 0.0
churn                         0.0
contract_type_id.1            0.0
contract_type                 0.0
internet_service_type_id.1    0.0
internet_service_type         0.0
payment_type_id.1             0.0
payment_type                  0.0
dtype: float64

In [9]:
df['total_charges'] = df['total_charges'].convert_objects(convert_numeric=True)
df.total_charges.dropna(0, inplace=True)

df.head()

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.


Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,payment_type_id,monthly_charges,total_charges,churn,contract_type_id.1,contract_type,internet_service_type_id.1,internet_service_type,payment_type_id.1,payment_type
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,2,59.9,542.4,No,1,Month-to-month,1,DSL,2,Mailed check
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,4,69.4,571.45,No,1,Month-to-month,1,DSL,4,Credit card (automatic)
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,1,48.2,340.35,No,1,Month-to-month,1,DSL,1,Electronic check
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,1,25.1,25.1,Yes,1,Month-to-month,1,DSL,1,Electronic check
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,3,30.5,30.5,Yes,1,Month-to-month,1,DSL,3,Bank transfer (automatic)


### Total charges was the only variable that had empty strings for 11 values. Those rows were dropped. Every other variable does not have issues with missing data.

# Question 3
3. Compute a new feature, tenure_year, that is a result of translating tenure from months to years.

In [10]:
def make_binary(df):
    df['churn'] == 'Yes'
    (df['churn'] == 'Yes').astype(int)
    df['churn'] = (df['churn'] == 'Yes').astype(int)
    return(df.head())

In [11]:
make_binary(df)

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,payment_type_id,monthly_charges,total_charges,churn,contract_type_id.1,contract_type,internet_service_type_id.1,internet_service_type,payment_type_id.1,payment_type
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,2,59.9,542.4,0,1,Month-to-month,1,DSL,2,Mailed check
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,4,69.4,571.45,0,1,Month-to-month,1,DSL,4,Credit card (automatic)
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,1,48.2,340.35,0,1,Month-to-month,1,DSL,1,Electronic check
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,1,25.1,25.1,1,1,Month-to-month,1,DSL,1,Electronic check
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,3,30.5,30.5,1,1,Month-to-month,1,DSL,3,Bank transfer (automatic)


# Question 4
4. Compute a new feature, tenure_year, that is a result of translating tenure from months to years.

In [12]:
df = df.assign(tenure_year=df.tenure/12).round(2)

In [13]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,monthly_charges,total_charges,churn,contract_type_id.1,contract_type,internet_service_type_id.1,internet_service_type,payment_type_id.1,payment_type,tenure_year
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,59.9,542.4,0,1,Month-to-month,1,DSL,2,Mailed check,0.75
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,69.4,571.45,0,1,Month-to-month,1,DSL,4,Credit card (automatic),0.75
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,48.2,340.35,0,1,Month-to-month,1,DSL,1,Electronic check,0.58
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,25.1,25.1,1,1,Month-to-month,1,DSL,1,Electronic check,0.08
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,30.5,30.5,1,1,Month-to-month,1,DSL,3,Bank transfer (automatic),0.08


# Data Prep
# Question 5
5. Figure out a way to capture the information contained in phone_service and multiple_lines into a single variable of dtype int. Write a function that will transform the data and place in a new column named phone_id.

### Phone_service and multiple_lines columns are strings with 'Yes', 'No' or 'No phone service' values. Turn columns into 0 or 1 values. Then sum the two columns to make a new column, phone_id. 

In [14]:
df[['phone_service','multiple_lines']].head()

Unnamed: 0,phone_service,multiple_lines
0,Yes,Yes
1,Yes,No
2,Yes,No
3,No,No phone service
4,No,No phone service


In [15]:
df = df.replace({'phone_service': {'Yes': 1, 'No': 0}})

In [16]:
df[['phone_service']].head()

Unnamed: 0,phone_service
0,1
1,1
2,1
3,0
4,0


In [17]:
df = df.replace({'multiple_lines': {'Yes': 1, 'No': 0, 'No phone service': 0}})

In [18]:
df[['phone_service','multiple_lines']].head(10)

Unnamed: 0,phone_service,multiple_lines
0,1,1
1,1,0
2,1,0
3,0,0
4,0,0
5,1,0
6,0,0
7,1,0
8,1,1
9,1,1


In [19]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,monthly_charges,total_charges,churn,contract_type_id.1,contract_type,internet_service_type_id.1,internet_service_type,payment_type_id.1,payment_type,tenure_year
0,0003-MKNFE,Male,0,No,No,9,1,1,1,No,...,59.9,542.4,0,1,Month-to-month,1,DSL,2,Mailed check,0.75
1,0013-MHZWF,Female,0,No,Yes,9,1,0,1,No,...,69.4,571.45,0,1,Month-to-month,1,DSL,4,Credit card (automatic),0.75
2,0015-UOCOJ,Female,1,No,No,7,1,0,1,Yes,...,48.2,340.35,0,1,Month-to-month,1,DSL,1,Electronic check,0.58
3,0023-HGHWL,Male,1,No,No,1,0,0,1,No,...,25.1,25.1,1,1,Month-to-month,1,DSL,1,Electronic check,0.08
4,0032-PGELS,Female,0,Yes,Yes,1,0,0,1,Yes,...,30.5,30.5,1,1,Month-to-month,1,DSL,3,Bank transfer (automatic),0.08


In [20]:
df['phone_id'] = df['phone_service'].astype(int) + df['multiple_lines'].astype(int)

In [21]:
df['phone_id'].value_counts()

1    3390
2    2971
0     682
Name: phone_id, dtype: int64

In [22]:
df[['phone_id']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 1 columns):
phone_id    7043 non-null int64
dtypes: int64(1)
memory usage: 55.1 KB


## New column, phone_id, created! 
## This column will require a key to understand it: 0 = No phone, 1 = 1 phone line, 2 = multiple phone lines.

# Data Prep
# Question 6
6.	Figure out a way to capture the information contained in dependents and partner into a single variable of dtype int. Transform the data and place in a new column household_type_id in df_sql. Be sure you have documented your function and logic well.

In [23]:
df.columns

Index(['customer_id', 'gender', 'senior_citizen', 'partner', 'dependents',
       'tenure', 'phone_service', 'multiple_lines', 'internet_service_type_id',
       'online_security', 'online_backup', 'device_protection', 'tech_support',
       'streaming_tv', 'streaming_movies', 'contract_type_id',
       'paperless_billing', 'payment_type_id', 'monthly_charges',
       'total_charges', 'churn', 'contract_type_id.1', 'contract_type',
       'internet_service_type_id.1', 'internet_service_type',
       'payment_type_id.1', 'payment_type', 'tenure_year', 'phone_id'],
      dtype='object')

In [24]:
df[['partner','dependents']].head()

Unnamed: 0,partner,dependents
0,No,No
1,No,Yes
2,No,No
3,No,No
4,Yes,Yes


### Partner and dependents columns are strings with 'Yes' or 'No' values. Turn columns into numerical values. 
## 'Yes partner' = 1
## 'No partner' = 0
## 'No dependents' = 0
## 'Yes dependents' = 2

In [25]:
df = df.replace({'partner': {'Yes': 1, 'No': 0}})

In [26]:
df = df.replace({'dependents': {'Yes': 2, 'No': 0}})

In [27]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,total_charges,churn,contract_type_id.1,contract_type,internet_service_type_id.1,internet_service_type,payment_type_id.1,payment_type,tenure_year,phone_id
0,0003-MKNFE,Male,0,0,0,9,1,1,1,No,...,542.4,0,1,Month-to-month,1,DSL,2,Mailed check,0.75,2
1,0013-MHZWF,Female,0,0,2,9,1,0,1,No,...,571.45,0,1,Month-to-month,1,DSL,4,Credit card (automatic),0.75,1
2,0015-UOCOJ,Female,1,0,0,7,1,0,1,Yes,...,340.35,0,1,Month-to-month,1,DSL,1,Electronic check,0.58,1
3,0023-HGHWL,Male,1,0,0,1,0,0,1,No,...,25.1,1,1,Month-to-month,1,DSL,1,Electronic check,0.08,0
4,0032-PGELS,Female,0,1,2,1,0,0,1,Yes,...,30.5,1,1,Month-to-month,1,DSL,3,Bank transfer (automatic),0.08,0


### Sum the two columns and turn this into the new column called household_type_id. This household_type will require a KEY to understand it.

In [28]:
df['household_type_id'] = df['dependents'].astype(int) + df['partner'].astype(int)

In [29]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,churn,contract_type_id.1,contract_type,internet_service_type_id.1,internet_service_type,payment_type_id.1,payment_type,tenure_year,phone_id,household_type_id
0,0003-MKNFE,Male,0,0,0,9,1,1,1,No,...,0,1,Month-to-month,1,DSL,2,Mailed check,0.75,2,0
1,0013-MHZWF,Female,0,0,2,9,1,0,1,No,...,0,1,Month-to-month,1,DSL,4,Credit card (automatic),0.75,1,2
2,0015-UOCOJ,Female,1,0,0,7,1,0,1,Yes,...,0,1,Month-to-month,1,DSL,1,Electronic check,0.58,1,0
3,0023-HGHWL,Male,1,0,0,1,0,0,1,No,...,1,1,Month-to-month,1,DSL,1,Electronic check,0.08,0,0
4,0032-PGELS,Female,0,1,2,1,0,0,1,Yes,...,1,1,Month-to-month,1,DSL,3,Bank transfer (automatic),0.08,0,3


In [30]:
df[['household_type_id']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 1 columns):
household_type_id    7043 non-null int64
dtypes: int64(1)
memory usage: 55.1 KB


## New column, household_type_id created!

## Household_type_id has 4 value types: 
### 0 = no partner & no dependents; 1 = Yes partner & No dependents; 2 = No partner and Yes dependents; 3 = Partner & Dependents

## Data Prep

## Question 7
7. Figure out a way to capture the information contained in streaming_tv and streaming_movies into a single variable of dtype int. Transform the data and place in a new column streaming_services. 


In [31]:
df['streaming_tv'].value_counts()

No                     2810
Yes                    2707
No internet service    1526
Name: streaming_tv, dtype: int64

In [32]:
df['streaming_movies'].value_counts()

No                     2785
Yes                    2732
No internet service    1526
Name: streaming_movies, dtype: int64

### Streaming_tv and streaming_movies columns are strings with 'Yes', 'No' or 'No internet service' values. Turn columns into the following values:
## 'No internet service' = 0 [BOTH COLUMNS]
## 'No streaming tv' = 1
## 'Yes streaming tv' = 2
## 'No streaming movies' = 1
## 'Yes streaming movies' = 3

In [33]:
df[['streaming_tv','streaming_movies']].head()

Unnamed: 0,streaming_tv,streaming_movies
0,No,Yes
1,Yes,Yes
2,No,No
3,No,No
4,No,No


In [34]:
df = df.replace({'streaming_tv': {'Yes': 2, 'No': 1, 'No internet service': 0}})

In [35]:
df[['streaming_tv']].head()

Unnamed: 0,streaming_tv
0,1
1,2
2,1
3,1
4,1


In [36]:
df['streaming_movies'].head()

0    Yes
1    Yes
2     No
3     No
4     No
Name: streaming_movies, dtype: object

In [37]:
df = df.replace({'streaming_movies': {'Yes': 3, 'No': 1, 'No internet service': 0}})

In [38]:
df['streaming_movies'].value_counts()

1    2785
3    2732
0    1526
Name: streaming_movies, dtype: int64

In [39]:
df['streaming_tv'].value_counts()

1    2810
2    2707
0    1526
Name: streaming_tv, dtype: int64

### Then sum the two columns to make a new column, streaming_services This column will require a key to understand it:
## 0 = No internet connection
## 2 = No streaming tv & No streaming movies
## 3 = Yes streaming tv & No streaming movies
## 4 = No streaming tv & Yes streaming movies
## 5 = Yes streaming tv & Yes streaming movies

In [40]:
df['streaming_services'] = df['streaming_tv'].astype(int) + df['streaming_movies'].astype(int)

In [41]:
df['streaming_services'].head()

0    4
1    5
2    2
3    2
4    2
Name: streaming_services, dtype: int64

In [42]:
df[['streaming_services']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 1 columns):
streaming_services    7043 non-null int64
dtypes: int64(1)
memory usage: 55.1 KB


## New column, streaming_services, created!

## Data Prep

## Question 8

8.	Figure out a way to capture the information contained in online_security and online_backup into a single variable of dtype int. Transform the data and place in a new column online_security_backup in df_sql. Be sure you have documented your function and logic well.


In [43]:
df['online_backup'].value_counts()

No                     3088
Yes                    2429
No internet service    1526
Name: online_backup, dtype: int64

In [44]:
df['online_security'].value_counts()

No                     3498
Yes                    2019
No internet service    1526
Name: online_security, dtype: int64

### Online_security and online_backup columns are strings with 'Yes', 'No' or 'No internet service' values. Turn columns into the following values:
## 'No internet service' = 0 [BOTH COLUMNS]
## 'No online security' = 1
## 'Yes online security' = 2
## 'No online backup' = 1
## 'Yes online backup' = 3

In [45]:
df[['online_security','online_backup']].head()

Unnamed: 0,online_security,online_backup
0,No,No
1,No,No
2,Yes,No
3,No,No
4,Yes,No


In [46]:
df = df.replace({'online_security': {'Yes': 2, 'No': 1, 'No internet service': 0}})

In [47]:
df[['online_security']].head()

Unnamed: 0,online_security
0,1
1,1
2,2
3,1
4,2


In [48]:
df['online_backup'].head()

0    No
1    No
2    No
3    No
4    No
Name: online_backup, dtype: object

In [49]:
df = df.replace({'online_backup': {'Yes': 3, 'No': 1, 'No internet service': 0}})

In [50]:
df['online_backup'].value_counts()

1    3088
3    2429
0    1526
Name: online_backup, dtype: int64

In [51]:
df['online_security'].value_counts()

1    3498
2    2019
0    1526
Name: online_security, dtype: int64

### Then sum the two columns to make a new column, online_security_backup This column will require a key to understand it:
## 0 = No internet connection
## 2 = No online security & No online backup
## 3 = Yes online security & No online backup
## 4 = No online security & Yes online backup
## 5 = Yes online security & Yes online backup

In [52]:
df['online_security_backup'] = df['online_security'].astype(int) + df['online_backup'].astype(int)

In [53]:
df['online_security_backup'].head()

0    2
1    2
2    3
3    2
4    3
Name: online_security_backup, dtype: int64

In [54]:
df[['online_security_backup']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 1 columns):
online_security_backup    7043 non-null int64
dtypes: int64(1)
memory usage: 55.1 KB


### New column, online_security_backup, created!

# Question 9
9. Data Split
    *	Split data into train (70%) & test (30%) samples. You should end with 2 data frames: train_df and test_df

In [55]:
train_df, test_df = train_test_split(df, test_size = .30, random_state = 123, stratify = df[['churn']])

train_df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,contract_type,internet_service_type_id.1,internet_service_type,payment_type_id.1,payment_type,tenure_year,phone_id,household_type_id,streaming_services,online_security_backup
5358,6556-DBKZF,Female,0,1,2,71,1,0,2,1,...,Two year,2,Fiber optic,1,Electronic check,5.92,1,3,2,2
559,4537-DKTAL,Female,0,0,0,2,1,0,1,1,...,Month-to-month,1,DSL,1,Electronic check,0.17,1,0,2,2
189,1666-JZPZT,Male,0,0,0,27,1,1,1,1,...,Month-to-month,1,DSL,2,Mailed check,2.25,2,0,2,2
4027,7422-WNBTY,Male,0,1,0,33,1,1,2,1,...,Month-to-month,2,Fiber optic,1,Electronic check,2.75,2,1,4,4
1589,6478-HRRCZ,Male,0,1,0,32,1,0,1,2,...,One year,1,DSL,2,Mailed check,2.67,1,1,4,5


# Question 10
10.	Variable Encoding
    *	write an encoder (fit and transform on train_df) for each non-numeric variable. Use that encoder object to transform on test_df

In [56]:
def encode_data(df):
    for col in df.drop(columns=(['customer_id', 'total_charges', 'monthly_charges'])):
        encoder = LabelEncoder()
        encoder.fit(df[col])
        new_col = col + '_encode'
        df[new_col] = encoder.transform(df[col])
    return df

train_df = encode_data(train_df)
test_df = encode_data(test_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


# Question 11
11.	Numeric Scaling
    *	Fit a min_max_scaler to train_df. Transform monthly_charges and total_charges variables in train_df using the scaler. Then use the scaler object to transform test_df.

In [57]:
scaler = MinMaxScaler()
scaler.fit(train_df[['monthly_charges', 'total_charges']])

train_df[['monthly_charges', 'total_charges']] = scaler.transform(train_df[['monthly_charges', 'total_charges']])
test_df[['monthly_charges', 'total_charges']] = scaler.transform(test_df[['monthly_charges', 'total_charges']])

df_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-d

NameError: name 'df_train' is not defined

## Data Exploration
### Deliverable
I will also need a report (ipynb) answering the question, "Why are our customers churning?" I want to see the analysis you did to answer my questions and lead to your findings. Please clearly call out the questions and answers you are analyzing. E.g. If you find that month-to-month customers churn more, I won't be surprised, but I am not getting rid of that plan. The fact that they churn is not because they can, it's because they can and they are motivated to do so. I want some insight intowhy they are motivated to do so. I realize you will not be able to do a full causal experiment, but I hope to see some solid evidence of your conclusions.
1.	Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers))
2.	Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?
3.	Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?
4.	If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?
5.	Controlling for services (phone_id, internet_service_type_id, online_security_backup, device_protection, tech_support, and contract_type_id), is the mean monthly_charges of those who have churned significantly different from that of those who have not churned?
import scipy as sp import numpy as np sp.stats.ttest_ind(df.dropna()[train_df['churn']==1]['monthly_charges'], df.dropna()[train_df['churn']==0]['monthly_charges'])
6.	How much of monthly_charges can be explained by internet_service_type? (hint: correlation test). State your hypotheses and your conclusion clearly.
7.	How much of monthly_charges can be explained by internet_service_type + phone service type (0, 1, or multiple lines). State your hypotheses and your conclusion clearly.
8.	Create visualizations exploring the interactions of variables (independent with independent and independent with dependent). The goal is to identify features that are related to churn, identify any data integrity issues, understand 'how the data works', e.g. we may find that all who have online services also have device protection. In that case, we don't need both of those. (The visualizations done in your analysis for questions 1-5 count towards the requirements below)
•	Each independent variable (except for customer_id) must be visualized in at least two plots, and at least 1 of those compares the independent variable with the dependent variable.
•	For each plot where x and y are independent variables, add a third dimension (where possible), of churn represented by color.
•	Use subplots when plotting the same type of chart but with different variables.
•	Adjust the axes as necessary to extract information from the visualizations (adjusting the x & y limits, setting the scale where needed, etc.)
•	Add annotations to at least 5 plots with a key takeaway from that plot.
•	Use plots from matplotlib, pandas and seaborn.
•	Use each of the following:
•	sns.heatmap
•	pd.crosstab (with color)
•	pd.scatter_matrix
•	sns.barplot
•	sns.swarmplot
•	sns.pairplot
•	sns.jointplot
•	sns.relplot or plt.scatter
•	sns.distplot or plt.hist
•	sns.boxplot
•	plt.plot
•	Use at least one more type of plot that is not included in the list above.
9.	What can you say about each variable's relationship to churn, based on your initial exploration? If there appears to be some sort of interaction or correlation, assume there is no causal relationship and brainstorm (and document) ideas on reasons there could be correlation.
•	phone_id
•	internet_service_type_id
•	online_security_backup
•	device_protection
•	tech_support
•	contract_type_id
•	senior_citizen
•	tenure
•	tenure_year
•	monthly_charges
•	total_charges
•	payment_type_id
•	paperless_billing
•	contract_type_id
•	gender
10.	Summarize your conclusions, provide clear answers to the specific questions, and summarize any takeaways/action plan from the work above.