# Classification Project

By Ednalyn C. De Dios & Michael P. Moran

## Project Planning

### Goals
1. Explain what is driving customers to churn

### Deliverables
1. [ ] A report (jupyter notebook) containing analysis of what is driving customer churn
1. [ ] A csv with the customer_id, probability of churn, and the prediction of churn (1=churn, 0=not_churn)
1. [ ] A single google slide that illustrates how your model works, including the features being used
    - Audience: senior leadership team
    - How were the values derived?
    - How likely is the model
        - to give a high probability of churn when churn doesn't occur,
        - to give a low probability of churn when churn occurs, and
        - to accurately predict churn.
1. [ ] A python script that prepares data such that it can be fed into your model
1. [ ] A README.md file that contains a link to your google slides presentation, and instructions for how to use your python script(s)

**Why are our customers churning?**

- Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers))
- Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?
- Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?
- If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

***Deliverables***

1. I will also need a report (ipynb) answering the question, "Why are our customers churning?" I want to see the analysis you did to answer my questions and lead to your findings. Please clearly call out the questions and answers you are analyzing. E.g. If you find that month-to-month customers churn more, I won't be surprised, but I am not getting rid of that plan. The fact that they churn is not because they can, it's because they can and they are motivated to do so. I want some insight into why they are motivated to do so. I realize you will not be able to do a full causal experiment, but I hope to see some solid evidence of your conclusions.
1. I will need you to deliver to me a csv with the customer_id, probability of churn, and the prediction of churn (1=churn, 0=not_churn). I would also like a single goolgle slide that illustrates how your model works, including the features being used, so that I can deliver this to the SLT when they come with questions about how these values were derived. Please make sure you include how likely your model is to give a high probability of churn when churn doesn't occur, to give a low probability of churn when churn occurs, and to accurately predict churn.
1. Finally, our development team will need a .py file that will take in a new dataset, (in the exact same form of the one you acquired from telco_churn.customers) and perform all the transformations necessary to run the model you have developed on this new dataset to provide probabilities and predictions.



### Data Dictionary & Domain Knowledge

### Hypotheses

### Thoughts & Questions

#### Questions
1. What is SLT?
#### From the boss
1. Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers))
1. Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?
1. Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?
1. If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

## Prepare Environment

In [1]:
from env import host, user, password

import numpy as np
import pandas as pd

from sqlalchemy import create_engine

## Acquisition

### Grab Data

1. Use the mysql connector to query telco_churn.customers. Assign the output of that query to the dataframe df. You want to include all the fields.

In [2]:
def get_db_url(
    hostname: str, username: str, password: str, db_name: str
) -> str:
    """
    return url for accessing a mysql database
    """
    return f"mysql+pymysql://{username}:{password}@{hostname}/{db_name}"


def get_sql_conn(hostname: str, username: str, password: str, db_name: str):
    """
    return a mysql connection object
    """
    return create_engine(get_db_url(host, user, password, db_name))


def df_from_sql(query: str, url: str) -> pd.DataFrame:
    """
    return a Pandas DataFrame resulting from a sql query
    """
    return pd.read_sql(query, url)


def get_telco_data() -> pd.DataFrame:
    db = "telco_churn"
    query = ("SELECT * "
             f"FROM customers;")
    url = get_db_url(host, user, password, db)
    return df_from_sql(query, url)

In [3]:
df = get_telco_data()

In [4]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,device_protection,tech_support,streaming_tv,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn
0,0002-ORFBO,Female,0,Yes,Yes,9,Yes,No,1,No,...,No,Yes,Yes,No,2,Yes,2,65.6,593.3,No
1,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,No,No,No,Yes,1,No,2,59.9,542.4,No
2,0004-TLHLJ,Male,0,No,No,4,Yes,No,2,No,...,Yes,No,No,No,1,Yes,1,73.9,280.85,Yes
3,0011-IGKFF,Male,1,Yes,No,13,Yes,No,2,No,...,Yes,No,Yes,Yes,1,Yes,1,98.0,1237.85,Yes
4,0013-EXCHZ,Female,1,Yes,No,3,Yes,No,2,No,...,No,Yes,Yes,No,1,Yes,2,83.9,267.4,Yes


### Summarize Data

2. Write a function, peekatdata(dataframe), that takes a dataframe as input and computes and returns the following:

    - creates dataframe object head_df (df of the first 5 rows) and prints contents to screen
    - creates dataframe object tail_df (df of the last 5 rows) and prints contents to screen
    - creates tuple object shape_tuple (tuple of (nrows, ncols)) and prints tuple to screen
    - creates dataframe object describe_df (summary statistics of all numeric variables) and prints contents to screen.
    - prints to screen the information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [5]:
def peekatdata(dataframe):
    head_df = dataframe.head()
    print(f"HEAD\n{head_df}", end="\n\n")

    tail_df = dataframe.tail()
    print(f"TAIL\n{tail_df}", end="\n\n")

    shape_tuple = dataframe.shape
    print(f"SHAPE: {shape_tuple}", end="\n\n")

    describe_df = dataframe.describe()
    print(f"DESCRIPTION\n{describe_df}", end="\n\n")

    print(f"INFORMATION")
    dataframe.info()

In [6]:
peekatdata(df)

HEAD
  customer_id  gender  senior_citizen partner dependents  tenure  \
0  0002-ORFBO  Female               0     Yes        Yes       9   
1  0003-MKNFE    Male               0      No         No       9   
2  0004-TLHLJ    Male               0      No         No       4   
3  0011-IGKFF    Male               1     Yes         No      13   
4  0013-EXCHZ  Female               1     Yes         No       3   

  phone_service multiple_lines  internet_service_type_id online_security  \
0           Yes             No                         1              No   
1           Yes            Yes                         1              No   
2           Yes             No                         2              No   
3           Yes             No                         2              No   
4           Yes             No                         2              No   

   ...  device_protection tech_support streaming_tv streaming_movies  \
0  ...                 No          Yes          Yes      

## Data Prep

### TODO
- [ ] multiple_lines has "No phone service" for the last row, which is not the same as yes/no that other columns have.
- [ ] convert total_charges column to float. it is an object after reading from sql database

1. Write a function, df_value_counts(dataframe), that takes a dataframe as input and computes and returns the values by frequency for each variable. Use the rule of thumb for your logic on whether or not to use the bins argument. The function will use a for loop and an in statement.

In [7]:
def df_value_counts(dataframe):
    for col in df.columns:
        n = df[col].unique().shape[0]
        col_bins = min(n, 10)
        print(f"{col}:")
        if df[col].dtype in ['int64', 'float64'] and n > 10:
            print(df[col].value_counts(bins=col_bins, sort=False))
        else:
            print(df[col].value_counts())
        print("\n")

In [8]:
df_value_counts(df)

customer_id:
8199-ZLLSA    1
8246-SHFGA    1
5624-RYAMH    1
7825-ECJRF    1
8024-XNAFQ    1
1565-RHDJD    1
4234-XTNEA    1
3727-OWVYD    1
3606-SBKRY    1
7377-DMMRI    1
4123-DVHPH    1
6128-AQBMT    1
0854-UYHZD    1
4695-WJZUE    1
4080-IIARD    1
7143-BQIBA    1
5519-TEEUH    1
8118-LSUEL    1
3481-JHUZH    1
9795-SHUHB    1
6048-UWKAL    1
1574-DYCWE    1
6568-POCUI    1
1482-OXZSY    1
3841-NFECX    1
8065-BVEPF    1
7799-LGRDP    1
9134-CEQMF    1
4929-XIHVW    1
2700-LUEVA    1
             ..
1432-FPAXX    1
1678-FYZOW    1
7359-WWYJV    1
7459-IMVYU    1
2955-BJZHG    1
2410-CIYFZ    1
8258-GSTJK    1
8851-RAGOV    1
9504-DSHWM    1
2004-OCQXK    1
9068-FHQHD    1
3331-HQDTW    1
5650-VDUDS    1
1837-YQUCE    1
6519-ZHPXP    1
3587-PMCOY    1
5046-NUHWD    1
5394-MEITZ    1
4459-BBGHE    1
0936-NQLJU    1
5274-XHAKY    1
5242-UOWHD    1
9251-WNSOD    1
9389-ACWBI    1
4029-HPFVY    1
6507-DTJZV    1
6521-YYTYI    1
4057-FKCZK    1
7682-AZNDK    1
2320-JRSDE    1
Name: custo

- customer_id has no duplicates
- gender is about even
- customers are mostly not seniors
- about equally split along single/partner
- most customers do not have dependents
- there are many new and many old customers
- overwhelming majority have phone service
- closely split along multiple_lines
- overwhelming majority have internet service
    - more have fiber than DSL
    - most do not have online_security
    - most do not have online_backup
    - most do not have device_protection
    - most do not have tech_support
    - about evenly split along streaming_tv
    - about evenly splot along streaming_movies
    - most customers are month-to-month
    - most customers are paperless billing
    - most customers pay by some form of check
    - many customers pay less than $30
    - most are between 45 and 110
    - most have not churned
        - about 1900 have
    

### Handle Missing Values

2. Missing Values:

    - Write a function, that returns a dataframe of the column name and the number of missing values and the percentage of missing values (missing records/total records) for each of the columns that have > 0 missing values.

   - Document your takeaways. For each variable:

        - should you remove the observations with a missing value for that variable?
        - should you remove the variable altogether?
        - is missing equivalent to 0 (or some other constant value) in the specific case of this variable?
        - should you replace the missing values with a value it is most likely to represent (e.g. Are the missing values a result of data integrity issues and should be replaced by the most likely value?)
        - Handle the missing values in the way you recommended above.

In [9]:
def df_missing_vals(dataframe):
    null_count = dataframe.isnull().sum()
    null_percentage = (null_count / dataframe.shape[0]) * 100
    empty_count = pd.Series((dataframe.get_values() == "").sum())
    return pd.DataFrame({"nmissing": null_count, "percentage": null_percentage, "nempty": empty_count})

# test 
print(df_missing_vals(pd.DataFrame({"col1": [np.nan, 1, "", np.nan, np.nan], "col2": [2, "", 4, np.nan, 4]})))

      nmissing  percentage  nempty
col1       3.0        60.0     NaN
col2       1.0        20.0     NaN
0          NaN         NaN     2.0


In [10]:
print(df_missing_vals(df))

                          nmissing  percentage  nempty
customer_id                    0.0         0.0     NaN
gender                         0.0         0.0     NaN
senior_citizen                 0.0         0.0     NaN
partner                        0.0         0.0     NaN
dependents                     0.0         0.0     NaN
tenure                         0.0         0.0     NaN
phone_service                  0.0         0.0     NaN
multiple_lines                 0.0         0.0     NaN
internet_service_type_id       0.0         0.0     NaN
online_security                0.0         0.0     NaN
online_backup                  0.0         0.0     NaN
device_protection              0.0         0.0     NaN
tech_support                   0.0         0.0     NaN
streaming_tv                   0.0         0.0     NaN
streaming_movies               0.0         0.0     NaN
contract_type_id               0.0         0.0     NaN
paperless_billing              0.0         0.0     NaN
payment_ty

  result = result.union(other)


- Document your takeaways. For each variable:
    - Me: No columns have NaNs
    - Me: do they have any empty strings?
    - should you remove the observations with a missing value for that variable?
    - should you remove the variable altogether?
    - is missing equivalent to 0 (or some other constant value) in the specific case of this variable?
    - should you replace the missing values with a value it is most likely to represent (e.g. Are the missing values a result of data integrity issues and should be replaced by the most likely value?)

Handle the missing values in the way you recommended above.

3. Transform churn such that "yes" = 1 and "no" = 0

4. Compute a new feature, tenure_year, that is a result of translating tenure from months to years.

5. Figure out a way to capture the information contained in phone_service and multiple_lines into a single variable of dtype int. Write a function that will transform the data and place in a new column phone_id in df_sql. Be sure you have documented your function and logic well.

6. Figure out a way to capture the information contained in dependents and partner into a single variable of dtype int. Transform the data and place in a new column household_type_id in df_sql. Be sure you have documented your function and logic well.

7. Figure out a way to capture the information contained in streaming_tv and streaming_movies into a single variable of dtype int. Transform the data and place in a new column streaming_services in df_sql. Be sure you have documented your function and logic well.

8. Figure out a way to capture the information contained in online_security and online_backup into a single variable of dtype int. Transform the data and place in a new column online_security_backup in df_sql. Be sure you have documented your function and logic well.

9. Data Split

    - Split data into train (70%) & test (30%) samples. You should end with 2 data frames: train_df and test_df

10. Variable Encoding

    - Write an encoder (fit and transform on train_df) for each non-numeric variable. Use that encoder object to transform on test_df

11. Numeric Scaling

    - Fit a min_max_scaler to train_df. Transform monthly_charges and total_charges variables in train_df using the scaler. Then use the scaler object to transform test_df.

## Data Exploration

### Deliverable

*I will also need a report (ipynb) answering the question, "Why are our customers churning?" I want to see the analysis you did to answer my questions and lead to your findings. Please clearly call out the questions and answers you are analyzing. E.g. If you find that month-to-month customers churn more, I won't be surprised, but I am not getting rid of that plan. The fact that they churn is not because they can, it's because they can and they are motivated to do so. I want some insight into why they are motivated to do so. I realize you will not be able to do a full causal experiment, but I hope to see some solid evidence of your conclusions.*

1. Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers))

2. Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?

3. Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?

4. If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

5. Controlling for services (phone_id, internet_service_type_id, online_security_backup, device_protection, tech_support, and contract_type_id), is the mean monthly_charges of those who have churned significantly different from that of those who have not churned?

6. How much of monthly_charges can be explained by internet_service_type? (hint: correlation test). State your hypotheses and your conclusion clearly.

7. How much of monthly_charges can be explained by internet_service_type + phone service type (0, 1, or multiple lines). State your hypotheses and your conclusion clearly.

8. Create visualizations exploring the interactions of variables (independent with independent and independent with dependent). The goal is to identify features that are related to churn, identify any data integrity issues, understand 'how the data works', e.g. we may find that all who have online services also have device protection. In that case, we don't need both of those. (The visualizations done in your analysis for questions 1-5 count towards the requirements below)

    - Each independent variable (except for customer_id) must be visualized in at least two plots, and at least 1 of those compares the independent variable with the dependent variable.

    - For each plot where x and y are independent variables, add a third dimension (where possible), of churn represented by color.

    - Use subplots when plotting the same type of chart but with different variables.

    - Adjust the axes as necessary to extract information from the visualizations (adjusting the x & y limits, setting the scale where needed, etc.)

    - Add annotations to at least 5 plots with a key takeaway from that plot.

    - Use plots from matplotlib, pandas and seaborn.

    - Use each of the following:

        - sns.heatmap
        - pd.crosstab (with color)
        - pd.scatter_matrix
        - sns.barplot
        - sns.swarmplot
        - sns.pairplot
        - sns.jointplot
        - sns.relplot or plt.scatter
        - sns.distplot or plt.hist
        - sns.boxplot
        - plt.plot
        
    - Use at least one more type of plot that is not included in the list above.
    

9. What can you say about each variable's relationship to churn, based on your initial exploration? If there appears to be some sort of interaction or correlation, assume there is no causal relationship and brainstorm (and document) ideas on reasons there could be correlation.

    - phone_id
    - internet_service_type_id
    - online_security_backup
    - device_protection
    - tech_support
    - contract_type_id
    - senior_citizen
    - tenure
    - tenure_year
    - monthly_charges
    - total_charges
    - payment_type_id
    - paperless_billing
    - contract_type_id
    - gender
    
   
10. Summarize your conclusions, provide clear answers to the specific questions, and summarize any takeaways/action plan from the work above.

## Modeling

1. Feature Selection: Are there any variables that seem to provide limited to no additional information? If so, remove those and assign the new limited dataframe to train_reduced

2. Train (fit, transform, evaluate) a logistic regression model varying your meta-parameters.

3. Compare evaluation metrics across all the models, and select the best performing model.

4. Test the final model (transform, evaluate) on your out-of-sample data (test_df). Summarize the performance. Interpret your results.

# Delivery

1. I will need you to deliver to me a csv with the customer_id, probability of churn, and the prediction of churn (1=churn, 0=not_churn). I would also like a single google slide that illustrates how your model works, including the features being used, so that I can deliver this to the SLT when they come with questions about how these values were derived. Please make sure you include how likely your model is to give a high probability of churn when churn doesn't occur, to give a low probability of churn when churn occurs, and to accurately predict churn.

1. Finally, our development team will need a .py file that will take in a new dataset, (in the exact same form of the one you acquired from telco_churn.customers) and perform all the transformations necessary to run the model you have developed on this new dataset to provide probabilities and predictions.