In [1]:
import pandas as pd
import numpy as np

import acquire
import env


### Importing the telco_churn data, 'customers' table, first round.  Just want to get an idea of what columns we're looking at - Round 1

**Different rounds are outlined later, as investigation should lead to the features we'll be needing most for our mvp.**


#### First Hypotheses for customer table:

#### 1.) 

- $H_0$ = contract type has no effect on churn rate
- $H_a$ = contract type DOES have an effect on churn rate

#### 2.) 

- $H_0$ = month-to-month tenure does NOT have an effect on churn rate
- $H_a$ = month-to-month tenure DOES have an effect on churn rate

#### 3.) 

- $H_0$ = contract length of one-year does NOT have an effect on churn rate
- $H_a$ = contract length of one-year DOES have an effect on churn rate

#### 4.)

- $H_0$ = contract length of two years does NOT affect churn rate
- $H_0$ = contract length of two years DOES affect churn rate

In [2]:
df = acquire.get_telco_data()
df.head()

Unnamed: 0,customer_id,tenure,phone_service,multiple_lines,internet_service_type_id,streaming_tv,streaming_movies,monthly_charges,paperless_billing,payment_type_id,total_charges,churn
0,0002-ORFBO,9,Yes,No,1,Yes,No,65.6,Yes,2,593.3,No
1,0003-MKNFE,9,Yes,Yes,1,No,Yes,59.9,No,2,542.4,No
2,0004-TLHLJ,4,Yes,No,2,No,No,73.9,Yes,1,280.85,Yes
3,0011-IGKFF,13,Yes,No,2,Yes,Yes,98.0,Yes,1,1237.85,Yes
4,0013-EXCHZ,3,Yes,No,2,Yes,No,83.9,Yes,2,267.4,Yes


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 12 columns):
customer_id                 7043 non-null object
tenure                      7043 non-null int64
phone_service               7043 non-null object
multiple_lines              7043 non-null object
internet_service_type_id    7043 non-null int64
streaming_tv                7043 non-null object
streaming_movies            7043 non-null object
monthly_charges             7043 non-null float64
paperless_billing           7043 non-null object
payment_type_id             7043 non-null int64
total_charges               7043 non-null object
churn                       7043 non-null object
dtypes: float64(1), int64(3), object(8)
memory usage: 660.4+ KB


#### No nulls in any of the chosen features.  Sweet.  Lots of encoding, though, because lots of un-integer objects in there.  Also, not sure (yet) if all values are unique.

**Also, just from this first pull, the categoricals are:**

- customer_id 
- phone_service ('Yes / No')
- mulitple_lines ('Yes / No')
- internet_service_type_id (1, 2, or 3)
- streaming_tv ('Yes / No') 
- streaming_movies('Yes' / No') 
- paperless_billing ('Yes / No') 
- payment_type_id
- and churn ('Yes / No')

**Leaving the continuous / numerical columns:**

- tenure (months)
- monthly_charges (monetary)
- total_charges (monetary)

In [5]:
df.shape

(7043, 12)

In [6]:
df.describe()

Unnamed: 0,tenure,internet_service_type_id,monthly_charges,payment_type_id
count,7043.0,7043.0,7043.0,7043.0
mean,32.371149,1.872923,64.761692,2.315633
std,24.559481,0.737796,30.090047,1.148907
min,0.0,1.0,18.25,1.0
25%,9.0,1.0,35.5,1.0
50%,29.0,2.0,70.35,2.0
75%,55.0,2.0,89.85,3.0
max,72.0,3.0,118.75,4.0


In [7]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
internet_service_type_id,7043.0,1.872923,0.737796,1.0,1.0,2.0,2.0,3.0
monthly_charges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75
payment_type_id,7043.0,2.315633,1.148907,1.0,1.0,2.0,3.0,4.0


### Making sure each customer_id is unique

- using the format 'df["column_name"].nunique()'

In [12]:
df["customer_id"].nunique()

7043

#### Checks out.  All customer_ids are unique.  Time to take a look at the continuous values to check for outliers.

In [14]:
# Before I see things, I gotta get the viz libraries

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
# plotting the subplots of the numericals

_, ax = plt.subplots(nrows=2, ncols=2, figsize=(14, 8))

plt.subplot(221)
plt.hist(df.tenure)
plt.title("Tenure")

plt.subplot(222)
plt.hist(df.monthly_charges)
plt.title("Monthly Charges")

plt.subplot(223)
plt.hist(df.total_charges)
plt.title("Total Charges")

Text(0.5, 1.0, 'Total Charges')