# Summary/Notes

### When you have finished your exploration, return to this markdown cell and fill it out in preparation for your meeting with the analytics manager, Victor.

**1.** Provide a summary of the data. This can include information such as: 
- How many total customers does COM have? 
- What types of data are in these two data sources?
- Is the data generally clean? Messy? etc.

**2. Special request** What is the average customer tenure in days? Use 9/1/2020 as the current date ("today"). 



**3.** After researching the data, what 2-3 topics would you like to research further regarding customers that churn? (e,g,. Do any of our service types cause customers to churn faster than others?) Why do you think these factors may be related to churn? 




----------------------

# How to complete this notebook

This notebook has a skeleton structure to guide your exploration and keep you on track. More details about each task can be found in the project sidebar. Be sure to read the sidebar instructions for each step before writing your code. 

# 1. IMPORT & EXPLORE THE DATA

## 1A. Import packages

In [49]:
import numpy as np
import pandas as pd

## 1B. Import the data
The datasets are stored in the following files:
- "demographics.csv"
- "services.csv"

These files are in the same folder you are currently working in. 

In [7]:
demographics_df = pd.read_csv("demographics.csv")
services_df = pd.read_csv("services.csv")

## 1C. Explore your data & identify structure
Add as many code cells as you need to thoroughly explore both DataFrames.

In [8]:
demographics_df.head()

Unnamed: 0,Customer_ID,Count,Gender,AGE,Under 30,Senior Citizen,MARRIED,Dependents,Number of Dependents
0,TCO-8779-QRDMV,1,Male,78,No,Yes,No,No,0
1,TCO-7495-OOKFY,1,Female,74,No,Yes,Yes,Yes,1
2,TCO-1658-BYGOY,1,Male,71,No,Yes,No,Yes,3
3,TCO-4598-XLKNJ,1,Female,78,No,Yes,Yes,Yes,1
4,TCO-4846-WHAFZ,1,Female,80,No,Yes,Yes,Yes,1


In [9]:
demographics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 9 columns):
Customer_ID             7043 non-null object
Count                   7043 non-null int64
Gender                  7043 non-null object
AGE                     7043 non-null int64
Under 30                7043 non-null object
Senior Citizen          7043 non-null object
MARRIED                 7043 non-null object
Dependents              7043 non-null object
Number of Dependents    7043 non-null object
dtypes: int64(2), object(7)
memory usage: 495.3+ KB


In [10]:
services_df.head()

Unnamed: 0,Customer_ID,Count,Quarter,Number_of_Referrals,Customer_Enrollement,Offer,Phone_Service,Internet_Service,Internet_Type,Avg_Monthly_GB_Download,...,Streaming_Music,Unlimited_Data,Contract,Payment_Method,Monthly_Charge,Total_Charges,Total_Refunds,Total_Extra_Data_Charges,Total_Long_Distance_Charges,Total_Revenue
0,8779QRDMV,1,Q3,0,8/1/2020,,No,Yes,DSL,8.0,...,No,No,Month-to-Month,Bank Withdrawal,39.65,39.65,0.0,20,0.0,59.65
1,7495OOKFY,1,Q3,1,1/1/2020,Offer E,Yes,Yes,Fiber Optic,17.0,...,No,Yes,Month-to-Month,Credit Card,80.65,633.3,0.0,0,390.8,1024.1
2,1658BYGOY,1,Q3,0,4/1/2019,Offer D,Yes,Yes,Fiber Optic,52.0,...,Yes,Yes,Month-to-Month,Bank Withdrawal,95.45,1752.55,45.61,0,203.94,1910.88
3,4598XLKNJ,1,Q3,1,8/1/2018,Offer C,Yes,Yes,Fiber Optic,12.0,...,No,Yes,Month-to-Month,Bank Withdrawal,98.5,2514.5,13.43,0,494.0,2995.07
4,4846WHAFZ,1,Q3,1,8/1/2017,Offer C,Yes,Yes,Fiber Optic,14.0,...,No,Yes,Month-to-Month,Bank Withdrawal,76.5,2868.15,0.0,0,234.21,3102.36


In [11]:
services_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
Customer_ID                    7043 non-null object
Count                          7043 non-null int64
Quarter                        7043 non-null object
Number_of_Referrals            7043 non-null int64
Customer_Enrollement           7043 non-null object
Offer                          3166 non-null object
Phone_Service                  7043 non-null object
Internet_Service               7043 non-null object
Internet_Type                  5517 non-null object
Avg_Monthly_GB_Download        5517 non-null float64
Streaming_TV                   7043 non-null object
Streaming_Movies               7043 non-null object
Streaming_Music                7043 non-null object
Unlimited_Data                 7043 non-null object
Contract                       7043 non-null object
Payment_Method                 7043 non-null object
Monthly_Charge                 7043 non-null float64
Total_Cha

# 2. DEMOGRAPHICS DATASET WRANGLING


## <p style="color:red;">2A. Standardize column titles</p>
**The code cell below is graded. Do not delete the cell.**

In [15]:
# WRITE YOUR SOLUTION HERE. DO NOT DELETE. THIS CELL IS GRADED.
services_df.columns = [c.replace(' ', '_').lower() for c in 
                       services_df.columns]

demographics_df.columns = [c.replace(' ', '_').lower() for c in 
                       demographics_df.columns]                      


## 2B. Edit Data Types

In [23]:
services_df.customer_enrollment = pd.to_datetime(services_df.customer_enrollement)

  """Entry point for launching an IPython kernel.


In [24]:
services_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
customer_id                    7043 non-null object
count                          7043 non-null int64
quarter                        7043 non-null object
number_of_referrals            7043 non-null int64
customer_enrollement           7043 non-null datetime64[ns]
offer                          3166 non-null object
phone_service                  7043 non-null object
internet_service               7043 non-null object
internet_type                  5517 non-null object
avg_monthly_gb_download        5517 non-null float64
streaming_tv                   7043 non-null object
streaming_movies               7043 non-null object
streaming_music                7043 non-null object
unlimited_data                 7043 non-null object
contract                       7043 non-null object
payment_method                 7043 non-null object
monthly_charge                 7043 non-null float64
t

In [None]:
demographics_df.number_of_dependents.value_counts()

In [25]:
demographics_df = demographics_df.loc[demographics_df.number_of_dependents != 'O']

In [27]:
demographics_df.number_of_dependents = demographics_df.number_of_dependents.astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [28]:
demographics_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7042 entries, 0 to 7042
Data columns (total 9 columns):
customer_id             7042 non-null object
count                   7042 non-null int64
gender                  7042 non-null object
age                     7042 non-null int64
under_30                7042 non-null object
senior_citizen          7042 non-null object
married                 7042 non-null object
dependents              7042 non-null object
number_of_dependents    7042 non-null int64
dtypes: int64(3), object(6)
memory usage: 550.2+ KB


## 2C. Locate & fix input errors

In [33]:
for c in ['gender', 'under_30', 'married', 'dependents', 'senior_citizen']:
    print('----')
    print(c)
    print(demographics_df[c].value_counts())
    
    
    
    

----
gender
Male      3554
Female    3488
Name: gender, dtype: int64
----
under_30
No     5641
Yes    1401
Name: under_30, dtype: int64
----
married
No     3623
Yes    3377
Y        25
N        17
Name: married, dtype: int64
----
dependents
No     5415
Yes    1627
Name: dependents, dtype: int64
----
senior_citizen
No     5901
Yes    1141
Name: senior_citizen, dtype: int64


In [40]:
# WRITE YOUR SOLUTION HERE. DO NOT DELETE. THIS CELL IS GRADED. ----------THIS MAY NOT BE GRADED---------
demographics_df["married"] = demographics_df.married.replace('N', 'No')
demographics_df["married"] = demographics_df.married.replace('Y', 'Yes')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [41]:
demographics_df.married.value_counts()

No     3640
Yes    3402
Name: married, dtype: int64

# 3. SERVICES DATASET WRANGLING

## 3A. Standardize column titles

## 3B. Remove unnecessary columns

## 3C. Convert Data Types


# 4. MERGE THE DATAFRAMES

## 4A. Identify the connecting columns

## <p style="color:red;">4B. Manipulate the connecting columns</p>

**The code cell below is graded. Do not delete the cell.**

In [47]:
# WRITE YOUR SOLUTION HERE. DO NOT DELETE. THIS CELL IS GRADED.
demographics_df["customer_id"] = demographics_df.customer_id.str.replace("TCO-", "")
demographics_df["customer_id"] = demographics_df.customer_id.str.replace("-", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [48]:
demographics_df.customer_id[:3]

0    8779QRDMV
1    7495OOKFY
2    1658BYGOY
Name: customer_id, dtype: object

## <p style="color:red;">4C. Join the DataFrames</p>
**The code cell below is graded. Do not delete the cell.**

In [None]:
# WRITE YOUR SOLUTION HERE. DO NOT DELETE. THIS CELL IS GRADED.


# 5. PREPARE TO COMPLETE THE SUMMARY

## <p style="color:red;">5A. Calculate customer tenure in days</p>
**The code cell below is graded. Do not delete the cell.**

In [None]:
# WRITE YOUR SOLUTION HERE. DO NOT DELETE. THIS CELL IS GRADED.


## 5B. Provide a summary of the data 
**Summarize the data in the markdown cell at the top of this Jupyter Notebook.**

## 5C. Add 2-3 inferences you'd like to share with Victor
**Add these into the markdown cell at the top of the Jupyter Notebook**