# Dataset Merging

### Importing libraries

In [None]:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install scipy

!pip install xlrd
!pip install openpyxl

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline

### Importing the data

In [None]:
DATA_PATH = "../data/"

path = DATA_PATH + "Telco_customer_churn.xlsx"

#### Importing the data sheets

We import every sheet and look at the first few data points.

#### Original

In [None]:
df_orig = pd.read_excel(path, sheet_name='Telco_Churn')
df_orig.head()

In [None]:
df_orig.info()

#### Status

In [None]:
df_status = pd.read_excel(path, sheet_name="status")
df_status.head()

In [None]:
df_status.info()

#### Services

In [None]:
df_services = pd.read_excel(path, sheet_name="services")
df_services.head()

In [None]:
df_services.info()

#### Location

In [None]:
df_location = pd.read_excel(path, sheet_name="location")
df_location.head()

In [None]:
df_location.info()

#### Population

In [None]:
df_population = pd.read_excel(path, sheet_name="population")
df_population.head()

In [None]:
df_population.info()

#### Demographics

In [None]:
df_demographics = pd.read_excel(path, sheet_name="demographics")
df_demographics.head()

In [None]:
df_demographics.info()

### Concatenation

Let us concatenate the dataframes into a single dataframe as it will be easier to work with. However, we must be careful combining these datasets as we must ensure that the data attributes correspond to the same clients. We also remark that the population dataframe is different from the others as each row does not correspond to a client but an area and an agglomeration of clients. We will see later if/how we can incorporate this into our data. 

Let us see for the other dataframes (excluding population) whether the customer Id's match in every row so that we can merge the data together. We see that not every dataframe has the same label for Customer ID so we first update them. We also see that all the contents of the original dataframe in contained within the four others (status, services, location and demographics).

In [None]:
df_orig.rename(columns={'CustomerID': 'Customer ID'}, inplace=True)

In [None]:
def checkID(dataframes):
    assert len(set(len(dataframe) for dataframe in dataframes)) == 1
    for i in range(len(dataframes[0])):
        for j in range(1, len(dataframes)):
            try:
                assert dataframes[0].iloc[i]["Customer ID"] == dataframes[j].iloc[i]["Customer ID"]
            except:
                print(i, j)
    return "Customer Id's match!"

dataframes = [df_status, df_services, df_location, df_demographics]
checkID(dataframes)

We see that these four datasets were indeed designed and created together. Therefore, we can safely concatenate them. 

In [None]:
df = pd.concat([df_status, df_services, df_location, df_demographics], axis=1, join='outer', ignore_index=False, verify_integrity=False)
df.head()

If we set `verify_integrity=True`, we'll find that we have overlapping columns: `Customer ID, Count, Quarter`. Therefore, we remove these duplicate columns from our new dataframe. 

In [None]:
df = df.loc[:,~df.columns.duplicated()] # Removes duplicates
df.info()

Having removed duplicates, we now have 51 variables. 

We save the new dataset into a new file so that we do not have to rerun the code every time. 

In [None]:
df.isnull().sum(axis=0)

In [None]:
# category = {}
# for i in range(len(df["Churn Category"])):
#     if df.loc[i, "Churn Category"] in category:
#         category[df.loc[i, "Churn Category"]] += 1
#     else:
#         category[df.loc[i, "Churn Category"]] = 0
# print(category)
df.groupby("Churn Category")["Customer ID"].nunique()

In [None]:
df.groupby("Churn Reason")["Customer ID"].nunique()

In [None]:
df.groupby("Churn Label")["Customer ID"].nunique()

In [None]:
df.groupby("Customer Status")["Customer ID"].nunique()

In [None]:
df.groupby("Churn Value")["Customer ID"].nunique()

In [None]:
4719 + 453

Remarks: Clearly the customers with the status of "Joined" have been considered as customers who are not going to churn. Decision to be made: Include or not include? We can try both. I propose leaving it for now and then trying running our algorithms at the end of the project. There is an interesting tradeoff. We would expect that removing the 450 or so customers who have only "joined" would make the features of customers who do not churn more precise and increase their importance, improving accuracy. On the other hand we are removing training instances so it may be more difficult to train certain models such as neural networks which require large amounts of data. This is already a medium-sized dataset (not that large) so removing 500 instances is not insignificant.  

It is clear that we must drop some columns such as "Churn Reason", which would immediately inform our algorithms whether a customer churned or not. 

Some of the columns we drop:
- Count: Every value is equal to 1
- Quarter: Every value is equal to Q3
- Country: Every value is equal to "United States"
- State: Every value is equal to "California"

Description to be updated

In [None]:
df.drop(columns=["Churn Category", "Churn Reason", "Customer Status", "Churn Value", "Churn Score", "Count", "Quarter", "Lat Long"], inplace=True)

In [None]:
df.isnull().sum(axis=0)

In [None]:
df.info()

### Fixing data types

#### Encoding binary values

Multiple binary formats are given by "Yes" and "No" which we must convert appropriately.

In [None]:
def convert_binary(columns: list):
    for column in columns:
        df[column] = df[column].eq('Yes').mul(1)

In [None]:
df.groupby("Phone Service")["Customer ID"].nunique()

In [None]:
binary_columns = ["Referred a Friend", "Churn Label", "Under 30", "Senior Citizen", "Married", "Dependents", "Phone Service", "Multiple Lines"]
convert_binary(binary_columns)

#### One hot encoding for churn label

In [None]:
df.head()

In [None]:
df.drop("Customer ID", axis=1)

#### Saving the new data

We save the new data in a new file

In [None]:
save_file = "Telco_data_clean.csv"

save_path = DATA_PATH + save_file

df.to_csv(save_path)