# EDA and data visualization

## EXERCISE 2

You have recently started your job in a famous telecom company. As you may very well know, telecoms fiercely fight for customer retention, with entire service branches devoted to this task. This is due to the fact that retention is more cost efficient than capturing new clients. One of your colleagues from the master got hired in the Marketing Department, and he needs to understand the company’s clients, so he’s asked you to help him out with a descriptive report and segmentation of the customer base. **He’s specially interested in the lifetime value of loyal customers.**

The purpose of this exercise is to prepare a descriptive report and segment the customers in the most adequate way. Use the data on `customers.csv`. Clean, organise and present an **exploratory analysis** of the data. What can you tell about the customers ?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set some Pandas options
sns.set(style="darkgrid")
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## 1. Preprocessing and Cleaning
    
As a first part we will go through the data, this way we will clean the csv:
- setup corrected types
- set index
- clean null values
- keep relevant data columns and raws

**TO DO:**
Load the "customers.csv" file in a pandas dataframe and display the head.

In [None]:
#If you use google collab:
#url ="https://media.githubusercontent.com/media/michalis0/Business-Intelligence-and-Analytics/master/data/customers.csv"
#customers = pd.read_csv(url)

#filename = "../../data/customers.csv"
#customers = pd.read_csv(filename)

**TO DO:** Display the index, columns, dtypes and shape.

* Modify data types, transform boolean values to correct one. Specially transform True/False values to catagories 0 and 1. The first step is to get the different values uniques present in each column. Then discretize to numerical values.

Firstly let's take a look to the tenure column. Since the tenure correspond to the Number of months the customer has stayed with the company, it is important to check null or 0 values.

**TO DO:** Firstly you will focus on the tenure columns acces with: `customers.tenure` or `customers['tenure']`. Check  and count the number of raws when `tenure == 0`. To test the values of a raw for a specific column value, you may used the ``.loc[]`` s.t: ``customers.loc[customers.tenure == 0]``

Finally to count the number of 0 values you apply the ``..count()``. This function will sum up the total of 0 values.

**Note:** These 11 raws with be deleted later, this clearly correspond to outliers which maybe start a contract and cancelled befopre its application, or the data is simply missing. Note also that these raws with be dropped by the next steps (check empty string, nan values etc...)

The following piece of code allow you to identify uniques values for each columns. The 2 arguments represent the range of the number of uniques values you are looking for. This function return the interested columns that contains this number of unique values (it will be useful later to clean the columns).

In [None]:
def identify_uniques(min_, max_):
    columns = list(customers.columns)
    print(columns, '\n')
    interested_columns = []
    for col in columns:
        #Retreive The uniques values of a column
        uniques = customers[col].unique().tolist()
        # Check if it belong to this range
        if len(uniques) >= min_ and len(uniques) < max_ :
            print(col, " uniques values: ", uniques)
            interested_columns.append(col)
    return interested_columns


Let's take a look for binary and mullticlass columns.

In [None]:
print("Binary classes: ")
interested_columns = identify_uniques(0, 4)

print("Multi classes: ")
identify_uniques(3, 15)

**Note:** We notice that there is several columns that contains only two categories. Some of them contains also the "No internet service" and "No phone service" fields which can be convert to the "No" categorie directly. This will automatically convert the columns to the good pandas type. There is also some special columns with several but "countable" classes: *InternetService*, *Contract* and *PayemmentMethod*

**TO DO:** Replace the fields with corresponding values, this is an example:  
`customers.gender.replace(to_replace=['Female', 'Male'], value=[0, 1], inplace=True)`.
Make sure the fields have numerical types.
  1. For binary class (columns with 2 different values), replace ['No', 'Yes', 'No internet service', 'No phone service'] by =[0, 1, 0, 0]:  
 `.replace(to_replace=['No', 'Yes', 'No internet service', 'No phone service'], value=[0, 1, 0, 0]`
 use the ```interested_columns``` list.
  2. For multi class columns
    - replace the `gender` column s.t: `['Female', 'Male']` becomes `[0, 1]`
    - replace the `InternetService` column s.t: `['DSL', 'Fiber optic']` becomes `[2, 1]`
    - replace the `Contract` column s.t: `['Month-to-month', 'One year', 'Two year']` becomes `[0, 1, 2]`
    - replace the `PaymentMethod` column s.t: `['Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)']` becomes `[0, 1, 2, 3]`
  3. Transform the `TotalCharges` to numerical type with `pd.to_numeric()`, set the argument `errors='coerce` . This set to `Nan` for values that could not be convert to numerical value.



**TO DO:** Execute the following code to display the different changes.

In [None]:
print("All multi classes: ")
identify_uniques(0, 15)

**To DO:**Let's check the modified dataframe. Display the head, the types and the new shape. Run the following code.

In [None]:
display(customers.head())
display(customers.dtypes)
display(customers.shape)

**TO DO:**Set index to the id by using `.set_index()`([doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html)):

* Check other Nan values represention (space, tab, foo...), and convert them to Nan values:

**TO DO:** Run the following code. For each special empty, null characters, it returns the elements which satisfy the condition.

In [None]:
sp_chars = ['foo', ' ', '\r\n\t', '', None]
for c in sp_chars:
    print("Test for ", c, ": ", np.where(customers.applymap(lambda x: x == c)))


**Note:** This values was correctly manage during the conversion process.

**TO DO:**
  1. Check if there are some null values. Use `.isnull().sum()`
  2. Remove these unwanted raws by using `.dropna(inplace=True)` on the dataframe.

**Note:** 11 nan raws with nan values have been removed.

### 2. Data visualization & Statistics
    
    
* Comparaison of the number of churned customers

**TO DO:** Now we will compare the number of churned customers. To do so, we will display a bar plot on `pd.value_counts(customers.churn)` (this function count for each unique value the total count). Then apply the `.plot()``function. Choose and modify correctly the parameters of your plot (set the title and the different axe labels)


* Charges distribution & statistics

**TO DO:** We are now interesting of the continuous values columns s.t the tenure and the charges (Total and Monthly). Apply the `.describe()` function for these three fields. Plot the charts for TotalCharges, MonthlyCharges and tenure fields. Apply `.sort_salues(ascending = True)` before plotting.

### 3. Statistical analysis

To extract the differents features of our models, we have to analyse some statistical indicators such that: the correlation between veriable. Specially for the *Churn* columns since in the fututre it will be the target variable.

* Correlation between features columns

The Pearson correlation coefficient is the most widely used. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it may be more appropriate to use the Spearman rank correlation method.  

We will see that, the feature extraction methods depend if the inputs columns are linearly dependent or not.

**To do:**
  1. Compute the different correlations ('pearson' and 'spearman', plot correlation regarding the churn column s.t: `.corr(method='pearson')[['Churn]]`. We apply the correlation function other customers but at the end we keep only the Churn column as a dataframe. Keep in mind this important notation and selection.
  2. Plot the correlation values for Churn column in a bar plot. We can reuse the previous code but we ask you to sort by values the correlation to finally get a readable bar plot. You code will start like this:  
  `customers.corr(method='pearson')['Churn'][:-1].sort_values(ascending = True).` Note that the [:-1] avoid to take into account the self correlation of the Churn column.


**Note:** We are interested about the strongly correlated and uncorrelated columns. The main features that interfer the churning decision are: *Contract*, *tenure*, *PaymentMethod*, *TotalCharges*, *SeniorCitizen*, *MonthlyCharges* and *PaperlessBilling*. Note that the two most correlated features are inversed depending on the method.



**Note:**
These results seem very consistent with some hypothesis:  
- some types of services the customers signed for do not influence its decision to quit the contract
- the gender also is not relevant  
- the senior categorie is strongly correlated (i.e: we all dies  :sweat:)  
- contract, (payment method (linked with contract), charges and tenure are obviously important for a lots of people (there are strongly correlated between them)  
- the paperless billing columns is correlated to the MonthlyCharges

**TO DO:** Run the following code. It plots the correlation matrix using `sns.heatmap()`.

In [None]:
corr = customers.corr()

plt.figure(figsize=(10,10))
plt.title("Correlation matrix")
ax = sns.heatmap(
    corr,
    vmin=-1, vmax=1, center=0,
    cmap=sns.cubehelix_palette(200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);
