# Call Center detection
1) Definition of Call Center: Have total calls in a specific period of time on top k percent (k ideally belongs to the interval [1,2])

2) Methods of Call Center detection:

    a) 1st method: 
    
    - Step 1. Filter and get total calls of users in the last month (...).
    
    - Step 2. Make a descending order (by total calls).
    
    - Step 3. Get the first k percent of rows.
    
    b) 2nd method:
    - Step 1. Make some dataframes to get total calls and total contacts of users in some most recent intervals of 7 days (from the last 0 to 7 days, from the last 7 to 14 days,...).
    
    - Step 2. In each dataframe above, get the first h percent of rows (h should be greater than k).
    
    - Step 3. Obtain the intersection by joining (how = 'inner') all the results in Step 2.
    
    c) 3rd method (not filter by percent):
    
    - Step 1. Filter and get total calls of users in the last month (...).
    
    - Step 2. Filter to get users who have number of calls greater than m calls (m is specified on our own).
    

# Diversity of Call Center
1) Definition of Diversity of Call Center: A Call Center which has a large number of total contacts is considered to be 'high diversity'; otherwise, 'low diveristy'.

2) Methods of Diversity detection:

    - Step 0. Filter the users who are call centers.
    
    - Step 1. Get total contacts in the specified period of time (last month, last week, etc.)
    
    - Step 2. Calculate the ratio of total contacts over total calls in that period.
    
    - Step 3. Assign diversity value based on the above ratio (>= 0.5 - 'high'; <0.5 - 'low')
    

# Call in/ Call out center
1) The main goal: to detech whether the center is a call in center or a call out center.

2) Method of Call in/ Call out detection:

    - Step 0. Filter the users who are call centers.
    
    - Step 1. Use countDistict function for the the column 'calling_type' of each user, which means that we count the number of calls of each value of 'calling-type'.
    
    - Step 2. If the result of:
    
                (number of ('calling_type' == 1) / total calls)
                
              is greater than 0.6, return 'call_in'; lower than 0.4, return 'call-out'; otherwise, 'both'.
              
              (We assume that 'calling_type' ==1 refers to 'incoming call'; 'calling_type' == 2 refers to 'outgoing call'.
              
# Call center: Type of duration
1) The main goal: get the most typical type of calls based on the length of calls (i.e. the column 'duration'). 

2) Duration types:

- very_short: 0 - 15s
- short: 15 - 30s
- medium: 30s - 2min
- long: 2 - 20min
- very_long: > 20 min

3) How to get the return value:

- Step 1. Consider each user, use countDistinct function to count number of calls of each duration type.
- Step 2. Find the type of duration that its number of calls is the maximum.


# Call center: Location (daily) changes
1) The main goal: based on locations of call history, return whether a user changes his location in village level, district level, province level, or do not change.

2) How to get the return value:

- If village does not change, then location is "no_change"
- If village changes, but district and province do not change, return "village"
- If district changes, but province does not change, return "district"
- If province changes, return "province"

# Business detection:
1) The main goal: based on all the above features, return the output that what the business of each user is.

2) How to get the return value: See the sheet "Main_idea" in the following link:
https://finosasia.sharepoint.com/:x:/s/CS_project/EeiEwzWQxX9OsGD3bW6-iMUBZqi7tPdJBdVCSqaDRqUygA?e=wPPliI
              


# Spark Config

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import sum,avg,max, min
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql.dataframe import DataFrame
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
from pyspark.sql.types import *
from pyspark.sql.functions import row_number
from pyspark.sql.functions import rank
from pyspark.sql.functions import dense_rank
import math as m
#import pyarrow.parquet as pq
#import s3fs
import pandas as pd
import findspark
import os
import pyspark
#input_date = '20201012'
from functools import reduce
from pyspark.sql.functions import col
from pyspark.sql.functions import when
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
import sys
from random import random
from operator import add
findspark.init()
#conf = SparkConf().setMaster("spark://spark-master.shared:7077").setAppName("Spark_test_36") --> Set Spark Master to run
# Set Spark Local to run
from pyspark.sql import SparkSession
from pyspark.sql.functions import input_file_name

import warnings
warnings.filterwarnings('ignore')

spark = SparkSession.builder.master('local').appName('foo').getOrCreate()
spark.sparkContext.setLogLevel('WARN')
df = spark.read.parquet("sample_data.snappy.parquet").orderBy(col('CALLING').asc(), col('CALLED_DATE').asc())

22/11/27 17:50:14 WARN Utils: Your hostname, HCM-MacOS-QuanNguyen.local resolves to a loopback address: 127.0.0.1; using 192.168.1.22 instead (on interface en0)
22/11/27 17:50:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/11/27 17:50:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                

# Basic functions

In [None]:
def lower_col_name(df):
    for col in df.columns:
        df = df.withColumnRenamed(col, col.lower())
    return df

def new_df(df, no_group_col, groupby_col_1, groupby_col_2, groupby_col_3,
           piv_col, cols, funcs, new_col_names, pivot = False):
    if pivot == False:
        if no_group_col == 1:
            df_new = (
            df
            .groupby(groupby_col_1)
            .agg(*[func(cols[index]).alias(new_col_names[index]) for index, func in enumerate(funcs)]))
        elif no_group_col == 2:
            df_new = (
            df
            .groupby(groupby_col_1, groupby_col_2)
            .agg(*[func(cols[index]).alias(new_col_names[index]) for index, func in enumerate(funcs)]))
        elif no_group_col == 3:
            df_new = (
            df
            .groupby(groupby_col_1, groupby_col_2, groupby_col_3)
            .agg(*[func(cols[index]).alias(new_col_names[index]) for index, func in enumerate(funcs)]))
        elif no_group_col == 4:
            df_new = (
            df
            .groupby(groupby_col_1, groupby_col_2, groupby_col_3, piv_col)
            .agg(*[func(cols[index]).alias(new_col_names[index]) for index, func in enumerate(funcs)]))
    else:
        df_new = (
        df
        .groupby(groupby_col_1, groupby_col_2) \
        .pivot(piv_col) \
        .agg(*[func(cols[index]).alias(new_col_names[index]) for index, func in enumerate(funcs)]))
    return df_new

def join_df(df1, df2, df1_col1, df2_col1, df1_col2, df2_col2, join_type):
    if df2_col1 != '':
        df_new = df1.join(df2, (df1[df1_col1] == df2[df2_col1]) & 
                         (df1[df1_col2] == df2[df2_col2]), how = join_type)
    else:
        df_new = df1.join(df2, df1[df1_col1] == df2[df2_col1], how = join_type)
    return df_new

def add_col(df, col, func):
    df_new = df.withColumn(col, func)
    return df_new

def rename_col(df, col_mapping):
    for c in col_mapping:
        df = df.withColumnRenamed(existing=c[0], new=c[1])
    return df

def del_col(df, cols):
    df_new = df.drop(*cols)
    return df_new


# def fill_na(df, cols):
#     df_new = df.na.fill(value = 0, subset = cols)
#     return df_new

def checkNull(df):
    
    '''
    prints total rows in dataframe, number and percentage of nulls in each column 
    '''
    total_rows = df.count()
    cols = df.columns 
    abs_null = df.select([f.count(f.when(f.col(c).isNull(), c)).alias(c) for c in cols])
    perc_null = abs_null
    for c in cols:
        perc_null = perc_null.withColumn('{}'.format(c), f.round(f.col(c)/total_rows*100,2))
    print ('Total Rows: {}'.format(total_rows))
    print ('Absolute Nulls')
    abs_null.show(truncate=False, vertical=True)
    print ('% Percentage Nulls')
    perc_null.show(truncate=False, vertical=True)

# Basic statistics

In [None]:
working_hours = [i for i in range(8,12)] + [i for i in range(13, 18)]
night_hours =  [i for i in range(19,24)] + [i for i in range(0, 6)]
morning_hours = [i for i in range(6, 8)]
lunch_hours = [i for i in range(11, 13)]
df = lower_col_name(df)
df = add_col(df, 'called_date', substring('called_time',0,8))
df = add_col(df, 'called_date', f.to_date('called_date', 'yyyyMMdd'))
df = add_col(df, 'week_of_year', f.weekofyear(f.col('called_date')))
df = add_col(df, 'year', f.year(f.col('called_date')))
df = add_col(df, 'day_of_week', dayofweek(col('called_date')))
df = add_col(df, 'session_of_week', when(col('day_of_week').isin([6,7]),'weekend').otherwise('weekday'))
df = add_col(df, 'session_of_day', when(col('hour').isin(working_hours), 'working_hours') \
                   .when(col('hour').isin(night_hours), 'night_hours') \
                   .when(col('hour').isin(morning_hours), 'morning_hours') \
                   .otherwise('lunch_hours'))

# Call Center detection

In [281]:
def fitler_from_x_to_y_days(df, x, y):
    df = df.withColumn("last_date", lit("2020-03-09"))
    filtered_df = df.filter((f.col('called_date') <= f.date_add(f.col('last_date'), -x)) 
    & (f.col('called_date') >= f.date_add(f.col('last_date'), -y)))
    return filtered_df

def count_calls_and_contacts_from_x_to_y_days(df, x, y):
    #Create a dataframe with total calls and total contacts
    df_count = new_df(fitler_from_x_to_y_days(df, x, y), 1, 'calling', '', '', '', ['called', 'called'], 
               [count, countDistinct], ['total_calls', 'total_contacts']).\
               orderBy(col('total_calls').desc(), col('total_contacts').desc())
    
    '''
    #Ranking by total calls
    total_calls_rank = df_count.\
                       select('total_calls').\
                       distinct().\
                       orderBy(col('total_calls').desc())
    total_calls_rank = total_calls_rank.withColumnRenamed('total_calls', 'total_calls_1')    
    total_calls_rank = total_calls_rank.withColumn('constant_value', lit('A'))
    total_calls_rank = total_calls_rank.withColumn("id", f.row_number().
                                                   over(Window.partitionBy('constant_value').
                                                   orderBy(col('total_calls_1').desc())
                                                   ))
    total_calls_rank.drop('constant_value')
    a = total_calls_rank.count()
    total_calls_rank = total_calls_rank.withColumn('percent_rank_total_calls', 
                                                   format_number(col('id')/a, 4))
    total_calls_rank = total_calls_rank.drop('constant_value', 'id')
    
    #Join ranking values into the dataframe of total calls and total contacts
    df1 = df_count
    df2 = total_calls_rank
    df3 = df1.join(df2, df2.total_calls_1 == df1.total_calls, 'inner') #The joint will include both two joining columns
    df3 = df3.drop('total_calls_1').\
          orderBy(col('total_calls').desc(), col('total_contacts').desc())
    '''
    return df_count
'''
def top_total_calls_from_x_to_y_days(df, x, y, percent):
    filtered = fitler_from_x_to_y_days(df, x, y)
    count_x_to_y_days = count_calls_and_contacts_from_x_to_y_days(df, x, y).\
                        filter(f.col('percent_rank_total_calls') <= percent)
    count_x_to_y_days = rename_col(count_x_to_y_days, [['calling', 'calling_1'], 
                                                       ['total_calls', f'total_calls_last_{x}_to_{y}_days'],
                                                       ['total_contacts', f'total_contacts_last_{x}_to_{y}_days'],
                                                       ['percent_rank_total_calls', f'percent_rank_total_calls_last_{x}_to_{y}_days']
                                                      ]
                                  )
    
    filtered = filtered.join(count_x_to_y_days, count_x_to_y_days.calling_1 == filtered.calling, 'inner').\
               drop('calling_1').\
               orderBy(col('calling').asc(), col('called_date').asc())
    return filtered
'''
def top_total_calls_from_x_to_y_days(df, x, y, percent):
    #Total calls and total contacts: filter top k percent
    df_count = count_calls_and_contacts_from_x_to_y_days(df, x, y)
    df_count = df_count.withColumn('constant_value', lit('A'))
    df_count = df_count.withColumn("id", f.row_number().
                                   over(Window.partitionBy('constant_value').
                                   orderBy(col('total_calls').desc())
                                        )
                                  )
    df_count = df_count.drop('constant_value')
    
    a = df_count.count()
    b = m.ceil(a*percent/100)
    c = df_count.filter(f.col('id') == b).collect()[0][1]
    top_total_calls = df_count.filter(f.col('total_calls') >= c)
    top_total_calls = rename_col(top_total_calls, [['calling', f'calling_last_{x}_to_{y}_days'], 
                                     ['total_calls', f'total_calls_last_{x}_to_{y}_days'],
                                     ['total_contacts', f'total_contacts_last_{x}_to_{y}_days'],
                                    ]
                                 ).drop('id')
    return top_total_calls
"""
def call_center(df, x, y, percent):
    df_center = top_total_calls_from_x_to_y_days(df, x, y, percent)
    df_center = df_center.withColumn('is_call_center', lit('Yes'))
    return df_center
"""
def is_call_center(df, x, y, percent, method):
    if method == "1_3":
        df_center = top_total_calls_from_x_to_y_days(df, x, y, percent)
        df_center = df_center.withColumn('is_call_center', lit('Yes'))
        return df_center
    if method == "2":
        top_total_calls_0 = top_total_calls_from_x_to_y_days(df, x, y, percent)
        top_total_calls_1 = top_total_calls_from_x_to_y_days(df, 0, 7, percent)
        top_total_calls_2 = top_total_calls_from_x_to_y_days(df, 7, 14, percent)
        top_total_calls_3 = top_total_calls_from_x_to_y_days(df, 14, 21, percent)
        top_total_calls_4 = top_total_calls_from_x_to_y_days(df, 21, 28, percent)
        #Join the above dataframes
        top_total_calls_5 = top_total_calls_0.join(top_total_calls_1, top_total_calls_0['calling_last_0_to_30_days'] ==\
                                                   top_total_calls_1['calling_last_0_to_7_days'],
                                                   'inner').\
                                                   drop(f'calling_last_0_to_7_days')
        top_total_calls_6 = top_total_calls_5.join(top_total_calls_2, top_total_calls_5['calling_last_0_to_30_days'] ==\
                                                   top_total_calls_2['calling_last_7_to_14_days'],
                                                   'inner').\
                                                   drop(f'calling_last_7_to_14_days')
        top_total_calls_7 = top_total_calls_6.join(top_total_calls_3, top_total_calls_6['calling_last_0_to_30_days'] ==\
                                                   top_total_calls_3['calling_last_14_to_21_days'],
                                                   'inner').\
                                                   drop(f'calling_last_14_to_21_days')
        top_total_calls_8 = top_total_calls_7.join(top_total_calls_4, top_total_calls_7['calling_last_0_to_30_days'] ==\
                                                 top_total_calls_4['calling_last_21_to_28_days'],
                                                 'inner').\
                                                 drop(f'calling_last_21_to_28_days')
        #Join the call center dataframe with the original dataframe
        top_total_calls_9 = top_total_calls_8.withColumn('is_call_center', lit('Yes'))
        """
        df2 = df.join(top_total_calls_9, top_total_calls_9[f'calling_last_{0}_to_{30}_days'] == df['calling'],
                      'fullouter').\
                      drop(f'calling_last_{0}_to_{30}_days', 'last_date').\
                      orderBy(col('calling').asc(), col('called_date').asc())
        """
        return top_total_calls_9
    else:
        return('Wrong method!')
 
        

1) A combination of the first and the third methods of Call Center detection.

In [279]:
is_call_center(df, 0, 30, 1, "1_3").show()



+-------------------------+-----------------------------+--------------------------------+--------------+
|calling_last_0_to_30_days|total_calls_last_0_to_30_days|total_contacts_last_0_to_30_days|is_call_center|
+-------------------------+-----------------------------+--------------------------------+--------------+
|     faF8JPamAGFfBh401...|                          183|                             157|           Yes|
|     G8Ym8cGCaXwsGf4Zc...|                          164|                             162|           Yes|
|     QwNNUJYOJI0KUukPU...|                          159|                             147|           Yes|
|     fh4YcnrMfQHRc2czK...|                          155|                             149|           Yes|
|     SeAOvie7GnBfBh401...|                          150|                             137|           Yes|
|     Ud9X/S0tv4bRc2czK...|                          146|                             133|           Yes|
|     /4ZrI/FrU/sJqWorj...|                   



2) The 2nd method

In [282]:
is_call_center(df, 0, 30, 1, "2").show()

                                                                                

+-------------------------+-----------------------------+--------------------------------+----------------------------+-------------------------------+-----------------------------+--------------------------------+------------------------------+---------------------------------+------------------------------+---------------------------------+--------------+
|calling_last_0_to_30_days|total_calls_last_0_to_30_days|total_contacts_last_0_to_30_days|total_calls_last_0_to_7_days|total_contacts_last_0_to_7_days|total_calls_last_7_to_14_days|total_contacts_last_7_to_14_days|total_calls_last_14_to_21_days|total_contacts_last_14_to_21_days|total_calls_last_21_to_28_days|total_contacts_last_21_to_28_days|is_call_center|
+-------------------------+-----------------------------+--------------------------------+----------------------------+-------------------------------+-----------------------------+--------------------------------+------------------------------+---------------------------------+-

#  Diversity of Call Center

In [164]:
def diversity_of_call_center_from_last_x_to_y_days(df, x, y, percent):
    """
    Note that in our methods, users filtered from the function 
    "top_total_calls_from_x_to_y_days(df, x, y, percent)"
    are all call centers
    """
    df_diversity = top_total_calls_from_x_to_y_days(df, x, y, percent)
    df_diversity = df_diversity.withColumn(f'diversity_ratio_last_{x}_to_{y}_days',
                                           format_number(f.col(f'total_contacts_last_{x}_to_{y}_days') 
                                           / f.col(f'total_calls_last_{x}_to_{y}_days'), 4
                                                        )
                                          )
    df_diversity = add_col(df_diversity, 'call_center_diversity', 
                           when((col('diversity_ratio') >= 0.3), 'high').\
                           otherwise('low')
                          )
    return df_diversity

Example

In [180]:
diversity_of_call_center_from_last_x_to_y_days(df, 0, 30, 2).show()



+-------------------------+-----------------------------+--------------------------------+---------------+---------------------+
|calling_last_0_to_30_days|total_calls_last_0_to_30_days|total_contacts_last_0_to_30_days|diversity_ratio|call_center_diversity|
+-------------------------+-----------------------------+--------------------------------+---------------+---------------------+
|     faF8JPamAGFfBh401...|                          183|                             157|         0.8579|                 high|
|     G8Ym8cGCaXwsGf4Zc...|                          164|                             162|         0.9878|                 high|
|     QwNNUJYOJI0KUukPU...|                          159|                             147|         0.9245|                 high|
|     fh4YcnrMfQHRc2czK...|                          155|                             149|         0.9613|                 high|
|     SeAOvie7GnBfBh401...|                          150|                             137|       



# Call in/ Call out center detection

In [183]:
def type_of_call_center_from_last_x_to_y_days(df, x, y, percent):
    df_type = top_total_calls_from_x_to_y_days(df, x, y, percent)
    df_type2 = df.join(df_type, df_type[f'calling_last_{x}_to_{y}_days'] == df['calling'], 'inner').\
               drop(f'calling')
    df_type3 = df_type2.groupBy(f'calling_last_{x}_to_{y}_days').pivot('calling_type').count()
    df_type3 = df_type3.na.fill(0)
    df_type3 = df_type3.withColumn('calling_type_ratio',
                                   format_number(f.col('1')/(f.col('1') + f.col('2')), 4)
                                  )
    df_type3 = df_type3.withColumn('type_call_center',
                                   when((col('calling_type_ratio') >= 0.6), 'call_in').\
                                   when((col('calling_type_ratio') <= 0.4), 'call_out').\
                                   otherwise('both')
                                  )
    df_type3 = rename_col(df_type3, [
                                     ['1', 'total_calls_type_1'],
                                     ['2', 'total_calls_type_2'],
                                    ]
                                 )
    return df_type3

Example

In [184]:
type_of_call_center_from_last_x_to_y_days(df, 0, 30, 2).show()

[Stage 1311:>                                                       (0 + 1) / 1]

+-------------------------+------------------+------------------+------------------+----------------+
|calling_last_0_to_30_days|total_calls_type_1|total_calls_type_2|calling_type_ratio|type_call_center|
+-------------------------+------------------+------------------+------------------+----------------+
|     +mhy8a/4ykcBvuxB6...|                 0|                85|            0.0000|        call_out|
|     014Ngw40khJfBh401...|                 0|                88|            0.0000|        call_out|
|     8oveueaXFOpSFhbKW...|                96|                 0|            1.0000|         call_in|
|     FtiZDwQpxXDRc2czK...|                 0|                69|            0.0000|        call_out|
|     RauDXtvwrfQBvuxB6...|                46|                 0|            1.0000|         call_in|
|     bDYozBkZ8vhfBh401...|                57|                 0|            1.0000|         call_in|
|     crrhqzYjBN3JRXAy8...|                 0|               139|            0.000

                                                                                

# Call center: Duration type detection
Duration types:

- very_short: 0 - 15s
- short: 15 - 30s
- medium: 30s - 2min
- long: 2 - 20min
- very_long: > 20 min

In [272]:
def typical_duration_type_from_last_x_to_y_days(df, x, y, percent):
    df_duration = top_total_calls_from_x_to_y_days(df, x, y, percent)
    df_duration2 = df.join(df_duration, df_duration[f'calling_last_{x}_to_{y}_days'] == df['calling'], 'inner').\
                   drop(f'calling')
    df_duration2 = df_duration2.withColumn('duration_type',
                                            when((col('duration') <= 15), 'very_short').\
                                            when((col('duration') <= 30) & (col('duration') >15), 'short').\
                                            when((col('duration') <= 120) & (col('duration') >30), 'medium').\
                                            when((col('duration') <= 1200) & (col('duration') >120), 'long').\
                                            when((col('duration') > 1200), 'very_long').\
                                            otherwise('unidentified')
                                           )
    df_duration3 = df_duration2.groupBy(f'calling_last_{x}_to_{y}_days', 'duration_type').count()
    df_duration4 = df_duration3.withColumn("rank", f.row_number().
                                           over(Window.partitionBy(f'calling_last_{x}_to_{y}_days').
                                           orderBy(col('count').desc()))
                                          )
    df_duration4 = df_duration4.filter(col('rank') ==1).\
                   withColumnRenamed(f'calling_last_{x}_to_{y}_days', 'calling_1').\
                   withColumnRenamed('duration_type', 'typical_duration_type').\
                   withColumnRenamed('count', 'number_of_calls_typical_duration_type').\
                   drop('rank')
    df_duration5 = df_duration2.groupBy(f'calling_last_{x}_to_{y}_days').pivot('duration_type').count()
    df_duration6 = df_duration5.join(df_duration4, 
                                     df_duration4['calling_1'] == df_duration5[f'calling_last_{x}_to_{y}_days'],
                                    'inner').\
                                drop('calling_1')
    df_duration6 = rename_col(df_duration6,[
                                            ['long', 'duration_long_total_calls'],
                                            ['medium', 'duration_medium_total_calls'],
                                            ['short', 'duration_short_total_calls'],
                                            ['very_short', 'duration_very_short_total_calls'],
                                            ['very_long', 'duration_very_long_total_calls'], 
                                           ]
                             )
    return df_duration6

Example

In [273]:
typical_duration_type_from_last_x_to_y_days(df, 0, 30, 2).show()

[Stage 1956:>                                                       (0 + 1) / 1]

+-------------------------+-------------------------+---------------------------+--------------------------+------------------------------+-------------------------------+---------------------+-------------------------------------+
|calling_last_0_to_30_days|duration_long_total_calls|duration_medium_total_calls|duration_short_total_calls|duration_very_long_total_calls|duration_very_short_total_calls|typical_duration_type|number_of_calls_typical_duration_type|
+-------------------------+-------------------------+---------------------------+--------------------------+------------------------------+-------------------------------+---------------------+-------------------------------------+
|     +mhy8a/4ykcBvuxB6...|                        9|                         24|                        23|                             2|                             27|           very_short|                                   27|
|     014Ngw40khJfBh401...|                       16|                   

                                                                                

# Call center: Diversity of Location

1) Village

In [239]:
def village_diversity_from_last_x_to_y_days(df, x, y, percent):
    df_location = top_total_calls_from_x_to_y_days(df, x, y, percent)
    df_location2 = df.join(df_location, df_location[f'calling_last_{x}_to_{y}_days'] == df['calling'], 'inner').\
                   drop(f'calling')
    df_location2 = fitler_from_x_to_y_days(df_location2, x, y)
    
    df_village = df_location2.filter((col('session_of_week') == 'weekday') & (col('session_of_day') == 'working_hours'))
    df_village2 = df_village.groupBy(f'calling_last_{x}_to_{y}_days', 'called_date').\
                             agg(countDistinct('village_name'))
    df_village3 = df_village2.groupBy(f'calling_last_{x}_to_{y}_days', 'count(village_name)').\
                              agg(countDistinct('called_date'))
    df_village3 = df_village3.withColumn("rank", f.row_number().
                                         over(Window.partitionBy(f'calling_last_{x}_to_{y}_days').
                                         orderBy(col('count(called_date)').desc()))
                                        )
    df_village3 = df_village3.filter(col('rank') == 1).\
                              withColumnRenamed('count(village_name)', 'village_typical_daily_number_of_changes').\
                              drop('count(called_date)', 'rank')
    return df_village3


In [240]:
village_diversity_from_last_x_to_y_days(df, 0, 30, 1).show()

[Stage 1589:>                                                       (0 + 1) / 1]

+-------------------------+---------------------------------------+
|calling_last_0_to_30_days|village_typical_daily_number_of_changes|
+-------------------------+---------------------------------------+
|     014Ngw40khJfBh401...|                                      1|
|     8oveueaXFOpSFhbKW...|                                      1|
|     FtiZDwQpxXDRc2czK...|                                      2|
|     crrhqzYjBN3JRXAy8...|                                      1|
|     BjB4bkzbjt9SFhbKW...|                                      1|
|     FPx0ykgCR60BvuxB6...|                                      1|
|     LmaANqnwPRwKUukPU...|                                      2|
|     NCWqAzbKnP/JRXAy8...|                                      1|
|     WpeNSWmthecJqWorj...|                                      1|
|     fDhLSXLPuzjJRXAy8...|                                      1|
|     igYsodAe0aTRc2czK...|                                      1|
|     lCZgOSa4H3uE6rUJ9...|                     

                                                                                

2) District

In [241]:
def district_diversity_from_last_x_to_y_days(df, x, y, percent):
    df_location = top_total_calls_from_x_to_y_days(df, x, y, percent)
    df_location2 = df.join(df_location, df_location[f'calling_last_{x}_to_{y}_days'] == df['calling'], 'inner').\
                   drop(f'calling')
    df_location2 = fitler_from_x_to_y_days(df_location2, x, y)
    
    df_district = df_location2.filter((col('session_of_week') == 'weekday') & (col('session_of_day') == 'working_hours'))
    df_district2 = df_district.groupBy(f'calling_last_{x}_to_{y}_days', 'called_date').\
                             agg(countDistinct('district_name'))
    df_district3 = df_district2.groupBy(f'calling_last_{x}_to_{y}_days', 'count(district_name)').\
                              agg(countDistinct('called_date'))
    df_district3 = df_district3.withColumn("rank", f.row_number().
                                         over(Window.partitionBy(f'calling_last_{x}_to_{y}_days').
                                         orderBy(col('count(called_date)').desc()))
                                        )
    df_district3 = df_district3.filter(col('rank') == 1).\
                              withColumnRenamed('count(district_name)', 'district_typical_daily_number_of_changes').\
                              drop('count(called_date)', 'rank')
    return df_district3

In [242]:
district_diversity_from_last_x_to_y_days(df, 0, 30, 1).show()

[Stage 1606:>                                                       (0 + 1) / 1]

+-------------------------+----------------------------------------+
|calling_last_0_to_30_days|district_typical_daily_number_of_changes|
+-------------------------+----------------------------------------+
|     014Ngw40khJfBh401...|                                       1|
|     8oveueaXFOpSFhbKW...|                                       1|
|     FtiZDwQpxXDRc2czK...|                                       1|
|     crrhqzYjBN3JRXAy8...|                                       1|
|     BjB4bkzbjt9SFhbKW...|                                       1|
|     FPx0ykgCR60BvuxB6...|                                       2|
|     LmaANqnwPRwKUukPU...|                                       2|
|     NCWqAzbKnP/JRXAy8...|                                       1|
|     WpeNSWmthecJqWorj...|                                       1|
|     fDhLSXLPuzjJRXAy8...|                                       1|
|     igYsodAe0aTRc2czK...|                                       1|
|     lCZgOSa4H3uE6rUJ9...|       

                                                                                

3) Province

In [248]:
def province_diversity_from_last_x_to_y_days(df, x, y, percent):
    df_location = top_total_calls_from_x_to_y_days(df, x, y, percent)
    df_location2 = df.join(df_location, df_location[f'calling_last_{x}_to_{y}_days'] == df['calling'], 'inner').\
                   drop(f'calling')
    df_location2 = fitler_from_x_to_y_days(df_location2, x, y)
    
    df_province = df_location2.filter((col('session_of_week') == 'weekday') & (col('session_of_day') == 'working_hours'))
    df_province2 = df_province.groupBy(f'calling_last_{x}_to_{y}_days', 'called_date').\
                             agg(countDistinct('province_name'))
    df_province3 = df_province2.groupBy(f'calling_last_{x}_to_{y}_days', 'count(province_name)').\
                              agg(countDistinct('called_date'))
    df_province3 = df_province3.withColumn("rank", f.row_number().
                                         over(Window.partitionBy(f'calling_last_{x}_to_{y}_days').
                                         orderBy(col('count(called_date)').desc()))
                                        )
    df_province3 = df_province3.filter(col('rank') == 1).\
                              withColumnRenamed('count(province_name)', 'province_typical_daily_number_of_changes').\
                              drop('count(called_date)', 'rank')
    return df_province3

In [246]:
province_diversity_from_last_x_to_y_days(df, 0, 30, 1).show()

Exception in thread "serve-DataFrame" java.net.SocketTimeoutException: Accept timed out
	at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
	at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:64)
[Stage 1634:>                                                       (0 + 1) / 1]

+-------------------------+----------------------------------------+
|calling_last_0_to_30_days|province_typical_daily_number_of_changes|
+-------------------------+----------------------------------------+
|     014Ngw40khJfBh401...|                                       1|
|     8oveueaXFOpSFhbKW...|                                       1|
|     FtiZDwQpxXDRc2czK...|                                       1|
|     crrhqzYjBN3JRXAy8...|                                       1|
|     BjB4bkzbjt9SFhbKW...|                                       1|
|     FPx0ykgCR60BvuxB6...|                                       1|
|     LmaANqnwPRwKUukPU...|                                       1|
|     NCWqAzbKnP/JRXAy8...|                                       1|
|     WpeNSWmthecJqWorj...|                                       1|
|     fDhLSXLPuzjJRXAy8...|                                       1|
|     igYsodAe0aTRc2czK...|                                       1|
|     lCZgOSa4H3uE6rUJ9...|       

                                                                                

In [247]:
village_diversity_from_last_x_to_y_days(df, 0, 30, 1).toPandas().to_excel('location_village.xlsx')
district_diversity_from_last_x_to_y_days(df, 0, 30, 1).toPandas().to_excel('location_district.xlsx')
province_diversity_from_last_x_to_y_days(df, 0, 30, 1).toPandas().to_excel('location_province.xlsx')

                                                                                

4) Combine all location to get the feature of location diversity

In [251]:
def location_diversity_from_last_x_to_y_days(df, x, y, percent):
    village = village_diversity_from_last_x_to_y_days(df, x, y, percent)
    district = district_diversity_from_last_x_to_y_days(df, x, y, percent).\
               withColumnRenamed(f'calling_last_{x}_to_{y}_days', 'calling_district')
    province = province_diversity_from_last_x_to_y_days(df, x, y, percent).\
               withColumnRenamed(f'calling_last_{x}_to_{y}_days', 'calling_province')
    location = village.join(district, district['calling_district'] == village[f'calling_last_{x}_to_{y}_days'], 
                            'inner').\
                            drop('calling_district')
    location2 = location.join(province, province['calling_province'] == location[f'calling_last_{x}_to_{y}_days'], 
                            'inner').\
                            drop('calling_province')
    location2 = location2.withColumn('location_type_change',
                                     when((col('province_typical_daily_number_of_changes') >=2), 
                                          'province'
                                         ).\
                                     when((col('province_typical_daily_number_of_changes') < 2) &
                                          (col('district_typical_daily_number_of_changes') >= 2), 
                                          'district'
                                         ).\
                                     when((col('province_typical_daily_number_of_changes') < 2) &
                                          (col('district_typical_daily_number_of_changes') < 2) &
                                          (col('village_typical_daily_number_of_changes') >= 3),
                                          'village'
                                         ).\
                                     when((col('village_typical_daily_number_of_changes') <2), 
                                          'no_change'
                                         ).\
                                     otherwise('unidentified')
                                    )
    return location2

In [253]:
location_diversity_from_last_x_to_y_days(df, x, y, percent).show()

[Stage 1763:>                                                       (0 + 1) / 1]

+-------------------------+---------------------------------------+----------------------------------------+----------------------------------------+--------------------+
|calling_last_0_to_30_days|village_typical_daily_number_of_changes|district_typical_daily_number_of_changes|province_typical_daily_number_of_changes|location_type_change|
+-------------------------+---------------------------------------+----------------------------------------+----------------------------------------+--------------------+
|     014Ngw40khJfBh401...|                                      1|                                       1|                                       1|           no_change|
|     8oveueaXFOpSFhbKW...|                                      1|                                       1|                                       1|           no_change|
|     FtiZDwQpxXDRc2czK...|                                      2|                                       1|                                     

                                                                                

# Call Center: Business Detection
(by the combination of the 1st and the 3rd methods of Call Center detection)

1) Combine all features

In [286]:
def call_center_final_from_last_x_to_y_days(df, x, y, percent, method):
    # Only obtain the call log dataframe of the specified interval of time
    df_test = fitler_from_x_to_y_days(df, x, y)
    
    # Recall all the feature dataframes
    call_center_detection = is_call_center(df, x, y, percent, method).\
                            withColumnRenamed('is_call_center', f'is_call_center_last_{x}_to_{y}_days')
    
    contact_diversity = diversity_of_call_center_from_last_x_to_y_days(df, x, y, percent).\
                        withColumnRenamed('diversity_ratio', 'contact_diversity_ratio').\
                        withColumnRenamed('call_center_diversity', 'contact_diversity').\
                        withColumnRenamed(f'calling_last_{x}_to_{y}_days', 'calling_contact_diversity').\
                        drop(f'total_calls_last_{x}_to_{y}_days', f'total_contacts_last_{x}_to_{y}_days')
    
    type_of_call_center = type_of_call_center_from_last_x_to_y_days(df, x, y, percent).\
                          withColumnRenamed(f'calling_last_{x}_to_{y}_days', 'calling_type_of_call_center')
    
    duration_type_of_call_center = typical_duration_type_from_last_x_to_y_days(df, x, y, percent).\
                                   withColumnRenamed(f'calling_last_{x}_to_{y}_days', 'calling_duration_type')
    
    location_diversity = location_diversity_from_last_x_to_y_days(df, x, y, percent).\
                         withColumnRenamed(f'calling_last_{x}_to_{y}_days', 'calling_location_diversity')
    
    #Join all dataframes above
    df1 = call_center_detection.join(contact_diversity, 
                                     contact_diversity['calling_contact_diversity'] == 
                                     call_center_detection[f'calling_last_{x}_to_{y}_days'],
                                     'inner').\
                                drop('calling_contact_diversity')
    
    df2 = df1.join(type_of_call_center, 
                   type_of_call_center['calling_type_of_call_center'] == 
                   df1[f'calling_last_{x}_to_{y}_days'],
                   'inner').\
                   drop('calling_type_of_call_center')
    
    df3 = df2.join(duration_type_of_call_center, 
                   duration_type_of_call_center['calling_duration_type'] == 
                   df2[f'calling_last_{x}_to_{y}_days'],
                   'inner').\
                   drop('calling_duration_type')
    df4 = df3.join(location_diversity, 
                   location_diversity['calling_location_diversity'] == 
                   df3[f'calling_last_{x}_to_{y}_days'],
                   'inner').\
                   drop('calling_location_diversity')
    df5 = df_test.join(df4,
                       df4[f'calling_last_{x}_to_{y}_days'] ==
                       df_test['calling'],
                       'inner').\
                       drop(f'calling_last_{x}_to_{y}_days')
    return df5

In [283]:
is_call_center(df, x, y, percent, "1_3").printSchema()



root
 |-- calling_last_0_to_30_days: string (nullable = true)
 |-- total_calls_last_0_to_30_days: long (nullable = false)
 |-- total_contacts_last_0_to_30_days: long (nullable = false)
 |-- is_call_center: string (nullable = false)



                                                                                

In [284]:
is_call_center(df, x, y, percent, "2").printSchema()



root
 |-- calling_last_0_to_30_days: string (nullable = true)
 |-- total_calls_last_0_to_30_days: long (nullable = false)
 |-- total_contacts_last_0_to_30_days: long (nullable = false)
 |-- total_calls_last_0_to_7_days: long (nullable = false)
 |-- total_contacts_last_0_to_7_days: long (nullable = false)
 |-- total_calls_last_7_to_14_days: long (nullable = false)
 |-- total_contacts_last_7_to_14_days: long (nullable = false)
 |-- total_calls_last_14_to_21_days: long (nullable = false)
 |-- total_contacts_last_14_to_21_days: long (nullable = false)
 |-- total_calls_last_21_to_28_days: long (nullable = false)
 |-- total_contacts_last_21_to_28_days: long (nullable = false)
 |-- is_call_center: string (nullable = false)





In [258]:
diversity_of_call_center_from_last_x_to_y_days(df, x, y, percent).printSchema()



root
 |-- calling_last_0_to_30_days: string (nullable = true)
 |-- total_calls_last_0_to_30_days: long (nullable = false)
 |-- total_contacts_last_0_to_30_days: long (nullable = false)
 |-- diversity_ratio: string (nullable = true)
 |-- call_center_diversity: string (nullable = false)



                                                                                

In [260]:
type_of_call_center_from_last_x_to_y_days(df, x, y, percent).printSchema()



root
 |-- calling_last_0_to_30_days: string (nullable = true)
 |-- total_calls_type_1: long (nullable = true)
 |-- total_calls_type_2: long (nullable = true)
 |-- calling_type_ratio: string (nullable = true)
 |-- type_call_center: string (nullable = false)



                                                                                

In [274]:
typical_duration_type_from_last_x_to_y_days(df, x, y, percent).printSchema()



root
 |-- calling_last_0_to_30_days: string (nullable = true)
 |-- duration_long_total_calls: long (nullable = true)
 |-- duration_medium_total_calls: long (nullable = true)
 |-- duration_short_total_calls: long (nullable = true)
 |-- duration_very_long_total_calls: long (nullable = true)
 |-- duration_very_short_total_calls: long (nullable = true)
 |-- typical_duration_type: string (nullable = false)
 |-- number_of_calls_typical_duration_type: long (nullable = false)





In [262]:
location_diversity_from_last_x_to_y_days(df, x, y, percent).printSchema()



root
 |-- calling_last_0_to_30_days: string (nullable = true)
 |-- village_typical_daily_number_of_changes: long (nullable = false)
 |-- district_typical_daily_number_of_changes: long (nullable = false)
 |-- province_typical_daily_number_of_changes: long (nullable = false)
 |-- location_type_change: string (nullable = false)





In [276]:
call_center_final_from_last_x_to_y_days(df, x, y, percent, "1_3").show()

[Stage 2072:>                                                       (0 + 1) / 1]

+--------------------+--------------------+--------------------+--------+------------------+-------------+--------------+-----------+-------------+------------+-----------+----+------------+----+-----------+---------------+--------------+----------+-----------------------------+--------------------------------+--------------------------------+-----------------------+-----------------+------------------+------------------+------------------+----------------+-------------------------+---------------------------+--------------------------+------------------------------+-------------------------------+---------------------+-------------------------------------+---------------------------------------+----------------------------------------+----------------------------------------+--------------------+
| phonenumber_encrypt|             calling|              called|duration|      village_name|district_name|   called_time|called_date|province_name|calling_type|called_hour|hour|week_of_year|y

                                                                                

In [287]:
call_center_final_from_last_x_to_y_days(df, x, y, percent, "2").show()



+--------------------+--------------------+--------------------+--------+------------+-------------+--------------+-----------+-------------+------------+-----------+----+------------+----+-----------+---------------+--------------+----------+-----------------------------+--------------------------------+----------------------------+-------------------------------+-----------------------------+--------------------------------+------------------------------+---------------------------------+------------------------------+---------------------------------+--------------------------------+-----------------------+-----------------+------------------+------------------+------------------+----------------+-------------------------+---------------------------+--------------------------+------------------------------+-------------------------------+---------------------+-------------------------------------+---------------------------------------+----------------------------------------+------

                                                                                

In [288]:
call_center_final_from_last_x_to_y_days(df, x, y, percent, "1_3").printSchema()



root
 |-- phonenumber_encrypt: string (nullable = true)
 |-- calling: string (nullable = true)
 |-- called: string (nullable = true)
 |-- duration: float (nullable = true)
 |-- village_name: string (nullable = true)
 |-- district_name: string (nullable = true)
 |-- called_time: string (nullable = true)
 |-- called_date: date (nullable = true)
 |-- province_name: string (nullable = true)
 |-- calling_type: string (nullable = true)
 |-- called_hour: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- week_of_year: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- session_of_week: string (nullable = false)
 |-- session_of_day: string (nullable = false)
 |-- last_date: string (nullable = false)
 |-- total_calls_last_0_to_30_days: long (nullable = false)
 |-- total_contacts_last_0_to_30_days: long (nullable = false)
 |-- is_call_center_last_0_to_30_days: string (nullable = false)
 |-- contact_diversity_ratio: string

                                                                                

In [289]:
call_center_final_from_last_x_to_y_days(df, x, y, percent, "2").printSchema()



root
 |-- phonenumber_encrypt: string (nullable = true)
 |-- calling: string (nullable = true)
 |-- called: string (nullable = true)
 |-- duration: float (nullable = true)
 |-- village_name: string (nullable = true)
 |-- district_name: string (nullable = true)
 |-- called_time: string (nullable = true)
 |-- called_date: date (nullable = true)
 |-- province_name: string (nullable = true)
 |-- calling_type: string (nullable = true)
 |-- called_hour: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- week_of_year: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- session_of_week: string (nullable = false)
 |-- session_of_day: string (nullable = false)
 |-- last_date: string (nullable = false)
 |-- total_calls_last_0_to_30_days: long (nullable = false)
 |-- total_contacts_last_0_to_30_days: long (nullable = false)
 |-- total_calls_last_0_to_7_days: long (nullable = false)
 |-- total_contacts_last_0_to_7_days: long

                                                                                

2) Business detection

Key features:
- 'is_call_center' -- 'Yes'
- 'contact_diversity' -- 'high' or 'low'
- 'type_call_center'  -- 'call_in' or 'call_out' or 'both'
- 'typical_duration_type' -- 'very_short' or 'short' or 'medium' or 'long' or 'very_long'
- 'location_type_change'  -- 'village' or 'district' or 'province' or 'no_change' or 'unidentified'

In [311]:
def business_call_center_from_last_x_to_y_days(df, x, y, percent, method):
    business = call_center_final_from_last_x_to_y_days(df, x, y, percent, method)
    """
    'is_call_center' -- 'Yes'
    'contact_diversity' -- 'high' or 'low'
    'type_call_center'  -- 'call_in' or 'call_out' or 'both'
    'typical_duration_type' -- 'very_short' or 'short' or 'medium' or 'long' or 'very_long'
    'location_type_change'  -- 'village' or 'district' or 'province' or 'no_change' or 'unidentified'
    """
    business = business.withColumn('business',
                                   when((col('type_call_center') == 'call_out') &
                                        (col('typical_duration_type') == 'very_short') &
                                        (col('contact_diversity') == 'high'), 
                                        'telesaler'
                                         ).\
                                   when((col('type_call_center') == 'call_out') &
                                        (col('typical_duration_type') == 'very_short') &
                                        (col('contact_diversity') == 'low'), 
                                        'friends/relative'
                                         ).\
                                   when((col('type_call_center') == 'call_out') &
                                        (
                                         (col('typical_duration_type') == 'short') | 
                                         (col('typical_duration_type') == 'medium')
                                        ) &
                                        (col('location_type_change') == 'village'), 
                                        'shipper'
                                         ).\
                                   when((col('type_call_center') == 'call_out') &
                                        (
                                         (col('typical_duration_type') == 'short') | 
                                         (col('typical_duration_type') == 'medium')
                                        ) &
                                        (col('location_type_change') == 'district'), 
                                        'shipper/driver'
                                         ).\
                                   when((col('type_call_center') == 'call_out') &
                                        (col('location_type_change') == 'province'), 
                                        'driver/transporter/entrepreneur'
                                         ).\
                                   when((col('type_call_center') == 'call_out') &
                                        (col('typical_duration_type') == 'long'), 
                                        'talent_acquisition'
                                         ).\
                                   when(#col('type_call_center') == all posible values
                                        (col('typical_duration_type') == 'very_long') &
                                        (col('contact_diversity') == 'low'), 
                                        'friend/relative'
                                         ).\
                                   when( #col('type_call_center') == all posible values
                                        (col('typical_duration_type') == 'very_long') &
                                        (col('contact_diversity') == 'high'), 
                                        'entrepreneur'
                                         ).\
                                   when((col('type_call_center') == 'call_in') &
                                        ((col('typical_duration_type') == 'very_short') |
                                         (col('typical_duration_type') == 'short')
                                        ), 
                                        'service/necessity_provider'
                                         ).\
                                   when((col('type_call_center') == 'call_in') &
                                        (
                                         (col('typical_duration_type') == 'medium') |
                                         (col('typical_duration_type') == 'long')   
                                        ) &
                                        (
                                         (col('location_type_change') == 'village') |
                                         (col('location_type_change') == 'no_change')
                                        ), 
                                        'consultant/customer_care'
                                         ).\
                                   when((col('type_call_center') == 'call_in') &
                                        (
                                         (col('typical_duration_type') == 'medium') |
                                         (col('typical_duration_type') == 'long')   
                                        ) &
                                        (
                                         (col('location_type_change') == 'district') |
                                         (col('location_type_change') == 'province')
                                        ), 
                                        'homestay_owner/consultant/customer_care'
                                         ).\
                                   otherwise('unidentified')
                                  )
    return business
    
                                   

In [312]:
business1 = business_call_center_from_last_x_to_y_days(df, x, y, percent, "1_3")
business1.printSchema()
#business1.toPandas().to_excel('call_center_1st_method.xlsx')

                                                                                

root
 |-- phonenumber_encrypt: string (nullable = true)
 |-- calling: string (nullable = true)
 |-- called: string (nullable = true)
 |-- duration: float (nullable = true)
 |-- village_name: string (nullable = true)
 |-- district_name: string (nullable = true)
 |-- called_time: string (nullable = true)
 |-- called_date: date (nullable = true)
 |-- province_name: string (nullable = true)
 |-- calling_type: string (nullable = true)
 |-- called_hour: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- week_of_year: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- session_of_week: string (nullable = false)
 |-- session_of_day: string (nullable = false)
 |-- last_date: string (nullable = false)
 |-- total_calls_last_0_to_30_days: long (nullable = false)
 |-- total_contacts_last_0_to_30_days: long (nullable = false)
 |-- is_call_center_last_0_to_30_days: string (nullable = false)
 |-- contact_diversity_ratio: string

                                                                                

In [313]:
business2 = business_call_center_from_last_x_to_y_days(df, x, y, percent, "2")
business2.printSchema()
#business2.toPandas().to_excel('call_center_2.xlsx')

                                                                                

root
 |-- phonenumber_encrypt: string (nullable = true)
 |-- calling: string (nullable = true)
 |-- called: string (nullable = true)
 |-- duration: float (nullable = true)
 |-- village_name: string (nullable = true)
 |-- district_name: string (nullable = true)
 |-- called_time: string (nullable = true)
 |-- called_date: date (nullable = true)
 |-- province_name: string (nullable = true)
 |-- calling_type: string (nullable = true)
 |-- called_hour: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- week_of_year: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- session_of_week: string (nullable = false)
 |-- session_of_day: string (nullable = false)
 |-- last_date: string (nullable = false)
 |-- total_calls_last_0_to_30_days: long (nullable = false)
 |-- total_contacts_last_0_to_30_days: long (nullable = false)
 |-- total_calls_last_0_to_7_days: long (nullable = false)
 |-- total_contacts_last_0_to_7_days: long

                                                                                