# Telecom Bandwidth Consumption Analysis

## Author: Luciano Andrade

An internet provider offers different types of access plans with different bandwidths. The company then upgraded its infrastructure and wanted to validate some assumptions about customers' bandwidth consumption. The company had data on average bandwidth consumption before the upgrade and recorded consumption in the month before the upgrade, the first month, and the second month after the upgrade. The recorded data also has some customer characteristics. The business sector formulated some business questions, and now we will answer them through four hypothesis tests.

Let's check some information about the data.

In [1]:
# Imports
import pandas as pd
import numpy as np
import scipy.stats as stats

In [2]:
# loading the data
df = pd.read_csv(r"TelelecomDataset.csv")

In [3]:
#Return a tuple representing the dimensionality of the DataFrame.
df.shape

(200, 10)

In [4]:
#The head function returns the first 5 rows for the object based on position. 
#It is useful for quickly testing if your object has the right type of data in it.
df.head()

Unnamed: 0,client_id,gender,service_channel,region,marital_status,segment,average_monthly_consumption_before_upgrade,average_consumption_month_before_the_upgrade,average_consumption_first_month_after_upgrade,average_consumption_second_month_after_upgrade
0,70,0,4,1,1,1,57,49.2,52,57.2
1,121,1,4,2,1,3,68,63.6,59,64.9
2,86,0,4,3,1,1,44,64.8,33,36.3
3,141,0,4,3,1,3,63,56.4,44,48.4
4,172,0,4,2,1,2,47,68.4,52,57.2


In [5]:
#The column labels of the DataFrame
df.columns

Index(['client_id', 'gender', 'service_channel', 'region', 'marital_status',
       'segment', 'average_monthly_consumption_before_upgrade',
       'average_consumption_month_before_the_upgrade',
       'average_consumption_first_month_after_upgrade',
       'average_consumption_second_month_after_upgrade'],
      dtype='object')

In [6]:
#This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 10 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   client_id                                       200 non-null    int64  
 1   gender                                          200 non-null    int64  
 2   service_channel                                 200 non-null    int64  
 3   region                                          200 non-null    int64  
 4   marital_status                                  200 non-null    int64  
 5   segment                                         200 non-null    int64  
 6   average_monthly_consumption_before_upgrade      200 non-null    int64  
 7   average_consumption_month_before_the_upgrade    200 non-null    float64
 8   average_consumption_first_month_after_upgrade   200 non-null    int64  
 9   average_consumption_second_month_after_upgr

In [7]:
#Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, 
#excluding NaN values.
df.describe()

Unnamed: 0,client_id,gender,service_channel,region,marital_status,segment,average_monthly_consumption_before_upgrade,average_consumption_month_before_the_upgrade,average_consumption_first_month_after_upgrade,average_consumption_second_month_after_upgrade
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,100.5,0.545,3.43,2.055,1.16,2.025,52.23,63.174,52.775,58.0525
std,57.879185,0.49922,1.039472,0.724291,0.367526,0.690477,10.252937,11.242137,9.478586,10.426445
min,1.0,0.0,1.0,1.0,1.0,1.0,28.0,39.6,31.0,34.1
25%,50.75,0.0,3.0,2.0,1.0,2.0,44.0,54.0,45.75,50.325
50%,100.5,1.0,4.0,2.0,1.0,2.0,50.0,62.4,54.0,59.4
75%,150.25,1.0,4.0,3.0,1.0,2.25,60.0,70.8,60.0,66.0
max,200.0,1.0,4.0,3.0,2.0,3.0,76.0,90.0,67.0,73.7


## Question 1:

Was the average bandwidth consumption for the month before the upgrade greater than 50?

In this case, we will use a One-Sample t-Test.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html

We calculated the T-Test for the mean of 1 group. This tests if the null hypothesis has an expected value (mean) of a sample of independent observations equal to the given population mean, popmean.

hypothesis:

- H0: The average bandwidth consumption before the upgrade was equal to 50.

- H1: Average bandwidth consumption before the upgrade was greater than 50.

If the p-value is less than 0.05, we reject H0. Otherwise, we fail to reject H0.

If a p-value reported from a t-test is less than 0.05, that result is considered statistically significant. If a p-value is greater than 0.05, the result is insignificant.

In [8]:
# Run the t-Test
stats.ttest_1samp(a = df.average_consumption_month_before_the_upgrade, popmean = 50, alternative='greater')

Ttest_1sampResult(statistic=16.57233752433133, pvalue=1.2481859640465792e-39)

As the p-value is very small (less than 0.05), we reject H0 and accept H1 because of sufficient evidence in the sample in favor of H1. The average bandwidth consumption before the upgrade was greater than 50.

In [9]:
# Let's confirm this result
df.average_consumption_month_before_the_upgrade.mean()

63.174

## Question 2:

Is bandwidth consumption different before and after the upgrade, considering the first month after the upgrade?

In this case, we will use Two-Sample t-Test (Paired).

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html

We use this test when we have two related or dependent samples. The test checks a null hypothesis that two related samples have an identical mean (expected).

hypothesis:

- H0: Consumption before the upgrade was equal to consumption after the upgrade (means are the same).
- H1: Consumption before the upgrade differed from consumption after the upgrade (means are different).

If the p-value is less than 0.05, we reject H0. Otherwise, we fail to reject H0.

In [10]:
stats.ttest_rel(a = df.average_monthly_consumption_before_upgrade, b = df.average_consumption_first_month_after_upgrade)

Ttest_relResult(statistic=-0.8673065458794775, pvalue=0.3868186820914985)

We fail to reject the null hypothesis because the p-value is greater than 0.05. Therefore, the average consumption in the first month after the upgrade was similar to the average consumption before the upgrade.

In [11]:
print(df.average_monthly_consumption_before_upgrade.mean())
print(df.average_consumption_first_month_after_upgrade.mean())

52.23
52.775


## Question 3:

Did the customer's gender influence bandwidth consumption in the first month after the upgrade?

In this case, we will use the Two Independent Samples t Test.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

We calculated the t Test for the means of two independent samples. It is a test for the null hypothesis that two independent samples have an identical mean (expected) values. 

This test assumes that populations have equal variances by default. If the parameter equal_var is True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance.

hypothesis:

- H0: Consumption by male customers was equal to consumption by female customers in the first month after the upgrade (average consumption was the same between genders).

- H1: Consumption by male customers was not equal to consumption by female customers in the first month after the upgrade (average consumption was not equal between genders).

If the p-value is less than 0.05, we reject H0. Otherwise, we fail to reject H0.

In [12]:
df.columns

Index(['client_id', 'gender', 'service_channel', 'region', 'marital_status',
       'segment', 'average_monthly_consumption_before_upgrade',
       'average_consumption_month_before_the_upgrade',
       'average_consumption_first_month_after_upgrade',
       'average_consumption_second_month_after_upgrade'],
      dtype='object')

In [13]:
# Let's separate the samples by gender
male_consumption = df.average_consumption_first_month_after_upgrade[df.gender == 0]
female_consumption = df.average_consumption_first_month_after_upgrade[df.gender == 1]

In [14]:
print(male_consumption.head())
print(female_consumption.head())

0    52
2    33
3    44
4    52
5    52
Name: average_consumption_first_month_after_upgrade, dtype: int64
1     59
92    62
93    44
94    44
95    62
Name: average_consumption_first_month_after_upgrade, dtype: int64


In [15]:
print(male_consumption.mean())
print(female_consumption.mean())

50.120879120879124
54.99082568807339


In [16]:
print(male_consumption.var())
print(female_consumption.var())

106.196336996337
66.15732246007475


In [17]:
#Perform Welch’s t-test,
stats.ttest_ind(a = male_consumption, b = female_consumption, equal_var = False)

Ttest_indResult(statistic=-3.6564080478875276, pvalue=0.00034088493594266187)

In [18]:
# We also can use ANOVA test
stats.f_oneway(male_consumption, female_consumption)

F_onewayResult(statistic=13.94330754080599, pvalue=0.0002462546120354903)

As the p-value is less than 0.05, we reject H0. Thus, we conclude that there was a difference between male and female bandwidth consumption in the first month after the upgrade.

## Question 4:

Is there any relationship between the region and customer segment?

In this case, we will use Chi-Square Test.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

We use the chi-square test of independence of variables in a contingency table. The function calculates the chi-square statistic and the p-value for testing the independence hypothesis of the observed frequencies in the contingency table. Expected frequencies are calculated based on marginal sums under the assumption of independence.

Assumptions:

- H0: There is no relationship between region and segment.
- H1: There is a relationship between region and segment.

If the p-value is less than 0.05, we reject H0. Otherwise, we fail to reject H0.

In [19]:
df.head()

Unnamed: 0,client_id,gender,service_channel,region,marital_status,segment,average_monthly_consumption_before_upgrade,average_consumption_month_before_the_upgrade,average_consumption_first_month_after_upgrade,average_consumption_second_month_after_upgrade
0,70,0,4,1,1,1,57,49.2,52,57.2
1,121,1,4,2,1,3,68,63.6,59,64.9
2,86,0,4,3,1,1,44,64.8,33,36.3
3,141,0,4,3,1,3,63,56.4,44,48.4
4,172,0,4,2,1,2,47,68.4,52,57.2


In [20]:
# Contingence table
cont_table = pd.crosstab(df.segment, df.region, margins = True)
cont_table

region,1,2,3,All
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,16,20,9,45
2,19,44,42,105
3,12,31,7,50
All,47,95,58,200


In [21]:
# Chi-Square Test execution
chi2, p, dof, ex = stats.chi2_contingency(observed = cont_table)

In [22]:
# p-value
p

0.055282939487992365

Based on a p-value greater than 0.05, we fail to reject H0, and we can say there is no relationship between the categorical variables region and segment.