<div style="text-align: center; margin: 40px 20px;">
    <div style="margin: 20px 0; border-top: 1px solid #B0E0E6; border-bottom:1px solid #B0E0E6; background-color:  #F0EDEE;">
 <font style="color: #058EE1; font-size:40px; font-weight: 700; text-align:center; text-transform: uppercase; letter-spacing: 1px; line-height:1.2;"> 📈 Statistical Data Analysis</font>
    </div>
 <font style="font-size:26px; font-weight: 700; text-align:center;"> ZNA📱 - Which one is a better plan?</font>
</div>

# Contents ⬇️ <a id='contents'></a>

[1. Contents ⬇️](#contents)   
[2. Introduction 📓](#introduction)  
[3. Project Goal 🎯](#project_goal)  
[4. Data Analysis 📊](#data-analysis)    
- [4.1 Initialization](#initialization)  
- [4.2 Load data](#load-data)  
- [4.3 Prepare the data](#prepare-the-data)    
- [4.4 Study plan conditions](#study-plan-conditions)  
- [4.5 Aggregate data per user](#aggregate-data-per-user)  
- [4.6 Study user behaviour](#study-user-behaviour)  
- [4.7 Study Revenue](#study-Revenue)  
- [4.8 Test statistical hypotheses](#test-statistical-hypotheses)  
- [4.9 General conclusion](#general-conclusion)  

<div style="border-bottom:2px solid #058EE1;"></div>

# Introduction 📓 <a id='introduction'></a> 
[Back to Contents](#contents)

**Megaline** is a telecom operator and it offers its clients two prepaid plans, **Surf** and **Ultimate**.  

**Surf**
- Monthly charge: &dollar;20
- 500 monthly minutes, 50 texts, and 15 GB of data
- After exceeding the package limits:
    - 1 minute: 3 cents
    - 1 text message: 3 cents
    - 1 GB of data: &dollar;10  

**Ultimate**
- Monthly charge: &dollar;70
- 3000 monthly minutes, 1000 text messages, and 30 GB of data
- After exceeding the package limits:
    - 1 minute: 1 cent
    - 1 text message: 1 cent
    - 1 GB of data: &dollar;7  
    
The commercial department wants to know which of the plans brings in more revenue in order to adjust the advertising budget. We have the data on 500 Megaline clients: who the clients are, where they're from, which plan they use, and the number of calls they made and text messages they sent in 2018.  

<div style="border-bottom:2px solid #058EE1;"></div>

# Project Goal 🎯 <a id='project_goal'></a>  
[Back to Contents](#contents)

We have the information related to 500 customers of **Megaline** which includes their identity, web sessions, plan type and the number of calls and text messages they made during the year 2018. We have 5 distinct files of data available:  

`megaline_calls.csv` - contains details about calls made by the customers.  
`megaline_internet.csv` - provides information on web sessions.  
`megaline_messages.csv` - contains data about text messages sent by the customers.  
`megaline_plans.csv` - includes details about the plans used by the customers.  
`megaline_users.csv` - contains information on the customers themselves.  

The ask is to analyze clients' behavior and **determine which prepaid plan brings in more revenue**.

<div style="border-bottom:2px solid #058EE1;"></div>

# Data Analysis 📊 <a id='data-analysis'></a>  
[Back to Contents](#contents)

## Initialization <a id='initialization'></a>  
[Back to Contents](#contents)

To begin with, we need a few libraries for our statistical analysis - `scipy`, `matplotlib`, `pandas`, `math` and `numpy`. We'll import all of them so that we can use the functions or methodds provided by them in our analysis:  
1. **NumPy**: It is a numerical computing library that provides support for arrays and matrices and mathematical operations that can be performed on them.  

2. **Pandas**: It is a data manipulation library that provides functions to read, write and manipulate data in various formats.  

3. **SciPy**: It is a library that provides scientific computing functions such as statistical analysis, integration, optimization, and signal processing.  

4. **Matplotlib**: It is a plotting library that is used to visualize data in various formats.  

5. **Math**: It is a library that provides mathematical functions such as trigonometric functions, logarithmic functions, etc.

6. **Seaborn**: It is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [1]:
# Loading all the libraries
import math as mt
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats as st
import seaborn as sns

<div style="border-bottom:2px solid #058EE1;"></div>

In [2]:
## Load data <a id='load-data'></a>  
[Back to Contents](#contents)

SyntaxError: invalid syntax. Perhaps you forgot a comma? (3297685715.py, line 2)

In [None]:
# We have been given 5 distinct files of data in the 'C:\\Users\\HP' directory in CSV format with a comma as the field separator:  
`megaline_calls.csv` - contains details about calls made by the customers.  
`megaline_internet.csv` - provides information on web sessions.  
`megaline_messages.csv` - contains data about text messages sent by the customers.  
`megaline_plans.csv` - includes details about the plans used by the customers.  
`megaline_users.csv` - contains information on the customers themselves.  
We need to read the data files and load the data into DataFrames using `read_csv()` method provided by `pandas`.

In [None]:
# Load the data files into different DataFrames
plans = pd.read_csv(r'C:\Users\HP\megaline_plans.csv')
users = pd.read_csv(r'C:\Users\HP\megaline_users.csv')
calls = pd.read_csv(r'C:\Users\HP\megaline_calls.csv')
plans = pd.read_csv(r'C:\Users\HP\megaline_messages.csv')
users = pd.read_csv(r'C:\Users\HP\megaline_internet.csv')

<div style="border-bottom:2px solid #058EE1;"></div>

## Prepare the data <a id='prepare-the-data'></a>  
[Back to Contents](#contents)

The data for this project is split into several tables. We'll explore each one to get an initial understanding of the data and do necessary corrections to each table if necessary.

<div style="border-bottom:2px solid #058EE1;"></div>

### Plans

`plans` Dataframe includes details about the plans used by the customers.  

Get the general information of the data in the DataFrame - `plans`:

In [None]:
# Print the general/summary information about the plans' DataFrame
plans.info()

The Dataframe - `plans` has a total of **2 rows and 8 columns**.

The **columns description** are as follows:
- `plan_name` — calling plan name
- `minutes_included` — monthly minute allowance
- `messages_included` — monthly text allowance
- `mb_per_month_included` — data volume allowance (in megabytes)
- `usd_per_minute` — price per minute after exceeding the package limits (e.g., if the package includes 100 minutes, the 101st minute will be charged)
- `usd_per_message` — price per text after exceeding the package limits
- `usd_per_gb` — price per extra gigabyte of data after exceeding the package limits (1 GB = 1024 megabytes)  

In the Data Description, it is given that there is a column - `usd_monthly_fee` that holds monthly charge in US dollars, but in the information above, we don't have a column with the exact name. Instead, we have `usd_monthly_pay`. It seems there is a slight mismatch in name. We can rename the column - `usd_monthly_pay` to `usd_monthly_fee` in order to be consistent with the data description provided.  

In [None]:
# Rename existing column to new name keeping it in sync with data description
plans = plans.rename(columns={'usd_monthly_pay': 'usd_monthly_fee'})

Let's check the column names again using the `columns` attribute.

In [None]:
# Get the list of columns names
plans.columns

It is also evident from the information we got from `info()` above that **we have no null values in the Dataframe - `plans`**.

Since, the Dataframe has only 2 rows, let's print all the rows from the Dataframe:

In [None]:
# Print data for plans
plans

<div style="border-bottom:2px solid #058EE1;"></div>

#### Enrich data

Since Megaline rounds megabytes to gigabytes but we have data volume allowance in megabytes, we can derive a new column - `gb_per_month_included` using `mb_per_month_included`:

In [None]:
# Calculate data volume allowance from mbs to gbs - 1 GB = 1024 megabytes
plans['gb_per_month_included'] = plans['mb_per_month_included'] / 1024

Let's check the general information of the `plans` Dataframe again:

In [None]:
plans.info()

Let's see the data again:

In [None]:
plans

<div style="border-bottom:2px solid #058EE1;"></div>

### Users

`users` Dataframe contains information on the customers themselves.  

Get the general information of the data in the DataFrame - `users`:

In [None]:
# Print the general/summary information about the users' DataFrame
users.info()

The Dataframe - `users` has a total of **500 rows and 8 columns**.

The **columns description** are as follows:  
- `user_id` — unique user identifier
- `first_name` — user's name
- `last_name` — user's last name
- `age` — user's age (years)
- `reg_date` — subscription date (dd, mm, yy)
- `churn_date` — the date the user stopped using the service (if the value is missing, the calling plan was being used when this database was extracted)
- `city` — user's city of residence
- `plan` — calling plan name

It is evident from the information we got from `info()` above that **we have no null values in the Dataframe - `users` expect in `churn_date` column**. But, as per the description,if the value in `churn_data` column is missing, the calling plan was being used when this database was extracted.

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for users
users.sample(n=10, random_state=100)

We can see from the output of `info()`  and `sample()` that **`reg_date` and `churn_date` columns are DateTime columns** of the format - `YYYY-MM-DD` but it's been stored as a String type in the DataFrame. It would be great to convert them both to the appropriate data types. We'll fix them in a while.

We can use the `duplicated()` method together with `sum()` to **check if we have any duplicate rows in the DataFrame - `users`**. `duplicated()` method returns a boolean Series (True/False) denoting duplicate rows. So, we could apply `sum()` over that series to get a summation of all the True(s) - and False(s).

In [None]:
# Checking for duplicated user records
users.duplicated().sum()

**We don't have any duplicate rows in the `users` Dataframe**.  

Let's check for just duplicate user IDs using `duplicated()` method together with `sum()`. Since, this time we want to find out if we have any duplicate user IDs, we will first get a Series of data for `user_id` column and then, apply `dupliacted()` method along with `sum()` on it.

In [None]:
# Checking for just duplicate user IDs
users['user_id'].duplicated().sum()

**We don't have any duplicate user IDs in the `users` Dataframe**.

Let's verify that the `plan` column contains only the two specified plans - **surf** and **ultimate**:

In [None]:
# Check plan column only contains the two specified plans
users['plan'].unique()

**The `plan` column has correct values and contains only the two specified plans - surf and ultimate**:

Let's verify that the `city` column doesn't have any duplicates because of differences in spellings and cases:

In [None]:
# Check if city column has any duplicate values
sorted(users['city'].unique())

 **The `city` column doesn't have any duplicates because of differences in spellings and cases**.

#### Fix Data

Let's convert the types of `reg_date` and `churn_date` columns to DateTime of the format - `YYYY-MM-DD`:

In [None]:
# Convert reg_date to datetime format
users['reg_date'] = pd.to_datetime(users['reg_date'], format='%Y-%m-%d')

In [None]:
# Convert churn_date to datetime format
users['churn_date'] = pd.to_datetime(users['churn_date'], format='%Y-%m-%d')

Let's check the data types of the Dataframe - `users` again using `dtypes` attribute:

In [None]:
users.dtypes

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for users
users.sample(n=10, random_state=100)

#### Enrich Data

Let's create a column - `full_name` by concatenating the first name and last name of the user.

In [None]:
users['full_name'] = users['first_name'] + ' ' + users['last_name']

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for users
users.sample(n=10, random_state=100)

<div style="border-bottom:2px solid #058EE1;"></div>

### Calls

`calls` Dataframe contains details about calls made by the customers.

Get the general information of the data in the DataFrame - `calls`:

In [None]:
# Print the general/summary information about the calls' DataFrame
calls.info()

The Dataframe - `calls` has a total of **137735 rows and 4 columns**.

The **columns description** are as follows:  
- `id` — unique call identifier
- `call_date` — call date
- `duration` — call duration (in minutes)
- `user_id` — the identifier of the user making the call  

It is evident from the information we got from `info()` above that **we have no null values in the Dataframe - `calls`**.  

Also, we have `call_date` column of String data type. **We need to convert the data type of this column to DateTime.**

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for calls
calls.sample(n=10, random_state=100)

We can use the `duplicated()` method together with `sum()` to **check if we have any duplicate rows in the DataFrame - `calls`**. `duplicated()` method returns a boolean Series (True/False) denoting duplicate rows. So, we could apply `sum()` over that series to get a summation of all the True(s) - and False(s).

In [None]:
# Checking for duplicated call records
calls.duplicated().sum()

**We don't have any duplicate rows in the `calls` Dataframe**.  

Let's check for just duplicate call IDs using `duplicated()` method together with `sum()`. Since, this time we want to find out if we have any duplicate user IDs, we will first get a Series of data for `id` column and then, apply `dupliacted()` method along with `sum()` on it.

In [None]:
# Checking for just duplicate call IDs
calls['id'].duplicated().sum()

**We don't have any duplicate call IDs in the `calls` Dataframe**.

#### Fix data

Let's convert the type of `call_date` column to DateTime of the format - `YYYY-MM-DD`:

In [None]:
# Convert call_date to datetime format
calls['call_date'] = pd.to_datetime(calls['call_date'], format='%Y-%m-%d')

Let's check the data types of the Dataframe - `calls` again using `dtypes` attribute:

In [None]:
calls.dtypes

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for calls
calls.sample(n=10, random_state=100)

Now, since we have converted the `call_date` column to DateTime, we can easily verify if all the records are from year 2018 - 

In [None]:
# Check if all the records are from year 2018
calls['call_date'].dt.year.unique()

**All the records are indeed from the year 2018. There are no odd records**.

#### Enrich data

As per description, for `calls`, each individual call is rounded up: even if the call lasted just one second, it will be counted as one minute. So, let's create a new column - `rounded_up_duration` from `duration` column by rounding off the values:

In [None]:
calls['rounded_up_duration'] = np.ceil(calls['duration'])

Let's also create a column - `call_month` that has only month and year portion of the `call_date`:

In [None]:
calls['call_month'] = calls['call_date'].dt.strftime('%b-%Y')

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for calls
calls.sample(n=10, random_state=100)

<div style="border-bottom:2px solid #058EE1;"></div>

### Messages

`messages` Dataframe contains data about text messages sent by the customers.

Get the general information of the data in the DataFrame - `messages`:

In [None]:
# Print the general/summary information about the messages' DataFrame
messages.info()

The Dataframe - `messages` has a total of **76051 rows and 3 columns**.

The **columns description** are as follows:  
- `id` — unique text message identifier
- `message_date` — text message date
- `user_id` — the identifier of the user sending the text 

It is evident from the information we got from `info()` above that **we have no null values in the Dataframe - `messages`**.  

Also, we have `message_date` column of String data type. **We need to convert the data type of this column to DateTime.**

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for messages
messages.sample(n=10, random_state=100)

We can use the `duplicated()` method together with `sum()` to **check if we have any duplicate rows in the DataFrame - `messages`**. `duplicated()` method returns a boolean Series (True/False) denoting duplicate rows. So, we could apply `sum()` over that series to get a summation of all the True(s) - and False(s).

In [None]:
# Checking for duplicated message records
messages.duplicated().sum()

**We don't have any duplicate rows in the `messages` Dataframe**.  

Let's check for just duplicate text message IDs using `duplicated()` method together with `sum()`. Since, this time we want to find out if we have any duplicate user IDs, we will first get a Series of data for `id` column and then, apply `dupliacted()` method along with `sum()` on it.

In [None]:
# Checking for just duplicate text message IDs
messages['id'].duplicated().sum()

**We don't have any duplicate message IDs in the messages Dataframe.**

#### Fix data

Let's convert the type of `message_date` column to DateTime of the format - `YYYY-MM-DD`:

In [None]:
# Convert message_date to datetime format
messages['message_date'] = pd.to_datetime(messages['message_date'], format='%Y-%m-%d')

Let's check the data types of the Dataframe - `messages` again using `dtypes` attribute:

In [None]:
messages.dtypes

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for messages
messages.sample(n=10, random_state=100)

Now, since we have converted the `message_date` column to DateTime, we can easily verify if all the records are from year 2018 - 

In [None]:
# Check if all the records are from year 2018
messages['message_date'].dt.year.unique()

**All the records are indeed from the year 2018. There are no odd records**.

#### Enrich data

Let's also create a column - `message_month` that has only month and year portion of the `message_date`:

In [None]:
messages['message_month'] = messages['message_date'].dt.strftime('%b-%Y')

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for messages
messages.sample(n=10, random_state=100)

<div style="border-bottom:2px solid #058EE1;"></div>

### Internet

`internet` Dataframe provides information on web sessions.

Get the general information of the data in the DataFrame - `internet`:

In [None]:
# Print the general/summary information about the internet DataFrame
internet.info()

The Dataframe - `internet` has a total of **104825 rows and 4 columns**.

The **columns description** are as follows:  
- `id` — unique session identifier
- `mb_used` — the volume of data spent during the session (in megabytes)
- `session_date` — web session date
- `user_id` — user identifier  

It is evident from the information we got from `info()` above that **we have no null values in the Dataframe - `internet`**.  

Also, we have `session_date` column of String data type. **We need to convert the data type of this column to DateTime.**

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for the internet traffic
internet.sample(n=10, random_state=100)

We can use the `duplicated()` method together with `sum()` to **check if we have any duplicate rows in the DataFrame - `internet`**. `duplicated()` method returns a boolean Series (True/False) denoting duplicate rows. So, we could apply `sum()` over that series to get a summation of all the True(s) - and False(s).

In [None]:
# Checking for duplicated internet records
internet.duplicated().sum()

**We don't have any duplicate rows in the `internet` Dataframe**.  

Let's check for just duplicate session IDs using `duplicated()` method together with `sum()`. Since, this time we want to find out if we have any duplicate user IDs, we will first get a Series of data for `id` column and then, apply `dupliacted()` method along with `sum()` on it.

In [None]:
# Checking for just duplicate internet IDs
internet['id'].duplicated().sum()

**We don't have any duplicate session IDs in the internet Dataframe.**

#### Fix data

Let's convert the type of `session_date` column to DateTime of the format - `YYYY-MM-DD`:

In [None]:
# Convert session_date to datetime format
internet['session_date'] = pd.to_datetime(internet['session_date'], format='%Y-%m-%d')

Let's check the data types of the Dataframe - `internet` again using `dtypes` attribute:

In [None]:
internet.dtypes

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for internet traffic
internet.sample(n=10, random_state=100)

Now, since we have converted the `session_date` column to DateTime, we can easily verify if all the records are from year 2018 - 

In [None]:
# Check if all the records are from year 2018
internet['session_date'].dt.year.unique()

**All the records are indeed from the year 2018. There are no odd records**.

#### Enrich data

Since Megaline rounds megabytes to gigabytes but we have the volume of data spent during the session in megabytes, we can derive a new column - `gb_used` using `mb_used`:

In [None]:
# Calculate the volume of data spent during the session from mbs to gbs - 1 GB = 1024 megabytes
internet['gb_used'] = internet['mb_used'] / 1024

Let's also create a column - `session_month` that has only month and year portion of the `session_date`:

In [None]:
internet['session_month'] = internet['session_date'].dt.strftime('%b-%Y')

Now, let's see the list of columns again in `internet` Dataframe

In [None]:
# Get list of colums
internet.columns

Let's get a sample of 10 rows from the Dataframe:

In [None]:
# Print a sample of data for internet traffic
internet.sample(n=10, random_state=100)

<div style="border-bottom:2px solid #058EE1;"></div>

## Study plan conditions <a id='study-plan-conditions'></a>  
[Back to Contents](#contents)

It is critical to understand how the plans work, how users are charged based on their plan subscription. So, let's print out the plan information to view their conditions once again.

In [None]:
# Print out the plan conditions
plans

<div style="border-bottom:2px solid #058EE1;"></div>

## Aggregate data per user <a id='aggregate-data-per-user'></a>  
[Back to Contents](#contents)

### Number of calls made by each user per month

In order to calculate the number of calls made by each user per month, we need data from both `users` and `calls` Dataframes. Let's create a new Dataframe - `users_calls` by merging the two of them.

In [None]:
# Merge the users and calls Dataframes and print first 10 records
users_calls = users.merge(calls, on='user_id')
users_calls.head(10)

Great! Let's create a pivot table with `user_id`, `full_name` and `call_month` as indices and apply `count()` on unique call identifiers - `id`, to get the number of calls made by each user per month:

In [None]:
# Calculate the number of calls made by each user per month. Save the result.

# Create a pivot table on user_id, full_name and call_month and count no. of unique call identifier - id
calls_per_user = users_calls.pivot_table(index=['user_id', 'full_name', 'call_month'], aggfunc={'id': 'count'})

# Give names to the columns of the the pivot table
calls_per_user.columns = ['number_of_calls']

Awesome! So, now we have the required data. Let's print the first 20 and last 20 records to view the results:

In [None]:
# Print first 20 records
calls_per_user.head(20)

In [None]:
# Print last 20 records
calls_per_user.tail(20)

### Amount of minutes spent by each user per month

We already have `users_calls` Dataframe that we created by merging `users` and `calls` Dataframes. We will use the same Dataframe here. Let's print first 10 rows to peek into the data:

In [None]:
users_calls.head(10)

Great! Let's create a pivot table with `user_id`, `full_name` and `call_month` as indices and apply `sum()` on `duration` to get total actual minutes spent by the users and on `rounded_up_duration` to get the rounded up minutes calculated for users, to get the amount of minutes spent by each user per month:

In [None]:
# Calculate the amount of minutes spent by each user per month. Save the result.

# Create a pivot table on user_id, full_name and call_month and sum duration and rounded_up_duration
minutes_per_user = users_calls.pivot_table(index=['user_id', 'full_name', 'call_month'], aggfunc={'duration': 'sum', 'rounded_up_duration': 'sum'})

# Give names to the columns of the the pivot table
minutes_per_user.columns = ['amount_of_actual_mins', 'amount_of_rounded_up_mins']

Awesome! So, now we have the required data. Let's print the first 20 and last 20 records to view the results:

In [None]:
minutes_per_user.head(20)

In [None]:
minutes_per_user.tail(20)

### Number of messages sent by each user per month

In order to calculate the number of messages sent by each user per month, we need data from both `users` and `messages` Dataframes. Let's create a new Dataframe - `users_messages` by merging the two of them.

In [None]:
# Merge the users and messages Dataframes and print first 10 records
users_messages = users.merge(messages, on='user_id')
users_messages.head(10)

Great! Let's create a pivot table with `user_id`, `full_name` and `message_month` as indices and apply `count()` on unique text message identifiers - `id`, to get the number of messages sent by each user per month:

In [None]:
# Calculate the number of messages sent by each user per month. Save the result.

# Create a pivot table on user_id, full_name and message_month and count no. of unique text message identifier - id
messages_per_user = users_messages.pivot_table(index=['user_id', 'full_name', 'message_month'], aggfunc={'id': 'count'})

# Give names to the columns of the the pivot table
messages_per_user.columns = ['number_of_messages']

Awesome! So, now we have the required data. Let's print the first 20 and last 20 records to view the results:

In [None]:
messages_per_user.head(20)

In [None]:
messages_per_user.tail(20)

### Volume of internet traffic used by each user per month

In order to calculate the volume of internet traffic used by each user per month, we need data from both `users` and `internet` Dataframes. Let's create a new Dataframe - `users_internet` by merging the two of them.

In [None]:
# Merge the users and internet Dataframes and print first 10 records
users_internet = users.merge(internet, on='user_id')
users_internet.head(10)

Great! Let's create a pivot table with `user_id`, `full_name` and `session_month` as indices and apply `sum()` on `gb_used` to get total actual gb used by the users each month, to get volume of internet traffic used by each user per month:

In [None]:
# Calculate the volume of internet traffic used by each user per month. Save the result.

# Create a pivot table on user_id, full_name and session_month and apply sum on gb_used
internet_traffic_per_user = users_internet.pivot_table(index=['user_id', 'full_name', 'session_month'], aggfunc={'gb_used': 'sum'})

# Give names to the columns of the the pivot table
internet_traffic_per_user.columns = ['actual_gb_used']

Awesome! So, now we have the required data. Let's print the first 20 and last 20 records to view the results:

In [None]:
internet_traffic_per_user.head(10)

In [None]:
internet_traffic_per_user.tail(20)

Wow! We are making a good progress. But, as per the Data Description, for web traffic, **individual web sessions are not rounded up. Instead, the total for the month is rounded up**. If someone uses 1025 megabytes this month, they will be charged for 2 gigabytes. So, let's create another column - `rounded_up_gb_used` where we can round up the total gb used for each month:

In [None]:
internet_traffic_per_user['rounded_up_gb_used'] = np.ceil(internet_traffic_per_user['actual_gb_used'])
internet_traffic_per_user.head(20)

### Combined data for calls, minutes, messages, internet for all the users

Now, **let's put the aggregate data together into one DataFrame - `user_consumption_per_month` so that one record in it would represent what an unique user consumed in a given month**.

Let's first flatten the `calls_per_user` and `minutes_per_user` Dataframes so that we can merge them both on `user_id` and `call_month`.

In [None]:
# Flatten calls_per_user Dataframe
calls_per_user = calls_per_user.reset_index()
calls_per_user.head(10)

In [None]:
# Flatten minutes_per_user Dataframe
minutes_per_user = minutes_per_user.reset_index()
minutes_per_user.head(10)

In [None]:
# Merge the data for calls and minutes based on user_id and month and save it in - user_consumption_per_month
user_consumption_per_month = calls_per_user.merge(minutes_per_user, how='outer', on=['user_id', 'call_month'])

In [None]:
# Create a column - user_name and save value from full_name_x. If that is null, then get value from full_name_y
user_consumption_per_month['user_name'] = user_consumption_per_month['full_name_x'].fillna(user_consumption_per_month['full_name_y'])

# Get only the necessary columns from user_consumption_per_month
user_consumption_per_month = user_consumption_per_month[['user_id', 'user_name', 'call_month', 'number_of_calls', 'amount_of_actual_mins', 'amount_of_rounded_up_mins']]
user_consumption_per_month

# Rename columns more meaningfully in user_consumption_per_month
user_consumption_per_month = user_consumption_per_month.rename(columns={'call_month': 'month', 'amount_of_actual_mins': 'call_mins', 'amount_of_rounded_up_mins': 'rounded_up_call_mins' })

In [None]:
# Get a sample of 20 records
user_consumption_per_month.sample(n=20, random_state=100)

Now, let's flatten the `messages_per_user` Dataframe so that we can merge it with `user_consumption_per_month` on `user_id` and `message_month` or `month`.

In [None]:
# Flatten messages_per_user Dataframe
messages_per_user = messages_per_user.reset_index()
messages_per_user.head(20)

In [None]:
# Merge the data for messages and user_consumption_per_month based on user_id and month and save it in - user_consumption_per_month
user_consumption_per_month = user_consumption_per_month.merge(messages_per_user, how='outer', left_on=['user_id', 'month'], right_on=['user_id', 'message_month'])

In [None]:
# Save value from user_name. If that is null, then get value from full_name
user_consumption_per_month['user_name'] = user_consumption_per_month['user_name'].fillna(user_consumption_per_month['full_name'])

# Save value from month. If that is null, then get value from message_month
user_consumption_per_month['month'] = user_consumption_per_month['month'].fillna(user_consumption_per_month['message_month'])

# Get only the necessary columns from user_consumption_per_month
user_consumption_per_month = user_consumption_per_month[['user_id', 'user_name', 'month', 'number_of_calls', 'call_mins', 'rounded_up_call_mins', 'number_of_messages']]

In [None]:
# Get a sample of 20 records
user_consumption_per_month.sample(n=20, random_state=800)

Now, let's flatten the `internet_traffic_per_user` Dataframe so that we can merge it with `user_consumption_per_month` on `user_id` and `session_month` or `month`.

In [None]:
# Flatten internet_traffic_per_user Dataframe
internet_traffic_per_user = internet_traffic_per_user.reset_index()
internet_traffic_per_user.head(20)

In [None]:
# Merge the data for internet and user_consumption_per_month based on user_id and month and save it in - user_consumption_per_month
user_consumption_per_month = user_consumption_per_month.merge(internet_traffic_per_user, how='outer', left_on=['user_id', 'month'], right_on=['user_id', 'session_month'])

In [None]:
# Save value from user_name. If that is null, then get value from full_name
user_consumption_per_month['user_name'] = user_consumption_per_month['user_name'].fillna(user_consumption_per_month['full_name'])

# Save value from month. If that is null, then get value from session_month
user_consumption_per_month['month'] = user_consumption_per_month['month'].fillna(user_consumption_per_month['session_month'])

# Get only the necessary columns from user_consumption_per_month
user_consumption_per_month = user_consumption_per_month[['user_id', 'user_name', 'month', 'number_of_calls', 'call_mins', 'rounded_up_call_mins', 'number_of_messages', 'actual_gb_used', 'rounded_up_gb_used']]

In [None]:
# Get a sample of 20 records
user_consumption_per_month.sample(n=20, random_state=989)

Awesome! Finally, **we have combined data for calls, minutes, messages, internet for all the users in `user_consumption_per_month`**.

### Combined data for all the users along with plan information

Won't it be great to have all the combined data for calls, minutes, messages, internet for all the users along with the information of plan they are subscribed to. Indeed it will be awesome!  

Fisrt of all, let's create a `users_plans` Dataframe where we will merge `users` and `plans` dataframes:

In [None]:
# Merge users and plans dataframes on plan or plan_name
users_plans = users[['user_id', 'plan']].merge(plans, left_on='plan', right_on='plan_name' )

# Take out the redundant column for plan
users_plans = users_plans.loc[:, users_plans.columns != 'plan_name']

In [None]:
# Get random 20 records from users_plans
users_plans.sample(n=20, random_state=200)

Now, since we have all the plan information for each user IDs in `users_plans`, **let's merge `users_plans` with `user_consumption_per_month` to add the plan information to the combined data for calls, minutes, messages, internet for all the users**:

In [None]:
# Add the plan information
user_consumption_per_month = user_consumption_per_month.merge(users_plans, on='user_id')

In [None]:
# Get random 20 records from user_consumption_per_month
user_consumption_per_month.sample(n=20, random_state=200)

### Monthly revenue for each user

Let's calculate the **monthly revenue from each user** (subtract the free package limit from the total number of calls, text messages, and data; multiply the result by the calling plan value; add the monthly charge depending on the calling plan). We can write a Python function that would be applied on each row and returns revenue for each user:

In [None]:
# Function that calculates the monthly revenue for each user
# Input: Each row in the Dataframe
# Output: Monthly revenue for the user
def calulate_monthly_revenue(row):  
    
    """
    - The function is applied to each row in the `user_consumption_per_month` DataFrame.
    - Revenue calculation involves adding the monthly charge and any additional expenses 
    incurred by the customer when they exceed usage limits.  
    - The additional expenses are determined based on call length, number of messages, and amount of internet used.
    - There are different code blocks for calculating and storing each of these possible expenses.
    
    Input: Each row in the Dataframe
    Output: Monthly revenue for the user
    """
    
    # Store monthly payment of the plan that user has taken
    plan_monthly_charge = row.usd_monthly_fee
    
    # Calculate revenue for calls, if user exceeded the limit covered by the plan
    #-----------------------------------------------------------------------------
    # rounded_up_call_mins: Total mins the user has spent in call in the particular month after rounding up
    # minutes_included: Monthly minute allowance of the plan
    # usd_per_minute: Price per minute after exceeding the package limits
    extra_call_mins = row.rounded_up_call_mins - row.minutes_included
    if extra_call_mins > 0:
        call_revenue = extra_call_mins * row.usd_per_minute
    else:
        call_revenue = 0
    
    # Calculate revenue for messages, if user exceeded limit covered by the plan
    #---------------------------------------------------------------------------
    # number_of_messages: Total number of messages sent by the user in the particular month
    # messages_included: Monthly text allowance of the plan
    # usd_per_message: Price per text after exceeding the package limits
    extra_messages = row.number_of_messages - row.messages_included
    if extra_messages > 0:
        message_revenue = extra_messages * row.usd_per_message
    else:
        message_revenue = 0
    
    # Calculate revenue for internet usage, if user exceeded limit covered by the plan
    #----------------------------------------------------------------------------------
    # rounded_up_gb_used: Total volume of internet traffic used by the user in the particular month
    # gb_per_month_included: Data volume allowance (in gigabytes) of the plan
    # usd_per_gb: Price per extra gigabyte of data after exceeding the package limits
    extra_internet_gb = (row.rounded_up_gb_used - row.gb_per_month_included)
    if extra_internet_gb > 0:
        internet_revenue = extra_internet_gb * row.usd_per_gb
    else:
        internet_revenue = 0
        
    monthly_revenue = plan_monthly_charge + call_revenue + message_revenue + internet_revenue
    
    return monthly_revenue

Woah! So, now we have a function - `calulate_monthly_revenue()` that when applied to each row in the `user_consumption_per_month` DataFrame, calculates the monthly revenue for each user. Let's apply the function to each row of the Dataframe and create a new a column - `usd_monthly_revenue`"

In [None]:
# Calculate the monthly revenue for each user
user_consumption_per_month['usd_monthly_revenue'] = user_consumption_per_month.apply(calulate_monthly_revenue, axis=1)

In [None]:
# Get random 20 records from user_consumption_per_month
user_consumption_per_month.sample(n=20, random_state=100)

<div style="border-bottom:2px solid #058EE1;"></div>

## Study user behaviour <a id='study-user-behaviour'></a>  
[Back to Contents](#contents)

Let's calculate some useful descriptive statistics for the aggregated and merged data, which typically reveal an overall picture captured by the data. We'll draw useful plots to help the understanding. Given that the main task is to compare the plans and decide on which one is more profitable, the statistics and the plots will be calculated on a per-plan basis.

### Calls

####  Compare average duration of calls per each plan per each distinct month

Let's compare average duration of calls per each plan per each distinct month.

In [None]:
# Compare average duration of calls per each plan per each distinct month.  
mean_calls_duration = user_consumption_per_month.pivot_table(index='month', columns='plan', aggfunc='mean', values='rounded_up_call_mins')
mean_calls_duration

In [None]:
# Plot a bar plot to visualize - mean_calls_duration
mean_calls_duration.plot.bar(figsize=(16,8), rot=0, color=['CornflowerBlue', 'Tomato'])

# Set the plot attributes
plt.title('Average duration of calls per each plan per each distinct month')
plt.ylabel('Minutes spent on calls')
plt.xlabel('Months (sorted alphabetically)')

plt.show()

**Here are some possible conclusions that can be drawn from the data of the average duration of calls per month for Megaline's Surf and Ultimate prepaid plans**:

- Overall, customers on the Ultimate plan tend to have longer average call durations than those on the Surf plan, as seen in most of the months where both plans have data available.

- The difference in average call durations between the two plans varies from month to month, with some months having a larger gap (e.g., Feb-2018) and others having a smaller gap (e.g., May-2018).

- Both plans show a general trend of increasing average call durations from January to December, which could indicate a seasonal effect or a trend in customer behavior.

- The Surf plan appears to have more variation in its average call durations than the Ultimate plan.  

- The highest average duration of calls for both plans was in December 2018.

- It's worth noting that the data only shows the average duration of calls per month and doesn't account for other factors that could influence revenue, such as the number of calls made or the quality of the network.

#### Compare the number of minutes users of each plan require each month

Let's compare the number of minutes users of each plan require each month. In order to do that, let's first separate out the records of the two plans from `user_consumption_per_month` and store them in separate dataframes:

In [None]:
# Separate out the records of the two plans and store separately
surf_user_consumption_per_month = user_consumption_per_month[user_consumption_per_month['plan'] == 'surf']
ultimate_user_consumption_per_month = user_consumption_per_month[user_consumption_per_month['plan'] == 'ultimate']

Let's peek into the data of each of the new Dataframes:

In [None]:
# Get random 10 records from surf_user_consumption_per_month
surf_user_consumption_per_month.sample(n=10, random_state=100)

In [None]:
# Get random 10 records from ultimate_user_consumption_per_month
ultimate_user_consumption_per_month.sample(n=10, random_state=100)

Awesome! So, now we are prepared to plot histograms to compare the number of minutes users of each plan require each month:

In [None]:
# Plot histograms to compare the number of minutes users of each plan require each month
surf_user_consumption_per_month['rounded_up_call_mins'].plot.hist(figsize=(16,8), color='Teal')
ultimate_user_consumption_per_month['rounded_up_call_mins'].plot.hist(color='Coral', alpha=0.8)

# Set the plot attributes
plt.legend(['Surf', 'Ultimate'])
plt.xlabel('Number of minutes')
plt.title('Number of minutes users of each plan require each month')

plt.show()

After seeing the histogram, we could obvisouly figure out that surf plan has more users. Also, the users of both the surf and ultimate plans tend to spend mostly 300 to 600 minutes per month. But it's the ultimate plan users (even though less in numbers) who spend more time in minutes that the surf plan users.

#### Check whether users on the different plans have different behaviours for their calls

Let's calculate the mean and the variable of the call duration to reason on whether users on the different plans have different behaviours for their calls. Let's create a pivot table:

In [None]:
# Calculate the mean and the variance of the monthly call duration
monthly_call_duration_stats = user_consumption_per_month.pivot_table(index='plan', values='rounded_up_call_mins', aggfunc=['mean', 'var', 'std', 'median'])
monthly_call_duration_stats.columns = ['mean_monthly_call_mins', 'var_monthly_call_mins', 'std_monthly_call_mins', 'median_monthly_call_mins']
monthly_call_duration_stats

**Here are some observations on users' behaviours based on the descriptive statistics data of Megaline's Surf and Ultimate plans**:

- The Surf plan has a slightly higher mean monthly call duration (436.52 minutes) than the Ultimate plan (434.68 minutes).

- The Surf plan has a slightly higher variance of monthly call duration (52,571.06 minutes) than the Ultimate plan (56,573.63 minutes).

- The Ultimate plan has a slightly lower standard deviation of monthly call duration (237.85 minutes) compared to the Surf plan (229.28 minutes).

- The median monthly call duration for both plans is relatively close, with Surf plan at 430.0 minutes and Ultimate plan at 425.0 minutes.

#### Visualize the distribution of the monthly call duration

Let's plot a box plot to visualize the distribution of the monthly call duration. But, before that, let's have quick refresh of what data our Dataframe - `user_consumption_per_month` holds:

In [None]:
user_consumption_per_month.head(10)

Now, let's get a subset of our data in the Dataframe - `user_consumption_per_month` that belongs to **Surf** plan users. We already have this data in `surf_user_consumption_per_month` Dataframe. So, let's see how the distribution looks for the monthly call distribution of the surf plan holders:

In [None]:
# Plot a boxplot to visualize the distribution of the monthly call distribution of the surf plan holders
surf_user_consumption_per_month['rounded_up_call_mins'].plot.box(figsize=(16,10))

# Set the plot attributes
plt.ylabel('Monthly call duration (in mins)')
plt.title('Distribution of the monthly call duration for Surf plan users')

plt.show()

Let's also get a descriptive statistics for the Series - `surf_user_consumption_per_month['rounded_up_call_mins']`:

In [None]:
surf_user_consumption_per_month['rounded_up_call_mins'].describe()

Based on the given descriptive statistics data and the observations from the box plot, **we can conclude the following about the distribution of monthly call duration for Surf plan users**:

- **Count**: There are **1545 data points** or monthly call duration values available for Surf plan users.

- **Mean**: The average monthly call duration for Surf plan users is **436.52 minutes**.

- **Standard Deviation**: The standard deviation of monthly call duration for Surf plan users is 229.28 minutes. This indicates that there is a significant variation in monthly call duration among Surf plan users.

- **Minimum**: The minimum monthly call duration is **0 minutes**.

- **Maximum**: The maximum monthly call duration is **1510 minutes**. This indicates that some Surf plan users made very long calls during the month.  

- **Quartiles**: The 25th percentile of monthly call duration is 279 minutes, **the median (50th percentile) is 430 minutes**, and the 75th percentile is 579 minutes. These quartiles divide the data into four equal parts and provide insight into the distribution of monthly call duration for Surf plan users.  

- There are outliers or anomalies in the data. **1510 minutes** being the extreme maximum.

Overall, we can conclude that **the distribution of monthly call duration for Surf plan users is positively or slightly right skewed**, with a large range of variation in call duration. The majority of Surf plan users (50%) make calls that are less than 430 minutes per month, while some users make very long calls, up to a maximum of 1510 minutes per month.

Now, let's plot a box plot to viualize the monthwise distribution of the monthly call distribution of the surf plan holders:

In [None]:
# Plot a boxplot to visualize the monthwise distribution of the monthly call distribution of the surf plan holders
surf_user_consumption_per_month.boxplot(by ='month', column =['rounded_up_call_mins'], grid = False, figsize=(16,10))

# Set the plot attributes
plt.ylabel('Call duration (in mins)')
plt.xlabel('Months')
plt.title('Monthwise distribution of the monthly call duration for Surf plan users')

plt.show()

Interesting! We can see that for Surf plan users:
- **The most extreme outliers lie in the month of December**. It's when users have talked the maximum.
- **There are no outliers in the month of January**. Also, the distribution for this month is symmetrical.

Now, let's get a subset of our data in the Dataframe - `user_consumption_per_month` that belongs to **Ultimate** plan users. We already have this data in `ultimate_user_consumption_per_month` Dataframe. So, let's see how the distribution looks for the monthly call distribution of the ultimate plan holders:

In [None]:
# Plot a boxplot to visualize the distribution of the monthly call distribution of the ultimate plan holders
ultimate_user_consumption_per_month['rounded_up_call_mins'].plot.box(figsize=(16,10))

# Set the plot attributes
plt.ylabel('Monthly call duration (in mins)')
plt.title('Distribution of the monthly call duration for Ultimate plan users')

plt.show()

Let's also get a descriptive statistics for the Series - `ultimate_user_consumption_per_month['rounded_up_call_mins']`:

In [None]:
ultimate_user_consumption_per_month['rounded_up_call_mins'].describe()

Based on the given descriptive statistics data and observations from the box plot, **we can conclude the following about the distribution of monthly call duration for Ultimate plan users**:

- **Count**: There are **713 data points** or monthly call duration values available.

- **Mean**: The average monthly call duration for Ultimate plan users is **434.68 minutes**.

- **Standard Deviation**: The standard deviation of monthly call duration for Ultimate plan users is **237.85 minutes**. This indicates that there is a significant variation in monthly call duration among Ultimate plan users.

- **Minimum**: The minimum monthly call duration is **0 minutes**.  

- **Maximum**: The maximum monthly call duration is **1369 minutes**. This indicates that some Ultimate plan users made very long calls during the month.

- **Quartiles**: The 25th percentile of monthly call duration is 263 minutes, **the median (50th percentile) is 425 minutes**, and the 75th percentile is 566 minutes. These quartiles divide the data into four equal parts and provide insight into the distribution of monthly call duration for Ultimate plan users.  

- There are outliers or anomalies in the data. **1369 minutes** being the extreme maximum.

Overall, we can conclude that **the distribution of monthly call duration for Ultimate plan users is positively skewed or slightly right skewed**, with a large range of variation in call duration. The majority of Ultimate plan users (50%) make calls that are less than 425 minutes per month, while some users make very long calls, up to a maximum of 1369 minutes per month. The mean monthly call duration for Ultimate plan users is slightly higher than that of Surf plan users, but the difference is not significant.

In [None]:
# Plot a boxplot to visualize the monthwise distribution of the monthly call distribution of the ultimate plan holders
ultimate_user_consumption_per_month.boxplot(by ='month', column =['rounded_up_call_mins'], grid = False, figsize=(16,10))

# Set the plot attributes
plt.ylabel('Call duration (in mins)')
plt.xlabel('Months')
plt.title('Monthwise distribution of the monthly call duration for Ultimate plan users')

plt.show()

Interesting! We can see that for Ultimate plan users:

- The most extreme outliers lie in the month of December. It's when users have talked the maximum. It's the same as for the Surf users.
- There are no symmetrical distributions for any month though.

In [None]:
# Plot a boxplot to visualize the distribution of the monthly call duration

# Set the order in which months will be plotted on the graph
months_order = ['Jan-2018', 'Feb-2018', 'Mar-2018', 'Apr-2018', 'May-2018', 'Jun-2018', 'Jul-2018', 'Aug-2018', 'Sep-2018', 'Oct-2018', 'Nov-2018', 'Dec-2018']

# Customize the markers that show outliers in the data
flierprops = dict(marker='o', markersize=10, markeredgecolor='black', markerfacecolor='darkgreen', alpha=0.6)

# Customize the markers that show mean values
meanprops = dict(marker='s', markerfacecolor='white', markeredgecolor='black')

plt.figure(figsize=(20,20))
my_plot = sns.boxplot(
    data=user_consumption_per_month,
    y='month',
    x='rounded_up_call_mins',
    hue='plan',
    order=months_order,
    showmeans=True,
    orient='h',
    linewidth=2,
    flierprops=flierprops,
    meanprops=meanprops,
    palette='Set2')

# Set the plot attributes
my_plot.set_xlabel('Call duration (in mins)', fontsize= 14, fontweight='bold')
my_plot.set_ylabel('Months', fontsize= 14, fontweight='bold')
my_plot.set_title('Distribution of the monthly call duration for Surf & Ultimate plan users', fontsize= 16, fontweight='bold')

plt.show()

Wow! So, now we can compare the user behaviours between the two plans:
- Mostly all the users regardless of the plan they are in, talk less in the starting of the year but they tend to talk more as we progess towards the end of the year.
- In both the plans, December is the month when users have talked the most.
- We could notice the mean of the distribution by white colored square in the graph. There was high variation between plans in average call durations at the beginning of the year. The mean/median call durations are generally quite different from months 1-5. From months 6-12, the means, medians, overall distributions look very similar.

### Messages

Let's do some statistical study on the Messages. But, before proceeding, it will be great to refresh our memory with how the data under study looks like - `user_consumption_per_month`:

In [None]:
user_consumption_per_month.head(10)

#### Compare average number of messages per each plan per each distinct month

Let's compare average number of messages per each plan per each distinct month.

In [None]:
# Compare average number of messages per each plan per each distinct month 
mean_no_of_messages = user_consumption_per_month.pivot_table(index='month', columns='plan', aggfunc='mean', values='number_of_messages')
mean_no_of_messages

That's good. But, the mean has calculated the average values for number of messages per month in float. Since that doesn't make sense for **Number of Messages** column, we'll round the values to the nearest integer.

In [None]:
mean_no_of_messages['surf'] = mean_no_of_messages['surf'].round(0)
mean_no_of_messages['ultimate'] = mean_no_of_messages['ultimate'].round(0)
mean_no_of_messages

In [None]:
# Plot a bar plot to visualize - mean_no_of_messages
mean_no_of_messages.plot.bar(figsize=(16,8), rot=0, color=['CornflowerBlue', 'Tomato'])

# Set the plot attributes
plt.title('Average number of messages per each plan per each distinct month')
plt.ylabel('Number of messages')
plt.xlabel('Months (sorted alphabetically)')

plt.show()

Based on the visualization about **the average number of messages per month for Megaline's Surf and Ultimate prepaid plans, we can analyze and conclude the following**:

- In most months, **Ultimate plan users sent more average messages per month compared to Surf plan users**.
- The average number of messages sent per month was relatively consistent for Surf plan users throughout the year, with little variation.
- The average number of messages sent per month for Ultimate plan users had some fluctuations over time but showed an increasing trend overall.
- **The highest average number of messages sent per month for both plans was in December 2018**.
- There is a significant difference between the average number of messages sent per month by Surf and Ultimate plan users in some months (e.g., August 2018, May 2018, March 2018).
- **The average number of messages sent by both Surf and Ultimate plan users is relatively low**, with most months having an average of fewer than 50 messages per month.

#### Compare the number of messages users of each plan require each month

Let's compare the number of messages users of each plan require each month. We already have separated out the records of the two plans from `user_consumption_per_month` in `surf_user_consumption_per_month` and `ultimate_user_consumption_per_month`:

In [None]:
# Get first 10 records
surf_user_consumption_per_month.head(10)

In [None]:
# Get first 10 records
ultimate_user_consumption_per_month.head(10)

Awesome! So, now we are prepared to plot histograms to compare the number of messages users of each plan require each month:

In [None]:
# Plot histograms to compare the number of minutes users of each plan require each month
surf_user_consumption_per_month['number_of_messages'].plot.hist(figsize=(16,8), color='Teal')
ultimate_user_consumption_per_month['number_of_messages'].plot.hist(color='Coral', alpha=0.8)

# Set the plot attributes
plt.legend(['Surf', 'Ultimate'])
plt.xlabel('Number of messages')
plt.title('Number of messages users of each plan require each month')

plt.show()

We can conclude that:
- Users from both the plans don't need many messages per month.
- Majority of the users use less number of messages.
- The maximum number of messages needed by Surf plan users is close to 190.
- The maximum number of messages needed by Ultimate plan users is close to 170.

#### Check whether users on the different plans have different behaviours for their text messages

Let's calculate the mean and the variable of the number of messages to reason on whether users on the different plans have different behaviours for their calls. Let's create a pivot table:

In [None]:
# Calculate the mean and the variance of the monthly no of messages
monthly_no_of_messages_stats = user_consumption_per_month.pivot_table(index='plan', values='number_of_messages', aggfunc=['mean', 'var', 'std', 'median'])
monthly_no_of_messages_stats.columns = ['mean_monthly_no_of_messages', 'var_monthly_no_of_messages', 'std_monthly_no_of_messages', 'median_monthly_no_of_messages']
monthly_no_of_messages_stats

**The descriptive statistics data of Megaline's Surf and Ultimate plans for monthly number of messages reveals the following information** about users on the different plans and their behaviors for the number of messages they use each month:

- The mean of the monthly number of messages is higher for Ultimate plan users than for Surf plan users, with Ultimate plan users sending an average of 46.30 messages per month compared to Surf plan users sending an average of 40.11 messages per month.
- The variance of the monthly number of messages is similar for both plans, indicating that **there is a similar degree of variability in the number of messages used each month by both Surf and Ultimate plan users**.
- The standard deviation of the monthly number of messages is similar for both plans, with Ultimate plan users having a slightly lower standard deviation than Surf plan users.
- **The median of the monthly number of messages is higher for Ultimate plan users than for Surf plan users**, with Ultimate plan users having a median of 41 messages per month compared to Surf plan users having a median of 32 messages per month.  

Based on the above statistics, **we can conclude that Ultimate plan users send more messages on average and have a higher median number of messages per month compared to Surf plan users**. Additionally, both plans have a similar degree of variability in the monthly number of messages sent, with Surf plan users having a slightly higher standard deviation.

#### Visualize the distribution of the monthly number of messages sent

Let's plot a box plot to visualize the distribution of the monthly number of messages sent. But, before that, let's have quick refresh of what data our Dataframe - `user_consumption_per_month` holds:

In [None]:
user_consumption_per_month.head(10)

Now, let's get a subset of our data in the Dataframe - `user_consumption_per_month` that belongs to **Surf** plan users. We already have this data in `surf_user_consumption_per_month` Dataframe. So, let's see how the distribution looks for the monthly number of messages sent for the surf plan holders:

In [None]:
# Plot a boxplot to visualize the distribution of the monthly no of messages sent for the surf plan holders
surf_user_consumption_per_month['number_of_messages'].plot.box(figsize=(16,10))

# Set the plot attributes
plt.ylabel('Monthly no of messages')
plt.title('Distribution of the monthly no of messages sent for Surf plan users')

plt.show()

Let's also get a descriptive statistics for the Series - `surf_user_consumption_per_month['number_of_messages']`:

In [None]:
surf_user_consumption_per_month['number_of_messages'].describe()

**Based on the given descriptive statistics data and the observations from the box plot, we can conclude the following about the distribution of monthly number of messages for Surf plan users**:

- **Count**: There are **1222 data points or monthly number of messages values** available.

- **Mean**: The average monthly number of messages for Surf plan users is **40.11 or 40 approximately**.

- **Standard Deviation**: The standard deviation of monthly number of messages for Surf plan users is **33.04**. This indicates that there is a significant variation in monthly number of messages among Surf plan users.

- **Minimum**: The minimum monthly number of messages is **1**. This indicates that **some users sent only one message during the month**.

- **Maximum**: The maximum monthly number of messages is **266**. This indicates that **some Surf plan users sent a large number of messages during the month**.

- **Quartiles**: The 25th percentile of monthly number of messages is 16, **the median (50th percentile) is 32**, and the 75th percentile is 54.  

- The distribution has many outliers, **266 being the maximum**.  

Overall, we can conclude that **the distribution of monthly number of messages for Surf plan users is positively skewed or right skewed**, with a large range of variation in the number of messages sent. **The majority of Surf plan users (50%) sent 32 or fewer messages per month, while some users sent up to 266 messages per month**. 

Now, let's plot a box plot to viualize the monthwise distribution of the number of messages sent for the surf plan holders:

In [None]:
# Plot a boxplot to visualize the monthwise distribution of the number of messages sent for the surf plan holders
surf_user_consumption_per_month.boxplot(by ='month', column =['number_of_messages'], grid = False, figsize=(16,10))

# Set the plot attributes
plt.ylabel('Number of messages sent')
plt.xlabel('Months')
plt.title('Monthwise distribution of the number of messages sent for the surf plan users')

plt.show()

Interesting! We can see that for Surf plan users:
- **The most extreme outliers lie in the month of December**. It's when users have texted the maximum.
- **There are no outliers in the months of January, February and March**.

Now, let's get a subset of our data in the Dataframe - `user_consumption_per_month` that belongs to **Ultimate** plan users. We already have this data in `ultimate_user_consumption_per_month` Dataframe. So, let's see how the distribution looks for the monthly number of messages sent for the ultimate plan holders:

In [None]:
# Plot a boxplot to visualize the distribution of the monthly no of messages sent for the ultimate plan holders
ultimate_user_consumption_per_month['number_of_messages'].plot.box(figsize=(16,10))

# Set the plot attributes
plt.ylabel('Monthly no of messages')
plt.title('Distribution of the monthly no of messages sent for Ultimate plan users')

plt.show()

Let's also get a descriptive statistics for the Series - `ultimate_user_consumption_per_month['number_of_messages']`:

In [None]:
ultimate_user_consumption_per_month['number_of_messages'].describe()

**Based on the given descriptive statistics data and the observations from the box plot, we can conclude the following about the distribution of monthly number of messages for Ultimate plan users**:

- **Count**: There are **584 data points or monthly number of messages values** available.

- **Mean**: The average monthly number of messages for Ultimate plan users is **46.30 or 46 approximately**.

- **Standard Deviation**: The standard deviation of monthly number of messages for Ultimate plan users is **32.94**. This indicates that there is a significant variation in monthly number of messages among Ultimate plan users.

- **Minimum**: The minimum monthly number of messages is **1**. **This indicates that some users sent only one message during the month**.

- **Maximum**: The maximum monthly number of messages is **166**. **This indicates that some Ultimate plan users sent a large number of messages during the month, but the maximum is lower than that of Surf plan users**.

- **Quartiles**: The 25th percentile of monthly number of messages is 21, **the median (50th percentile) is 41**, and the 75th percentile is 66. These quartiles divide the data into four equal parts and provide insight into the distribution of monthly number of messages for Ultimate plan users.  

- There are outliers in the distribution, **166** being the maximum.  

Overall, we can conclude that **the distribution of monthly number of messages for Ultimate plan users is positively skewed or right skewed**, with a large range of variation in the number of messages sent. **The majority of Ultimate plan users (50%) sent 41 or fewer messages per month, while some users sent up to 166 messages per month. The mean monthly number of messages for Ultimate plan users is slightly higher than that of Surf plan users**. However, the maximum monthly number of messages is lower for Ultimate plan users compared to Surf plan users.

In [None]:
# Plot a boxplot to visualize the monthwise distribution of the number of messages sent for the ultimate plan holders
ultimate_user_consumption_per_month.boxplot(by ='month', column =['number_of_messages'], grid = False, figsize=(16,10))

# Set the plot attributes
plt.ylabel('Number of messages sent')
plt.xlabel('Months')
plt.title('Monthwise distribution of the number of messages sent for the Ultimate plan users')

plt.show()

Interesting! We can see that for Ultimate plan users:
- **The most extreme outliers lie in the month of November**. It's when users have texted the maximum.
- **There are no outliers in the months of January, March and December**.

In [None]:
# Plot a boxplot to visualize the distribution of the monthly no of messages sent by users

# Set the order in which months will be plotted on the graph
months_order = ['Jan-2018', 'Feb-2018', 'Mar-2018', 'Apr-2018', 'May-2018', 'Jun-2018', 'Jul-2018', 'Aug-2018', 'Sep-2018', 'Oct-2018', 'Nov-2018', 'Dec-2018']

# Customize the markers that show outliers in the data
flierprops = dict(marker='o', markersize=10, markeredgecolor='black', markerfacecolor='darkgreen', alpha=0.6)

# Customize the markers that show mean values
meanprops = dict(marker='s', markerfacecolor='white', markeredgecolor='black')

plt.figure(figsize=(20,20))
my_plot = sns.boxplot(
    data=user_consumption_per_month,
    y='month',
    x='number_of_messages',
    hue='plan',
    order=months_order,
    showmeans=True,
    orient='h',
    linewidth=2,
    flierprops=flierprops,
    meanprops=meanprops,
    palette='Set2')

# Set the plot attributes
my_plot.set_xlabel('Number of messages sent', fontsize= 14, fontweight='bold')
my_plot.set_ylabel('Months', fontsize= 14, fontweight='bold')
my_plot.set_title('Distribution of the monthly no of messages sent for Surf & Ultimate plan users', fontsize= 16, fontweight='bold')

plt.show()

Wow! So, now we can compare the user behaviours between the two plans:
- Mostly all the users regardless of the plan they are in, message less in the starting of the year but they tend to message more as we progess towards the end of the year.
- In the Surf plan, December is the month when users have messaged the most and in the Ultimate plan, November is the month when users have messaged the most.

### Internet

Let's do some statistical study on the Internet. But, before proceeding, it will be great to refresh our memory with how the data under study looks like - `user_consumption_per_month`:

In [None]:
# Get first 10 records
user_consumption_per_month.head(10)

#### Compare average amount of internet traffic consumed by users per each plan per each distinct month

Let's compare average amount of internet traffic consumed by users (in GBs) per each plan per each distinct month.

In [None]:
# Compare average amount of internet traffic consumed by users per each plan per each distinct month  
mean_internet_traffic_consumed = user_consumption_per_month.pivot_table(index='month', columns='plan', aggfunc='mean', values='rounded_up_gb_used')
mean_internet_traffic_consumed

In [None]:
# Plot a bar plot to visualize - mean_internet_traffic_consumed
mean_internet_traffic_consumed.plot.bar(figsize=(16,8), rot=0, color=['CornflowerBlue', 'Tomato'])

# Set the plot attributes
plt.title('Average amount of internet traffic consumed by users per each plan per each distinct month')
plt.ylabel('Internet traffic consumed in Gigabytes (GBs)')
plt.xlabel('Months (sorted alphabetically)')

plt.show()

**From the given data and the plot about the average amount of internet traffic consumed by users per each plan per each distinct month, we can conclude the following**:  
- Both plans show a similar pattern in internet usage over time, with higher consumption during the later months of the year (Oct, Nov, Dec) and lower consumption during the early months (Jan, Feb, Mar).
- January appears to be the month with the least amount of internet usage for both plans.
- The difference in average internet usage between the two plans is not significant, with only about a 2-3 GB difference on average.

#### Compare the number of internet GBs users of each plan require each month

Let's compare the number of internet GBs users of each plan require each month. We already have separated out the records of the two plans from `user_consumption_per_month` in `surf_user_consumption_per_month` and `ultimate_user_consumption_per_month`:

In [None]:
# Get first 10 records for surf users
surf_user_consumption_per_month.head(10)

In [None]:
# Get first 10 records for ultimate users
ultimate_user_consumption_per_month.head(10)

Awesome! So, now we are prepared to plot histograms to compare the number of internet GBs users of each plan require each month:

In [None]:
# Plot histograms to compare the number of internet GBs users of each plan require each month
surf_user_consumption_per_month['rounded_up_gb_used'].plot.hist(figsize=(16,8), color='Teal')
ultimate_user_consumption_per_month['rounded_up_gb_used'].plot.hist(color='Coral', alpha=0.8)

# Set the plot attributes
plt.legend(['Surf', 'Ultimate'])
plt.xlabel('Number of internet GBs')
plt.title('Number of internet GBs users of each plan require each month')

plt.show()

We can conclude that:  
- Majority of the users in both the plans use somewhere between 15 to 22 GBs of internet each month.
- The maximum number of internet GBs needed by Surf plan users is close to 42.
- The maximum number of internet GBs needed by Ultimate plan users is close to 46.

#### Check whether users on the different plans have different behaviours for their internet traffic consumption

Let's calculate the mean and the variable of the number of internet GBs used to reason on whether users on the different plans have different behaviours for their internet traffic consumption. Let's create a pivot table:

In [None]:
# Calculate the mean and the variance of the monthly internet traffic consumption in GBs
monthly_internet_traffic_stats = user_consumption_per_month.pivot_table(index='plan', values='rounded_up_gb_used', aggfunc=['mean', 'var', 'std', 'median'])
monthly_internet_traffic_stats.columns = ['mean_monthly_internet_GBs', 'var_monthly_internet_GBs', 'std_monthly_internet_GBs', 'median_monthly_internet_GBs']
monthly_internet_traffic_stats

Based on the descriptive statistics data provided, we can conclude the following about the internet traffic consumption of users on the Surf and Ultimate plans:

- The mean monthly internet traffic consumption for both plans is relatively similar, with Surf users consuming an average of 16.83 GBs and Ultimate users consuming an average of 17.33 GBs.
- The variance and standard deviation for monthly internet traffic consumption are also quite similar for both plans.
- The median monthly internet traffic consumption for both plans is identical at 17 GBs.
- The standard deviation is relatively small for both plans, indicating that the data is clustered around the mean and there is not a large amount of variability in internet traffic consumption among users on each plan.  

Overall, these statistics suggest that **users on both plans consume similar amounts of internet traffic on average, with relatively little variation among users within each plan**.

#### Visualize the distribution of the monthly internet traffic consumption by users

Let's plot a box plot to visualize the distribution of the monthly internet traffic consumption by users. But, before that, let's have quick refresh of what data our Dataframe - `user_consumption_per_month` holds:

In [None]:
user_consumption_per_month.head(10)

Now, let's get a subset of our data in the Dataframe - `user_consumption_per_month` that belongs to **Surf** plan users. We already have this data in `surf_user_consumption_per_month` Dataframe. So, let's see how the distribution looks for the monthly internet traffic consumption by the surf plan holders:

In [None]:
# Plot a boxplot to visualize the distribution of the monthly internet traffic consumption by the surf plan holders
surf_user_consumption_per_month['rounded_up_gb_used'].plot.box(figsize=(16,10))

# Set the plot attributes
plt.ylabel('Internet Traffic in GBs')
plt.title('Distribution of the monthly internet traffic consumption (in GBs) by the surf plan users')

plt.show()

Let's also get a descriptive statistics for the Series - `surf_user_consumption_per_month['rounded_up_gb_used']`:

In [None]:
surf_user_consumption_per_month['rounded_up_gb_used'].describe()

**Based on the given descriptive statistics data and the observations from the box plot above, we can conclude the following about the distribution of monthly internet traffic consumption for Surf plan users**:

- **Count**: There are **1558 data points** or monthly internet traffic consumption values available.

- **Mean**: The average monthly internet traffic consumption for Surf plan users is **16.83 GB**.

- **Standard Deviation**: The standard deviation of monthly internet traffic consumption for Surf plan users is **7.71 GB**. This indicates that there is a significant variation in monthly internet traffic consumption among Surf plan users.

- **Minimum**: The minimum monthly internet traffic consumption is **1 GB**. This indicates that some users consumed very little internet traffic during the month.

- **Maximum**: The maximum monthly internet traffic consumption is **70 GB**. This indicates that some Surf plan users consumed a large amount of internet traffic during the month.

- **Quartiles**: The 25th percentile of monthly internet traffic consumption is 12 GB, **the median (50th percentile) is 17 GB**, and the 75th percentile is 21 GB.  

- There are many outliers in the distribution, **70 GB** being the maximum.

Overall, we can conclude that **the distribution of monthly internet traffic consumption for Surf plan users is positively skewed or right skewed**, with a large range of variation in the amount of internet traffic consumed. **The majority of Surf plan users (50%) consumed between 12 GB and 21 GB of internet traffic per month**, while some users consumed as little as 1 GB and as much as 70 GB. 

Now, let's plot a box plot to visualize the monthwise distribution of internet traffic consumption (in GBs) for the surf plan holders:

In [None]:
# Plot a boxplot to visualize the monthwise distribution of internet traffic consumption (in GBs) for the surf plan holders
surf_user_consumption_per_month.boxplot(by ='month', column =['rounded_up_gb_used'], grid = False, figsize=(16,10))

# Set the plot attributes
plt.ylabel('Internet traffic (in GBs) consumed')
plt.xlabel('Months')
plt.title('Monthwise distribution of internet traffic consumption (in GBs) for the surf plan users')

plt.show()

Interesting! We can see that for Surf plan users:
- **The most extreme outliers lie in the month of December**. It's when users have used internet the most.
- **There are no outliers in the months of January, February, March and April**.

Now, let's get a subset of our data in the Dataframe - `user_consumption_per_month` that belongs to **Ultimate** plan users. We already have this data in `ultimate_user_consumption_per_month` Dataframe. So, let's see how the distribution looks for the monthly internet traffic consumption by the Ultimate plan holders::

In [None]:
# Plot a boxplot to visualize the distribution of the monthly internet traffic consumption by the ultimate plan holders
ultimate_user_consumption_per_month['rounded_up_gb_used'].plot.box(figsize=(16,10))

# Set the plot attributes
plt.ylabel('Internet Traffic in GBs')
plt.title('Distribution of the monthly internet traffic consumption (in GBs) by the ultimate plan users')

plt.show()

Let's also get a descriptive statistics for the Series - `ultimate_user_consumption_per_month['rounded_up_gb_used']`:

In [None]:
ultimate_user_consumption_per_month['rounded_up_gb_used'].describe()

**Based on the given descriptive statistics data and the observations from the box plot, we can conclude the following about the distribution of monthly internet traffic consumption for Ultimate plan users**:

- **Count**: There are **719 data points** or monthly internet traffic consumption values available.

- **Mean**: The average monthly internet traffic consumption for Ultimate plan users is **17.33 GB**.

- **Standard Deviation**: The standard deviation of monthly internet traffic consumption for Ultimate plan users is **7.65 GB**. This indicates that there is a significant variation in monthly internet traffic consumption among Ultimate plan users.

- **Minimum**: The minimum monthly internet traffic consumption is **1 GB**. This indicates that some users consumed very little internet traffic during the month.

- **Maximum**: The maximum monthly internet traffic consumption is **46 GB**. This indicates that some Ultimate plan users consumed a large amount of internet traffic during the month, but not as much as some Surf plan users.

- **Quartiles**: The 25th percentile of monthly internet traffic consumption is 13 GB, **the median (50th percentile) is 17 GB**, and the 75th percentile is 21 GB. These quartiles divide the data into four equal parts and provide insight into the distribution of monthly internet traffic consumption for Ultimate plan users.  

- There are many outliers in the distribution, **46GB** beign the maxixmum.  

Overall, we can conclude that **the distribution of monthly internet traffic consumption for Ultimate plan users is also slightly positively skewed or very slighty right skewed (Mean - 17.33 GB > Median 17GB)**. **The majority of Ultimate plan users (50%) consumed between 13 GB and 21 GB of internet traffic per month, while some users consumed as little as 1 GB and as much as 46 GB**. The mean monthly internet traffic consumption for Ultimate plan users is 17.33 GB, which is slightly higher than the mean consumption for Surf plan users. However, the difference is not very significant.

In [None]:
# Plot a boxplot to visualize the monthwise distribution of the internet traffic consumption (in GBs) by the ultimate plan holders
ultimate_user_consumption_per_month.boxplot(by ='month', column =['rounded_up_gb_used'], grid = False, figsize=(16,10))

# Set the plot attributes
plt.ylabel('Internet traffic consumed (in GBs)')
plt.xlabel('Months')
plt.title('Monthwise distribution of the internet traffic consumption (in GBs) by the Ultimate plan users')

plt.show()

Interesting! We can see that for Ultimate plan users:

- The most extreme outliers lie in the months of December and October. It's when users have used internet the most.
- There are no outliers in the months of January, March and April.

In [None]:
# Plot a boxplot to visualize the distribution of the monthly internet traffic consumption (in GBs) by users

# Set the order in which months will be plotted on the graph
months_order = ['Jan-2018', 'Feb-2018', 'Mar-2018', 'Apr-2018', 'May-2018', 'Jun-2018', 'Jul-2018', 'Aug-2018', 'Sep-2018', 'Oct-2018', 'Nov-2018', 'Dec-2018']

# Customize the markers that show outliers in the data
flierprops = dict(marker='o', markersize=10, markeredgecolor='black', markerfacecolor='darkgreen', alpha=0.6)

# Customize the markers that show mean values
meanprops = dict(marker='s', markerfacecolor='white', markeredgecolor='black')

plt.figure(figsize=(20,20))
my_plot = sns.boxplot(
    data=user_consumption_per_month,
    y='month',
    x='rounded_up_gb_used',
    hue='plan',
    order=months_order,
    showmeans=True,
    orient='h',
    linewidth=2,
    flierprops=flierprops,
    meanprops=meanprops,
    palette='Set2')

# Set the plot attributes
my_plot.set_xlabel('Internet traffic consumed (in GBs)', fontsize= 14, fontweight='bold')
my_plot.set_ylabel('Months', fontsize= 14, fontweight='bold')
my_plot.set_title('Distribution of the monthly internet traffic consumption (in GBs) by Surf & Ultimate plan users', fontsize= 16, fontweight='bold')

plt.show()

Wow! So, now we can compare the user behaviours between the two plans:

- Mostly all the users regardless of the plan they are in, the internet usage is less in the starting of the year but they tend to use internet more as we progess towards the end of the year.
- In the Surf plan, December is the month when users have messaged the most and in the Ultimate plan, December and October are the month when users have used internet the most.

<div style="border-bottom:2px solid #058EE1;"></div>

## Study Revenue <a id='study-revenue'></a>  
[Back to Contents](#contents)

Let's do some statistical study on the Revenue. But, before proceeding, it will be great to refresh our memory with how the data under study looks like - `user_consumption_per_month`:

In [None]:
# Get the first 10 records
user_consumption_per_month.head(10)

#### Compare average revenue per each plan per each distinct month

Let's compare average revenue per each plan per each distinct month.

In [None]:
# Compare average revenue per each plan per each distinct month  
mean_revenue = user_consumption_per_month.pivot_table(index='month', columns='plan', aggfunc='mean', values='usd_monthly_revenue')
mean_revenue

In [None]:
# Plot a bar plot to visualize - mean_revenue
mean_revenue.plot.bar(figsize=(16,8), rot=0, color=['CornflowerBlue', 'Tomato'])

# Set the plot attributes
plt.title('Average revenue per each plan per each distinct month')
plt.ylabel('Amount in USD')
plt.xlabel('Months (sorted alphabetically)')

plt.show()

**From the given data and the plot about the average revenue per each plan per each distinct month, we can conclude the following**:  
- The average revenue for the Ultimate plan is consistently higher than that of the Surf plan across all months.

- There is a general trend of increasing revenue over time for both plans, with higher revenue observed in the latter months of the year.

- The difference in average revenue between the two plans is not as large as the difference in their monthly costs, indicating that customers on the Ultimate plan are likely using more of the additional features and services offered by the plan, such as higher data limits and more inclusive minutes and messages.

#### Check whether users on the different plans have different revenues

Let's check whether users on the different plans have different revenues. Let's create a pivot table:

In [None]:
# Calculate the mean and the variance of the monthly revenues
monthly_revenue_stats = user_consumption_per_month.pivot_table(index='plan', values='usd_monthly_revenue', aggfunc=['mean', 'var', 'std', 'median'])
monthly_revenue_stats.columns = ['mean_monthly_revenue', 'var_monthly_revenue', 'std_monthly_revenue', 'median_monthly_revenue']
monthly_revenue_stats

Looking at the descriptive statistics data for monthly revenues of Megaline's Surf and Ultimate plans, we can conclude that:

- On average, **customers on the Ultimate plan generate higher revenue per month compared to those on the Surf plan**. The mean monthly revenue for Ultimate is 72.31, which is significantly higher than the mean monthly revenue for Surf, which is 60.71.
- The variance in monthly revenue for Surf is significantly higher than that for Ultimate, indicating that there is more variability in monthly revenue for the Surf plan compared to the Ultimate plan.
- The standard deviation of monthly revenue for Surf is 55.39, which is higher than that of Ultimate (11.40). This means that revenue for the Surf plan is more spread out compared to that for the Ultimate plan.
- The median monthly revenue for Ultimate is 70.00, while for Surf it is 40.36. **This indicates that at least half of the customers on the Ultimate plan generate revenue of 70.00 or more per month, while at least half of the customers on the Surf plan generate revenue of 40.36 or less per month**.  

Overall, these statistics suggest that the **Ultimate plan is more profitable for Megaline, as customers on this plan generate higher revenue on average and revenue is more consistent compared to the Surf plan**.

#### Visualize the distribution of the monthly revenues

Let's plot a box plot to visualize the distribution of the monthly revenues. But, before that, let's have quick refresh of what data our Dataframe - `user_consumption_per_month` holds:

In [None]:
user_consumption_per_month.head(10)

Now, let's get a subset of our data in the Dataframe - `user_consumption_per_month` that belongs to **Surf** plan users. We already have this data in `surf_user_consumption_per_month` Dataframe. So, let's see how the distribution looks for the monthly revenues of the surf plan holders:

In [None]:
# Plot a boxplot to visualize the distribution of the monthly revenues of the surf plan holders
surf_user_consumption_per_month['usd_monthly_revenue'].plot.box(figsize=(16,10))

# Set the plot attributes
plt.ylabel('Amount in USD')
plt.title('Distribution of the monthly revenues (in USD) of the surf plan holders')

plt.show()

Let's also get a descriptive statistics for the Series - `surf_user_consumption_per_month['usd_monthly_revenue']`:

In [None]:
surf_user_consumption_per_month['usd_monthly_revenue'].describe()

**Based on the given descriptive statistics data and the observations from the box plot, we can conclude the following about the distribution of monthly revenues for Surf plan users**:

- **Count**: There are **1573 data points** or monthly revenue values available.

- **Mean**: The average monthly revenue for Surf plan users is **60.71 dollars**.

- **Standard Deviation**: The standard deviation of monthly revenue for Surf plan users is **55.39 dollars**. This indicates that there is a significant variation in monthly revenue among Surf plan users.

- **Minimum**: The minimum monthly revenue is **20 dollars**. **This indicates that some users paid only the basic monthly charge**.

- **Maximum**: The maximum monthly revenue is **590.37 dollars**. This indicates that some Surf plan users paid a much higher amount, possibly because of additional charges.

- **Quartiles**: The 25th percentile of monthly revenue is 20 dollars, **the median (50th percentile) is 40.36 dollars**, and the 75th percentile is 80.36 dollars. These quartiles divide the data into four equal parts and provide insight into the distribution of monthly revenue for Surf plan users.  

- There are outliers in the distribution, **590.37 dollars** being the maximum.  

Overall, we can conclude that **the distribution of monthly revenues for Surf plan users is positively skewed or rightly skewed**. The majority of Surf plan users (50%) paid between 20 dollars and 40.36 dollars per month, while some users paid as little as the basic charge and some paid much more.

Now, let's plot a box plot to visualize the monthwise revenues (in USD) of the surf plan holders:

In [None]:
# Plot a boxplot to visualize the monthwise revenues (in USD) of the surf plan holders
surf_user_consumption_per_month.boxplot(by ='month', column =['usd_monthly_revenue'], grid = False, figsize=(16,10))

# Set the plot attributes
plt.ylabel('Amount in USD')
plt.xlabel('Months')
plt.title('Monthwise revenues (in USD) of the Surf plan holders')

plt.show()

Interesting! We can see that for Surf plan users:
- **The most extreme outliers lie in the month of December**.
- **There are no outliers in the months of January**.

Now, let's get a subset of our data in the Dataframe - `user_consumption_per_month` that belongs to **Ultimate** plan users. We already have this data in `ultimate_user_consumption_per_month` Dataframe. So, let's see how the distribution looks for the monthly revenues of the ultimate plan holders:

In [None]:
# Plot a boxplot to visualize the distribution of the monthly revenues of the ultimate plan holders
ultimate_user_consumption_per_month['usd_monthly_revenue'].plot.box(figsize=(16,10))

# Set the plot attributes
plt.ylabel('Amount in USD')
plt.title('Distribution of the monthly revenues (in USD) of the ultimate plan holders')

plt.show()

Let's also get a descriptive statistics for the Series - `ultimate_user_consumption_per_month['usd_monthly_revenue']`:

In [None]:
ultimate_user_consumption_per_month['usd_monthly_revenue'].describe()

**Based on the given descriptive statistics data and observations from the box plot, we can conclude the following about the distribution of monthly revenues for Ultimate plan users**:

- **Count**: There are **720 data points** or monthly revenue values available.

- **Mean**: The average monthly revenue for Ultimate plan users is **72.31 dollars**.

- **Standard Deviation**: The standard deviation of monthly revenue for Ultimate plan users is **11.40 dollars**. This indicates that there is relatively low variation in monthly revenue among Ultimate plan users.

- **Minimum**: The minimum monthly revenue is **70 dollars**. **This indicates that all Ultimate plan users paid the basic monthly charge of **70 dollars**.

- **Maximum**: The maximum monthly revenue is **182 dollars**. This indicates that some Ultimate plan users paid more than the basic monthly charge, possibly because of additional charges or higher usage of services.

- **Quartiles**: **The 25th, 50th, and 75th percentiles of monthly revenue are all 70 dollars**. This indicates that the majority of Ultimate plan users paid only the basic monthly charge, and very few users paid more than that.  

Overall, **we can conclude that the distribution of monthly revenues for Ultimate plan users is very narrow and tightly centered around the basic monthly charge of 70 dollars**. There is very little variation in monthly revenue among Ultimate plan users, and the vast majority of them paid only the basic charge. The mean monthly revenue for Ultimate plan users is 72.31 dollars, which is slightly higher than the mean revenue for Surf plan users but still relatively low. 

In [None]:
# Plot a boxplot to visualize the monthwise revenues (in USD) of the ultimate plan holders
ultimate_user_consumption_per_month.boxplot(by ='month', column =['usd_monthly_revenue'], grid = False, figsize=(16,10))

# Set the plot attributes
plt.ylabel('Amount in USD')
plt.xlabel('Months')
plt.title('Monthwise revenues (in USD) of the Ultimate plan holders')

plt.show()

We can conclude that **the distribution of monthly revenues for Ultimate plan users is very narrow and tightly centered around the basic monthly charge of 70 dollars**. There are certain ouliers and the maximum being in October and December.

In [None]:
# Plot a boxplot to visualize the distribution of the monthly revenues (in USD) by users

# Set the order in which months will be plotted on the graph
months_order = ['Jan-2018', 'Feb-2018', 'Mar-2018', 'Apr-2018', 'May-2018', 'Jun-2018', 'Jul-2018', 'Aug-2018', 'Sep-2018', 'Oct-2018', 'Nov-2018', 'Dec-2018']

# Customize the markers that show outliers in the data
flierprops = dict(marker='o', markersize=10, markeredgecolor='black', markerfacecolor='darkgreen', alpha=0.6)

# Customize the markers that show mean values
meanprops = dict(marker='s', markerfacecolor='white', markeredgecolor='black')

plt.figure(figsize=(20,20))
my_plot = sns.boxplot(
    data=user_consumption_per_month,
    y='month',
    x='usd_monthly_revenue',
    hue='plan',
    order=months_order,
    showmeans=True,
    orient='h',
    linewidth=2,
    flierprops=flierprops,
    meanprops=meanprops,
    palette='Set2')

# Set the plot attributes
my_plot.set_xlabel('Amount in USD', fontsize= 14, fontweight='bold')
my_plot.set_ylabel('Months', fontsize= 14, fontweight='bold')
my_plot.set_title('Distribution of the monthly revenues (in USD) by Surf & Ultimate plan users', fontsize= 16, fontweight='bold')

plt.show()

We can compare the user behaviours between the two plans:

- The distribution of monthly revenues for Ultimate plan users is very narrow and tightly centered around the basic monthly charge of 70 dollars. 
- There are outliers in both surf and ultimate plans. For Surf plan users, the outliers lie mostly in December and for Ultimate plan users, the outliers lie mostly in October and December months.

<div style="border-bottom:2px solid #058EE1;"></div>

## Test statistical hypotheses <a id='test-statistical-hypotheses'></a>  
[Back to Contents](#contents)

### Average revenue from users of the Ultimate and Surf calling plans differs

Let's test the hypothesis that the average revenue from users of the Ultimate and Surf calling plans differs. For testing the hypothesis, let's formulate the null and the alternative hypotheses.  

**Null Hypothesis**: Average revenue from users of the Ultimate and Surf calling plans are equal  
**Alternative Hypothesis**: Average revenue from users of Surf calling plan is different than the Ultimate calling plan.

In order to test the above hypotheses, we need average user revenue for surf and ultimate plans. Let's calculate them first. We already have surf plan users data in - `surf_user_consumption_per_month` and ultimate plan users data in - `ultimate_user_consumption_per_month`:

In [None]:
# Average revenue from users of the Surf plan
surf_mean_user_revenue = surf_user_consumption_per_month.groupby('user_id')['usd_monthly_revenue'].mean()
surf_mean_user_revenue.head(10)

In [None]:
# Average revenue from users of the Ultimate plan
ultimate_mean_user_revenue = ultimate_user_consumption_per_month.groupby('user_id')['usd_monthly_revenue'].mean()
ultimate_mean_user_revenue.head(10)

In [None]:
# Test the hypotheses
alpha = 0.05 # significance level

results = st.ttest_ind(surf_mean_user_revenue, ultimate_mean_user_revenue)

print('p-value:', results.pvalue)

if (results.pvalue < alpha):
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

**We can reject the null hypothesis that the 'average revenue from users of the Ultimate and Surf calling plans are equal' to a significance of less than 0.05.** We can say there is a significant difference between the average revenues between users of the Ultimate and Surf calling plans.

### Average revenue from users in the NY-NJ area is different from that of the users from the other regions

Let's test the hypothesis that the aAverage revenue from users in the NY-NJ area is different from that of the users from the other regions. For testing the hypothesis, let's formulate the null and the alternative hypotheses.  

**Null Hypothesis**: Average revenue from users in the NY-NJ area is equal to that of the users from the other regions.  
**Alternative Hypothesis**: Average revenue from users in the NY-NJ area is different from that of the users from the other regions.

In order to test the above hypotheses, we need to prepare our dataset. Let's merge `user_consumption_per_month` Dataframe with `users` Dataframe's two columns - `user_id` and `city`:

In [None]:
# Merge user_consumption_per_month and users (required columns) on user_id
all_cities_user_consumption_per_month = user_consumption_per_month.merge(users[['user_id', 'city']], on='user_id')
all_cities_user_consumption_per_month.head(10)

Let's now create two separate Dataframes to store users from the NY-NJ area and other users:

In [None]:
# Store data of users from the NY-NJ area 
ny_nj_user_consumption_per_month = all_cities_user_consumption_per_month[all_cities_user_consumption_per_month['city'].str.contains('NY-NJ')]
ny_nj_user_consumption_per_month.head(10)

In [None]:
# Store data of users who are not from the NY-NJ area 
other_cities_user_consumption_per_month = all_cities_user_consumption_per_month[~all_cities_user_consumption_per_month['city'].str.contains('NY-NJ')]
other_cities_user_consumption_per_month.head()

In order to test the above hypotheses, we need average revenue from users in the NY-NJ area and other areas. Let's calculate them first. We already have the data of users in the NY-NJ area in - `ny_nj_user_consumption_per_month` and of other users in - `other_cities_user_consumption_per_month`:

In [None]:
# Average revenue from users in the NY-NJ area
ny_nj_mean_user_revenue = ny_nj_user_consumption_per_month.groupby('user_id')['usd_monthly_revenue'].mean()
ny_nj_mean_user_revenue.head(10)

In [None]:
# Average revenue from users not in the NY-NJ area
other_cities_mean_user_revenue = other_cities_user_consumption_per_month.groupby('user_id')['usd_monthly_revenue'].mean()
other_cities_mean_user_revenue.head(10)

In [None]:
# Test the hypotheses
alpha = 0.05 # significance level

results = st.ttest_ind(ny_nj_mean_user_revenue, other_cities_mean_user_revenue)

print('p-value:', results.pvalue)

if (results.pvalue < alpha):
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

Awesome! **We cannot reject the null hypothesis that the average revenue from users in the NY-NJ area is equal to that of the users from the other regions**.

<div style="border-bottom:2px solid #058EE1;"></div>

## General conclusion <a id='general-conclusion'></a>  
[Back to Contents](#contents)

**Calls**
- Customers on the Ultimate plan tend to have longer average call durations than those on the Surf plan.
- Both plans show a general trend of increasing average call durations from January to December, which could indicate a seasonal effect or a trend in customer behavior.
- The distribution of monthly call duration for Surf plan users is positively or slightly right skewed. The majority of Surf plan users (50%) make calls that are less than 430 minutes per month, while some users make very long calls, up to a maximum of 1510 minutes per month.
- The distribution of monthly call duration for Ultimate plan users is positively skewed or slightly right skewed. The majority of Ultimate plan users (50%) make calls that are less than 425 minutes per month, while some users make very long calls, up to a maximum of 1369 minutes per month. The mean monthly call duration for Ultimate plan users is slightly higher than that of Surf plan users, but the difference is not significant.
- Mostly all the users regardless of the plan they are in, talk less in the starting of the year but they tend to talk more as we progess towards the end of the year.   

**Messages**
- The average number of messages sent by both Surf and Ultimate plan users is relatively low, with most months having an average of fewer than 50 messages per month.
- In most months, Ultimate plan users sent more average messages per month compared to Surf plan users.
- The highest average number of messages sent per month for both plans was in December 2018.
- The distribution of monthly number of messages for Surf plan users is positively skewed or right skewed. The majority of Surf plan users (50%) sent 32 or fewer messages per month, while some users sent up to 266 messages per month.
- The distribution of monthly number of messages for Ultimate plan users is positively skewed or right skewed. The majority of Ultimate plan users (50%) sent 41 or fewer messages per month, while some users sent up to 166 messages per month. The mean monthly number of messages for Ultimate plan users is slightly higher than that of Surf plan users. However, the maximum monthly number of messages is lower for Ultimate plan users compared to Surf plan users.   

**Internet**
- Both plans show a similar pattern in internet usage over time, with higher consumption during the later months of the year (Oct, Nov, Dec) and lower consumption during the early months (Jan, Feb, Mar).
- January appears to be the month with the least amount of internet usage for both plans.
- The difference in average internet usage between the two plans is not significant, with only about a 2-3 GB difference on average.
- The distribution of monthly internet traffic consumption for Surf plan users is positively skewed or right skewed. The majority of Surf plan users (50%) consumed between 12 GB and 21 GB of internet traffic per month, while some users consumed as little as 1 GB and as much as 70 GB.
- The distribution of monthly internet traffic consumption for Ultimate plan users is also slightly positively skewed or very slighty right skewed (Mean - 17.33 GB > Median 17GB). The majority of Ultimate plan users (50%) consumed between 13 GB and 21 GB of internet traffic per month, while some users consumed as little as 1 GB and as much as 46 GB. The mean monthly internet traffic consumption for Ultimate plan users is 17.33 GB, which is slightly higher than the mean consumption for Surf plan users. However, the difference is not very significant.  

**Revenue**
- Ultimate plan is more profitable for Megaline, as customers on this plan generate higher revenue on average and revenue is more consistent compared to the Surf plan.
- There is a general trend of increasing revenue over time for both plans, with higher revenue observed in the latter months of the year.
- The difference in average revenue between the two plans is not as large as the difference in their monthly costs, indicating that customers on the Ultimate plan are likely using more of the additional features and services offered by the plan, such as higher data limits and more inclusive minutes and messages.
- The distribution of monthly revenues for Surf plan users is positively skewed or rightly skewed. The majority of Surf plan users (50%) paid between 20 dollars and 40.36 dollars per month, while some users paid as little as the basic charge and some paid much more.
- The distribution of monthly revenues for Ultimate plan users is very narrow and tightly centered around the basic monthly charge of 70 dollars. There is very little variation in monthly revenue among Ultimate plan users, and the vast majority of them paid only the basic charge. The mean monthly revenue for Ultimate plan users is 72.31 dollars, which is slightly higher than the mean revenue for Surf plan users but still relatively low.  

**There is a significant difference between the average revenues between users of the Ultimate and Surf calling plans.**. 

**The average revenue from users in the NY-NJ area is equal to that of the users from the other regions.**