# Introduction

In this project for MegaLines telecom company, we have been tasked to help identify which of the two plans offered to the customers is more profitable. This will in turn help to inform the marketing team on how to adjust the budget for advertising.  

Using a small sample of 500 customers and their behavior in regards to messaging, calls and Internet usage from the year 2018. we will conduct a preliminary analysis to investigate the revenue distribution of customers in each plan. 


*Note:* Megaline rounds seconds up to minutes, and megabytes to gigabytes. 
- For calls, each individual call is rounded up: even if the call is one second, it will be counted as one minute. 
- For web traffic, individual web sessions are not rounded up.
- Instead, the total for the month is rounded up. 
    - If someone uses 1025 megabytes this month, they will be charged for 2 gigabytes.

## Changes for Notebook_V1

- Improve Organization and Format for headings.

- Improve the Application of Pivot Tables for grouping, summarizing, aggregating,and calculating statistics. 

- Enhance EDA and corresponding Conclusions.

- add to Hypothesis Testing and read up to deep understanding. 

## Libraries 

In [1]:
import pandas as pd
import scipy as py
from scipy import stats as st 
import numpy as np
import matplotlib.pyplot as plt
import math
import statistics as stats

import re

## Load Datasets

**DataSets**
keys for all data sets "user_id"

"id" represent unique instances for each user. ex: '1000'+'_93' 'user_id'+ '_instanceid'

`calls_data` - multiple values for calls for each 'user_id'.

- "duration" is the time of the call. each call has a date in "call_date"
- 137735 rows total for the 500 unique "user_ids"
<br>

`messages_data` - messages for each 'user_id'
- the count of messages are are in the number of unique "id" per 'user_id'
<br>

`internet_data` - megabites uses per session in each 'id'
<br>

`plans_data` - Metrics for both plans. 
- limits for each plan
- price for overages
<br>

`user_data`
 - 'first_name' & 'last_name' 
 - 'age'
 - 'city', contains city and state
 - 'reg_date', data of registration
 - 'tariff', plan for this 'user_id'
 - 'churn_date' - "nan" values for current users

In [2]:
# Load the data files into different DataFrames
calls_data = pd.read_csv(
    '/Users/ericross/Desktop/MegaLinesProject/Megaline datasets/megaline_calls.csv',
    parse_dates=['call_date']
)

In [3]:
messages_data = pd.read_csv(
    '/Users/ericross/Desktop/MegaLinesProject/Megaline datasets/megaline_messages.csv',
    parse_dates=['message_date']
)

In [4]:
internet_data = pd.read_csv(
    '/Users/ericross/Desktop/MegaLinesProject/Megaline datasets/megaline_internet.csv',
    parse_dates=['session_date']
)

In [5]:
plans_data = pd.read_csv(
    '/Users/ericross/Desktop/MegaLinesProject/Megaline datasets/megaline_plans.csv'
)

In [6]:
users_data = pd.read_csv(
    '/Users/ericross/Desktop/MegaLinesProject/Megaline datasets/megaline_users1.csv',
    parse_dates=['reg_date']
)

# Preprocessing

## Plans_data

In [7]:
# Print a sample of data for plans
plans_data


Unnamed: 0,messages_included,mb_per_month_included,minutes_included,usd_monthly_pay,usd_per_gb,usd_per_message,usd_per_minute,plan_name
0,50,15360,500,20,10,0.03,0.03,surf
1,1000,30720,3000,70,7,0.01,0.01,ultimate


## Users_data

In [8]:
# Print the general/summary information about the users' DataFrame
users_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   user_id     500 non-null    int64         
 1   first_name  500 non-null    object        
 2   last_name   500 non-null    object        
 3   age         500 non-null    int64         
 4   city        500 non-null    object        
 5   reg_date    500 non-null    datetime64[ns]
 6   tariff      500 non-null    object        
 7   churn_date  34 non-null     object        
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 31.4+ KB


In [9]:
users_data.rename(columns={'city': 'location'}, inplace=True)


In [10]:
users_data['user_id'].duplicated().sum()

0

In [11]:
users_data['reg_date'].min(),users_data['reg_date'].max()


(Timestamp('2018-01-01 00:00:00'), Timestamp('2018-12-31 00:00:00'))

### Enrich data

In [12]:
import re

users_data['state'] = users_data['location'].str.extract(r',\s*([A-Za-z]+(?:\s*-?[A-Za-z]+)*)')
users_data['state'] = users_data['state'].str.replace(r'\s+MSA$', '')

users_data['state'].value_counts()

  users_data['state'] = users_data['state'].str.replace(r'\s+MSA$', '')


NY-NJ-PA       80
CA             78
TX             39
FL             25
IL-IN-WI       19
PA-NJ-DE-MD    17
MI             16
GA             14
WA             13
TN             12
MA-NH          12
DC-VA-MD-WV    11
MN-WI          11
AZ             11
OH              9
LA              9
CO              9
OR-WA           8
NC-SC           8
OH-KY-IN        8
SC              7
NV              7
CT              6
NY              6
IN              6
OK              6
PA              5
KY-IN           5
HI              5
AL              4
VA-NC           4
MD              4
VA              4
RI-MA           3
UT              3
WI              3
MO-IL           3
NM              2
MO-KS           2
NE-IA           2
TN-MS-AR        2
NC              2
Name: state, dtype: int64

In [13]:
users_data['month'] = users_data['reg_date'].dt.month
users_data.head()

Unnamed: 0,user_id,first_name,last_name,age,location,reg_date,tariff,churn_date,state,month
0,1000,Anamaria,Bauer,45,"Atlanta-Sandy Springs-Roswell, GA MSA",2018-12-24,ultimate,,GA,12
1,1001,Mickey,Wilkerson,28,"Seattle-Tacoma-Bellevue, WA MSA",2018-08-13,surf,,WA,8
2,1002,Carlee,Hoffman,36,"Las Vegas-Henderson-Paradise, NV MSA",2018-10-21,surf,,NV,10
3,1003,Reynaldo,Jenkins,52,"Tulsa, OK MSA",2018-01-28,surf,,OK,1
4,1004,Leonila,Thompson,40,"Seattle-Tacoma-Bellevue, WA MSA",2018-05-23,surf,,WA,5


In [14]:
users_data['tariff_id'] = np.where(users_data['tariff'] == 'ultimate', 1, 0)

In [15]:
users_data['tariff_id'].value_counts()

0    339
1    161
Name: tariff_id, dtype: int64

 **Comment:** plot the share of plans in the data, with a Pie chart. 

### Fix Churn Date columns

In [16]:
users_data.churn_date.fillna(users_data['reg_date'].max(),inplace=True)

users_data['churn'] =  np.where(users_data['churn_date'] == users_data['reg_date'].max(), 0, 1)

users_data['churn'].value_counts()

0    466
1     34
Name: churn, dtype: int64

### Remove Unneeded columns form Users_Data

In [17]:
del users_data['tariff']
del users_data['reg_date']
del users_data['churn_date']
del users_data['location']

In [18]:
users_data.head()

Unnamed: 0,user_id,first_name,last_name,age,state,month,tariff_id,churn
0,1000,Anamaria,Bauer,45,GA,12,1,0
1,1001,Mickey,Wilkerson,28,WA,8,0,0
2,1002,Carlee,Hoffman,36,NV,10,0,0
3,1003,Reynaldo,Jenkins,52,OK,1,0,0
4,1004,Leonila,Thompson,40,WA,5,0,0


## Calls_data 

In [19]:
# Print the general/summary information about the calls' DataFrame
calls_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137735 entries, 0 to 137734
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   id         137735 non-null  object        
 1   user_id    137735 non-null  int64         
 2   call_date  137735 non-null  datetime64[ns]
 3   duration   137735 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 4.2+ MB


**Comment:** useful info needed from calls_data 

- total and average call duration for each user. 

- add a call_count for each user. 



In [20]:
calls_data['user_id'].duplicated().sum()

137254

### Enrich data

In [21]:
calls_data['month'] = calls_data['call_date'].dt.month

In [22]:
calls_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137735 entries, 0 to 137734
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   id         137735 non-null  object        
 1   user_id    137735 non-null  int64         
 2   call_date  137735 non-null  datetime64[ns]
 3   duration   137735 non-null  float64       
 4   month      137735 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 5.3+ MB


##### Rounding calls to the nearest minute and making any session equal to 0, 1 minute

In [23]:
calls_data['duration'] = calls_data['duration'].apply(np.ceil)

calls_data.loc[calls_data['duration'] == 0, 'duration'] = 1

calls_data.describe()

Unnamed: 0,user_id,duration,month
count,137735.0,137735.0,137735.0
mean,1247.658046,7.341496,9.320797
std,139.416268,5.728989,2.41255
min,1000.0,1.0,1.0
25%,1128.0,2.0,8.0
50%,1247.0,6.0,10.0
75%,1365.0,11.0,11.0
max,1499.0,38.0,12.0


## Messages_data

In [24]:
# Print the general/summary information about the messages' DataFrame

messages_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76051 entries, 0 to 76050
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            76051 non-null  object        
 1   user_id       76051 non-null  int64         
 2   message_date  76051 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 1.7+ MB


In [25]:
# Print a sample of data for messages

messages_data.sample(5)

Unnamed: 0,id,user_id,message_date
27075,1167_32,1167,2018-05-25
22307,1133_318,1133,2018-11-21
56165,1355_218,1355,2018-10-15
63012,1399_77,1399,2018-10-21
48532,1326_69,1326,2018-10-21


### Enrich data

In [26]:
messages_data['month'] = messages_data['message_date'].dt.month

In [27]:
messages_data.sample(5)

Unnamed: 0,id,user_id,message_date,month
27422,1169_274,1169,2018-12-20,12
64657,1412_239,1412,2018-10-12,10
31521,1196_0,1196,2018-01-20,1
49954,1328_912,1328,2018-12-18,12
17455,1114_387,1114,2018-10-29,10


## Internet_data

In [28]:
# Print the general/summary information about the internet DataFrame

internet_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104825 entries, 0 to 104824
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   id            104825 non-null  object        
 1   user_id       104825 non-null  int64         
 2   session_date  104825 non-null  datetime64[ns]
 3   mb_used       104825 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 3.2+ MB


In [29]:
# Print a sample of data for the internet traffic

internet_data.head()

Unnamed: 0,id,user_id,session_date,mb_used
0,1000_13,1000,2018-12-29,89.86
1,1000_204,1000,2018-12-31,0.0
2,1000_379,1000,2018-12-28,660.4
3,1000_413,1000,2018-12-26,270.99
4,1000_442,1000,2018-12-27,880.22


### Enrich data

In [30]:
internet_data['month'] = internet_data['session_date'].dt.month

In [31]:
internet_data.head()

Unnamed: 0,id,user_id,session_date,mb_used,month
0,1000_13,1000,2018-12-29,89.86,12
1,1000_204,1000,2018-12-31,0.0,12
2,1000_379,1000,2018-12-28,660.4,12
3,1000_413,1000,2018-12-26,270.99,12
4,1000_442,1000,2018-12-27,880.22,12


# EDA

## Create Tables

### User_Info

In [32]:
users_plan = pd.pivot_table(users_data, values=['tariff_id', 'state'], index=['user_id', 'month'], aggfunc={'tariff_id': 'sum', 'state': 'first'})

users_plan

Unnamed: 0_level_0,Unnamed: 1_level_0,state,tariff_id
user_id,month,Unnamed: 2_level_1,Unnamed: 3_level_1
1000,12,GA,1
1001,8,WA,0
1002,10,NV,0
1003,1,OK,0
1004,5,WA,0
...,...,...,...
1495,9,NY-NJ-PA,0
1496,2,LA,0
1497,12,CA,1
1498,2,NY-NJ-PA,0


### Internet_monthly

In [33]:
internet_monthly = pd.pivot_table(internet_data, values=['mb_used'], index=['user_id', 'month'], aggfunc='sum')

# create 'gb_used' round up and then Drop 'mb_used'
internet_monthly['gb_used']= internet_monthly['mb_used']/1000
internet_monthly['gb_used'] = internet_monthly['gb_used'].apply(np.ceil)
internet_monthly.drop('mb_used', axis=1, inplace=True)

internet_monthly.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,gb_used
user_id,month,Unnamed: 2_level_1
1028,12,38.0
1156,7,24.0
1044,12,16.0
1131,8,13.0
1183,12,21.0
1298,12,18.0
1167,8,9.0
1460,10,30.0
1399,10,25.0
1077,11,23.0


### Messages_monthly

In [34]:
messages_monthly = messages_data.pivot_table(values='id', index=['user_id','month'], aggfunc='count', fill_value=0)
messages_monthly.columns = ['mess_count']

messages_monthly.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mess_count
user_id,month,Unnamed: 2_level_1
1000,12,11
1001,8,30
1001,9,44
1001,10,53
1001,11,36


### Calls_monthly

In [35]:
calls_monthly = pd.pivot_table(calls_data, values=['duration'], index=['user_id', 'month'], aggfunc='sum')

calls_monthly.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,duration
user_id,month,Unnamed: 2_level_1
1000,12,124.0
1001,8,187.0
1001,9,326.0
1001,10,411.0
1001,11,441.0
1001,12,422.0
1002,10,62.0
1002,11,393.0
1002,12,393.0
1003,12,1135.0


In [36]:
monthly_stats = pd.concat(
    [messages_monthly, internet_monthly, calls_monthly, users_plan], axis=1)

monthly_stats.loc[:, [
    'mess_count', 'gb_used', 'duration'
]] = monthly_stats.loc[:, ['mess_count', 'gb_used', 'duration']].fillna(0)

monthly_stats.ffill(inplace=True)

monthly_stats.isna().sum()

mess_count    0
gb_used       0
duration      0
state         0
tariff_id     0
dtype: int64

### plan conditions

In [37]:
# Print out the plan conditions and make sure they are clear for you
plans_data['gb_per_month_included'] = plans_data['mb_per_month_included']/1000

plans_data

Unnamed: 0,messages_included,mb_per_month_included,minutes_included,usd_monthly_pay,usd_per_gb,usd_per_message,usd_per_minute,plan_name,gb_per_month_included
0,50,15360,500,20,10,0.03,0.03,surf,15.36
1,1000,30720,3000,70,7,0.01,0.01,ultimate,30.72


In [38]:
surf_monthly_cost = 20
ulti_monthly_cost = 70

surf_mins_lim = 500
ulti_mins_lim = 3000
surf_min_overage = 0.3
ulti_min_overage = 0.01

surf_mess_lim = 50
ulti_mess_lim = 1000
surf_mess_overage = 0.3
ulti_mess_overage = 0.01

surf_gb_lim = 16
ulti_gb_lim = 31
surf_gb_overage = 10
ulti_gb_overage = 7


### Seperate Surf and Ultimante Tables

In [39]:
stats_surf = monthly_stats.query('tariff_id == 0').copy()
stats_ulti = monthly_stats.query('tariff_id == 1').copy()

## Calculating Overages and Fees

### Mins over

In [40]:
stats_surf.loc[:, 'min_over'] = stats_surf['duration'] - surf_mins_lim
stats_ulti.loc[:, 'min_over'] = stats_ulti['duration'] - ulti_mins_lim

stats_surf.loc[stats_surf['min_over'] < 0, 'min_over'] = 0
stats_ulti.loc[stats_ulti['min_over'] < 0, 'min_over'] = 0

### Messages over 

In [41]:
stats_surf.loc[:, 'mess_over'] = stats_surf['mess_count'] - surf_mess_lim
stats_ulti.loc[:, 'mess_over'] = stats_ulti['mess_count'] - ulti_mess_lim

stats_surf.loc[stats_surf['mess_over'] < 0, 'mess_over'] = 0
stats_ulti.loc[stats_ulti['mess_over'] < 0, 'mess_over'] = 0

### GB over

In [42]:
stats_surf.loc[:, 'gb_over'] = stats_surf['gb_used'] - surf_gb_lim
stats_ulti.loc[:, 'gb_over'] = stats_ulti['gb_used'] - ulti_gb_lim

stats_surf.loc[stats_surf['gb_over'] < 0, 'gb_over'] = 0
stats_ulti.loc[stats_ulti['gb_over'] < 0, 'gb_over'] = 0

### base cost for plan

In [43]:
stats_surf['base_cost']=20
stats_ulti['base_cost']=70

### Cost calculations

In [44]:
stats_surf['min_costs'] = stats_surf['min_over'] * surf_min_overage

stats_surf['mess_costs'] = stats_surf['mess_over'] * surf_mess_overage

stats_surf['gb_costs'] = stats_surf['gb_over'] * surf_gb_overage

stats_surf['monthly_cost'] = stats_surf[['base_cost', 'min_costs', 'mess_costs', 'gb_costs']].sum(axis=1)

stats_surf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mess_count,gb_used,duration,state,tariff_id,min_over,mess_over,gb_over,base_cost,min_costs,mess_costs,gb_costs,monthly_cost
user_id,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1001,8,30.0,7.0,187.0,WA,0.0,0.0,0.0,0.0,20,0.0,0.0,0.0,20.0
1001,9,44.0,14.0,326.0,WA,0.0,0.0,0.0,0.0,20,0.0,0.0,0.0,20.0
1001,10,53.0,23.0,411.0,WA,0.0,0.0,3.0,7.0,20,0.0,0.9,70.0,90.9
1001,11,36.0,19.0,441.0,WA,0.0,0.0,0.0,3.0,20,0.0,0.0,30.0,50.0
1001,12,44.0,20.0,422.0,WA,0.0,0.0,0.0,4.0,20,0.0,0.0,40.0,60.0


In [45]:
stats_ulti['min_costs'] = stats_ulti['min_over'] * ulti_min_overage

stats_ulti['mess_costs'] = stats_ulti['mess_over'] * ulti_mess_overage

stats_ulti['gb_costs'] = stats_ulti['gb_over'] * ulti_gb_overage

stats_ulti['monthly_cost'] = stats_ulti[['base_cost', 'min_costs', 'mess_costs', 'gb_costs']].sum(axis=1)

stats_ulti.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mess_count,gb_used,duration,state,tariff_id,min_over,mess_over,gb_over,base_cost,min_costs,mess_costs,gb_costs,monthly_cost
user_id,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000,12,11.0,2.0,124.0,GA,1.0,0.0,0.0,0.0,70,0.0,0.0,0.0,70.0
1006,11,15.0,3.0,10.0,CA,1.0,0.0,0.0,0.0,70,0.0,0.0,0.0,70.0
1006,12,139.0,33.0,61.0,CA,1.0,0.0,0.0,2.0,70,0.0,0.0,14.0,84.0
1007,8,51.0,25.0,470.0,CA,1.0,0.0,0.0,0.0,70,0.0,0.0,0.0,70.0
1007,9,47.0,29.0,413.0,CA,1.0,0.0,0.0,0.0,70,0.0,0.0,0.0,70.0


### Revenue Calculations

In [46]:
# customer counts
surf_total_customers = 339
ulti_total_customers = 161

total_customers = 500 

print(f'Total_customers {total_customers} \n')
print(f'Surf total customers: {surf_total_customers}\n Share: {surf_total_customers/total_customers*100:.2f}% \n')
print(f'ultimante total customers: {ulti_total_customers}\n Share: {ulti_total_customers/total_customers*100:.2f}% \n')



# Revenue calc.
surf_total_revenue = sum(stats_surf['monthly_cost'])
ulti_total_revenue = sum(stats_ulti['monthly_cost'])
total_revenue = surf_total_revenue + ulti_total_revenue

print(f'Total Revenue: {total_revenue:.2f} $\n')

print(
    f'Surf total revenue: {surf_total_revenue:.2f} $ \n Share: {(surf_total_revenue/total_revenue)*100:.2f}%\n'
)

print(
    f'Ultimante total revenue: {ulti_total_revenue:.2f} $ \n Share: {(ulti_total_revenue/total_revenue)*100:.2f}%\n'
)

Total_customers 500 

Surf total customers: 339
 Share: 67.80% 

ultimante total customers: 161
 Share: 32.20% 

Total Revenue: 183077.80 $

Surf total revenue: 116773.80 $ 
 Share: 63.78%

Ultimante total revenue: 66304.00 $ 
 Share: 36.22%



# LEFT OFF - June, 25th 

## Study user behaviour

### Calls

In [None]:
stats.plan.astype('int')

In [None]:
surf_monthly= stats.query("plan == 0")
ult_monthly= stats.query("plan == 1")

In [None]:
ult_monthly.revenue

In [None]:
# Compare the number of minutes users of each plan require each month. Plot a histogram.
surf_month_avg_min = surf_monthly.groupby(['month'])['duration'].agg(Surf_avg='mean')
ult_month_avg_min = ult_monthly.groupby(['month'])['duration'].agg(Ult_avg='mean')


In [None]:
plan_avg_month_min = surf_month_avg_min.merge(ult_month_avg_min, on="month")
plan_avg_month_min.plot(kind='bar', title = 'Average Minute per Plan', width=.7,xlabel='Month', ylabel='Avg Minutes',grid=True)
plt.legend(loc=3)
plt.show()

In [None]:
# Calculate the mean and the variance of the monthly call duration
plan_avg_month_min.mean()


In [None]:
plan_avg_month_min.var()

In [None]:
# Plot a boxplot to visualize the distribution of the monthly call duration
plan_avg_month_min.boxplot(showmeans=True, figsize=(3,4), widths=0.55, color='black')
plt.title('avg. min. used by plan')
plt.ylabel('minutes')
plt.show()


### calls - conclsion

surf plan users have a similar average minutes per month. however most are below the package limit.

how ever when looking at the distridutions of the data per plan the ultimte plans averages are higher and more tightly focused around 400 minutes. and none exceed the package limit. 

most users do not exceed there package limit for minutes. 

### Messages

In [None]:
# Compare the number of messages users of each plan tend to send each month
surf_data.messages.mean()

In [None]:
ult_data.messages.mean()

In [None]:
surf_month_avg_mess =surf_monthly.groupby('month')['messages'].agg(Surf_avg='mean')
ult_month_avg_mess = ult_monthly.groupby('month')['messages'].agg(Ult_avg='mean')

In [None]:
plan_avg_month_mess = surf_month_avg_mess.merge(ult_month_avg_mess, on="month")
plan_avg_month_mess.plot(kind='bar', title = 'Average messages per Month', width=.7,xlabel='Month', ylabel='Avg messages',grid=True)
plt.legend(loc=4)
plt.show()

In [None]:
plan_avg_month_mess.boxplot(showmeans=True, widths=0.55, figsize=(3,4), color='black')
plt.title('messages sent by plan')
plt.ylabel('messages')
plt.show()

### messages - conclsion 

surf users send less messages per month than the ultimate users.

all users seem to fall below the limit of their respective package.

### Internet

In [None]:
# Compare the amount of internet traffic consumed by users per plan
surf_month_avg_gb = surf_monthly.groupby(['month'])['gb'].agg(Surf_avg='mean')
ult_month_avg_gb = ult_monthly.groupby(['month'])['gb'].agg(Ult_avg='mean')

In [None]:
plan_avg_month_gb = surf_month_avg_gb.merge(ult_month_avg_gb, on="month")
plan_avg_month_gb.round(1)

In [None]:
plan_avg_month_gb.plot(kind='bar', title = 'average gb used per Plan', width=.8,xlabel='Month', ylabel='GB',grid=True)
plt.legend(loc=2)
plt.show()

In [None]:
plan_avg_month_gb = surf_month_avg_gb.merge(ult_month_avg_gb, on="month")

In [None]:
plan_avg_month_gb.boxplot(showmeans=True, widths=0.55, figsize=(3,4), color='black')
plt.title('gb used by plan')
plt.ylabel('GB')
plt.show()

In [None]:
plan_avg_month_gb.mean()

### gb - conclsion 

surf users tend to use less data than the users of the ultimante plan 
however the surf users tend to exceed their plan more often. 

all users rarly use over 20 gb of data in a month. 

### Revenue

In [None]:
surf_month_avg_rev = surf_monthly.groupby(['month'])['revenue'].agg(Surf_avg='mean')
ult_month_avg_rev = ult_monthly.groupby(['month'])['revenue'].agg(Ult_avg='mean')

In [None]:
plan_avg_month_rev = surf_month_avg_rev.merge(ult_month_avg_rev, on="month")
plan_avg_month_rev.plot(kind='bar', title = 'Total Revenue per Plan', width=.8, xlabel='Month', ylabel='Revenue',grid=True)
plt.legend(loc=2)
plt.show()

In [None]:
plan_avg_month_rev.boxplot(showmeans=True, widths=0.55, figsize=(3,4), color='black')
plt.title('Average Revenue used by plan')
plt.ylabel('Revenue($)')
plt.show()

In [None]:
surf_total_rev = surf_data.revenue.sum()
surf_total_rev

In [None]:
ult_total_rev = ult_data.revenue.sum()
ult_total_rev

In [None]:
total_rev_surf = surf_data.revenue.count()
total_rev_surf

In [None]:
total_rev_ult = ult_data.revenue.count()
total_rev_ult

In [None]:
total_rev_ult/total_rev_surf * 100

[Formulate conclusions about how the revenue differs between the plans.]

In [None]:
(ult_total_rev / surf_total_rev) * 100

### revenue - conclsion 

the Ultimante plan seem to make a steady amount around 70 dollars not alot over base. 

surf users pay around 50 dollars, which is 30$ more than than the base. meaning over halfof the revenue is made from overages. 

the surf plan in total generates more revenue overall due to having more users, however the average per user is much lower. 

the ultimante plan brings in more revenue per user while have less total revenue. 


# Test statistical hypotheses


### Hypothesis 

 **H0** 
    - There is *no diffrence* between the averges of both plans. 
    
**H1**
    - The mean values of the populations of each plan are *significantly diffrent*.

In [None]:
sample_stats_surf = stats.query("plan == 0")

sample_stats_ulti = stats.query("plan == 1")

In [None]:
rev_mean_surf = sample_stats_surf.revenue.mean()
rev_var_surf = sample_stats_surf.revenue.var()

In [None]:
rev_mean_ulti = sample_stats_ulti.revenue.mean()
rev_var_ulti= sample_stats_ulti.revenue.var()

In [None]:
print('surf varience:',rev_var_surf , "Ultimate varience:" , rev_var_ulti)

In [None]:
plan_rev_result = py.stats.ttest_ind(sample_stats_surf['revenue'], sample_stats_ulti['revenue'], equal_var=False)
plan_rev_result.pvalue

In [None]:
alpha=0.02

In [None]:
# Test the hypotheses
print('p-value:', plan_rev_result.pvalue / 2)

if (plan_rev_result.pvalue / 2) < alpha:
    print("the null hypothesis should be rejected")
else:
    print("the null hypthesis can not be rejected")

In [None]:
sample_stats_NY = stats.query("state in 'NY'")
sample_stats_other = stats.query("state not in 'NY'")

In [None]:
rev_mean_NY = sample_stats_NY.revenue.mean()
rev_var_NY= sample_stats_NY.revenue.var()

In [None]:
rev_mean_other = sample_stats_other.revenue.mean()
rev_var_other= sample_stats_other.revenue.var()

In [None]:
print('NY area varience:',rev_var_NY , "other area varience:" , rev_var_other)

In [None]:
rev_var_other/rev_var_NY 

In [None]:
NY_rev_results = py.stats.ttest_ind(sample_stats_NY['revenue'], sample_stats_other['revenue'], equal_var=True)
NY_rev_results.pvalue

In [None]:
# Test the hypotheses
print('p-value:', NY_rev_results.pvalue / 2)

if (NY_rev_results.pvalue / 2) < alpha:
    print("the null hypothesis should be rejected")
else:
    print("the null hypthesis can not be rejected")

Users from the NY-NJ area do not differ from the other states enough to reject the null hypothesis that the datasets are not equal.  

# General conclusion

**Metrics Conclusion**

- Minutes
    - users from both plans tend to stay below their call limits. 
- Messages 
    - again, users dont often go over the plans limit. 
- GB
    - Data is where users tend to exceed the limits of the given plan. 
        - this is extremely common amoung surf plan users. 
        - data overage fees are the main source of revenue for the whole plan. 
    - users rarely exceed 20 gb which is below the ultimante plans limit.
    
**Plan Conclusions** 

- *Surf* currently brings in more revenue 
    - at roughly ~100,000 or 51% of the overall revenue.
    - a result of having 40% more users then Ultimate. 

- *Ultimante* is the more profitable plan. 
    - making up 49% of the total revenue. 
    - while having only a quarter of the total users.
    
**Final Conclusion** 

- The plan which generates **the most revenue is the Ultimante plan**. 
    - **the biggest factor in each plan is the base costs.** dispite the fact the many surf users exceed their data limit they still pay less than the base cost of an ultimante user.

