In [17]:
# Some basic package imports
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

In [11]:
user_usage = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_usage.csv')
user_device = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_device.csv')
devices = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/android_devices.csv')

In [22]:
def clean_cols(df):
    df.columns = (
        df.columns.str.strip()
                .str.lower()
                .str.replace(" ", "_")
    )
    return df
    
user_usage = clean_cols(user_usage)
user_device = clean_cols(user_device)
devices = clean_cols(devices)

user_usage.head(), user_device.head(), devices.head()

(   outgoing_mins_per_month  outgoing_sms_per_month  monthly_mb  use_id
 0                    21.97                    4.82     1557.33   22787
 1                  1710.08                  136.88     7267.55   22788
 2                  1710.08                  136.88     7267.55   22789
 3                    94.46                   35.17      519.12   22790
 4                    71.59                   79.26     1557.33   22792,
    use_id  user_id platform  platform_version     device  use_type_id
 0   22782    26980      ios              10.2  iPhone7,2            2
 1   22783    29628  android               6.0    Nexus 5            3
 2   22784    28473  android               5.1   SM-G903F            1
 3   22785    15200      ios              10.2  iPhone7,2            3
 4   22786    28239  android               6.0  ONE E1003            1,
   retail_branding marketing_name    device                      model
 0             NaN            NaN    AD681H  Smartfren Andromax AD681

**Part 1: Merge usage and user device**

In [18]:
usage_device = pd.merge(
    user_usage, user_device,
    on="use_id", how="inner"
)

full = pd.merge(
    usage_device, devices,
    on="device", how="left"  
)

full = full.rename(columns={"retail_branding":"manufacturer"})

def manufacturer_clean_fn(x):
    x = str(x).strip().lower()
    if not x or x == "nan": return None
    if "samsung" in x:      return "Samsung"
    if x.startswith("lg"):  return "LGE"
    if x in {"oneplus","one+","1+","one plus"}: return "Oneplus"
    return x.capitalize()

full["manufacturer_clean"] = full["manufacturer"].apply(manufacturer_clean_fn)

full.head()

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,user_id,platform,platform_version,device,use_type_id,manufacturer,marketing_name,model,manufacturer_clean
0,21.97,4.82,1557.33,22787,12921,android,4.3,GT-I9505,1,,,,
1,1710.08,136.88,7267.55,22788,28714,android,6.0,SM-G930F,1,,,,
2,1710.08,136.88,7267.55,22789,28714,android,6.0,SM-G930F,1,,,,
3,94.46,35.17,519.12,22790,29592,android,5.1,D2303,1,Sony,Xperia M2,D2303,Sony
4,71.59,79.26,1557.33,22792,28217,android,5.1,SM-G361F,1,,,,


**Q1: Does platform impact monthly MB used?**

Prediction: Different platforms will have different mean and median MB

In [27]:
q1 = (
    full.groupby("platform")["monthly_mb"]
        .agg(["count", "mean", "median", "std"])
        .sort_values("mean", ascending=False)
        .round(2)
)
q1

Unnamed: 0_level_0,count,mean,median,std
platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
android,159,4364.18,2076.45,5356.13
ios,2,961.16,961.16,438.74


**Explanation:** From what the data says, we can see that the means/medians across platforms matters for data usage.

**Q2: Do Sony users have more outgoing minutes than OnePlus users?**

**Prediction:** Sony will likely have higher averages, but OnePlus may differ in medians.

In [28]:
q2 = ['Sony', 'Oneplus']

comparison = (
    full[full['manufacturer_clean'].isin(q2)]
        .groupby('manufacturer_clean')['outgoing_mins_per_month']
        .agg(['count','mean','median','std'])
        .round(2)
)
comparison

Unnamed: 0_level_0,count,mean,median,std
manufacturer_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Oneplus,4,170.4,170.4,130.37
Sony,13,143.7,99.23,112.09


**Explanation:** The table compares Oneplus with Sony, and the mean and median tell us which group has more time on outgoing calls. Higher averages correlates with more users talking. 

**Q3: Which manufacturers use the most monthly MB?**

**Prediction:** Larger brands will have a far larger average MB, but smaller niche brands may have some spikes and dips.

In [29]:
q3 = (
    full.groupby("manufacturer_clean")
        .agg(n_users=("use_id","nunique"),
             avg_mb=("monthly_mb","mean"),
             med_mb=("monthly_mb","median"))
        .round(2)
        .sort_values("avg_mb", ascending=False)
)
q3.head(10)

Unnamed: 0_level_0,n_users,avg_mb,med_mb
manufacturer_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Oneplus,2,15573.33,15573.33
Lava,2,12458.67,12458.67
Sony,13,2715.35,2076.45


**Explanation:** This ranking confirms my prediction: some manufacturers show noticeably higher average MB usage than others. The `avg_mb` column highlights which brands’ users consume the most data on average, while `med_mb` shows the typical usage per brand. 

**What columns can be used for merging? Any to rename?**
  - `use_id` — in `user_usage` and `user_device` (links usage rows to a user’s device record).
  - `device` — in `user_device` and `devices` (links the user’s device code to the device catalog).
  - `devices.retail_branding` - `manufacturer` (clearer name for brand).
  - Create `manufacturer_clean` from `manufacturer` to normalize labels for grouping

**What I’m doing in the merge and why**
1) **`user_usage`, `user_device` on `use_id` — `how='inner'`**  
   - I start by keeping only the rows where I have both usage data and a device record. Inner join here gives me a clean cohort so platform/device fields aren’t missing later.

2) **(result)**  
   - Next I add the catalog info by device code, but I use a left join so I don’t lose users if their device isn’t in the catalog. Those rows stay, and the brand just shows up as NaN if it’s missing.
3) **Cleanup for analysis** 
   - I rename retail_branding - manufacturer, then create manufacturer_clean to normalize brand labels so my group-bys don’t split the same brand.

**What I’m doing and why**

I group by manufacturer_clean and compute three things: the number of users, the average monthly data, and the median. Then I sort by avg_mb to see which brands use the most data on average. I also set a small minimum sample size so I’m not ranking brands based on just a couple of users. Reporting both mean and median lets me see if a high average is being driven by a few heavy users or if typical usage is also high