# Introduction

In this notebook, I look at the [Mobile App Country of Origin](https://nprint.github.io/benchmarks/application_identification/mobile_country_of_origin.html) dataset from the nPrint project. The goal is to predict whether a mobile app was developed in the United States, China, or India based on its network traffic. The dataset has already been processed using nPrint, so the features are ready for machine learning without having to manually parse packet captures.

My main goal here is to get a basic model working and see how close I can get to the leaderboard result, which reports about **96.8 balanced accuracy** using AutoGluon. I will start with a **Random Forest classifier** as a simple baseline and then examine the results using balanced accuracy and confusion matrices. Later, I might test a few other models to see if the performance changes or if Random Forest is already good enough.

## Data Preprocessing (done offline)

The packet capture I downloaded was in the PcapNG format, which turned out to be incompatible with the older version of the nPrint tool that I needed for feature extraction. To work around this, I converted the original file into a regular `.pcap` file using Scapy. That conversion mainly rewrites timestamps into a format the older tools can read. 


(Note: this part is a bit tricky. I used Gemini 3.0 Pro to assist with installing Conda (my Mac isn’t compatible with some nPrint dependencies) as well as converting traffic data from `.pcapng` to `.pcap`. The conversion script (`fix_pcap_safe.py`) and the previously downloaded files are located in `course-project-chrislowzhengxi/final/offline_processing`.)


After that, I used the `nprint` command-line tool (outside of this notebook) to extract network-level features. The tool parses each packet of the PCAP file and produces a feature table in CSV form. This CSV is what the rest of the notebook loads and analyzes. I generated it once and added it directly to the project folder so the notebook does not depend on installing nPrint or running the extraction again.

The command I used was essentially:

```bash
nprint -P traffic_fixed.pcap -W traffic_features.csv -4 -t -u -p 20
```

The main flags tell nPrint to include IPv4, TCP, and UDP header information, and to grab a small part of the payload. The resulting file (`traffic_features.csv`) contains the full set of features and will be used for model training and evaluation below.

## Adding Labels to the Feature Data (done offline)

After running `nprint`, I ended up with a CSV containing only packet-level features. It did not include the country label. The reason is that the label was stored inside the original PcapNG file as a packet comment rather than a normal field, so `nprint` never saw it.

To move forward with supervised learning, I needed a label column. I extracted the labels from the PcapNG file using `tshark`, then merged those labels with the feature rows.

### Step 1. Extract labels using Tshark

Each packet in the original `traffic.pcapng` included a comment such as `13682128230572000042,china`. That comment contains both an ID and the country. Using Tshark, I pulled out the `ip.src` field and the per-packet comment:

```python
# offline_processing/extract_labels_tshark.py

command = [
    "tshark", "-n", "-r", pcap_file,
    "-Y", "ip",
    "-T", "fields",
    "-e", "ip.src",
    "-e", "frame.comment",
    "-E", "separator=,",
    "-E", "quote=d"
]
```

This produced a small CSV with the source IP and the raw comment label.

(Note: Again, this part is tricky, so I asked Gemini for help. The prompt can be seen at the top of `offline_processing/extract_labels_tshark.py`.)

### Step 2. Merge with feature table

The next step was to join the extracted labels with the feature rows. Since both files contained the exact same number of rows and appeared in the same order, I merged them by index.

I also cleaned the label string so it only keeps the country part.

This step was performed during preprocessing, so it does **not** appear in this notebook. The output of that step is the file `final_dataset_with_labels.csv`, which is the dataset used for training. It is generated from the original `traffic.pcapng` file and includes the extracted country labels. Below is the script: 


```python
import pandas as pd

features_df = pd.read_csv('traffic_features.csv')
labels_df = pd.read_csv('labels_extracted.csv')

# Clean the Label Column (from "13682128230572000042,china" to "china")
def parse_label(val):
    if isinstance(val, str) and ',' in val:
        return val.split(',')[-1].strip() # Take the last part (the text label)
    return 'Unknown'

labels_df['clean_label'] = labels_df['raw_label'].apply(parse_label)

# Assign the label to the feature set
features_df['label'] = labels_df['clean_label']

features_df.to_csv('final_dataset_with_labels.csv', index=False)
```

### Result

The final dataset now includes a `label` column at the end. This is the version used for all modeling. In the rest of the notebook, I simply load the combined CSV and treat `label` as the target.

That’s all that’s needed to start training models.


## Exploratory Data Analysis 
Let's start with some imports: 

In [7]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    balanced_accuracy_score,
)

import matplotlib.pyplot as plt

RANDOM_STATE = 42

Now, let's examine the dataset:

In [None]:
features_df = pd.read_csv('offline_processing/traffic_features.csv')
labels_df = pd.read_csv('offline_processing/labels_extracted.csv')

print(f"Features shape: {features_df.shape}")
print(f"Labels shape:   {labels_df.shape}")

# 2. Sanity Check: Do the IPs match?
# We compare the first few IPs to make sure the rows are aligned
print("\nAlignment Check (First 5 IPs):")
print(pd.DataFrame({
    'Feature_IP': features_df['src_ip'].head(),
    'Label_IP': labels_df['tshark_ip'].head()
}))

# 3. Clean the Label Column
# The raw_label looks like "13682128230572000042,china"
# We need to split it to get just "china"
def parse_label(val):
    if isinstance(val, str) and ',' in val:
        return val.split(',')[-1].strip() # Take the last part (the text label)
    return 'Unknown'

labels_df['clean_label'] = labels_df['raw_label'].apply(parse_label)

# 4. Assign the label to the feature set
# We assign by index (position)
features_df['label'] = labels_df['clean_label']

# 5. Check Class Distribution
print("\nFinal Class Distribution:")
print(features_df['label'].value_counts())

# Save the final dataset
features_df.to_csv('final_dataset_with_labels.csv', index=False)

Features shape: (10625, 1185)
Labels shape:   (10625, 2)

Alignment Check (First 5 IPs):
  Feature_IP   Label_IP
0  10.11.1.3  10.11.1.3
1    8.8.8.8    8.8.8.8
2  10.11.1.3  10.11.1.3
3  10.11.1.3  10.11.1.3
4  10.11.1.3  10.11.1.3

Final Class Distribution:
label
india    4075
us       3500
china    3050
Name: count, dtype: int64
