### The project: Predicting Name Era (Legacy vs Modern) using Google BigQuery and XGBoost

This project aims to **classify given names as either "Legacy" (pre-2000) or "Modern" (from 2000 onward)** based on historical US birth data, using features extracted from the names themselves. The data was queried directly from the **`bigquery-public-data.usa_names.usa_1910_current`** dataset on **Google Cloud BigQuery**, containing over 1.8 million name records.

---

### Data & Labeling

* We **grouped names by decade**, then selected only names that appeared predominantly in a single decade (to reduce ambiguity).
* The decades were then **mapped into binary labels**:

  * `Legacy`: Decades ≤ 1990
  * `Modern`: Decades ≥ 2000

---

### Feature Engineering

We extracted both statistical and linguistic features:

* **Character-level n-grams (1–3)**
* **Name structure features**: length, first/last letters, suffixes, vowel ratios
* **Semantic patterns**: starts with "Mc", ends with "n" or "a"
* **Popularity trend**: ratio of name’s popularity within its decade

All features were combined into a **sparse matrix**, including one-hot encodings for categorical features.

---

### Modeling & Evaluation

* The classifier used is **XGBoost** (tree-based ensemble), trained with:

  * Stratified sampling to ensure ≥30 Legacy samples in test set
  * **Random oversampling** applied only to the training set to balance class distributions
  * **Manual grid search** (64 parameter combinations) to select the best hyperparameters

---

### Results

Final evaluation (on test set):

* **Accuracy**: 83.46%
* **Legacy Recall**: 60%
* **Modern Recall**: 91%
* **Macro F1**: 0.76

This performance indicates the model **successfully learns to distinguish older vs newer name trends**, especially excelling at identifying Modern names, while still retrieving Legacy names with reasonable recall.


# The dataset: `bigquery-public-data.usa_names.usa_1910_current`

This dataset contains historical records of baby names registered in the United States from the year 1910 onwards. It is publicly hosted on Google BigQuery as part of Google's public datasets.

## Source

The data originates from the U.S. Social Security Administration (SSA) and includes aggregated birth name information by year, gender, and state.

## Columns Description

| Column   | Type    | Description |
|----------|---------|-------------|
| `name`   | STRING  | The first name of the baby. |
| `gender` | STRING  | The gender associated with the name: either `"M"` (Male) or `"F"` (Female). |
| `state`  | STRING  | The U.S. state abbreviation (e.g., "CA", "NY"). In some entries this field may be null. |
| `year`   | INTEGER | The year in which the name was registered. |
| `number` | INTEGER | The number of babies given this name in that year and state. |

## Target / Label Column

There is no explicit label column in this dataset, as it is not originally designed for supervised machine learning tasks. However, depending on the goal of the analysis, one can define a target. For example:
- In a **gender prediction task**, the column `gender` can serve as the label.
- For **popularity prediction**, the column `number` or its change over time could be used as a target variable.

## Use Cases

This dataset is suitable for a variety of data science and analytics tasks, including:
- Trend analysis of baby name popularity over time.
- Gender distribution of names.
- Identification of unisex (gender-neutral) names.
- Forecasting future popularity of specific names.
- Sociocultural studies (e.g., names influenced by events, celebrities, or politics).

## Dataset Characteristics

- Total rows: ~1.8 million (depending on filters).
- Time range: 1910 to present.
- Granularity: Yearly.
- Aggregated: Not individual-level data, but grouped by name/year/state/gender.


In [1]:
# Step 1: Install requirements (if needed)
!pip install -q google-cloud-bigquery xgboost

# Step 2: Imports and authentication
from google.colab import auth
auth.authenticate_user()

In [15]:
from google.cloud import bigquery

# Step 3: Connect to BigQuery and query gender + name data
project_id = "custom-helix-474006-k6"  # ← replace with your GCP project ID
client = bigquery.Client(project=project_id)

In [33]:
import pandas as pd
import numpy as np
from google.cloud import bigquery
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier
from imblearn.over_sampling import RandomOverSampler
from scipy.sparse import hstack
from scipy.sparse import csr_matrix

# Step 1: Query BigQuery
print("Querying from BigQuery...")
query = """
SELECT name, year, gender, SUM(number) AS total
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year >= 1910
GROUP BY name, year, gender
HAVING total > 100
"""
df = client.query(query).to_dataframe()
print(f"Retrieved {len(df)} rows.")

# Step 2: Prepare decade and filter
df["decade"] = (df["year"] // 10) * 10
df["decade"] = df["decade"].astype(str)
df_grouped = df.groupby(["name", "decade"], as_index=False)["total"].sum()
name_counts = df_grouped.groupby("name")["decade"].nunique().reset_index()
df_single = df_grouped[df_grouped["name"].isin(name_counts[name_counts["decade"] == 1]["name"])].reset_index(drop=True)
valid_decades = df_single["decade"].value_counts()[lambda x: x >= 100].index
df_single = df_single[df_single["decade"].isin(valid_decades)].reset_index(drop=True)

# Step 3: Binary labeling – Modern vs Legacy
df_single["era"] = df_single["decade"].apply(lambda d: "Modern" if int(d) >= 2000 else "Legacy")
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_single["era"])  # Legacy=0, Modern=1

# Step 4: Feature Engineering
df_single["name"] = df_single["name"].str.lower()
vectorizer = CountVectorizer(analyzer="char", ngram_range=(1, 3))
X_ngram = vectorizer.fit_transform(df_single["name"])

df_single["length"] = df_single["name"].str.len()
df_single["first_letter"] = df_single["name"].str[0]
df_single["last_letter"] = df_single["name"].str[-1]
df_single["ends_with_a"] = (df_single["last_letter"] == "a").astype(int)
df_single["contains_y"] = df_single["name"].str.contains("y").astype(int)
df_single["ends_with_n"] = (df_single["name"].str[-1] == "n").astype(int)
df_single["has_mc"] = df_single["name"].str.startswith("mc").astype(int)
df_single["vowel_ratio"] = df_single["name"].apply(lambda name: sum(c in "aeiou" for c in name) / len(name))
df_single["suffix_2"] = df_single["name"].str[-2:]
df_single["suffix_3"] = df_single["name"].str[-3:]
total_per_decade = df_single.groupby("decade")["total"].transform("sum")
df_single["popularity_ratio"] = df_single["total"] / total_per_decade

# One-hot encoding
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=True)
X_suffix = ohe.fit_transform(df_single[["first_letter", "last_letter", "suffix_2", "suffix_3"]])
X_manual = df_single[[
    "length", "ends_with_a", "contains_y", "ends_with_n",
    "has_mc", "vowel_ratio", "popularity_ratio"
]].astype("float32").values

from scipy.sparse import csr_matrix
X_combined = csr_matrix(hstack([X_ngram, X_suffix, X_manual]))


# Step 5: Stratified split with min Legacy constraint
print("\nCreating stratified split with minimum Legacy samples...")
sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
desired_min_legacy = 30
found_split = False

for train_idx, test_idx in sss.split(X_combined, y):
    if np.sum(y[test_idx] == 0) >= desired_min_legacy:
        found_split = True
        X_train = X_combined[train_idx]
        X_test = X_combined[test_idx]
        y_train = y[train_idx]
        y_test = y[test_idx]
        print(f"Test set includes {np.sum(y_test==0)} Legacy samples.")
        break

if not found_split:
    raise ValueError("Could not find a test split with enough Legacy samples.")

# Step 6: Oversample train set
ros = RandomOverSampler(random_state=42)
X_train_bal, y_train_bal = ros.fit_resample(X_train, y_train)

# Step 7: Train model (no scale_pos_weight)
model = XGBClassifier(
    n_estimators=200,
    max_depth=8,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    n_jobs=-1
)
model.fit(X_train_bal, y_train_bal)

# Step 8: Evaluation
y_pred = model.predict(X_test)
print("\nFinal Evaluation:")
print("Test Accuracy:", round(accuracy_score(y_test, y_pred) * 100, 2), "%")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))


Querying from BigQuery...
Retrieved 182540 rows.

Creating stratified split with minimum Legacy samples...
Test set includes 30 Legacy samples.

Final Evaluation:
Test Accuracy: 83.46 %
Classification Report:
              precision    recall  f1-score   support

      Legacy       0.67      0.60      0.63        30
      Modern       0.88      0.91      0.89        97

    accuracy                           0.83       127
   macro avg       0.77      0.75      0.76       127
weighted avg       0.83      0.83      0.83       127

