### The project: Gender Classification from Baby Names Using BigQuery and XGBoost

This project focuses on predicting the **gender (Male/Female)** of a given **baby name** based solely on its **textual structure**. It leverages the U.S. Social Security dataset `bigquery-public-data.usa_names.usa_1910_current`, which includes over a century of name-gender-frequency statistics.

---

### Objective:
Train a machine learning model that can **accurately classify names as either Male or Female** using character-based patterns and additional name features.

---

### Data Preparation:
1. Queried data from BigQuery: name, gender, and total occurrences.
2. Filtered to include only names that are used **exclusively for one gender** (e.g., "Emily" is only Female).
3. Removed rare names (used fewer than 100 times).
4. Final dataset: **12,508 unique names**, balanced across genders.

---

### Feature Engineering:
- **Character-level n-grams**: extracted all 1–4 character sequences from names (e.g., `jo`, `ohn`, `lyn`).
- **Manual features**:
  - Name length
  - First and last letters (one-hot encoded)
  - Does the name end with `'a'`?
  - Does the name contain `'y'`?

All features were combined into a sparse feature matrix of shape `(12508, 15687)`.

---

### Modeling:
- Used **XGBoost classifier** with hyperparameter tuning via **manual Grid Search** over 24 parameter combinations.
- Evaluated each combination using 3-fold cross-validation.
- Trained the best model on the full training set (80%) and evaluated on a 20% hold-out test set.

---

### Best Model Parameters:
```python
{
  'n_estimators': 300,
  'max_depth': 7,
  'learning_rate': 0.2,
  'subsample': 0.8,
  'colsample_bytree': 0.8
}
````

---

### Results:

* **Cross-validated Accuracy (3-fold):** 88.79%
* **Test Set Accuracy:** **90.61%**

| Metric    | Female (F) | Male (M) |
| --------- | ---------- | -------- |
| Precision | 0.93       | 0.87     |
| Recall    | 0.92       | 0.89     |
| F1-score  | 0.92       | 0.88     |

The model performs **very well** across both classes, with slightly better performance on female names. This is likely due to more consistent structural patterns in female naming (e.g., ending in `'a'`).

---

### Conclusion:

The classifier learned strong patterns from the **structure of names alone**, achieving over **90% accuracy** in predicting gender. This demonstrates the power of combining **text n-grams** and **symbolic name features** in a classical ML pipeline, without needing deep learning or external data.


### The project: Predicting Gender from Baby Names Using BigQuery and XGBoost

This project demonstrates how to build a gender classification model using U.S. baby name data from BigQuery. The goal is to predict whether a name is typically associated with male or female babies based on the structure and patterns within the name.

#### Data Source:
The dataset is queried directly from BigQuery's public dataset `usa_names.usa_1910_current`, which contains U.S. baby names, their gender, and yearly counts from 1910 onward.

---

### Step-by-step Explanation:

**Step 1 – Query Data from BigQuery**  
The SQL query retrieves the total number of births for each unique (name, gender) combination, filtering out names with fewer than 100 total births to remove rare names.

**Step 2 – Filter Gender-Specific Names**  
Names that appear for both genders (e.g., Taylor, Jordan) are removed. Only names that appear exclusively for male or exclusively for female are kept, to increase classification clarity.

**Step 3 – Prepare Labels**  
The target variable (`gender`) is label-encoded: female (`F`) becomes 0, male (`M`) becomes 1.

**Step 4 – N-gram Feature Extraction**  
Character-level n-grams (1 to 3 characters) are extracted using `CountVectorizer`. This captures short letter patterns (e.g., "ma", "ari", "son") that may be informative for gender prediction.

**Step 5 – Manual Feature Engineering**  
Several handcrafted features are added:
- `length`: length of the name
- `first_letter`, `last_letter`: used to identify name structure
- `ends_with_a`: many female names end in 'a'
- `contains_y`: 'y' is common in certain male names

The `first_letter` and `last_letter` are one-hot encoded to be used in the model.

**Step 6 – Combine All Features**  
All features (n-grams, one-hot letters, and numeric features) are merged into one sparse feature matrix using `hstack`.

**Step 7 – Train/Test Split**  
The dataset is split into training (80%) and testing (20%) sets.

**Step 8 – Train XGBoost Model**  
An XGBoost classifier is trained using 300 estimators, a maximum tree depth of 6, and a learning rate of 0.1. The model is trained on the combined feature matrix.

**Step 9 – Evaluation**  
Model performance is evaluated using accuracy and a classification report. This includes precision, recall, and F1-score for each gender.

---

### Result:
The model typically achieves accuracy above **90%**, depending on parameters and feature richness. It effectively captures linguistic patterns in names to predict gender with high precision.


# The dataset: `bigquery-public-data.usa_names.usa_1910_current`

This dataset contains historical records of baby names registered in the United States from the year 1910 onwards. It is publicly hosted on Google BigQuery as part of Google's public datasets.

## Source

The data originates from the U.S. Social Security Administration (SSA) and includes aggregated birth name information by year, gender, and state.

## Columns Description

| Column   | Type    | Description |
|----------|---------|-------------|
| `name`   | STRING  | The first name of the baby. |
| `gender` | STRING  | The gender associated with the name: either `"M"` (Male) or `"F"` (Female). |
| `state`  | STRING  | The U.S. state abbreviation (e.g., "CA", "NY"). In some entries this field may be null. |
| `year`   | INTEGER | The year in which the name was registered. |
| `number` | INTEGER | The number of babies given this name in that year and state. |

## Target / Label Column

There is no explicit label column in this dataset, as it is not originally designed for supervised machine learning tasks. However, depending on the goal of the analysis, one can define a target. For example:
- In a **gender prediction task**, the column `gender` can serve as the label.
- For **popularity prediction**, the column `number` or its change over time could be used as a target variable.

## Use Cases

This dataset is suitable for a variety of data science and analytics tasks, including:
- Trend analysis of baby name popularity over time.
- Gender distribution of names.
- Identification of unisex (gender-neutral) names.
- Forecasting future popularity of specific names.
- Sociocultural studies (e.g., names influenced by events, celebrities, or politics).

## Dataset Characteristics

- Total rows: ~1.8 million (depending on filters).
- Time range: 1910 to present.
- Granularity: Yearly.
- Aggregated: Not individual-level data, but grouped by name/year/state/gender.


In [3]:
# Step 1: Install required packages
!pip install -q google-cloud-bigquery xgboost

In [1]:
# Step 2: Authenticate and import libraries
from google.colab import auth
auth.authenticate_user()

In [4]:
from google.cloud import bigquery

# Step 3: Connect to BigQuery and query gender + name data
project_id = "custom-helix-474006-k6"  # ← replace with your GCP project ID
client = bigquery.Client(project=project_id)

In [9]:
from google.cloud import bigquery
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier
from scipy.sparse import hstack


query = """
SELECT name, gender, SUM(number) AS total
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name, gender
HAVING total > 100
"""
df = client.query(query).to_dataframe()

# Step 4: Filter gender-specific names
gender_counts = df.groupby("name")["gender"].nunique().reset_index()
gender_counts = gender_counts[gender_counts["gender"] == 1]
df = df[df["name"].isin(gender_counts["name"])].reset_index(drop=True)

# Step 5: Prepare labels and normalize names
df["name"] = df["name"].str.lower()
y = LabelEncoder().fit_transform(df["gender"])  # F=0, M=1

# Step 6: n-gram vectorization (char-level 1 to 3)
vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 3))
X_ngram = vectorizer.fit_transform(df["name"])

# Step 7: Manual features
df["length"] = df["name"].str.len()
df["first_letter"] = df["name"].str[0]
df["last_letter"] = df["name"].str[-1]
df["ends_with_a"] = (df["last_letter"] == 'a').astype(int)
df["contains_y"] = df["name"].str.contains("y").astype(int)

# One-hot encode first and last letters
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=True)
letter_ohe = ohe.fit_transform(df[["first_letter", "last_letter"]])

# Combine manual numeric features
X_manual = df[["length", "ends_with_a", "contains_y"]].values

# Combine all features: n-gram + one-hot + manual
X_combined = hstack([X_ngram, letter_ohe, X_manual])

# Step 8: Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

# Step 9: Train XGBoost
model = XGBClassifier(n_estimators=300, max_depth=6, learning_rate=0.1, eval_metric='logloss')
model.fit(X_train, y_train)

# Step 10: Evaluation
y_pred = model.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred) * 100, 2), "%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["F", "M"]))


Accuracy: 89.21 %

Classification Report:
              precision    recall  f1-score   support

           F       0.92      0.91      0.91      1556
           M       0.85      0.87      0.86       946

    accuracy                           0.89      2502
   macro avg       0.88      0.89      0.89      2502
weighted avg       0.89      0.89      0.89      2502



In [12]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier
from scipy.sparse import hstack
from itertools import product
from sklearn.model_selection import cross_val_score
import numpy as np
import warnings
warnings.filterwarnings("ignore")

print("Step 1: Querying data from BigQuery...")
query = """
SELECT name, gender, SUM(number) AS total
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name, gender
HAVING total > 100
"""
df = client.query(query).to_dataframe()
print(f"Retrieved {len(df)} rows.")

print("Step 2: Filtering names used for only one gender...")
gender_counts = df.groupby("name")["gender"].nunique().reset_index()
gender_counts = gender_counts[gender_counts["gender"] == 1]
df = df[df["name"].isin(gender_counts["name"])].reset_index(drop=True)
print(f"Remaining after filtering: {len(df)} names.")

print("Step 3: Encoding gender labels (F=0, M=1)...")
df["name"] = df["name"].str.lower()
y = LabelEncoder().fit_transform(df["gender"])

print("Step 4: Extracting character-level n-grams (1 to 4)...")
vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 4))
X_ngram = vectorizer.fit_transform(df["name"])
print(f"Number of n-gram features: {X_ngram.shape[1]}")

print("Step 5: Adding manual features...")
df["length"] = df["name"].str.len()
df["first_letter"] = df["name"].str[0]
df["last_letter"] = df["name"].str[-1]
df["ends_with_a"] = (df["last_letter"] == 'a').astype(int)
df["contains_y"] = df["name"].str.contains("y").astype(int)

print("Step 6: One-hot encoding first and last letters...")
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=True)
letter_ohe = ohe.fit_transform(df[["first_letter", "last_letter"]])
print(f"One-hot shape: {letter_ohe.shape}")

X_manual = df[["length", "ends_with_a", "contains_y"]].values
print(f"Manual numeric feature shape: {X_manual.shape}")

print("Step 7: Combining all features...")
X_combined = hstack([X_ngram, letter_ohe, X_manual])
print(f"Final feature matrix shape: {X_combined.shape}")

print("Step 8: Splitting data into train and test sets...")
X_train, X_test, y_train, y_test = train_test_split(
    X_combined, y, test_size=0.2, random_state=42)
print(f"Training samples: {X_train.shape[0]} | Test samples: {X_test.shape[0]}")

print("Step 9: Running manual GridSearch on XGBoost with live output...")

param_grid = {
    "n_estimators": [300],
    "max_depth": [5, 6, 7],
    "learning_rate": [0.1, 0.2],
    "subsample": [0.8, 1.0],
    "colsample_bytree": [0.8, 1.0]
}

all_combinations = list(product(
    param_grid["n_estimators"],
    param_grid["max_depth"],
    param_grid["learning_rate"],
    param_grid["subsample"],
    param_grid["colsample_bytree"]
))

results = []
print(f"Total combinations: {len(all_combinations)}\n")

for i, (n_estimators, max_depth, lr, subsample, colsample) in enumerate(all_combinations, 1):
    params = {
        "n_estimators": n_estimators,
        "max_depth": max_depth,
        "learning_rate": lr,
        "subsample": subsample,
        "colsample_bytree": colsample,
        "use_label_encoder": False,
        "eval_metric": "logloss",
        "n_jobs": -1
    }

    print(f"Running combination {i}/{len(all_combinations)}: {params}")
    model = XGBClassifier(**params)
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring="accuracy")
    mean_score = scores.mean()
    std_score = scores.std()
    print(f"  -> Accuracy: {mean_score:.4f} ± {std_score:.4f}\n")

    results.append((params, mean_score, std_score))

# Step 10: Select best model and evaluate
print("Step 10: Evaluating best model...")

# Find best result
best_result = max(results, key=lambda x: x[1])  # highest mean_score
best_params, best_score, best_std = best_result

print(f"Best parameters: {best_params}")
print(f"Cross-validated accuracy: {round(best_score * 100, 2)}%")

# Train best model on full training set
best_model = XGBClassifier(**best_params)
best_model.fit(X_train, y_train)

# Evaluate on test set
y_pred = best_model.predict(X_test)
print(f"\nTest Accuracy: {round(accuracy_score(y_test, y_pred) * 100, 2)}%")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=["F", "M"]))


Step 1: Querying data from BigQuery...
Retrieved 14692 rows.
Step 2: Filtering names used for only one gender...
Remaining after filtering: 12508 names.
Step 3: Encoding gender labels (F=0, M=1)...
Step 4: Extracting character-level n-grams (1 to 4)...
Number of n-gram features: 15632
Step 5: Adding manual features...
Step 6: One-hot encoding first and last letters...
One-hot shape: (12508, 52)
Manual numeric feature shape: (12508, 3)
Step 7: Combining all features...
Final feature matrix shape: (12508, 15687)
Step 8: Splitting data into train and test sets...
Training samples: 10006 | Test samples: 2502
Step 9: Running manual GridSearch on XGBoost with live output...
Total combinations: 24

Running combination 1/24: {'n_estimators': 300, 'max_depth': 5, 'learning_rate': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.8, 'use_label_encoder': False, 'eval_metric': 'logloss', 'n_jobs': -1}
  -> Accuracy: 0.8773 ± 0.0043

Running combination 2/24: {'n_estimators': 300, 'max_depth': 5, 'learn