Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Ruby

on:
push:
branches:
- main

pull_request:

jobs:
build:
runs-on: ubuntu-latest
name: Ruby ${{ matrix.ruby }}
strategy:
matrix:
ruby:
- '3.3.4'

steps:
- uses: actions/checkout@v4
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: ${{ matrix.ruby }}
bundler-cache: true
- name: Run the default task
run: bundle exec rake
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
/.bundle/
/.yardoc
/_yardoc/
/coverage/
/doc/
/pkg/
/spec/reports/
/tmp/
3 changes: 3 additions & 0 deletions .standard.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# For available configuration options, see:
# https://github.com/standardrb/standard
ruby_version: 3.0
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Changelog

## 1.0.0

- Initial release.
12 changes: 12 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# frozen_string_literal: true

source "https://rubygems.org"

# Specify your gem's dependencies in names_dataset.gemspec
gemspec

gem "rake", "~> 13.0"

gem "minitest", "~> 5.16"

gem "standard", "~> 1.3"
68 changes: 68 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
PATH
remote: .
specs:
names_dataset (1.0.0)
iso_country_codes (~> 0.7.6)
rubyzip (~> 2.3)

GEM
remote: https://rubygems.org/
specs:
ast (2.4.2)
iso_country_codes (0.7.8)
json (2.9.1)
language_server-protocol (3.17.0.3)
lint_roller (1.1.0)
minitest (5.25.4)
parallel (1.26.3)
parser (3.3.6.0)
ast (~> 2.4.1)
racc
racc (1.8.1)
rainbow (3.1.1)
rake (13.2.1)
regexp_parser (2.10.0)
rubocop (1.69.2)
json (~> 2.3)
language_server-protocol (>= 3.17.0)
parallel (~> 1.10)
parser (>= 3.3.0.2)
rainbow (>= 2.2.2, < 4.0)
regexp_parser (>= 2.9.3, < 3.0)
rubocop-ast (>= 1.36.2, < 2.0)
ruby-progressbar (~> 1.7)
unicode-display_width (>= 2.4.0, < 4.0)
rubocop-ast (1.37.0)
parser (>= 3.3.1.0)
rubocop-performance (1.23.0)
rubocop (>= 1.48.1, < 2.0)
rubocop-ast (>= 1.31.1, < 2.0)
ruby-progressbar (1.13.0)
rubyzip (2.3.2)
standard (1.43.0)
language_server-protocol (~> 3.17.0.2)
lint_roller (~> 1.0)
rubocop (~> 1.69.1)
standard-custom (~> 1.0.0)
standard-performance (~> 1.6)
standard-custom (1.0.2)
lint_roller (~> 1.0)
rubocop (~> 1.50)
standard-performance (1.6.0)
lint_roller (~> 1.1)
rubocop-performance (~> 1.23.0)
unicode-display_width (3.1.3)
unicode-emoji (~> 4.0, >= 4.0.4)
unicode-emoji (4.0.4)

PLATFORMS
ruby

DEPENDENCIES
minitest (~> 5.16)
names_dataset!
rake (~> 13.0)
standard (~> 1.3)

BUNDLED WITH
2.5.16
129 changes: 128 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,130 @@
# First and Last Names Dataset

A Ruby port of https://github.com/philipperemy/name-dataset
`NamesDataset` is a Ruby library (ported from the python [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset) library) that provides fast lookups and metadata for first and last names. Ever wondered if “Zoe” is more likely a name from the United Kingdom or how popular “White” is as a last name in the United States? This library helps you answer those questions.

`NamesDataset` can help you:
- Search for a first or last name and learn about:
- Probable country of origin
- Gender distribution (for first names)
- Rank/popularity
- Get lists of top names by country and gender.

Under the hood, `NamesDataset` loads an in-memory dataset (derived from a Facebook leak of 533M users) that’s roughly 3.2GB once loaded into memory. Once loaded, it’s quick to search but definitely requires some hardware overhead, so keep that in mind if you’re planning on deploying this to production.

## Requirements
- Ruby >= 2.7 (tested on 2.7, 3.0, 3.1, 3.2).
- Approximately 3.2GB of RAM available to load the full dataset.

## Installation

Add the gem to your Gemfile and run bundle.

```ruby
gem "names_dataset"
```

Then require the library and initialize it in your application.

```ruby
require "names_dataset"

# The library takes time to initialize because the database is massive.
# A tip is to include its initialization in your app's startup process.
nd = NamesDataset.new
```

## Usage

`NamesDataset` provides methods to query the dataset for information about first and last names. Here are some examples:

```ruby
nd = NamesDataset.new

p nd.search("Philippe")
# => {
# :first_name => {
# :country => { "France" => 0.63, "Belgium" => 0.12, ... },
# :gender => { "Male" => 0.99, "Female" => 0.01 },
# :rank => { "France" => 73, "Belgium" => 291, ... }
# },
# :last_name => {
# :country => {},
# :gender => {},
# :rank => {}
# }
# }

p nd.search("Zoe")
# => {
# :first_name => {
# :country => { "United Kingdom" => 0.52, "United States" => 0.23, ... },
# :gender => { "Female" => 0.98, "Male" => 0.02 },
# :rank => { "United Kingdom" => 140, "United States" => 315, ... }
# },
# :last_name => { ... }
# }
```

The result is a Ruby Hash with the following structure:
- `:first_name`: Includes `:country`, `:gender`, `:rank`
- `:last_name`: Includes `:country`, `:gender` (generally empty for last names), and `:rank`

### Memory Usage Disclaimer

Because the library pre-loads the entire 3.2GB dataset into memory, you’ll need sufficient RAM to avoid NoMemoryError. If you only need a subset of the data or if memory is a major concern, consider alternative approaches (e.g., a streaming or database-based solution). But if you can spare the memory, NamesDataset is fast for repeated lookups once it’s loaded.

### Top Names

Similar to the Python library, you can fetch the most popular names by country or gender:

```ruby
p nd.get_top_names(n: 10, gender: "Male", country_alpha2: "US")
# => {
# "US" => {
# "M" => ["Jose", "David", "Michael", "John", "Juan", ... ]
# }
# }

p nd.get_top_names(n: 5, country_alpha2: "ES")
# => {
# "ES" => {
# "M" => ["Jose", "Antonio", "Juan", "Manuel", "David"],
# "F" => ["Maria", "Ana", "Carmen", "Laura", "Isabel"]
# }
# }
```

### Other Helpers

```ruby
p nd.get_country_codes(alpha_2: true)
# => ["AE", "AF", "AL", "AO", "AR", "AT", ... ]

nd.first_names
# => A Hash of first names mapped to their attributes (country, gender, rank, etc).

nd.last_names
# => A Hash of last names mapped to their attributes (country, rank, etc).
```

## Full Dataset

For offline or alternative usage, a link to the raw dataset can be found in the [original Python library](https://github.com/philipperemy/name-dataset/blob/6ae42a6a84a7b6460baa2cbd440f0cdf9fe81752/README.md#full-dataset).

## Ports

- This library is a port of the original Python library [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset).

## Contributing

We welcome contributions! Feel free to open an issue or submit a pull request on GitHub.

## License

This library is subject to the same considerations as the Python version:
- The dataset is generated from a large-scale Facebook leak (533M accounts).
- Basic lists of names are [typically not copyrightable](https://github.com/philipperemy/name-dataset/blob/6ae42a6a84a7b6460baa2cbd440f0cdf9fe81752/README.md#license), but please consult a lawyer if you have specific legal concerns.
- You can find the full license from the original python library in [that project](https://github.com/philipperemy/name-dataset/blob/6ae42a6a84a7b6460baa2cbd440f0cdf9fe81752/LICENSE).
- You can find the full license for this Ruby port in the [LICENSE](LICENSE) file at the root of this repository.

Thanks for checking out `names_dataset`! If this library helps you ship something neat, I’d love to know about it, feel free to open a Pull Request or Issue :heart:
10 changes: 10 additions & 0 deletions Rakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# frozen_string_literal: true

require "bundler/gem_tasks"
require "minitest/test_task"

Minitest::TestTask.create

require "standard/rake"

task default: %i[test standard]
11 changes: 11 additions & 0 deletions bin/console
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env ruby
# frozen_string_literal: true

require "bundler/setup"
require "names_dataset"

# You can add fixtures and/or initialization code here to make experimenting
# with your gem easier. You can also use a different console, if you like.

require "irb"
IRB.start(__FILE__)
8 changes: 8 additions & 0 deletions bin/setup
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
set -vx

bundle install

# Do any other automated setup that you need to do here
Binary file added data/first_names.zip
Binary file not shown.
Binary file added data/last_names.zip
Binary file not shown.
Loading