# Domain Checker

In [1]:
# dependencies
import whois
import validators
import pandas as pd

## Background

CSV files with flags for whether each word's domain is registered have already been generated. These files are generated by Python files, but this section will give a brief overview of the process.


### Domain Lookup

The `python-whois` library is used to access domain registration information.

In [2]:
def dm_lookup(dm):
    """
    This is the basic function to check whether a domain is registered.
    If a given domain is determined to be valid and returns a whois result, it is registered.
    Args:
        - dm (string) | domain name to check, including Top-Level Domain
    """
    if validators.domain(dm):
        try:
            dm_info = whois.whois(dm)
            return dm_info
        except Exception as e:
            e_text = e.__str__().lower()
            if e_text.startswith('no match for'):
                return f"{dm} is not registered"
            elif e_text.startswith('domain not found'):
                return "Unavailable TLD"
            else:
                return "Other exception"
    else:
        return f"Invalid domain format"

In [3]:
# test domain lookup with known url
dm_info = dm_lookup("google.com")
print(dm_info)

{
  "domain_name": "GOOGLE.COM",
  "registrar": "MarkMonitor, Inc.",
  "registrar_url": "http://www.markmonitor.com",
  "reseller": null,
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-09-09 15:39:04",
    "2024-08-02 02:17:33+00:00"
  ],
  "creation_date": [
    "1997-09-15 04:00:00",
    "1997-09-15 07:00:00+00:00"
  ],
  "expiration_date": [
    "2028-09-14 04:00:00",
    "2028-09-13 07:00:00+00:00"
  ],
  "name_servers": [
    "NS1.GOOGLE.COM",
    "NS2.GOOGLE.COM",
    "NS3.GOOGLE.COM",
    "NS4.GOOGLE.COM"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
    "serverUpdateProhibited ht

In [4]:
# test domain lookup with unowned url
dm_info = dm_lookup("googleasdfasdf.com")
print(dm_info)

googleasdfasdf.com is not registered


### Word Frequency

A dataset derived from the Google Web Trillion Word Corpus (accessed from [Kaggle](https://www.kaggle.com/datasets/rtatman/english-word-frequency)) is used as a reference for word frequency.

In [5]:
# import word frequencies
words_df = pd.read_csv("data/unigram_freq.csv")
words_df

Unnamed: 0,word,count
0,the,23135851162
1,of,13151942776
2,and,12997637966
3,to,12136980858
4,a,9081174698
...,...,...
333328,gooek,12711
333329,gooddg,12711
333330,gooblle,12711
333331,gollgo,12711


- 2 new columns are added: word length and frequency rank.
- Words of 4 letters or fewer are filtered out because previous projects evidence that no domains of 4 letters or fewer remain unregistered.
- Words list filtered to rank of 25,000 and lower—searching through longer word lists was taking much longer with less impactful matches.
- This leaves around 40,000 words.

In [6]:
# add column for word length
words_df['length'] = words_df['word'].str.len()

# remove words 4 letters and fewer - all domains are believed to be taken
words_df = words_df[words_df['length']>4].reset_index().rename(columns={'index': 'rank'})

# filter based on rank <25k
words_df = words_df[words_df['rank']<50000]

words_df

Unnamed: 0,rank,word,count,length
0,35,about,1226734006,5.0
1,40,search,1024093118,6.0
2,45,other,978481319,5.0
3,48,information,932594387,11.0
4,56,which,810514085,5.0
...,...,...,...,...
40249,49993,hotlines,333309,8.0
40250,49994,hazelton,333297,8.0
40251,49996,reaffirms,333279,9.0
40252,49997,anleitung,333245,9.0


The Python file then splits the dataframe into a separate dataframe for each length. Then it loops through each word in the smaller dataframe, adding a TLD like `.com`, and searches for a registration. It returns a list of flags, and that list is added as a column to the dataframe, which is then exported to a CSV file.

Flag meanings:
- 1, found registration
- 0, did not find registration
- -1, unavailable TLD
- -2, another Whois exception
- -3, invalid domain format

It was important to handle different exceptions. Because the Python file takes so long to run, there can be connection errors, and re-running the file is time-consuming. Therefore, it was necessary to assign different flags for the Whois exception of not finding a domain registration versus other Whois exceptions. Then we can re-run with any domains which threw an error rather than the entire list.

## Initial Results

Results are exported to a separate CSV file for each word length. The results files need to be combined.

In [7]:
result_filenames = [f"results/dm_results_{i}_letters.csv" for i in range(5,21)]

result_dfs = []
for filename in result_filenames:
    df = pd.read_csv(filename)
    df = df[['rank', 'word', 'count', 'length', 'reserved']]
    result_dfs.append(df)

results = pd.concat(result_dfs)
display(results.head())
print(results['reserved'].value_counts())

Unnamed: 0,rank,word,count,length,reserved
0,35,about,1226734006,5.0,1
1,45,other,978481319,5.0,1
2,56,which,810514085,5.0,1
3,57,their,782849411,5.0,1
4,62,there,701170205,5.0,1


reserved
 1    40056
 0      117
-2       81
Name: count, dtype: int64


Any words flagged as not being registered or raising an exception will be re-run for verification.

In [8]:
# import the helper functions
from helper_functions import flag_loop

# filter words to selected flags
results_words = results[results['reserved'] != 1].copy()
results_words['reserved'] = flag_loop(results_words, 'N')
results_words.head()

N-letter words:  91%|█████████ | 180/198 [02:10<00:03,  5.95it/s]2025-04-10 13:33:04,016 - whois.whois - ERROR - Error trying to connect to socket: closing socket - timed out
N-letter words: 100%|██████████| 198/198 [02:24<00:00,  1.37it/s]


Unnamed: 0,rank,word,count,length,reserved
3065,24669,telco,1116242,5.0,1
3096,24957,drwxr,1093839,5.0,0
3394,27678,serbs,913195,5.0,1
3395,27683,verdi,912609,5.0,1
3396,27687,alpes,912218,5.0,1


In [9]:
# view flag counts
print(results_words['reserved'].value_counts())

# view filtered results
result_df = results_words[results_words['reserved'] == 0]
display(result_df)

reserved
0    116
1     82
Name: count, dtype: int64


Unnamed: 0,rank,word,count,length,reserved
3096,24957,drwxr,1093839,5.0,0
4067,33843,libxt,650704,5.0,0
5362,45889,nzlug,385406,5.0,0
5829,49718,setcl,336237,5.0,0
3978,27348,hetatm,932185,6.0,0
...,...,...,...,...,...
1,19216,memberlistmemberlist,1717438,20.0,0
2,23607,helpsearchmemberscalendar,1204096,25.0,0
4,27610,gardenjewelrykidsmore,917146,21.0,0
6,34032,ezcontentobjecttreenode,644566,23.0,0


## Final Results

The results show that there are many terms that aren't just words. They may be amalgamations of words or (relatively) common abbreviations.

In [10]:
# view the shortest word results
result_df.sort_values('length').reset_index(drop=True).head(20)

Unnamed: 0,rank,word,count,length,reserved
0,24957,drwxr,1093839,5.0,0
1,33843,libxt,650704,5.0,0
2,45889,nzlug,385406,5.0,0
3,49718,setcl,336237,5.0,0
4,49204,seqres,342532,6.0,0
5,47878,libpam,358654,6.0,0
6,47683,unrhyw,360896,6.0,0
7,43591,ranlib,420603,6.0,0
8,43283,xviewg,425754,6.0,0
9,39042,libwww,510597,6.0,0


Many of these "words" are computer commands, which would explain why they were indexed so highly in the Google web corpus. There also appear to be some combined words that could be hashtags from sites like Pinterest.

In [11]:
# view the most common word results
result_df.sort_values('rank').reset_index(drop=True).head(20)

Unnamed: 0,rank,word,count,length,reserved
0,12869,acdbline,3373661,8.0,0
1,16680,cvsroot,2179905,7.0,0
2,17693,usergroupsusergroups,1968491,20.0,0
3,18882,oldmedline,1766800,10.0,0
4,19152,asnblock,1725480,8.0,0
5,19216,memberlistmemberlist,1717438,20.0,0
6,20357,peterthoeny,1553749,11.0,0
7,21608,otherosfs,1399389,9.0,0
8,22317,viewcvs,1323001,7.0,0
9,23017,xlibmesa,1257219,8.0,0


A composite ranking of length and frequency can help determine the optimal available domains.

In [12]:
rank_df = result_df.copy()
rank_df['len_rank'] = rank_df['length'].rank(method='min')
rank_df['freq_rank'] = rank_df['count'].rank(method='min', ascending=False)
rank_df['avg_rank'] = (rank_df['len_rank'] + rank_df['freq_rank']) / 2
rank_df = rank_df[['word', 'count', 'length', 'freq_rank', 'len_rank', 'avg_rank']].sort_values('avg_rank')
rank_df.head(10)

Unnamed: 0,word,count,length,freq_rank,len_rank,avg_rank
3096,drwxr,1093839,5.0,13.0,1.0,7.0
3978,hetatm,932185,6.0,15.0,5.0,10.0
2418,cvsroot,2179905,7.0,2.0,18.0,10.0
4330,nysgrc,807810,6.0,20.0,5.0,12.5
3285,viewcvs,1323001,7.0,9.0,18.0,13.5
4702,tcmseq,702646,6.0,28.0,5.0,16.5
4067,libxt,650704,5.0,33.0,1.0,17.0
4725,ptcldy,696519,6.0,29.0,5.0,17.0
1559,acdbline,3373661,8.0,1.0,33.0,17.0
4332,wstrict,826107,7.0,18.0,18.0,18.0


## Takeaways

The dearth of available domains that are actually words isn't totally surprising, but the sparcity is still striking to me. However, this project did uncover what I believe to be some high-value domain names.

There are also significant opportunities for further research:
- Testing other TLDs could find many more available domains.
    - This only tested `.com` TLDs, but the `python-whois` package states that it supports `.net` and `.edu` as well.
    - I also believe that it supports `.org` and other TLDs and that their error messages just haven't been updated to reflect that.
- Previous research suggested that all 4-letter domains (words or otherwise) had been registered. However, this project would suggest that there are 5-letter domains, even relatively popular abbreviations.
    - Theoretically, it wouldn't be too complicated to use the functions defined in this project to search through all possible 5-letter combinations.
    - Practically, that's 26^5 possibilities. When running the pipeline without anything else running on my computer, I could get through about 3,000 domains per hour.
    - If we generously believe that pace could be maintained for a much larger volume, then the 11,881,376 possibilities (minus the near negligble number already tested as words) would take approximately 3,960 hours. So if it was constantly running on my local machine, it would take about 165 days. Maybe there are cloud computing solutions, but this idea may need to stay theoretical for the sake of cost.
- Other corpora could be used to specifically target full words, not including abbreviations or amalgamations.
    - Other corpora considered were the Corpus of Contemporary American English (COCA) and the iWeb corpus.
    - However, I didn't want to spend hundreds of dollars on a corpus. An academic license could ease this if you're part of an eligible academic institution.