# Automated Domain Availability Analysis

In this notebook, we apply simple fitering techniques to identify a domain as available and unavailable. We further sub-classify the unavailble domains based on the http response code and the html pages returned 

## Basic Filtering Technique Used

- We used the standard urllib library in python to make a request to the webpage and get back the initial reponse. 
- We perform a basic initial filter. If the initial response is error 404, we mark the domain as invalid. If the initial reponse is anything other than 200, we mark the domain as unavailable along with the corresponding response code.

- If the response code is 200, we parse the entire page using beautiful soup and based on some insights mark the domains as available or for sale.

### Identifying the domais for sale using a parser (beautiful soup)

After going through a number of avaiable and for sale domains the important observations are:
    
- The domains for sale have in almost all the cases no <b>internal links</b> on the entire landing page. Whereas the available domains have a number of internal links present. 
<br> <b>Note: </b> Internal Links are the one of the same domain. 
- The domains for sale have a for sale banner on them.
- The doamains for sale have a smaller size.


I analysed both available and unavailable domains and the major observations are:
- key words like 'sale' are not a good indictor as this word commonly occurs on available websites as well.
- Page Size is relatively small  in for sale domains
- Number of Internal links is the best indicator as all the for sale domains observed had no internal links at all. 

Therefore, as of now, we just use the links to identify a domain as availble or for sale.

#### Storing the Results:

We store the results in the form of a dictionary.
The key is the domain under analysis.
The value is a tuple (x,y) where x is the status code returned e.g 200,203 and y is the explanatory comment e.g domain for sale etc.

For Example 
{
"sevendollarclick.com" : (200, domain available),
"symbux.com" : (200, domain for sale)
}


In [1]:
import urllib.request
from bs4 import BeautifulSoup, SoupStrainer
import re

domain_label = {}

The main function that classifies the domains is given below:

In [2]:
def classifier(domain_name):
    global domain_label
    links = []
    tlinks = []
    try:
        with urllib.request.urlopen("http://" + domain_name) as url:
            s = url.getcode()
            if not s == 200:
                domain_label[domain_name] = (s, "domain unavaiable with status code " + str(s))
            else:
                soup = BeautifulSoup(url)
                for link in soup.findAll('a', attrs={'href': re.compile(domain_name)}):
                    links.append(link.get('href'))
                for link in soup.findAll('a', attrs={'href': re.compile('/')}):
                    tlinks.append(link.get('href'))
                for x in tlinks:
                    if not ("http" in x or ".com" in x or "www." in x or len(x) < 2):
                        links.append(x)
                total_internal_links = len(links)
                if (total_internal_links > 2):
                    domain_label[domain_name] = (s, "domain available")
                else:
                    domain_label[domain_name] = (s, "domain for sale or not hosted anymore")	

    except:
        if domain_label[domain_name] == (0,'null'): 
            domain_label[domain_name] = (404, "domain not reachable or invalid")

### Testing our approach
Now we perform a basic testing of our approach. 
For this purpose I pick 25 domains and check my approach 
- 10 completely working available domain
- 7 domains with status code 200 but for sale
- 2 invalid domains (since easily detectable) 

In [3]:
available_domains = ["sevendollarclick.com" , "fourdollarclick.com" , "paidverts.com","tv-two.com", "tylerpratt.com", "cuturl.in", "clixsense.com", "andyhaffel.com", "bestptc.org", "probux.com"]
for_sale_domains = ["symbux.com", "profitclicking.com", "buxp.com","buxcap.com", "no1tip.com", "getlink.pw", "thepocketmoney.online"]
invalid_domains = ["t.me", "hits4pay.com"]


 #Initialising my output dictionary
for x in available_domains:
    domain_label[x] = (0,'null')
for x in for_sale_domains:
    domain_label[x] = (0,'null')    
for x in invalid_domains:
    domain_label[x] = (0,'null')  

    
for x in available_domains:
    classifier(x)
for x in for_sale_domains:
    classifier(x)    
for x in invalid_domains:
    classifier(x)     
    
domain_label    

{'sevendollarclick.com': (200, 'domain available'),
 'fourdollarclick.com': (200, 'domain available'),
 'paidverts.com': (200, 'domain available'),
 'tv-two.com': (200, 'domain available'),
 'tylerpratt.com': (200, 'domain available'),
 'cuturl.in': (200, 'domain available'),
 'clixsense.com': (200, 'domain available'),
 'andyhaffel.com': (404, 'domain not reachable or invalid'),
 'bestptc.org': (200, 'domain available'),
 'probux.com': (200, 'domain for sale or not hosted anymore'),
 'symbux.com': (200, 'domain for sale or not hosted anymore'),
 'profitclicking.com': (200, 'domain for sale or not hosted anymore'),
 'buxp.com': (200, 'domain for sale or not hosted anymore'),
 'buxcap.com': (200, 'domain for sale or not hosted anymore'),
 'no1tip.com': (200, 'domain for sale or not hosted anymore'),
 'getlink.pw': (200, 'domain for sale or not hosted anymore'),
 'thepocketmoney.online': (200, 'domain for sale or not hosted anymore'),
 't.me': (404, 'domain not reachable or invalid'),
 '

#### Conclusion:
   It can be observed that even this approach correctly classifies the domains in most cases.