# Exploring Clustering and Network Results

### Postgres Access

Connecting to database:

In [None]:
import psycopg2
import sqlalchemy as sql

In [None]:
pg_dbname = '...'
pg_username = '...'
pg_endpoint = '...rds.amazonaws.com'
pg_port = '...'
pg_password = '...'

pg_engine = sql.create_engine('postgres://{}:{}@{}:{}/{}'.format(
    pg_username,
    pg_password,
    pg_endpoint,
    pg_port,
    pg_dbname))

## Text Clustering

In [None]:
from everest import cluster

This pulls the text clustering data and organizes the info into the clusters object:

In [None]:
clusters = cluster.Clusters(pg_engine)

clusters.cluster_info shows info on each cluster, with traffic, domain ratings, and backlinks counts from ahrefs.

In [None]:
clusters.cluster_info

clusters.cluster_domains shows info on each scraped domain, including its language, which cluster it's in. 

By default, 'new' says whether it was found since the last upload of ahrefs data. clusters.set_new_date_threshold(date_string) takes a string as an argument (it just as to be recognizable as a date by pandas' pd.to_datetime() function) and uses it as the new threshold for whether a site is 'new'.

In [None]:
clusters.cluster_domains

clusers.cluster_backlinks is the same, but with backlinks from the ahrefs database (but only those from scraped domains).

In [None]:
clusters.cluster_backlinks

There are a few ways to pull all info just for domains/backlinks in one cluster:


By cluster name:

clusters.get_cluster_domains('en48')

clusters.get_cluster_backlinks('en48')


By domain name / backlink:

clusters.get_cluster_domains_by_domain('[redacted]') - i.e. by domain name

clusters.get_cluster_backlinks_by_url('[redacted]')


In [None]:
clusters.get_cluster_domains('en56')

In [None]:
clusters.get_cluster_domains_by_domain('')

In [None]:
clusters.get_cluster_backlinks('en48')

In [None]:
clusters.get_cluster_backlinks_by_url('[redacted]')

Since these are all just pandas tables they can all be sorted by whatever columns you want.

## Link networks

In [None]:
from everest import network

This pulls the network clustering data and organizes each connected component (i.e. each group of interconnected sites).

In [None]:
networks = network.Networks(pg_engine)

networks.component_info gives info for all such components. Components are by default listed in order of size (i.e. number of sites).

'density' is the ratio of links to sites.

'star' is true if the network is just one central site, with all other sites linking to it but not to each other.

'centroid' is the site with the most links within the component.

domain and traffic ratings are from ahrefs - the mean across all sites within the component and the max.

In [None]:
networks.component_info

networks.domain_info gives the info for each domain, including which component it's in, and the number of links in and out.

In [None]:
networks.domain_info

There are two ways to get info about all the domains for a given component, or draw the component.

By component number:
networks.get_component(5)

By domain or link:
networks.get_component_by_domain('medium.com')



In [None]:
networks.get_component(1)

In [None]:
networks.get_component_by_domain('[redacted]')

If you want to remove sites from the graph (e.g. because they're generic sites that everyone links to), you can do it using, eg: 

networks.remove_node('medium.com')

and then to reinstate every node:

networks.reset_network_graph()

In [None]:
networks.remove_node('medium.com')

In [None]:
networks.reset_network_graph()

In [None]:
networks.draw_component(5)

In [None]:
networks.draw_component_by_domain('medium.com')

In [None]:
networks.set_link_type('all_domains')

# Can be 'all_ahrefs_domains', 'all_domains', or 'all_links'