# Unraveling the BitTorrent Ecosystem
---

- ##### Introduction:
    *in short*: BitTorrent is a remarkaboly popular file distribution technology. It has millions of user sharing content in hundreds of thousands of torrents on a daily basis. BitTorrent traffi continues to grow at impressive rates. 

- ##### Fundamental:

    The BitTorrent protocol has been published, and the source code of the baseline implementation is widely available; because of this there has been over 50 client side implementations, and dozens of independent trackers implemented, as well as a miltitude of torrent-discovery sites.

    This protocol method has fostered a very productive discussion in both development and research communities, and has lead to further design improvements.



### The BitTorrent Ecosystem:
---
Technology such as open sourcing, large client-side support, peer-to-peer communications, and a vast amount of discovery sites across the Internet. 
BitTorrent is not only a thriving file distribution system, but also serves as a model for many successful live and on-demand P2P video deployment. 

- ##### Peer-to-Peer Paradigm: 

    The BitTorrent ecosystem consists of three major components: ***peers, peer discovery mechanisms,* and *torrent-discovery sites.***
    
    A ***Torrent***: is a collection of peers that partcipate in the distribution of a specific file at a given time. Each torrent is identified with a tarrent identiafier called the ***infohash***. 
    
    At any given instant of time, each peer in a torrent is either a leecher or a seed; a ***seed*** possesses the entire file, whereas a ***leecher*** possesses only a portion of the file. Typically, a torrent begins with an ***initial seed***, which is the only peer to have the file. Eacher leecher and seed uses one of the many ***BitTorrent client types***. 
    
    The common mechanism for peer discovery is to use a ***tracker***. When a peer joins a torrent, it typcially registers with one more more trackers. Any peer can contact a tracker at any time to obtain a random subset (IP-port pairs) of other peers communicate with each other using the (open) BitTorrent protocol. 
    
    Many BitTorrent clients also support *"distributed trackers"* using DHTs and *Peer Exchange* (PEX). Clients using these trackers collectively form a DHT: a client can query the DHT, using an infohash for the key, to obtain a list of peers participating in the torrent. Many BitTorrent clients also employ PEX, which is a gossiping mechanism that allos peers in the same torrent to exhange peer liste directly with eahc other. Thus, many client types can discover peers using these distint mechaanisms: *centralized trackers, DHTs, and PEX*. When a user wants to start a new torrent, it needs to seed the content file and register the torrent with a tracker. Users can start a torrent by seeding the content file locally, registering the torrent with a tracker, and uploading a .torrent file (w/ included tracker addresses) to a torrent-discovery site.
    
    
    
- ##### Communities:
   
    Users can learn about the existance of ongoing torrents from torrent-discovery sites such as The Pirate Bay, Mininova, Isohunt, BTmonster, and Torrent Portal. These are just some of the hundreds of public of torrent-discovery sites. Below is a list of the top 10 most-visited English torrent-discovery sites obtained from the web-traffic monitoring site Alexa. 
    
    Most torrent-discovery sites, for each of their indexed torrents, provide a *.torrent file*, which includes the IP addresses of one or more trackers and the hashes of all the pieces in the file. Most of the websites from Table 1 include files from the torrents they index. 
    
    <img src='./screenshots/public-discovery-sites-t1.png' width='200' height='200'>
    
    In addition to the "flourishing" public-ecosystem activity, there is also activity within the private BitTorrent sites. These websites restrict their users by requiring a registered account, and will commonly only use invitation systems for limiting registrations. These websites monitor users upload and download, and typically enforce a minimum upload-to-download ratio on each user. Many private trackers implement passkeys in the .torrent file which the user's client presents to the site's private trakcer for authorization. 
    
    <img src='./screenshots/private-discovery-sites-t2.png' width='200' height='200'>
    
    Above is a list of the top 10 private sites, again based on Alexa ranking. Although the world of private sites and torrents is important and interesting. This is only mentioned to point how BitTorrent is truly an international, multilingual phenomenon; and the content is in many different languages (Russian, Chinese, Spanish, and so on). 
 


### The Paper: ***Unraveling the BitTorrent Ecosystem***
---
- ***Thesis Statement***: BitTorrent plays an important role in the Internet, and there has lacked an up-to-date and comprehensive understanding of the BitTorrent Ecosystem (Web3). A large-scale measurement covering five of the most popular torrent-discovery sites. Over a nine month period of time, over 4.6 million unique torrents and 38,996 trackers were identified from the five sites (The Pirate Bay, BTMonster, Torrent Reactor, Mininova, and TorrentPortal). The degree of indexing overlap among sites, the characteristics of uploaders and how the sites acquire .torrent files is taken into investigation. To gain further insight into the world of trackers and peers, a high-performance multitracker crawler that simultaneously crawls thousands of trackers with concurrent TCP connections was used to obtain a peer list of millions of torrents within a narrow window of 12-hours.

##### Section 2: Overview of the BitTorrent Ecosystem


The BitTorrent ecosystem consists of three major components: ***peers, peer discovery mechanisms,* and *torrent-discovery sites.***
    
A ***Torrent***: is a collection of peers that partcipate in the distribution of a specific file at a given time. Each torrent is identified with a tarrent identiafier called the ***infohash***. 
    
At any given instant of time, each peer in a torrent is either a leecher or a seed; a ***seed*** possesses the entire file, whereas a ***leecher*** possesses only a portion of the file. Typically, a torrent begins with an ***initial seed***, which is the only peer to have the file. Eacher leecher and seed uses one of the many ***BitTorrent client types***. 
    
The common mechanism for peer discovery is to use a ***tracker***. When a peer joins a torrent, it typcially registers with one more more trackers. Any peer can contact a tracker at any time to obtain a random subset (IP-port pairs) of other peers communicate with each other using the (open) BitTorrent protocol. 
    
Many BitTorrent clients also support *"distributed trackers"* using DHTs and *Peer Exchange* (PEX). Clients using these trackers collectively form a DHT: a client can query the DHT, using an infohash for the key, to obtain a list of peers participating in the torrent. Many BitTorrent clients also employ PEX, which is a gossiping mechanism that allos peers in the same torrent to exhange peer liste directly with eahc other. Thus, many client types can discover peers using these distint mechaanisms: *centralized trackers, DHTs, and PEX*. When a user wants to start a new torrent, it needs to seed the content file and register the torrent with a tracker. Users can start a torrent by seeding the content file locally, registering the torrent with a tracker, and uploading a .torrent file (w/ included tracker addresses) to a torrent-discovery site.


##### Section 3: The Measurement Methodology and Scope:

***3.1 Measurement Infrastructure***:

The measurement platform includes two crawlers and one storage system. The *discovery-site crawler* downloads webpages and .torrent files from torrent-discovery sites, and parses their contents to extract information of interest. After crawling five seperate popular torrent-discovery sites, all of which presented different formats for presenting torrent metadata information. To improve the efficiency, all the downloaded webpages and .torrent files are first stored in the NFS file sustem temporarily; later, it will be used in addition to a site-dependent parsers to extract information and to then store in a SQL database. 

Information extracted from the webpages included information like: torrent category, torrent upload time, torrent uploaderm the number of downloads. From the .torrent files, our parser extracts the torrent infohash, creation, time, the list of trackers, data file size, and so forth. One goal of the groups study was to obtain good estimates of the number of peers in a given torrent. Trackers support "scrape" querying, which returns, for a specific infohash aggregated information that includes: total number of leechers and seeds. Scraper information cannot always be relied upon, The IP/port pairs from each tracker, and then the aggregated lower level data. The IP/port data can also provide a wealth of additional information, including geographical and user behavior information. 

After obtaining all the infohashes indexed by the torrent-discovery sites, and the list of trackers associated with each of the infohashes. The *multitracker crawler* determines the peers tracked for each (infohash, tracker) pair. This was done for each pair by repeatedly requesting peer lists. The term *task* was used for determining the peer list for the given pairs. 

The challenges of designing a high-performance multitracker crawler included: 
   - Getting snapshot Ecosystem data. The **thread-pool** model does not work here as the number of threads is limited by CPU and memory resources. 
   - Controlling the crawling speed in order to avoid being banned by any trackers. The multitracker used employs multiple tracker bots controlled by a master controller. To optimize the crawling speed, asynchronous I/O model was used. The tracker was then able to support more than 1,000 concurrent TCP (Transmission Control Protocal) connections. To avoid bans, the tunable parameter was limited the crawling speed and randomized the crawled targets to disperse the traffic evenly among multiple trackers. The multitracker included one "manager" to control the multiple crawling bots. 
   
Trackers only return a random subset of the entire peer set for each query, and multiple queries are required to get the complete peer set. 

Suppose a given infohash, there are *n* peers registered with the tracker and the size of the subset returned is *k*; then the expected number of queries to obtain all the peers is:

$E(n,k) \approx (\frac{n}{k} - \frac{k-1}{2k})L_{n} + \frac{k-1}{2k}$,

where $L_{n}$ is the *n*th harmonic number (i.e., the sum of the reciprocals of the first *n* natural numbers). When *n* is large, we have:

$E(n,k) \approx \frac{n}{k}(\gamma + ln(n))$,

where $\gamma$ is the Euler-Mascheroni constant ($\gamma \approx 0.5772$). However, it is still difficult to determine the number of required queries to obtain the complete peer set. This is because the value for *n* may be inaccurate, and because the above equations only give the expected number of queries. There, the adopted heuristic as the stopping criteria for a given task: if the bot doesn not see any new peers in two consecutive replies from the tracker, it will assume that the peer list is almost complete and stops sending queries to that tracker for that infohash. Note, it is not applicable if the tracker does not return a random peer list for each query. 

***3.2 Measurement Scope***:

Using 17 different machines, continuously crawling five major torrent-discovery sites since 25 July, 2008 until 22 April 2009. Some websites limit the rate at which an IP address can download its .torrent files, we restrict the speed of torrent-discovery crawling to avoid being banned. By obtaining all the webpages and .torrent files from those sites; then continued to monitor specific sites for new .torrent files.

A collection of approximately 8.8 million .torrent files from the discovery-site crawler, from which we obtained 4.6 million unique infohashes. There was also 38,996 trackers including nearly 19 million unique crawling tasks.

34 machines were used for the miltitracker crawler, including 1 master controller. Each crawler obtains a snapshot of each torrent file, typically over every minute. 

A ***Peer*** is an <IP, port> pair. A peer can join multiple torrents at the same time. A *.torrent uploader* is a registered username on some torrent-discovery site that has uploaded at least one .torrent file. 

A ***Torrent***: is a collection of peers that partcipate in the distribution of a specific file at a given time.

A torrent is said to be an ***active torrent*** if the multitracker crawler finds at least one peer in the torrent. A tracker is said to be an ***active tracker*** if it returns at least one peer for any of the queried infohashes. 

##### Section 4: The Measurement Results for Torrent-Discovery, Tracker, and Peer Landscapes 

The results from the torrent-discovery crawling and multitracker crawling will be presented in the following sections. The Ecosystem's discovery sites, trackers, and peers will be carefully taken into account and a full in-depth analysis will be made to provide a better overlooking idea of what the Ecosystem truly houses. 

***4.1 Torrent-Discovery Sites***:

<img src='' width='' height=''>
<img src='' width='' height=''>

To gain insight in the Ecosystem's torrent-discovery sites, refer to the image below

<img src='./screenshots/pairwise-of-discovery-sites.png' width='' height=''>



***4.2 Tracker Statistics***:

<img src='' width='' height=''>

***4.3 Peer Statistics***:

<img src='' width='' height=''>
<img src='' width='' height=''>
<img src='' width='' height=''>
<img src='' width='' height=''>

***4.4 Torrent Popularity versus Age***:

<img src='' width='' height=''>

##### Section 5: Content Geography Classification for the torrents:

Provided is an analysis of the content being distributed in the public (English speaking) Ecosystem. 

To classify active torrents into one of the following 9 categories:
- *Movies*: DVD movies, high-resolution movies, and documentaries
- *Music*: Music-related content, including: music videos, sound tracks, songs, albums, music covers, concerts, and discographies
- *TV/Radio Shows*: TV shows, radio shows, cartoons, and other shows
- *Applications*: Applications for Windows, MacOS, Linux, and other handheld devices; operating system installers
- *Games*: Games for PC, Mac, PS3, Xbox360, Wii, and Mobile
- *Books*: Audio books, e-Books, comics, articles, magazies, manuals, etc.
- *Audio*: Content that could not be classified into any of the above seven categories but are known to be audio files
- *Video*: Content that could not be classified into any if the above categories but are known to be video files
- *Other*: Content that could not be classified into any of the above nine categories 

Classifying the torrents is a challenge. *Metadata* available for many torrents isn't always conclusive. Simple *Hueristics* for classifications were adopted so that each of the five torrent-discovery sites provide some kind of categorical data. 

Each of the five torrent-discovery sites to classify the torrents into one of the aforementioned categories. 

For torrents that are indeced by more than one site, a voting systems was adopted wherein each torrent, each site containing the torrent votes for one of the above categories based on the categorical information that it has for the torrent. The category that gets the highest votes is considered the final category for the torrent. In case that there is a tie in the number of votes for the torrent, the file extension of the file is used for categorical classifications. If the torrent is assigned into the *Other* category than none of the classifications are used.


<img src='./screenshots/distributions-per-classification.png' width='' height=''>


Figure 2.1 shows the overall classification of 1.2 million active torrents. It also shows the numbers of peers participating in each category. Ratio of peers to torrents is larger for movies than music. Although movies, music, and TV shows are the leading types, there is a significant oarticipation in books, games, and applications indicating a great diversity in the content being distributed by BitTorrent.

Figure 2.2 shows the number of peers per Internet user per country for three selected categories.

##### Section 6-8: The Pirate Bay and the BitTorrent Ecosystem:
---

The Pirate Bay.

### Conclusion & Acknowledgements:
---

It was found that the Ecosystem is exhibits remarkable diversity in terms of the operation of the major torrent-discovery sites, user uplaod behavior, numbers of torrents and peers tracked by trackers, content type, and client implementations. 

"Nevertheless", it was found that The Pirate Bay, including an analysis of the extent that the DHTs can support the Ecosystem. 

The Ecosystem is by most measures the most successful open Internet application deloyed in this communities including P2P researchers, ISP researchers, and copyright holders. 

The collected data have been anonymized and made publicly available to public research communities. 