### Filter Crawled results for english pages.

We realised that the search results should be in English. We have already crawled 100K+ webpages. So we first filter the existing data, and then resume crawling with a new frontier consisting of English URLs.

We delete non-English documents from the database, and perform a surgery on the crawl state so we can continue with our new frontier.

In [56]:
import json
import pickle
import urllib

from rocksdict import Rdict, Options
from ftlangdetect import detect

In [10]:
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

--2024-06-25 18:14:19--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 2600:9000:223e:2e00:13:6e38:acc0:93a1, 2600:9000:223e:2600:13:6e38:acc0:93a1, 2600:9000:223e:d600:13:6e38:acc0:93a1, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:223e:2e00:13:6e38:acc0:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131266198 (125M) [application/octet-stream]
Saving to: ‘lid.176.bin’


2024-06-25 18:14:20 (301 MB/s) - ‘lid.176.bin’ saved [131266198/131266198]



In [12]:
result = detect(text="what is a bibliothek?", low_memory=False)
result     # Good we can have mixed docs!

{'lang': 'en', 'score': 0.7169032692909241}

#### Check frontier snippets to ensure they are in English

In [30]:
# Go through the search results and detect the language of the text
# print the final percentage of snippets that are in English
with open('../data/search_results.pkl', 'rb') as f:
    search_results = list(pickle.load(f))

print(f"We have {len(search_results)} Results")

english_snippets = 0
total_snippets = 0

for result in search_results:
    for snippet in result[1]:
        total_snippets += 1
        language = detect(snippet['snippet'], low_memory=False)['lang']
        if language == 'en':
            english_snippets += 1
print(f"Percentage of English snippets: {english_snippets / total_snippets * 100:.2f}%")

We have 4715 Results
Percentage of English snippets: 75.27%


In [31]:
# count the number of URLs in result set
all_results = sum([x[1] for x in search_results], [])

print("Total number of URLs in search results: ", len(all_results))
print("Number of unique URLs in search results: ", len(set([x['url'] for x in all_results])))

Total number of URLs in search results:  107566
Number of unique URLs in search results:  50311


In [32]:
db = Rdict('../data/crawl_data')

In [36]:
count = 0
count_english = 0
for key, value in db.items():
    if detect(value.replace('\n', ' '), low_memory=False)['lang'] == 'en':
        count_english += 1
    count += 1

print("Number of unique URLs in database: ", count)
print("Number of unique URLs in database with English text: ", count_english)

Number of unique URLs in database:  203701
Number of unique URLs in database with English text:  25264


In [38]:
rejected_urls = []
for key, value in db.items():
    if detect(value.replace('\n', ' '), low_memory=False)['lang'] != 'en':
        rejected_urls.append(key)
        del db[key]

print("Number of rejected URLs: ", len(rejected_urls))

Number of rejected URLs:  178437


TypeError: object of type 'builtins.Rdict' has no len()

In [40]:
count = 0
count_english = 0
remaining_urls = []
for key, value in db.items():
    if detect(value.replace('\n', ' '), low_memory=False)['lang'] == 'en':
        remaining_urls.append(key)
        count_english += 1
    count += 1

print("Number of unique URLs in database: ", count)
print("Number of unique URLs in database with English text: ", count_english)

Number of unique URLs in database:  25264
Number of unique URLs in database with English text:  25264


In [67]:
with open('../data/crawl_state.pkl', 'rb') as f:
    crawl_state = pickle.load(f)

with open('../data/frontier_urls.pkl', 'rb') as f:
    frontier_urls = pickle.load(f)

print("Number of URLs in frontier: ", len(frontier_urls))

crawl_state['rejected'] = set(remaining_urls).intersection(set(crawl_state['rejected']))
crawl_state['all_discovered_urls'] = set(remaining_urls).union(crawl_state['rejected'])
crawl_state['frontier'] = list([(x, 7, urllib.parse.urlparse(x).netloc) for x in frontier_urls])
crawl_state['visited'] = set(remaining_urls).intersection(set(crawl_state['visited']))
crawl_state['failed'] = set()
crawl_state['to_visit'] = set()
print(crawl_state.keys())

Number of URLs in frontier:  51713
dict_keys(['frontier', 'visited', 'failed', 'rejected', 'last_saved', 'to_visit', 'all_discovered_urls'])


In [68]:
for key in ['rejected', 'all_discovered_urls', 'frontier', 'visited', 'failed', 'to_visit']:
    print(f"Number of URLs in {key}: ", len(crawl_state[key]))

Number of URLs in rejected:  18310
Number of URLs in all_discovered_urls:  25264
Number of URLs in frontier:  51713
Number of URLs in visited:  6967
Number of URLs in failed:  0
Number of URLs in to_visit:  0


In [69]:
with open('../data/crawl_state.pkl', 'wb') as f:
    pickle.dump(crawl_state, f)

In [55]:
db.close()