Hello everyone! Below I'm trying to show you the whole process I've went through to filter collected URLs by hand and try to extract only ones that promised some text data from the forum discussions as this is the main purpose of this task. 

Of course, it's possible to create separate URL sets from shop or others that are included here. However it's not the main point of concern.

In [None]:
import pandas as pd

# Read the URLs from a .txt file
with open('kfd.txt', 'r', encoding='utf-8') as f:
    txt = f.read().split()

# Load URLs to Pandas DataFrame
df = pd.DataFrame(columns=['urls'],
                  data=txt)
print(len(df))

Note: just to let you know, original input used for this analysis contained approx. 7.3M different URLs. That's a starting point - let's see where we're going to end up.

Here's the main part of work that has to be done happens.

**CAUTION: Please be aware that the prefixes/suffixes will differ depending on the website you're currently processing.**

My workflow was:
- scroll through the txt file to find an interesting URL (i.ex. containing 'sklep-kdf' in it)
- add another exclusion to the set below
- run the cell below to find how many URLs are affected
- rince and repeat until satisfied

In [36]:
# We're interested only in forum texts, therefore 'akademia.kfd' website should be dropped
df['akademia-kfd'] = df['urls'].apply(lambda x: 'akademia.kfd' in x)

# If something is not a 'kfd' website we also dump it
df['non-kfd'] = df['urls'].apply(lambda x: 'kfd' not in x)

# 'getlastpost' directs us to a particular post of an user, not a whole topic
df['getlastpost'] = df['urls'].apply(lambda x: 'getlastpost' in x)

# Some URLs were concatenated in wrong way (i.ex. double 'https://' part in the URL)
df['multiple-http'] = df['urls'].apply(lambda x: 'http://h' in x)

# Another part of URL leading to nowhere interesting
df['/s/'] = df['urls'].apply(lambda x: '/s/' in x)

# 'sklep-kfd' leads to shopping part, we're not interested in it
df['sklep-kfd'] = df['urls'].apply(lambda x: 'sklep.kfd' in x)

# Some JS/PHP component, not useful
df['component'] = df['urls'].apply(lambda x: 'component' in x)

# Search results, same outcome as 'getlastpost' above
df['find-post'] = df['urls'].apply(lambda x: 'view=findpost' in x)

# Again, concatenating mistake from crawler
df['...'] = df['urls'].apply(lambda x: '...' in x)

# These two always lead to the same version of website, params changed nothing
df['langid'] = df['urls'].apply(lambda x: 'langid=' in x)
df['setlanguage'] = df['urls'].apply(lambda x: 'setlanguage=' in x)

# Returns topic search results, not the content of topics
df['sort_by'] = df['urls'].apply(lambda x: 'sort_by=' in x)

# Attachments are not texts, so ditch them too
df['attach_id'] = df['urls'].apply(lambda x: 'attach_id=' in x)

# Again, similar to 'sort_by'
df['search_app_filters'] = df['urls'].apply(lambda x: 'search_app_filters' in x)

# Leads to external portal, which does not exist anymore
df['nasza_klasa'] = df['urls'].apply(lambda x: 'nasza_klasa' in x)

# Yet another concatenation mistake
df['//'] = df['urls'].apply(lambda x: x.startswith('//'))

# Again - something useless
df['page__pid'] = df['urls'].apply(lambda x: 'page__pid' in x)

7271459


In [34]:
# I've created a simple matrix to check how many entries are
# per particular argument 
df[['akademia-kfd', 'getlastpost', 'non-kfd', 'multiple-http',
    '/s/', 'sklep-kfd', 'component', 'find-post', '...',
    'langid', 'sort_by', 'attach_id', 'search_app_filters',
    'nasza_klasa', 'setlanguage', '//', 'page__pid']].value_counts()

akademia-kfd  getlastpost  non-kfd  multiple-http  /s/    sklep-kfd  component  find-post  ...    langid  sort_by  attach_id  search_app_filters  nasza_klasa  setlanguage  //     page__pid
False         False        False    False          False  False      False      False      False  True    False    False      False               False        True         False  False        2145787
                                                                                                  False   False    False      True                False        False        False  False        1973045
                                                                                                                              False               True         False        False  False        1238751
                                                                                                                                                  False        False        False  False         779911
           

In [37]:
# New DataFrame where we do include only the URLs we'd like to inspect closely
df_v1 = df[(df['akademia-kfd']==False) & (df['getlastpost']==False) &
   (df['non-kfd']==False) & (df['multiple-http']==False) &
   (df['/s/']==False) & (df['sklep-kfd']==False) &
   (df['component']==False) & (df['find-post']==False) &
   (df['...']==False) & (df['langid']==False) &
   (df['sort_by']==False) & (df['attach_id']==False) &
   (df['search_app_filters']==False) & (df['nasza_klasa']==False) &
   (df['setlanguage']==False) & (df['//']==False) &
   (df['page__pid']==False)]

len(df_v1)

779911

As you can see, from 7.3M input URLs, I've ended up with about 0.78M which is about 10% of total crawled URLs. The whole analysis I've perfomed above took me about 45 minutes but due to this - I've saved at least full day of processing empty/useless links that only would either take time (attachments, sort results, other websites like `nasza-klasa`) or would duplicate entries in the dataset (`lang_id`, `setlanguage` or `getlastpost`).

In [38]:
# Saving the remaining URLs to new file ready to serve as input to another tool
with open('kfd-cleared.txt', 'w', encoding='utf-8') as f:
	for url in df_v1['urls']:
		f.write(url + '\n')
	print('Files written succesfully!')

f.close()
print('File closed')    

Files written succesfully!
File closed


**NOTE:** After the final processing with `article_crawler.py`, the tool managed to collect 0.33M documents. It's about 40% of cleared and processed URLs (and 4% of whole input set of URLs). This is mostly due to some faulty URLs containing no data got under our radar and due to the settings of the tool used (i.ex. minimal text length).