#### Zadanie:
Čistenie dát od robotov vyhľadávacích služieb:

- identifikujte robotov na základe poľa User-Agent (Agent) - kľúčové slová bot, crawl, spider.

- zmažte prístupy robotov vyhľadávacích služieb na základe prístupov z IP adries a kľúčových slov v User-Agent.


Odovzdajte zdrojový kód programu + očistený logovací súbor (v jednom zip).

Príkazy v Pythone vhodné pre riešenie tohto zadania:

`import pandas as pd
pd.read_csv() - načítanie log súboru`

In [1]:
import pandas as pd
import json
import re

In [2]:
# opening the file we will be working with and assigning it to the variabel file for latter use
log_data_table = pd.read_csv("2_Čistenie_dát_II.csv",  encoding="utf-8",  escapechar=";", index_col=False)

In [3]:
# load the list of IP addresses which accessed robots.txt we prepared in last excercise
list_of_bots = pd.read_csv("3_Čistenie_dát_III.csv", index_col = False)

In [4]:
log_data_table["Remote Host"]

0           13.66.139.0
1        216.244.66.230
2          54.36.148.92
3         92.101.35.224
4         54.36.148.108
              ...      
62394    37.115.207.251
62395    37.115.207.251
62396       62.138.6.15
62397       62.138.3.52
62398       62.138.6.15
Name: Remote Host, Length: 62399, dtype: object

In [5]:
# for every IP Address from list of bots IP (Which accessed robot.txt) which we gained in the last excercise
# we will for each of bot IP compare it to the column in log file called 'Remote host', and record the indexes of them
index_names = []
for bot in list_of_bots['Remote Host']:
    index_names_tmp = log_data_table[log_data_table['Remote Host'] == bot].index
    index_names.extend(index_names_tmp)
# at this point the indexes we gathered in previous code, we will drop from the pandas database
log_data_table.drop(index_names, inplace =True)

In [6]:
# now we can move to second section and that is removal of the bots based on the User-Agent.
# for really shallow cleaning it is enough to remove the keywords such as "crawl", "bot", "spider",
# but not all bots identify themselves like this
# sometimes they try to masquarade as real user, or sometimes they are associated with the specidif research agency
# therefore in our example we will rely in the cleaning on the work of other fellow comunity members
# which created an more exhausting list of bots 


In [7]:
# load counter bot json file
# ref: https://github.com/atmire/COUNTER-Robots
# ref: https://www.projectcounter.org/
bots_not_allowed_patterns = []
with open("COUNTER-Robots\COUNTER_Robots_list.json", "r",encoding="utf8") as bot_pattern_file:
    bots_data = json.loads(bot_pattern_file.read())
    for data in bots_data:
        bots_not_allowed_patterns.append(data["pattern"])

# display the amount of the bots in the list        
len(bots_not_allowed_patterns)

301

In [8]:
bots_not_allowed_patterns

['bot',
 '^Buck\\/[0-9]',
 'spider',
 'crawl',
 '^.?$',
 '[^a]fish',
 '^IDA$',
 '^ruby$',
 '^@ozilla\\/\\d',
 '^脝脝陆芒潞贸碌脛$',
 '^破解后的$',
 'AddThis',
 'A6-Indexer',
 'ADmantX',
 'alexa',
 'Alexandria(\\s|\\+)prototype(\\s|\\+)project',
 'AllenTrack',
 'almaden',
 'appie',
 'API[\\+\\s]scraper',
 'Arachni',
 'Arachmo',
 'architext',
 'ArchiveTeam',
 'aria2\\/\\d',
 'arks',
 '^Array$',
 'asterias',
 'atomz',
 'BDFetch',
 'Betsie',
 'baidu',
 'biglotron',
 'BingPreview',
 'binlar',
 'bjaaland',
 'Blackboard[\\+\\s]Safeassign',
 'blaiz-bee',
 'bloglines',
 'blogpulse',
 'boitho\\.com-dc',
 'bookmark-manager',
 'Brutus\\/AET',
 'BUbiNG',
 'bwh3_user_agent',
 'CakePHP',
 'celestial',
 'cfnetwork',
 'checklink',
 'checkprivacy',
 'China\\sLocal\\sBrowse\\s2\\.6',
 'Citoid',
 'cloakDetect',
 'coccoc\\/1\\.0',
 'Code\\sSample\\sWeb\\sClient',
 'ColdFusion',
 'collection@infegy.com',
 'com\\.plumanalytics',
 'combine',
 'contentmatch',
 'ContentSmartz',
 'convera',
 'core',
 'Cortana',
 'CoverScout

In [9]:
def is_it_bot(logline, bots_not_allowed_patterns):
    # or different way doing the try and catch
    if logline is None:
        return False

    for bot_pattern in bots_not_allowed_patterns:
        if re.search(bot_pattern, logline, re.IGNORECASE):
            return True
    
    return False

In [10]:
# here I checked if it is a bot, checking "User-Agent"
# against counter bots patterns from projectcounter

# will continue with this cleaned data, because the bot cleaning is more complete

# first clean the temp array we created for cleaning the indexes of 
index_names.clear()

for agent in log_data_table["User-Agent"]:
   
    if is_it_bot(str(agent), bots_not_allowed_patterns):
        index_names_tmp = log_data_table[log_data_table['User-Agent'] == agent].index
        index_names.extend(index_names_tmp)

log_data_table.drop(set(index_names), inplace =True)

In [11]:
len(log_data_table)

37025

In [12]:
# exporting the cleaned log file into the file
log_data_table.to_csv("4_Čistenie_dát_IV.csv", encoding="utf-8",  escapechar=";", index=False)