# Loading data
- Initially, I had multiple datasets containing lists of only ad servers and non-ad servers. I combined them all to create a dataset 'all.csv'.
- Since all.csv had multiple overlapping entries, I deleted the duplicates and saved it as another file 'all-without-duplicates'
```
df = pd.read_csv("../lists/all.csv",converters={'domain': convert_dtype,'class': convert_dtype}) 
df = df.drop_duplicates()
df.to_csv('../lists/all-without-duplicates.csv')

```


In [16]:
import pandas as pd
import re
import traceback

#Convert dtypes for fixing Dtypewarning
# https://www.roelpeters.be/solved-dtypewarning-columns-have-mixed-types-specify-dtype-option-on-import-or-set-low-memory-in-pandas/
def convert_dtype(x):
    if not x:
        return ''
    try:
        return str(x)   
    except:        
        return ''

df = pd.read_csv("../lists/all-without-duplicates.csv",converters={'domain': convert_dtype,'class': convert_dtype}) # Dataset is now stored in a Pandas Dataframe
#df = pd.read_csv("../lists/all.csv",converters={'domain': convert_dtype,'class': convert_dtype})
#df = df.drop_duplicates()
#df.to_csv('../lists/all-without-duplicates.csv')
df

Unnamed: 0,url,class
0,google.com,1
1,youtube.com,1
2,facebook.com,1
3,amazonaws.com,1
4,netflix.com,1
...,...,...
1474709,slview.psne.jp,0
1474710,x.vipergirls.to,0
1474711,x0r.urlgalleries.net,0
1474712,yotta.scrolller.com,0


# Preprocessing and feature extraction
This block of code is used for preprocessing the dataset, removing unwanted patterns, and extracting meaningful features from the dataset. Here, the features extracted are has_ad(does it contain the word 'ad'), is_subdomain(does it contain the subdomain 'www'),num_dots(number of dots in the url, excluding subdomain if any),num_hyphens(number of hyphens), num_digits(number of digits in the URL)


In [22]:
# Define regular expressions for pattern matching
ad_pattern = r'\b(ad|ads)\b'
subdomain_pattern = r'^www\.'
dot_pattern = r'.'
hyphen_pattern = r'-'
digit_pattern = r'\d'

# Define the batch size and the input/output file paths
batch_size = 10000
input_file = '../lists/all-without-duplicates.csv'
output_file = '../lists/preprocessed.csv'

# Open the input and output files
with open(input_file, 'r') as f_in, open(output_file, 'w') as f_out:
    # Read the CSV file in chunks
    for chunk in pd.read_csv(f_in, chunksize=batch_size):
        # Preprocess the URLs in the current chunk
        for url in chunk['url']:
            has_ad = int(bool(re.search(ad_pattern, url)))
            is_subdomain = int(bool(re.search(subdomain_pattern, url)))
            num_dots = url.count(dot_pattern) #- is_subdomain
            if (is_subdomain == 1):
                num_dots = num_dots - 1;
            num_hyphens = url.count(hyphen_pattern)
            num_digits = len(re.findall(digit_pattern, url))

            # Write the preprocessed features to the output file
            f_out.write(f'{url},{has_ad},{is_subdomain},{num_dots},{num_hyphens},{num_digits}\n')
