# Download and prepare the AOL search dataset for testing

The [AOL search dataset](http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection/) lists ~36M search queries issued to the AOL search engine in 2006 along with the clicked result URLs. It includes user IDs, timestamp, query text and clicked search result URL.

Although outdated, it's one of the main sources of public search query data available. We download and extract a subset for testing sanitization methods.

In [17]:
import pandas as pd
# from tldextract import extract as tldextract

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.show_dimensions", True)

In [18]:
# Download the first slice, ~200MB
AOL_DATA_SOURCE = "http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection/user-ct-test-collection-01.txt"

QUERY_CSV = "../assets/aol_queries.csv.gz"

Load the dataset

In [19]:
df = pd.read_csv(AOL_DATA_SOURCE, sep="\t")

In [12]:
df

Unnamed: 0,AnonID,Query,QueryTime,ItemRank,ClickURL
0,142,rentdirect.com,2006-03-01 07:17:12,,
1,142,www.prescriptionfortime.com,2006-03-12 12:31:06,,
2,142,staple.com,2006-03-17 21:19:29,,
3,142,staple.com,2006-03-17 21:19:45,,
4,142,www.newyorklawyersite.com,2006-03-18 08:02:58,,
...,...,...,...,...,...
3558406,24968114,-,2006-05-31 01:04:20,,
3558407,24969251,sp.trafficmarketplace.com,2006-05-31 15:51:23,,
3558408,24969374,orioles tickets,2006-05-31 12:24:51,,
3558409,24969374,orioles tickets,2006-05-31 12:31:57,2.0,http://www.greatseats.com


Distribution of # queries per user:

In [35]:
df["AnonID"].value_counts().describe()

count    65516.000000
mean        54.313618
std        123.377710
min          1.000000
25%          5.000000
50%         17.000000
75%         52.000000
max       3755.000000
Name: AnonID, Length: 8, dtype: float64

For sanitization, it is enough to work with the unique queries.

In [65]:
queries = (
    df["Query"]
    .str.strip()
    .sort_values(ignore_index=True)
    .drop_duplicates()
    .dropna()
    .rename("query")
)

In [46]:
print(f"Num unique queries: {len(queries):,}")

Num unique queries: 1,216,652


Exclude queries that don't have any alphanumeric characters:

In [68]:
no_alphanum = queries.str.fullmatch("\W+")
no_alphanum.sum()

68

In [69]:
queries = queries[~no_alphanum]

Queries that look like URIs:

- None that look like full URLS (`http://www.example.com/...`)
- There are some that look like domains/netlocs (`www.example.com`), but we leave these as is for now

In [90]:
has_uri = queries.str.contains(":/")
has_uri.sum()

0

Queries that include `@`:

- None

In [100]:
has_at = queries.str.contains("@")
has_at.sum()

0

Queries that include numeric characters:

In [109]:
has_num = queries.str.contains("\d")

In [105]:
print(f"{has_num.sum():,}  ({has_num.mean():.2%})")

93,857  (7.71%)


Queries that don't include numeric:

In [115]:
print(f"{(~has_num).sum():,}  ({(~has_num).mean():.2%})")

1,122,727  (92.29%)


In [116]:
queries[has_num].sample(20).sort_values().to_frame()

Unnamed: 0,query
127920,2006 eclipse
128162,2006 hairstyles
508199,bt4 combat
678808,clients-01.eprize.net
813743,deaths 1927 cleveland ohio
949806,ebay sent this message to ronald dunham 4638ronaldd .your registered name is included to show this message originated from ebay. learn more. complete your ebay registration dear 4638ronaldd to complete your ebay registration click the activate yo
1391771,hometown.aol.comgolfwidow55
1483106,hydrocone 10 650
1921767,misdemeanors go away after 7 years
2348285,president 1736


In [117]:
queries[~has_num].sample(20).sort_values().to_frame()

Unnamed: 0,query
351843,background images
405869,bennettpottery.com
650766,christus schumpert shreveport employment
719071,concert tickets on sale
762986,creditor herassment
1025471,fairplay firemans carnival
1354686,hepatomegaly meaning
1674637,labrador retrievers
1766240,low price tires car
1879011,memorialhermann


Write the set of queries together with the indicator of those containing numerals for further analysis.

In [150]:
query_df = pd.DataFrame({"query": queries, "has_num": has_num})

In [15]:
query_df.to_csv(QUERY_CSV, index=False)

Does the separator character show up in the query text?

- No.

In [157]:
queries.str.contains(",").sum()

0