### Challenge

Your goal is to create an **accurate representation of a user** based on their Google search history.

The data is in `./search_history.json`. This contains a list of searches made by a single person over time.

### What does "accurate" mean?

**Accurate** means understanding which searches are **signal** and which are **noise**. Not every search reflects who someone is. Your job is to separate the meaningful from the incidental and build a coherent picture of this person.

A strong solution might surface insights like:
- **Fashion preferences**: What styles, brands, or aesthetics do they gravitate toward?
- **Travel**: Where have they been? Where are they planning to go?
- **Daily life**: What occupies their time—at work and for leisure?
- **Life transitions**: Are they moving? Starting a new job? Planning a wedding?
- **Location**: Where do they live?

This is not an exhaustive list. The point is to go beyond surface-level keyword extraction and demonstrate that you *actually understand* this person.

### What could a "representation" look like?

There are many ways to represent a user. A few examples:
- A **personal knowledge graph** capturing entities, relationships, and context
- A **single embedding** that encodes the user's preferences in a vector space
- An **LLM fine-tuned** on the user's data
- An **agent** that uses RAG to answer questions about the user

These are just starting points—come up with your own if you have a better idea. The specific representation you choose matters less than **why** you chose it and how well it captures what's meaningful about this person.

### Dummy approach

The following is what we consider a **dummy** approach:
1. Embed all searches
2. Cluster them by topic
3. Label each cluster and call it a "user interest"

This is mechanical. It doesn't distinguish signal from noise, doesn't capture nuance, and doesn't produce insights that feel *true* about a real person.

### What makes an interesting approach?

We're not looking for a "correct" answer, there probably isn't one. We're looking for **evidence of thinking**:
- Why did you choose this method over alternatives?
- What assumptions are you making, and why are they reasonable?
- How do you handle ambiguity in the data?
- What did you try that didn't work?

**The reasoning behind your approach is as important as the solution itself.** Show your work. Explain your decisions. If you explored dead ends, include them.

Make sure to include the cell output in the final commit. We will **not** execute the notebook ourselves.

In [None]:
# add your code here
# read the json file
import json

with open('search_history.json', 'r') as f:
    data = json.load(f)



55383


In [5]:
import pandas as pd

# Raw JSON -> dataframe (keeps all original columns)
df = pd.DataFrame(data)

# Google Takeout is usually reverse-chronological (most recent first),
# so sort by time so “after a search” means later in time.
df["time"] = pd.to_datetime(df["time"], utc=True, errors="coerce")
df = df.sort_values("time").reset_index(drop=True)

title = df["title"].astype(str)

# Classify rows (optional but convenient)
df["event_type"] = "other"
df.loc[title.str.startswith("Searched for "), "event_type"] = "search"
df.loc[title.str.startswith("Visited "), "event_type"] = "click"

# Add a search id that increments at each "Searched for ..." row,
# then applies to all subsequent rows until the next search.
is_search = df["event_type"].eq("search")
df["search_id"] = is_search.cumsum().astype("Int64")
df.loc[df["search_id"].eq(0), "search_id"] = pd.NA

# Carry forward the search metadata so every row keeps full context
df["search_query"] = (
    title.where(is_search)
    .str.replace(r"^Searched for\s+", "", regex=True)
    .ffill()
)
df["search_time"] = df["time"].where(is_search).ffill()
df["search_titleUrl"] = df["titleUrl"].where(is_search).ffill()

# Convenience: normalize click title (original columns are still present)
df["clicked_title"] = title.where(df["event_type"].eq("click")).str.replace(r"^Visited\s+", "", regex=True)

# Example: all click URLs for each search query
# df[df["event_type"].eq("click")].groupby(["search_id", "search_query"])["titleUrl"].apply(list)

df[df["search_id"].notna()].head(10)

Unnamed: 0,header,title,titleUrl,time,products,activityControls,locationInfos,subtitles,details,event_type,search_id,search_query,search_time,search_titleUrl,clicked_title
0,Search,Searched for gmail,https://www.google.com/search?q=gmail,2017-06-08 16:42:55.223000+00:00,[Search],[Web & App Activity],,,,search,1,gmail,2017-06-08 16:42:55.223000+00:00,https://www.google.com/search?q=gmail,
1,Search,Visited https://www.google.com/gmail/,https://www.google.com/gmail/,2017-06-08 16:42:57.355000+00:00,[Search],[Web & App Activity],,,,click,1,gmail,2017-06-08 16:42:55.223000+00:00,https://www.google.com/search?q=gmail,https://www.google.com/gmail/
2,Search,Searched for investment banking networking eve...,https://www.google.com/search?q=investment+ban...,2017-06-08 16:45:50.139000+00:00,[Search],[Web & App Activity],,,,search,2,investment banking networking events london,2017-06-08 16:45:50.139000+00:00,https://www.google.com/search?q=investment+ban...,
3,Search,Visited http://news.efinancialcareers.com/uk-e...,https://www.google.com/url?q=http://news.efina...,2017-06-08 16:45:58.449000+00:00,[Search],[Web & App Activity],,,,click,2,investment banking networking events london,2017-06-08 16:45:50.139000+00:00,https://www.google.com/search?q=investment+ban...,http://news.efinancialcareers.com/uk-en/222450...
4,Search,Searched for blackstone's women networking event,https://www.google.com/search?q=blackstone%27s...,2017-06-08 16:48:12.167000+00:00,[Search],[Web & App Activity],,,,search,3,blackstone's women networking event,2017-06-08 16:48:12.167000+00:00,https://www.google.com/search?q=blackstone%27s...,
5,Search,Visited https://www.blackstone.com/careers/bla...,https://www.google.com/url?q=https://www.black...,2017-06-08 16:48:26.034000+00:00,[Search],[Web & App Activity],,,,click,3,blackstone's women networking event,2017-06-08 16:48:12.167000+00:00,https://www.google.com/search?q=blackstone%27s...,https://www.blackstone.com/careers/blackstone-...
6,Search,Searched for blackstone,https://www.google.com/search?q=blackstone,2017-06-08 16:51:08.080000+00:00,[Search],[Web & App Activity],,,,search,4,blackstone,2017-06-08 16:51:08.080000+00:00,https://www.google.com/search?q=blackstone,
7,Search,Visited Blackstone: Home,https://www.google.com/url?q=https://www.black...,2017-06-08 16:51:10.506000+00:00,[Search],[Web & App Activity],,,,click,4,blackstone,2017-06-08 16:51:08.080000+00:00,https://www.google.com/search?q=blackstone,Blackstone: Home
8,Search,Searched for face,https://www.google.com/search?q=face,2017-06-08 17:04:43.881000+00:00,[Search],[Web & App Activity],,,,search,5,face,2017-06-08 17:04:43.881000+00:00,https://www.google.com/search?q=face,
9,Search,Visited Facebook - log in or sign up,https://www.google.com/url?q=https://www.faceb...,2017-06-08 17:04:53.892000+00:00,[Search],[Web & App Activity],,,,click,5,face,2017-06-08 17:04:43.881000+00:00,https://www.google.com/search?q=face,Facebook - log in or sign up


In [None]:
# Example of a search URL in the dataset:
# https://www.google.com/search?q=sunak&tbm=nws