In [None]:
#installing dependencies
!pip install -q pyspark findspark

In [None]:
# Initialise Spark to locate Java dependencies
import findspark
findspark.init()

#Pyspark for distributed data processing
import pyspark
from pyspark.sql import SparkSession #initialise spark
from pyspark.sql.functions import col, sum as _sum

#import standard Python libraries for data analysis and visualization
import numpy as np # Numerical operations
import pandas as pd # Data manipulation
import matplotlib.pyplot as plt # Plotting
from matplotlib import colormaps # Access to color maps for plotting
import math # Mathematical utilities

import seaborn as sns # Statistical data visualization

In [None]:
from statsmodels.tsa.stattools import adfuller, kpss

> **⚠️ Prerequisite Notice**  
> This notebook uses Apache Spark via PySpark.  
> Please ensure you have **Java (JDK 8 or 11)** installed and properly configured on your syste Make sure the `JAVA_HOME` environment variable is set correctly and that `java.exe` is accessible in your system PATH.H.
>  
> Without Java, the Spark session will fail to initiale.



In [None]:
# Create a Spark session for distributed processing
spark = SparkSession.builder \
    .appName("Search Trends Analysis") \
    .getOrCreate()

While the dataset used in this thesis is relatively small and could have been handled entirely within Python's Pandas framework, PySpark was initially selected for its scalability and potential for distributed data processing.

# Google Trends keywords SVI data

In [None]:
#read CSV using Spark
df_kw = spark.read.csv('search-trends-vs-financial-markets/Collected Data/carrefour_search_trends_keywords.csv', header=True, inferSchema=True) #inferSchema added to automatically type inference for columns
df_agg = spark.read.csv('search-trends-vs-financial-markets/Collected Data/carrefour_search_trends_aggregated.csv', header=True, inferSchema=True)

## Google Trends keywords

In [None]:
#show preview keywords
df_kw.show(3)
df_kw.printSchema()

Several columns appear to have only zero values; let's drop them and keep a list of these.

In [None]:
# Exclude 'date' column from processing
columns_to_check = [c for c in df_kw.columns if c != 'date']

In [None]:
#Drop columns (keywords) with only zero values in their cells
kept_cols = []
dropped_cols = []

# Loop through numeric columns only
for c in columns_to_check:
    try:
        col_sum = df_kw.select(_sum(col(f"`{c}`"))).collect()[0][0]     # Calculate sum of each keyword column, backticks in col() to safely reference columns like `E.Leclerc`
        if col_sum == 0 or col_sum is None:
            dropped_cols.append(c)  # Drop if column has no valid data
        else:
            kept_cols.append(c)
    except Exception as e:
        print(f"Skipping column '{c}' due to error: {e}")
        dropped_cols.append(c)

In [None]:
#Construct final list of columns to retain, keep 'date' and valid keyword columns
final_cols = ['date'] + kept_cols

# Select cleaned/filtered DataFrame
df_kw = df_kw.select(*[col(c) if c == 'date' else col(f"`{c}`") for c in final_cols])

In [None]:
# Show results
print("Dropped columns (all values were 0 or null):")
print(dropped_cols)

✅ Dropped columns (all values were 0 or null):

`carfour`, `carrefour near me`, `IntermarchÃ©`, `carrefour bourse`.

In [None]:
# Preview the first 3 rows of the cleaned dataset
df_kw.show(3)

### EDA

In [None]:
# Convert PySpark DataFrame to Pandas
df_kw = df_kw.toPandas()

In [None]:
# Check the dimensions of the DataFrame
df_kw.shape

#### 1. Date Handling & Time Index Setup

The `date` is currently in ISO 8601 standard, `yyyy-mm-dd`, making it directly compatible with pandas and the libraries we will use.

However, we will convert the date to a Datetime object to fully utilise the time series functionalities. Lastly, we will set it as the DataFrame index, which converts the DataFrame into a time series for slicing, plotting, and modelling.

In [None]:
# Convert 'date' column to datetime and set as index
df_kw['date'] = pd.to_datetime(df_kw['date'])
df_kw.set_index('date', inplace=True)
df_kw.sort_index(inplace=True) #sorts data chronologically from earliest to latest data

#### 2. Time Series Grid of Keywords

In [None]:
# Set up subplot grid
n_keywords = len(df_kw.columns)
n_cols = 4
n_rows = math.ceil(n_keywords / n_cols)

In [None]:
#plot grid
plt.figure(figsize=(n_cols * 4, n_rows * 3))

for i, keyword in enumerate(df_kw.columns):
    plt.subplot(n_rows, n_cols, i + 1)
    plt.plot(df_kw.index, df_kw[keyword], color='teal')
    plt.title(keyword, fontsize=10)
    plt.xticks(rotation=45)
    plt.tight_layout()

plt.suptitle("SVI Trends by Keyword", fontsize=16, y=1.02)

#save figure as png
plt.savefig('svi_keyword_trends.png', bbox_inches='tight', dpi=300)

plt.show()

As expected based on the literature review, we observe considerable variability in search popularity over time for most keywords, with episodic peaks in search interest. We can also see the sudden rise in popularity of some keywords over time and the decline of others.

#### 3. Distribution Plot of Interest Scores

In [None]:
# Create horizontal boxplots for each keyword's SVI distribution
df_kw.plot(kind='box', vert=False, figsize=(10, 12), title='SVI Distribution')
plt.title('SVI Distribution', fontweight='bold')
plt.tick_params(axis='x', which='both', labeltop=True)
plt.grid(axis='x', linestyle=':', linewidth=0.7)
plt.xticks(np.arange(0, 110, 10))
plt.tight_layout()

#save figure as png
plt.savefig('svi_distribution_boxplot.png', bbox_inches='tight', dpi=300)

plt.show()

#### 4. Statistical Summary

In [None]:
# Generate summary statistics for all keywords
summary = df_kw.describe().T

In [None]:
# Compute additional statistics: range, IQR, skew, kurtosis, volatility
summary["range"] = summary["max"] - summary["min"]
summary["iqr"] = summary["75%"] - summary["25%"] #interquartile range
summary["skew"] = df_kw.skew()
summary["kurtosis"] = df_kw.kurtosis()
summary["volatility (std/mean)"] = summary["std"] / summary["mean"]

In [None]:
# Display the updated summary statistics
summary

##### Mean
A few keywords stand above the others with a mean above 60, indicating dominant and sustained search behaviour: 
"carrefour banque" ~73, "lidl" ~68, "carrefour" ~64, "catalogue carrefour"	~62, "carrefour catalogue"	~62, "drive carrefour"	~61, "carrefour drive"	~61

##### Standard Deviation
With a sigma above ~17, the following keywords showcase the highest volatility in search patterns: "catalogue carrefour", "carrefour catalogue", "cora", "foire aux vins carrefour".

##### Range (max-min)
Several FMCG-related and brand name keywords are acarcterised with wide fluctuations in attention (range above 50): "foire aux vins carrefour", "pizza carrefour", "carrefour", "cora carrefour", "cora", "carrefour market", "leclerc".

##### Skewness and Kurtosis
A few keywords have a negative skew, with only "Auchan catalogue" having a skewness above -1.0. Interestingly, "carrefour livraison Ã domicile", "rappel produit carrefour", "E.Leclerc" and "carrefour recrutement" have a significant positive skeweness (all above +5.0), indicating a low search interest with occasional spikes in interest. These same keywords are also the ones with the highest kurtosis, suggesting strong event-driven behaviour. Based on the meaning of the keywords, we can see that this is possibly related to news or exceptional occasions ("rappel produit carrefour" and "E.Leclerc") or seasonal events ("carrefour recrutement").

##### Volatility (std/mean)
Based on the data, we can set the volatility thresholds as follows:
* below 0.20 Low: most of these keywords are brand equity
* 0.20 – 0.50 Medium: keywords here seem to be related to FMCG sales cycles
* 0.50 – 1.0 High: keywords in this group are possibly linked to events as they have high variance in consumer interest.
* above 1.0 Very High: these keywords might be helpful for short-term forecasting or anomaly analysis.

#### 5. Missing Values Analysis
Check how much missing and flat data there is with visuals.

In [None]:
# Check for any columns with missing values
missing = df_kw.isnull().sum()
missing = missing[missing > 0]

if not missing.empty:
    print("Columns with missing values:")
    print(missing)
else:
    print("No missing values found.")


In [None]:
# Check how many zero values exist per column
zeros = (df_kw == 0).sum()
zeros = zeros[zeros > 0]

if not zeros.empty:
    print("Number of zeros in columns:")
    print(zeros)
else:
    print("No columns with zeros values found.")


Keywords with many zero values might have episodic or accidental search interest; these will be monitored throughout the rest of the EDA.

#### 6. Keyword Correlation Matrix

In [None]:
# Plot heatmap of correlation matrix for keyword search trends
corr = df_kw.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(35, 20))
sns.heatmap(corr, 
            mask=mask, 
            cmap='coolwarm', 
            center=0, 
            linewidths=0.5, 
            annot=True)
plt.title("Keywords Correlation Matrix")
plt.tight_layout()

#save figure as png
plt.savefig('svi_correlation_matrix.png', bbox_inches='tight', dpi=300)

plt.show()

In [None]:
# Transform correlation matrix into list and categorise correlation strength
corr_pairs = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
corr_pairs = corr_pairs.stack().reset_index()
corr_pairs.columns = ['Keyword1', 'Keyword2', 'Correlation']

def categorize_corr(value):
    if value >= 0.8:
        return 'High Positive'
    elif value <= -0.8:
        return 'High Negative'
    elif value >= 0.4:
        return 'Average Positive'
    elif value <= -0.4:
        return 'Average Negative'
    elif value >= 0.0:
        return 'Low Positive'
    elif value > -0.4:
        return 'Low Negative'
    else:
        value
        return 'Negative'

corr_pairs['Category'] = corr_pairs['Correlation'].apply(categorize_corr)

##### HIGH CORRELATION PAIRS

In [None]:
# Display keyword pairs with high correlations
high_corr = corr_pairs[corr_pairs['Category'].isin(['High Positive', 'High Negative'])]
print(high_corr.to_string(index=False))

Highly correlated keywords are all positive indicating a strong linear positive correlation.

* Carrefour and its sub-brands/services show consistently high correlations, suggesting that interest in Carrefour as a brand is strongly tied to its different offerings.
* Consumers frequently search promotions or catalogues in conjunction with drive-related services.
* Strong correlation between Carrefour and competitors, suggesting that users often compare multiple grocery retailers in the same session or buying cycle.
* High correlation between localised formats, confirming the literature as of France's interest in urban and convenience-oriented store formats

##### AVERAGE CORRELATION PAIRS

Reflect moderately aligned but differentiated consumer behaviours.

In [None]:
# Display keyword pairs with average correlations
avg_corr = corr_pairs[corr_pairs['Category'].isin(['Average Positive', 'Average Negative'])]
print(avg_corr.to_string(index=False))

##### LOW CORRELATION PAIRS

In [None]:
# Display keyword pairs with low correlations
low_corr = corr_pairs[corr_pairs['Category'].isin(['Low Positive', 'Low Negative'])]
print(low_corr.to_string(index=False))

**Positive Low Correlations**
* Product-specific keywords (aloe vera, pizza, ongle carrefour) often exhibit isolated behaviours, hinting at niche shopping intent or product-specific campaigns.

**Negative Correlations**
* These pairs indicate weak or diverging search behaviour, which may suggest: niche interest, separate consumer journeys, misalignment in search intent or timing.
* `carefour` (misspelt) has several negative correlations (`carrefour drive` (-0.35), `carrefour promo` (-0.20), `carrefour catalogue` (-0.36)), suggesting noise or irrelevant intent behind this keyword.

##### Conclusions
Based on the descriptive analysis, count of zeros in the columns, and correlation matrix, we can state the following for our analysis:

* `carefour`: Highly noisy and negatively correlated with most Carrefour terms, it will be removed as it may not be relevant. 
* `carrefour livraison à domicile`: Sparse and episodic search behaviour, will be aggregated with `carrefour livraison domicile`.
* `aloe vera carrefour`: Niche product search, best to aggregate it with other FMCG keywords.
* `E.Leclerc`: inconsistent sample, best to remove as there is already the `lecerc` keyword fulfilling the same search intent.
* `carrefour recrutement`: Episodic search, while it possibly follows a seasonal recruitment pattern, it has too little data to provide insight fully.
* `rappel produit carrefour`: Event-driven; behaves independently from regular consumer patterns.

In [None]:
#drop desired keywords
df_svi = df_kw.drop(columns=["carefour", "E.Leclerc", "carrefour recrutement"])

In [None]:
# Check the dimensions of the DataFrame
df_svi.shape

## Google Trends keywords aggregated

Based on search intent, keywords can be aggregated as follows:

> ⚠️ keywords marked as ~~keywords~~ are those dropped after the initial EDA

| Aggregate | Keywords | Justification |
|---|---|---|
| Brand | carrefour, carrefour autour de moi, ~~carrefour near me~~, ~~carfour~~, ~~carefour~~ | Serves as an anchor term to capture general brand interest and visibility. |
| Service and logistics | carrefour drive, drive carrefour, carrefour livraison, carrefour livraison domicile, carrefour livraison Ã domicile | Reflects consumer demand for fulfilment services such as click-and-collect and home delivery, indicating operational engagement. |
| Sub-brand | carrefour market, carrefour city, carrefour express, cora | Provides more granular insight into Carrefour’s diversified retail formats and regional presence. |
| Promo and engagement | carrefour promo, carrefour code promo drive, carrefour catalogue, catalogue carrefour, carrefour fidelite, bon d'achat carrefour | Captures interest in promotions, loyalty programs, and catalogues; key drivers of footfall and conversion in price-sensitive FMCG segments. |
| FMCG products | carrefour produits, carrefour alimentaire, carrefour epicerie, carrefour bio, pizza carrefour, foire aux vins carrefour, ongle carrefour, franck provost carrefour, parfumerie carrefour, aloe vera carrefour | Reflects consumer preferences for specific product categories; interest in organic and beauty items may indicate evolving lifestyle and sustainability trends. |
| Competitors | Auchan, Auchan catalogue, ~~E.Leclerc~~, leclerc, ~~IntermarchÃ©~~, lidl, super u  | Rising interest in competing retailers may signal market share shifts or influence investor sentiment regarding Carrefour. |
| Finance | ~~carrefour bourse~~, ~~carrefour recrutement~~, carrefour credit, carrefour assurance, action carrefour, carrefour banque, carrefour anti crise | Indicates public engagement with Carrefour’s financial operations, job market relevance, and economic resilience. |
| News  | fermeture carrefour, rappel produit carrefour, cora carrefour | Tracks external news-driven factors, including store closures and product recalls, which may impact consumer trust or financial outlook. |

In [None]:
#free up memory by deleting no longer used variables
del avg_corr, c, col_sum, columns_to_check, corr, corr_pairs, df_kw, dropped_cols, final_cols, high_corr, i, keyword, low_corr, missing, n_cols, n_keywords, n_rows, summary, zeros

In [None]:
#show preview keywords
df_agg.show(5)
df_agg.printSchema()

In [None]:
# Convert PySpark DataFrame to Pandas
df_agg = df_agg.toPandas()

In [None]:
# Ensure 'date' column is datetime and set it as the index
df_agg['date'] = pd.to_datetime(df_agg['date'], dayfirst=True)
df_agg.set_index('date', inplace=True)

In [None]:
# Plotting all keywords 
plt.figure(figsize=(16, 8))

for keyword in df_agg.columns:
    plt.plot(df_agg.index, df_agg[keyword], label=keyword, linewidth=2)

plt.title("Search Interest Over Time by Aggregated Keywords", fontsize=14)
plt.xlabel("Date")
plt.ylabel("Google SVI (0–100)")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=8)
plt.tight_layout()
plt.grid(True)
plt.show()

The aggregation of the SVI data in the `carrefour_search_trends_aggregated.csv` was initially performed at the data collection stage in the following manner: 
- one column for each aggregated category (as defined in the table above);
- each row represents a time period with weekly frequency;
- each value is the sum of the SVI values for all keywords belonging to that category at that point in time.

However, this initial aggregation method is methodologically unsatisfactory. In Google Trends each keyword's SVI is scaled individually (where 0 represent the lowest relative search interest and 100 the peak relative interest within the time range selected), therefore, summing across keywords combines values on different scales. As such we risk overemphasising categories with more keywords and introduce bias if some keywords exhibit greater volatility than others.

To enhance the interpretability and methodological robustness of the analysis, alternative aggregation techniques should be considered:

| Aggregation Method   | Analysis   | 
|:---|:---|
| Simple mean | Assigns equal weight to all keywords, avoids keyword-count bias, and is easy to interpret. |
| Weighted mean | Offers higher accuracy if reliable weights (e.g., based on historical correlation or relevance) are available. |
| Z-score normalised mean | Standardises keyword volatility and expresses interest relative to each keyword’s historical mean. |
| Median | More robust to outliers and episodic spikes, especially useful with erratic or sparse search data. |
| Principal component aggregation (PCA) | Extracts the dominant shared pattern across keywords, ideal when a common driver is expected. |
| Maximum value (peak interest) | Highlights the most significant surge in attention per period, suitable for tracking event-driven spikes. |
| Frequency-based binary aggregation | Converts SVIs into binary indicators (e.g., 1 if above threshold), capturing the breadth of search interest per category. |

Using the mean as the aggregation method is likely the most appropriate option, as it mitigates the bias of differing keyword counts while also offering an intuitive measure of category-level search interest. However, to avoid weight bias, keywords with almost equal search intent will be aggregated separately first; these are:
* "carrefour drive", "drive carrefour"
* "carrefour livraison", "carrefour livraison domicile", "carrefour livraison Ã domicile"
* "carrefour promo", "carrefour code promo drive"
* "carrefour catalogue", "catalogue carrefour"
* "Auchan", "Auchan catalogue"

In [None]:
# Averaging specified keywords and creating new merged columns using df_kw_pd
df_svi["c_drive"] = df_svi[["carrefour drive", "drive carrefour"]].mean(axis=1)
df_svi["c_livraison"] = df_svi[["carrefour livraison", "carrefour livraison domicile", "carrefour livraison Ã domicile"]].mean(axis=1)
df_svi["promo"] = df_svi[["carrefour promo", "carrefour code promo drive"]].mean(axis=1)
df_svi["catalogue"] = df_svi[["carrefour catalogue", "catalogue carrefour"]].mean(axis=1)
df_svi["auchan"] = df_svi[["Auchan", "Auchan catalogue"]].mean(axis=1)

In [None]:
#dropping old keywords now merged
df_svi = df_svi.drop(columns=[
    "carrefour drive", "drive carrefour",
    "carrefour livraison", "carrefour livraison domicile", "carrefour livraison Ã domicile",
    "carrefour promo", "carrefour code promo drive",
    "carrefour catalogue", "catalogue carrefour",
    "Auchan", "Auchan catalogue"
])

In [None]:
# Check the dimensions of the DataFrame
df_svi.shape

In [None]:
# Check the dimensions of the DataFrame
df_svi.columns 

In [None]:
df_svi_agg = df_svi.copy()

In [None]:
#aggregating variables based on category
df_svi_agg["brand"] = df_svi_agg[["carrefour", "carrefour autour de moi"]].mean(axis=1)
df_svi_agg["service"] = df_svi_agg[["c_drive", "c_livraison"]].mean(axis=1)
df_svi_agg["sub-brand"] = df_svi_agg[["carrefour market", "carrefour city", "carrefour express", "cora"]].mean(axis=1)
df_svi_agg["promo"] = df_svi_agg[["promo", "catalogue", "carrefour fidelite", "bon d'achat carrefour"]].mean(axis=1)
df_svi_agg["fmcg"] = df_svi_agg[["carrefour produits", "carrefour alimentaire", "carrefour epicerie", "carrefour bio", "pizza carrefour", "foire aux vins carrefour", "ongle carrefour", "franck provost carrefour", "parfumerie carrefour", "aloe vera carrefour"]].mean(axis=1)
df_svi_agg["competitors"] = df_svi_agg[["auchan", "leclerc", "lidl", "super u"]].mean(axis=1)
df_svi_agg["finance"] = df_svi_agg[["carrefour credit", "carrefour assurance", "action carrefour", "carrefour banque", "carrefour anti crise"]].mean(axis=1)
df_svi_agg["news"] = df_svi_agg[["fermeture carrefour", "rappel produit carrefour", "cora carrefour"]].mean(axis=1)

In [None]:
#dropping old keywords now merged
df_svi_agg = df_svi_agg.drop(columns=[
    "carrefour", "carrefour autour de moi", "carrefour city",
    "carrefour express", "carrefour market", "cora", "bon d'achat carrefour",
    "carrefour fidelite", "carrefour alimentaire", "carrefour bio", "carrefour epicerie", "carrefour produits",
    "pizza carrefour", "aloe vera carrefour", "foire aux vins carrefour",
    "franck provost carrefour", "ongle carrefour", "parfumerie carrefour",
    "leclerc", "carrefour credit", "lidl", "super u", "action carrefour",
    "carrefour anti crise", "carrefour assurance", "carrefour banque",
    "fermeture carrefour", "cora carrefour", "rappel produit carrefour",
    "c_drive", "c_livraison", "promo", "catalogue", "auchan"  # fixed here
])

In [None]:
# Check the dimensions of the DataFrame
df_svi_agg.shape

In [None]:
# Check the dimensions of the DataFrame
df_svi_agg.columns 

### EDA Aggregated DataFrame df_agg_final

##### 1. Statistical Summary

In [None]:
# Generate summary statistics for aggregated SVI data
df_svi_agg.describe().T

##### 2. Correlation Matrix 

In [None]:
# Plot heatmap for correlations among aggregated keyword trends
corr = df_svi_agg.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(5, 3))
sns.heatmap(corr, 
            mask=mask, 
            cmap='coolwarm', 
            center=0, 
            linewidths=0.5, 
            annot=True)
plt.title("Aggregated Keywords Correlation Matrix")
plt.tight_layout()

#save figure as png
plt.savefig('svi_agg_corr_matrix.png', bbox_inches='tight', dpi=300)

plt.show()

##### 3. Boxplots

In [None]:
# Boxplot of aggregated keyword search volumes
df_svi_agg.plot(kind='box', vert=False, figsize=(6, 3), title='SVI Aggregated Distribution')
plt.tick_params(axis='x', which='both', labeltop=True)
plt.grid(axis='x', linestyle=':', linewidth=0.7)
plt.xticks(np.arange(0, 70, 5))
plt.tight_layout()

#save figure as png
plt.savefig('svi_agg_distribution_boxplot.png', bbox_inches='tight', dpi=300)

plt.show()

##### 4. Time Series Trends

In [None]:
# Plotting all keywords in timeseries
plt.figure(figsize=(12, 4))

for idx, keyword in enumerate(df_svi_agg.columns):
    plt.plot(df_svi_agg.index, df_svi_agg[keyword], label=keyword)

plt.title("Search Interest Over Time by Aggregated Keyword", fontsize=11)
plt.xlabel("Date")
plt.ylabel("Google SVI (0–100)")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=8)
plt.tight_layout()
plt.grid(True)

#save figure as png
plt.savefig('svi_agg_keyword_trends.png', bbox_inches='tight', dpi=300)

plt.show()

# Carrefour France stock data
This data wascollected att daily and weekly frequenciey.

In [None]:
#read CSV using Spark
df_fin = spark.read.csv('search-trends-vs-financial-markets/Collected Data/carrefour_stock_data.csv', header=True, inferSchema=True)
df_wfin = spark.read.csv('search-trends-vs-financial-markets/Collected Data/carrefour_stock_weekly.csv', header=True, inferSchema=True)

#### Daily stocks

In [None]:
#show preview daily stocks
df_fin.show(5)
df_fin.printSchema()

#### Weekly stocks

In [None]:
#show preview stocks
df_wfin.show(5)
df_wfin.printSchema()

##### Convert PySpark DataFrame to Pandas to analyse 

In [None]:
# Convert yFinance stock data to pandas
df_stock = df_wfin.toPandas()

In [None]:
# Check the dimensions of the DataFrame
df_stock.shape

### The time issue. 

Upon reviewing the collected datasets, a discrepancy in date labelling was identified between the weekly stock data from `yfinance` and the Google Trends keyword data. To facilitate data comparison and accurate time-series modelling, all datasets must share a standard, synchronised timeframe.

1. *Stock Market Data*

The weekly stock data from `yfinance` uses Monday as the label for each weekly observation. However, each row in the dataset represents the week ending Friday, but is indexed by the preceding Monday.

2. *Google Trends Data*

Google Trends aggregates search interest weekly, with each week's data point labelled by Sunday, the end of the search week.


This results in a misalignment between:
* Stock closing prices (on Friday, labelled as Monday),
* Search volume data (ending Sunday).

The solution to align both datasets:
* Google Trends dates will be shifted −2 days (from Sunday → Friday) to represent the end of the same week as the stock market.
* Stock data dates will be shifted +4 days (from Monday → Friday) to reflect the actual trading day.

This ensures that both data sources are indexed by the same Friday date, making them directly comparable for all subsequent analysis.

##### Adjust Google Trends dates (from Sunday to Friday)

In [None]:
# Align weekly indexes of Google Trends
df_svi.index = df_svi.index - pd.Timedelta(days=2)
# Align weekly indexes of Aggregated Google Trends
df_svi_agg.index = df_svi_agg.index - pd.Timedelta(days=2)

##### Adjust Stock Data dates (from Monday to Friday)

In [None]:
# Align weekly indexes of yFinance
df_stock['Date'] = pd.to_datetime(df_stock['Date']) + pd.Timedelta(days=4)

In [None]:
#set date as index
df_stock.set_index('Date', inplace=True)

In [None]:
# Print date ranges of both datasets to ensure alignment
print('Google Trends')
print(df_svi.index.min(), df_svi.index.max())
print('Aggregated Google Trends')
print(df_svi_agg.index.min(), df_svi_agg.index.max())
print('yFinance')
print(df_stock.index.min(), df_stock.index.max())

There is a date range difference between the two datasets; the `df_svi` and `df_svi_agg` start 2 weeks earlier and finish one week earlier than `df_stock`.

In [None]:
# Drop the first two weeks from Google Trends
df_svi = df_svi.iloc[2:]

# Drop the first two weeks from Aggregated Google Trends
df_svi_agg = df_svi_agg.iloc[2:]

# Drop the last week from stock data
df_stock = df_stock.iloc[:-1]

In [None]:
# Print date ranges of both datasets to ensure alignment
print('Google Trends')
print(df_svi.index.min(), df_svi.index.max())
print('Aggregated Google Trends')
print(df_svi_agg.index.min(), df_svi_agg.index.max())
print('yFinance')
print(df_stock.index.min(), df_stock.index.max())

Now the date has been trimmed to match for both dataset and it is ready to join.

In [None]:
# Get structure of yFinance dataset
df_stock.info()

For the aims and objectives set in the methodology, we will only retain the adjusted closing price from the stock market dataframe.

**Why?**
The adjusted closing price reflects the stock’s final trading price after accounting for dividends, splits, and other corporate actions. It provides the most accurate picture of a stock’s actual performance over time, it is especially good for:
* Time-series analysis
* Log-return calculations
* Correlation with external signals (e.g., SVIs)

In [None]:
# Keep only the 'Close' column
df_stock = df_stock[['Close']].copy()

# Renaming column for clarity when merging later
df_stock.rename(columns={'Close': 'Carrefour_Close'}, inplace=True)

df_stock.head(5)

**Missing Values Analysis**

Check how much missing and flat data there is.

In [None]:
# Check for any columns with missing values
missing = df_stock.isnull().sum()
missing = missing[missing > 0]

if not missing.empty:
    print("Columns with missing values:")
    print(missing)
else:
    print("No missing values found.")


In [None]:
# Check how many zero values exist per column
zeros = (df_stock == 0).sum()
zeros = zeros[zeros > 0]

if not zeros.empty:
    print("Number of zeros in columns:")
    print(zeros)
else:
    print("No columns with zeros values found.")


##### 1. Close price over time

In [None]:
# Plot closing price over time
plt.figure(figsize=(10, 4))
plt.plot(df_stock.index, df_stock['Carrefour_Close'], label='Close Price', color='violet')
plt.title("Weekly Close Price Over Time")
plt.xlabel("Date")
plt.ylabel("Price (€)")
plt.grid(True)
plt.tight_layout()

#save figure as png
plt.savefig('stock_price_close.png', bbox_inches='tight', dpi=300)

plt.show()

##### 2. Rolling mean

In [None]:
# Plot rolling mean and std deviation of stock price
df_stock['rolling_mean'] = df_stock['Carrefour_Close'].rolling(window=4).mean()
df_stock['rolling_std'] = df_stock['Carrefour_Close'].rolling(window=4).std()

df_stock[['Carrefour_Close', 'rolling_mean']].plot(figsize=(10,4), title='Close Price with Rolling Mean (4 weeks)')
plt.grid(True)
plt.tight_layout()

#save figure as png
plt.savefig('stock_close_price_rolling_mean.png', bbox_inches='tight', dpi=300)

plt.show()

# Merging Datasets: SVIs and Price Close Data.

Analysing individual keywords (rather than aggregated categories) can reveal granular signals that might be averaged out or lost in grouped data.

### Merging Datasets: df_merged1

In [None]:
df_merged1 = df_svi.join(df_stock, how='inner')

In [None]:
df_merged1.head()

In [None]:
df_merged1 = df_merged1.drop(columns=[ "rolling_mean", "rolling_std"])

In [None]:
df_merged1.info()

###  Stationarity Tests

Stationarity tests ensure that:

1. The input features (search trends) are stable and predictable over time.
2. The models (e.g., correlation or regression) are not falsely attributing relationships based on trending behaviour or structural changes in data.

To robustly assess stationarity and cross-validate findings, both the Augmented Dickey-Fuller (ADF) and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests will be carried out. Assessing simultaneously if the series is stationary (ADF) and is not stationary (KPSS).

In [157]:
# Apply ADF test
def adf_test(series):
    try:
        result = adfuller(series.dropna(), autolag='AIC')
        return result[1]  # p-value
    except:
        return None

In [168]:
# Apply KPSS test
def kpss_test(series):
    try:
        result = kpss(series.dropna(), regression='c', nlags='auto')
        return result[1]  # p-value
    except:
        return None
        

In [169]:
# Create a summary DataFrame
stationarity_summary = pd.DataFrame(columns=["ADF_pvalue", "KPSS_pvalue", "ADF_Stationary", "KPSS_Stationary"])

In [170]:
# Loop through each column in the DataFrame
for col in df_merged1.columns:
    adf_p = adf_test(df_merged1[col])
    kpss_p = kpss_test(df_merged1[col])
    
    stationarity_summary.loc[col] = [
        adf_p,
        kpss_p,
        "Yes" if adf_p is not None and adf_p < 0.05 else "No", #is it stationary? Stationary (p < 0.05) 
        "Yes" if kpss_p is not None and kpss_p > 0.05 else "No" #is it stationary? Stationary (p > 0.05)
    ]

look-up table. The actual p-value is smaller than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is greater than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is greater than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is smaller than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is smaller than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is greater than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is greater than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is smaller tha

> **NOTE**: KPSS warnings indicate test statistics outside the tabulated p-value range.
> - If the returned p-value = 0.01 → actual p-value < 0.01 → strong evidence against stationarity.
> - If the returned p-value = 0.1  → actual p-value > 0.1  → weak evidence against stationarity.
>
> These bounds are used conservatively to classify series as stationary or not.

In [171]:
# Display results
stationarity_summary

Unnamed: 0,ADF_pvalue,KPSS_pvalue,ADF_Stationary,KPSS_Stationary
carrefour,7.382216e-11,0.066901,Yes,Yes
carrefour autour de moi,1.553901e-13,0.01,Yes,No
carrefour city,6.829166e-14,0.1,Yes,Yes
carrefour express,2.005387e-13,0.1,Yes,Yes
carrefour market,0.139111,0.01,No,No
cora,0.4079767,0.01,No,No
bon d'achat carrefour,1.309592e-05,0.1,Yes,Yes
carrefour fidelite,3.326909e-08,0.1,Yes,Yes
carrefour alimentaire,7.974931e-15,0.01,Yes,No
carrefour bio,7.356709e-08,0.1,Yes,Yes


**Which keywords are stationary?**

`carrefour`, `carrefour city`, `carrefour express`, `bon d'achat carrefour`, `carrefour fidelite`, `carrefour bio`, `carrefour epicerie`, `carrefour produits`, `aloe vera carrefour`, `foire aux vins carrefour`, `franck provost carrefour`, `ongle carrefour`, `parfumerie carrefour`, `super u`, `action carrefour`, `fermeture carrefour`, `rappel produit carrefour`.

**Which keywords are non-stationary?**

`carrefour market`, `cora`, `carrefour credit`, `carrefour anti crise`, `carrefour assurance`, `carrefour banque`, `c_drive`, `c_livraison`, `promo`, `catalogue`, **`Carrefour_Close`**.

**Inconclusive tests**

`carrefour autour de moi`, `carrefour alimentaire`, `pizza carrefour`, `leclerc`, `lidl`, `cora carrefour`, `auchan`.

_Close	1.344312e-01	0.010000	No	No
_Close	1.344312e-01	0.010000	No	No
_Close	1.344312e-01	0.010000	No	No


##### Transformation of non-stationary data

> !!! Before proceeding with the Correlation, Lagged and Time series analysis we need to transform both predictors (SVI keywords) and the outcome variable `Carrefour_Close` to make them stationary; an assumption of the models we will use later. 

### Correlation Analysis

To identify whether changes in consumer search behaviour (via SVI) are associated with Carrefour’s stock price performance. 

In [None]:
# Pearson correlation matrix
pearson = df_merged1.corr(method='pearson')['Carrefour_Close'].drop('Carrefour_Close')
print("Pearson Correlations:")
print(pearson.sort_values(ascending=False))

In [None]:
# Spearman correlation matrix
spearman = df_merged1.corr(method='spearman')['Carrefour_Close'].drop('Carrefour_Close')
print("\nSpearman Correlations:")
print(spearman.sort_values(ascending=False))

**Strongest positive correlations**

These correlations (all > 0.45) indicate real-time or slightly lagged relevance between consumer search behaviours and Carrefour’s market performance.

| Keyword | Pearson | Spearman | Interpretation |
| ------- | ------- | -------- | -------------- |
| `c_drive`              | 0.603   | 0.591    | **Highest correlation** — strong link between online drive-related searches and stock priceThis likely reflects service usage and & operational scale. |
| `carrefour banque`     | 0.513   | 0.522    | Financial arm interest possibly signals **investor awareness or trust*   |
| `catalogue`            | 0.506   | 0.516    | Regularly viewed catalogues may tie to promotional periods that affect revenu    |
| `carrefour anti crise` | 0.473   | 0.495    | Interest in discount initiatives may align with **price-sensitive consumers** and operational adaptat     |
| `promo`                | 0.463   | 0.494    | Promo searches likely peak during events that boost short-term,ales — aligning with financial perfor       |
| `cora`                 | 0.511   | 0.481    | Competitor interest could reflect sector-wide consumer engagement or **comparative brand sh      |



**Moderate to Weak Correlations (Still Informative)**

Keywords like `carrefour market`, `carrefour credit`, and `leclerc` show a Pearson/Spearman ~0.3–0.45 → mild alignment, possibly driven by product lines or regional interests.

General brand terms like `carrefour`, `lidl`, `auchan` fall below 0.3, which suggests background noise or baseline interest with no strong predictive power.

**Near-Zero or Negative Correlations**

| Keyword  | Pearson | Interpretation  |
| ---- | --- | --- |
| `carrefour express`, `epicerie`, `parfumerie   | \~0     | These keywords are **niche, location-specific, or low-volume** — little link to broader financial performanc    |
| `fermeture carrefour`, `autour de moi`, `cora carrefour` | < -0.1  | These likely reflect **negative sentiment**, local store closures, or competitive attrition. Can be **signal of risk**, not growth  |


### Lagged Analysis

>*Does an increase (or decrease) in search volume for specific keywords precede a corresponding movement in Carrefour’s stock price?*

In [None]:
# Dictionary to store correlations for each lag
lagged_corrs = {}

for lag in range (1, 5):
    df_lag = df_merged1.copy()
    keyword_cols = df_svi.columns
    df_lag[keyword_cols] = df_lag[keyword_cols].shift(lag)
    df_lag = df_lag.dropna()
    
    corr_series = df_lag.corr(method='pearson')['Carrefour_Close'].drop('Carrefour_Close')
    lagged_corrs[f'lag_{lag}w'] = corr_series
    
df_svi_corrs = pd.DataFrame(lagged_corrs)

In [None]:
# Print dataframe
print("\nComparison of Lagged Pearson Correlations:")
print(df_svi_corrs.sort_values("lag_1w", ascending=False))

In [None]:
#plotting keywords lagged correlations in a seaborn heatmap for easier analysis.
keywords = df_svi_corrs.max(axis=1).sort_values(ascending=False).index
plt.figure(figsize=(8, 8))
sns.heatmap(df_svi_corrs.loc[keywords], annot=True, cmap="coolwarm", center=0)
plt.title("Keyword Correlations with Carrefour Stock Price (1–4 Week Lags)")
plt.savefig('kw_lagged_corr.png', bbox_inches='tight', dpi=300) #save figure as png
plt.show()


##### **Key Insights**

This granular analysis reveals that not all digital attention is equally valuable for financial forecasting. Keywords tied to Carrefour’s logistics (drive), financial services, and promotions are the most informative, offering promising avenues for early detection of stock price movements in the FMCG sector.

1. Strong predictors of stock price (r > 0.5):
- “carrefour drive” (`c_drive`) consistently exhibited the strongest correlation, peaking at r = 0.59 at a 3-week lag.
- `carrefour banque` and “carrefour credit” showed high correlations (up to r = 0.558 and r = 0.503, respectively), particularly at 3- and 4-week lags. These keywords reflect financial engagement, suggesting potential links between consumer interest in Carrefour’s banking services and investor sentiment.
- `catalogue` and `carrefour anti crise` also demonstrated moderately strong correlations, indicating a connection between promotional interest and stock movement.

2. Lag timing matters:
- For most high-signal keywords, correlations were strongest at lag 3 or 4 weeks, indicating that consumer search behaviour may take up to a month to reflect in market performance.
- This lag effect supports the hypothesis that search trends act as leading indicators rather than coinciding with stock movements.

3. Weak or negative correlations:
- Keywords like `carrefour autour de moi` and `fermeture carrefour` yielded negative or weak correlations, suggesting that local or negative sentiment searches are either noise or inversely related to stock performance.
- Some niche or low-frequency queries (e.g., `parfumerie carrefour`, `bon d’achat carrefour`) showed little to no predictive power.
 the FMCG sector.


# Merging Datasets: Aggregated SVIs and Price Close Data.

### Merging Datasets: df_merged

In [None]:
df_merged = df_svi_agg.join(df_stock, how='inner')

In [None]:
df_merged.head()

In [None]:
df_merged = df_merged.drop(columns=[ "rolling_mean", "rolling_std"])

In [None]:
df_merged.info()

##### Summary Statistics

In [None]:
# Get summary statistics
df_merged.describe().T

###  Stationarity Tests

In [175]:
# Apply ADF test
def adf_test(series):
    try:
        result = adfuller(series.dropna(), autolag='AIC')
        return result[1]  # p-value
    except:
        return None

In [176]:
# Apply KPSS test
def kpss_test(series):
    try:
        result = kpss(series.dropna(), regression='c', nlags='auto')
        return result[1]  # p-value
    except:
        return None
        

In [177]:
# Create a summary DataFrame
stationarity_summary_agg = pd.DataFrame(columns=["ADF_pvalue", "KPSS_pvalue", "ADF_Stationary", "KPSS_Stationary"])

In [178]:
# Loop through each column in the DataFrame
for col in df_merged.columns:
    adf_p = adf_test(df_merged[col])
    kpss_p = kpss_test(df_merged[col])
    
    stationarity_summary_agg.loc[col] = [
        adf_p,
        kpss_p,
        "Yes" if adf_p < 0.05 else "No", #is it stationary? Stationary (p < 0.05) 
        "Yes" if kpss_p > 0.05 else "No" #is it stationary? Stationary (p > 0.05)
    ]

look-up table. The actual p-value is smaller than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is smaller than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is smaller than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')
look-up table. The actual p-value is smaller than the p-value returned.

  result = kpss(series.dropna(), regression='c', nlags='auto')


In [179]:
# Display results
stationarity_summary_agg

Unnamed: 0,ADF_pvalue,KPSS_pvalue,ADF_Stationary,KPSS_Stationary
brand,6.218859e-11,0.076574,Yes,Yes
service,0.9690511,0.01,No,No
sub-brand,0.1364937,0.01,No,No
fmcg,3.773914e-11,0.086961,Yes,Yes
competitors,9.775674e-11,0.042233,Yes,No
finance,0.5184308,0.01,No,No
news,0.01032207,0.044464,Yes,No
Carrefour_Close,0.1344312,0.01,No,No


**Which keywords are stationary?**

`brand`, `fmcg`

**Which keywords are non-stationary?**

`service`, `sub-brand`, `finance`, **`Carrefour_Close`**.

**Inconclusive tests**

`competitors`, `news`.

0	No	No
o


##### Transformation of non-stationary data

### Correlation Analysis

To identify whether changes in consumer search behaviour (via SVI grouped by keyword category) are associated with Carrefour’s stock price performance. This supports the hypothesis that digital attention may signal financial outcomes.

* Pearson correlation tests linear relationships between variables.
* Spearman correlation detects monotonic relationships, useful for nonlinear or skewed data (as some of the keywords are highly skewed and migth imact ).
* Lagged correlation helps explore predictive power—i.e., whether search behaviour precedes changes in stock price.

This analysis helps validate whether digital attention is an informational indicator or just noise.

In [None]:
# Pearson correlation matrix
pearson = df_merged.corr(method='pearson')['Carrefour_Close'].drop('Carrefour_Close')
print("Pearson Correlations:")
print(pearson.sort_values(ascending=False))

In [None]:
# Spearman correlation matrix
spearman = df_merged.corr(method='spearman')['Carrefour_Close'].drop('Carrefour_Close')
print("\nSpearman Correlations:")
print(spearman.sort_values(ascending=False))

Both Pearson and Spearman correlation methods yield similar rankings strengthening the reliability of the following findings.

| *Category*        | *Pearson* | *Spearman* | *Interpretation*   |
| :--------------- | :-------: | :-------: | :------------------ |
| `service`| 0.594| 0.583| **Strongest positive relationship**. Weekly search interest in Carrefour’s services (e.g., delivery, click & collect) rises and falls in tandem with stock prices. Suggests strong consumer-business sentiment alignment. |
| `finance`| 0.512| 0.499| **Moderately strong positive correlation**. People searching financial keywords (e.g., Carrefour Banque) may reflect investor attention or consumer trust, both possibly linked to stock valuation.|
| `sub-brand`|0.454|0.464| Indicates that consumer interest in Carrefour’s sub-brands (e.g., Bio, Market) correlates moderately with stock prices. |
| `competitors`| 0.279| 0.323| **Mildly positive relationship**. Interest in competitors might indirectly reflect market positioning or comparative shopping,are somewhat aligned with Carrefour’s own stock movement. |
|`fmcg`|0.200|0.237| **Weak positive correlation**. General FMCG terms reflect baseline consumer interest, but less tightly linked to Carrefour’s specific performance.|
|`brand`|0.187|0.287| **Weaker than expected**. Brand searches alone may not drive stock price unless paired with promotions, campaigns, or sentiment.|
| `news` | -0.175  | -0.169   | **Slightly negative correlation**. News searches may reflect external crises, scandals, or non-operational events, possibly triggering investor caution rather than interest.   |


***Implications:***
* Consumer interest in services and finance shows the strongest relationship with Carrefour’s stock price, suggesting these are high-signal categories for digital attention analytics.
* Brand-level or FMCG searches are less directly tied to financial performance, possibly because they reflect shopping behaviour, not sentiment or market interest.
* Negative news correlation is typical; news spikes often mean crises, not positive business sentiment.

### Lagged Analysis

>*Does an increase (or decrease) in search volume for specific keyword categories precede a corresponding movement in Carrefour’s stock price?*

In [None]:
keyword_cols = ['brand', 'service', 'sub-brand', 'fmcg', 'competitors', 'finance', 'news']
lags = [1, 2, 3, 4]  # 4 weeks lag

# Dictionary to store correlations for each lag
lagged_correlations = {}

for lag in lags:
    df_lag = df_merged.copy()
    df_lag[keyword_cols] = df_lag[keyword_cols].shift(lag)
    df_lag = df_lag.dropna()
    
    corr = df_lag.corr(method='pearson')['Carrefour_Close'].drop('Carrefour_Close')
    lagged_correlations[f'Lag_{lag}w'] = corr
    
df_agg_corrs = pd.DataFrame(lagged_correlations)

In [None]:
# Convert to DataFrame for comparison and print dataframe
comparison_df = pd.DataFrame(lagged_correlations)
print("\nComparison of Lagged Pearson Correlations:")
print(comparison_df.sort_values('Lag_1w', ascending=False))

In [None]:
#plotting keywords lagged correlations in a seaborn heatmap for easier analysis.
agg_keywords = df_agg_corrs.max(axis=1).sort_values(ascending=False).index
plt.figure(figsize=(4, 3))
sns.heatmap(df_agg_corrs.loc[agg_keywords], annot=True, cmap="coolwarm", center=0)
plt.title("Aggregated Correlations with Carrefour Stock Price (1–4 Week Lags)")
plt.savefig('agg_lagged_corr.png', bbox_inches='tight', dpi=300) #save figure as png
plt.show()

Instead of trying only 1 1-week lag, we tested a 4-week lag. This allows us to test behavioural lag hypotheses and account for stock market delay.

1. Behavioural Lag Hypothesis

> Consumers search → consider options → purchase → company revenue impact → market reacts.

This chain can take 1–3 weeks, especially for FMCG, where buying cycles are short but not instant.

2. Stock Market Delay

Search volume might only impact stock performance when it gets amplified through news or earnings guidance. As such, investors may not immediately react to changes in consumer sentiment, especially if signals are subtle or not widely reported. 

*Note: Each lag shortens the dataset. As our dataset spans ~176 weeks, using lags of 5 or more would result in too few observations.* 

##### **Key Insights (Agg. Keyword Categories)**

While aggregated keyword categories provide a useful thematic overview, their predictive strength is generally lower than that of specific, high-signal keywords. Nevertheless, categories such as service and finance stand out as valuable early signals of Carrefour’s market performance.

1. `service` and `finance` categories lead in predictive strength:
- The `service` category consistently exhibited the highest correlation with Carrefour’s closing price across all lags, peaking at r = 0.581 at lag 1 and remaining stable up to lag 4. This category includes search terms related to delivery, online shopping, and store services, indicating that increased consumer engagement with Carrefour's service offerings is a strong signal of upcoming stock performance.
- `finance` also showed a robust positive relationship, with a maximum correlation of r = 0.514 at lag 1–2. This suggests consumer search interest in Carrefour’s financial services (e.g., banking, credit) may reflect broader market sentiment or investor attention.

2. `sub-brand` and `competitor` categories are moderately predictive:
- The `sub-brand` group (e.g., Carrefour Bio) peaked at r = 0.460 at lag 1–2, implying that interest in Carrefour’s subsidiary or niche product lines can be indicative of broader company performance.
- `competitors` showed moderate correlations (r ≈ 0.28), suggesting some potential substitution or benchmarking behaviour by consumers, though weaker in predictive strength.

3. Limited predictive value from general `brand` searches and `news`:
- The `brand` category, although intuitively important, yielded weaker correlations (r < 0.20), possibly due to consistently high baseline attention that does not vary with market-relevant events.
- Interestingly, the `news` category showed a negative correlation (down to r = -0.18), which may indicate that spikes in news-related search activity — potentially tied to controversies, recalls, or crises — precede negative shifts in stock price.

4. Temporal Dynamics of Attention:
- Correlation strengths were highest at lag 1 and lag 2, with mild tapering by lag 3 and 4. This suggests that aggregated consumer search interest impacts stock performance with a short delay of 1–2 weeks, reinforcing the premise that Google Trends data can serve as a leading indicator in financial forecasting for the FMCG sector.
