<a href="https://colab.research.google.com/github/kzumreen/FoodTrendsPrediction/blob/main/pytrends_eda_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTrends EDA Template

This notebook pulls Google Trends data via **PyTrends** (or uses synthetic data if offline), performs exploratory data analysis (descriptive statistics + visualizations), and saves outputs. It is designed to satisfy the EDA rubric for your DAT 490 project.

## Setup

Install required packages (run in terminal):

```bash
pip install pytrends pandas matplotlib seaborn plotly nbformat
```

If pytrends or network access is unavailable, the notebook falls back to a synthetic dataset for demonstration.

In [None]:
# Optional: install packages from the notebook
# !pip install pytrends pandas matplotlib seaborn plotly nbformat


In [None]:
import os, numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from datetime import datetime
OUTPUT_DIR = 'pytrends_eda_outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)
sns.set(style='whitegrid')


In [None]:
def synthetic_trends(keywords, start='2021-01-01', end='2025-10-01', seed=42):
    rng = pd.date_range(start=start, end=end, freq='W-SUN')
    np.random.seed(seed)
    data = pd.DataFrame(index=rng)
    for kw in keywords:
        base = np.random.poisson(lam=5, size=len(rng)).astype(float)
        spikes = np.zeros(len(rng))
        for i in range(3):
            loc = np.random.randint(0, len(rng))
            width = np.random.randint(1, 6)
            height = np.random.randint(30, 90)
            start_i = max(0, loc - width)
            end_i = min(len(rng), loc + width)
            spikes[start_i:end_i] += np.linspace(height, 0, end_i-start_i)
        series = (base + spikes).clip(0, 100)
        series = (series / series.max() * 100) if series.max() > 0 else series
        data[kw] = series.round(1)
    return data

def descriptive_stats(df):
    stats = df.describe().T
    stats['skew'] = df.skew()
    stats['kurtosis'] = df.kurtosis()
    return stats

def save_fig(fig, name):
    out = os.path.join(OUTPUT_DIR, name)
    fig.savefig(out, dpi=150, bbox_inches='tight')
    print('Saved', out)


## Fetch Google Trends Data
Edit `keywords`, `timeframe`, `geo`, and `gprop` below. If pytrends is unavailable, the notebook will use synthetic data.

In [None]:
keywords = ['baked feta pasta', 'matcha', 'mushroom coffee']
timeframe = '2021-01-01 2025-10-01'
geo = 'US'
gprop = ''

def fetch_google_trends(keywords, timeframe='2021-01-01 2025-10-01', geo='US', gprop=''):
    try:
        from pytrends.request import TrendReq
    except Exception as e:
        print('pytrends import failed:', e)
        return None
    try:
        pytrends = TrendReq(hl='en-US', tz=360)
        pytrends.build_payload(kw_list=keywords, timeframe=timeframe, geo=geo, gprop=gprop)
        data = pytrends.interest_over_time()
        if data is None or data.empty:
            return None
        if 'isPartial' in data.columns:
            data = data.drop(columns=['isPartial'])
        return data
    except Exception as e:
        print('pytrends fetch error:', e)
        return None

data = fetch_google_trends(keywords, timeframe=timeframe, geo=geo, gprop=gprop)
source = 'pytrends' if data is not None else 'synthetic'
if data is None:
    data = synthetic_trends(keywords, start=timeframe.split()[0], end=timeframe.split()[1])
print('Data source:', source)
data.head()


## Descriptive Statistics

In [None]:
stats = descriptive_stats(data)
stats


## Time-series Plot

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
data.plot(ax=ax)
ax.set_title('Search Interest Over Time (0-100 Index)')
ax.set_ylabel('Relative Search Interest')
ax.set_xlabel('Date')
plt.legend(title='Keyword')
save_fig(fig, 'fig_time_series.png')
plt.show()


## Rolling Mean (8-week)

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
data.rolling(8).mean().plot(ax=ax)
ax.set_title('8-week Rolling Mean of Search Interest')
ax.set_ylabel('Rolling Mean (0-100)')
ax.set_xlabel('Date')
save_fig(fig, 'fig_rolling_8.png')
plt.show()


## Correlation Heatmap

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(data.corr(), annot=True, cmap='vlag', center=0, ax=ax)
ax.set_title('Correlation Between Keywords (Search Interest Index)')
save_fig(fig, 'fig_correlation.png')
plt.show()


## Related Queries (Optional)
If PyTrends is available, run this cell to fetch related queries for each keyword.

In [None]:
try:
    from pytrends.request import TrendReq
    pytrends = TrendReq(hl='en-US', tz=360)
    pytrends.build_payload(keywords, timeframe=timeframe, geo=geo, gprop=gprop)
    related = pytrends.related_queries()
    for k in keywords:
        print('\n==', k, '==')
        top = related.get(k, {}).get('top')
        rising = related.get(k, {}).get('rising')
        print('Top queries:')
        display(top.head() if top is not None else 'None')
        print('Rising queries:')
        display(rising.head() if rising is not None else 'None')
except Exception as e:
    print('Skipping related queries (pytrends not available):', e)


In [None]:

html = """<html><head><meta charset='utf-8'><title>PyTrends EDA Report</title></head><body>
<h1>PyTrends EDA Report</h1>
<p>Generated: %s</p>
<h2>Keywords</h2><ul>%s</ul>
<h2>Notes</h2><p>Data source: %s. Outputs saved in %s.</p>
<h2>Figures</h2>
<img src='fig_time_series.png' style='max-width:900px;width:100%;'/>
<img src='fig_rolling_8.png' style='max-width:900px;width:100%;'/>
<img src='fig_correlation.png' style='max-width:900px;width:100%;'/>
</body></html>""" % (datetime.now().isoformat(), ''.join([f"<li>{k}</li>" for k in keywords]), source, OUTPUT_DIR)

with open(os.path.join(OUTPUT_DIR, 'eda_report.html'), 'w', encoding='utf-8') as f:
    f.write(html)
print('Saved simple HTML report to', os.path.join(OUTPUT_DIR, 'eda_report.html'))


## Next steps & Tips

- Overlay timestamps of viral posts to explain spikes.
- Combine with YouTube/Reddit timestamps for causal linkage.
- Consider resampling frequency and handling timezone differences.