# 📰 News Article Exploratory Data Analysis (EDA)
*Dataset: 60,000 Company News Articles*

This notebook provides a comprehensive, interactive EDA of a large news article dataset using Plotly and pandas.  
It covers time trends, topic tags, co-occurrence, velocity, and basic text statistics.

**Interactivity:** All visualizations use Plotly—hover, zoom, and export are available.


### 📂 Data Access

To run this notebook:
1. Download the dataset [from this Google Drive link](<insert your shareable link here>).
2. Upload it to your own Drive, or use the public link if Colab allows.
3. Make sure `DATA_PATH` points to the correct file.

*File size: ~180MB. If you hit memory or download limits, consider analyzing a smaller sample!*


# 📰 News Article EDA Notebook — Instructions

## How to Use This Notebook

**1. Mount Google Drive**
- Run the first cell:  
  This mounts your Google Drive so the data file can be accessed and/or downloaded automatically.

**2. Automatic Data Download**
- The notebook checks for the data file `/My Drive/data/aa_news_data.json` in your Drive.
- If not found, it will download it from Google Drive and save it for you (no manual upload needed).

**3. Install Dependencies**
- `gdown` and `plotly` are installed automatically.  
- No need for manual package installation.

**4. Run All Cells**
- Execute all code cells from top to bottom.
- Do not skip setup or data cells.

**5. Interact with the Analysis**
- All major visualizations use Plotly — hover for details, zoom, pan, export as PNG/SVG, etc.
- Each plot or function can be customized by editing the parameters in the corresponding cell, such as:
  - `top_n` (number of tags to display)
  - `freq` (frequency: "D" for day, "W" for week, "M" for month)
  - `tag` (for tag-specific analysis)
  - `bins` (for histogram granularity)
  - `selected_date` (for date-centered plots)
  - etc.

**6. Typical Analysis Workflow**
- Inspect the DataFrame: `.view_df()`, `.info()`, `.shape()`, `.missing_report()`
- Tag Distribution: `.plot_tag_counts()`
- Article Trends Over Time: `.plot_trend()`
- Frequency Around Dates: `.articles_per_month_around_date()`
- Tag Coverage Over Time: `.plot_tag_coverage_over_time()`
- Tag Co-occurrence: `.tag_cooccurrence_matrix()`
- Topic Shifts: `.get_tag_month_matrix()` + `.plot_tag_temporal_shifts()`
- Topic Emergence/Decay: `.topic_emergence_decay()` + `.plot_topic_emergence_decay()`
- Article Velocity: `.plot_article_velocity_agg()`
- Event Lifespan: `.event_coverage_lifespan()` + `.plot_event_lifespan()`
- Article Length/Stats: `TextAnalyzer.length_hist()`, `TextAnalyzer.most_common_words()`

---

## Troubleshooting

- **File Not Found**
  - Ensure Drive is mounted and the file path is correct (`/content/drive/My Drive/data/aa_news_data.json`)
- **Colab Out-of-Memory**
  - Data file is ~180MB; for memory issues, use a sample or smaller subset.
- **Plots Not Displayed**
  - Rerun the cell. Plotly sometimes needs a fresh run to show the chart.

## Need More Help?
- Ask the project owner for support, or refer to comments in the notebook for parameter hints.

---


In [28]:
from google.colab import drive
import os

# 1. Mount Google Drive
print("Step 1: Mounting Google Drive...")
try:
    drive.mount('/content/drive', force_remount=True)
    print("✅ Google Drive mounted successfully.")
except Exception as e:
    print(f"❌ Error mounting Google Drive: {e}")
    raise SystemExit("Stopping execution.")

# 2. Download data file if not already present
!pip install -q gdown

import gdown

output_dir = '/content/drive/My Drive/data'
os.makedirs(output_dir, exist_ok=True)
output_path = f"{output_dir}/aa_news_data.json"

if not os.path.exists(output_path):
    print("Downloading data file directly to your Drive...")
    gdown.download(
        'https://drive.google.com/uc?id=1fZJcjli2sMqScbcKG2Xhud5UbKbycuva',
        output_path,
        quiet=False
    )
    print("✅ Data file downloaded and saved to your Drive.")
else:
    print("Data file already exists in your Drive.")


Step 1: Mounting Google Drive...
Mounted at /content/drive
✅ Google Drive mounted successfully.
Data file already exists in your Drive.


In [29]:

!pip install plotly

import pandas as pd
from collections import Counter
import plotly.express as px
import plotly.graph_objects as go

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import plotly.express as px
import plotly.graph_objects as go

from itertools import combinations
from collections import Counter


class Analyzer:
    def __init__(self, source_file):
        self.source_file = source_file
        self.df = None
        self._load_data()
        self.tags = set()
        self.df["date"] = pd.to_datetime(
            self.df["CreateDateString"], format="%d.%m.%Y", errors="coerce"
        )

    # DF Analysis
    def _load_data(self):
        """Load the JSONL metadata file into a DataFrame."""
        try:
            self.df = pd.read_json(self.source_file, lines=True)
            self.df["date"] = pd.to_datetime(
                self.df["CreateDateString"], format="%d.%m.%Y"
            )

            print(f"✅ Loaded {len(self.df)} records from {self.source_file}")
        except Exception as e:
            print(f"❌ Failed to load data: {e}")
            self.df = pd.DataFrame()  # Empty fallback

    def view_df(self, n=5):
        """Print the first n rows of the DataFrame."""
        print(self.df.head(n))

    def info(self):
        """Print summary info (columns, types, non-nulls, etc)."""
        print(self.df.info())

    def shape(self):
        """Return DataFrame shape (rows, columns)."""
        print(self.df.shape)

    def show_columns(self):
        """Print the columns in the DataFrame."""
        print(self.df.columns.tolist())

    def preview_random(self, n=5):
        """Show a random sample of rows."""
        print(self.df.sample(n))

    def describe_dates(self, date_col="CreateDateString"):
        """Print the min/max date (if available)."""
        if date_col in self.df.columns:
            print("Earliest date:", self.df[date_col].min())
            print("Latest date:", self.df[date_col].max())
        else:
            print(f"No '{date_col}' column found.")

    def missing_report(self):
        """Show count of missing values per column."""
        print(self.df.isnull().sum())

    def display_col(self, col):
        for value in self.df[col]:
            print(value)

    # Tag Analysis
    def get_tag_counts(self, col="Tags"):
        tag_counter = Counter()
        for tag_list in self.df[col]:
            if isinstance(tag_list, list):
                for tag in tag_list:
                    if tag:  # skip empty
                        tag_counter.update([tag])
            elif isinstance(tag_list, str):
                # Assume tags in string are comma-separated, or just one tag
                tags = [
                    tag.strip().lower() for tag in tag_list.split(",") if tag.strip()
                ]
                tag_counter.update(tags)
            # Else: skip (could be None, float, etc.)
        return tag_counter

    def plot_tag_counts(self, tags_col="Tags", top_n=20):
        tag_counter = self.get_tag_counts(tags_col)
        if not tag_counter:
            print("No tags found.")
            return
        common_tags = tag_counter.most_common(top_n)
        tags, counts = zip(*common_tags)
        fig = px.bar(
            x=list(counts),
            y=list(tags),
            orientation="h",
            labels={"x": "Count", "y": "Tag"},
            color=list(counts),
            color_continuous_scale="teal",
            title=f"Top {top_n} Tags",
        )
        fig.update_layout(
            yaxis=dict(categoryorder="total ascending"),
            height=max(400, int(top_n * 24)),
        )
        # fig.show()
        return fig

    # Date Analysis
    def articles_per_day(self, print_bool=False):
        counts = self.df["date"].value_counts().sort_index()
        # print(counts)
        if print_bool:
            print(counts)
        return counts

    def articles_per_week(self):
        # Group by year-week
        weekly = self.df.groupby(self.df["date"].dt.isocalendar().week)["Id"].count()
        print(weekly)
        return weekly

    def articles_per_month(self):
        monthly = self.df.groupby(self.df["date"].dt.to_period("M"))["Id"].count()
        print(monthly)
        return monthly

    def longest_shortest_day(self):
        counts = self.articles_per_day()
        max_count = counts.max()
        min_count = counts.min()
        busiest = counts[counts == max_count]
        slowest = counts[counts == min_count]
        busiest_single = busiest.index[0]
        slowest_single = slowest.index[0]
        print(f"Most articles: {max_count} on {busiest_single}")
        print(f"Fewest articles: {min_count} on {slowest_single}")
        return busiest, slowest

    def articles_per_month_around_date(
        self, selected_date, months_window=3, date_col="date"
    ):
        if not isinstance(selected_date, pd.Timestamp):
            selected_date = pd.to_datetime(selected_date)
        start = selected_date - pd.DateOffset(months=months_window)
        end = selected_date + pd.DateOffset(months=months_window)
        mask = (self.df[date_col] >= start) & (self.df[date_col] <= end)
        df_window = self.df[mask]
        monthly = df_window.groupby(df_window[date_col].dt.to_period("M"))["Id"].count()
        monthly = monthly.reset_index()
        monthly[date_col] = monthly[date_col].astype(str)

        fig = px.bar(
            monthly,
            x=date_col,
            y="Id",
            color=date_col,
            color_discrete_sequence=px.colors.sequential.Teal,
            labels={"Id": "Article Count", date_col: "Month"},
            title=f"Articles per Month Around {selected_date.strftime('%Y-%m-%d')}",
        )
        fig.update_layout(xaxis_tickangle=-45)
        # fig.show()
        return fig
        # return monthly

    def plot_trend(self, freq="D"):
        if freq == "D":
            counts = self.articles_per_day()
        elif freq == "W":
            counts = self.articles_per_week()
        else:
            counts = self.articles_per_month()
        df = counts.reset_index()
        df.columns = ["Date", "Article_Count"]
        if pd.api.types.is_period_dtype(df["Date"]):
            df["Date"] = df["Date"].astype(str)

        fig = px.line(
            df,
            x="Date",
            y="Article_Count",
            markers=True,
            line_shape="linear",
            title=f"Article Frequency Trend ({freq})",
        )

        # Highlight peak
        peak_idx = df["Article_Count"].idxmax()
        fig.add_scatter(
            x=[df.loc[peak_idx, "Date"]],
            y=[df.loc[peak_idx, "Article_Count"]],
            mode="markers+text",
            marker=dict(color="seagreen", size=14),
            text=[f"Peak: {df.loc[peak_idx, 'Article_Count']}"],
            textposition="top center",
            showlegend=False,
        )
        fig.update_layout(xaxis_tickangle=-45)
        # fig.show()
        return fig

    def plot_tag_coverage_over_time(
        self, tag, date_col="date", tags_col="Tags", top_n_months=None, color="#009688"
    ):
        tag_lower = tag.lower()
        mask = self.df[tags_col].apply(
            lambda tags: tag_lower in [str(t).lower() for t in tags]
            if isinstance(tags, list)
            else tag_lower in str(tags).lower()
        )
        tag_df = self.df[mask]
        if tag_df.empty:
            print(f'No articles found with tag "{tag}"')
            return
        monthly_counts = tag_df.groupby(tag_df[date_col].dt.to_period("M"))[
            "Id"
        ].count()
        if top_n_months:
            monthly_counts = (
                monthly_counts.sort_values(ascending=False)
                .head(top_n_months)
                .sort_index()
            )
        df_plot = monthly_counts.reset_index()
        df_plot.columns = ["Month", "Article_Count"]
        df_plot["Month"] = df_plot["Month"].astype(str)

        fig = px.bar(
            df_plot,
            x="Month",
            y="Article_Count",
            color="Article_Count",
            color_continuous_scale="teal",
            title=f'Coverage of "{tag}" Over Time',
        )
        # Highlight peak
        peak_idx = df_plot["Article_Count"].idxmax()
        fig.add_scatter(
            x=[df_plot.loc[peak_idx, "Month"]],
            y=[df_plot.loc[peak_idx, "Article_Count"]],
            mode="markers+text",
            marker=dict(color="seagreen", size=16, symbol="diamond"),
            text=[f"Peak: {df_plot.loc[peak_idx, 'Article_Count']}"],
            textposition="top center",
            showlegend=False,
        )
        fig.update_layout(xaxis_tickangle=-45)
        # fig.show()
        return fig
        # return monthly_counts

    def tag_cooccurrence_matrix(self, tag_col="Tags", top_n=20, plot_heatmap=True):
        co_counter = Counter()
        tag_freq = Counter()
        for taglist in self.df[tag_col].dropna():
            unique_tags = list(set(taglist))
            tag_freq.update(unique_tags)
            for tag_pair in combinations(sorted(unique_tags), 2):
                co_counter[tag_pair] += 1

        top_tags = [tag for tag, _ in tag_freq.most_common(top_n)]
        matrix = pd.DataFrame(0, index=top_tags, columns=top_tags, dtype=int)
        for (tag1, tag2), count in co_counter.items():
            if tag1 in top_tags and tag2 in top_tags:
                matrix.loc[tag1, tag2] = count
                matrix.loc[tag2, tag1] = count

        if plot_heatmap:
            fig = go.Figure(
                data=go.Heatmap(
                    z=matrix.values,
                    x=matrix.columns,
                    y=matrix.index,
                    colorscale="Viridis",
                    colorbar={"title": "Co-occurrence"},
                )
            )
            fig.update_layout(
                title=f"Tag Co-occurrence Heatmap (Top {top_n} Tags)",
                xaxis_title="Tag",
                yaxis_title="Tag",
                height=80 + 36 * top_n,
            )
            # fig.show()
            return fig

        return matrix

    def get_tag_month_matrix(self, tag_col="Tags", date_col="date", top_n=10):
        """
        Returns a DataFrame: rows=month, cols=top N tags, values=article counts.
        """
        # Flatten: for each article and tag, record (month, tag)
        rows = []
        for idx, row in self.df.iterrows():
            # if pd.isna(row[tag_col]) or not row[tag_col]:
            #     continue
            month = pd.to_datetime(row[date_col]).to_period("M")
            for tag in set(row[tag_col]):  # dedupe just in case
                rows.append({"month": month, "tag": tag})

        tag_month_df = pd.DataFrame(rows)
        # Count occurrences per (month, tag)
        tag_month_counts = (
            tag_month_df.groupby(["month", "tag"]).size().reset_index(name="count")
        )
        # Get top N tags overall
        top_tags = (
            tag_month_counts.groupby("tag")["count"]
            .sum()
            .sort_values(ascending=False)
            .head(top_n)
            .index.tolist()
        )
        tag_month_counts = tag_month_counts[tag_month_counts["tag"].isin(top_tags)]
        # Pivot to month × tag table
        matrix = (
            tag_month_counts.pivot(index="month", columns="tag", values="count")
            .fillna(0)
            .astype(int)
        )
        # Sort by time
        matrix = matrix.sort_index()
        return matrix

    @staticmethod
    def plot_tag_temporal_shifts(matrix):
        # matrix: index=time, columns=tags, values=counts
        if isinstance(matrix.index, pd.PeriodIndex):
            matrix.index = matrix.index.astype(str)

        df_long = matrix.reset_index().melt(
            id_vars=matrix.index.name or "index", var_name="Tag", value_name="Count"
        )
        time_col = matrix.index.name or "index"

        fig = px.area(
            df_long,
            x=time_col,
            y="Count",
            color="Tag",
            title="Temporal Topic Shifts (Top Tags)",
            labels={time_col: "Month", "Count": "Article Count"},
        )
        fig.update_layout(xaxis_tickangle=-45)
        # fig.show()
        return fig

    def topic_emergence_decay(
        self, tag_col="Tags", date_col="date", freq="M", min_window_count=3
    ):
        """
        Identify emerging and disappearing tags per time window (e.g., month).

        Args:
            df: DataFrame
            tag_col: name of the column with tag lists
            date_col: date column (must be datetime)
            freq: window size ("M" for month, "W" for week, etc.)
            min_window_count: only consider tags appearing at least this many times in a window

        Returns:
            emergence_df: DataFrame with window, emergent_tags, decayed_tags
        """
        # 1. Assign window
        df = self.df.copy()
        df["window"] = pd.to_datetime(df[date_col]).dt.to_period(freq)

        # 2. Get tags per window
        window_tags = {}
        for window, group in df.groupby("window"):
            tags = []
            for taglist in group[tag_col]:
                if isinstance(taglist, list):
                    tags += taglist
            tag_counts = pd.Series(tags).value_counts()
            tags_set = set(tag_counts[tag_counts >= min_window_count].index)
            window_tags[window] = tags_set

        # 3. Compare window to previous/next
        windows = sorted(window_tags)
        results = []
        for i, win in enumerate(windows):
            current_tags = window_tags[win]
            prev_tags = window_tags[windows[i - 1]] if i > 0 else set()
            next_tags = window_tags[windows[i + 1]] if i < len(windows) - 1 else set()
            emergent = current_tags - prev_tags
            decayed = current_tags - next_tags  # Tags present now, gone next window
            results.append(
                {
                    "window": win,
                    "emergent_tags": sorted(list(emergent)),
                    "decayed_tags": sorted(list(decayed)),
                    "n_emergent": len(emergent),
                    "n_decayed": len(decayed),
                }
            )
        emergence_df = pd.DataFrame(results)
        return emergence_df

    @staticmethod
    def plot_topic_emergence_decay(emergence_df, window_col="window"):
        fig = go.Figure()
        fig.add_trace(
            go.Scatter(
                x=emergence_df[window_col].astype(str),
                y=emergence_df["n_emergent"],
                mode="lines+markers",
                name="Emergent tags",
            )
        )
        fig.add_trace(
            go.Scatter(
                x=emergence_df[window_col].astype(str),
                y=emergence_df["n_decayed"],
                mode="lines+markers",
                name="Decayed tags",
            )
        )
        fig.update_layout(
            title="Topic Emergence and Decay Over Time",
            xaxis_title="Time Window",
            yaxis_title="Number of Tags",
            legend_title_text=None,
            xaxis_tickangle=-45,
        )
        # fig.show()

        # Optional: print/annotate top emergent/decayed tags for recent windows
        print("\nRecent Emergent and Decayed Tags:")
        display_df = emergence_df[["window", "emergent_tags", "decayed_tags"]].tail(6)
        print(display_df.to_string(index=False))
        return fig

    def plot_article_velocity_agg(
        self,
        tag,
        tag_col="Tags",
        date_col="date",
        freq="M",
        agg="mean",
        time_unit="days",
    ):
        tag_lower = tag.lower()
        mask = self.df[tag_col].apply(
            lambda tags: tag_lower in [str(t).lower() for t in tags]
            if isinstance(tags, list)
            else False
        )
        tag_dates = self.df.loc[mask, date_col]
        if tag_dates.empty:
            print(f"No articles found for tag '{tag}'.")
            return

        dates_sorted = pd.to_datetime(tag_dates).sort_values()
        deltas = dates_sorted.diff().dropna()
        if time_unit == "days":
            delta_vals = deltas.dt.total_seconds() / 86400
            unit_str = "Days"
        elif time_unit == "hours":
            delta_vals = deltas.dt.total_seconds() / 3600
            unit_str = "Hours"
        else:
            delta_vals = deltas.dt.total_seconds() / 60
            unit_str = "Minutes"

        velocity = 1 / delta_vals.replace(0, float("nan"))
        vel_df = pd.DataFrame({"date": dates_sorted.iloc[1:], "velocity": velocity})
        vel_df["window"] = vel_df["date"].dt.to_period(freq)

        if agg == "mean":
            agg_vel = vel_df.groupby("window")["velocity"].mean()
        elif agg == "max":
            agg_vel = vel_df.groupby("window")["velocity"].max()
        else:
            raise ValueError("agg must be 'mean' or 'max'")

        df_plot = agg_vel.reset_index()
        df_plot.columns = ["Time_Window", "Velocity"]
        df_plot["Time_Window"] = df_plot["Time_Window"].astype(str)

        fig = px.line(
            df_plot,
            x="Time_Window",
            y="Velocity",
            markers=True,
            title=f"{agg.capitalize()} Article Velocity for '{tag}' by {freq}",
            labels={"Velocity": f"Velocity (1/{unit_str})"},
        )
        # Highlight peak
        peak_idx = df_plot["Velocity"].idxmax()
        fig.add_scatter(
            x=[df_plot.loc[peak_idx, "Time_Window"]],
            y=[df_plot.loc[peak_idx, "Velocity"]],
            mode="markers+text",
            marker=dict(color="seagreen", size=14, symbol="diamond"),
            text=[f"Peak: {df_plot.loc[peak_idx, 'Velocity']:.2f}"],
            textposition="top center",
            showlegend=False,
        )
        fig.update_layout(xaxis_tickangle=-45)
        # fig.show()
        print(f"{agg.capitalize()} velocity stats:\n{agg_vel.describe()}")
        return fig

        # return agg_vel

    def event_coverage_lifespan(
        self, tag, tag_col="tags_norm", date_col="date", freq="D"
    ):
        """
        For a given tag, find first, peak, and last article appearance, plus lifespan.
        Optionally, return/plot daily or weekly trend.
        """
        tag_lower = tag.lower()
        mask = self.df[tag_col].apply(
            lambda tags: tag_lower in [str(t).lower() for t in tags]
            if isinstance(tags, list)
            else False
        )
        event_df = self.df.loc[mask].copy()
        if event_df.empty:
            print(f"No articles found for tag '{tag}'.")
            return None

        event_df["date"] = pd.to_datetime(event_df[date_col])
        grouped = event_df.groupby(event_df["date"].dt.to_period(freq)).size()

        first_appearance = event_df["date"].min()
        last_appearance = event_df["date"].max()
        peak_window = grouped.idxmax()
        peak_count = grouped.max()
        lifespan_days = (last_appearance - first_appearance).days

        print(f"Event/tag: '{tag}'")
        print(f"First appearance: {first_appearance.strftime('%Y-%m-%d')}")
        print(f"Peak window: {peak_window} with {peak_count} articles")
        print(f"Last appearance: {last_appearance.strftime('%Y-%m-%d')}")
        print(f"Lifespan: {lifespan_days} days ({lifespan_days // 7} weeks)")
        print(f"Total articles: {len(event_df)}")

        # Optional: return for visualization
        return grouped, first_appearance, peak_window, last_appearance

    @staticmethod
    def plot_event_lifespan(
        grouped, first_appearance, peak_window, last_appearance, freq="D", tag=""
    ):
        # grouped: pd.Series (index=window, values=counts)
        x_labels = list(grouped.index.astype(str))
        first_label = str(first_appearance.to_period(freq))
        last_label = str(last_appearance.to_period(freq))
        peak_label = str(peak_window)

        base_colors = ["#1976D2"] * len(x_labels)
        for i, label in enumerate(x_labels):
            if label == first_label:
                base_colors[i] = "mediumseagreen"
            if label == peak_label:
                base_colors[i] = "goldenrod"
            if label == last_label:
                base_colors[i] = "slateblue"

        fig = go.Figure()
        fig.add_trace(
            go.Bar(
                x=x_labels, y=grouped.values, marker_color=base_colors, showlegend=False
            )
        )
        # Custom legend
        legend_colors = [
            ("First Appearance", "mediumseagreen"),
            ("Peak Coverage", "goldenrod"),
            ("Last Appearance", "slateblue"),
        ]
        for name, color in legend_colors:
            fig.add_trace(go.Bar(x=[None], y=[None], marker_color=color, name=name))

        fig.update_layout(
            title=f"Coverage Lifespan for '{tag}' ({freq})",
            xaxis_title="Date",
            yaxis_title="Article Count",
            xaxis_tickangle=-45,
            barmode="group",
        )
        # fig.show()
        return fig


import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
import plotly.express as px


class TextAnalyzer:
    def __init__(self, source_file, id_col="Id", title_col="Title"):
        self.source_file = source_file
        self.df = None
        self._load_data()

        self.id_col = id_col
        self.title_col = title_col

    def _load_data(self):
        """Load the JSONL metadata file into a DataFrame."""
        try:
            self.df = pd.read_json(self.source_file, lines=True)
            print(f"✅ Loaded {len(self.df)} records from {self.source_file}")
        except Exception as e:
            print(f"❌ Failed to load data: {e}")
            self.df = pd.DataFrame()  # Empty fallback

    @staticmethod
    def _calculate_word_count(df, text_col, new_col="n_words"):
        """
        Adds/updates a word count column for the specified text column.

        Args:
            df (pd.DataFrame): The DataFrame.
            text_col (str): The name of the text column.
            new_col (str): The name for the word count column (default: 'n_words').
        Returns:
            pd.DataFrame: The DataFrame with the new column.
        """
        df[new_col] = df[text_col].apply(
            lambda x: len(str(x).split()) if pd.notnull(x) else 0
        )
        return df

    def length_stats(self, text_col):
        TextAnalyzer._calculate_word_count(df=self.df, text_col=text_col)
        print("Article Count:", len(self.df))
        print("Min length (words):", self.df["n_words"].min())
        print("Max length (words):", self.df["n_words"].max())
        print("Mean length (words):", self.df["n_words"].mean())
        print("Median length (words):", self.df["n_words"].median())
        print("10 shortest articles:")
        print(
            self.df[[self.id_col, self.title_col, "n_words"]]
            .sort_values("n_words")
            .head(10)
        )
        print("10 longest articles:")
        print(
            self.df[[self.id_col, self.title_col, "n_words"]]
            .sort_values("n_words", ascending=False)
            .head(10)
        )

    def length_hist(self, text_col, bins=30):
        TextAnalyzer._calculate_word_count(df=self.df, text_col=text_col)
        fig = px.histogram(
            self.df,
            x="n_words",
            nbins=bins,
            title="Distribution of Article Lengths (in words)",
            labels={"n_words": "Number of Words"},
            opacity=0.85,
            color_discrete_sequence=["#1976D2"],
        )
        fig.update_layout(
            xaxis_title="Number of Words", yaxis_title="Number of Articles", bargap=0.07
        )
        # fig.show()
        return fig

    def most_common_words(self, text_col, n=30, ngram=1):
        ngram_list = []
        for text in self.df[text_col].dropna():
            tokens = str(text).split()
            if len(tokens) < ngram:
                continue
            ngrams = zip(*[tokens[i:] for i in range(ngram)])
            ngram_list.extend([" ".join(ng) for ng in ngrams])
        ngram_freq = Counter(ngram_list)
        print(f"Top {n} {ngram}-grams:")
        for ngram_str, freq in ngram_freq.most_common(n):
            print(f"{ngram_str}: {freq}")

        # Plotly barplot
        top_ngrams = ngram_freq.most_common(n)
        df_plot = pd.DataFrame(top_ngrams, columns=["Ngram", "Frequency"])
        fig = px.bar(
            df_plot,
            x="Ngram",
            y="Frequency",
            title=f"Top {n} Most Common {ngram}-grams",
            labels={"Ngram": f"{ngram}-gram", "Frequency": "Frequency"},
            color="Frequency",
            color_continuous_scale="Teal",
        )
        fig.update_layout(
            xaxis_title=f"{ngram}-gram", yaxis_title="Frequency", xaxis_tickangle=-45
        )
        # fig.show()
        return fig








In [30]:

analyzer = Analyzer(output_path)
text_analyzer = TextAnalyzer(source_file=output_path, id_col="Id", title_col="title_norm")


✅ Loaded 52052 records from /content/drive/My Drive/data/aa_news_data.json
✅ Loaded 52052 records from /content/drive/My Drive/data/aa_news_data.json


In [31]:
analyzer.view_df(5)
analyzer.info()
analyzer.shape()
analyzer.missing_report()


        Id CreateDateString  \
0  3637367       21.07.2025   
1  3636873       20.07.2025   
2  3636862       20.07.2025   
3  3636727       20.07.2025   
4  3636717       20.07.2025   

                                               title  \
0  Turkish president vows support for his Syrian ...   
1  Türkiye rejects 'politically motivated attempt...   
2  TRNC now represented in Turkic, Islamic groupi...   
3  Turkish foreign minister discusses development...   
4  Türkiye grieves with Vietnam over deadly capsi...   

                                             summary  \
0  Erdogan vows not to 'leave al-Sharaa alone,' s...   
1  'Türkiye pursues independent policy rooted in ...   
2  Insisting on solution model that Turkish Cypri...   
3  In separate phone calls, Hakan Fidan also disc...   
4  Türkiye says it is 'deeply saddened' over loss...   

                                                Tags     Categories  \
0  Gaza,Israel,Syria,Türkiye,Gaza,Israel,Syria,Tü...        [2, 71] 

## Tag Distribution
Visualize the most common topics/tags across the dataset.


In [32]:
fig = analyzer.plot_tag_counts(tags_col="tags_norm", top_n=20)
fig.show()


## Article Trends Over Time
View daily/weekly/monthly publishing trends.


In [33]:
fig = analyzer.plot_trend(freq="M")  # "D", "W", "M"
fig.show()


date
2011-09      1
2011-12      5
2012-01      2
2012-02      1
2012-04      6
          ... 
2025-03    570
2025-04    708
2025-05    671
2025-06    505
2025-07    363
Freq: M, Name: Id, Length: 155, dtype: int64



is_period_dtype is deprecated and will be removed in a future version. Use `isinstance(dtype, pd.PeriodDtype)` instead



## Article Frequency Around Key Dates
Visualize article counts centered on a selected date.


In [34]:
fig = analyzer.articles_per_month_around_date(selected_date="2024-06-20", months_window=3)
fig.show()


## Tag Coverage Over Time
Track the frequency of a given tag across months.


In [35]:
fig = analyzer.plot_tag_coverage_over_time(tag="Gaza", tags_col="tags_norm", top_n_months=6)
fig.show()


## Tag Co-occurrence
Visualize which tags commonly appear together.


In [36]:
fig = analyzer.tag_cooccurrence_matrix(tag_col="tags_norm", top_n=15)
fig.show()


## Topic Shifts Over Time
Track how top topics change monthly.


In [37]:
matrix = analyzer.get_tag_month_matrix(tag_col="tags_norm", date_col="date", top_n=10)
fig = Analyzer.plot_tag_temporal_shifts(matrix)
fig.show()


## Topic Emergence/Decay
Which topics are appearing/disappearing in each window?


In [38]:
emergence_df = analyzer.topic_emergence_decay(tag_col="tags_norm", date_col="date", freq="M", min_window_count=2)
fig = Analyzer.plot_topic_emergence_decay(emergence_df)
fig.show()



Recent Emergent and Decayed Tags:
 window                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

## Article Velocity for a Tag
How quickly do articles about a topic appear?


In [39]:
fig = analyzer.plot_article_velocity_agg(
    tag="Gaza", tag_col="tags_norm", date_col="date", time_unit="days", freq="M"
)
fig.show()


Mean velocity stats:
count    66.000000
mean      0.320268
std       0.371654
min       0.001754
25%       0.016534
50%       0.071275
75%       0.681429
max       1.000000
Name: velocity, dtype: float64


## Event Lifespan
When does a topic first, peak, and last appear?


In [40]:
grouped, first, peak, last = analyzer.event_coverage_lifespan(
    "Gaza", tag_col="tags_norm", date_col="date", freq="M"
)
if grouped is not None:
    fig = Analyzer.plot_event_lifespan(grouped, first, peak, last, freq="M", tag="Gaza")
    fig.show()


Event/tag: 'Gaza'
First appearance: 2012-10-12
Peak window: 2023-11 with 142 articles
Last appearance: 2025-07-21
Lifespan: 4665 days (666 weeks)
Total articles: 1269


## Article Length Distribution


In [41]:
fig = text_analyzer.length_hist(text_col="full_text_norm", bins=50)
fig.show()


## Most Common N-Grams (Unigrams, Bigrams, Trigrams)


In [42]:
fig = text_analyzer.most_common_words(text_col="full_text_norm", n=20, ngram=2)
fig.show()


Top 20 2-grams:
covid 19: 7565
prime minister: 7114
said statement: 6874
year old: 6306
told anadolu: 6047
anadolu agency: 6024
foreign minister: 5906
last year: 5906
tayyip erdogan: 4310
recep tayyip: 4302
last week: 3874
world cup: 3754
president recep: 3720
said adding: 3396
erdogan said: 3378
ministry said: 3339
two country: 3189
000 people: 3161
turkish president: 3135
million people: 3114
