# Twitter Archive Analyzer & CSV Converter

This notebook helps you:

- Upload one or more Twitter export files (the `.js` files inside your `data`/`tweet.js`-style export, or plain `.json`).
- Parse Twitter's special `.js` format (it has a JavaScript variable assignment before the JSON).
- Combine and clean tweets, deleted tweets, and note tweets.
- Show basic statistics and interactive charts (with Plotly).
- Export cleaned data to CSV and download it.

Works in: GitHub Codespaces (JupyterLab), Local Jupyter, and Google Colab.


## 1) Install dependencies

If prompted to restart the kernel after installing `ipywidgets`, please do so. In Colab, installations are quick and typically don't require a restart.


In [None]:
# Install required packages (safe to run multiple times).
# - plotly for interactive charts
# - pandas/numpy for analysis
# - ipywidgets for a friendly upload UI (Jupyter)
# - chardet helps detect file encodings
%pip -q install plotly pandas numpy ipywidgets chardet python-dateutil

# Enable ipywidgets in JupyterLab (no-op in Colab)
try:
    import ipywidgets as widgets
except Exception as e:
    print('ipywidgets is not available yet. If using JupyterLab, you may need to restart the kernel after install.')


In [None]:
# Imports and environment detection
import io, os, re, json, sys, zipfile, textwrap
from datetime import datetime
from typing import List, Dict, Tuple

import pandas as pd
import numpy as np
import chardet

import plotly.express as px
import plotly.io as pio
from IPython.display import display, FileLink, FileLinks, Markdown

# Detect Colab
IN_COLAB = False
try:
    import google.colab  # type: ignore
    IN_COLAB = True
except Exception:
    IN_COLAB = False

# Pick a sensible Plotly renderer
if IN_COLAB:
    pio.renderers.default = 'colab'
else:
    # Works in JupyterLab / Codespaces
    pio.renderers.default = 'notebook_connected'

print('Environment: Colab' if IN_COLAB else 'Environment: Jupyter / Codespaces')


In [None]:
# Option A: Upload files from your computer
uploaded_files_bytes = None
uploader = None

if IN_COLAB:
    # Colab: shows a file picker dialog
    from google.colab import files as colab_files  # type: ignore
    print('Colab: Use the dialog to select one or more .js/.json files to upload.')
    uploaded_files_bytes = colab_files.upload()  # dict: name -> bytes
else:
    # Jupyter/Codespaces: use an ipywidgets file uploader
    import ipywidgets as widgets
    uploader = widgets.FileUpload(
        accept='.js,.json',
        multiple=True,
        description='Select .js/.json files',
    )
    display(widgets.VBox([
        widgets.HTML('<b>Upload one or more Twitter export .js or .json files:</b>'),
        uploader,
    ]))
    print('After selecting files, run the next cell to process them.')


In [None]:
# Option B: (Advanced) Provide local workspace paths (e.g., from a checked-out repo)
# Example: 
#local_paths = ['twits/tweets.js', 'twits/deleted-tweets.js']
local_paths: List[str] = []
local_paths


## 3) Parsing helpers

These functions:

- Detect encoding and decode bytes.
- Strip the JavaScript wrapper (`window.YTD.* =`) used by Twitter exports.
- Parse JSON safely with useful errors.
- Normalize different record types (`tweet`, `deleted_tweet`, `note`).


In [None]:
def detect_and_decode(data: bytes) -> str:
    """Detect encoding and decode to text.
    Falls back to UTF-8 with errors ignored.
    """
    try:
        ch = chardet.detect(data) or {}
        enc = ch.get('encoding') or 'utf-8'
        return data.decode(enc, errors='replace')
    except Exception:
        return data.decode('utf-8', errors='replace')


def strip_js_wrapper(text: str) -> str:
    """Twitter exports in .js usually look like: 
    window.YTD.tweets.part0 = [ ... ];
    We extract the JSON array/object from the first '[' or '{' to the matching last ']' or '}'.
    """
    # Remove UTF-8 BOM if present
    if text and text[0] == '\ufeff':
        text = text[1:]
    stripped = text.strip()
    # If it already looks like JSON, return as-is
    if stripped.startswith('[') or stripped.startswith('{'):
        return stripped
    # Otherwise, try to find the first '[' ... last ']'
    lb, rb = stripped.find('['), stripped.rfind(']')
    if lb != -1 and rb != -1 and rb > lb:
        return stripped[lb:rb+1]
    # Fallback: try first '{' ... last '}'
    lb, rb = stripped.find('{'), stripped.rfind('}')
    if lb != -1 and rb != -1 and rb > lb:
        return '[' + stripped[lb:rb+1] + ']'
    raise ValueError('Could not locate JSON content within the .js file. Expected [ ... ] or { ... }.')


def parse_twitter_export_bytes(data: bytes, filename: str) -> List[Dict]:
    """Parse a Twitter .js or .json export file into a Python list.
    Each item is usually a dict with a single key like 'tweet', 'noteTweet', etc.
    """
    text = detect_and_decode(data)
    try:
        core = strip_js_wrapper(text)
        parsed = json.loads(core)
        if isinstance(parsed, dict):
            parsed = [parsed]
        if not isinstance(parsed, list):
            raise ValueError('Top-level JSON must be an array or object.')
        return parsed
    except json.JSONDecodeError as jde:
        context = textwrap.shorten(text, width=200, placeholder='...')
        raise ValueError(f'JSON parse error in {filename}: {jde}. Sample: {context}')


def html_strip(s: str) -> str:
    if not isinstance(s, str):
        return s
    return re.sub(r'<[^>]+>', '', s)


def safe_get(d: Dict, *keys, default=None):
    cur = d
    for k in keys:
        if isinstance(cur, dict) and k in cur:
            cur = cur[k]
        else:
            return default
    return cur


def normalize_items(raw_items: List[Dict], source_label: str) -> List[Dict]:
    """Flatten Twitter export records into a unified schema.
    source_label is the filename (for provenance).
    """
    rows = []
    for it in raw_items:
        if not isinstance(it, dict):
            continue
        if 'tweet' in it:
            rec = it['tweet'] or {}
            record_type = 'deleted_tweet' if 'deleted_at' in rec else 'tweet'
            text_val = rec.get('full_text') or rec.get('text')
            created = rec.get('created_at')
            row = {
                'record_type': record_type,
                'source_file': source_label,
                'id_str': str(rec.get('id_str') or rec.get('id') or ''),
                'created_at': created,
                'text': text_val,
                'lang': rec.get('lang'),
                'source': html_strip(rec.get('source')),
                'favorite_count': rec.get('favorite_count'),
                'retweet_count': rec.get('retweet_count'),
                'possibly_sensitive': rec.get('possibly_sensitive'),
                'deleted_at': rec.get('deleted_at'),
                'in_reply_to_status_id': rec.get('in_reply_to_status_id') or rec.get('in_reply_to_status_id_str'),
                'in_reply_to_user_id': rec.get('in_reply_to_user_id') or rec.get('in_reply_to_user_id_str'),
                'in_reply_to_screen_name': rec.get('in_reply_to_screen_name'),
            }
            ents = rec.get('entities') or {}
            row.update({
                'hashtags_count': len(ents.get('hashtags') or []),
                'user_mentions_count': len(ents.get('user_mentions') or []),
                'urls_count': len(ents.get('urls') or []),
                'media_count': len(ents.get('media') or []),
            })
            rows.append(row)
        elif 'noteTweet' in it:
            rec = it['noteTweet'] or {}
            core = rec.get('core') or {}
            row = {
                'record_type': 'note',
                'source_file': source_label,
                'id_str': str(rec.get('noteTweetId') or ''),
                'created_at': rec.get('createdAt') or rec.get('updatedAt'),
                'text': core.get('text'),
                'lang': None,
                'source': 'Note',
                'favorite_count': None,
                'retweet_count': None,
                'possibly_sensitive': None,
                'deleted_at': None,
                'in_reply_to_status_id': None,
                'in_reply_to_user_id': None,
                'in_reply_to_screen_name': None,
                'hashtags_count': len(core.get('hashtags') or []),
                'user_mentions_count': len(core.get('mentions') or []),
                'urls_count': len(core.get('urls') or []),
                'media_count': 0,
            }
            rows.append(row)
        else:
            # Unknown item type; keep as raw JSON string for debugging
            rows.append({
                'record_type': 'unknown',
                'source_file': source_label,
                'id_str': '',
                'created_at': None,
                'text': json.dumps(it)[:5000],
                'lang': None,
                'source': None,
                'favorite_count': None,
                'retweet_count': None,
                'possibly_sensitive': None,
                'deleted_at': None,
                'in_reply_to_status_id': None,
                'in_reply_to_user_id': None,
                'in_reply_to_screen_name': None,
                'hashtags_count': None,
                'user_mentions_count': None,
                'urls_count': None,
                'media_count': None,
            })
    return rows


def coerce_types(df: pd.DataFrame) -> pd.DataFrame:
    # created_at to datetime
    if 'created_at' in df.columns:
        df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')
    # counts to numeric
    for col in ['favorite_count', 'retweet_count', 'hashtags_count', 'user_mentions_count', 'urls_count', 'media_count']:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    # id_str as string
    if 'id_str' in df.columns:
        df['id_str'] = df['id_str'].astype(str)
    # text length
    if 'text' in df.columns:
        df['text_len'] = df['text'].fillna('').astype(str).str.len()
    return df


def summarize(df: pd.DataFrame) -> str:
    lines = []
    lines.append(f'Total records: {len(df):,}')
    if 'record_type' in df.columns:
        lines.append('By type:')
        lines.extend(['  - ' + k + ': ' + str(v) for k, v in df['record_type'].value_counts().to_dict().items()])
    if 'created_at' in df.columns and df['created_at'].notna().any():
        lines.append(f'Date range: {df["created_at"].min()} -> {df["created_at"].max()}')
    if 'lang' in df.columns:
        top_langs = df['lang'].value_counts(dropna=True).head(5)
        if not top_langs.empty:
            lines.append('Top languages: ' + ', '.join([f"{idx} ({val})" for idx, val in top_langs.items()]))
    if 'source' in df.columns:
        top_src = df['source'].value_counts(dropna=True).head(5)
        if not top_src.empty:
            lines.append('Top sources: ' + ', '.join([f"{idx} ({val})" for idx, val in top_src.items()]))
    if 'text_len' in df.columns and df['text_len'].notna().any():
        lines.append(f"Text length (chars): mean {df['text_len'].mean():.1f}, median {df['text_len'].median():.0f}")
    if 'favorite_count' in df.columns and df['favorite_count'].notna().any():
        lines.append(f"Favorites: mean {df['favorite_count'].mean():.2f}, max {df['favorite_count'].max()}")
    if 'retweet_count' in df.columns and df['retweet_count'].notna().any():
        lines.append(f"Retweets: mean {df['retweet_count'].mean():.2f}, max {df['retweet_count'].max()}")
    return '\\n'.join(lines)


## 4) Process the uploaded/provided files

Run this cell after you selected files (or set `local_paths`).

It will parse all files, combine them, show stats, and build charts.


In [None]:
all_rows = []
errors = []
file_count = 0

def add_from_bytes_map(bmap: Dict[str, bytes]):
    global all_rows, file_count
    for name, data in (bmap or {}).items():
        try:
            items = parse_twitter_export_bytes(data, name)
            rows = normalize_items(items, source_label=name)
            all_rows.extend(rows)
            file_count += 1
        except Exception as e:
            errors.append(f'{name}: {e}')

def add_from_local_paths(paths: List[str]):
    global all_rows, file_count
    for p in paths or []:
        try:
            with open(p, 'rb') as f:
                data = f.read()
            items = parse_twitter_export_bytes(data, os.path.basename(p))
            rows = normalize_items(items, source_label=os.path.basename(p))
            all_rows.extend(rows)
            file_count += 1
        except FileNotFoundError:
            errors.append(f'{p}: File not found')
        except Exception as e:
            errors.append(f'{p}: {e}')

# 1) From Colab upload dialog
if 'uploaded_files_bytes' in globals() and uploaded_files_bytes:
    add_from_bytes_map(uploaded_files_bytes)

# 2) From Jupyter/Codespaces widget
if 'uploader' in globals() and uploader is not None and getattr(uploader, 'value', None):
    # uploader.value is a tuple of dicts with keys: name, type, size, content
    bytes_map = {item['name']: item['content'] for item in uploader.value}
    add_from_bytes_map(bytes_map)

# 3) From local workspace paths
if local_paths:
    add_from_local_paths(local_paths)

if not all_rows:
    print('No data was loaded. Please upload files in the prior cell or set local_paths, then run this cell again.')
else:
    df = pd.DataFrame(all_rows)
    df = coerce_types(df)
    print(f'Parsed {len(df):,} records from {file_count} file(s).')
    if errors:
        print('Some files had errors:')
        for e in errors:
            print(' -', e)
    display(df.head(10))
    print()
    print('Summary:')
    print(summarize(df))
    # Keep as global for later cells
    tweets_df = df.copy()


## 5) Visualizations (Plotly)

If you loaded data, this will render interactive charts.


In [None]:
if 'tweets_df' not in globals() or tweets_df.empty:
    print('No data to visualize yet. Run the processing cell after uploading files.')
else:
    df = tweets_df.copy()
    # Time series: monthly counts (tweets + deleted tweets + notes)
    if 'created_at' in df.columns and df['created_at'].notna().any():
        ts = (df.dropna(subset=['created_at'])
                .set_index('created_at')
                .assign(count=1)
                .groupby([pd.Grouper(freq='M'), 'record_type'])['count']
                .sum()
                .reset_index())
        fig_ts = px.line(ts, x='created_at', y='count', color='record_type',
                         title='Monthly Record Counts by Type', markers=True)
        fig_ts.update_layout(hovermode='x unified')
        fig_ts.show()
    else:
        print('created_at not available for time series.')

    # Text length distribution
    if 'text_len' in df.columns and df['text_len'].notna().any():
        fig_len = px.histogram(df, x='text_len', nbins=60, color='record_type',
                               title='Distribution of Text Length (characters)')
        fig_len.show()
    else:
        print('No text_len to chart.')

    # Top languages
    if 'lang' in df.columns and df['lang'].notna().any():
        top_lang = df['lang'].value_counts().head(10).reset_index()
        top_lang.columns = ['lang', 'count']
        fig_lang = px.bar(top_lang, x='lang', y='count', title='Top Languages')
        fig_lang.show()
    else:
        print('No language info available.')

    # Top sources (client apps)
    if 'source' in df.columns and df['source'].notna().any():
        top_src = df['source'].value_counts().head(10).reset_index()
        top_src.columns = ['source', 'count']
        fig_src = px.bar(top_src, x='source', y='count', title='Top Sources (Client Apps)')
        fig_src.update_layout(xaxis={'categoryorder':'total descending'})
        fig_src.show()
    else:
        print('No source info available.')


## 6) Export to CSV and download

- Saves CSVs under `exports/` in this environment.
- In Colab, also triggers file downloads.
- In Jupyter/Codespaces, clickable links are shown; you can also download from the left file browser.


In [None]:
if 'tweets_df' not in globals() or tweets_df.empty:
    print('Nothing to export yet.')
else:
    os.makedirs('exports', exist_ok=True)
    timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
    paths = []

    # All records
    path_all = f'exports/twitter_records_{timestamp}.csv'
    tweets_df.to_csv(path_all, index=False)
    paths.append(path_all)

    # Per type
    for typ in sorted(tweets_df['record_type'].dropna().unique()):
        dsub = tweets_df[tweets_df['record_type'] == typ]
        p = f'exports/{typ}_{timestamp}.csv'
        dsub.to_csv(p, index=False)
        paths.append(p)

    print('Saved CSV files:')
    for p in paths:
        print(' -', p)

    # Clickable links in Jupyter/Codespaces
    display(FileLinks('exports'))

    if IN_COLAB:
        from google.colab import files as colab_files  # type: ignore
        for p in paths:
            try:
                colab_files.download(p)
            except Exception as e:
                print(f'Could not trigger download for {p}: {e}')


## 7) Tips: Uploading/downloading via the UI

- Google Colab:
  - Upload: the `files.upload()` dialog (used above).
  - Download: handled automatically in the export cell, or click the folder icon (Files) to right-click and download.
- GitHub Codespaces / JupyterLab:
  - Upload: Drag-and-drop files into the left file browser, or use the file upload widget provided above.
  - Download: After export, use the clickable links shown above, or right-click files in the left pane and choose Download.
- Local Jupyter: similar to Codespaces (left file browser, right-click to download).


## 8) Ideas for further analysis and features

Here are some ideas you can add next (all doable with free pip packages):

- Time-of-day / day-of-week posting patterns; calendar heatmaps.
- Engagement analysis by content type (with/without media, replies vs. original tweets).
- Hashtag and mention frequency; co-occurrence networks.
- Text analytics: sentiment (e.g., `textblob`), keyword extraction (e.g., `rake-nltk`), topics (e.g., `scikit-learn` LDA).
- URL expansion (resolve t.co links) and domain breakdown.
- Media inventory: count images/videos, list top-performing media.
- Create a Streamlit dashboard for shareable, interactive exploration of your archive.
- Geotag analysis (if present).
- Compare your activity before/after a specific date or event.

Ethics & privacy: If sharing outputs, consider redacting personal data or mentions from private conversations.
