# Immigration Corpus - Jupyter Notebook Template

Quick start template for analyzing immigration discourse data.

**Before running:** Make sure you've installed the library:
```bash
pip install git+https://github.com/kevinbarcenasmtz/immigration-discourse-dataset.git
```

## 1. AWS Credentials Setup

Choose ONE of these options:

### Option A: AWS CLI (recommended - if you ran `aws configure`)

If you've already run `aws configure` in terminal, skip to Section 2. No setup needed!

### Option B: Set credentials in notebook (quick but less secure)

In [None]:
# WARNING: Don't commit this notebook with credentials!
# Add *.ipynb to .gitignore if needed

import os

os.environ['AWS_ACCESS_KEY_ID'] = 'PASTE-YOUR-KEY-HERE'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'PASTE-YOUR-SECRET-HERE'
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

### Option C: Prompt for credentials (most secure)

In [None]:
# This will prompt you to paste credentials (input hidden)

import os
from getpass import getpass

os.environ['AWS_ACCESS_KEY_ID'] = getpass('AWS Access Key ID: ')
os.environ['AWS_SECRET_ACCESS_KEY'] = getpass('AWS Secret Access Key: ')
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

print('âœ… Credentials set!')

## 2. Import Library

In [None]:
from immigration_corpus import (
    load_data,
    search_term,
    get_term_counts,
    filter_by_date,
    filter_by_source,
    get_stats,
    export_to_json
)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Notebook settings
%matplotlib inline
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

## 3. Load Data

In [None]:
# Load first 3 files (~19K articles)
# Adjust files=[0, 1, 2] to load more/less
df = load_data(files=[0, 1, 2])

print(f"\nLoaded {len(df):,} articles")
df.head()

## 4. Explore Dataset

In [None]:
# Dataset statistics
stats = get_stats(df)

print(f"Total articles: {stats['total_articles']:,}")
print(f"Unique sources: {stats['unique_sources']}")
print(f"Date range: {stats['date_range'][0]} to {stats['date_range'][1]}")
print(f"\nTop 5 sources:")
for source, count in list(stats['top_sources'].items())[:5]:
    print(f"  {source}: {count:,}")

In [None]:
# Visualize source distribution
top_sources = pd.Series(stats['top_sources']).head(10)

plt.figure(figsize=(10, 6))
top_sources.plot(kind='barh')
plt.xlabel('Number of Articles')
plt.title('Top 10 News Sources')
plt.tight_layout()
plt.show()

## 5. Term Analysis

In [None]:
# Compare terminology usage
terms = ['illegal alien', 'undocumented immigrant']
counts = get_term_counts(df, terms)

for term, stats in counts.items():
    print(f"{term}: {stats['count']:,} articles ({stats['percentage']:.2f}%)")

In [None]:
# Visualize term comparison
term_data = pd.DataFrame({
    'Term': list(counts.keys()),
    'Count': [counts[t]['count'] for t in counts.keys()]
})

plt.figure(figsize=(8, 5))
plt.bar(term_data['Term'], term_data['Count'])
plt.ylabel('Number of Articles')
plt.title('Immigration Terminology Comparison')
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

## 6. Filter and Analyze by Source

In [None]:
# Compare conservative vs liberal sources
conservative = filter_by_source(df, ['breitbart.com', 'foxnews.com'])
liberal = filter_by_source(df, ['huffpost.com', 'cnn.com'])

print(f"Conservative sources: {len(conservative):,} articles")
print(f"Liberal sources: {len(liberal):,} articles")

# Compare term usage
cons_counts = get_term_counts(conservative, terms)
lib_counts = get_term_counts(liberal, terms)

print("\nConservative sources:")
for term in terms:
    print(f"  {term}: {cons_counts[term]['percentage']:.2f}%")

print("\nLiberal sources:")
for term in terms:
    print(f"  {term}: {lib_counts[term]['percentage']:.2f}%")

## 7. Your Analysis

Add your custom analysis below:

In [None]:
# Your code here


## 8. Export Results (Optional)

In [None]:
# Export filtered results for further analysis
# results = search_term(df, 'illegal alien')
# export_to_json(results, 'illegal_alien_articles.jsonl')
# results.to_csv('my_analysis.csv', index=False)