# NLP Basics: News Headlines Analysis

Welcome to your first NLP homework! In this notebook, you will work with a dataset of news headlines and learn basic text analysis techniques.

**Instructions:**
- Sections marked **COMPLETED** are examples for you to study
- Sections marked **TODO** are for you to complete
- Run each cell in order
- Add your code where you see `# YOUR CODE HERE`

Let's get started!

## 1. Introduction & Data Loading (COMPLETED)

First, we need to import the libraries we'll use and load our dataset.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

# Set visualization style
plt.style.use('ggplot')
%matplotlib inline

In [None]:
# Load the dataset
df = pd.read_csv('news_headlines_dataset.csv')

# Display first 10 rows
print("First 10 headlines:")
print(df.head(10))

In [None]:
# Get basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {list(df.columns)}")
print(f"\nData types:")
print(df.dtypes)

## 2. Exploring the Data (TODO)

Now it's your turn! Let's explore the dataset to understand what we have.

In [None]:
# TODO: How many headlines are in each category?
# Hint: Use .value_counts() on the 'category' column

# YOUR CODE HERE


In [None]:
# TODO: Create a bar chart showing the number of headlines per category
# Hint: Use .value_counts().plot(kind='bar')

# YOUR CODE HERE

plt.title('Headlines per Category')
plt.xlabel('Category')
plt.ylabel('Number of Headlines')
plt.show()

In [None]:
# TODO: What is the average word count across all headlines?
# Hint: Use .mean() on the 'word_count' column

# YOUR CODE HERE


## 3. Text Basics (COMPLETED)

Let's learn how to work with text data. We'll start with a single headline.

In [None]:
# Get the first headline
example_headline = df['headline'].iloc[0]
print(f"Original headline: {example_headline}")
print(f"Type: {type(example_headline)}")

In [None]:
# Convert to lowercase
lowercase_headline = example_headline.lower()
print(f"Lowercase: {lowercase_headline}")

In [None]:
# Split into words (tokenization)
words = example_headline.split()
print(f"Words: {words}")
print(f"Number of words: {len(words)}")

## 4. Working with All Headlines (TODO)

Now let's apply these operations to all headlines in the dataset.

In [None]:
# TODO: Create a new column with all headlines in lowercase
# Hint: Use .str.lower() on the 'headline' column

# YOUR CODE HERE

# Display first 5 headlines
print(df[['headline', 'headline_lower']].head())

In [None]:
# TODO: Count the total number of words across ALL headlines
# Hint: Sum the 'word_count' column

# YOUR CODE HERE


In [None]:
# TODO: Find the longest and shortest headlines
# Hint: Use .idxmax() and .idxmin() on 'word_count', then use .loc[] to get the headlines

# YOUR CODE HERE


## 5. Word Frequency - Introduction (COMPLETED)

Let's count how often specific words appear in our headlines.

In [None]:
# Combine all headlines into one big text (convert to lowercase first)
all_text = ' '.join(df['headline'].str.lower())
print(f"Total characters: {len(all_text)}")
print(f"First 200 characters: {all_text[:200]}")

In [None]:
# Split into individual words
all_words = all_text.split()
print(f"Total words: {len(all_words)}")
print(f"First 20 words: {all_words[:20]}")

In [None]:
# Count how many times specific words appear
print(f"Count of 'the': {all_words.count('the')}")
print(f"Count of 'and': {all_words.count('and')}")
print(f"Count of 'in': {all_words.count('in')}")

## 6. Most Common Words (TODO)

Instead of counting words manually, let's find the most common words automatically.

In [None]:
# TODO: Count ALL words and find the 15 most common
# Hint: Counter is already imported at the top. Use Counter(all_words) and .most_common(15)

# YOUR CODE HERE
# word_counts = Counter(all_words)
# most_common = word_counts.most_common(15)

# Print the results
# for word, count in most_common:
#     print(f"{word}: {count}")

In [None]:
# TODO: Create a bar chart of the 15 most common words
# Hint: Extract words and counts from most_common, then use plt.bar()

# YOUR CODE HERE

plt.figure(figsize=(12, 6))
# YOUR CODE HERE
plt.title('Top 15 Most Common Words')
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Question:** What do you notice about the most common words? Are they meaningful?

*YOUR ANSWER HERE:* 

## 7. Words by Category (COMPLETED)

Different categories might use different words. Let's analyze Sports headlines.

In [None]:
# Get all Sports headlines (convert to lowercase)
sports_headlines = df[df['category'] == 'Sports']['headline'].str.lower()
print(f"Number of Sports headlines: {len(sports_headlines)}")
print(f"\nFirst 5 Sports headlines:")
print(sports_headlines.head())

In [None]:
# Combine all Sports headlines
sports_text = ' '.join(sports_headlines)
sports_words = sports_text.split()

# Count words
sports_counter = Counter(sports_words)
sports_top10 = sports_counter.most_common(10)

print("Top 10 words in Sports headlines:")
for word, count in sports_top10:
    print(f"{word}: {count}")

## 8. Compare Categories (TODO)

Let's compare word usage across different categories.

In [None]:
# TODO: Find the top 10 words for each category
# Hint: Create a loop that goes through each category and repeats what we did for Sports

categories = ['Politics', 'Sports', 'Technology', 'Entertainment']

# YOUR CODE HERE


In [None]:
# TODO: Which words appear most in Technology but not in Sports?
# Hint: Get top words for Technology, then check if they appear in Sports top words

# YOUR CODE HERE


In [None]:
# TODO: Find one unique word for each category
# (A word that appears in one category but rarely or never in others)

# YOUR CODE HERE


## 9. Simple Visualization (TODO)

Let's create some visualizations to better understand our data.

In [None]:
# TODO: Create a histogram of word counts
# Hint: Use plt.hist() on the 'word_count' column

# YOUR CODE HERE

plt.title('Distribution of Word Counts')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()

In [None]:
# TODO: Create a pie chart of category distribution
# Hint: Use .value_counts().plot(kind='pie')

# YOUR CODE HERE

plt.title('Category Distribution')
plt.ylabel('')
plt.show()

In [None]:
# TODO: Do Technology headlines have more numbers than Sports headlines?
# Hint: Calculate the percentage of headlines with numbers for each category

# YOUR CODE HERE


## 10. Text Patterns (TODO - BONUS)

These are bonus exercises for extra practice!

In [None]:
# TODO: Count how many headlines contain the word 'new'
# Hint: Use .str.contains() on the 'headline_lower' column

# YOUR CODE HERE


In [None]:
# TODO: Which category uses the word 'wins' most often?

# YOUR CODE HERE


In [None]:
# TODO: Find all headlines containing a keyword of your choice
# Pick any word you're interested in!

keyword = 'YOUR_KEYWORD_HERE'

# YOUR CODE HERE


## 11. Summary Questions

Answer these questions based on your analysis:

**1. What did you learn about news headlines from this analysis?**

*YOUR ANSWER HERE:*

**2. Which category has the longest headlines on average?**

*YOUR ANSWER HERE:*

In [None]:
# Calculate average word count by category to answer question 2
# YOUR CODE HERE


**3. What patterns did you notice? (e.g., which words appear in which categories?)**

*YOUR ANSWER HERE:*

**4. What was the most surprising thing you discovered?**

*YOUR ANSWER HERE:*

---

## Congratulations!

You've completed your first NLP analysis! You've learned how to:
- Load and explore text data
- Perform basic text processing (lowercase, tokenization)
- Count word frequencies
- Compare text across categories
- Create visualizations of text data

These are fundamental skills for Natural Language Processing!