# Web Scraper + Regular Expression Project

# Project Introduction
This project aims to scrape the text of Martin Luther King Jr.'s iconic 1963 speech from an external website. 
Using the BeautifulSoup library, we extract the speech text, clean it by removing punctuation and normalizing the case, 
and then analyze the frequency of each word.

## The  Code Functionality

- The code begins by importing necessary libraries for web scraping and data processing.
- It defines the URL and fetches the content from that URL.
- The HTML is parsed to extract the speech text from paragraph elements.
- The text is cleaned, analyzed for word frequency, and saved to a CSV file for later use.




In [80]:
from bs4 import BeautifulSoup  # Import BeautifulSoup for HTML parsing
import requests  # Import requests for fetching the web page
import pandas as pd  # Import pandas for data manipulation
import re  # Import regex for text processing

In [7]:
# Define the URL of Martin Luther King Jr.'s speech
url = "http://www.analytictech.com/mb021/mlk.htm"

# Fetch the page content from the URL
page = requests.get(url)


# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')  # Use 'html.parser' for better compatibility

# Extract all paragraph elements that contain the speech text
mlkj_spech = soup.find_all('p') 

# Combine the text from all paragraph elements into a single list
spech_combined = [p.text for p in mlkj_spech]

In [84]:
# This will display the list of paragraphs
print(spech_combined)

['I am happy to join with you today in what will go down in\r\nhistory as the greatest demonstration for freedom in the history\r\nof our nation. ', 'Five score years ago a great American in whose symbolic shadow\r\nwe stand today signed the Emancipation Proclamation. This\r\nmomentous decree came as a great beckoning light of hope to\r\nmillions of Negro slaves who had been seared in the flames of\r\nwithering injustice. It came as a joyous daybreak to end the long\r\nnight of their captivity. ', 'But one hundred years later the Negro is still not free. One\r\nhundred years later the life of the Negro is still sadly crippled\r\nby the manacles of segregation and the chains of discrimination. ', 'One hundred years later the Negro lives on a lonely island of\r\npoverty in the midst of a vast ocean of material prosperity. ', 'One hundred years later the Negro is still languishing in the\r\ncomers of American society and finds himself in exile in his own\r\nland. ', "We all have come to t

In [82]:
# Join the list into a single string
string_spech = " ".join(spech_combined)

# Clean the string by replacing newline characters with a space
string_spech_cleaned = string_spech.replace('\r\n', ' ')

# Remove punctuation from the cleaned string
spech_no_punt = re.sub(r'[^\w\s]', '', string_spech_cleaned)

# Convert the string to lowercase
spech_lower = spech_no_punt.lower()

# Split the cleaned string into individual words
spech_broken_out = re.split(r'\s+', spech_lower)

In [73]:
# Count the occurrences of each word and create a DataFrame
df = pd.DataFrame(spech_broken_out).value_counts()

0      
the        54
of         49
to         29
and        27
a          20
           ..
jews        1
joyous      1
judged      1
land        1
lookout     1
Name: count, Length: 323, dtype: int64

In [None]:
# Save the word counts to a CSV file for further analysis
df.to_csv('/Users/munirahalzuman/Documents/MLKJ_speach_Count.csv', index_label='word')