# Text Collection and Pre-Processing
We want to collect and pre-process some textual data. In particular, you should do the followings:
- Scrape the Wikipedia page of [Christopher Nolan](https://en.wikipedia.org/wiki/Christopher_Nolan) using the `requests` library.
- Extract and clean up the title and first few paragraphs of the web page using the `Beautiful Soup` library.
- Tokenize sentences and words using either the `NLTK` or `spaCy` library.  
- Stem and Lematize words using either the `NLTK` or `spaCy` library. 


## Importing Modules

In [1]:
import bs4
import nltk
import requests

## Getting the HTML Page

In [2]:
url = "https://en.wikipedia.org/wiki/Christopher_Nolan"
r = requests.get(url)
r.status_code

200

## Parsing the HTML Page

In [4]:
parsed_html = bs4.BeautifulSoup(r.text, "html.parser")
title = parsed_html.find("h1", {"class": "firstHeading"}).get_text()
paragraphs = parsed_html.find("div", {"class": "mw-parser-output"}).find_all("p")
for p in paragraphs:
    print(p.get_text())
    print("--------------")



--------------
Christopher Edward Nolan CBE (/ˈnoʊlən/; born 30 July 1970) is a British-American film director, producer, and screenwriter. His films have grossed more than US$5 billion worldwide, and have garnered 11 Academy Awards from 36 nominations. 

--------------
Born and raised in London, Nolan developed an interest in filmmaking from a young age. After studying English literature at University College London, he made his feature debut with Following (1998). Nolan gained international recognition with his second film, Memento (2000), for which he was nominated for the Academy Award for Best Original Screenplay. He transitioned from independent to studio filmmaking with Insomnia (2002), and found further critical and commercial success with The Dark Knight Trilogy (2005–2012), The Prestige (2006), and Inception (2010), which received eight Oscar nominations, including for Best Picture and Best Original Screenplay. This was followed by Interstellar (2014), Dunkirk (2017), and T

## Sentence Segmentation and Word Tokenization

In [6]:
text = paragraphs[1].get_text()
sentences = nltk.tokenize.sent_tokenize(text)
words = nltk.tokenize.word_tokenize(sentences[0])
print(sentences)
print(words)

['Christopher Edward Nolan CBE (/ˈnoʊlən/; born 30 July 1970) is a British-American film director, producer, and screenwriter.', 'His films have grossed more than US$5\xa0billion worldwide, and have garnered 11 Academy Awards from 36 nominations.']
['Christopher', 'Edward', 'Nolan', 'CBE', '(', '/ˈnoʊlən/', ';', 'born', '30', 'July', '1970', ')', 'is', 'a', 'British-American', 'film', 'director', ',', 'producer', ',', 'and', 'screenwriter', '.']


## Stemming

In [7]:
stemmer = nltk.stem.PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
stemmed_words

['christoph',
 'edward',
 'nolan',
 'cbe',
 '(',
 '/ˈnoʊlən/',
 ';',
 'born',
 '30',
 'juli',
 '1970',
 ')',
 'is',
 'a',
 'british-american',
 'film',
 'director',
 ',',
 'produc',
 ',',
 'and',
 'screenwrit',
 '.']

## Lematization

In [8]:
lemmatizer = nltk.stem.WordNetLemmatizer()
lematized_words = [lemmatizer.lemmatize(word) for word in words]
lematized_words

['Christopher',
 'Edward',
 'Nolan',
 'CBE',
 '(',
 '/ˈnoʊlən/',
 ';',
 'born',
 '30',
 'July',
 '1970',
 ')',
 'is',
 'a',
 'British-American',
 'film',
 'director',
 ',',
 'producer',
 ',',
 'and',
 'screenwriter',
 '.']

## POS Tagging

In [12]:
pos_tags = nltk.pos_tag(words)
pos_tags

[('Christopher', 'NNP'),
 ('Edward', 'NNP'),
 ('Nolan', 'NNP'),
 ('CBE', 'NNP'),
 ('(', '('),
 ('/ˈnoʊlən/', 'NNP'),
 (';', ':'),
 ('born', 'VBD'),
 ('30', 'CD'),
 ('July', 'NNP'),
 ('1970', 'CD'),
 (')', ')'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('British-American', 'JJ'),
 ('film', 'NN'),
 ('director', 'NN'),
 (',', ','),
 ('producer', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('screenwriter', 'NN'),
 ('.', '.')]