# Scraping the text from a webpage and creating a topic cloud

## Import Statements



```
# This is formatted as code
```

In python you can _import_ pieces of other people's code or _modules_. 

Here we're importing a piece of code that someone else already wrote for us.  Kind of like the Brandwatch API SDK, the wikipedia module is just a wrapper that makes it easier to access Wikipedia's API.  Similarly, mechanicalsoup is a bunch of code that someone else wrote that makes grabbing the text from a webpage very easy. 

For more info on the Wikipedia module read this: https://wikipedia.readthedocs.io/en/latest/

This for BeautifulSoup (parent of MechanicalSoup): http://beautiful-soup-4.readthedocs.io/en/latest/


Importing modules is great because it means we don't have to write that code ourselves!

In [1]:
!pip install wikipedia
!pip install mechanicalsoup
!pip install matplotlib
!pip install wordcloud

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Collecting mechanicalsoup
  Downloading https://files.pythonhosted.org/packages/f6/6a/263f3e12d50e3272abf3842e13a3c991cda4af0f253e9c73a41d0b8387c3/MechanicalSoup-0.11.0-py2.py3-none-any.whl
Installing collected packages: mechanicalsoup
Successfully installed mechanicalsoup-0.11.0


In [0]:
import wikipedia
from mechanicalsoup import Browser

mech = Browser()

Now we can use that wikipedia module to search Wikipedia for a page - let's say Brandwatch's page.

The following line is exactly the same as using the 'search' bar on the Wikipedia page to look for Brandwatch.

In [4]:
wikipedia.search("buzzsumo")

['Nostalgia industry', 'Brandwatch']

Now let's use a _variable_ to save that information somewhere.  A variable is kind of like a label. 

Here we're creating a variable (or label) 'bw' and storing all the data with that label.  This is useful because we can refer to that data later with just the symbol bw instead of that whole big messy text.

In [0]:
bw = wikipedia.page("Brandwatch")

New variables often inherit attributes from the type of data they are storing. In this case the bw variable inherits a few useful functions from the command above. If you place your cursor after the period in bw and hit <tab> you can see what functions the variable has. 

Let's see what's in the bw.summary


In [0]:
bw.summary

If you looked at a few of the functions you might notice a few useful ones. We're going to need the URL of the webpage later so let's store that in another variable so we can use it later

In [0]:
#Grab the URL of the Brandwatch page and assign it to the variable "url"
url = bw.url

## Let's grab the text from the webpage!

We've got everything we need to tell MechanicalSoup to crawl a webpage a give us some data!

Remember in the first cell where we set the Browser function from mechanicalsoup to mech? We're going to use that now.

mech is going to _get_ everything from the url we picked, in this case the Brandwatch homepage.

Then its going to find everything inside the _p_ brackets on an html page and put that data into the variable match. We're going to do the exact same thing for titles.

In [0]:
#Crawl the url and pull out all the text

page = mech.get(url)
matches = page.soup.select("p")
matches2 = page.soup.select("title")


This part is a little bit of housekeeping, but we're going to drop all the html markup and keep only the text. Then we're going to store it in a text file and keep it for later. File I/O, or input/output is super useful! 

You read more about how to do that in Python here:https://docs.python.org/3/tutorial/inputoutput.html


In [0]:
#Drop everything into a text file
f = open('workfile', 'w')
line=u''
# Paragraphs
for match in matches:
    f.write(match.getText())
# Titles
for match in matches2:
    f.write(match.getText())
    f.close()

## Time for the fun stuff! Wordclouds!

Let's grab a couple more modules, this time Wordcloud to make the topic clouds, and matplotlib for displaying the data

After we get everything imported we're going to read the text file we made earlier and store all the words in a variable. 

Next all we have to do if generate the wordcloud and plot it. Don't worry if you don't get how I knew _that_ was the way to use the function. Part of learning how to use new functions is about reading documentation. These files are usually online and can range from incredibly informative with lots of examples to very technical and hard to understand finally to just non-existant. Don't worry, figuring out how to be an expert Googler is all part of learning how to write code



In [0]:
%matplotlib inline

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# Read the file in and merge all lines
words=' '
f = open('workfile', 'r')
for line in f:
    words = words + line
f.close
 
# Make the word cloud (I use a font that I like)
wordcloud = WordCloud().generate(words)

plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('./cloud2.png', dpi=300)
plt.show()



###### We hope this was a good introduction to learning about writing code! Can't wait to see you all next week!

###### Cheers!
###### Paul, Amy and Ian