# Crawling Lesson Hints


### Marvel Cinematic Universe

##### First list
[List of all MCU Actors](https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_film_actors)

CSS selector: `th`

This contains links to both characters and actors

##### Second list
[List of all Marvel Comics characters](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters)

CSS selector: `.hatnote`

This is a multi-page list of lists, where each article URL begins with  `https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_` and ends with any of the values from ['ABCDEFGHIJKLMNOPQRSTUVWXYZ'] and '0-9'. Therefore, we must construct a list of URLs of each individual list and do a multi-page retrieval.

##### Visualization notes

This is an extremely dense graph, due to the number of links between any two articles. An undirected graph (which sums links back and forth between two articles as a single weighted edge) with a minimum weight greater than 1 helps cut down on clutter. In addition, a `spring_layout` creates some interesting groupings of individual articles.

##### Data files

[`mcu_network.json`](./mcu_network.json)

### BET Hip Hop Award Winners

##### First list: 

[List of all Hip Hop musicians](https://en.wikipedia.org/wiki/List_of_hip_hop_musicians)

CSS selector: `li`

This contains links to all Hip Hop musicians with a Wiki article.

##### Second list:

[List of all BET Hip Hop awards](https://en.wikipedia.org/wiki/BET_Hip_Hop_Awards)

CSS selector: `li`

This contains links to all BET Award Winner musicians and the names of the works for which they won an award.

##### Crawling notes

Crawling and saving the graph as an undirected graph and as a directed graph can generate extremely different visualizations.

##### Visualization notes

When the graph is flattened as a directed graph and then visualized as a directed graph with minimum weight of 2, a number of artists are forced to outside of the graph in a ring shape.

##### Data files

[`bet_directed.json`](./bet_directed.json) and [`bet_undirected.json`](./bet_undirected.json)

### Forbes 400


##### Crawling notes
The [Forbes 400 Wikipedia](https://en.wikipedia.org/wiki/List_of_members_of_the_Forbes_400) entry is extremely incomplete. To create the crawl list, you can text mine the [Forbes 400](https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/#3c2f82d422f4) list directly. Each entry is contained inside of a `strong` tag and matches a pattern that begins with a number, followed by a period. To convert this data into a useable list of Wikipedia articles, use the following code:

```
from pyquery import PyQuery

# Make a list of URLs to mine
urls = [ "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/#3381da7c22f4",
         "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/2/#1bf172cb7b17", 
         "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/3/#5722e9247c58", 
         "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/4/#3262ecd31473"]

strongs = list()

# Get each "strong" HTML tag from each url
for url in urls:
    strongs.extend(PyQuery(url=url)("strong"))

# Use regular expressions to do the heavy lifting
import re

# This regex matches any number of digits followed by a period and a space, then accepts the rest of the string
regex = re.compile("^\d+\. .+")

# Use another regex to delete the list number, replace any spaces with an underscore
forbes_400 = [ "/wiki/" + re.sub("^\d+\. ", "", strong.text).replace(" ", "_") \
                for strong in strongs if strong.text and regex.match(strong.text) ]

print forbes_400
```

You may then perform the crawl with `max_articles=400`.

_*Note*_: Not all entries will be real Wikipedia entries. This code may generate some error messages during the crawl.

##### Visualization notes

When creating the graph, a `minimum_weight=3` will create clusters of individuals with business or family ties to each other.

##### Data files

[`forbes_400.json`](./forbes_400.json)