## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with through Wednesday of this week. The HTML page on the BBC site poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies then immediately follow her/him. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just email me and I will send you the code.

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, email me and I will send you working code so you can move on to the next step.


### REMEMBER: secondary source
Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.


### Getting started: Data Architecture
You can come up with your own data scheme for this, but the one I'm recommending is three separate lists:

The most challenging one is the **critics_list**:

`critics_list = [['critic name','critic organization','critic country','movie one name','movie two name','movie three name',etc],['critic name','critic organization','critic country','movie one','movie two','movie three',etc]]`

So each list would contain 13 elements -- three entries about the critic, and then the 10 movies picked. critics_list[0][3] would be the first critic's #1 movie, critics_list[2][12] would be the third critic's  #10 movie.

Next, you would make  **"movie_list"** which would look like this:

`movie_list = [['movie name','director name','movie date'],['movie name','director name','movie date']]`

Just go through the whole page and make a list of lists for every movie. Each list would contain three elements. movie_list[0][0] what give you the name of the first movie in the list, movie_list[3][1] would give you the director of the fourth movie in the list.

Finally, you would need make a simple **directors_list**.

director_list = ['Director name','Director name']

director_list[0] would give you first director.


### Time for code: 

The first thing you need to do is import beautiful soup & urllib like we did in the homework, and scrape the page. http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted


One thing I should note there are two inconsistencies (actual errors in the HTML) that will cause you to lose a couple entries (which is okay but may be frustrating). I have posted a version of the exact same page with those inconsistencies fixed, if you want to scrape from that page: 

http://floatingmedia.com/columbia/BBC.html

It's up to you. Okay let's begin!

STEP 1:


In [None]:
##Import your libraries: Beautiful soup, urllib, and re (For regular expressions)



In [None]:
# read the URL, and put the HTML page into beautiful soup



In [None]:
#Using beautiful soup find the div tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 



**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [None]:
#find_all

**STEP THREE** This is where all the magic has to happen: you need to find a way to look through all of the <p> elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list--if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it may or may not be helpful to look at. See how you do step-by-step and if you get stuck at a step email me with your code!



In [None]:
##Write your loop for STEP 3 here
#I started this for you,
#Because you only want it to search starting with each critic
#   if line.strong is not None: does that for you
for lines in all_p:
    if line.strong is not None:
        #critic_info = ???
        #movie_info = ???
        





**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided to solve a low for you to practice your regular expressions before you put them into the loop.

In [None]:
#Practice/Build your regular expressions here
crit_sample = "Arturo Aguilar â€“ Rolling Stone Mexico (Mexico)"
regex_for_name = r""
regex_for_org = r""
regex_for_cn = r""
name = re.findall(regex_for_name,crit_sample)
name[0]


In [4]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [None]:
##TakeYou're working loop And add the find_all
#And the inner loop that loops through each_movie

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [None]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r""
movie_name = re.findall(regex_for_mname,movie_sample)
movie_name[0]


**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get up and move name.

So the entire loop should be getting you 13 elements:
critic_name
critic_org
critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instance of:
movie_name


In [None]:
#Get that loop working here







**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not don't worry I will get you through by midweek.

The final step is building a list of lists that contains the 13 elements 3 things about the critic and the 10 movies she/he selected.

In the cell below, I give you a final architecture you need to use to get this most challenging list of lists.

In [None]:
critics_list = []
#for loop that goes throug all the <p> element
    #if strong (begins with the critic)
        this_critic = []
        #critic_info= get the critic line
        #critic_name = re.findall(regex,critic_info)
        #critic_org = re.findall(regex,critic_info)
        #critic_cn = re.findall(regex,critic_info)
        this_critic.append(critic_name[0],critic_org[0],critic_cn[0])
        #movie_info = get movie line using next_sibling
        #get each movie string
        #loop through each movie_line (#1 through #10)
            #movie_name = re.findall(regex,movie_line)
            this_critic.append(movie_name[0])
            #this append will happen 10 times
        #The list for the single criticIs finished
        #Add it to the critics list
        critics_list.append(this_critic)
            

    

In [None]:
##Take a peek at your final lists of lists
critics_list

If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic but not be nearly as complicated as this one.