# Scraping for Categories, Winners, and Nominees for the 2022 GRAMMYs
I'm one of those people who always claims to be really into music yet the only thing that i know about the 2022 GRAMMYs is that [Tyler's](https://www.google.com/search?gs_ssp=eJzj4tLP1TdILkopySo3YPQSLKnMSS1SKMlIVUguSk0syS8CAKRdCss&q=tyler+the+creator&oq=tyler&aqs=chrome.1.69i57j46i39j46i67i433j46i67j46i433i512j69i61l3.3555j0j7&sourceid=chrome&ie=UTF-8) album won 😂. And if you are seeing this you probably don't know much about the event too. Let's fix  that by [web scraping](https://en.wikipedia.org/wiki/Web_scraping) for the information so that before our friends start a conversation we can't keep up with. We'll would already know so much we would be the ones starting the conversation like "How you doin?":
<img src="https://pbs.twimg.com/media/FP0_xP5XwAItvlo?format=jpg&name=small" alt="How you doing?" width="550"/>


## Description
- For this project I'm going to be scrapping the official GRAMMYs [website](https://www.grammy.com/news/2022-grammys-complete-winners-nominees-nominations-list
) using Python
- Libraries used include: requests, BeautifulSoup, re, nltk, pandas
  

## Web Scraping begins
First let's load all the libraries that we would be using for this project


In [1]:
# Website chosen (URL)
# https://www.grammy.com/news/2022-grammys-complete-winners-nominees-nominations-list

#load libraries
import requests
from bs4 import BeautifulSoup
import re
import nltk
nltk.download('punkt')
from nltk import sent_tokenize
import pandas as pd

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\TARI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Next let's download the page with **requests**

In [2]:
# Download the Grammy page
url = "https://www.grammy.com/news/2022-grammys-complete-winners-nominees-nominations-list"
response = requests.get(url)

# Check for error in downloading page
response.raise_for_status()

In [3]:
#size of the page
len(response.text)

368073

In [4]:
page_contents = response.text

#### Then we'll use **BeautifulSoup** to parse and extract information

In [6]:
# Pasing information
doc = BeautifulSoup(page_contents, 'html.parser')


To extract information from the site we'll follow the following steps
- **Step 1**: Open the [2022 GRAMMYs page](https://www.grammy.com/news/2022-grammys-complete-winners-nominees-nominations-list)
 ![photo1](https://pbs.twimg.com/media/FP1IBXBWUAEHxMI?format=jpg&name=large)

- **Step 2**: Scroll down to the awards section 
![photo2](https://pbs.twimg.com/media/FP1IAEvXEAUpVZN?format=jpg&name=large)

- **Step 3**: Right click on the "General Field" element and click on inspect
![photo3](https://pbs.twimg.com/media/FP1NSfVXwAAEECZ?format=jpg&name=large)

- **Step 4**: The developer options opens and we see that the "General Field" element is a `<h1>` element and is nested under a `<div>` element.
![photo4](https://pbs.twimg.com/media/FP1NTtEX0AMlDkn?format=jpg&name=large)

- **Step 5**: Next we'll click on the `<div>` element to see what it contains (nests)
![photo5](https://pbs.twimg.com/media/FP1NU-oWUAEY180?format=jpg&name=large)

- **Step 6**: Lastly we scroll down to find out that the `<div>` element of the class prose contains all the information we need. So we would be working mainly with that.
![photo6](https://pbs.twimg.com/media/FP1NU-oWUAEY180?format=jpg&name=large)

#### We just found out that all the Grammy details we need is in a div element that belongs to the class prose. So let's extract the details and start storing the info we need

In [8]:
grammydets = doc.find_all('div', {'class': 'prose'})

First we get a list of all the catergories in the 2022 grammy awards
- From the one of the pictures we looked at above we can see that the "General Field" element is a `<h1>` element 
![photo4](https://pbs.twimg.com/media/FP1NTtEX0AMlDkn?format=jpg&name=large)
- If we check the page further we see that every other category's design has the same look with and only with the "General Field" element
![photo7](https://pbs.twimg.com/media/FP1Se38WQAEV4SA?format=jpg&name=large)


That means in the `<div.prose>` class only the categories are `<h1>` tags. Let's get the categories then.  First we'll get all the `<h1>` tags then we'll store the categories in a list. For a more detailed explaination of what's happening [here](https://www.youtube.com/watch?v=RKsLLG-bzEY&t=6s)

In [13]:
# Get all h1 tags in div.prose
h1_tags = grammydets[1].find_all('h1')
# Are the categories complete?
len(h1_tags)

26

In [14]:
# Get the categories
categories = []
for i in range(len(categories_tags)):
  categories.append(categories_tags[i].text)
categories
len(categories)

26

#### For the awards and winners
Inspection of the page shows that all awards and winners are in bold that is they have the `<strong>` tag. See:
1. Exhibit A:
![photo8](https://pbs.twimg.com/media/FP1XEvUWUAspKEq?format=jpg&name=large)
2. Exhibit B:
![photo9](https://pbs.twimg.com/media/FP1XFr3XsAIZLor?format=jpg&name=large)


Knowing that, first let's get all the `<strong>` tags.

In [16]:
strong_tags = doc.select('p > strong')

Secondly, let's get each award given and store it in a list

In [21]:
# Getting awards
awards = []
for i in range(0, len(strong_tags)):
  match = re.match('\d\.', strong_tags[i].text) # Get only text that being with a digiet 4. boy
  match2 = re.match('\d\d\.', strong_tags[i].text) # Get only text that being with a digiet 34. girl
  if match or match2:
    awards.append(strong_tags[i].text[3:].strip())
  else:
    continue

In [22]:
# Check for errors
awards

['Record Of The Year',
 'Album Of The Year',
 'Song Of The Year',
 'Best New Artist',
 'Best Pop Solo Performance',
 'Best Pop Duo/Group Performance',
 'Best Traditional Pop Vocal Album',
 'Best Pop Vocal Album',
 'Best Dance/Electronic Recording',
 'Best Dance/Electronic Music Album',
 'Best Contemporary Instrumental Album',
 'Best Rock Performance',
 'Best Metal Performance',
 'Best Rock Song',
 'Best Rock Album',
 'Best Alternative Music Album',
 'Best R&B Performance',
 'Best Traditional R&B Performance',
 'Best R&B Song',
 'Best Progressive R&B Album',
 'Best R&B Album',
 'Best Rap Performance',
 'Best Melodic Rap Performance',
 'Best Rap Song',
 'Best Rap Album',
 'Best Country Solo Performance',
 'Best Country Duo/Group Performance',
 'Best Country Song',
 'Best Country Album',
 'Best New Age Album',
 'Best Improvised Jazz Solo',
 'Best Jazz Vocal Album',
 'Best Jazz Instrumental Album',
 'Best Large Jazz Ensemble Album',
 'Best Latin Jazz Album',
 'Best Gospel Performance/Song'

Everything looks good so far 😅. But this feels too easy, let's check the number of awards stored.

In [23]:
len(awards)

85

#### OH UH the's supposed to be 86. I know that much about the 2022 GRAMMYS 😅
- Unto the webpage for a manual check to see what's wrong 
- so award 44 is not in bold that is not enclosed in a strong tag. 
![photo10](https://pbs.twimg.com/media/FP1e462XIAMFRUk?format=jpg&name=small)


![bobthebuilder](https://upload.wikimedia.org/wikipedia/en/thumb/0/09/Bob_the_Builder_Can_We_Fix_It_art.jpg/220px-Bob_the_Builder_Can_We_Fix_It_art.jpg)

Yes, we can Bob. Run along now don't deny those children their cartoons.😤

Now that Bob has gone, to fix it we'll simply get all the `<p>` tags and use [regex](https://en.wikipedia.org/wiki/Regular_expression) expressions to get just the fourty fourth award. To do that in Python all one needs is the re module and we have already imported that

In [25]:
p_tags = grammydets[1].find_all('p')
fourtyfour = []
for i in range(0, len(p_tags)):
  match3 = re.match(r'^\*\*\d\d', p_tags[i].text)
  if match3:
    fourtyfour.append(p_tags[i].text)
fourtyfour = fourtyfour[0][5: 58].strip()

In [26]:
fourtyfour

'Best Regional Mexican Music Album (Including Tejano)'

In [27]:
awards.insert(43, fourtyfour)

In [28]:

len(awards)

86

#### There you have it we saved the day 🦸. Unto the next task. Getting the list of winners

We are basically doing the same thing we did to get the award list. Just that the pattern to follow this time is **WINNER** 
![photo11](https://pbs.twimg.com/media/FP1kaQDWQAwJG0s?format=jpg&name=large)

- Or so we would have thought if we didn't just have to squash some annoying bugs we were getting the award list. Who would have thought that the official GRAMMY website had bugs? I guess you could say their pitch isn't perfect 😂
- Anyways a manual check revealed the following:
  1. There's a winner **without** the word **"WINNER"**
  ![photo12](https://pbs.twimg.com/media/FP1qHAAXwAkiuI4?format=jpg&name=large)
  2. There's a winner with the word **"WINNNER"**
  ![photo13](https://pbs.twimg.com/media/FP1qIXbWYAEWSQ_?format=jpg&name=large)
  3. There's an award that two people tied for, and the two people have the word **"TIE"** instead
  ![photo14](https://pbs.twimg.com/media/FP1qJ-jX0AEz2Ry?format=jpg&name=large)
  4. There's an award that also tied but this time the people involved contain the word **'Tie'**
  ![photo15](https://pbs.twimg.com/media/FP1qLMlXMAMnPX4?format=jpg&name=large)


#### With all of that clarified. Let's get the list of winners. With all the work we just did our names are better be on that list 😤

In [30]:
# Geting the list of winners
winners = []
tie_list1 = [] # Store the first tie
tie_list2 = [] # Store the second tie

for i in range(0, len(strong_tags), 1):
    # for the normal condition 
    match = re.search('WINNER', strong_tags[i].text) 
    # for the order conditions
    match2 = re.search('Sour Olivia Rodrigo', strong_tags[i].text)
    match3 = re.search('WINNNER', strong_tags[i].text)
    match4 = re.search('TIE', strong_tags[i].text)
    match5 = re.search('Tie', strong_tags[i].text)
    if match:
      winners.append(strong_tags[i].text)
    elif match2:
      if strong_tags[i] not in winners:
        winners.append(strong_tags[i].text)
    elif match3:
      winners.append(strong_tags[i].text)
    elif match4:
      # Here i used the sent_tokenize function from the nltk module
      # it seperates a string into sentences
      tie_list1.extend(sent_tokenize(strong_tags[i].text))
      if tie_list1 not in winners:
        winners.append(tie_list1)
    elif match5:
      tie_list2.extend(sent_tokenize(strong_tags[i].text))
      if tie_list2 not in winners:
        winners.append(tie_list2)
    else:
      continue

Above the **sent_tokenize** function from [**nltk**](https://www.youtube.com/watch?v=X2vAabgKiuM&t=1405s) was used. For detailed explaination on how it works click on nltk. The number of winners should be 86 now if we completed our task correctly

In [31]:
len(winners)

86

In [113]:
no_of_awards = [4, 4, 2, 1, 4, 1, 5, 4, 4, 1, 5, 5, 5, 8, 1, 2, 1, 1, 1, 1, 3, 3, 4, 6, 8, 2]
for i in range(len(no_of_awards)):
  print(f"{i}: ", no_of_awards[i])

0:  4
1:  4
2:  2
3:  1
4:  4
5:  1
6:  5
7:  4
8:  4
9:  1
10:  5
11:  5
12:  5
13:  8
14:  1
15:  2
16:  1
17:  1
18:  1
19:  1
20:  3
21:  3
22:  4
23:  6
24:  8
25:  2


In [114]:
categories_dict = {
    'categories':categories,
    'no_of_awards':no_of_awards,
}

In [115]:
df = pd.DataFrame(categories_dict)

In [116]:
len(awards)


86

In [117]:
awards_dict = {
    'awards': awards,
    'winners': winners
}


In [118]:
df_awards =  pd.DataFrame(awards_dict)
df_awards

Unnamed: 0,awards,winners
0,Record Of The Year,"Leave The Door Open - WINNERSilk SonicDernst ""..."
1,Album Of The Year,"We Are - WINNERJon BatisteCraig Adams, David G..."
2,Song Of The Year,"Leave The Door Open - WINNERBrandon Anderson, ..."
3,Best New Artist,Olivia Rodrigo - WINNER
4,Best Pop Solo Performance,drivers license - WINNEROlivia Rodrigo
...,...,...
81,Best Classical Solo Vocal Album,Mythologies - WINNERSangeeta Kaur & Hila Plitm...
82,Best Classical Compendium,Women Warriors - The Voices Of Change - WINNER...
83,Best Contemporary Classical Composition,"Shaw: Narrow Sea - WINNERCaroline Shaw, compos..."
84,Best Music Video,"Freedom - WINNERJon BatisteAlan Ferguson, vide..."


In [119]:
df.to_csv("categories.csv", index=None, mode='w')
df_awards.to_csv("awards.csv", index=None, mode='w')
