# Categories, Winners, and Nominees for the 2022 GRAMMYs


### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


In [1]:
# Website chosen (URL)
# https://www.grammy.com/news/2022-grammys-complete-winners-nominees-nominations-list

#load libraries
import requests
from bs4 import BeautifulSoup
import re
import nltk
nltk.download('punkt')
from nltk import sent_tokenize
import pandas as pd

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
# Download the Grammy page
url = "https://www.grammy.com/news/2022-grammys-complete-winners-nominees-nominations-list"
response = requests.get(url)

# Check for error in downloading page
response.raise_for_status()

In [3]:
#size of the page
len(response.text)

368073

In [4]:
page_contents = response.text

### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [5]:
# parse the pages_contents as html file
doc = BeautifulSoup(page_contents, 'html.parser')


## inspected website and found out that all the data that i will be working with in the div.prose class

In [6]:
grammydets = doc.find_all('div', {'class': 'prose'})

In [7]:
# Categories
categories_tags = grammydets[1].find_all('h1')
len(categories_tags)

26

In [8]:
categories = []
for i in range(len(categories_tags)):
  categories.append(categories_tags[i].text)
categories
len(categories)

26

# IT Seems like all the awards are in bold text format

In [9]:
# awards 
strong_tags = doc.select('p > strong')
awards = []
for i in range(0, len(strong_tags), 1):
  match = re.match('\d\.', strong_tags[i].text)
  match2 = re.match('\d\d\.', strong_tags[i].text)
  if match or match2 or match:
    awards.append(strong_tags[i].text[:].strip())
  else:
    continue

In [10]:
awards

['1. Record Of The Year',
 '2. Album Of The Year',
 '3.\xa0Song Of The Year',
 '4. Best New Artist',
 '5. Best Pop Solo Performance',
 '6. Best Pop Duo/Group Performance',
 '7. Best Traditional Pop Vocal Album',
 '8. Best Pop Vocal Album',
 '9. Best Dance/Electronic Recording',
 '10. Best Dance/Electronic Music Album',
 '11. Best Contemporary Instrumental Album',
 '12. Best Rock Performance',
 '13. Best Metal Performance',
 '14. Best Rock Song',
 '15. Best Rock Album',
 '16. Best Alternative Music Album',
 '17. Best R&B Performance',
 '18. Best Traditional R&B Performance',
 '19. Best R&B Song',
 '20. Best Progressive R&B Album',
 '21. Best R&B Album',
 '22. Best Rap Performance',
 '23. Best Melodic Rap Performance',
 '24. Best Rap Song',
 '25. Best Rap Album',
 '26. Best Country Solo Performance',
 '27. Best Country Duo/Group Performance',
 '28. Best Country Song',
 '29. Best Country Album',
 '30. Best New Age Album',
 '31. Best Improvised Jazz Solo',
 '32. Best Jazz Vocal Album',
 '3

In [11]:
len(awards)

85

# OH UH the's supposed to be 86. I know that much
- let's check the url to see what's wrong
- so award 44 is not in bold that is not enclosed in a strong tag. Nah smalls we go run am

In [12]:
p_tags = grammydets[1].find_all('p')
fourtyfour = ['']
for i in range(0, len(p_tags)):
  match3 = re.match(r'^\*\*\d\d', p_tags[i].text)
  if match3:
    fourtyfour.append(p_tags[i].text)
fourtyfour = fourtyfour[1][2: 58]

In [13]:
fourtyfour

'44. Best Regional Mexican Music Album (Including Tejano)'

In [14]:
awards.insert(43, fourtyfour)

In [15]:

len(awards)

86

In [16]:
winners = []
tie_list1 = []
tie_list2 = []

for i in range(0, len(strong_tags), 1):
    match = re.search('WINNER', strong_tags[i].text)
    match2 = re.search('Sour Olivia Rodrigo', strong_tags[i].text)
    match3 = re.search('WINNNER', strong_tags[i].text)
    match4 = re.search('TIE', strong_tags[i].text)
    match5 = re.search('Tie', strong_tags[i].text)
    if match:
      winners.append(strong_tags[i].text)
    elif match2:
      if strong_tags[i] not in winners:
        winners.append(strong_tags[i].text)
    elif match3:
      winners.append(strong_tags[i].text)
    elif match4:
      tie_list1.extend(sent_tokenize(strong_tags[i].text))
      if tie_list1 not in winners:
        winners.append(tie_list1)
    elif match5:
      tie_list2.extend(sent_tokenize(strong_tags[i].text))
      if tie_list2 not in winners:
        winners.append(tie_list2)
    else:
      continue

In [17]:
len(winners)

86

Data gotten manually from url 2

In [18]:
for i in range(len(categories)):
  print(f"{i}: ", categories[i])

0:  General Field
1:  Pop
2:  Dance/Electronic Music
3:  Contemporary Instrumental Music
4:  Rock
5:  Alternative
6:  R&B
7:  Rap
8:  Country
9:  New Age
10:  Jazz
11:  Gospel/Contemporary Christian Music
12:  Latin
13:  American Roots Music
14:  Reggae
15:  Global Music
16:  Children's
17:  Spoken Word
18:  Comedy
19:  Musical Theater
20:  Music for Visual Media
21:  Composing/Arranging
22:  Package, Notes, and Historical
23:  Production
24:  Classical
25:  Music Video/Film


In [20]:
no_of_awards = [4, 4, 2, 1, 4, 1, 5, 4, 4, 1, 5, 5, 5, 8, 1, 2, 1, 1, 1, 1, 3, 3, 4, 6, 8, 2]
for i in range(len(no_of_awards)):
  print(f"{i}: ", no_of_awards[i])

0:  4
1:  4
2:  2
3:  1
4:  4
5:  1
6:  5
7:  4
8:  4
9:  1
10:  5
11:  5
12:  5
13:  8
14:  1
15:  2
16:  1
17:  1
18:  1
19:  1
20:  3
21:  3
22:  4
23:  6
24:  8
25:  2


In [23]:
categories_dict = {
    'categories':categories,
    'no_of_awards':no_of_awards,
}

In [24]:
df = pd.DataFrame(categories_dict)

In [25]:
len(awards)
len(winners)

86

In [None]:
""