# Web-Scrapping with Beautiful Soup

#### Content

1. [Wikipedia Home Page Headers](#wiki)
2. [IMDB Top Rated 100 Movies](#imdb)
3. [IMDB Top Rated 100 Indian Movies](#imdbInd)
4. [Book Page Reviews](#bookpage)
5. [ICC Cricket (Men)](#iccMen)
6. [ICC Cricket (Women)](#iccWomen)
7. [Amazon Mobiles under ₹20,000](#amazon)
8. [San Francisco Weather Data](#sfWeather)

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd

##### 1. Wikipedia Home Page Headers <a name='wiki'></a>

In [2]:
wikiPage = requests.get('http://en.wikipedia.org/wiki/Main_Page')

In [3]:
wikiSoup = BeautifulSoup(wikiPage.content)

In [4]:
wikiHeaders = wikiSoup.find_all('h2')

In [5]:
headerList = []
for index in range(len(wikiHeaders)):
    headerList.append(wikiHeaders[index].text)
    
headerList

["From today's featured article",
 'Did you know\xa0...',
 'In the news',
 'On this day',
 "Today's featured picture",
 'Other areas of Wikipedia',
 "Wikipedia's sister projects",
 'Wikipedia languages',
 'Navigation menu']

In [6]:
#Saving the header list in a text file
with open('header.txt', 'w') as h:
    h.write('\n'.join(headerList))

##### 2. IMDB Top Rated 100 Movies <a name='imdb'></a>

In [7]:
imdbPage = requests.get('https://www.imdb.com/chart/top/')
imdbPage

<Response [200]>

In [8]:
imdbSoup = BeautifulSoup(imdbPage.content, 'html.parser')

In [9]:
#Keeping the first 100 items from list returned by .find_all
#This list contains all 'td' elements with the class 'titleColumn'
imdbMovieTitle = imdbSoup.find_all('td', class_ = 'titleColumn')
imdbMovieTitle = imdbMovieTitle[0:100]

In [10]:
#Keeping the first 100 items from list returned by .find_all
#This list contains all 'td' elements with the class 'ratingColumn imdbRating'
imdbMovieRating = imdbSoup.find_all('td', class_ = 'ratingColumn imdbRating')
imdbMovieRating = imdbMovieRating[0:100]

In [11]:
#Extracting the Movie title, release year and rating 

movieTitles = []
releaseYear = []
rating = []
for index in range(len(imdbMovieTitle)):
    ls = imdbMovieTitle[index].text.split()[1:len(imdbMovieTitle[index].text.split())-1]
    year = imdbMovieTitle[index].text.split()[-1]
    movieTitles.append(" ".join(ls))
    releaseYear.append(year[year.find("(")+1:year.find(")")])
    rating.append(imdbMovieRating[index].text.split()[0])

In [12]:
#Making a dataframe with all the movie information
movieDf = pd.DataFrame(movieTitles, columns=['Movie Title'])
movieDf['Year of Release'] = releaseYear
movieDf['Ratings'] = rating
movieDf

Unnamed: 0,Movie Title,Year of Release,Ratings
0,The Shawshank Redemption,1994,9.2
1,The Godfather,1972,9.1
2,The Godfather: Part II,1974,9.0
3,The Dark Knight,2008,9.0
4,12 Angry Men,1957,8.9
...,...,...,...
95,Citizen Kane,1941,8.3
96,Dangal,2016,8.3
97,Idi i smotri,1985,8.2
98,The Kid,1921,8.2


In [13]:
#Saving in a excel file
movieDf.to_csv('IMDB Top Rated 100 Movies.csv')

##### 3. IMDB Top Rated 100 Indian Movies <a name='imdbInd'></a>

In [14]:
imdbIndiaPage = requests.get('https://www.imdb.com/india/top-rated-indian-movies/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2e9dfa9b-3e4d-4d39-acd2-8af11f252a59&pf_rd_r=9MGQC1N7X84HA9ET22FP&pf_rd_s=right-5&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_india_tr_rhs_1')
imdbIndiaPage

<Response [200]>

In [15]:
imdbIndiaSoup = BeautifulSoup(imdbIndiaPage.content, 'html.parser')

In [16]:
#Keeping the first 100 items from list returned by .find_all
#This list contains all 'td' elements with the class 'titleColumn'
imdbIndiaTitle = imdbIndiaSoup.find_all('td', class_ = 'titleColumn')
imdbIndiaTitle = imdbIndiaTitle[0:100]

In [17]:
#Keeping the first 100 items from list returned by .find_all
#This list contains all 'td' elements with the class 'ratingColumn imdbRating'
imdbIndiaRating = imdbIndiaSoup.find_all('td', class_ = 'ratingColumn imdbRating')
imdbIndiaRating = imdbIndiaRating[0:100]

In [18]:
movieIndiaTitles = []
releaseIndiaYear = []
ratingIndia = []
for index in range(len(imdbMovieTitle)):
    ls = imdbIndiaTitle[index].text.split()[1:len(imdbIndiaTitle[index].text.split())-1]
    year = imdbIndiaTitle[index].text.split()[-1]
    movieIndiaTitles.append(" ".join(ls))
    releaseIndiaYear.append(year[year.find("(")+1:year.find(")")])
    ratingIndia.append(imdbIndiaRating[index].text.split()[0])

In [19]:
#Making a dataframe with all the movie information
movieIndiaDf = pd.DataFrame(movieIndiaTitles, columns=['Movie Title'])
movieIndiaDf['Year of Release'] = releaseIndiaYear
movieIndiaDf['Ratings'] = ratingIndia
movieIndiaDf

Unnamed: 0,Movie Title,Year of Release,Ratings
0,Pather Panchali,1955,8.5
1,Gol Maal,1979,8.5
2,Nayakan,1987,8.5
3,Anbe Sivam,2003,8.5
4,Apur Sansar,1959,8.5
...,...,...,...
95,The Legend of Bhagat Singh,2002,8.0
96,Barfi!,2012,8.0
97,Pink,2016,8.0
98,Bommarillu,2006,8.0


In [20]:
movieIndiaDf.to_csv('IMDB Top Rated 100 Indian Movies.csv')

##### 4. Book Page Reviews <a name= 'bookpage'></a>

In [21]:
bookReviewPage = requests.get('https://bookpage.com/reviews')
bookReviewPage

<Response [200]>

In [22]:
bookReviewSoup = BeautifulSoup(bookReviewPage.content)

In [23]:
bookInfo = bookReviewSoup.find_all('div', class_ = 'flex-article-content')
bookInfo = bookInfo[0:5]

In [24]:
bookTitle = []
bookAuthor = []
bookGenre = []
bookReview = []

for index in range(len(bookInfo)):
    bookTitle.append(bookInfo[index].h4.text.replace('\n','').replace('★', ''))
    bookAuthor.append(bookInfo[index].find('p', class_='sans bold').text.replace('\n',''))
    bookGenre.append(bookInfo[index].find('p', class_ = 'genre-links hidden-phone').text.replace('\n','').replace('/',','))
    if bookInfo[index].find('p', class_ = 'excerpt').text == '\n':
        bookReview.append(bookInfo[index].find_all('p')[3].text.replace('\n',''))
    else:
        bookReview.append(bookInfo[index].find('p', class_ = 'excerpt').text.replace('\n',''))

In [25]:
bookReviewDf = pd.DataFrame()

In [26]:
bookReviewDf['Title'] = bookTitle
bookReviewDf['Author'] = bookAuthor
bookReviewDf['Genre'] = bookGenre
bookReviewDf['Review'] = bookReview

In [27]:
bookReviewDf

Unnamed: 0,Title,Author,Genre,Review
0,Secrets of Happiness,Joan Silber,"Fiction , Family Drama",Rarely is a novel of moral ideas so buoyant in...
1,"Olympus, Texas",Stacey Swann,"Fiction , Family Drama",A man’s return to his Texas hometown sets off ...
2,"From Little Tokyo, With Love",Sarah Kuhn,"YA , YA Fiction","Rika Rakuyama loves Little Tokyo, from its ram..."
3,Seed to Dust,Marc Hamer,"Nonfiction , Memoir , Nature",Hamer uses his deep knowledge of gardens to gr...
4,Before I Saw You,Emily Houghton,"Romance , Contemporary Romance","Emily Houghton’s Before I Saw You is a tender,..."


##### 5. ICC Cricket (Men) <a name='iccMen'></a>
(i) Top 10 ODI teams in men’s cricket along with the records for matches, points and rating.

In [28]:
odiMenTeams = requests.get('https://www.icc-cricket.com/rankings/mens/team-rankings/odi')

In [29]:
odiMenTeamsSoup = BeautifulSoup(odiMenTeams.content)

In [30]:
#Initialising empty list to hold ranking, county, match, points and ratings
menRanking = []
menCountry = []
menMatch = []
menPoints = []
menRating = []

In [31]:
#Manually adding the first items in the lists above since the country with Ranking 1 has different class names
#then the rest of the countries
firstRank = odiMenTeamsSoup.find('td', class_ = 'rankings-block__banner--pos')
menRanking.append(firstRank.text)
firstCountry = odiMenTeamsSoup.find('td', class_ = 'rankings-block__banner--team-name')
menCountry.append(firstCountry.text.split('\n')[2])
firstMatch = odiMenTeamsSoup.find('td', class_ = 'rankings-block__banner--matches')
menMatch.append(firstMatch.text)
firstPoints = odiMenTeamsSoup.find('td', class_ = 'rankings-block__banner--points')
menPoints.append(firstPoints.text.replace(',',''))
firstRating = odiMenTeamsSoup.find('td', class_ = 'rankings-block__banner--rating u-text-right')
menRating.append(firstRating.text.replace('\n', '').split()[0])

In [32]:
odiMen = odiMenTeamsSoup.find_all('tr', class_ = 'table-body')
odiMen = odiMen[0:9]

In the following cells I am taking just the first element of odiMen and examining which combination of commands and functions produces the desired output. Once the right format is identified, we can use for loop to find the rest of the data.

In [33]:
odiMen[0].find('td', class_ = 'table-body__cell table-body__cell--position u-text-right').text

'2'

In [34]:
odiMen[0].find('td', class_ = 'table-body__cell rankings-table__team').text.split('\n')[2]

'Australia'

In [35]:
odiMen[0].find('td', class_ = 'table-body__cell u-center-text').text

'25'

In [36]:
odiMen[0].find_all('td', class_ = 'table-body__cell u-center-text')[1].text.replace(',','')

'2945'

In [37]:
odiMen[0].find('td', class_ = 'table-body__cell u-text-right rating').text

'118'

In [38]:
for index in range(len(odiMen)):
    menRanking.append(odiMen[index].find('td', class_ = 'table-body__cell table-body__cell--position u-text-right').text)
    menCountry.append(odiMen[index].find('td', class_ = 'table-body__cell rankings-table__team').text.split('\n')[2])
    menMatch.append(odiMen[index].find('td', class_ = 'table-body__cell u-center-text').text)
    menPoints.append(odiMen[index].find_all('td', class_ = 'table-body__cell u-center-text')[1].text.replace(',',''))
    menRating.append(odiMen[index].find('td', class_ = 'table-body__cell u-text-right rating').text)

In [39]:
#Making a dataframe to hold all the scrapped information
menTopOdiTeam = pd.DataFrame()
menTopOdiTeam['Ranking'] = menRanking
menTopOdiTeam['Country'] = menCountry
menTopOdiTeam['Match'] = menMatch
menTopOdiTeam['Points'] = menPoints
menTopOdiTeam['Rating'] = menRating

In [40]:
menTopOdiTeam

Unnamed: 0,Ranking,Country,Match,Points,Rating
0,1,New Zealand,17,2054,121
1,2,Australia,25,2945,118
2,3,India,29,3344,115
3,4,England,27,3100,115
4,5,South Africa,20,2137,107
5,6,Pakistan,24,2323,97
6,7,Bangladesh,24,2157,90
7,8,West Indies,27,2222,82
8,9,Sri Lanka,21,1652,79
9,10,Afghanistan,17,1054,62


(ii) Top 10 ODI Batsmen in men along with the records of their team and rating.

In [41]:
menBatsman = requests.get('https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting')

In [42]:
menBatsmanSoup = BeautifulSoup(menBatsman.content)

In [43]:
#Initialising empty list to hold batsman ranking, name, county, and ratings
menBatsRanking = [rank for rank in range(1,11)]
menBatsName = []
menBatsCountry = []
menBatsRating = []

In [44]:
#Manually adding the first items in the lists above since the Ranking 1 row has different class names
#then the rest

menBatsmanName = menBatsmanSoup.find('td', class_ = 'rankings-block__top-player-container')
menBatsName.append(menBatsmanName.text.replace('\n', ''))

mensBatsmanCountry = menBatsmanSoup.find('div', class_ = 'rankings-block__banner--nationality')
menBatsCountry.append(mensBatsmanCountry.text.replace('\n','').split()[0])

mensBatsmanRating = menBatsmanSoup.find('div', class_ = 'rankings-block__banner--rating')
menBatsRating.append(mensBatsmanRating.text)

In [45]:
mensBat = menBatsmanSoup.find_all('tr', class_ = 'table-body')
mensBat = mensBat[0:9]

In the following cells I am taking just the first element of mensBat and examining which combination of commands and functions produces the desired output. Once the right format is identified, we can use for loop to find the rest of the data.

In [46]:
mensBat[0].find('td', class_ = 'table-body__cell rankings-table__name name').a.text

'Virat Kohli'

In [47]:
mensBat[0].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n','')

'IND'

In [48]:
mensBat[0].find('td', class_="table-body__cell rating").text

'857'

In [49]:
for index in range(len(mensBat)):
    menBatsName.append(mensBat[index].find('td', class_ = 'table-body__cell rankings-table__name name').a.text)
    menBatsCountry.append(mensBat[index].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n',''))
    menBatsRating.append(mensBat[index].find('td', class_="table-body__cell rating").text)

In [50]:
odiBatsmanRankingDf = pd.DataFrame()
odiBatsmanRankingDf['Ranking'] = menBatsRanking
odiBatsmanRankingDf['Batsman Name'] = menBatsName
odiBatsmanRankingDf['Country'] = menBatsCountry
odiBatsmanRankingDf['Rating'] = menBatsRating

odiBatsmanRankingDf

Unnamed: 0,Ranking,Batsman Name,Country,Rating
0,1,Babar Azam,PAK,865
1,2,Virat Kohli,IND,857
2,3,Rohit Sharma,IND,825
3,4,Ross Taylor,NZ,801
4,5,Aaron Finch,AUS,791
5,6,Jonny Bairstow,ENG,785
6,7,Fakhar Zaman,PAK,778
7,8,Francois du Plessis,SA,778
8,9,David Warner,AUS,773
9,10,Shai Hope,WI,773


(iii) Top 10 ODI bowlers along with the records of their team and rating.

In [51]:
menBowling = requests.get('https://www.icc-cricket.com/rankings/mens/player-rankings/odi/bowling')

In [52]:
menBowlingSoup = BeautifulSoup(menBowling.content)

In [53]:
#Initialising empty list to hold bowler ranking, name, county, and ratings
menBowlRanking = [rank for rank in range(1,11)]
menBowlName = []
menBowlCountry = []
menBowlRating = []

In [54]:
#Manually adding the first items in the lists above since the Ranking 1 row has different class names
#then the rest

menBowlName1 = menBowlingSoup.find('div', class_ = 'rankings-block__banner--name-large')
menBowlName.append(menBowlName1.text)

menBowlCountry1 = menBowlingSoup.find('div', class_ = 'rankings-block__banner--nationality')
menBowlCountry.append(menBowlCountry1.text.replace('\n', '').split()[0])

menBowlRating1 = menBowlingSoup.find('div', class_= 'rankings-block__banner--rating')
menBowlRating.append(menBowlRating1.text)

In [55]:
mensBowl = menBowlingSoup.find_all('tr', class_ = 'table-body')
mensBowl = mensBowl[0:9]

In the following cells I am taking just the first element of mensBowl and examining which combination of commands and functions produces the desired output. Once the right format is identified, we can use for loop to find the rest of the data.

In [56]:
mensBowl[0].find('td', class_ = 'table-body__cell rankings-table__name name').text.replace('\n', '')

'Mujeeb Ur Rahman'

In [57]:
mensBowl[0].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n', '')

'AFG'

In [58]:
mensBowl[0].find('td', class_='table-body__cell rating').text

'708'

In [59]:
for index in range(len(mensBowl)):
    menBowlName.append(mensBowl[index].find('td', class_ = 'table-body__cell rankings-table__name name').text.replace('\n', ''))
    menBowlCountry.append(mensBowl[index].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n', ''))
    menBowlRating.append(mensBowl[index].find('td', class_='table-body__cell rating').text)

In [60]:
odiBowlRankingDf = pd.DataFrame()

odiBowlRankingDf['Ranking'] = menBowlRanking
odiBowlRankingDf['Name'] = menBowlName
odiBowlRankingDf['Country'] = menBowlCountry
odiBowlRankingDf['Rating'] = menBowlRating

odiBowlRankingDf

Unnamed: 0,Ranking,Name,Country,Rating
0,1,Trent Boult,NZ,737
1,2,Mujeeb Ur Rahman,AFG,708
2,3,Matt Henry,NZ,691
3,4,Jasprit Bumrah,IND,690
4,5,Mehedi Hasan,BAN,668
5,6,Kagiso Rabada,SA,666
6,7,Chris Woakes,ENG,665
7,8,Josh Hazlewood,AUS,660
8,9,Pat Cummins,AUS,646
9,10,Mohammad Amir,PAK,638


##### 6. ICC Cricket (Women) <a name='iccWomen'></a>
(i) Top 10 ODI teams in women’s cricket along with the records for matches, points and rating.

In [61]:
odiWomenTeams = requests.get('https://www.icc-cricket.com/rankings/womens/team-rankings/odi')

In [62]:
odiWomenTeamsSoup = BeautifulSoup(odiWomenTeams.content)

In [63]:
#Initialising empty list to hold ranking, county, match, points and ratings
womenRanking = [rank for rank in range(1,11)]
womenCountry = []
womenMatch = []
womenPoints = []
womenRating = []

In [64]:
#Manually adding the first items in the lists above since the country with Ranking 1 has different class names
#then the rest of the countries

firstCountry = odiWomenTeamsSoup.find('td', class_ = 'rankings-block__banner--team-name')
womenCountry.append(firstCountry.text.split('\n')[2])
firstMatch = odiWomenTeamsSoup.find('td', class_ = 'rankings-block__banner--matches')
womenMatch.append(firstMatch.text)
firstPoints = odiWomenTeamsSoup.find('td', class_ = 'rankings-block__banner--points')
womenPoints.append(firstPoints.text.replace(',',''))
firstRating = odiWomenTeamsSoup.find('td', class_ = 'rankings-block__banner--rating u-text-right')
womenRating.append(firstRating.text.replace('\n', '').split()[0])

In [65]:
odiWomen = odiWomenTeamsSoup.find_all('tr', class_ = 'table-body')
odiWomen = odiWomen[0:9]

In the following cells I am taking just the first element of odiWomen and examining which combination of commands and functions produces the desired output. Once the right format is identified, we can use for loop to find the rest of the data.

In [66]:
odiWomen[0].find('td', class_ = 'table-body__cell table-body__cell--position u-text-right').text

'2'

In [67]:
odiWomen[0].find('td', class_ = 'table-body__cell rankings-table__team').text.split('\n')[2]

'South Africa'

In [68]:
odiWomen[0].find('td', class_ = 'table-body__cell u-center-text').text

'24'

In [69]:
odiWomen[0].find_all('td', class_ = 'table-body__cell u-center-text')[1].text.replace(',','')

'2828'

In [70]:
odiWomen[0].find('td', class_ = 'table-body__cell u-text-right rating').text

'118'

In [71]:
for index in range(len(odiWomen)):
    womenCountry.append(odiWomen[index].find('td', class_ = 'table-body__cell rankings-table__team').text.split('\n')[2])
    womenMatch.append(odiWomen[index].find('td', class_ = 'table-body__cell u-center-text').text)
    womenPoints.append(odiWomen[index].find_all('td', class_ = 'table-body__cell u-center-text')[1].text.replace(',',''))
    womenRating.append(odiWomen[index].find('td', class_ = 'table-body__cell u-text-right rating').text)

In [72]:
#Making a dataframe to hold all the scrapped information
womenTopOdiTeam = pd.DataFrame()
womenTopOdiTeam['Ranking'] = womenRanking
womenTopOdiTeam['Country'] = womenCountry
womenTopOdiTeam['Match'] = womenMatch
womenTopOdiTeam['Points'] = womenPoints
womenTopOdiTeam['Rating'] = womenRating

In [73]:
womenTopOdiTeam

Unnamed: 0,Ranking,Country,Match,Points,Rating
0,1,Australia,18,2955,164
1,2,South Africa,24,2828,118
2,3,England,17,1993,117
3,4,India,20,2226,111
4,5,New Zealand,21,1947,93
5,6,West Indies,12,1025,85
6,7,Pakistan,15,1101,73
7,8,Bangladesh,5,306,61
8,9,Sri Lanka,11,519,47
9,10,Ireland,2,25,13


(ii) Top 10 ODI Batsmen in women along with the records of their team and rating.

In [74]:
womenBatsman = requests.get('https://www.icc-cricket.com/rankings/womens/player-rankings/odi/batting')

In [75]:
womenBatsmanSoup = BeautifulSoup(womenBatsman.content)

In [76]:
#Initialising empty list to hold batsman ranking, name, county, and ratings
womenBatsRanking = [rank for rank in range(1,11)]
womenBatsName = []
womenBatsCountry = []
womenBatsRating = []

In [77]:
#Manually adding the first items in the lists above since the Ranking 1 row has different class names
#then the rest

womenBatsmanName = womenBatsmanSoup.find('td', class_ = 'rankings-block__top-player-container')
womenBatsName.append(womenBatsmanName.text.replace('\n', ''))

womensBatsmanCountry = womenBatsmanSoup.find('div', class_ = 'rankings-block__banner--nationality')
womenBatsCountry.append(womensBatsmanCountry.text.replace('\n','').split()[0])

womensBatsmanRating = womenBatsmanSoup.find('div', class_ = 'rankings-block__banner--rating')
womenBatsRating.append(womensBatsmanRating.text)

In [78]:
womensBat = womenBatsmanSoup.find_all('tr', class_ = 'table-body')
womensBat = womensBat[0:9]

In the following cells I am taking just the first element of womensBat and examining which combination of commands and functions produces the desired output. Once the right format is identified, we can use for loop to find the rest of the data.

In [79]:
womensBat[0].find('td', class_ = 'table-body__cell rankings-table__name name').a.text

'Lizelle Lee'

In [80]:
womensBat[0].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n','')

'SA'

In [81]:
womensBat[0].find('td', class_="table-body__cell rating").text

'758'

In [82]:
for index in range(len(womensBat)):
    womenBatsName.append(womensBat[index].find('td', class_ = 'table-body__cell rankings-table__name name').a.text)
    womenBatsCountry.append(womensBat[index].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n',''))
    womenBatsRating.append(womensBat[index].find('td', class_="table-body__cell rating").text)

In [83]:
odiBatsmanWomenRankingDf = pd.DataFrame()
odiBatsmanWomenRankingDf['Ranking'] = womenBatsRanking
odiBatsmanWomenRankingDf['Batsman Name'] = womenBatsName
odiBatsmanWomenRankingDf['Country'] = womenBatsCountry
odiBatsmanWomenRankingDf['Rating'] = womenBatsRating

odiBatsmanWomenRankingDf

Unnamed: 0,Ranking,Batsman Name,Country,Rating
0,1,Tammy Beaumont,ENG,765
1,2,Lizelle Lee,SA,758
2,3,Alyssa Healy,AUS,756
3,4,Stafanie Taylor,WI,746
4,5,Meg Lanning,AUS,723
5,6,Amy Satterthwaite,NZ,715
6,7,Smriti Mandhana,IND,710
7,8,Mithali Raj,IND,709
8,9,Natalie Sciver,ENG,685
9,10,Laura Wolvaardt,SA,683


6 (ii) Top 10 ODI women bowlers along with the records of their team and rating.

In [84]:
womenBowling = requests.get('https://www.icc-cricket.com/rankings/womens/player-rankings/odi/bowling')

In [85]:
womenBowlingSoup = BeautifulSoup(womenBowling.content)

In [86]:
#Initialising empty list to hold bowler ranking, name, county, and ratings
womenBowlRanking = [rank for rank in range(1,11)]
womenBowlName = []
womenBowlCountry = []
womenBowlRating = []

In [87]:
#Manually adding the first items in the lists above since the Ranking 1 row has different class names
#then the rest

womenBowlName1 = womenBowlingSoup.find('div', class_ = 'rankings-block__banner--name-large')
womenBowlName.append(womenBowlName1.text)

womenBowlCountry1 = womenBowlingSoup.find('div', class_ = 'rankings-block__banner--nationality')
womenBowlCountry.append(womenBowlCountry1.text.replace('\n', '').split()[0])

womenBowlRating1 = womenBowlingSoup.find('div', class_= 'rankings-block__banner--rating')
womenBowlRating.append(womenBowlRating1.text)

In [88]:
womensBowl = womenBowlingSoup.find_all('tr', class_ = 'table-body')
womensBowl = womensBowl[0:9]

In the following cells I am taking just the first element of mensBowl and examining which combination of commands and functions produces the desired output. Once the right format is identified, we can use for loop to find the rest of the data.

In [89]:
womensBowl[0].find('td', class_ = 'table-body__cell rankings-table__name name').text.replace('\n', '')

'Megan Schutt'

In [90]:
womensBowl[0].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n', '')

'AUS'

In [91]:
womensBowl[0].find('td', class_='table-body__cell rating').text

'762'

In [92]:
for index in range(len(womensBowl)):
    womenBowlName.append(womensBowl[index].find('td', class_ = 'table-body__cell rankings-table__name name').text.replace('\n', ''))
    womenBowlCountry.append(womensBowl[index].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n', ''))
    womenBowlRating.append(womensBowl[index].find('td', class_='table-body__cell rating').text)

In [93]:
odiBowlWomenRankingDf = pd.DataFrame()

odiBowlWomenRankingDf['Ranking'] = womenBowlRanking
odiBowlWomenRankingDf['Name'] = womenBowlName
odiBowlWomenRankingDf['Country'] = womenBowlCountry
odiBowlWomenRankingDf['Rating'] = womenBowlRating

odiBowlWomenRankingDf

Unnamed: 0,Ranking,Name,Country,Rating
0,1,Jess Jonassen,AUS,808
1,2,Megan Schutt,AUS,762
2,3,Marizanne Kapp,SA,747
3,4,Shabnim Ismail,SA,717
4,5,Jhulan Goswami,IND,681
5,6,Katherine Brunt,ENG,655
6,7,Poonam Yadav,IND,641
7,8,Ayabonga Khaka,SA,638
8,9,Ellyse Perry,AUS,616
9,10,Shikha Pandey,IND,610


(iii) Top 10 women’s ODI all-rounder along with the records of their team and rating.

In [94]:
womenAll = requests.get('https://www.icc-cricket.com/rankings/womens/player-rankings/odi/all-rounder')

In [95]:
womenAllSoup = BeautifulSoup(womenAll.content)

In [96]:
#Initialising empty list to hold bowler ranking, name, county, and ratings
womenAllRanking = [rank for rank in range(1,11)]
womenAllName = []
womenAllCountry = []
womenAllRating = []

In [97]:
#Manually adding the first items in the lists above since the Ranking 1 row has different class names
#then the rest

womenAllName1 = womenAllSoup.find('div', class_ = 'rankings-block__banner--name-large')
womenAllName.append(womenAllName1.text)

womenAllCountry1 = womenAllSoup.find('div', class_ = 'rankings-block__banner--nationality')
womenAllCountry.append(womenAllCountry1.text.replace('\n', '').split()[0])

womenAllRating1 = womenAllSoup.find('div', class_= 'rankings-block__banner--rating')
womenAllRating.append(womenAllRating1.text)

In [98]:
womenAll = womenAllSoup.find_all('tr', class_ = 'table-body')
womenAll = womenAll[0:9]

In the following cells I am taking just the first element of womenAll and examining which combination of commands and functions produces the desired output. Once the right format is identified, we can use for loop to find the rest of the data.

In [99]:
womenAll[0].find('td', class_ = 'table-body__cell rankings-table__name name').text.replace('\n', '')

'Ellyse Perry'

In [100]:
womenAll[0].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n', '')

'AUS'

In [101]:
womenAll[0].find('td', class_='table-body__cell rating').text

'418'

In [102]:
for index in range(len(womenAll)):
    womenAllName.append(womenAll[index].find('td', class_ = 'table-body__cell rankings-table__name name').text.replace('\n', ''))
    womenAllCountry.append(womenAll[index].find('td', class_ = 'table-body__cell nationality-logo rankings-table__team').text.replace('\n', ''))
    womenAllRating.append(womenAll[index].find('td', class_='table-body__cell rating').text)

In [103]:
womenAllRankingDf = pd.DataFrame()

womenAllRankingDf['Ranking'] = womenAllRanking
womenAllRankingDf['Name'] = womenAllName
womenAllRankingDf['Country'] = womenAllCountry
womenAllRankingDf['Rating'] = womenAllRating

womenAllRankingDf

Unnamed: 0,Ranking,Name,Country,Rating
0,1,Marizanne Kapp,SA,418
1,2,Ellyse Perry,AUS,418
2,3,Stafanie Taylor,WI,410
3,4,Natalie Sciver,ENG,349
4,5,Deepti Sharma,IND,343
5,6,Jess Jonassen,AUS,307
6,7,Ashleigh Gardner,AUS,252
7,8,Dane van Niekerk,SA,243
8,9,Sophie Devine,NZ,242
9,10,Amelia Kerr,NZ,236


##### 7. Amazon Mobiles under ₹20,000 <a name='amazon'></a>

In [104]:
pages = [page for page in range(1,21)]

In [105]:
def amazonScapper(pageNo):
    '''This function takes the page number and then returns the product name, price, rating and image url.
    Note: 1. Sometimes throws error. In such a case, re-run again.
          2. This function does not scrape sponsored products in the page'''
    
    url = f'https://www.amazon.in/s?k=mobile&rh=p_36%3A100-1999900&page={pageNo}&qid=1620066055&rnid=1318502031&ref=sr_pg_{pageNo}'
    page = requests.get(url)
    soup = BeautifulSoup(page.content)
    
    name = []
    price = []
    rating = []
    imgUrl = []
    
    productName = soup.find_all('h2', class_ = 'a-size-mini a-spacing-none a-color-base s-line-clamp-2')
    productPrice = soup.find_all('span', class_='a-price-whole')
    productRating = soup.find_all('span', class_='a-icon-alt')
    productImgUrl = soup.find_all('img', class_='s-image')
    
    for index in range(len(soup)):
        if productName == [] or productPrice == [] or productRating == [] or productImgUrl == []:
            pass
        else:
            name.append(productName[index].text)
            price.append(productPrice[index].text.replace(',',''))
            rating.append(productRating[index].text[0:3])
            imgUrl.append(productImgUrl[index]['src'])
        
    return name, price, rating, imgUrl

In [106]:
#Making lists containing all the names, prices, ratings and image URLs from all the 20 amazon pages

nameAll = []
priceAll = []
ratingAll = []
imgAll = []

for page in pages:
    n, p, r, i = amazonScapper(page)
    nameAll.extend(n)
    priceAll.extend(p)
    ratingAll.extend(r)
    imgAll.extend(i)
    print(f'Iteration {page}')

print('Loop Complete')

Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Loop Complete


In [107]:
mobileDf = pd.DataFrame()

mobileDf['Name'] = nameAll
mobileDf['Price'] = priceAll
mobileDf['Rating'] = ratingAll
mobileDf['Img URL'] = imgAll

mobileDf.head()

Unnamed: 0,Name,Price,Rating,Img URL
0,"Infinix Smart 4 Blue, 2GB, 32GB",9500.0,3.9,https://m.media-amazon.com/images/I/513txU2lnf...
1,"Motorola Plus (Fine Gold, 32GB)",8970.0,3.4,https://m.media-amazon.com/images/I/71WSw-nKCR...
2,"Lenovo A7 (Black, 4GB RAM, 64GB Storage, 4000m...",10499.0,4.1,https://m.media-amazon.com/images/I/317jmcaI6P...
3,"Realme C12 (Power Blue, 4GB RAM, 64GB Storage)...",15999.0,4.3,https://m.media-amazon.com/images/I/31X1+mh8zP...
4,Portable USB Cell Phone for Children Student M...,5949.0,3.1,https://m.media-amazon.com/images/I/51zgVtjY2K...


##### 8. San Francisco Weather Data <a name = 'sfWeather'></a> 

In [108]:
sfPage = requests.get('https://forecast.weather.gov/MapClick.php?lat=37.777120000000025&lon=-122.41963999999996#.YJDog7UzZPY')

In [109]:
sfSoup = BeautifulSoup(sfPage.content)

In [110]:
sfWeather = sfSoup.find_all('li', class_ = 'forecast-tombstone')
sfWeatherLong = sfSoup.find_all('div', class_="col-sm-10 forecast-text")

In [111]:
len(sfWeather)

9

In the following cells I am taking just the first element of sfWeather and examining which combination of commands and functions produces the desired output. Once the right format is identified, we can use for loop to find the rest of the data.

In [112]:
sfWeather[0].find('p', class_ = 'period-name').text

'Tonight'

In [113]:
sfWeather[0].find('p', class_ = 'short-desc').text

'Mostly Clear'

In [114]:
sfWeather[0].find('p', class_ = 'temp').text

'Low: 50 °F'

In [115]:
sfWeatherLong[0].text

'Mostly clear, with a low around 50. West southwest wind around 7 mph. '

In [116]:
period = []
shortDesc = []
temp = []
descLong = []

for index in range(len(sfWeather)):
    period.append(sfWeather[index].find('p', class_ = 'period-name').text)
    shortDesc.append(sfWeather[index].find('p', class_ = 'short-desc').text)
    temp.append(sfWeather[index].find('p', class_ = 'temp').text)
    descLong.append(sfWeatherLong[index].text)

In [117]:
weatherSF = pd.DataFrame()

weatherSF['Period'] = period
weatherSF['Short Description'] = shortDesc
weatherSF['Temperature'] = temp
weatherSF['Long Description'] = descLong

weatherSF

Unnamed: 0,Period,Short Description,Temperature,Long Description
0,Tonight,Mostly Clear,Low: 50 °F,"Mostly clear, with a low around 50. West south..."
1,Wednesday,Sunny,High: 66 °F,"Sunny, with a high near 66. Light west southwe..."
2,WednesdayNight,Partly Cloudy,Low: 51 °F,"Partly cloudy, with a low around 51. West wind..."
3,Thursday,Sunny,High: 64 °F,"Sunny, with a high near 64. West wind 8 to 17 ..."
4,ThursdayNight,Mostly Clear,Low: 50 °F,"Mostly clear, with a low around 50. West wind ..."
5,Friday,Sunny,High: 64 °F,"Sunny, with a high near 64."
6,FridayNight,Clear,Low: 51 °F,"Clear, with a low around 51."
7,Saturday,Sunny,High: 69 °F,"Sunny, with a high near 69."
8,SaturdayNight,Mostly Clear,Low: 52 °F,"Mostly clear, with a low around 52."
