#Web Scraping in Python

What is Web Scraping?
<ul>
  <li>In simple terms, web scraping is the process of extracting content and data from a website.</li>
  <li>A form of Data Engineering</li>
  <li>Useful for creating your own datasets to answer your own questions</li>
</ul>

How does web scraping play a role in Data Science?

<img src="http://www.uvm.edu/~cbcafier/assets/pipeline.jpg">

<b>Data Wrangling</b> - The process of restructuring, cleaning, and enriching raw data into a desired format for easy access and analysis.

<b>Data cleansing</b> - The process of prepping data for analysis by amending or removing incorrect, corrupted, improperly formatted, duplicated, irrelevant, or incomplete data within a dataset. 

<b>Explore</b> - Analyze your data to see any trends and/or correlations

<b>Pre-Process</b> - Prepare the data for modeling though processes such as standarization of values, or one-hot-encoding, etc.

<b>Model</b> - Create your model.

<b>Validate</b> - Use different metrics (like ROC, F1, loss, etc.) to see if your model has minial errors.

<b>Tell the Story</b> - Use your model to make decisions

Notice how before anything we need to (1) have a question and then (2) gather our data [via Web Scraping]. 

<br>

<center>
<p style="color:red;"><b>Question - Do Major League Soccer Teams pay their players more depending on their position?</b></p> 
</center>

<br>

Now that we have our question we need to gather/create a dataset? But before we can talk about that we need to learn some HTML & CSS:
<ul>
<li>HTML (Hyper Text Markup Language) is a formatting system for displaying material retrieved over the Internet. More specifically, you can structure a webpage using HTML tags which contain HTML atributes. </li>

<li>CSS (Cascading Style Sheets) is used to style and layout web pages.</li>
</ul>
<i>**Lets Do a Very Quick Demo on HTML and CSS Selectors**</i>

<br>
<br>

[Css Selectors](https://www.w3schools.com/cssref/css_selectors.asp)
[SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en)

Now, lets go into web scraping!!



# Install Packages

In [None]:
!pip install requests



In [None]:
!pip install beautifulsoup4



#Import Libraries


In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# USAS

Do MLS soccer players get paid more if they played for more minutes?
<br>
How does the number of Penalty Kick Saved correlate to a goal keepers salary?

How sould the dataframe look like?

Each row is a specific player with salaray, and stats

First lets get all the players for individual teams

In [None]:
html_text = requests.get("https://www.spotrac.com/mls/")
print(html_text.status_code)

200


In [None]:
soup = bs(html_text.content, 'html.parser')

In [None]:
team_names = soup.select("#main .col-md-4")
team_names[0:3]

[<div class="teamname col-xs-10 col-md-4">
 <h3><a class="team-name" href="https://www.spotrac.com/mls/atlanta-united-fc/">Atlanta United FC</a></h3>
 </div>, <div class="teamname col-xs-10 col-md-4">
 <h3><a class="team-name" href="https://www.spotrac.com/mls/austin-fc/">Austin FC</a></h3>
 </div>, <div class="teamname col-xs-10 col-md-4">
 <h3><a class="team-name" href="https://www.spotrac.com/mls/cf-montreal/">CF Montreal</a></h3>
 </div>]

In [None]:
team_names[0].find_all('a', class_="team-name")[0].string
team_names[0].find_all('a', class_="team-name")[0]['href']

'https://www.spotrac.com/mls/atlanta-united-fc/'

In [None]:
team_names[0].text.strip()

'Atlanta United FC'

In [None]:
mls_team_names = []
for item in team_names:
  mls_team_names.append((item.find_all('a', class_="team-name")[0].string, item.find_all('a', class_="team-name")[0]['href']))
print(mls_team_names)
len(mls_team_names)

[('Atlanta United FC', 'https://www.spotrac.com/mls/atlanta-united-fc/'), ('Austin FC', 'https://www.spotrac.com/mls/austin-fc/'), ('CF Montreal', 'https://www.spotrac.com/mls/cf-montreal/'), ('Charlotte FC', 'https://www.spotrac.com/mls/charlotte-fc/'), ('Chicago Fire', 'https://www.spotrac.com/mls/chicago-fire/'), ('Colorado Rapids', 'https://www.spotrac.com/mls/colorado-rapids/'), ('Columbus Crew', 'https://www.spotrac.com/mls/columbus-crew/'), ('D.C. United', 'https://www.spotrac.com/mls/dc-united/'), ('FC Cincinnati', 'https://www.spotrac.com/mls/fc-cincinnati/'), ('FC Dallas', 'https://www.spotrac.com/mls/fc-dallas/'), ('Houston Dynamo', 'https://www.spotrac.com/mls/houston-dynamo/'), ('Inter Miami FC', 'https://www.spotrac.com/mls/inter-miami-fc/'), ('LA Galaxy', 'https://www.spotrac.com/mls/la-galaxy/'), ('Los Angeles FC', 'https://www.spotrac.com/mls/los-angeles-fc/'), ('Minnesota United FC', 'https://www.spotrac.com/mls/minnesota-united-fc/'), ('Nashville SC', 'https://www.sp

28

In [None]:
def extract_info(lst):
  ret_lst = []
  for item in lst:
    ret_lst.append(item.string.strip())
  return ret_lst

Doing for Atlanta United FC

In [None]:
#goto the Atlanta United FC team roster page
team_name = mls_team_names[0]
team_name

('Atlanta United FC', 'https://www.spotrac.com/mls/atlanta-united-fc/')

In [None]:
atlanta_united = requests.get(team_name[1])
print(atlanta_united.status_code)

200


In [None]:
atlanta_united_soup = bs(atlanta_united.content, 'html.parser')

In [None]:
player_name = []
player_position = []
player_base_salary = []
player_guaranteed_salary = []

In [None]:
names = atlanta_united_soup.select('.player a')
print(names)

[<a href="https://www.spotrac.com/redirect/player/21515/">Josef Martinez</a>, <a href="https://www.spotrac.com/redirect/player/74245/">Luiz Araujo</a>, <a href="https://www.spotrac.com/redirect/player/21477/">Miles Robinson</a>, <a href="https://www.spotrac.com/redirect/player/72247/">Alan Franco</a>, <a href="https://www.spotrac.com/redirect/player/47303/">Matheus Rossetto</a>, <a href="https://www.spotrac.com/redirect/player/31970/">Emerson Hyndman</a>, <a href="https://www.spotrac.com/redirect/player/71757/">Santiago Sosa</a>, <a href="https://www.spotrac.com/redirect/player/71786/">Franco Ibarra</a>, <a href="https://www.spotrac.com/redirect/player/21553/">Brooks Lennon</a>, <a href="https://www.spotrac.com/redirect/player/62635/">Marcelino Moreno</a>, <a href="https://www.spotrac.com/redirect/player/21513/">Brad Guzan</a>, <a href="https://www.spotrac.com/redirect/player/32708/">Andrew Gutman</a>, <a href="https://www.spotrac.com/redirect/player/71768/">Ronald Hernandez</a>, <a hr

In [None]:
player_name = extract_info(names)
print(player_name)

['Josef Martinez', 'Luiz Araujo', 'Miles Robinson', 'Alan Franco', 'Matheus Rossetto', 'Emerson Hyndman', 'Santiago Sosa', 'Franco Ibarra', 'Brooks Lennon', 'Marcelino Moreno', 'Brad Guzan', 'Andrew Gutman', 'Ronald Hernandez', 'Ronaldo Cisneros', 'Bobby Shuttleworth', 'Osvaldo Alonso', 'Dom Dwyer', 'Rocco Ríos Novo', 'Mikey Ambrose', 'Dylan Castanheira', 'Alex De John', 'Amar Sejdic', 'Tyler Wolff', 'George Campbell', 'Jackson Conway', 'Caleb Wiley', 'Erik Centeno', 'Machop Chol', 'Bryce Washington', 'Thiago Almada', 'Ezequiel Barco', 'Erik Lopez', 'Justin Garces', 'Efrain Morales', 'Raúl Gudiño', 'Juan José Purata']


In [None]:
player_position = extract_info(atlanta_united_soup.select('.player + .center .cap'))
print(player_position)

['F', 'F', 'D', 'D', 'M', 'M', 'M', 'M', 'M', 'F', 'GK', 'D', 'D', 'F', 'GK', 'M', 'F', 'GK', 'D', 'GK', 'D', 'M', 'F', 'D', 'F', 'D', 'D', 'F', 'D', 'M', 'M', 'F', 'GK', 'D', 'GK', 'D']


In [None]:
player_base_salary = extract_info(atlanta_united_soup.select('.center + .right .info'))
print(player_base_salary)

['$3,750,000', '$3,600,000', '$700,000', '$540,000', '$550,000', '$657,143', '$525,000', '$450,000', '$500,000', '$460,000', '$445,716', '$300,000', '$300,000', '$244,000', '$125,000', '$84,000', '$84,000', '-', '$85,444', '$85,444', '$85,444', '$85,444', '$110,000', '$98,000', '$84,000', '$65,500', '$65,500', '$65,500', '$65,500', '$1,650,000', '$2,200,000', '$360,000', '$65,500', '$65,500', '-', '-']


In [None]:
player_guaranteed_salary = extract_info(atlanta_united_soup.select('.right + .right .info'))
print(player_guaranteed_salary)

['$4,141,667', '$3,941,667', '$737,500', '$667,500', '$662,500', '$657,143', '$643,100', '$520,000', '$500,000', '$460,000', '$458,216', '$331,250', '$300,000', '$244,000', '$125,000', '$84,000', '$84,000', '-', '$85,444', '$85,444', '$85,444', '$85,444', '$114,500', '$98,000', '$84,000', '$67,100', '$65,500', '$65,500', '$65,500', '$2,332,000', '$2,200,000', '$528,300', '$65,500', '$65,500', '-', '-']


In [None]:
df = pd.DataFrame(data={
    "player_name": player_name,
    "player_position": player_position,
    "player_base_salary": player_base_salary,
    "player_guaranteed_salary": player_guaranteed_salary,
    "mls_team": "Atlanta United FC"
})

In [None]:
df.head(1)

Unnamed: 0,player_name,player_position,player_base_salary,player_guaranteed_salary,mls_team
0,Josef Martinez,F,"$3,750,000","$4,141,667",Atlanta United FC


In [None]:
len(df)

36

In [None]:
def get_dat_from_link(item):
  # print(item[0])
  team_link = requests.get(item[1])
  soup = bs(team_link.content, 'html.parser')

  player_name = extract_info(soup.select('.player a'))
  player_position = extract_info(soup.select('.player + .center .cap'))
  player_base_salary = extract_info(soup.select('.center + .right .info'))
  # if item[0] in ['Colorado Rapids', 'New England Revolution', 'D.C. United', 'Los Angeles FC']:
  player_guaranteed_salary = extract_info(soup.select('.right + .right .cap'))
  if len(player_guaranteed_salary) != len(player_name):
    player_guaranteed_salary = extract_info(soup.select('.right + .right .info'))
    
  print(team[0], len(player_name), len(player_position), len(player_base_salary), len(player_guaranteed_salary))
  ret_df = pd.DataFrame(data={
      "player_name": player_name,
      "player_position": player_position,
      "player_base_salary": player_base_salary,
      "player_guaranteed_salary": player_guaranteed_salary,
      "mls_team": item[0]
  })
  return ret_df

for team in mls_team_names[1:]:
    df = df.append(get_dat_from_link(team))

Austin FC 29 29 29 29
CF Montreal 34 34 34 34
Charlotte FC 32 32 32 32
Chicago Fire 29 29 29 29
Colorado Rapids 33 33 33 33
Columbus Crew 30 30 30 30
D.C. United 31 31 31 31
FC Cincinnati 29 29 29 29
FC Dallas 31 31 31 31
Houston Dynamo 32 32 32 32
Inter Miami FC 34 34 34 34
LA Galaxy 30 30 30 30
Los Angeles FC 31 31 31 31
Minnesota United FC 31 31 31 31
Nashville SC 28 28 28 28
New England Revolution 30 30 30 30
New York City FC 32 32 32 32
New York Red Bulls 31 31 31 31
Orlando City 30 30 30 30
Philadelphia Union 29 29 29 29
Portland Timbers 28 28 28 28
Real Salt Lake 37 37 37 37
San Jose Earthquakes 29 29 29 29
Seattle Sounders FC 28 28 28 28
Sporting Kansas City 29 29 29 29
Toronto FC 27 27 27 27
Vancouver Whitecaps FC 34 34 34 34


In [None]:
len(df)

864

In [None]:
len(df['mls_team'].unique())

28

In [None]:
df.head()

Unnamed: 0,player_name,player_position,player_base_salary,player_guaranteed_salary,mls_team
0,Josef Martinez,F,"$3,750,000","$4,141,667",Atlanta United FC
1,Luiz Araujo,F,"$3,600,000","$3,941,667",Atlanta United FC
2,Miles Robinson,D,"$700,000","$737,500",Atlanta United FC
3,Alan Franco,D,"$540,000","$667,500",Atlanta United FC
4,Matheus Rossetto,M,"$550,000","$662,500",Atlanta United FC


In [None]:
df.to_csv("dataset_salary.csv")

In [None]:
df[df['mls_team'] == 'Atlanta United FC']

Unnamed: 0,player_name,player_position,player_base_salary,player_guaranteed_salary,mls_team
0,Josef Martinez,F,"$3,750,000","$4,141,667",Atlanta United FC
1,Luiz Araujo,F,"$3,600,000","$3,941,667",Atlanta United FC
2,Miles Robinson,D,"$700,000","$737,500",Atlanta United FC
3,Alan Franco,D,"$540,000","$667,500",Atlanta United FC
4,Matheus Rossetto,M,"$550,000","$662,500",Atlanta United FC
5,Emerson Hyndman,M,"$657,143","$657,143",Atlanta United FC
6,Santiago Sosa,M,"$525,000","$643,100",Atlanta United FC
7,Franco Ibarra,M,"$450,000","$520,000",Atlanta United FC
8,Brooks Lennon,M,"$500,000","$500,000",Atlanta United FC
9,Marcelino Moreno,F,"$460,000","$460,000",Atlanta United FC
