# Scraping Cricinfo
<b>Authors</b>: Manas Bedmutha, Kishen Gowda and N.V. Karthikeya

## Introduction
This notebook will walk us through extracting data from a table on a website. We will be extracting the list of IPL Centuries from ESPN Cricinfo's site ("http://stats.espncricinfo.com/ci/engine/records/batting/list_hundreds.html?id=117;type=trophy")

### Importing Libraries
We will require the following libraries for scraping through this page
1. requests: Used for basic get, post operations to the webpage. Here, to get the data from Cricinfo Servers
2. bs4 (BeautifulSoup): To extract the content based on html tags and their attributes
3. csv: To write the extracted data into a (comma separated value) csv file

In [11]:
# coding: utf-8
import requests
#import re
from bs4 import BeautifulSoup as bs
import csv

# def striphtml(data):
#     p = re.compile(r'<.*?>')
#     return(p.sub('', data))

### Getting html content of the webpage
The request.get() method gets a response object returned by the server based on the given url. Based on the response, we extract its content using bs4 and create what is generally called a `soup`
The soup.prettify() method prettifies the extracted html content in the soup so that it is clearly legible

In [12]:
r = requests.get('http://stats.espncricinfo.com/ci/engine/records/batting/list_hundreds.html?id=117;type=trophy')
soup = bs(r.content,'html.parser')
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- hostname: web004, edition-view: espncricinfo-en-in, country: unknown, cluster: www, created: 2018-06-07 14:11:48 -->
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://developers.facebook.com/schema/" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <script type="text/javascript">
   var _sf_startpt=(new Date()).getTime()
  </script>
  <meta content="ZxdgH3XglRg0Bsy-Ho2RnO3EE4nRs53FloLS6fkt_nc" name="google-site-verification"/>
  <title>
   Cricket Records | Records | Indian Premier League |  | List of hundreds | ESPNcricinfo
  </title>
  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Cricket records,  Records,  / , Indian Premier League,  / , List of hundreds" name="keywords"/>
  <meta content="Cricinfo Cricket Records - Records,  / , Indian Premier League,  / , List of hundreds" name="description"/>
  <

Now that we have the html content, observe that all the data relevant to us is in a `<td>` tag. Such small observations help a long way in fectching the data quickly. We also see that all points of relevance have an attribute `nowrap:nowrap`
We use exactly this to get the data

The `find_all()` method for the soup object is used here to get all such occurences. We store that in a list variable `l`. <br>We will print types of `l` and `l[0]` to check type of elements are in l. Also, to understand what it is extracting, we will print `l[0]`.<br>However, all the content needs to be first extracted to get only the text inside those tags

In [14]:
l = soup.find_all('td',{"nowrap":"nowrap"})
print(type(l))
print(type(l[0]))
print(l[0])
for i in range(len(l)):
    l[i]=(l[i]).text

<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
<td class="left" nowrap="nowrap" title="record rank: 1"><a class="data-link" href="/ci/content/player/37737.html">BB McCullum</a></td>


In [15]:
l[:50]

[u'BB McCullum',
 u'158*',
 u'73',
 u'10',
 u'13',
 u'216.43',
 u'',
 u'KKR',
 u'v RCB',
 u'18 Apr 2008',
 u'T20',
 u'MEK Hussey',
 u'116*',
 u'54',
 u'8',
 u'9',
 u'214.81',
 u'',
 u'Super Kings',
 u'v Kings XI',
 u'19 Apr 2008',
 u'T20',
 u'A Symonds',
 u'117*',
 u'53',
 u'11',
 u'7',
 u'220.75',
 u'',
 u'Chargers',
 u'v Royals',
 u'24 Apr 2008',
 u'T20',
 u'AC Gilchrist',
 u'109*',
 u'47',
 u'9',
 u'10',
 u'231.91',
 u'',
 u'Chargers',
 u'v Mum Indians',
 u'27 Apr 2008',
 u'T20',
 u'ST Jayasuriya',
 u'114*',
 u'48',
 u'9',
 u'11',
 u'237.50']

In [4]:
head = ["Name","Score","Balls","Sixes","Fours","Strike Rate","Ground","Team","Opposition","Date"]
data = [head]
l2 = soup.find_all('a',{"class":"data-link"})
for i in range(len(l2)):
    l2[i]=(l2[i]).text

[u'BB McCullum', u'158*', u'73', u'10', u'13', u'216.43', u'', u'KKR', u'v RCB', u'18 Apr 2008', u'T20', u'MEK Hussey', u'116*', u'54', u'8', u'9', u'214.81', u'', u'Super Kings', u'v Kings XI', u'19 Apr 2008', u'T20', u'A Symonds', u'117*', u'53', u'11', u'7', u'220.75', u'', u'Chargers', u'v Royals', u'24 Apr 2008', u'T20', u'AC Gilchrist', u'109*', u'47', u'9', u'10', u'231.91', u'', u'Chargers', u'v Mum Indians', u'27 Apr 2008', u'T20', u'ST Jayasuriya', u'114*', u'48', u'9', u'11', u'237.50', u'', u'Mum Indians', u'v Super Kings', u'14 May 2008', u'T20', u'SE Marsh', u'115', u'69', u'11', u'7', u'166.66', u'', u'Kings XI', u'v Royals', u'28 May 2008', u'T20', u'AB de Villiers', u'105*', u'54', u'5', u'6', u'194.44', u'', u'Daredevils', u'v Super Kings', u'23 Apr 2009', u'T20', u'MK Pandey', u'114*', u'73', u'10', u'4', u'156.16', u'', u'RCB', u'v Chargers', u'21 May 2009', u'T20', u'YK Pathan', u'100', u'37', u'9', u'8', u'270.27', u'', u'Royals', u'v Mum Indians', u'13 Mar 2010',

In [5]:
i = 0
j = 0
while i<len(l):
    data.append([l[i],l[i+1],l[i+2],l[i+3],l[i+4],l[i+5],l2[j+1],l[i+7],l[i+8][2:],l[i+9]])
    i+=11
    j+=3
print(data)
#print(l1)
myFile = open('ipl_centuries.csv', 'w')
with myFile:
    writer = csv.writer(myFile,lineterminator='\n')
    for i in data:
        writer.writerow(i)
myFile.close()
     
print("Writing complete")

[['Name', 'Score', 'Balls', 'Sixes', 'Fours', 'Strike Rate', 'Ground', 'Team', 'Opposition', 'Date'], [u'BB McCullum', u'158*', u'73', u'10', u'13', u'216.43', u'Bengaluru', u'KKR', u'RCB', u'18 Apr 2008'], [u'MEK Hussey', u'116*', u'54', u'8', u'9', u'214.81', u'Mohali', u'Super Kings', u'Kings XI', u'19 Apr 2008'], [u'A Symonds', u'117*', u'53', u'11', u'7', u'220.75', u'Hyderabad (Deccan)', u'Chargers', u'Royals', u'24 Apr 2008'], [u'AC Gilchrist', u'109*', u'47', u'9', u'10', u'231.91', u'Mumbai', u'Chargers', u'Mum Indians', u'27 Apr 2008'], [u'ST Jayasuriya', u'114*', u'48', u'9', u'11', u'237.50', u'Mumbai', u'Mum Indians', u'Super Kings', u'14 May 2008'], [u'SE Marsh', u'115', u'69', u'11', u'7', u'166.66', u'Mohali', u'Kings XI', u'Royals', u'28 May 2008'], [u'AB de Villiers', u'105*', u'54', u'5', u'6', u'194.44', u'Durban', u'Daredevils', u'Super Kings', u'23 Apr 2009'], [u'MK Pandey', u'114*', u'73', u'10', u'4', u'156.16', u'Centurion', u'RCB', u'Chargers', u'21 May 2009']

https://github.com/dwillis/python-espncricinfo/blob/master/espncricinfo/match.py<br>
https://gist.github.com/genekogan/ebd77196e4bf0705db51f86431099e57