# What is web scraping?
On July 21, 2017, the New York Times updated an opinion article called Trump's Lies, detailing every public lie the President has told since taking office. Because this is a newspaper, the information was (of course) published as a block of text. This is a great format for human consumption, but it can't easily be understood by a computer. In this tutorial, we'll extract the President's lies from the New York Times article and store them in a structured dataset.

This is a common scenario: You find a web page that contains data you want to analyze, but it's not presented in a format that you can easily download and read into your favorite data analysis tool. You might imagine manually copying and pasting the data into a spreadsheet, but in most cases, that is way too time consuming. A technique called web scraping is a useful way to automate this process.

What is web scraping? It's the process of extracting information from a web page by taking advantage of patterns in the web page's underlying code. Let's start looking for these patterns!



# Article:
https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html

When converting this into a dataset, you can think of each lie as a "record" with four fields:

1. The date of the lie.
2. The lie itself (as a quotation).
3. The writer's brief explanation of why it was a lie.

The URL of an article that substantiates the claim that it was a lie.
Importantly, those fields have different formatting, which is consistent throughout the article: the date is bold red text, the lie is "regular" text, the explanation is gray italics text, and the URL is linked from the gray italics text.

Why does the formatting matter? Because it's very likely that the code underlying the web page "tags" those fields differently, and we can take advantage of that pattern when scraping the page. Let's take a look at the source code, known as HTML:

Examining the HTML
To view the HTML code that generates a web page, you right click on it and select "View Page Source" in Chrome or Firefox, "View Source" in Internet Explorer, or "Show Page Source" in Safari. (If that option doesn't appear in Safari, just open Safari Preferences, select the Advanced tab, and check "Show Develop menu in menu bar".)

Thankfully, you only have to understand three basic facts about HTML in order to get started with web scraping!

#Fact 1: HTML consists of tags
You can see that the HTML contains the article text, along with "tags" (specified using angle brackets) that "mark up" the text. ("HTML" stands for Hyper Text Markup Language.)

For example, one tag is '<strong></strong>', which means "use bold formatting". There is a <strong> tag before "Jan. 21" and a </strong> tag after it. The first is an "opening tag" and the second is a "closing tag" (denoted by the /), which indicates to the web browser where to start and stop applying the formatting. In other words, this tag tells the web browser to make the text "Jan. 21" bold. (Don't worry about the &nbsp; - we'll deal with that later.)

#Fact 2: Tags can have attributes
HTML tags can have "attributes", which are specified in the opening tag. For example, <span class="short-desc"> indicates that this particular <span> tag has a class attribute with a value of short-desc.

For the purpose of web scraping, you don't actually need to understand the meaning of <span>, class, or short-desc. Instead, you just need to recognize that tags can have attributes, and that they are specified in this particular way.

#Fact 3: Tags can be nested
Let's pretend my HTML code said:

Hello <strong><em>Data Science</em> students</strong>

The text Data Science students would be bold, because all of that text is between the opening <strong> tag and the closing </strong> tag. The text Data School would also be in italics, because the em tag means "use italics". The text "Hello" would not be bold or italics, because it's not within either the strong or em tags. Thus, it would appear as follows:

Hello Data Science students

The central point to take away from this example is that tags "mark up" text from wherever they open to wherever they close, regardless of whether they are nested within other tags.

Got it? You now know enough about HTML in order to start web scraping!



# Reading the web page into Python
The first thing we need to do is to read the HTML for this article into Python, which we'll do using the requests library. (If you don't have it, you can pip install requests from the command line.)

In [5]:
import requests
r= requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [6]:
#status_code confirms whether we are able to connect or no, if its 200 it means success and if 404 it fails
r.status_code

200

In [7]:
# content return the content we got from the url, basically all the html file will be called
r.content



In [9]:
# Using the slicing 
r.content[0:500] 

b'<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->\n<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page'

# Parsing the HTML using Beautiful Soup
We're going to parse the HTML using the Beautiful Soup 4 library, which is a popular Python library for web scraping. (If you don't have it, you can pip install beautifulsoup4 from the command line.)

In [10]:
#Python library for scraping
from bs4 import BeautifulSoup

In [14]:
# Reading the contents of r and parsing it to get the exact format
soup= BeautifulSoup(r.content,'html.parser')

The code above parses the HTML (stored in r.text) into a special object called soup that the Beautiful Soup library understands.In other words, Beautiful Soup is reading the HTML and making sense of its structure.

(Note that html.parser is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup. See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers to learn more.)

#  Collecting the data with format

In [17]:
results= soup.find_all(name='span',attrs={'class':'short-desc'})
#Returns all the object in a list

In [18]:
len(results)

180

In [19]:
type(results)

bs4.element.ResultSet

In [20]:
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [21]:
results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

# Extracting the date

In [26]:
fr= results[0]
fr

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Although first_result may look like a Python string, you'll notice that there are no quote marks around it. Instead, it's another special Beautiful Soup object (called a "Tag") that has specific methods and attributes.

In order to locate the date, we can use its find() method to find a single tag that matches a specific pattern, in contrast to the find_all() method we used above to find all tags that match a pattern:

In [25]:
fr.find_all('strong')

[<strong>Jan. 21 </strong>]

This code searches first_result for the first instance of a "strong" tag, and again returns a Beautiful Soup "Tag" object (not a string).

Since we want to extract the text between the opening and closing tags, we can access its text attribute, which does in fact return a regular Python string:

In [27]:
fr.find('strong').text

'Jan. 21\xa0'

What is \xa0? You don't actually need to know this, but it's called an "escape sequence" that represents the &nbsp; character we saw earlier in the HTML source.

However, you do need to know that an escape sequence represents a single character within a string. Let's slice it off from the end of the string:

In [33]:
fr.find('strong').text[0:-1]

'Jan. 21'

In [32]:
fr.find('strong').text[0:-1]+' '+'2017'

'Jan. 21 2017'

# Extracting the lie

In [34]:
fr

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Our goal is to extract the two sentences about Iraq. Unfortunately, there isn't a pair of opening and closing tags that starts immediately before the lie and ends immediately after the lie. Therefore, we're going to have to use a different technique:

In [35]:
fr.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [36]:
fr.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [37]:
fr.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

# Extracting the explanation
Based upon what you've seen already, you might have figured out that we have at least two options for how we extract the third component of the record, which is the writer's explanation of why the President's statement was a lie.

The first option is to slice the contents attribute, like we did when extracting the lie:

In [38]:
fr

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [39]:
fr.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [40]:
fr.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [42]:
fr.find('a').text

'(He was for an invasion before he was against it.)'

In [45]:
fr.find('a').contents[0][1:-1]

'He was for an invasion before he was against it.'

In [46]:
fr.find('a').text[1:-1]

'He was for an invasion before he was against it.'

# Extracting the url

In [49]:
fr.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

# Beautiful Soup methods and attributes
Before we finish building the dataset, I want to summarize a few ways you can interact with Beautiful Soup objects.

You can apply these two methods to either the initial soup object or a Tag object (such as first_result):

1. find(): searches for the first matching tag, and returns a Tag object
2. find_all(): searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)

You can extract information from a Tag object (such as first_result) using these two attributes:

1. text: extracts the text of a Tag, and returns a string
2. contents: extracts the children of a Tag, and returns a list of Tags and strings

It's important to keep track of whether you are interacting with a Tag, ResultSet, list, or string, because that affects which methods and attributes you can access.

In [50]:
records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

In [51]:
len(records)

180

In [53]:
records[0:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

# Making a data set

In [54]:
import pandas as pd
data= pd.DataFrame(records,columns=('date', 'lie', 'explanation', 'url'))

In [57]:
data.dtypes

date           object
lie            object
explanation    object
url            object
dtype: object

In [59]:
data.date= pd.to_datetime(data.date)

In [60]:
data.dtypes

date           datetime64[ns]
lie                    object
explanation            object
url                    object
dtype: object

In [61]:
data.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


# Exporting to CSV


In [62]:
data.to_csv('trump-lies.csv', index=False, encoding='utf-8')

In [67]:
pd.read_csv('trump-lies.csv').head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [68]:
#The only 16 line code for whole process

import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')


# Using Google Api's

In [69]:
url = 'https://maps.googleapis.com/maps/api/geocode/json?address=Moonshine+cafe+cp'

r = requests.get(url)



In [71]:
r.text

'{\n   "error_message" : "You must use an API key to authenticate each request to Google Maps Platform APIs. For additional information, please refer to http://g.co/dev/maps-no-account",\n   "results" : [],\n   "status" : "REQUEST_DENIED"\n}\n'

In [119]:
url = "https://www.flipkart.com/"

r= requests.get(url)

soup = BeautifulSoup(r.content, 'html5lib')

In [120]:
img_list = soup.findAll('img')
link = soup.findAll('a')

In [121]:
len(img_list)

31

In [122]:
img_list[5]['src']

img=requests.get('https:'+img_list[1]['src'])

with open('flipkart.jpg', 'wb') as ot:
    ot.write(img.content)