------------------------
#### demo : document loading (html) using beautifulsoup 

- connect to NYT article (date)
- extract info from the article
----------------------

![image.png](attachment:73f45be0-b5eb-48de-b375-d19eda590c70.png)![image.png](attachment:9abb0a4f-9450-458f-89e0-9c1fe160d824.png)

In [1]:
import requests

In [2]:
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [3]:
r

<Response [200]>

In [4]:
r.text[:2000]

'<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="https://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->\n<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->\n<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->\n<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 page-

#### beautiful package

In [5]:
from bs4 import BeautifulSoup  
soup = BeautifulSoup(r.text, 'html.parser')  

In [6]:
# Collecting all of the records
results = soup.find_all('span', attrs={'class':'short-desc'})  

In [7]:
len(results)  

180

In [8]:
results[:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [9]:
# Extracting the date
first_result = results[0]  
first_result  

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [10]:
first_result.find('strong')  

<strong>Jan. 21 </strong>

In [11]:
first_result.find('strong').text 

'Jan. 21\xa0'

In [12]:
first_result.find('strong').text[0:-1]  

'Jan. 21'

In [13]:
first_result.find('strong').text[0:-1] + ', 2017'

'Jan. 21, 2017'

In [14]:
# Extracting the lie
first_result  

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [15]:
first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [16]:
first_result.contents[1][1:-2]  

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

In [17]:
first_result.contents[2]  

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [18]:
first_result.find('a')  

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [19]:
first_result.find('a').text

'(He was for an invasion before he was against it.)'

In [20]:
first_result.find('a').text[1:-1]  

'He was for an invasion before he was against it.'

In [21]:
first_result.find('a')['href']  

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

In [22]:
records = []  

for result in results:  
    date        = result.find('strong').text[0:-1] + ', 2017'
    lie         = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url         = result.find('a')['href']
    
    records.append((date, lie, explanation, url))

In [23]:
len(records) 

180

In [24]:
records

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html'),
 ('Jan. 25, 2017',
  'Now, the audience was the biggest ever. But this crowd was massive. Look how far back it goes. This crowd was massive.',
  "Official aerial photos show Obama's 2009 inauguration was mu

In [27]:
# Applying a tabular data structure
import pandas as pd  
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url']) 

In [26]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [28]:
df.columns

Index(['date', 'lie', 'explanation', 'url'], dtype='object')

In [31]:
from llama_index.core import Document

In [33]:
# Convert DataFrame rows to LlamaIndex documents
documents = [
    Document(text=f"Date: {row['date']}\nLie: {row['lie']}\nExplanation: {row['explanation']}\nURL: {row['url']}")
    for _, row in df.iterrows()
]

In [34]:
documents

[Document(id_='79a681c9-81d2-424d-abde-7a386bad2cbd', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Date: Jan. 21, 2017\nLie: I wasn't a fan of Iraq. I didn't want to go into Iraq.\nExplanation: He was for an invasion before he was against it.\nURL: https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the", mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='55b7032c-b7ec-4d99-a49f-cdbe499c664e', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Date: Jan. 21, 2017\nLie: A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.\nExplanation: Trump was on the cover 11 times and Nixon appear

In [37]:
from llama_index.core import VectorStoreIndex

In [38]:
# Create the index
index = VectorStoreIndex.from_documents(documents)

In [40]:
# Create a query engine
query_engine = index.as_query_engine()

In [42]:
# Query the index
response = query_engine.query("What lies are documented related to Russia?")
print(response)

Two lies related to Russia are documented.


In [44]:
response.source_nodes

[NodeWithScore(node=TextNode(id_='63ff1e97-3d4f-41b0-8895-4978a1c44c25', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='7aa4a693-d281-4fc6-8172-62f5c8335230', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='5656bcb477194a313294882e6a64919360f81a4dfe1b0829da08e9169f7d5ed0')}, text="Date: Aug. 3, 2017\nLie: The Russia story is a total fabrication.\nExplanation: It's not.\nURL: https://www.washingtonpost.com/news/fact-checker/wp/2017/08/03/fact-checking-the-trump-russia-investigation/?utm_term=.1404a36076a6", mimetype='text/plain', start_char_idx=0, end_char_idx=224, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.8442507833505157),
 NodeWithScore(node=TextNode(id_='eb65fb24-a2cc-4d03-aa2a-e50a6a2967a3', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], re

In [1]:
from llama_index.core import Document, VectorStoreIndex, HTMLReader

ImportError: cannot import name 'HTMLReader' from 'llama_index.core' (D:\anaconda3\Lib\site-packages\llama_index\core\__init__.py)