# Scrapy Mini Project

Laura Gutierrez Funderburk


In this notebook, we are going to scrape quotes.toscrape.com, an synthetic website that lists quotes from famous authors.

## Tasks:

Convert the json data collected into a structured RDBMS table, which can be queried using SQL afterwards. Since, scalibility is not a concern here, a great and simple solution would be to design a simple SQLlite3 database for the job. In real-life, when scraping data, the amount of data generated is usually quite large, therefore RDBMS databases such as MySQL or Postgres would be more suited in that case.

SQLLite3 has a lot of simple tutorials available at: https://www.sqlitetutorial.net/ Submit what one SQLLite3 table schema might look like to store and query the data Scraped via this tutorial, and write a small python script that can easily read the json files generated by the spiders you built, and insert each record into that table (or tables).

## Advanced exercise

In [1]:
import json
from pathlib import Path
import pandas as pd
import sqlite3

In [2]:
# Identify path to json-file
path_to_json = "C:/Users/Laura GF/Documents/GitHub/mec-mini-projects/mec-5.5.4-webscraping-project/scrapy_submission/scrapy_submission/data_dump"
json_file_name = "xpath-scraper-results.json"

In [32]:
# load data into variable
with open(Path(path_to_json,json_file_name), "r") as json_data:    
    data = json.load(json_data)    
json_data.close()

In [90]:
for i in range(len(data)):
    
    data[i]['text'] = data[i]['text'].replace('“',"").replace('”',"").replace("'","`")
    data[i]['author']= data[i]['author'].replace("'","`")

In [91]:
#Connecting to sqlite
conn = sqlite3.connect('example.db')

#Creating a cursor object using the cursor() method
cursor = conn.cursor()

#Doping EMPLOYEE table if already exists.
cursor.execute("DROP TABLE IF EXISTS QUOTES")

#Creating table as per requirement
sql ='''CREATE TABLE QUOTES(
   TEXT CHAR(20) NOT NULL,
   AUTHOR CHAR(20) NOT NULL,
   TAGS CHAR(20) NOT NULL
)'''
cursor.execute(sql)


# # Preparing SQL queries to INSERT a record into the database.
for i in range(len(data)-1):
    text = data[i]['text']
    author = data[i]['author']
    tags = data[i]['tags']
    cursor.execute(f'''INSERT INTO QUOTES(
       TEXT, AUTHOR, TAGS) VALUES 
       ('{text}', '{author}', '{tags}')''')


# Commit your changes in the database
conn.commit()

print("Records inserted........")

# Closing the connection
conn.close()


Records inserted........


In [92]:
# Create a SQL connection to our SQLite database


try:
    conn2 = sqlite3.connect("example.db") 
except Exception as e:
    print(e)

    
#Now in order to read in pandas dataframe we need to know table name
cursor = conn2.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(f"Table Name : {cursor.fetchall()}")

df = pd.read_sql_query('SELECT * FROM QUOTES', conn2)
conn.close()

Table Name : [('QUOTES',)]


In [93]:
df

Unnamed: 0,TEXT,AUTHOR,TAGS
0,The world as we have created it is a process o...,Albert Einstein,change
1,"It is our choices, Harry, that show what we tr...",J.K. Rowling,deep-thoughts
2,There are only two ways to live your life. One...,Albert Einstein,thinking
3,"The person, be it gentleman or lady, who has n...",Jane Austen,world
4,"Imperfection is beauty, madness is genius and ...",Marilyn Monroe,abilities
...,...,...,...
94,But better to get hurt by the truth than comfo...,Khaled Hosseini,love
95,You never really understand a person until you...,Harper Lee,courage
96,You have to write the book that wants to be wr...,Madeleine L`Engle,life
97,Never tell the truth to people who are not wor...,Mark Twain,better-life-empathy
