In [36]:
!pip3 install pandas



# Homework 1.5: Scraping Review 🏋️‍🏋️‍🏋️‍

For this assignment, you will be scraping an API and a live website.

### Table of Contents
1. CFPB API
2. Microworkers

## Prelude: Importing Your Libraries 
The *first* first thing we're going to do is make sure we're all set up and ready to go. 

That means importing some libraries! I've got a cell below all ready for you to put in some libraries.

Remember with third-party libraries, you will need to make sure they are actually installed before they will run. 

In [38]:
# Import native libraries
import re
import csv 
import requests 
import json
import pandas as pd
from bs4 import BeautifulSoup

# Import third party libraries

## Part One: Scraping the CFPB

The Consumer Finance and Protection Bureau was founded in the aftermath of the 2008 financial crisis. One of the things that they do is collect complaints from consumers about bad banks, lenders and other financial insitutions. This complaint data is available to the public in many forms. While you can download a big, horrible CSV file with _all_ of the data, it's usually easier (for you and your computer's memory storage) to use the API to only get the data you need.

The CFPB uses [Socrata](https://www.tylertech.com/products/socrata) to manage their API, which is a company that helps a lot of public agencies share their data with the rest of the world. The way they have you request for data is kind of funky, but we will perservere together!

[The homepage for the Consumer Complaint Database](https://cfpb.github.io/api/ccdb/index.html)<br>[API Reference](https://dev.socrata.com/foundry/data.consumerfinance.gov/s6ew-h6mp)

### 1. Doing a single request

#### Open the page as a JSON object with Requests
print("1")

In [2]:
url= 'https://data.consumerfinance.gov/resource/s6ew-h6mp.json'
api_url= requests.get(url).text
print(api_url)
# api_key = open('../data/ConsumerAPI.txt').read().strip()
# api_url = 'https://data.consumerfinance.gov/'+api_key+'/views/s6ew-h6mp/rows.json?accessType=DOWNLOAD'
# api_response = requests.get(api_url).text
# #url = 'https://data.consumerfinance.gov/resource/s6ew-h6mp.json'
# data = json.loads(api_response)
# print(requests.get(api_url).text)
# #data = json.loads(api_response)
# #print(requests.get(api_url))

[{"date_received":"2019-10-07T00:00:00.000","product":"Credit reporting, credit repair services, or other personal consumer reports","sub_product":"Credit reporting","issue":"Incorrect information on your report","sub_issue":"Information belongs to someone else","company":"Continental Finance Company, LLC","state":"VA","zip_code":"22030","submitted_via":"Web","date_sent_to_company":"2019-10-07T00:00:00.000","company_response":"In progress","timely":"Yes","consumer_disputed":"N/A","complaint_id":"3397597"}
,{"date_received":"2019-10-07T00:00:00.000","product":"Debt collection","sub_product":"I do not know","issue":"Communication tactics","sub_issue":"Frequent or repeated calls","company_public_response":"Company believes it acted appropriately as authorized by contract or law","company":"ATG Credit, LLC","state":"FL","zip_code":"32907","consumer_consent_provided":"Consent not provided","submitted_via":"Web","date_sent_to_company":"2019-10-07T00:00:00.000","company_response":"Closed with

#### Print the first item in the returned list

In [3]:
data = json.loads(api_url)
print(data[1])

{'date_received': '2019-10-07T00:00:00.000', 'product': 'Debt collection', 'sub_product': 'I do not know', 'issue': 'Communication tactics', 'sub_issue': 'Frequent or repeated calls', 'company_public_response': 'Company believes it acted appropriately as authorized by contract or law', 'company': 'ATG Credit, LLC', 'state': 'FL', 'zip_code': '32907', 'consumer_consent_provided': 'Consent not provided', 'submitted_via': 'Web', 'date_sent_to_company': '2019-10-07T00:00:00.000', 'company_response': 'Closed with explanation', 'timely': 'Yes', 'consumer_disputed': 'N/A', 'complaint_id': '3397957'}


Notice that there are a lot of fields!


### 2. Getting a lil bit more specific

When we get complaint data from just the endpoint, we are getting ALL the data–it's basically a firehose! However, we don't actually want all the complaints submitted to the CFPB! We only want specific kinds! 

In fact, we only want complaints that fit this criteria:
- The consumer is based the state of New York
- It was received by the CFPB between January 1, 2018 and January 1, 2019
- It is about the product "Debt collection" and the sub-product "Mortgage debt"

Using the `cfpb_endpoint`, we will build a url that requests just these kinds of complaints!

We will first filter by each thing, and then write a url that filters all three at the same time! Woah!

#### Filtering by state

Look back at the piece of data we printed in Step 1. How can you tell which state the complaint is from? How are they formatting the state names–is it the full name, or an abbreviation of sort? Consider checking out the [API documentation](https://dev.socrata.com/foundry/data.consumerfinance.gov/s6ew-h6mp)'s "Fields" section if you're feeling a little lost.

In [4]:
url_ny= 'https://data.consumerfinance.gov/resource/s6ew-h6mp.json?state=NY'
api_url_ny= requests.get(url_ny).text
print(api_url_ny)

[{"date_received":"2019-09-30T00:00:00.000","product":"Credit reporting, credit repair services, or other personal consumer reports","sub_product":"Credit reporting","issue":"Problem with a credit reporting company's investigation into an existing problem","sub_issue":"Their investigation did not fix an error on your report","company_public_response":"Company has responded to the consumer and the CFPB and chooses not to provide a public response","company":"TRANSUNION INTERMEDIATE HOLDINGS, INC.","state":"NY","zip_code":"104XX","submitted_via":"Web","date_sent_to_company":"2019-09-30T00:00:00.000","company_response":"Closed with explanation","timely":"Yes","consumer_disputed":"N/A","complaint_id":"3390653"}
,{"date_received":"2019-09-28T00:00:00.000","product":"Debt collection","sub_product":"Medical debt","issue":"Written notification about debt","sub_issue":"Didn't receive notice of right to dispute","company_public_response":"Company believes it acted appropriately as authorized by 

#### Filtering by date range

Read the [between...and...](https://dev.socrata.com/docs/functions/between.html) page in the API documentation. This will explain how to query for complaints within a particular timeframe! Now use that knowledge to call all the complaints between January 1, 2018 and January 1, 2019!

In [5]:
url_date= "https://data.consumerfinance.gov/resource/s6ew-h6mp.json?$where=date_received between '2018-01-01T00:00:00' and '2019-10-01T00:00:00'"
api_url_date= requests.get(url_date).text
print(api_url_date)

[{"date_received":"2018-01-01T00:00:00.000","product":"Debt collection","sub_product":"I do not know","issue":"Attempts to collect debt not owed","sub_issue":"Debt is not yours","complaint_what_happened":"I 've sent letters to Viking Client Services requesting specifics on this alleged debt. I do not have a contract with Viking or XXXX XXXX XXXX. I have not received nor made payments to either of these companies.","company":"Viking Client Services","state":"TX","zip_code":"761XX","consumer_consent_provided":"Consent provided","submitted_via":"Web","date_sent_to_company":"2018-01-01T00:00:00.000","company_response":"Closed with explanation","timely":"Yes","consumer_disputed":"N/A","complaint_id":"2768843"}
,{"date_received":"2018-01-01T00:00:00.000","product":"Mortgage","sub_product":"Conventional home mortgage","issue":"Struggling to pay mortgage","company":"Alabama Housing Finance Authority","state":"AL","zip_code":"36869","consumer_consent_provided":"Other","submitted_via":"Web","dat

In [6]:
url_temp= 'https://data.consumerfinance.gov/resource/s6ew-h6mp.json?date_received=2018-01-03'
api_url_temp= requests.get(url_temp).text
print(api_url_temp)

[{"date_received":"2018-01-03T00:00:00.000","product":"Credit reporting, credit repair services, or other personal consumer reports","sub_product":"Credit reporting","issue":"Incorrect information on your report","sub_issue":"Account information incorrect","complaint_what_happened":"So Ive recently checked my credit report on XX/XX/XXXX,  and I had noticed that the number of inquires on report are through the roof. I had been in disbelief, and I had good reason to be 13 credit inquires were reported on my report that I had recollection of no did I approve of the hard pull on my credit. So I sent an email to all three credit bureaus, XXXX, XXXX and Equifax explain that I would like to discuss why these inquiries are on my report if I did not have any knowledge of them. All three Credit bureaus declined my requests when I called in XX/XX/XXXX, the three names I had gotten from the associates were XXXX for XXXX., XXXX for XXXX, XXXX for Equifax they had argued with me and kept blaming me 

#### Filtering by sub-product

In [7]:
url_sub_product= 'https://data.consumerfinance.gov/resource/s6ew-h6mp.json?sub_product=Mortgage debt'
api_url_sub_product= requests.get(url_sub_product).text
print(api_url_sub_product)

[{"date_received":"2019-09-25T00:00:00.000","product":"Debt collection","sub_product":"Mortgage debt","issue":"Attempts to collect debt not owed","sub_issue":"Debt is not yours","company":"NATIONSTAR MORTGAGE","state":"WV","zip_code":"25801","tags":"Older American","submitted_via":"Web","date_sent_to_company":"2019-09-25T00:00:00.000","company_response":"Closed with explanation","timely":"Yes","consumer_disputed":"N/A","complaint_id":"3386113"}
,{"date_received":"2019-09-16T00:00:00.000","product":"Debt collection","sub_product":"Mortgage debt","issue":"Attempts to collect debt not owed","sub_issue":"Debt is not yours","company":"NATIONSTAR MORTGAGE","state":"IL","zip_code":"60002","submitted_via":"Web","date_sent_to_company":"2019-09-16T00:00:00.000","company_response":"In progress","timely":"Yes","consumer_disputed":"N/A","complaint_id":"3375737"}
,{"date_received":"2019-09-05T00:00:00.000","product":"Debt collection","sub_product":"Mortgage debt","issue":"Attempts to collect debt no

#### Putting it all together

Now that you've gotten data from each *individual* filter, let's combine them! You can use multiple filters by sticking an `&` between them.

In [33]:
url_allfilters= "https://data.consumerfinance.gov/resource/s6ew-h6mp.json?$where=date_received between '2018-01-01T00:00:00' and '2019-01-01T00:00:00'&sub_product=Mortgage debt&state=NY&product=Debt collection"
api_url_allfilters= requests.get(url_allfilters).text
data_url_allfilters= json.loads(api_url_allfilters)
print(data_url_allfilters)
len(data_url_allfilters)
#len(url_allfilters)

[{'date_received': '2018-11-08T00:00:00.000', 'product': 'Debt collection', 'sub_product': 'Mortgage debt', 'issue': 'False statements or representation', 'sub_issue': 'Attempted to collect wrong amount', 'company_public_response': 'Company believes it acted appropriately as authorized by contract or law', 'company': 'SELECT PORTFOLIO SERVICING, INC.', 'state': 'NY', 'zip_code': '117XX', 'consumer_consent_provided': 'Other', 'submitted_via': 'Web', 'date_sent_to_company': '2018-11-20T00:00:00.000', 'company_response': 'Closed with explanation', 'timely': 'Yes', 'consumer_disputed': 'N/A', 'complaint_id': '3069926'}, {'date_received': '2018-06-26T00:00:00.000', 'product': 'Debt collection', 'sub_product': 'Mortgage debt', 'issue': 'Communication tactics', 'sub_issue': 'Frequent or repeated calls', 'complaint_what_happened': 'On XX/XX/XXXX at XXXX XXXX XXXX XXXX called and left a voice message. Then at XXXX a rep called again and left another voice message. I can understand once a day bu

68

**Gutcheck:** Count how many items you get back using the `len()` function. Is it 68? You're good to go!

### 3. Saving the data into a CSV file

Now that we have a beautifully crafted URL that gives us all the data we want, let's save it in a CSV file so we can open it up in ｡･:*:･ﾟ★,｡･:*:･ﾟ☆𝔰𝔭𝔯𝔢𝔞𝔡𝔰𝔥𝔢𝔢𝔱 𝔣𝔬𝔯𝔪｡･:*:･ﾟ★,｡･:*:･ﾟ☆.

#### Save the data to a file called `"../output/2018_NY_mortgage_complaints.csv"`

In [50]:
df = pd.DataFrame(data_url_allfilters,columns=["date_received","product","sub_product","state"])
df.to_csv("../output/2018_NY_mortgage_complaints.csv", index= False)    
#df = pd.DataFrame(data_url_allfilters,columns=["date_received","product","sub_product","issue", "sub_issue","company_public_response","company", "state","zip_code","x","consumer_consent_provided","submitted_via"])

# complaint = []
# #print(data_url_allfilters)

# for i in data_url_allfilters:

#     complaint_rows= {
#                 "date_received": "date_received",
#                 "state":"state",
#                 "sub_product": "sub_product",
#                 "product":"product"
#                  }
             
#     complaint.append(complaint_rows) 
            
        
# with open("../output/2018_NY_mortgage_complaints.csv", "w+")  as csvfile:  
#     fieldnames = ['date', 'state','sub_product','product']
#     # this creates your csv
#     writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
#     # this writes in the first row, which are the headers
#     writer.writeheader()

#     # this loops through your rows (the array you set at the beginning and have updated throughtout)
#     for complaint_rows in complaint:
#          # this takes each row and writes it into your csv
#         writer.writerow(complaint)


### Bonus: Collect mortgage complaints from multiple states!
**For an extra point:** write a script that loops through the list of states below, downloads all complaints between January 1, 2018 and January 1, 2019 that are about the sub-product "Mortgage debt", and save each into their own csv, that has the filename format `../output/2018_STATENAME_mortgage_complaints.csv`

In [113]:
states = ['NY', 'NJ', 'NV', 'ND', 'NM', 'NC']

In [133]:

url_all= requests.get("https://data.consumerfinance.gov/resource/s6ew-h6mp.json?$where=date_received between '2018-01-01T00:00:00' and '2019-01-01T00:00:00'&product=Debt collection&state=states[0]&state=states[1]").text


In [134]:
data_url_all= json.loads(url_all)
print(data_url_all)

{'error': True, 'message': 'cannot specify a field more than once'}


## Part Two: Scraping Microworkers.com

For Part Two, you will be scraping an archive I've made of [Microworkers](https://www.microworkers.com/), a site that pays small amounts of money for the completion of short tasks. I have archived their "Twitter" job listings.

You will have to:
1. Scrape the homepage for links to each job listing
2. Figure out how to scrape a single job listing
3. Apply the knowledge you learned from **(2)** to each link from **(1)**

The link to the archive is here:<br>
**[http://maddy.zone/microworker/index.html](http://maddy.zone/microworker/index.html)**

### 2. Scraping the homepage

#### Open the homepage using Requests

In [None]:
url= "http://maddy.zone/microworker/index.html"
response = requests.get(url)

#### Parse the page using BeautifulSoup

In [None]:
soup = BeautifulSoup(response.content, 'html.parser')

#### Isolate each job listing url and add them to an array

In [None]:
result = []
main_box = soup.find('div', class_='joblistarea')
for jobname in main_box:
    main_list = main_box.find_all('div', class_='jobname')
    for alink in main_list:
        link = alink.find('a',href=True)
        job_link = "https://maddy.zone/microworker" + link['href']
        #print(job_link)
        result.append(job_link)

# 1. Scraping a single job listing

![screenshot of the linked page](example.png)

For each page, we will collect **five** different pieces of information:
1. Job title
2. Job ID
3. Employer ID
4. Payment
5. Description

But scraping them all at once can be overwhelming! Let's scrape a signle listing first. For some of the pieces of information, you might want to look into `.replace()` and `.strip()` functions for strings.

#### Open `http://maddy.zone/microworker/54y2h5e4j5c4z213o503w2b4.html` using Requests

In [None]:
response = requests.get("http://maddy.zone/microworker/54y2h5e4j5c4z213o503w2b4.html")

#### Parse the page using BeautifulSoup

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

#### Isolate the job title

In [None]:
title = soup.find('div', class_='jobarealeft')
job_title = title.find('h1').text.strip()
print(job_title)

#### Isolate the job id

In [None]:
Id = soup.find('div', class_= 'jobdetailsnoteleft')
job_id = Id.find('p', text = re.compile('Job ID')).text.strip()
print(job_id)

#### Isolate the employer id

In [None]:
E_Id = soup.find('div', class_= 'jobdetailsnoteright')
employer_id = E_Id.find('a').text
print(employer_id)

#### Isolate the payment

In [None]:
pay = soup.find('div', class_= 'jobdetailsnoteleft')
payment = pay.find_all('p')[1].text.strip()
print(payment)

#### Isolate the description

In [None]:
description = soup.find('div', class_= 'jobdetailsbox').text.strip()
print(description)

#### Store each of your variables into this dictionary

In [None]:
job_listing = {
    'job_title': job_title        ,
    'job_id': job_id           ,
    'employer_id': employer_id      , 
    'payment': payment          , 
    'description':  description     ,
} 

### 3. Scraping all of the linked pages

#### Make an empty array for your data

In [None]:
job = [500]

#### Loop through each of the listing links that you saved in Step 1, and...<br>    Use the code from Step 2 to get the data from each listing page<br>And add the dictionary you make to the array from above

In [None]:
job = []
url= "http://maddy.zone/microworker/index.html"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
main_box = soup.find('div', class_='joblistarea')
for jobname in main_box:
    main_list = main_box.find_all('div', class_='jobname')
    main_pay = main_box.find_all('div', class_='jobpayment')
    main_success= main_box.find_all('div', class_='jobsuccess')
    main_ttr= main_box.find_all('div', class_='jobttr')
    main_status=  main_box.find_all('div', class_='jobstatus')
    main_done=  main_box.find_all('div', class_='jobdone')
    
    for alink in main_list:
        link = alink.find('a',href=True)
        job_link = "https://maddy.zone/microworker" + link['href']
        job_title = link.text.strip()
        #print(job_title)
        #print(job_link)
#         job.append(job_link)
#         job.append(job_title)
        
    for pay in main_pay: 
        job_pay = pay.find_all('p')[0].text.strip()
        #print(job_pay)
#         job.append(job_pay)
        
    for success in main_success:
        job_success = success.find_all('p')[0].text.strip() + "%"
        #print(job_success)
#         job.append(job_success)
        
    for ttr in main_ttr: 
        job_ttr= ttr.find_all('p')[0].text.strip()
        #print(job_ttr)
#         job.append(job_ttr)
        
    for status in main_status: 
        job_status= status.find_all('p')[0].text.strip()
        #print(job_status)
#         job.append(job_status)  
        
    for done in main_done: 
        job_done= done.find_all('p')[0].text.strip()
        #print(job_done)
#         job.append(job_done)  
        
        job_listing = {
            'job_title': job_title,
            'job_link': job_link,
            'job_pay': job_pay, 
            'job_success': job_success, 
            'job_ttr':  job_ttr,
            'job_status': job_status,
            'job_done' : job_done}
    
        job.append(job_listing)

### 4. Saving the data into a CSV file

🎉 Wooo! you have all of data! 

#### Print each row into a spreadsheet called `"../output/twitter_microworkers.csv"`

In [None]:
# make a new csv into which we will write all the rows
with open('../output/twitter_microworkers.csv', 'w+') as csvfile:
    # these are the header names:
    fieldnames = ['job_title', 'job_link','job_pay','job_success','job_ttr','job_status', 'job_done']
    # this creates your csv
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    # this writes in the first row, which are the headers
    writer.writeheader()

    # this loops through your rows (the array you set at the beginning and have updated throughtout)
    for job_listing in job:
         # this takes each row and writes it into your csv
        writer.writerow(job_listing)