# Homework 1.5: Scraping Review 🏋️‍🏋️‍🏋️‍

For this assignment, you will be scraping an API and a live website.

### Table of Contents
1. CFPB API
2. Microworkers

## Prelude: Importing Your Libraries 
The *first* first thing we're going to do is make sure we're all set up and ready to go. 

That means importing some libraries! I've got a cell below all ready for you to put in some libraries.

Remember with third-party libraries, you will need to make sure they are actually installed before they will run. 

In [1]:
# Import native libraries
import re
import csv 
import requests 
from bs4 import BeautifulSoup

# Import third party libraries


## Part One: Scraping the CFPB

The Consumer Finance and Protection Bureau was founded in the aftermath of the 2008 financial crisis. One of the things that they do is collect complaints from consumers about bad banks, lenders and other financial insitutions. This complaint data is available to the public in many forms. While you can download a big, horrible CSV file with _all_ of the data, it's usually easier (for you and your computer's memory storage) to use the API to only get the data you need.

The CFPB uses [Socrata](https://www.tylertech.com/products/socrata) to manage their API, which is a company that helps a lot of public agencies share their data with the rest of the world. The way they have you request for data is kind of funky, but we will perservere together!

[The homepage for the Consumer Complaint Database](https://cfpb.github.io/api/ccdb/index.html)<br>[API Reference](https://dev.socrata.com/foundry/data.consumerfinance.gov/s6ew-h6mp)

### 1. Doing a single request

#### Open the page as a JSON object with Requests

#### Print the first item in the returned list

Notice that there are a lot of fields!


### 2. Getting a lil bit more specific

When we get complaint data from just the endpoint, we are getting ALL the data–it's basically a firehose! However, we don't actually want all the complaints submitted to the CFPB! We only want specific kinds! 

In fact, we only want complaints that fit this criteria:
- The consumer is based the state of New York
- It was received by the CFPB between January 1, 2018 and January 1, 2019
- It is about the product "Debt collection" and the sub-product "Mortgage debt"

Using the `cfpb_endpoint`, we will build a url that requests just these kinds of complaints!

We will first filter by each thing, and then write a url that filters all three at the same time! Woah!

#### Filtering by state

Look back at the piece of data we printed in Step 1. How can you tell which state the complaint is from? How are they formatting the state names–is it the full name, or an abbreviation of sort? Consider checking out the [API documentation](https://dev.socrata.com/foundry/data.consumerfinance.gov/s6ew-h6mp)'s "Fields" section if you're feeling a little lost.

#### Filtering by date range

Read the [between...and...](https://dev.socrata.com/docs/functions/between.html) page in the API documentation. This will explain how to query for complaints within a particular timeframe! Now use that knowledge to call all the complaints between January 1, 2018 and January 1, 2019!

#### Filtering by sub-product

#### Putting it all together

Now that you've gotten data from each *individual* filter, let's combine them! You can use multiple filters by sticking an `&` between them.

**Gutcheck:** Count how many items you get back using the `len()` function. Is it 68? You're good to go!

### 3. Saving the data into a CSV file

Now that we have a beautifully crafted URL that gives us all the data we want, let's save it in a CSV file so we can open it up in ｡･:*:･ﾟ★,｡･:*:･ﾟ☆𝔰𝔭𝔯𝔢𝔞𝔡𝔰𝔥𝔢𝔢𝔱 𝔣𝔬𝔯𝔪｡･:*:･ﾟ★,｡･:*:･ﾟ☆.

#### Save the data to a file called `"../output/2018_NY_mortgage_complaints.csv"`

### Bonus: Collect mortgage complaints from multiple states!
**For an extra point:** write a script that loops through the list of states below, downloads all complaints between January 1, 2018 and January 1, 2019 that are about the sub-product "Mortgage debt", and save each into their own csv, that has the filename format `../output/2018_STATENAME_mortgage_complaints.csv`

In [2]:
states = ['NY', 'NJ', 'NV', 'ND', 'NM', 'NC']

## Part Two: Scraping Microworkers.com

For Part Two, you will be scraping an archive I've made of [Microworkers](https://www.microworkers.com/), a site that pays small amounts of money for the completion of short tasks. I have archived their "Twitter" job listings.

You will have to:
1. Scrape the homepage for links to each job listing
2. Figure out how to scrape a single job listing
3. Apply the knowledge you learned from **(2)** to each link from **(1)**

The link to the archive is here:<br>
**[http://maddy.zone/microworker/index.html](http://maddy.zone/microworker/index.html)**

### 2. Scraping the homepage

#### Open the homepage using Requests

In [3]:
url= "http://maddy.zone/microworker/index.html"
response = requests.get(url)

#### Parse the page using BeautifulSoup

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

#### Isolate each job listing url and add them to an array

In [5]:
result = []
main_box = soup.find("div", class_="joblistarea")
#print(main_box)
for i in main_box: 
    link = main_box.find('a')['href']
    #print(type(link))
    job_link = "http://maddy.zone/microworker" + link
    result.append(job_link)
    #print(job_link)
    #print(result)

# 1. Scraping a single job listing

![screenshot of the linked page](example.png)

For each page, we will collect **five** different pieces of information:
1. Job title
2. Job ID
3. Employer ID
4. Payment
5. Description

But scraping them all at once can be overwhelming! Let's scrape a signle listing first. For some of the pieces of information, you might want to look into `.replace()` and `.strip()` functions for strings.

#### Open `http://maddy.zone/microworker/54y2h5e4j5c4z213o503w2b4.html` using Requests

In [6]:
response = requests.get("http://maddy.zone/microworker/54y2h5e4j5c4z213o503w2b4.html")

#### Parse the page using BeautifulSoup

In [7]:
soup = BeautifulSoup(response.text, 'html.parser')

#### Isolate the job title

In [16]:
title = soup.find('div', class_='jobarealeft')
job_title = title.find('h1').text.strip()
print(job_title)

DE Shaw Twitter: Follow + Retweet


#### Isolate the job id

In [27]:
Id = soup.find('div', class_= 'jobdetailsnoteleft')
job_id = Id.find('p', text = re.compile('Job ID')).text.strip()
print(job_id)

Job ID:
            b1befe34f477


#### Isolate the employer id

In [45]:
E_Id = soup.find('div', class_= 'jobdetailsnoteright')
employer_id = E_Id.find('a').text
print(employer_id)

Member_1014973


#### Isolate the payment

In [51]:
pay = soup.find('div', class_= 'jobdetailsnoteleft')
payment = pay.find_all('p')[1].text.strip()
print(payment)

You will earn
            $0.75


#### Isolate the description

In [52]:
description = soup.find('div', class_= 'jobdetailsbox').text.strip()
print(description)

What is expected from Workers?

1. Go to this link - https://twitter.com/DEShawInsider/status/1176597146776289281
2. Follow this account on Twitter
3. Retweet this recent post
4. Take a screenshot of the repost



            Required proof that task was finished?

1. Take a screenshot of the repost


#### Store each of your variables into this dictionary

In [54]:
job_listing = {
    'job_title': job_title        ,
    'job_id': job_id           ,
    'employer_id': employer_id      , 
    'payment': payment          , 
    'description':  description     ,
} 

### 3. Scraping all of the linked pages

#### Make an empty array for your data

In [None]:
job = []

#### Loop through each of the listing links that you saved in Step 1, and...<br>    Use the code from Step 2 to get the data from each listing page<br>And add the dictionary you make to the array from above

In [None]:
for i in job_listing:
    title = soup.find('div', class_='jobarealeft')
    job_title = title.find('h1').text.strip()
   
    
    

### 4. Saving the data into a CSV file

🎉 Wooo! you have all of data! 

#### Print each row into a spreadsheet called `"../output/twitter_microworkers.csv"`