# Data Cleaning Assignment

For this assignment, you will be practicing your web scraping and data cleaning skills on a website of your choice. 

Use the code provided to get started, but you will need to input specifics like your chosen URL, and the parts that you need to clean from the results. 

At the end, I will ask you to briefly reflect on your progress and speculate on next steps for this work (if you were to continue this work). 

In [None]:
# first, load up the necessary libraries

import requests
from bs4 import BeautifulSoup
import lxml

In [None]:
# second, load up the webpage and create a 'soup' object
# be sure to paste your webpage URL in the get() function, as a string 

webpage = requests.get('')
source = webpage.content
soup = BeautifulSoup(source, 'lxml')

## identifying html tags

Before you can scrape a specific part of an webpage, you need to find out the html tag for the element you want to scrape. You may need to play around with scraping different elements to find the desired one. 


Note: if you want to scrape an attribute, check out the bs4 docs on [attributes](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Tag.attrs) to get started. 

First, print a slice (maybe the first 10 times) of your new soup object.

Now, print out the following elements, one by one: a title, a heading, a paragraph and a link. You may need to do some research on the [tags and attributes](
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Tag)

Now, use some bs4 methods for navigating up and down the object tree, like `.child`, `.parent` and for scraping just the text with `.text`.

## scraping the right tag

In the next section, I will provide some code for you to modify in order to scrape the elements you want. 

In order to find out the element I need to scrape, go the page source. Here, use the "inspector" tool. On most browsers, you can fire up this tool by right-clicking on the element you want to scrape. After right-clicking on your desired element that you want to scrape, in the popup menu, select "inspect." Then, a window should appear that covers about half of your webpage. Check in the largest panel in this window (the one displaying html code) and look for the *highlighted code* in this section. That code contains the name of the html tag that you want to scrape.

Finally, in the cell below, complete the loop with your desired HTML element in the find_all() function. Remember the element needs to be written as a string.

In [None]:
# this is a for loop that goes through all of the tags in our 'soup' object
# then uses the 'text' attribute to grab just the text (taking out the html tag)
# for that item

for item in soup.find_all(''):
    print(item.text)

If you want to weed out items that contain a certain word in the text, use the code in the cell below as a starting point.

In [None]:
results = []

# insert the element you want to scrape in find_all()
for item in soup.find_all(''):
    
    # insert the text you want to search for (within this element) between quotes
    if '' in item.text:
        
        # append the text to our new list
        results.append(item.text)

In [None]:
# checking our output, just the first item (at position 0)

results[0]

In [None]:
# now the first 10 items

results[:10]

## cleaning the results

### the `strip()` function

Check the results for things that you may want to remove from your data. For this, we will use the `strip()` function. Some things to removemight include characters, like:
- `\n` characters
- `\n\n` characters
- `\n\n\n` characters

### the `split()` function
We may also want to separate out part of the resutls from a single string into separate strings, so that they can populate separate cells on a spreadsheet. For this we will use the `split()` function. For example, we may want the following string: 

`'Abut, Daniel Adjunct Assistant Professor of Finance Finance Department daa249@stern.nyu.edu'`

to become a part of a list of individual strings, like: 

`'Abut, Daniel ', 'Adjunct Assistant Professor of Finance, Finance Department', 'daa249@stern.nyu.edu'`

### the `replace()` function

Sometimes, you may want to take out something and replace it with a space. For example, if you want to take out a certain character and put in a space. Read more about `replace()` [on RealPython](https://realpython.com/replace-string-python/).


In [None]:
# here is a sample loop which you can modify to strip out unwanted elements 
# from your dataset.

stripped = []

for item in results:
    stripped.append(item.strip(''))
    
# check the results
stripped[0]

In [None]:
# we can also do the same using 'list comprehension' - a way of shortening
# the syntax to compress the loop into one line of code.

stripped = [item.strip('') for item in results]

# check the first few lines
stripped[:3]

In [37]:
# here is a more complex process that splits the strings in the list if they 
# happend to have the \n\n character. This is useful for creating individual 
# strings, which will be useful for later creating individual cells on a 
# spreadsheet

# First, the traditional version of loop:

divided = []
for row in stripped:
    for item in row.split('\n\n'):
        divided.append(item)
        
# Now, the list comprehension version of the same loop! Look closely to see 
# the logic is the same, but with compressed syntax. 

divided_comp = [item for row in stripped for item in row.split('\n\n')]

## Reflection: 
In the markdown cell below, explain your work on this assignment. What did you decide to scrape from the website? How did the work go? Is there anything you didn't know how to do or any obstacles you encountered? How did you handle the obstacles? (And it's totally fine to say you were discouraged and/or gave up!! Coding is really hard) Reflect a bit on your original objective and current progress.  

## OPTIONAL: writing CSV to format and save our data

This section is completely optional, because not every 
Alter the relevant code below to transfer your data to a CSV file.

In [None]:
# if you want a format where it's just *one* item per row, adopt the code 
# below with your information 

import csv
    
story_wrappers = soup.find_all(class_ = 'story-wrapper')
    
for item in story_wrappers:
    link_text = item.a.text
    print(link_text)
    
    link_location = item.a['href']
    print(link_location)

    csv_writer.writerow([link_text, link_location])

csv_file.close()

In [43]:
# if you want *multiple* items per row (if you happen to be scraping tabular data),
# adapt the code below

import csv

# open output files and call up the data
with open('adjuncts.csv', 'w', newline='') as output_file:
    # create csv reader and writer objects
    writer = csv.writer(output_file)
    
    # iterate over rows in our data
    for cell in divided:
        # check if cell contains "edu"
        if "edu" in cell:
            # add cell to new row and write to output file
            new_row.append(cell)
            writer.writerow(new_row)
            # start new row
            new_row = []
        else:
            # add cell to current row
            new_row.append(cell)
    # write last row to output file
    writer.writerow(new_row)