# Ch7 : Cleaning your dirty data

## Cleaning in Code

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

def ngrams(_input, n):
    _input = _input.split(' ')
    output = []
    for i in range(len(_input)-n+1):
        output.append(_input[i:i+n])
    return output

html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html, "lxml")
content = bsObj.find("div", {"id":"mw-content-text"}).get_text()
ngrams = ngrams(content, 2)
print(ngrams)
print("2-grams count is: "+str(len(ngrams)))

[['\nPython\n\n\n\n\nParadigm\nObject-oriented,', 'imperative,'], ['imperative,', 'functional,'], ['functional,', 'procedural,'], ['procedural,', 'reflective\n\n\nDesigned\xa0by\nGuido'], ['reflective\n\n\nDesigned\xa0by\nGuido', 'van'], ['van', 'Rossum\n\n\nDeveloper\nPython'], ['Rossum\n\n\nDeveloper\nPython', 'Software'], ['Software', 'Foundation\n\n\nFirst\xa0appeared\n20\xa0February'], ['Foundation\n\n\nFirst\xa0appeared\n20\xa0February', '1991;'], ['1991;', '27'], ['27', 'years'], ['years', 'ago\xa0(1991-02-20)[1]\n\n\n\n\n\nStable'], ['ago\xa0(1991-02-20)[1]\n\n\n\n\n\nStable', 'release\n\n3.6.5'], ['release\n\n3.6.5', '/'], ['/', '28\xa0March'], ['28\xa0March', '2018;'], ['2018;', '3'], ['3', 'days'], ['days', 'ago\xa0(2018-03-28)[2]\n2.7.14'], ['ago\xa0(2018-03-28)[2]\n2.7.14', '/'], ['/', '16\xa0September'], ['16\xa0September', '2017;'], ['2017;', '6'], ['6', 'months'], ['months', 'ago\xa0(2017-09-16)[3]\n\n\n\nPreview'], ['ago\xa0(2017-09-16)[3]\n\n\n\nPreview', 'release\n\n

In [5]:
import re

#cleaning
def ngrams(_input, n):
    content = re.sub("\n+", " ", content)
    content = re.sub(" +", " ", content)
    content = bytes(content, "UTF-8")
    content = content.decode("ascii", "ignore")
    print(content)
    _input = _input.split(" ")
    output = []
    for i in range(len(_input)-n+1):
        output.append(_input[i:i+n])
    return output

In [6]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string

def cleanInput(_input):
    _input = re.sub("\n+", " ", _input)
    _input = re.sub("\[[0-9]*\]", "", _input)
    _input = re.sub(" +", " ", _input)
    _input = bytes(_input, "UTF-8")
    _input = _input.decode("ascii", "ignore")
    cleanInput = []
    _input = _input.split(" ")
    for item in _input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput


def ngrams(_input, n):
    _input = cleanInput(_input)
    output = []
    for i in range(len(_input)-n+1):
        output.append(_input[i:i+n])
    return output

> check string.punctuation

In [8]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

## Data Normalization

## OpenRefine

# Ch8 : Reading and Writing Natural Languages

## Markov Models

# Ch9 : Crawling Though Forms and Logins

## Submitting Basic Form

- The name of the field you want to submit with data 
- The action acttribute of the form itself; that is, the page that the form actually posts to

## Radio Buttons, Checkboxes, and Other Inputs

## Submitting Files and images

In [None]:
import requests

files = {"uploadFile": open("../files/Python-logo.png", "rb")}
r = requests.post("http://pythonscraping.com/pages/processing2.php",
                  files=files)
print(r.text)

## Handling Login and Cookies

In [9]:
import requests 

params = {"username":"kadencho", "password":"password"}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", data=params)
print('Cookie is set to:')
print(r.cookies.get_dict())
print("----------------------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php",
                 cookies=r.cookies)
print(r.text)

Cookie is set to:
{'loggedin': '1', 'username': 'kadencho'}
----------------------
Going to profile page...
Hey kadencho! Looks like you're still logged into the site!


This works well for simple situations, but what if you're dealing with a more complicated site that frequently modifies cookies without warning, or if you'd rather not even think about the cookies to begin with? The Requests session function works perfectly in this case:

In [None]:
import requests

params = {"username":"username", "password":"password"}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", data=params)
print("Cookies is set to:")
print(s.cookies.get_dict())
print("-----------")
print("Going to profile page ...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)

## HTTP Basic Access Authentication

In [11]:
import requests 
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth

auth = HTTPBasicAuth("kaden", "password")
r = requests.post(url="http://pythonscraping.com/pages/auth/login.php", auth=auth)
print(r.text)

<p>Hello kaden.</p><p>You entered password as your password.</p>


## Other Form Problems

Web forms are a hot point of entry for malicious bots. You don't want bots creating user accounts, taking up valuable server processing time, or submitting spam comments on a blog. For this reason, there are often a number of security features that are incorporated into HTML forms on modern websites that might not be ommediately apparent.

# Ch10 : Scraping JavaScript

Client-side scripting languages are languages that are run in the browser itself, ratherthan on a web server. The success of a client-side language depends on your browser's ability to interpret and execute the language correctly. 

Partly due to the difficulty of getting every browser manufacturer to agree on a standard, there are far fewer client-side languages than there are server-side languages. This is a good thing whne it comes to web scraping: the fewer languages there are to deal with the better."

For the most part, there are only two languages you'll frequently encounter online: ActionScript (which is used by Flash applications) and JavaScript. ActionScript is used far lesss frequently today than it was 10 years age, and is often used to stream multimedia files, as a platform for online games, or to display "intro" pages for websites that haven't gotten the hint that no one wants to watch an intro page. At any rate, because there isn't much demand for scraping Flash pages, I will instead focus on the client-side language that's ubiquitous in modern web pages: JavaScript.

JavaScript is, by far, the most common and most well-supported client-side scripting language on the Web today. It can be used to collect information for user tracking, submit forms without reloading the page, embed mulitmedia, and even power entire online games. Even deceptively simple-looking pages can often contain mulitple pieces of JavaScript. You can find it embedded between <script> tags in the page's source code:

## A Brief Introduction to JavaScript

## Common JavaScript Libraries

> jQuery

If you find jQuery is foundon a site, you must be careful when scraping it. jQuery is adept ant dynamically creating HTML content that appears only after the JavaScript is executed. If you scrape the page's content using traditional methods, you will retrieve only the preloaded page that appears before the JavaScript has created the content.

> Google Analytics

```javascript
<!-- Google Analytics -->
<script type="text/javascript">

...

</script>
```

This script handles Google Analytics, specific cookies used to track your visit from page to page. This can sometimes be a problem for web scrapers that are designed to execute JavaScript and handle cookies.

If a site uses Google Analytics or a similar web analytics system and you do want the site to know that it's being crawled or scraped, make sure to discard any cookies used for analytics or discard cookies altogether.

> Google Maps

Python mkaes it easy to extract all instances of coordinates that occur between google.maps.LatLng to obtain a list of latitude/longitude coordinates.

## Ajax and Dynamic HTML