![alt text](https://avatars1.githubusercontent.com/u/35489154?s=200&v=4)

### Cohort 3

**course instructors:** Tejumade (tmafonja1@gmail.com), Kenechi (kennydukor@gmail.com)

In [0]:
## all imports
from urllib.request import urlopen
from IPython.display import HTML
import numpy as np
#import urllib2
import bs4 #this is beautiful soup
import time
import operator
import socket
#import cPickle
import re # regular expressions

from pandas import Series
import pandas as pd
from pandas import DataFrame

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_context("talk")
sns.set_style("white")

# API registrations

If you would like to run all the examples in this notebook, you need to register for the following APIs:

- Github

https://github.com/settings/tokens/new

- Twitter

https://apps.twitter.com/app/new

- Twitter Instructions

https://twittercommunity.com/t/how-to-get-my-api-key/7033

## Important Note

There's one very important point here, which may be obvious to you if you've spend substantial time doing any kind of software development, but if most of your experience with programming is via class exercises, it may not be completely apparent, so I emphasize it here. You will see code samples like this throughout the course, in the slides and in these notes. It's important not to take this to mean that you should memorize these precise function calls, or even do anything other than just scan over them briefly. As a data scientist, you'll be dealing with hundreds of different libraries and APIs, and trying to commit them all to memory is not useful. Instead, what you need to develop is the ability to quickly find a library and function call that you need to accomplish some task. For example, even if you know nothing about the in this case, you want to download the content of some URL. You can type into Google something like "Python download url content" (I just picked this precise phrasing randomly, feel free to try some variants on this). The first result for my search is a Stack Overflow page: How do I download a file over HTTP using Python?. While the first response actually lists the urllib2 package (this was the more common library at one point, but the requests library provides a simpler interface that does things like automatically encode parameters to urls and other niceties), the requests library home page is a few responses down. And once you find the home page for that library, the very first example on the page shows how to use it for simple calls like the one above. You can look through documentation here, but like above, if you have a question about the requests library, you can likely use good for a direct answer there too. For instance, if you want to learn to use the POST command, you can Google something like "python requests library post command" and the searches will either bring you straight to the relevant requests documentation or to a Stack Overflow page.

General advice about programming

- You will find nearly everything on google
- Try: length of a list in python
- A programmer is someone who can turn stack overflow snippets into running code
- Use tab completion
- Make your variable names meaningful

# Pulling Data from an already exising table

We will be exploring the movielens data using pandas

http://grouplens.org/datasets/movielens/ 

Example inspired by Greg Reda: http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/

# Read the user data

In [0]:

# pass in column names for each CSV
user_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.user', 
    sep='|', names=user_cols)

users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213



# Read the ratings

In [0]:
ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.data', 
    sep='\t', names=ratings_cols)

ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596



# Data about the movies

In [0]:
# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
movie_cols = ['movie_id', 'title', 'release_date', 
            'video_release_date', 'imdb_url']

movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item', sep='|', 
                     names=movie_cols, usecols=range(5))
movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995)


#### What went wrong?
`read_csv` takes an encoding option to deal with files in different formats. I mostly use `read_csv('file', encoding = "ISO-8859-1")`, or alternatively `encoding = "utf-8"` for reading, and generally` utf-8 for to_csv.`

You can also use one of several alias options like 'latin' instead of `'ISO-8859-1' `(see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).**bold text**

# Get information about data

In [0]:
print(movies.dtypes)
print (movies.describe())
# *** Why only those two columns? ***

movie_id                int64
title                  object
release_date           object
video_release_date    float64
imdb_url               object
dtype: object
          movie_id  video_release_date
count  1682.000000                 0.0
mean    841.500000                 NaN
std     485.695893                 NaN
min       1.000000                 NaN
25%     421.250000                 NaN
50%     841.500000                 NaN
75%    1261.750000                 NaN
max    1682.000000                 NaN


# Selecting data

- DataFrame => group of Series with shared index
- single DataFrame column => Series

In [0]:
A = users.head()
B = users['occupation'].head()

columns_you_want = ['occupation', 'sex'] 
D = users[columns_you_want].head()

print (A, "\n========")
print (B, "\n========")
print (D)

   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067
3        4   24   M  technician    43537
4        5   33   F       other    15213 
0    technician
1         other
2        writer
3    technician
4         other
Name: occupation, dtype: object 
   occupation sex
0  technician   M
1       other   F
2      writer   M
3  technician   M
4       other   F


# Filtering Data

Select users older than 25

In [0]:
oldUsers = users[users['age'] > 25]
oldUsers.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
1,2,53,F,other,94043
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201


# Quiz:

- show users aged 40 and male

- show the mean age of female programmers

In [0]:
# users aged 40 AND male
# your code here


In [0]:
users[(users.age == 40) & (users.sex == "M")].head(3)

Unnamed: 0,user_id,age,sex,occupation,zip_code
18,19,40,M,librarian,2138
82,83,40,M,other,44133
115,116,40,M,healthcare,97232


In [0]:
users[(users["age"] == 40) & (users["sex"] == "M")].head(3)

Unnamed: 0,user_id,age,sex,occupation,zip_code
18,19,40,M,librarian,2138
82,83,40,M,other,44133
115,116,40,M,healthcare,97232


In [0]:
# but what if i want to see only age and sex
# your code here
columns_you_want_2 = users[['age', 'sex']]
print(columns_you_want_2.head())

   age sex
0   24   M
1   53   F
2   23   M
3   24   M
4   33   F


In [0]:
#solution_cont
older = columns_you_want_2[(columns_you_want_2.age == 40) & (columns_you_want_2['sex'] == 'M')]
older.head()

Unnamed: 0,age,sex
18,40,M
82,40,M
115,40,M
199,40,M
283,40,M


In [0]:
## users who are female and programmers
# your code here

## show statistic summary or compute mean
# your code here


In [0]:
users['occupation'].unique()

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'],
      dtype=object)

In [0]:
females = users[(users.sex == "F") & (users.occupation == "programmer")].head(5)
females

Unnamed: 0,user_id,age,sex,occupation,zip_code
291,292,35,F,programmer,94703
299,300,26,F,programmer,55106
351,352,37,F,programmer,55105
403,404,29,F,programmer,55108
420,421,38,F,programmer,55105


In [0]:
#If we want only those column (sex, occupation)
columns_you_want = users[['sex', 'occupation']]
print(columns_you_want.head())

  sex  occupation
0   M  technician
1   F       other
2   M      writer
3   M  technician
4   F       other


In [0]:
#solution_cont
occ = columns_you_want[(columns_you_want['sex'] == 'F') & (columns_you_want['occupation'] == 'programmer')]
occ.head()

Unnamed: 0,sex,occupation
291,F,programmer
299,F,programmer
351,F,programmer
403,F,programmer
420,F,programmer


In [0]:
# a smarter way
columns_you_want_better = females[['sex', 'occupation']]
columns_you_want_better

Unnamed: 0,sex,occupation
291,F,programmer
299,F,programmer
351,F,programmer
403,F,programmer
420,F,programmer


# Split-apply-combine

- splitting the data into groups based on some criteria
- applying a function to each group independently
- combining the results into a data structure

Lets find the diligent users

In [0]:
print (ratings.head())
print("=======")

   user_id  movie_id  rating  unix_timestamp
0      196       242       3       881250949
1      186       302       3       891717742
2       22       377       1       878887116
3      244        51       2       880606923
4      166       346       1       886397596


In [0]:

## split data
#grouped_data = ratings.groupby('user_id')
grouped_data = ratings['movie_id'].groupby(ratings['user_id'])
#print(grouped_data.head(5))

## count and combine
ratings_per_user = grouped_data.count()

ratings_per_user.head(5)

user_id
1    272
2     62
3     54
4     24
5    175
Name: movie_id, dtype: int64

In [0]:
#Other method
ratings.set_index(["user_id", "movie_id"]).count(level="user_id").head()

Unnamed: 0_level_0,rating,unix_timestamp
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,272,272
2,62,62
3,54,54
4,24,24
5,175,175


In [0]:
ratings.count()

user_id           100000
movie_id          100000
rating            100000
unix_timestamp    100000
dtype: int64

# Quiz

- get the average rating per movie
- advanced: get the movie titles with the highest average rating

In [0]:
## split data

# your code here
grouped_data_1 = ratings['rating'].groupby(ratings['movie_id'])
grouped_data_1.head(2)

0        3
1        3
2        1
3        2
4        1
5        4
6        2
7        5
8        3
9        3
10       2
11       5
12       5
13       3
14       3
15       3
16       5
17       2
18       4
19       2
20       4
21       4
22       4
23       2
24       4
25       2
26       5
27       2
28       4
29       5
        ..
90967    1
91526    1
92175    4
92329    3
92338    3
92716    3
92728    3
93047    3
93792    3
93843    1
93856    2
93874    3
93967    1
94060    4
94088    1
94232    4
94670    1
95364    5
95376    3
96444    4
97057    5
97401    1
97649    3
98323    2
98427    3
98640    3
98955    3
99177    5
99749    2
99953    4
Name: rating, Length: 3223, dtype: int64

In [0]:
## average and combine
# your code here\
average_ratings = grouped_data.mean()
print ("Average ratings:")
print (average_ratings.head())

Average ratings:
movie_id
1    3.878319
2    3.206107
3    3.033333
4    3.550239
5    3.302326
Name: rating, dtype: float64


In [0]:
# get the maximum rating
# your code here
maximum_rating = average_ratings.max()
maximum_rating

5.0

In [0]:
# get movie ids with that rating
# your code here
good_movie_ids = average_ratings[average_ratings == maximum_rating].index
good_movie_ids

Int64Index([814, 1122, 1189, 1201, 1293, 1467, 1500, 1536, 1599, 1653], dtype='int64', name='movie_id')

In [0]:
print ("Good movie ids:")
print #your code here
print (good_movie_ids)
print ("===============\n=============")
print ("Best movie titles")
print # your code here
print (movies[movies.movie_id.isin(good_movie_ids)].title)

Good movie ids:
Int64Index([814, 1122, 1189, 1201, 1293, 1467, 1500, 1536, 1599, 1653], dtype='int64', name='movie_id')
Best movie titles
813                         Great Day in Harlem, A (1994)
1121                       They Made Me a Criminal (1939)
1188                                   Prefontaine (1997)
1200           Marlene Dietrich: Shadow and Light (1996) 
1292                                      Star Kid (1997)
1466                 Saint of Fort Washington, The (1993)
1499                            Santa with Muscles (1996)
1535                                 Aiqing wansui (1994)
1598                        Someone Else's America (1995)
1652    Entertaining Angels: The Dorothy Day Story (1996)
Name: title, dtype: object


But the best movie rating should also be dependent on the number of people that rated the movie

In [0]:
# get number of ratings per movie
# your code here
how_many_ratings = grouped_data.count()
print ("Number of ratings per movie")
print # your code here
print (how_many_ratings[average_ratings == maximum_rating])

Number of ratings per movie
movie_id
814     1
1122    1
1189    3
1201    1
1293    3
1467    2
1500    2
1536    1
1599    1
1653    1
Name: rating, dtype: int64


# Passing a Function

In [0]:
average_ratings = grouped_data.apply(lambda f: f.mean())
average_ratings.head()

movie_id
1    3.878319
2    3.206107
3    3.033333
4    3.550239
5    3.302326
Name: rating, dtype: float64

# Quiz

- get the average rating per user
- advanced: list all occupations and if they are male or female dominant

In [0]:
# get the average rating per user
# your code here


In [0]:
grouped_data = ratings['rating'].groupby(ratings.user_id)
average_ratings = grouped_data.mean()
average_ratings.head()

In [0]:
# list all occupations and if they are male or female dominant
# your code here

In [0]:
grouped_data = users['sex'].groupby(users['occupation'])
male_dominant_occupations = grouped_data.apply(lambda f: 
                                               sum(f == 'M') > sum(f == 'F'))
print (male_dominant_occupations)
print ('\n')

In [0]:
print ('number of male users: ')
print (sum(users['sex'] == 'M'))

print ('number of female users: ')
print (sum(users['sex'] == 'F'))

# Python data scraping

- Why scrape the web?
- - vast source of information
- - automate tasks
- - keep up with sites
- - fun!



- by Justin Blinder
- http://projects.justinblinder.com/We-Read-We-Tweet
“We Read, We Tweet” geographically visualizes the dissemination of New York Times articles through Twitter. Each line connects the location of a tweet to the contextual location of the New York Times article it referenced. The lines are generated in a sequence based on the time in which a tweet occurs. The project explores digital news distribution in a temporal and spatial context through the social space of Twitter.

![alt text](https://camo.githubusercontent.com/aee227c701091e56cfe6479ccc8f6a757c0d6c94/687474703a2f2f7777772e6373632e6e6373752e6564752f666163756c74792f6865616c65792f74776565745f76697a2f666967732f74776565742d76697a2d65782e706e67)

- by Healey and Ramaswamy

- http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/

Type a keyword into the input field, then click the Query button. Recent tweets that contain your keyword are pulled from Twitter and visualized in the Sentiment tab as circles. Hover your mouse over a tweet or click on it to see its text.

# HTML

- HyperText Markup Language
- standard for creating webpages
- HTML tags
- - have angle brackets
- - typically come in pairs


This is an example for a minimal webpage defined in HTML tags. The root tag is '< html>' and then you have the < head> tag. This part of the page typically includes the title of the page and might also have other meta information like the author or keywords that are important for search engines. The < body> tag marks the actual content of the page. You can play around with the < h2> tag trying different header levels. They range from 1 to 6.

In [0]:

htmlString = """<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <h2> Test </h2>
    <p>Hello world!</p>
  </body>
</html>"""

htmlOutput = HTML(htmlString)
htmlOutput


# Useful Tags

- heading < h1>< /h1> ... < h6>< /h6>
- paragraph < p>< /p>
- line break < br>
- link with attribute

< a href="http://www.example.com/">An example link< /a>

# Scraping with Python

- different useful libraries:
- - urllib
- - beautifulsoup
- - pattern
- - soupy
- - LXML

In [0]:
import requests as req

In [0]:
url = 'http://www.crummy.com/software/BeautifulSoup'
source = req.get(url)
print (source.status_code) #To check that a request is successful, use r.raise_for_status() or check r.status_code is what you expect.

200


In [0]:
print(source.headers)

{'Date': 'Sat, 26 Jan 2019 15:26:03 GMT', 'Server': 'Apache/2.4.18 (Ubuntu) OpenSSL/1.0.2g mod_wsgi/4.3.0 Python/2.7.12', 'Last-Modified': 'Sat, 26 Jan 2019 15:00:01 GMT', 'ETag': '"2468-5805db2a6efb1-gzip"', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '3970', 'Keep-Alive': 'timeout=15, max=99', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}


In [0]:
print(source.headers["Content-Type"])

text/html; charset=UTF-8


This code issues an "HTTP GET" request to load the content of the paper, and returns it in the response object. The status_code field contains the "200" code, which indicates a successful query, and the headers field contains meta-information about the page (in this case, you could see, for instance, that despite the URL, we're actually hosting this page on github). If you want to see the actual content of the page, you can use the response.content or response.text fields, as below.

In [0]:
print(source.text[:480])

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<link rev="made" href="mailto:leonardr@segfault.org">
<link rel="stylesheet" type="text/css" href="/nb/themes/Default/nb.css">
<meta name="Description" content="Beautiful Soup: a library designed for screen-s


You probably have seen URLS like this before

https://www.google.com/search?q=python+download+url+content&source=chrome
The https://www.google.com/search string is the URL, and everything after the ? are parameters; each parameter is of the form "parameter=value" and are separated by ampersands &. If you've seen URLS before you've noticed that a lot of content needs to be encoded in these parameters, such as spaces replaces with the code "%20" (the Google url above can also handle the "+" character, but "%20" is the actual encoding of a space). Fortunately, requests handles all of this for you. You can simply pass all the parameters as a Python dictionary.

In [0]:
params = {"query": "python download url content", "source":"chrome"}
source2 = req.get("http://www.google.com/search", params=params)
print(source2.status_code)

200


Besides the HTTP GET command, there are other common HTTP commands (POST, PUT, DELETE) which can also be called by the corresponding function in the library.

Quiz :

- Is the word 'Alice' mentioned on the beautiful soup homepage?
- How often does the word 'Soup' occur on the site?
- - hint: use .count() 
- At what index occurs the substring 'alien video games' ?
- - hint: use .find() 


In [0]:
re.findall(r"Soup", source.text)

In [0]:
soup = re.search(r"Soup", source.text)
print(soup)

In [0]:
## get bs4 object
soup = bs4.BeautifulSoup(source.text)

In [0]:
## compare the two print statements
print (soup)
#print soup.prettify()

In [0]:
## show how to find all a tags
soup.findAll('a')

## ***Why does this not work? ***
#soup.findAll('Soup')

In [0]:
# Some other examples

In [0]:
## get attribute value from an element:
## find tag: this only returns the first occurrence, not all tags in the string
first_tag = soup.find('a')
print(first_tag)


In [0]:
## get attribute `href`
print(first_tag.get('href'))


In [0]:

## get all links in the page
link_list = [l.get('href') for l in soup.findAll('a')]
link_list

# or
# link_list = []
# for l in soup.findAll('a'):
#     link_list.append(l.get('href'))
# link_list

In [0]:
# So, to find all the soup. We search within the tags

link_list = [l.get('Soup') for l in soup.findAll('html')]
link_list

### RESTful APIs

While parsing data in HTML (the format returned by these web queries) is sometimes a necessity, and we'll discuss it further before, HTML is meant as a format for displaying pages visually, not as the most efficient manner for encoding data.  Fortunately, a fair number of web-based data services you will use in practice employ something called REST (Representational State Transfer, but no one uses this term) APIs.  We won't go into detail about REST APIs, but there are a few main feature that are important for our purposes:

1. You call REST APIs using standard HTTP commands: GET, POST, DELETE, PUT.  You will probably see GET and POST used most frequently.
2. REST servers don't store state.  This means that each time you issue a request, you need to include all relevant information like your account key, etc.
3. REST calls will usually return information in a nice format, typically JSON (more on this later).  The `requests` library will automatically parse it to return a Python dictionary with the relevant data.

Let's see how to issue a REST request using the same method as before.  We'll here query my GitHub account to get information.  More info about GitHub's REST API is available at their [Developer Site](https://developer.github.com/v3/).

In [0]:
# Get your own at https://github.com/settings/tokens/new
token = "" 
response = req.get("https://api.github.com/user", params={"access_token":token})

#print(response.status_code)
print(response.headers["Content-Type"])
print(response.json().keys())

application/json; charset=utf-8
dict_keys(['message', 'documentation_url'])


The token element there (that is an example that was linked to my account, which I have since deleted, you can get your own token for your account at https://github.com/settings/tokens/new) identifies your account, and because this is a REST API there is no "login" procedure, you just simply include this token with each call to identify yourself. The call here is just a standard HTTP request: it requests the URL https://api.github.com/user passing our token as the parameter access_token. The response looks similar to our above response, except if we look closer we see that the "Content-Type" in the HTTP header is "application/json". In these cases, the requests library has a nice function, response.json(), which will convert the returned data into a Python dictionary (I'm just showing the keys of the dictionary here).

### Authentication

Most APIs will use an authentication procedure that is more involved than this example above.  The standard here for a while was called "Basic Authentication", and can be used via the `requests` library by simply passing the login and password as the `auth` argument to the relevant calls, as below. 

In [0]:
response = req.get("https://api.github.com/user", auth=("kennydukor@gmail.com", "github_password"))
print(response.status_code)
print(response.headers["Content-Type"])
print(response.json())
print(response.json().keys())

### JSON data

Although originally built as a data format specific to the Javascript language, JSON (Javascript Object Notation) is another extremely common way to share data.  We've already seen in it with the GitHub API example above, but very briefly, JSON allows for storing a few different data types:

- Numbers: e.g. `1.0`, either integers or floating point, but typically always parsed as floating point
- Booleans: `true` or `false` (or `null`)
- Strings: `"string"` characters enclosed in double quotes (the `"` character then needs to be escaped as `\"`)
- Arrays (lists): `[item1, item2, item3]` list of items, where item is any of the described data types
- Objects (dictionaries): `{"key1":item1, "key2":item2}`, where the keys are strings and item is again any data type

Note that lists and dictionaries can be nested within each other, so that, for instance

    {"key1":[1.0, 2.0, {"key2":"test"}], "key3":false}

would be a valid JSON object.

Let's look at the full JSON returned by the GitHub API above:

In [0]:
print(response.content)

We have already seen that we can use the `response.json()` call to convert this to a Python dictionary, but more common is to use the `json` library in the Python standard library: documentation page [here](https://docs.python.org/3/library/json.html).  To convert our GitHub response to a Python dictionary manually, we can use the `json.loads()` (load string) function like the following.

In [0]:
import json
print(json.loads(response.content))

If you have the data as a file (i.e., as a file descriptor opened with the Python `open()` command), you can use the `json.load()` function instead.  To convert a Python dictionary to a JSON object, you'll use the `json.dumps()` command, such as the following.

In [0]:
data = {"a":[1,2,3,{"b":2.1}], 'c':4}
json.dumps(data)

Notice that Python code, unlike JSON, can include single quotes to denote strings, but converting it to JSON will replace it with double quotes.  Finally, if you try to dump an object that includes types not representable by JSON, it will throw an error.

In [0]:
json.dumps(response)

so in summary

In [0]:
a = {'a': 1, 'b':2}
s = json.dumps(a)
a2 = json.loads(s)

## a is a dictionary
print a
## vs s is a string containing a in JSON encoding
print s
## reading back the keys are now in unicode
print a2

# Regular Expressions

CMU datascience course (data College and scraping) http://www.datasciencecourse.org/lectures/

Harvard datascience course (Web scraping) http://cs109.github.io/2015/pages/videos.html