Python Workshop 6
=========

September 27, 2017


# Welcome Back

We've been on hiatus for a few months and most of us can barely
remember any Python; or even whether it was Python, Go, JavaScript,
Julia, Scala, or something else we were investigating.

This might a good time to update our installation with the latest 
versions of our favorite packages and maybe even install some new ones.
Recall that we used a Python environment manager named **conda**
to manage our Python environment.  The name comes from the full blown
installation called [Anaconda](https://docs.continuum.io/) originally
developed by from Continuum Analytics.  **conda** is a minimal
installation that allows us to customize the packages we deploy.

## Bluecoat

Don't forget that our friend, Bluecoat, hasn't gone anywhere.
In order for our **conda** environment manager to reach the
package repository, we have to perform a few work-around tasks.

### Authenticate Through Firewall

Enter the following URL into your web browser.

<https://repo.continuum.io/pkgs/>

Authenticate through the firewall if you have to.

### SSL Certificate Verification

Because Bluecoat subverts the server certificate returned from
the Conda repository servers, we must tell the conda client to
skip the certificate verification step.

```
conda config --set ssl_verify false
```

You can check whether this was already done with 

```
conda config --show | grep ssl
```

If your machine doesn't have grep (shame on you), you can
manually scroll through the output and verify.  It should
be near the bottom.

### Update Packages

Using

```
conda update <package name>
```

update the following packages.

* `conda` - tell conda to update itself
* `python` - the Python interpreter
* `requests` - high level HTTP package
* `ipthon` - the `ipthon` shell
* `pandas` - which in turn updates `NumPy` and others
* `jupyter` - interactive web notebooks
* `scipy` - utilities for probability 
* `matplotlib` - plotting
* `seaborn` - more plotting
* `statsmodels` - statistical modeling


# Data Frame Review

Last spring we sourced most of our datasets from CSV files using
`pd.read_csv`.  Let's use a related function, `pd.read_json` to
load a dataset from the 
[Los Angeles County Open Data](https://data.lacounty.gov/)
service.  There are many interesting data sets here.  Since most
of us come from the justice community, we'll work with a Sheriff
dataset concerning crimes reported within the last 12 months.

<https://dev.socrata.com/foundry/data.lacounty.gov/uvdj-ch3p>

Navigate to this page in your browser and you'll find all kinds
of good information on invoking the API and descriptions of the
fields it returns.  At the bottom are small samples of invoking
the API through a variety of APIs.  We'll start with the Python
Pandas API and invoke it with the `$limit` filter limit the number
of rows we get back (the default is a maximum of 1,000 rows).
The URL of the service is

```
https://data.lacounty.gov/resource/uvdj-ch3p.json
```

In [78]:
import pandas as pd
df = pd.read_json('https://data.lacounty.gov/resource/uvdj-ch3p.json?$limit=20')
len(df)

20

We now have a data frame in the `df` variable.
We'll have much more to say about invoking APIs over a network
later.  For now we'll use this
opportunity to review basic operations on a data frame.

The `dtypes` attribute tells us the type for each column.

In [79]:
df.dtypes

city                             object
crime_category_description       object
crime_category_number             int64
crime_date                       object
crime_identifier                  int64
crime_year                        int64
gang_related                     object
geo_location                     object
geo_location_address             object
geo_location_city                object
geo_location_state               object
geo_location_zip                float64
latitude                        float64
longitude                       float64
reporting_district                int64
state                            object
station_identifier               object
station_name                     object
statistical_code                  int64
statistical_code_description     object
street                           object
victim_count                      int64
zip                             float64
dtype: object

A dtype of `object` usually means a string.
Most of these seem acceptable.  A few that
stand out are

* `crime_year` - should be a datetime.
* `geo_location_zip` - should be a string.

We can convert this within the `read_json` function call.
We'll increase the `$limit=20` parameter so that
we receive the default maximum number of rows,
which for this API is 1,000.

In [97]:
df = pd.read_json('https://data.lacounty.gov/resource/uvdj-ch3p.json?$limit=1000',
                 dtype={'geo_location_zip': 'U'}, convert_dates=['crime_date'])
df.dtypes

city                                    object
crime_category_description              object
crime_category_number                    int64
crime_date                      datetime64[ns]
crime_identifier                         int64
crime_year                               int64
gang_related                            object
geo_location                            object
geo_location_address                    object
geo_location_city                       object
geo_location_state                      object
geo_location_zip                        object
latitude                               float64
longitude                              float64
reporting_district                       int64
state                                   object
station_identifier                      object
station_name                            object
statistical_code                         int64
statistical_code_description            object
street                                  object
victim_count 

We called the API like last time, but with a few extra parameters.

* `dtype` - a dictionary where each name is a column name and each
  value is a type.  The `U` here means UTF-8 string.
* `convert_dates` - This works just like with `pd.csv_read`.  We
  specify which columns are to be interpreted as dates or times.

Let's drop some columns that we won't use in this session.
The `drop` function accepts the following parameters.

* a list of things to drop.
* `axis` this defaults to `0` (the first parameter refers to
  which rows to drop).  But we want to drop columns
  (`axis=1`).  This interprets the list in the first parameter
  as column names.
* `inplace` default to `False` which means a new copy of the
  data frame is returned (the original one is not changed).
  In this case, we want to change the data frame in place.

In [98]:
df.drop(['crime_year', 'geo_location', 'geo_location_address', 
         'latitude', 'longitude', 'street', 'city', 'state', 'zip'],
       axis=1, inplace=True)
df.head(3)

Unnamed: 0,crime_category_description,crime_category_number,crime_date,crime_identifier,gang_related,geo_location_city,geo_location_state,geo_location_zip,reporting_district,station_identifier,station_name,statistical_code,statistical_code_description,victim_count
0,AGGRAVATED ASSAULT,4,2017-09-13 19:09:00,18331100,N,LOS ANGELES,CA,90059.0,2137,CA01900V3,CENTURY,51,"ASSAULT, AGGRAVATED: ADW - GUN",1
1,BURGLARY,5,2017-08-21 02:08:00,18306371,N,CERRITOS,CA,90703.0,2314,CA01900R7,CERRITOS,71,"BURGLARY, OTHER STRUCTURE: Night, Entry By Force",1
2,VEHICLE / BOATING LAWS,23,2017-07-20 17:07:00,18274453,N,LYNWOOD,CA,,2117,CA01900V3,CENTURY,255,VEHICLE AND BOATING LAWS: Misdemeanor,1


Rename the rest of the columns.
Use original names as a guide.

In [99]:
df.columns

Index(['crime_category_description', 'crime_category_number', 'crime_date',
       'crime_identifier', 'gang_related', 'geo_location_city',
       'geo_location_state', 'geo_location_zip', 'reporting_district',
       'station_identifier', 'station_name', 'statistical_code',
       'statistical_code_description', 'victim_count'],
      dtype='object')

In [100]:
df.columns = ['cat_desc', 'cat_code', 'date', 'id', 'gang', 'city',
             'state','zip', 'reporting_district', 'station_id',
             'station_name', 'stat_code', 'stat_desc', 'victim_count']
df.head(3)

Unnamed: 0,cat_desc,cat_code,date,id,gang,city,state,zip,reporting_district,station_id,station_name,stat_code,stat_desc,victim_count
0,AGGRAVATED ASSAULT,4,2017-09-13 19:09:00,18331100,N,LOS ANGELES,CA,90059.0,2137,CA01900V3,CENTURY,51,"ASSAULT, AGGRAVATED: ADW - GUN",1
1,BURGLARY,5,2017-08-21 02:08:00,18306371,N,CERRITOS,CA,90703.0,2314,CA01900R7,CERRITOS,71,"BURGLARY, OTHER STRUCTURE: Night, Entry By Force",1
2,VEHICLE / BOATING LAWS,23,2017-07-20 17:07:00,18274453,N,LYNWOOD,CA,,2117,CA01900V3,CENTURY,255,VEHICLE AND BOATING LAWS: Misdemeanor,1


The `value_counts` function on a series gives us a quick table of
the frequency of values.  This is useful for categorical values.

In [101]:
df['cat_code'].value_counts()

6     172
23    152
16    107
13     84
7      66
5      62
24     59
4      53
10     46
3      27
14     25
29     24
30     23
9      16
11     14
12      9
17      8
18      8
1       8
15      7
20      6
22      6
2       6
19      4
8       4
28      2
25      1
26      1
Name: cat_code, dtype: int64

In [102]:
df['gang'].value_counts()

N    981
Y     19
Name: gang, dtype: int64

In [103]:
df['stat_code'].value_counts()

255    113
185     61
91      55
144     45
340     40
146     35
261     35
383     34
250     33
389     32
263     20
112     19
117     17
51      17
384     16
151     16
183     16
399     16
53      15
181     13
89      12
71      12
47      11
103     10
325     10
184      8
11       8
339      8
67       7
391      7
      ... 
277      1
270      1
92       1
197      1
155      1
41       1
202      1
221      1
31       1
27       1
161      1
290      1
285      1
149      1
101      1
106      1
86       1
107      1
328      1
82       1
315      1
119      1
121      1
350      1
68       1
134      1
138      1
314      1
94       1
133      1
Name: stat_code, Length: 122, dtype: int64

This `isnull()` function returns a data frame of booleans with
the same shape of the source data frame.  We can invoke `sum()`
to return the total null values for each column.

In [104]:
df.isnull().sum()

cat_desc               0
cat_code               0
date                   0
id                     0
gang                   0
city                  56
state                 56
zip                    0
reporting_district     0
station_id             0
station_name           0
stat_code              0
stat_desc              0
victim_count           0
dtype: int64

Without a specified index, pandas just assigns an integer sequence
starting with zero.  If we wish to make our `id` column the index,
we use the `set_index` function.

In [105]:
df.set_index('id', inplace=True)

In [106]:
df.head()

Unnamed: 0_level_0,cat_desc,cat_code,date,gang,city,state,zip,reporting_district,station_id,station_name,stat_code,stat_desc,victim_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
18331100,AGGRAVATED ASSAULT,4,2017-09-13 19:09:00,N,LOS ANGELES,CA,90059.0,2137,CA01900V3,CENTURY,51,"ASSAULT, AGGRAVATED: ADW - GUN",1
18306371,BURGLARY,5,2017-08-21 02:08:00,N,CERRITOS,CA,90703.0,2314,CA01900R7,CERRITOS,71,"BURGLARY, OTHER STRUCTURE: Night, Entry By Force",1
18274453,VEHICLE / BOATING LAWS,23,2017-07-20 17:07:00,N,LYNWOOD,CA,,2117,CA01900V3,CENTURY,255,VEHICLE AND BOATING LAWS: Misdemeanor,1
18245488,VEHICLE / BOATING LAWS,23,2017-06-21 04:06:00,N,PICO RIVERA,CA,,1512,CA0190015,PICO RIVERA,255,VEHICLE AND BOATING LAWS: Misdemeanor,1
18332163,LARCENY THEFT,6,2017-09-14 17:09:08,N,,,,1329,CA0190013,LAKEWOOD,83,"GRAND THEFT: Shoplifting (From Dept Store, Mkt...",1


Sorting is done by either by index or by value.

In [107]:
df.sort_index().head()

Unnamed: 0_level_0,cat_desc,cat_code,date,gang,city,state,zip,reporting_district,station_id,station_name,stat_code,stat_desc,victim_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
17934869,LARCENY THEFT,6,2016-10-14 11:10:00,N,INDUSTRY,CA,,1415,CA0190014,INDUSTRY,383,"THEFT, PETTY: Shoplifting (From Dept Store, Mk...",1
17941914,WEAPON LAWS,14,2016-10-21 12:10:00,N,,,,1747,CA0190017,LOMITA,151,WEAPON LAWS: Felony,1
17943051,NON-AGGRAVATED ASSAULTS,13,2016-10-22 19:10:00,N,CANYON COUNTRY,CA,,610,CA0190006,SANTA CLARITA VALLEY,146,"ASSAULT, NON-AGGRAVATED: DOMESTIC VIOLENCE",1
17943941,DRUNK DRIVING VEHICLE / BOAT,22,2016-10-24 04:10:46,N,,,,2612,CA01900W9,PALMDALE,242,DRUNK DRIVING - VEHICLE/BOAT: Alc/Drugs,1
18059054,NON-AGGRAVATED ASSAULTS,13,2017-02-15 17:02:00,N,LOS ANGELES,CA,90044.0,372,CA0190003,SOUTH LOS ANGELES,149,"ASSAULT, NON-AGG: Child Assault",1


In [108]:
df.sort_values(by=['cat_code', 'date'], ascending=[True, False]).head()

Unnamed: 0_level_0,cat_desc,cat_code,date,gang,city,state,zip,reporting_district,station_id,station_name,stat_code,stat_desc,victim_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
18304157,CRIMINAL HOMICIDE,1,2017-08-19 00:08:00,Y,LOS ANGELES,CA,90001.0,2172,CA01900V3,CENTURY,11,CRIMINAL HOMICIDE: Murder,1
18302580,CRIMINAL HOMICIDE,1,2017-08-17 14:08:00,Y,LOS ANGELES,CA,90001.0,2175,CA01900V3,CENTURY,11,CRIMINAL HOMICIDE: Murder,1
18294920,CRIMINAL HOMICIDE,1,2017-08-09 21:08:00,Y,COMPTON,CA,,2830,CA0190042,COMPTON,11,CRIMINAL HOMICIDE: Murder,1
18287795,CRIMINAL HOMICIDE,1,2017-08-03 00:08:00,N,LANCASTER,CA,93535.0,1132,CA0190024,LANCASTER,11,CRIMINAL HOMICIDE: Murder,1
18282808,CRIMINAL HOMICIDE,1,2017-07-28 20:07:00,N,CERRITOS,CA,,2310,CA01900R7,CERRITOS,11,CRIMINAL HOMICIDE: Murder,1


In the last example, we sorted on ascending `cat_code` and
descending `date`.

We can create a time series data frame by indexing on the
`date` column instead of the `id` column.  This should be
done in two steps:

1. **reset_index** - Send `id` back to a regular column.
2. **set_index** - Set `date` as the new index

If we skip the first step, we'll lose the `id` column.

In [109]:
ts = df.reset_index().set_index('date')
ts.index.is_unique

False

In [123]:
ts.head(3)

Unnamed: 0_level_0,id,cat_desc,cat_code,gang,city,state,zip,reporting_district,station_id,station_name,stat_code,stat_desc,victim_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2017-09-13 19:09:00,18331100,AGGRAVATED ASSAULT,4,N,LOS ANGELES,CA,90059.0,2137,CA01900V3,CENTURY,51,"ASSAULT, AGGRAVATED: ADW - GUN",1
2017-08-21 02:08:00,18306371,BURGLARY,5,N,CERRITOS,CA,90703.0,2314,CA01900R7,CERRITOS,71,"BURGLARY, OTHER STRUCTURE: Night, Entry By Force",1
2017-07-20 17:07:00,18274453,VEHICLE / BOATING LAWS,23,N,LYNWOOD,CA,,2117,CA01900V3,CENTURY,255,VEHICLE AND BOATING LAWS: Misdemeanor,1


Now that we have a time series, we can investigate this dataset
from the perspective of time.  Let's check the earliest and latest
times in this dataset.

In [124]:
(ts.index.min(), ts.index.max())

(Timestamp('2016-10-02 00:10:00'), Timestamp('2017-09-15 04:09:43'))

We can subset a time series index by choosing broader portions
of time.

In [125]:
ts['2017-02']

Unnamed: 0_level_0,id,cat_desc,cat_code,gang,city,state,zip,reporting_district,station_id,station_name,stat_code,stat_desc,victim_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2017-02-02 17:02:00,18286145,FELONIES MISCELLANEOUS,29,N,NORWALK,CA,90650,451,CA0190004,NORWALK,325,"FELONIES, MISCELLANEOUS: Perjury",1
2017-02-09 18:02:00,18288870,FELONIES MISCELLANEOUS,29,N,NORWALK,CA,90650,451,CA0190004,NORWALK,325,"FELONIES, MISCELLANEOUS: Perjury",1
2017-02-08 18:02:00,18288871,FELONIES MISCELLANEOUS,29,N,NORWALK,CA,90650,451,CA0190004,NORWALK,325,"FELONIES, MISCELLANEOUS: Perjury",1
2017-02-15 17:02:00,18059054,NON-AGGRAVATED ASSAULTS,13,N,LOS ANGELES,CA,90044,372,CA0190003,SOUTH LOS ANGELES,149,"ASSAULT, NON-AGG: Child Assault",1
2017-02-22 18:02:02,18067626,NARCOTICS,16,N,PERRIS,CA,92570,3668,CA0190036,NARCOTICS BUREAU,184,Felony Possession of a Controlled Substance (e...,1
2017-02-28 01:02:00,18307740,FRAUD AND NSF CHECKS,10,N,LANCASTER,CA,93534,1104,CA0190024,LANCASTER,112,FRAUD: Fraud By False Pretenses,1
2017-02-18 21:02:00,18213539,FELONIES MISCELLANEOUS,29,Y,LOS ANGELES,CA,90059,2138,CA01900V3,CENTURY,339,"FELONIES, MISCELLANEOUS: All Other Felonies",1
2017-02-02 14:02:00,18288865,FELONIES MISCELLANEOUS,29,N,NORWALK,CA,90650,451,CA0190004,NORWALK,325,"FELONIES, MISCELLANEOUS: Perjury",1


Let's remind ourselves that we aren't working with the full
dataset.  We just chose the first 1,000 entries that the API
has to offer.  This is OK for now; we're just playing around.

As shown above with `ts.index.is_unique`, this index is
**not** unique.  Let's get an idea of how many of our
incidents occur at the same time.  First, we'll create a
series of `1` using the same index as our time series.
Each entry will represent an `occurrence`.

In [130]:
occurrences = pd.Series(1, index=ts.index)
dup_times = occurrences.groupby(level=0).sum()
dup_times.value_counts().sort_index()

1     458
2      77
3      30
4      22
5       5
6       6
7       6
8       3
9       4
10      1
11      1
12      1
14      1
dtype: int64

The `dup_times` variable holds a grouping of all the
duplicate index entries.  The `groupby` operates on
columns by default.  But `level=0` parameter specifies
the index instead of the columns.  The `sum()` is the
aggregation operation for the grouping.

Later we'll dig much deeper into how we can manipulate
time series.

# HTTP Basics

The HTTP protocol runs over TCP and is characterized by

* __text based__ - Headers and contents are text-based.
* __stateless__ - No application state is maintained by the connection.
  The connection (at the TCP level) is closed after the response is returned.

These two characteristics make HTTP simpler to work with than other
binary RPC (Remote Procedure Call) protocols.  They also severely limit
what can be done with HTTP, which is why there are now routine "end-arounds"
to both characteristics.  Since we're most interested in invoking REST APIs
to retrieve JSON data, we can continue to think in terms of these simple
characteristics for this workshop.

The following is an example of an HTTP request and response.

```
1 GET /resources/uvdj-ch3p.json?$limit=3 HTTP/1.1
2 Host: data.lacounty.gov
3 User-Agent: curl/7.54.0
4 Accept: */*
5
```

**Request Notes**

* Line 1 has three fields: (1) verb, (2) URL, (3) Protocol version.
  The field separator is a space; which means the values of the
  fields themselves cannot contain spaces.  In this example, the
  verb is **GET**, the URL is `/resources/uvdj-ch3p.json?$limit=3`,
  and the requested protocol version is `HTTP/1.1` (which may or
  may not be honored).
* Lines 2 - 4 are examples of request headers.
* Line 5 is blank.  This is actually important.  This request doesn't
  contain content.  But the separation between header and content is
  denoted by a blank line (two line feeds in a row).

The following is a sample response.

```
1 HTTP/1.1 404 Not Found
2 Server: nginx
3 Date: Mon, 11 Sep 2017 22:13:05 GMT
4 Content-Type: text/html; charset=utf-8
5
6 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
7       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
8 . . .
```

**Response Notes**

* Line 1 has three fields: (1) the protocol chosen by the server
  (which doesn't necessarily honor the protocol requested by the
  client), (2) the HTTP status code, (3) the reason code (which
  might contain spaces, but that's ok since it's the last field
  on the line.
* Lines 2 - 4 are response headers.
* Line 5 separates the response headers from the respone content.
* Lines 6 through the rest of the response is the content.

It's a struggle to stay awake reading about protocol headers.
But understanding their basics can go a long way to writing
robust API clients.  In the example above, the URL is wrong.
The **status code** `404` was useful in telling us why our
request did not succeed.  The 500 lines of JavaScript and HTML
that was returned as content was **not useful**.  This waste
of bandwidth could have been prevented if we had notified the
server we were only interested in JSON responses.  But because
our `Accept` header was ambiguous, the server responded as if
our client was a browser.  If we had simply told the server
we were only interested in JSON, we could have saved network
bandwidth and memory.

```
 1 GET /resources/uvdj-ch3p.json?$limit=3 HTTP/1.1
 2 Host: data.lacounty.gov
 3 User-Agent: curl/7.54.0
 3 Accept: application/json
 4
 5 HTTP/1.1 404 Not Found
 6 Server: nginx
 7 Date: Mon, 11 Sep 2017 22:10:51 GMT
 8 Content-Type: application/json; charset=utf-8
 9 . . .
```

This response provides us the same information without
several thousand bytes of HTML and JavaScript.  In this
case, the content was empty (I snipped the other response
headers).

## The Requests Package

The Python **Requests** package is documented at

<http://docs.python-requests.org/en/master/>.

Its slogan is *HTTP for humans*.  It is a usability layer on top
of the [url.request](https://docs.python.org/3.5/library/urllib.request.html).
While `url.request` is part of all standard Python distributions,
`requests` is not.  It must be separately installed.  With **conda**
this amounts to

```
conda install requests
```

Generally we try to stick to standard Python packages in this
workshop (standard for Data Science, anyway).  But this package
is actually suggested for use by the official Python
`url.request` documentation for the simpler HTTP needs.

Let's start with issuing the last HTTP request above
(intentionally misspelling the URL) so that we receive a
`404` status code.

In [110]:
import requests
r = requests.get('https://data.lacounty.gov/resources/uvdj-ch3p.json?$limit=3')
r.status_code

404

In the snippet above, `r` is the response object.
`r.status_code` returns the status of the invocation.
`r.text` returns the text of the response.  We did not
set the `Accept` header; so we probably got a bunch of
JavaScript and HTML detailing our `404`.

In [111]:
"The length of the response was {:,d} {} characters.".format(len(r.text), r.encoding)

'The length of the response was 109,741 utf-8 characters.'

In [112]:
r.text[:300]

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n<!--[if IE 8]>\n  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"\n        xmlns:v="urn:schemas-microsoft-com:vml"\n        xmlns:og="http://opengraphprotocol.org/sc'

What a mess!  Let's send the `Accept` header this time.

In [113]:
r = requests.get('https://data.lacounty.gov/resources/uvdj-ch3p.json?$limit=3',
                headers={'Accept': 'application/json'})
r.status_code

404

In [114]:
"The length of this response content was {:,d} {} characters.".format(len(r.text), r.encoding)

'The length of this response content was 0 utf-8 characters.'

That's much better.  The `404` was all we needed to know to address
this problem.  Now let's fix the URL and get something we can use.

In [115]:
r = requests.get('https://data.lacounty.gov/resource/uvdj-ch3p.json?$limit=3',
                headers={'Accept': 'application/json'})
r.status_code

200

In [116]:
r.text[:200]

'[{"city":"LOS ANGELES","crime_category_description":"AGGRAVATED ASSAULT","crime_category_number":"4","crime_date":"2017-09-13T19:09:00.000","crime_identifier":"18331100","crime_year":"2017","gang_rela'

As we can see, this is precisely the JSON we asked for.
A more robust way to check this is with the content type
header.

In [117]:
r.headers['content-type'], r.headers['Content-Type']

('application/json; charset=UTF-8', 'application/json; charset=UTF-8')

Note the header names are case-insensitive in this special case.
Python dictionaries are generally case sensitive.  But the **requests**
package makes special allowances for response header names.
Let's check all the response headers.

In [118]:
for h,v in r.headers.items():
    print("{:30} {}".format(h, v))

Server                         nginx
Date                           Mon, 18 Sep 2017 17:24:11 GMT
Content-Type                   application/json; charset=UTF-8
Transfer-Encoding              chunked
Connection                     keep-alive
X-Socrata-RequestId            7gtxycay8c6fglvvky50ojtns
Access-Control-Allow-Origin    *
ETag                           W/"YWxwaGEuNTE2MjZfM182NzQ0OGlZLWgtaXFySkR1NEVVejJLUlpZRnR3cVm_lF0tHjk9QpfBvBFEi50gGj1ssg-gzip"
X-SODA2-Fields                 ["city","crime_category_description","crime_category_number","crime_date","crime_identifier","crime_year","gang_related","geo_location","geo_location_address","geo_location_city","geo_location_state","geo_location_zip","latitude","longitude","reporting_district","state","station_identifier","station_name","statistical_code","statistical_code_description","street","victim_count","zip"]
X-SODA2-Types                  ["text","text","number","floating_timestamp","number","text","text","point","text","text","

We can see that Socrata (the vendor for the LA County Open Data site)
provides some extra goodies in the response headers.

Since we now know that we got JSON back, let's parse it.

In [119]:
crimes = r.json()
len(crimes)

3

In [120]:
crimes[0]

{'city': 'LOS ANGELES',
 'crime_category_description': 'AGGRAVATED ASSAULT',
 'crime_category_number': '4',
 'crime_date': '2017-09-13T19:09:00.000',
 'crime_identifier': '18331100',
 'crime_year': '2017',
 'gang_related': 'N',
 'geo_location': {'coordinates': [-118.23011872031, 33.91441689608],
  'type': 'Point'},
 'geo_location_address': '13100 WILLOWBROOK AVE',
 'geo_location_city': 'LOS ANGELES',
 'geo_location_state': 'CA',
 'geo_location_zip': '90059',
 'latitude': '33.91441689608002871242',
 'longitude': '-118.23011872031021234397',
 'reporting_district': '2137',
 'state': 'CA',
 'station_identifier': 'CA01900V3',
 'station_name': 'CENTURY',
 'statistical_code': '51',
 'statistical_code_description': 'ASSAULT, AGGRAVATED: ADW - GUN',
 'street': '13100 WILLOWBROOK AVE',
 'victim_count': '1',
 'zip': '90059'}

If we wanted to start using pandas at this point,
we could have pandas parse the json.

In [121]:
df = pd.read_json(r.text)
df.shape

(3, 23)

Or we could let the response object parse it and feed pandas
the dictionary.

In [122]:
df = pd.DataFrame(r.json())
df.shape

(3, 23)

Of course, as we saw above, pandas is perfectly capable of
fetching the dataset itself.  The Request package is good
if you need the data for something else besides pandas.