Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

## City-Data.com ##

In [6]:
# use Washington County, UT as an example
url = 'http://www.city-data.com/county/Washington_County-UT.html'
html = !curl {wash_county_url}
print(html[:5])

['<!DOCTYPE html>', '<html lang="en">', ' <head>', ' <meta charset="utf-8"/>', '<meta http-equiv="Content-Language" content="en"/>']


In [8]:
# Suicides per 1,000,000 population from 2000 to 2006: 150.4


# To extract a list of column names from a website describing a dataset
import re
# html = !curl https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records
# matches = [ re.search('\. ([\w -]+): 2', line) for line in html ]

matches = [ re.search('Suicides per ([\d,]+) population from (\d{4}) to (\d{4}):', line) for line in html ]
column_names = [ match.group(0) for match in matches if match ]
print(len(column_names), column_names)


1 ['Suicides per 1,000,000 population from 2000 to 2006:']


In [13]:
import re

for line in html:
  match = re.search('Suicides per ([\d,]+) population from (\d{4}) to (\d{4}): </b>([\d.]+)<b>', line)
  if match:
    print(html[i])
    a = match.group(1)
    print(int(a.replace(',', '')))
    print(int(match.group(2)))
    print(int(match.group(3)))
    print(float(match.group(4)))

I found a match!
<b>Suicides per 1,000,000 population from 2000 to 2006: </b>150.4<b>. This is more than state average.</b><br/>
Suicides per 1,000,000 population from 2000 to 2006: </b>150.4<b>
1000000
2000
2006
150.4


## Data.gov ##

In [None]:
base_url = 'http://catalog.data.gov/api/3/'

In [1]:
!wget http://demo.ckan.org/api/3/action/package_list

--2020-12-06 04:31:15--  http://demo.ckan.org/api/3/action/package_list
Resolving demo.ckan.org (demo.ckan.org)... 172.67.170.152, 104.24.114.210, 104.24.115.210, ...
Connecting to demo.ckan.org (demo.ckan.org)|172.67.170.152|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://demo.ckan.org/api/3/action/package_list [following]
--2020-12-06 04:31:16--  https://demo.ckan.org/api/3/action/package_list
Connecting to demo.ckan.org (demo.ckan.org)|172.67.170.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 188392 (184K) [application/json]
Saving to: ‘package_list’


2020-12-06 04:31:16 (2.86 MB/s) - ‘package_list’ saved [188392/188392]



In [23]:
import json
import os.path

# Only queries the website once
# base_url = 'http://demo.ckan.org/api/3/action/'
base_url = 'http://catalog.data.gov/api/3/action/'
query = 'package_search?q=suicide'
if not os.path.exists(query):
  query_url = base_url + query
  !wget {query_url}

with open (query, "r") as myfile:
  data = myfile.read()
  data = json.loads(data)
data

URL transformed to HTTPS due to an HSTS policy
--2020-12-06 05:21:18--  https://catalog.data.gov/api/3/action/package_search?q=suicide
Resolving catalog.data.gov (catalog.data.gov)... 13.32.240.71, 13.32.240.94, 13.32.240.14, ...
Connecting to catalog.data.gov (catalog.data.gov)|13.32.240.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68465 (67K) [application/json]
Saving to: ‘package_search?q=suicide’


2020-12-06 05:21:19 (411 KB/s) - ‘package_search?q=suicide’ saved [68465/68465]



{'help': 'https://catalog.data.gov/api/3/action/help_show?name=package_search&host=&protocol=',
 'result': {'count': 86,
  'facets': {},
  'results': [{'author': None,
    'author_email': None,
    'creator_user_id': '47303a9e-1187-4290-85a3-1fc02dc49e4a',
    'extras': [{'key': 'publisher',
      'value': 'National Transportation Safety Board'},
     {'key': 'identifier', 'value': '{710F6EAC-E30A-4C49-A51D-BEAF25B54F87}'},
     {'key': 'catalog_describedBy',
      'value': 'https://project-open-data.cio.gov/v1.1/schema/catalog.json'},
     {'key': 'harvest_source_id',
      'value': '68151113-c0fb-4354-96f6-34376f841a1b'},
     {'key': 'catalog_@context',
      'value': 'https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld'},
     {'key': 'resource-type', 'value': 'Dataset'},
     {'key': 'temporal', 'value': '1995-01-01T00:00:00Z/2014-12-31T23:59:59Z'},
     {'key': '__category_tag_3b4e71c5-3c3c-43af-a3ff-81694225e453',
      'value': '["Transportation"]'},
     {'key': 'modi

In [22]:
import json
import os.path

# !wget http://demo.ckan.org/api/3/action/package_search?q=spending

# Only queries the website once
# base_url = 'http://demo.ckan.org/api/3/action/'
base_url = 'http://catalog.data.gov/api/3/action/'
query = 'package_list'
if not os.path.exists(query):
  query_url = base_url + query
  !wget {query_url}

with open (query, "r") as myfile:
  data = myfile.read()
  data = json.loads(data)
data


--2020-12-06 05:17:47--  http://catalog.data.gov/api/3/action/package_list
Resolving catalog.data.gov (catalog.data.gov)... 13.32.240.94, 13.32.240.10, 13.32.240.71, ...
Connecting to catalog.data.gov (catalog.data.gov)|13.32.240.94|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://catalog.data.gov/api/3/action/package_list [following]
--2020-12-06 05:17:47--  https://catalog.data.gov/api/3/action/package_list
Connecting to catalog.data.gov (catalog.data.gov)|13.32.240.94|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://catalog.data.gov/api/3/action/package_search [following]
--2020-12-06 05:17:48--  https://catalog.data.gov/api/3/action/package_search
Reusing existing connection to catalog.data.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 98826 (97K) [application/json]
Saving to: ‘package_list’


2020-12-06 05:17:48 (595 KB/s) - ‘package_list’ saved [98826/98826]



{'help': 'https://catalog.data.gov/api/3/action/help_show?name=package_search&host=&protocol=',
 'result': {'count': 219382,
  'facets': {},
  'results': [{'author': None,
    'author_email': None,
    'creator_user_id': '47303a9e-1187-4290-85a3-1fc02dc49e4a',
    'extras': [{'key': 'publisher',
      'value': 'Allegheny County / City of Pittsburgh / Western PA Regional Data Center'},
     {'key': 'identifier', 'value': '9e0ce87d-07b8-420c-a8aa-9de6104f61d6'},
     {'key': 'license',
      'value': 'http://www.opendefinition.org/licenses/cc-zero'},
     {'key': 'harvest_source_id',
      'value': '041cd86b-18ea-412b-8d67-2ed1d6124bbf'},
     {'key': 'catalog_@context',
      'value': 'https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld'},
     {'key': 'resource-type', 'value': 'Dataset'},
     {'key': 'modified', 'value': '2018-05-02T05:57:21.847645'},
     {'key': 'harvest_source_title', 'value': 'WPRDC data.json'},
     {'key': 'source_schema_version', 'value': '1.1'},
     

In [20]:
data

{'help': 'https://demo.ckan.org/api/3/action/help_show?name=package_list',
 'result': ['_-_-_',
  '0-1-annual-probability-extents',
  '0-1-annual-probability-extents-with-30-climate-change-adjustment',
  '0-1-annual-probability-outputs',
  '0-1-annual-probability-outputs-with-30-climate-change-adjustment',
  '02a8c314-e726-44fb-88da-2e535e788675',
  '02p-expenditure-over-25k-apr-19',
  '02p-expenditure-over-25k-feb-20',
  '02p-expenditure-over-25k-mar-19',
  '02p-expenditure-over-25k-nov-19',
  '02p-expenditure-over-25k-sep-19',
  '09-listas',
  '10019191949485756',
  '100-filtered-whey-protein-isolate-canada',
  '100-natural-fat-burner-to-reduce-calories-easily',
  '1022330232',
  '1080p-2020',
  '10-incredible-fast-flow-male-enhancement-examples',
  '10-shocking-facts-about-battery-warehouse',
  '10-tips-for-silgenix-male-enhancement',
  '10-unforgivable-sins-of-one-shot-keto',
  '1-1',
  '111',
  '11111',
  '111111111111111111111111111',
  '111asdfasdf',
  '112112',
  '12121',
  '12

In [7]:
# JSON
import json

# some JSON:
x =  '{ "name":"John", "age":30, "city":"New York"}'

# parse x:
y = json.loads(x)

# the result is a Python dictionary:
print(y["age"])

30
