# Getting the data relative to the county profiles in California

Get the county asthma profile together with demographics data from all counties in California, as shown in http://www.californiabreathing.org/asthma-data/county-asthma-profiles

Adapted from https://github.com/gkafka/Rehab4Rehab/blob/master/GetDataRehabCenters.ipynb

In [1]:
%matplotlib inline

# Web scrapping
import requests
from bs4 import BeautifulSoup

# Data handling
# import pandas as pd
import numpy as np
# import scipy as sp

# Graphing capabilities
import matplotlib.pyplot as plt
# import seaborn as sns

import json
# import time

### Get the list of counties and respective pages

In [2]:
url = 'http://www.californiabreathing.org/asthma-data/county-asthma-profiles'
baseurl = 'http://www.californiabreathing.org'

In [3]:
try:
    response = requests.get(url)
    print 'Successfully acquired page.'
except:
    print 'Failed to get url page.'

Successfully acquired page.


In [4]:
soup = BeautifulSoup(response.text, 'lxml')
# soup = BeautifulSoup(response.text, "html.parser")

In [5]:
html_entries = soup.find_all('ul',attrs={'class': 'zoo-item-list zoo-list page-profiles'})
print 'Found %d unordered list(s)' % len(html_entries)

Found 1 unordered list(s)


In [6]:
county_entries = html_entries[0].find_all('li')
N_counties = len(county_entries)
print 'Found %d county entries' % len(county_entries)

Found 58 county entries


Find all counties and respective web pages

In [7]:
county_name= []
county_url= []

for entry in county_entries:
    county_link= entry.find_all('a')[0]
    county_url.append( county_link.get('href').strip() )
    county_name.append( county_link.get('title').strip().title() )   # strip all spaces of the title and make only the first letter capitalized
    
county_name = np.array(county_name)
county_url = np.array(county_url)

Making sure the counties are in alphabetical order

In [8]:
indsSort = np.argsort(county_name)
county_name = county_name[indsSort]
county_url = county_url[indsSort]

In [9]:
filename = 'county_names.json'
outfile = open(filename, "w")
json.dump(county_name.tolist(), outfile)
outfile.close()

In [10]:
for i in xrange(len(county_name)):
    print '%s   %s%s' % (county_name[i].title(), baseurl, county_url[i])

Alameda   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/alameda-county-asthma-profile
Alpine   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/alpine-county-asthma-profile
Amador   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/amador-county-asthma-profile
Butte   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/butte-county-asthma-profile
Calaveras   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/calaveras-county-asthma-profile
Colusa   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/colusa-county-asthma-profile
Contra Costa   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/contra-costa-county-asthma-profile
Del Norte   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/del-norte-county-asthma-profile
El Dorado   http://www.californiabreathing.org/asthma-data/county-asthma-profiles/el-dorado-county-asthma-profil

## Now we need to go to each county page and scrape the data

### Get population data

Data will be saved in a csv file

In [None]:
filename = 'Population_AgeGroup_byCounty.csv'
fout = open(filename, 'w')

Write the header for the csv file

In [12]:
line ='county,0-4,5-17,18-64,65+,total'
fout.write(line+'\n')

In [13]:
dict_tmp = {}

for i in xrange(N_counties):
    print county_name[i]
    line= county_name[i]

    try:
        response = requests.get(baseurl+county_url[i])
    except:
        print 'Failed to get url page %s' % url+county_url[i]

    soup = BeautifulSoup(response.text, 'lxml')

    # Find all tables containing data in the web page
    html_entries = soup.find_all('table',attrs={'class': 'datatable'})
#     print 'Found %d tables' % len(html_entries)

    # Population data is in first table
    ind= 0
    table = html_entries[ind]
    try:
        print 'ID:', table['id']
    except:
        pass    

    rows = table.find_all('tr')


    # Get the data for each county
    dict_tmp[county_name[i]] = []
    for j in xrange(1,len(rows)): # ignore first row: table header
        row= rows[j]
        col= row.find_all('td')[0] # get the first table value
        val= eval(col.get_text().replace(',','')) # correct for all the thousands commas e.g., 10,000 to 10000
        print '%d' % (val)
        line= line+','+np.str(val)

    print ''
    fout.write(line+'\n')

fout.close()

Alameda
ID: population
99911
248516
1022113
196707
1567248

Alpine
ID: population
35
210
683
219
1148

Amador
ID: population
1341
4466
22016
9011
36833

Butte
ID: population
11474
33791
140503
37584
223353

Calaveras
ID: population
1692
6109
26210
10992
45004

Colusa
ID: population
1590
4524
13132
2837
22083

Contra Costa
ID: population
62767
189096
679089
153863
1084815

Del Norte
ID: population
1511
4144
18464
4430
28549

El Dorado
ID: population
8414
29251
112898
31798
182360

Fresno
ID: population
79872
199933
585623
107296
972724

Glenn
ID: population
2041
5501
16985
4098
28626

Humboldt
ID: population
6806
19018
88330
21237
135392

Imperial
ID: population
15369
37095
112755
21524
186744

Inyo
ID: population
954
2635
11232
3834
18656

Kern
ID: population
73380
183155
540506
90089
887129

Kings
ID: population
12201
29268
99322
13938
154729

Lake
ID: population
3441
9284
39190
12968
64884

Lassen
ID: population
1406
4113
24387
4012
33918

Los Angeles
ID: population
670558
1651655
64

### Get ethnicity data

In [14]:
for t in html_entries:
    try:
        print t['id']
    except:
        print 'No id found'

population
ethnicity
prevalence
prevalence
riskfactors
No id found
No id found
No id found
numhospitalizations
No id found
No id found
disparities
hosprates
hospitals
edvisits


Datas of interest are 0,1,4,...