# Homework 2 - Considering Bias in Data
# Name: James Joko (jjoko)
## License
The "CONSTANTS" and "PROCEDURES/FUNCTIONS" cells in this notebook contains code developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the Creative Commons CC-BY license. The changes made include subtituting my email for request headers and inserting my access token.

For reproducibility, execute each cell from top to bottom.

# Setup
Importing necessary libraries and datasets. The `us_cities_by_state_SEPT.2023.csv` dataset contains duplicate rows. I delete duplicates to reduce computation time and accuracy.

In [1]:
import json, time, urllib.parse, pandas as pd, requests

In [115]:
cities = pd.read_csv("starting_data/us_cities_by_state_SEPT.2023.csv").drop_duplicates().reset_index(drop=True)
cities['revision_id'] = None
cities['article_quality'] = None
cities.head()

Unnamed: 0,state,page_title,url,revision_id,article_quality
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",,
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",,
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",,
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",,
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",,


The given dataset does not have Connecticut or Nebraska.

# Step 2: Getting Article Quality Predictions
Define constants and a function that gets the revision ID of an article in the `us_cities_by_state_SEPT.2023.csv` dataset by calling the Wikipedia API.

In [116]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'jjoko@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

In [117]:
#########
#
#    #########
#
#    PROCEDURES/FUNCTIONS
#
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

Define constants and a function that gets makes an ORES request using the page title and current revision id. I am defining a constant for my access token because I only intend to distribute this notebook with myself and the instructional staff. Otherwise, I'd store the access token in a local hidden file.

In [155]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "jjoko@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "jjoko@uw.edu",         # your email address should go here
    'access_token'  : "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiI5M2QwMmZhY2RiNDIyNDI5N2M0YzU1ZmNlMTg5MGU3MyIsImp0aSI6ImIwMmU1MjU2MmI1ZjQxNjJlODdjMGI5MjdiNTY1MTczMjg1OWI3NTcxNjc2MDk0YjQwZGIyZDExZjJkYTQ2ZjVmMWEwYzM1ZDllZDc1MGIxIiwiaWF0IjoxNjk3MjMzNDM2LjM0NDEwNiwibmJmIjoxNjk3MjMzNDM2LjM0NDEwOSwiZXhwIjozMzI1NDE0MjIzNi4zNDI3NzMsInN1YiI6IjczOTk4ODk4IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.XnUGpEA01TR6aOd2XOfYxE72cscL9tK8cJ6PKcO5v1-ufSiGu-CeKAned8zNVG8HyqyaapPUfhr8wybl0FuY6ouEVxy0oxiVlxoobYPx5uzGIpE-ubNzyZdd5sRVmiQAKWIC322bY4C3nvL7qxC_rX9MN2mXXgJ010imENg6O5mmiDd9_YMaOoWMMtKq53rZE-Eltdeyb_Lza7ryQkz51A5_YuduGHfFMfEj6PpssWscdH4m4CzqrZbTWVtjhab3jkYdEqWmzXtvgR3fZz7TzMTuZFlBNk4tIPv6aaLn-sqm1NOe0wWizMkGcdxHRl7pdnhOLhlYr8hgdfZ1Og0JBkLF44i-ztyJj6YGVFN4nC03xWRWAxGQV9DHGqMSHdWkUAXZCaI2YQ7UyNah48SLhdukUpAq1mq7KZSh5dHppM5vmPreUB4TGtaVpVPmxcObmeLrVTGdUJ5Rbs0Lr_RbcMCwa4ce0CFWeUOYjuh4-BhY15M1-98wVKBjSq_04WECrdaITwUaluRuZYqkVaI5mzwyvnVAbv3Jy98UYJXd9e2nB7fU1Bvaw4XsOrhLejmWpDrL1VfTQlv203MojrOS_DH57wdmksGOgvo-N7xSAFn8C0SwTZXJUmDbjqGqHMtHixBe9UKOSvzINrSIqOG5F12lPVcGZLyUBucoZlfdfE8"          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = "Jamesjoko"
ACCESS_TOKEN = "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiI5M2QwMmZhY2RiNDIyNDI5N2M0YzU1ZmNlMTg5MGU3MyIsImp0aSI6ImIwMmU1MjU2MmI1ZjQxNjJlODdjMGI5MjdiNTY1MTczMjg1OWI3NTcxNjc2MDk0YjQwZGIyZDExZjJkYTQ2ZjVmMWEwYzM1ZDllZDc1MGIxIiwiaWF0IjoxNjk3MjMzNDM2LjM0NDEwNiwibmJmIjoxNjk3MjMzNDM2LjM0NDEwOSwiZXhwIjozMzI1NDE0MjIzNi4zNDI3NzMsInN1YiI6IjczOTk4ODk4IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.XnUGpEA01TR6aOd2XOfYxE72cscL9tK8cJ6PKcO5v1-ufSiGu-CeKAned8zNVG8HyqyaapPUfhr8wybl0FuY6ouEVxy0oxiVlxoobYPx5uzGIpE-ubNzyZdd5sRVmiQAKWIC322bY4C3nvL7qxC_rX9MN2mXXgJ010imENg6O5mmiDd9_YMaOoWMMtKq53rZE-Eltdeyb_Lza7ryQkz51A5_YuduGHfFMfEj6PpssWscdH4m4CzqrZbTWVtjhab3jkYdEqWmzXtvgR3fZz7TzMTuZFlBNk4tIPv6aaLn-sqm1NOe0wWizMkGcdxHRl7pdnhOLhlYr8hgdfZ1Og0JBkLF44i-ztyJj6YGVFN4nC03xWRWAxGQV9DHGqMSHdWkUAXZCaI2YQ7UyNah48SLhdukUpAq1mq7KZSh5dHppM5vmPreUB4TGtaVpVPmxcObmeLrVTGdUJ5Rbs0Lr_RbcMCwa4ce0CFWeUOYjuh4-BhY15M1-98wVKBjSq_04WECrdaITwUaluRuZYqkVaI5mzwyvnVAbv3Jy98UYJXd9e2nB7fU1Bvaw4XsOrhLejmWpDrL1VfTQlv203MojrOS_DH57wdmksGOgvo-N7xSAFn8C0SwTZXJUmDbjqGqHMtHixBe9UKOSvzINrSIqOG5F12lPVcGZLyUBucoZlfdfE8"
#

In [156]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


a) read each line of us_cities_by_state_SEPT.2023.csv, b) make a page info request to get the current article page revision, and c) then make an ORES request using the page title and current revision id.  

In [194]:
revision_ids = []
for idx in range(0, len(cities)):
    try:
        row = cities.iloc[idx]
        revision_id = list(request_pageinfo_per_article(row["page_title"])["query"]["pages"].values())[0]["lastrevid"]
        row["revision_id"] = revision_id
        score = request_ores_score_per_article(article_revid=revision_id,
                                           email_address="jjoko@uw.edu",
                                           access_token=ACCESS_TOKEN)["enwiki"]["scores"][str(revision_id)]["articlequality"]["score"]["prediction"]
        row["article_quality"] = score
    except TypeError:
        print(f"Error on row index:{idx}")

In [258]:
cities.to_csv("intermediate_data/us_cities_article_quality.csv", index = False)
pd.read_csv("intermediate_data/us_cities_article_quality.csv").head()

Unnamed: 0,state,page_title,url,revision_id,article_quality
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",1171163550,C
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",1177621427,C
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",1168359898,C
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",1165909508,GA
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",1179139816,C


In [211]:
cities["article_quality"].value_counts()

C        12926
GA        4727
Start     2102
B          883
Stub       677
FA         210
Name: article_quality, dtype: int64

I was able to get ORES scores for every article.
# Step 3: Combining the Datasets
Merge the us cities article dataset, population dataset, and regional division dataset together. The populaiton dataset contains some non-states and extra information. The result should contain the following columns: state, regional_division, population, article_title, revision_id, article_quality.

First, load the population dataset into memory and pre-process it by removing non-states and extra information and fixing the state column, then inner joining on state. 

In [270]:
population = pd.read_excel("starting_data/NST-EST2022-POP.xlsx").iloc[8:-6,[0, 3]].reset_index(drop = True).dropna()
population.columns = ["state", "population"]
population["state"] = [a[1:].replace(" ", "_") for a in population["state"]]
population.head()

Unnamed: 0,state,population
0,Alabama,5049846.0
1,Alaska,734182.0
2,Arizona,7264877.0
3,Arkansas,3028122.0
4,California,39142991.0


In [250]:
population[~population['state'].isin(cities['state'])]

Unnamed: 0,state,population
6,Connecticut,3623355.0
8,District_of_Columbia,668791.0
10,Georgia,10788029.0
27,Nebraska,1963554.0


In [251]:
cities[~cities['state'].isin(population['state'])]["state"].unique()

array(['Georgia_(U.S._state)'], dtype=object)

The us_cities dataset is missing Connecticut, Nebraska, and District of Columbia. Georgia also needs to be reformatted.

In [279]:
cities.loc[cities["state"].str.contains("Georgia"), "state"] = "Georgia"
cities.loc[cities["state"] == "Georgia"].head()

Unnamed: 0,state,page_title,url,revision_id,article_quality
2442,Georgia,"Abbeville, Georgia","https://en.wikipedia.org/wiki/Abbeville,_Georgia",1171167087,C
2443,Georgia,"Acworth, Georgia","https://en.wikipedia.org/wiki/Acworth,_Georgia",1166760529,C
2444,Georgia,"Adairsville, Georgia","https://en.wikipedia.org/wiki/Adairsville,_Geo...",1165502646,C
2445,Georgia,"Adel, Georgia","https://en.wikipedia.org/wiki/Adel,_Georgia",1168374078,C
2446,Georgia,"Adrian, Georgia","https://en.wikipedia.org/wiki/Adrian,_Georgia",1176950695,C


In [311]:
len(pd.read_excel("starting_data/US States by Region - US Census Bureau.xlsx")["STATE"].dropna())

50

Next, load the regional division dataset into memory and preprocess it by repairing the schema to regional division and state and then inner joining on state. Lastly, fix the structure of the merged dataset to match the homework handout and write to file.

In [419]:
regional_division = pd.read_excel("starting_data/US States by Region - US Census Bureau.xlsx").iloc[:, 1:].dropna(how='all').reset_index(drop = True)
for idx, row in regional_division.iterrows():
    if pd.isna(row["DIVISION"]):
        row["DIVISION"] = regional_division.iloc[idx-1, 0]
regional_division = regional_division.dropna()
regional_division.columns = ["regional_division", "state"]
regional_division["state"] = regional_division["state"].str.replace(" ", "_")
len(regional_division["state"].value_counts())

50

In [423]:
merged = pd.merge(cities, population, on='state', how='left')
merged = pd.merge(merged, regional_division, on='state', how='left')
merged = merged[["state", "regional_division", "population", "page_title", "revision_id", "article_quality"]]
merged.columns = ["state", "regional_division", "population", "article_title", "revision_id", "article_quality"]
merged.head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5049846.0,"Abbeville, Alabama",1171163550,C
1,Alabama,East South Central,5049846.0,"Adamsville, Alabama",1177621427,C
2,Alabama,East South Central,5049846.0,"Addison, Alabama",1168359898,C
3,Alabama,East South Central,5049846.0,"Akron, Alabama",1165909508,GA
4,Alabama,East South Central,5049846.0,"Alabaster, Alabama",1179139816,C


In [429]:
merged.to_csv("wp_scored_city_articles_by_state.csv", index = False)
pd.read_csv("wp_scored_city_articles_by_state.csv").head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5049846.0,"Abbeville, Alabama",1171163550,C
1,Alabama,East South Central,5049846.0,"Adamsville, Alabama",1177621427,C
2,Alabama,East South Central,5049846.0,"Addison, Alabama",1168359898,C
3,Alabama,East South Central,5049846.0,"Akron, Alabama",1165909508,GA
4,Alabama,East South Central,5049846.0,"Alabaster, Alabama",1179139816,C


# Step 4: Analysis and Step 5: Results
First, to caluclate total articles per population, I get the number of rows per state (# articles) and divide it by the population of the state. Below is the Top 10 US states by coverage

In [484]:
states_by_coverage = pd.DataFrame(merged["state"].value_counts() / merged["state"].value_counts().index.map(dict(zip(population.state, population.population)))).reset_index()
states_by_coverage.columns = ["state", "total_articles_by_capita"]
states_by_coverage = states_by_coverage.sort_values(by="total_articles_by_capita", ascending=False).reset_index(drop = True)
states_by_coverage.head(10)

Unnamed: 0,state,total_articles_by_capita
0,Vermont,0.000509
1,North_Dakota,0.000458
2,Maine,0.000351
3,South_Dakota,0.000347
4,Iowa,0.000326
5,Alaska,0.000203
6,Pennsylvania,0.000196
7,Michigan,0.000177
8,Wyoming,0.000171
9,New_Hampshire,0.000169


Below is the Bottom 10 US states by coverage

In [485]:
states_by_coverage.tail(10).sort_values(by="total_articles_by_capita", ascending=True)

Unnamed: 0,state,total_articles_by_capita
47,North_Carolina,5e-06
46,Nevada,6e-06
45,California,1.2e-05
44,Arizona,1.3e-05
43,Virginia,1.5e-05
42,Oklahoma,1.9e-05
41,Florida,1.9e-05
40,Kansas,2.1e-05
39,Maryland,2.5e-05
38,Wisconsin,3.3e-05


Second, to caluclate total high quality articles per population, I make a dataframe of only high quality articles, get the number of rows per state (# articles), and divide it by the population of the state. Below is the Top 10 US states by by high quality.

In [434]:
high_quality = merged[(merged["article_quality"] == "FA") | (merged["article_quality"] == "GA")]

In [486]:
states_by_high_qual = pd.DataFrame(high_quality["state"].value_counts() / high_quality["state"].value_counts().index.map(dict(zip(population.state, population.population)))).reset_index()
states_by_high_qual.columns = ["state", "total_high_quality_articles_by_capita"]
states_by_high_qual = states_by_high_qual.sort_values(by="total_high_quality_articles_by_capita", ascending=False).reset_index(drop = True)
states_by_high_qual.head(10)

Unnamed: 0,state,total_high_quality_articles_by_capita
0,Vermont,7e-05
1,Wyoming,6.7e-05
2,South_Dakota,6.2e-05
3,West_Virginia,5.9e-05
4,Montana,5e-05
5,New_Hampshire,4.5e-05
6,Pennsylvania,4.3e-05
7,Missouri,4.3e-05
8,Alaska,4.2e-05
9,New_Jersey,4.1e-05


Below is the Bottom 10 US states by by high quality.

In [490]:
states_by_high_qual.tail(10).sort_values(by="total_high_quality_articles_by_capita", ascending=True)

Unnamed: 0,state,total_high_quality_articles_by_capita
47,North_Carolina,2e-06
46,Virginia,2e-06
45,Nevada,3e-06
44,Arizona,3e-06
43,California,4e-06
42,Florida,5e-06
41,New_York,6e-06
40,Maryland,7e-06
39,Kansas,7e-06
38,Oklahoma,8e-06


Lastly, to caluclate total and high quality articles per population by division, I make a dataframe of population per division to aid the calculations. To calculate the total coverage, I get the number of rows per division (# articles), and divide it by the population of the division. To calculate the high quality coverage, I make a dataframe of only high quality articles, get the number of rows per division (# articles), and divide it by the population of the division.

In [468]:
regional_division_population = pd.merge(regional_division, population, on="state", how="left").groupby("regional_division")["population"].sum()
regional_division_population.sort_values(ascending=False)

regional_division
South Atlantic        65997557.0
Pacific               53321373.0
East North Central    47181948.0
Middle Atlantic       42137512.0
West South Central    41205309.0
Mountain              25268390.0
West North Central    21654557.0
East South Central    19474372.0
New England           15121745.0
Name: population, dtype: float64

Below is the rank ordered list of US census divisions (in descending order) by total articles per capita

In [458]:
regional_division_by_coverage = pd.DataFrame(merged["regional_division"].value_counts() / merged["regional_division"].value_counts().index.map(regional_division_population)).reset_index()
regional_division_by_coverage.columns = ["regional_division", "total_articles_by_capita"]
regional_division_by_coverage = regional_division_by_coverage.sort_values(by="total_articles_by_capita", ascending=False).reset_index(drop = True)
regional_division_by_coverage

Unnamed: 0,regional_division,total_articles_by_capita
0,West North Central,0.000165
1,East North Central,0.000101
2,New England,9.5e-05
3,Middle Atlantic,9e-05
4,East South Central,7.9e-05
5,West South Central,5.1e-05
6,Mountain,4.7e-05
7,South Atlantic,2.8e-05
8,Pacific,2.4e-05


Below is the rank ordered list of US census divisions (in descending order) by high quality articles per capita.

In [463]:
regional_division_by_high_qual = pd.DataFrame(high_quality["regional_division"].value_counts() / high_quality["regional_division"].value_counts().index.map(regional_division_population)).reset_index()
regional_division_by_high_qual.columns = ["regional_division", "total_high_quality_articles_by_capita"]
regional_division_by_high_qual = regional_division_by_high_qual.sort_values(by="total_high_quality_articles_by_capita", ascending=False).reset_index(drop = True)
regional_division_by_high_qual

Unnamed: 0,regional_division,total_high_quality_articles_by_capita
0,West North Central,3e-05
1,Middle Atlantic,2.5e-05
2,East South Central,1.6e-05
3,West South Central,1.5e-05
4,East North Central,1.5e-05
5,New England,1.5e-05
6,Mountain,1.3e-05
7,Pacific,9e-06
8,South Atlantic,8e-06


In [491]:
high_quality["state"].value_counts()

Pennsylvania      566
Texas             487
New_Jersey        379
Missouri          263
Ohio              202
Illinois          196
California        172
Minnesota         169
Tennessee         146
Oregon            141
Michigan          133
Indiana           124
Florida           119
Washington        115
New_York          111
West_Virginia     105
Iowa              104
South_Carolina    103
Georgia            93
Kentucky           79
Colorado           77
Arkansas           72
New_Hampshire      63
Massachusetts      62
Utah               61
Wisconsin          60
South_Dakota       56
Montana            55
Alabama            53
Vermont            45
Louisiana          44
Maine              43
Maryland           42
Idaho              41
Mississippi        39
Wyoming            39
Alaska             31
Oklahoma           31
New_Mexico         31
Hawaii             30
North_Dakota       26
Delaware           25
Arizona            24
Kansas             22
North_Carolina     20
Virginia  

In [483]:
cities["state"].value_counts()

Pennsylvania      2556
Michigan          1773
Illinois          1298
Texas             1224
Iowa              1043
Missouri           951
Ohio               926
Minnesota          854
New_York           661
Indiana            565
New_Jersey         564
Georgia            538
Arkansas           500
Maine              483
California         482
Alabama            461
Kentucky           421
Florida            412
North_Dakota       356
Massachusetts      352
Tennessee          347
Vermont            329
South_Dakota       311
Louisiana          304
Mississippi        300
Colorado           290
Washington         281
South_Carolina     271
Utah               255
Oregon             241
New_Hampshire      234
West_Virginia      232
Idaho              201
Wisconsin          192
Maryland           157
Hawaii             151
Alaska             149
Virginia           133
Montana            128
New_Mexico         106
Wyoming             99
Arizona             91
Oklahoma            75
Kansas     

In [482]:
population.sort_values("population", ascending=False).reset_index(drop = True)

Unnamed: 0,state,population
0,California,39142991.0
1,Texas,29558864.0
2,Florida,21828069.0
3,New_York,19857492.0
4,Pennsylvania,13012059.0
5,Illinois,12686469.0
6,Ohio,11764342.0
7,Georgia,10788029.0
8,North_Carolina,10565885.0
9,Michigan,10037504.0
