## DATA 512 Homework 2

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles about cities in different US states. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of state populations, and use a machine learning service called ORES to estimate the quality of the articles about the cities.
You are expected to perform an analysis of how the coverage of US cities on Wikipedia and how the quality of articles about cities varies among states. Your analysis will consist of a series of tables that show:
The states with the greatest and least coverage of cities on Wikipedia compared to their population.
The states with the highest and lowest proportion of high quality articles about cities.
A ranking of US geographic regions by articles-per-person and proportion of high quality articles.


### 1. Getting the Article, Population and Region Data

We load three files into dataframes -
* us_cities_by_state_SEPT.2023.csv: List of Wikipedia article pages about US cities from each state.
* NST-EST2022-POP (3).xlsx: Population estimates for every US state from April, 2020 to 2022. This dataset is cleared to contain only population for 2022. The trailing '.' has been handled and removed from the dataset.
* US States by Region - US Census Bureau - Sheet1.csv: Regional and divisional agglomerations as defined by the US Census Bureau and used for analysis in this notebook.

In [97]:
import pandas as pd

In [4]:
state_df = pd.read_csv("us_cities_by_state_SEPT.2023.csv")
state_df

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...
22152,Wyoming,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
22153,Wyoming,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
22154,Wyoming,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming"
22155,Wyoming,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming"


In [5]:
pop_estim = pd.read_excel("/content/NST-EST2022-POP (3).xlsx")
pop_estim.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,Geographic Area,"April 1, 2020 Estimates Base",2020,2021,2022
1,United States,331449520,331511512,332031554,333287557
2,Northeast,57609156,57448898,57259257,57040406
3,Midwest,68985537,68961043,68836505,68787595
4,South,126266262,126450613,127346029,128716192


In [6]:
pop_estim.columns = pop_estim.iloc[0]
pop_estim = pop_estim[1:]
pop_estim

Unnamed: 0,Geographic Area,"April 1, 2020 Estimates Base",2020,2021,2022
1,United States,331449520,331511512,332031554,333287557
2,Northeast,57609156,57448898,57259257,57040406
3,Midwest,68985537,68961043,68836505,68787595
4,South,126266262,126450613,127346029,128716192
5,West,78588565,78650958,78589763,78743364
6,Alabama,5024356,5031362,5049846,5074296
7,Alaska,733378,732923,734182,733583
8,Arizona,7151507,7179943,7264877,7359197
9,Arkansas,3011555,3014195,3028122,3045637
10,California,39538245,39501653,39142991,39029342


In [7]:
pop_estim = pop_estim[['Geographic Area', 2022]]
pop_estim.head(10)

Unnamed: 0,Geographic Area,2022
1,United States,333287557
2,Northeast,57040406
3,Midwest,68787595
4,South,128716192
5,West,78743364
6,Alabama,5074296
7,Alaska,733583
8,Arizona,7359197
9,Arkansas,3045637
10,California,39029342


In [9]:
reg = pd.read_csv("/content/US States by Region - US Census Bureau - Sheet1.csv")
reg

Unnamed: 0,REGION,DIVISION,STATE
0,Northeast,,
1,,New England,
2,,,Connecticut
3,,,Maine
4,,,Massachusetts
...,...,...,...
58,,,Alaska
59,,,California
60,,,Hawaii
61,,,Oregon


This code processes a DataFrame containing region, division, and state information. It initializes lists for 'region' and 'division' and iterates through the rows, determining 'region' and 'division' values based on non-NaN values in the 'REGION' and 'DIVISION' columns. It appends these determined values to lists, adds them as new columns in the DataFrame, and filters out rows with NaN values in the 'STATE' column. The resulting DataFrame, 'state_data,' retains only valid state-related rows, with its index reset for continuity. This code effectively transforms the original data, extracting relevant 'region' and 'division' values while retaining state information.

In [10]:
# Initialize empty lists for 'region' and 'division'
region_list = []
division_list = []

# Initialize empty variables for 'prev_region' and 'prev_division'
prev_region = ''
prev_division = ''

for index, row in reg.iterrows():
    # Get the 'REGION' and 'DIVISION' values for the current row
    region_value = row['REGION']
    division_value = row['DIVISION']

    if pd.notna(region_value):
        prev_region = region_value
    region_list.append(prev_region)

    if pd.notna(division_value):
        prev_division = division_value
    division_list.append(prev_division)

reg['region'] = region_list
reg['division'] = division_list
reg.drop(columns=['REGION', 'DIVISION'], inplace=True)

# Filter out rows with NaN values in 'STATE'
state_data = reg.dropna(subset=['STATE'])
state_data.reset_index(drop=True, inplace=True)

Checking our new dataframe

In [11]:
state_data[state_data['STATE'] == 'Connecticut']

Unnamed: 0,STATE,region,division
0,Connecticut,Northeast,New England


This code merges two DataFrames, 'state_data' and 'pop_estim,' using the common columns 'STATE' and 'Geographic Area.' The resulting 'merged_data' DataFrame retains relevant population data while dropping the duplicate 'Geographic Area' column. To align with the desired format, it renames the columns as 'Geographic Area,' 'Population Estimate,' 'Region,' and 'Division.' The code efficiently organizes data for further analysis or visualization, as reflected in the initial rows displayed using 'head(10).'

In [12]:
# Merge the two DataFrames based on the 'STATE' and 'Geographic Area' columns
merged_data = state_data.merge(pop_estim, left_on='STATE', right_on='Geographic Area', how='inner')
merged_data.drop(columns=['Geographic Area'], inplace=True)
merged_data.rename(columns={'STATE': 'Geographic Area', '2022': 'Population Estimate', 'region': 'Region', 'division': 'Division'}, inplace=True)
merged_data.head(10)

Unnamed: 0,Geographic Area,Region,Division,2022
0,Connecticut,Northeast,New England,3626205
1,Maine,Northeast,New England,1385340
2,Massachusetts,Northeast,New England,6981974
3,New Hampshire,Northeast,New England,1395231
4,Rhode Island,Northeast,New England,1093734
5,Vermont,Northeast,New England,647064
6,New Jersey,Northeast,Middle Atlantic,9261699
7,New York,Northeast,Middle Atlantic,19677151
8,Pennsylvania,Northeast,Middle Atlantic,12972008
9,Indiana,Midwest,East North Central,6833037


We will use this little merged piece in our further analysis. Now we move on to Step 2 to get article quality predictions

### 2. Getting Article Quality Predictions

Now, we need to obtain the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES, which was originally an acronym for "Objective Revision Evaluation Service" but has since been renamed to just "ORES." ORES is a machine learning tool that provides estimates of Wikipedia article quality. These quality estimates are categorized from best to worst as follows: FA - Featured article, GA - Good article (sometimes referred to as A-class), B - B-class article, C - C-class article, Start - Start-class article, and Stub - Stub-class article.


Putting this together, to obtain Wikipedia page quality predictions from ORES for each politician's article page, we will need to: a) read each line of the us_cities_by_state_SEPT.2023.csv file, b) initiate a page info request to retrieve the current article page revision, and c) subsequently send an ORES request using the page title and the current revision ID. This process will allow us to collect quality predictions for the Wikipedia pages effectively.

In [13]:
#
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

In [14]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'hmuppa@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
#ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
#PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

In [15]:
def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

Making a list of all page titles

In [16]:
page_title = state_df['page_title'].to_list()

Let's see what the output json looks like for Adamsville, Alabama

In [17]:
print("Getting revision data for: ",page_title[1])
revid = request_pageinfo_per_article(page_title[1])
revid

Getting revision data for:  Adamsville, Alabama


{'batchcomplete': '',
 'query': {'pages': {'104761': {'pageid': 104761,
    'ns': 0,
    'title': 'Adamsville, Alabama',
    'contentmodel': 'wikitext',
    'pagelanguage': 'en',
    'pagelanguagehtmlcode': 'en',
    'pagelanguagedir': 'ltr',
    'touched': '2023-10-10T22:35:37Z',
    'lastrevid': 1177621427,
    'length': 18040}}}}

In [100]:
!pip install Ipython

Collecting jedi>=0.16 (from Ipython)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi
Successfully installed jedi-0.19.1


The API call took around 40 minutes to scrape revision ids of each state in the wikipedia articles list. We dump the json response to 'title_revid.json' to save us from re-running the API over and over again. I'm doing a sample run below. To crawl for all of the pages, replace page_title[:10] with page_title in "**for title in page_title[:10]:**"

In [106]:
json_response = []
for title in page_title[:10]:
  response = request_pageinfo_per_article(title)
  json_response.append(response)

with open('title_revid.json', 'w') as f:
  json.dump(json_response, f)

In [19]:
def load_data(file_path):
  with open(file_path, "r") as json_file:
    data = json.load(json_file)
  return pd.DataFrame(data)

This code snippet is responsible for extracting and structuring data from a JSON response. This is the same file that we saved our JSON response for the pageingo in. It iterates through the rows of a DataFrame called 'title_revid,' which contains information about Wikipedia article titles and their corresponding revision IDs. Within the loop, it accesses the 'query' and 'pages' values to retrieve details about each page, such as the title and revision ID. The extracted data is then appended to a list and subsequently used to create a new DataFrame named 'result_df.' This DataFrame provides a structured representation of the relevant Wikipedia article information, making it easier for further analysis or processing. We will be using this dataframe for further analysis

In [20]:
title_revid = load_data("/content/title_revid.json")
title_revid

Unnamed: 0,batchcomplete,query
0,,"{'pages': {'104730': {'pageid': 104730, 'ns': ..."
1,,"{'pages': {'104761': {'pageid': 104761, 'ns': ..."
2,,"{'pages': {'105188': {'pageid': 105188, 'ns': ..."
3,,"{'pages': {'104726': {'pageid': 104726, 'ns': ..."
4,,"{'pages': {'105109': {'pageid': 105109, 'ns': ..."
...,...,...
22152,,"{'pages': {'140221': {'pageid': 140221, 'ns': ..."
22153,,"{'pages': {'140185': {'pageid': 140185, 'ns': ..."
22154,,"{'pages': {'140245': {'pageid': 140245, 'ns': ..."
22155,,"{'pages': {'140070': {'pageid': 140070, 'ns': ..."


In [21]:
data = []
for index, row in title_revid.iterrows():
    for page_info in row['query']['pages'].values():
        data.append({'title': page_info['title'], 'revid': page_info['lastrevid']})

In [22]:
result_df = pd.DataFrame(data)
result_df

Unnamed: 0,title,revid
0,"Abbeville, Alabama",1171163550
1,"Adamsville, Alabama",1177621427
2,"Addison, Alabama",1168359898
3,"Akron, Alabama",1165909508
4,"Alabaster, Alabama",1179139816
...,...,...
22152,"Wamsutter, Wyoming",1169591845
22153,"Wheatland, Wyoming",1176370621
22154,"Worland, Wyoming",1166347917
22155,"Wright, Wyoming",1166334449


In [23]:
title_ids = pd.Series(result_df.revid.values,index=result_df.title).to_dict()

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels. The code is adapted from - https://colab.research.google.com/drive/17C9xsmR9U3lJeD52UTbAedlHDetwYsxs#scrollTo=ZX4cDXd0MvjJ

See more about creating API token and getting the access id which would be needed below:

In [24]:
#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "hmuppa@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "hmuppa@uw.edu",         # your email address should go here
    'access_token'  : "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJlYzJiOGM5MmI5ZGFjODVkN2YyMDhlMTMzODcyNzc4YiIsImp0aSI6ImM3MjAyMjU1MzQ2MTc0NjhhODg1ODRiYTVkZDgzOWEzYjNkN2Q4ZDUyYThiMjE0Y2I5MDQzOTljZWM3NmMyYjIyZmZlYjVkOGQwYjYwMjA3IiwiaWF0IjoxNjk3NTA5NzE2LjQ3NzA0OCwibmJmIjoxNjk3NTA5NzE2LjQ3NzA1MSwiZXhwIjozMzI1NDQxODUxNi40NzQ2OTMsInN1YiI6Ijc0MDIxOTE4IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.IeV-rxd_-f4EBUeDVARLoVI_u-J8tzI6HhK5sXCyC5A6-6Vih6A3Yff6VurohjOYv0Aswh8M8dU6zkScaHWA29EE03r9B_OmzUMfu2AdDVBY9OB7Z1q-qvrHCfGE_hDuLPtFGoc1H4jcXlyKasY0YKJ9w3vQ8E3uzgY5XWOwVGf0R6E3dDHPxrChmPrW9c1570g91QCjuXvDLAP62v5h0z-kJJZ1WHm9Opy-Cbic8EKsdcMYYMudthYPH6l08xFBpQIjytNKbOI-8WlCF7jDeeeXxdiSrSTBvusqaY4G1Boo9YGCTqVXjibYrNSOad9p5STjs1LktpBbLkDjAq5HIDLyGNEzOwdlOvvArqTvMMdZctT4dq5BKjXoGa5gX3c6Ee4I2ctxVvC8o673meHSGd9fAFbHyUe2hhqhMB6YPt-q480Pa_11b0NeeC_W5ENh_vKFGMwXshQB96q84SgehDGUyGz2lXBj2l3P1lqOxaeurX3CKzZN6W6BRh_3NHeo3xlfFlPQVS6pa2UDcZGG9I9jO8pRXJVOaLW9HDsGoF7LkzOeyVjyEx2MHu7YZeEMrDcIi5RsqlbeaNideAZ0KeBdnimKWQ0836Z0o_VLC4MlyLVUbbzQ3LywJLBK6wgrOvNGvolDyfLy3w3q2pDnUIfgLQ3PMFR729F7LTTkCxw"          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = title_ids
#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""

In [25]:
!pip install apikey

Collecting apikey
  Downloading apikey-0.2.4.tar.gz (6.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apikey
  Building wheel for apikey (setup.py) ... [?25l[?25hdone
  Created wheel for apikey: filename=apikey-0.2.4-py3-none-any.whl size=6671 sha256=4e5f9afc8c10b0a79c89d20af66c6a08066db9f69fe904bdeb0932d1bf17225d
  Stored in directory: /root/.cache/pip/wheels/d0/b2/c9/a4400b26c52c13f16c796d15694407a8c610a3098b9e886651
Successfully built apikey
Installing collected packages: apikey
Successfully installed apikey-0.2.4


In [26]:
def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

This code snippet demonstrates a time-consuming process of making API calls to retrieve data for the first 100 articles. It took approximately 5.5 hours to complete the task for all articles, highlighting the computational effort required for a substantial dataset. The obtained data is stored in intermediate files for further analysis. Make sure to change **for article_title in list(ARTICLE_REVISIONS.keys())[:100]:** to **for article_title in ARTICLE_REVISIONS:** when trying to get predictions for the entire dataset.

The code initializes an empty dictionary named 'quality' and iterates through the list of the first 100 article titles in 'ARTICLE_REVISIONS.' For each article, it prints the title, requests ORES scores using the corresponding revision ID, and extracts the quality prediction from the API response. These quality predictions are then stored in the 'quality' dictionary with the article titles as keys. This process helps collect quality assessment data for a subset of articles, which can be useful for subsequent analyses.

In [104]:
quality = {}
for article_title in list(ARTICLE_REVISIONS.keys())[:100]:

  print(f"'{article_title}'")
  score = request_ores_score_per_article(article_revid=ARTICLE_REVISIONS[article_title],
                                       email_address="hmuppa@uw.edu",
                                       access_token=ACCESS_TOKEN)
  for id in score['enwiki']['scores']:
    pred = score['enwiki']['scores'][id]['articlequality']['score']['prediction']
    quality[article_title] = pred

'Abbeville, Alabama'
'Adamsville, Alabama'
'Addison, Alabama'
'Akron, Alabama'
'Alabaster, Alabama'
'Albertville, Alabama'
'Alexander City, Alabama'
'Aliceville, Alabama'
'Allgood, Alabama'
'Altoona, Alabama'
'Andalusia, Alabama'
'Anderson, Lauderdale County, Alabama'
'Anniston, Alabama'
'Arab, Alabama'
'Ardmore, Alabama'
'Argo, Alabama'
'Ariton, Alabama'
'Arley, Alabama'
'Ashford, Alabama'
'Ashland, Alabama'
'Ashville, Alabama'
'Athens, Alabama'
'Atmore, Alabama'
'Attalla, Alabama'
'Auburn, Alabama'
'Autaugaville, Alabama'
'Avon, Alabama'
'Babbie, Alabama'
'Baileyton, Alabama'
'Bakerhill, Alabama'
'Banks, Alabama'
'Bay Minette, Alabama'
'Bayou La Batre, Alabama'
'Bear Creek, Alabama'
'Beatrice, Alabama'
'Beaverton, Alabama'
'Belk, Alabama'
'Benton, Alabama'
'Berlin, Alabama'
'Berry, Alabama'
'Bessemer, Alabama'
'Billingsley, Alabama'
'Birmingham, Alabama'
'Black, Alabama'
'Blountsville, Alabama'
'Blue Springs, Alabama'
'Boaz, Alabama'
'Boligee, Alabama'
'Bon Air, Alabama'
'Brantley, A

In [105]:
predictions = pd.DataFrame({'article': quality.keys(), 'prediction': quality.values()})

In [80]:
csv_file_path = "wiki_predictions.csv"
predictions.to_csv(csv_file_path, index=False)

We obtain the article and its corresponding quality prediction.

In [30]:
ORES_pred = pd.read_csv("wiki_predictions.csv")
ORES_pred.drop('Unnamed: 0', axis=1)

Unnamed: 0,article,prediction
0,"Abbeville, Alabama",C
1,"Adamsville, Alabama",C
2,"Addison, Alabama",C
3,"Akron, Alabama",GA
4,"Alabaster, Alabama",C
...,...,...
21514,"Wamsutter, Wyoming",GA
21515,"Wheatland, Wyoming",GA
21516,"Worland, Wyoming",GA
21517,"Wright, Wyoming",GA


###3. Combining Datasets

We will need to perform several data processing tasks to create the final schema. First, we will retrieve and include the ORES data for each article and then merge the Wikipedia data with the population data using state names as the common key. Additionally, the dataset will be enriched with US Census regional-division information by reading and merging hierarchical data that represents regions, divisions, and states. During this merging process, we might encounter entries that cannot be merged straightforwardly, usually due to non-state areas like "Washington, D.C." or "Puerto Rico," which should be disregarded. We will identify and list all such unmatched areas in the output, with each area on a separate line. This list helps us understand the differences in the datasets. Finally, we will consolidate the merged data into a single CSV file named "wp_scored_city_articles_by_state.csv," which will be the final schema of the combined dataset. This comprehensive dataset will provide valuable insights for further analysis and research.

In [31]:
merged_data.head(10)

Unnamed: 0,Geographic Area,Region,Division,2022
0,Connecticut,Northeast,New England,3626205
1,Maine,Northeast,New England,1385340
2,Massachusetts,Northeast,New England,6981974
3,New Hampshire,Northeast,New England,1395231
4,Rhode Island,Northeast,New England,1093734
5,Vermont,Northeast,New England,647064
6,New Jersey,Northeast,Middle Atlantic,9261699
7,New York,Northeast,Middle Atlantic,19677151
8,Pennsylvania,Northeast,Middle Atlantic,12972008
9,Indiana,Midwest,East North Central,6833037


In [36]:
merged_step1 = state_df.merge(merged_data, left_on='state', right_on='Geographic Area', how='inner')
merged_step2 = merged_step1.merge(result_df, left_on='page_title', right_on='title', how='inner')
merged_step2

Unnamed: 0,state,page_title,url,Geographic Area,Region,Division,2022,title,revid
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550
1,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550
2,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550
3,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550
4,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",Alabama,South,East South Central,5074296,"Adamsville, Alabama",1177621427
...,...,...,...,...,...,...,...,...,...
18785,Wyoming,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming",Wyoming,West,Mountain,581381,"Wamsutter, Wyoming",1169591845
18786,Wyoming,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming",Wyoming,West,Mountain,581381,"Wheatland, Wyoming",1176370621
18787,Wyoming,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming",Wyoming,West,Mountain,581381,"Worland, Wyoming",1166347917
18788,Wyoming,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming",Wyoming,West,Mountain,581381,"Wright, Wyoming",1166334449


In [46]:
combined_df = merged_step2.merge(ORES_pred, left_on='page_title', right_on='article', how='inner')
combined_df

Unnamed: 0.1,state,page_title,url,Geographic Area,Region,Division,2022,title,revid,Unnamed: 0,article,prediction
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550,0,"Abbeville, Alabama",C
1,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550,0,"Abbeville, Alabama",C
2,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550,0,"Abbeville, Alabama",C
3,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550,0,"Abbeville, Alabama",C
4,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",Alabama,South,East South Central,5074296,"Adamsville, Alabama",1177621427,1,"Adamsville, Alabama",C
...,...,...,...,...,...,...,...,...,...,...,...,...
18785,Wyoming,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming",Wyoming,West,Mountain,581381,"Wamsutter, Wyoming",1169591845,21514,"Wamsutter, Wyoming",GA
18786,Wyoming,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming",Wyoming,West,Mountain,581381,"Wheatland, Wyoming",1176370621,21515,"Wheatland, Wyoming",GA
18787,Wyoming,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming",Wyoming,West,Mountain,581381,"Worland, Wyoming",1166347917,21516,"Worland, Wyoming",GA
18788,Wyoming,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming",Wyoming,West,Mountain,581381,"Wright, Wyoming",1166334449,21517,"Wright, Wyoming",GA


In [47]:
combined_df = combined_df[['state', 'Geographic Area', 'Region', 'Division', 2022, 'page_title', 'revid', 'prediction']]
combined_df

Unnamed: 0,state,Geographic Area,Region,Division,2022,page_title,revid,prediction
0,Alabama,Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550,C
1,Alabama,Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550,C
2,Alabama,Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550,C
3,Alabama,Alabama,South,East South Central,5074296,"Abbeville, Alabama",1171163550,C
4,Alabama,Alabama,South,East South Central,5074296,"Adamsville, Alabama",1177621427,C
...,...,...,...,...,...,...,...,...
18785,Wyoming,Wyoming,West,Mountain,581381,"Wamsutter, Wyoming",1169591845,GA
18786,Wyoming,Wyoming,West,Mountain,581381,"Wheatland, Wyoming",1176370621,GA
18787,Wyoming,Wyoming,West,Mountain,581381,"Worland, Wyoming",1166347917,GA
18788,Wyoming,Wyoming,West,Mountain,581381,"Wright, Wyoming",1166334449,GA


In [60]:
combined_df = combined_df.copy()  # Create a copy of the DataFrame
combined_df['regional_division'] = combined_df['Region'] + " " + combined_df['Division']
final_df = combined_df.drop(['Region', 'Division'], axis = 1)
final_df = final_df[['state','regional_division',2022,'page_title','revid','prediction']]
merged_final = final_df.rename(columns={2022: 'population', 'page_title':'article_title', 'revid': 'revision_id', 'prediction': 'article_quality'})

The schema for that file should look something like this:
* Column
* state
* regional_division
* population
* article_title
* revision_id
* article_quality

In [62]:
merged_final.head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,South East South Central,5074296,"Abbeville, Alabama",1171163550,C
1,Alabama,South East South Central,5074296,"Abbeville, Alabama",1171163550,C
2,Alabama,South East South Central,5074296,"Abbeville, Alabama",1171163550,C
3,Alabama,South East South Central,5074296,"Abbeville, Alabama",1171163550,C
4,Alabama,South East South Central,5074296,"Adamsville, Alabama",1177621427,C


Consolidate the merged data into a single CSV file called:
wp_scored_city_articles_by_state.csv


In [66]:
merged_final.to_csv('wp_scored_city_articles_by_state.csv',index=False)

### 4 / 5 Analysis and Results

In our analysis, we will calculate two important metrics: "total-articles-per-population," which represents the number of articles per person, and "high-quality-articles-per-population," which represents the number of high-quality articles per person. These calculations will be performed on both a state-by-state and divisional basis, providing insights into the distribution of articles and high-quality articles across different regions. It's important to note that the definition of "high-quality" articles will be based on ORES predictions, specifically those classified as "FA" (featured article) or "GA" (good article) classes. This analysis will help us understand the coverage and quality of Wikipedia articles in relation to the population, allowing for meaningful comparisons and insights.

In [67]:
data = pd.read_csv('wp_scored_city_articles_by_state.csv')
data.head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,South East South Central,5074296,"Abbeville, Alabama",1171163550,C
1,Alabama,South East South Central,5074296,"Abbeville, Alabama",1171163550,C
2,Alabama,South East South Central,5074296,"Abbeville, Alabama",1171163550,C
3,Alabama,South East South Central,5074296,"Abbeville, Alabama",1171163550,C
4,Alabama,South East South Central,5074296,"Adamsville, Alabama",1177621427,C


In [70]:
data.columns

Index(['state', 'regional_division', 'population', 'article_title',
       'revision_id', 'article_quality'],
      dtype='object')

### 1. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .

In [77]:
state_coverage = data.groupby('state').agg({'article_title': 'count', 'population': 'mean'})
state_coverage['articles_per_capita'] = state_coverage['article_title'] / state_coverage['population']
top_10_states_by_coverage = state_coverage.sort_values(by='articles_per_capita', ascending=False).head(10)
top_10_states_by_coverage = top_10_states_by_coverage.rename(columns={'article_title': 'Total Articles per Capita'})
top_10_states_by_coverage

Unnamed: 0_level_0,Total Articles per Capita,population,articles_per_capita
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Vermont,329,647064.0,0.000508
Alabama,1844,5074296.0,0.000363
Maine,483,1385340.0,0.000349
Iowa,1049,3200517.0,0.000328
Alaska,149,733583.0,0.000203
Pennsylvania,2556,12972008.0,0.000197
Michigan,1773,10034113.0,0.000177
Wyoming,99,581381.0,0.00017
Arkansas,500,3045637.0,0.000164
Missouri,951,6177957.0,0.000154


### 2. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order)

In [78]:
bottom_10_states_by_coverage = state_coverage.sort_values(by='articles_per_capita').head(10)
bottom_10_states_by_coverage = bottom_10_states_by_coverage.rename(columns={'article_title': 'Total Articles per Capita'})
bottom_10_states_by_coverage

Unnamed: 0_level_0,Total Articles per Capita,population,articles_per_capita
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nevada,19,3177772.0,6e-06
California,482,39029342.0,1.2e-05
Arizona,91,7359197.0,1.2e-05
Oklahoma,75,4019800.0,1.9e-05
Florida,425,22244823.0,1.9e-05
Kansas,63,2937150.0,2.1e-05
Maryland,157,6164660.0,2.5e-05
Wisconsin,205,5892539.0,3.5e-05
Washington,281,7785786.0,3.6e-05
Texas,1247,30029572.0,4.2e-05


### 3. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order)

In [85]:
high_quality_articles = data[data['article_quality'].isin(['FA', 'GA'])]
statewise_high_quality_count = high_quality_articles.groupby('state')['article_title'].count()
statewise_pop = data.groupby('state')['population'].mean()
statewise_high_quality_per_capita = statewise_high_quality_count / statewise_pop
bottom_10_states_by_high_quality = statewise_high_quality_per_capita.sort_values(ascending=False).head(10)
bottom_10_states_by_high_quality = bottom_10_states_by_high_quality.rename('High Quality Articles per Capita').reset_index()
bottom_10_states_by_high_quality

Unnamed: 0,state,High Quality Articles per Capita
0,Vermont,7e-05
1,Wyoming,6.7e-05
2,Montana,4.9e-05
3,Pennsylvania,4.4e-05
4,Missouri,4.3e-05
5,Alaska,4.2e-05
6,Alabama,4.2e-05
7,Iowa,3.4e-05
8,Oregon,3.3e-05
9,Maine,3.1e-05


### 4.Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).

In [84]:
bottom_10_states_by_high_quality = statewise_high_quality_per_capita.sort_values().head(10)
bottom_10_states_by_high_quality = bottom_10_states_by_high_quality.rename('High Quality Articles per Capita').reset_index()
bottom_10_states_by_high_quality

Unnamed: 0,state,High Quality Articles per Capita
0,Nevada,3e-06
1,Arizona,3e-06
2,California,4e-06
3,Florida,6e-06
4,Maryland,7e-06
5,Kansas,7e-06
6,Oklahoma,8e-06
7,Virginia,8e-06
8,Massachusetts,9e-06
9,Louisiana,1e-05


### 5. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.

In [95]:
divisionwise_article_count = data.groupby('regional_division')['article_title'].count()
divisionwise_pop = data.groupby('regional_division')['population'].mean()
divisionwise_coverage_per_capita = divisionwise_article_count / divisionwise_pop
divisionwise_ranked = divisionwise_coverage_per_capita.sort_values(ascending=False).head(10)
divisionwise_ranked = divisionwise_ranked.rename('Total Articles per Capita').reset_index()
divisionwise_ranked

Unnamed: 0,regional_division,Total Articles per Capita
0,Midwest West North Central,0.000595
1,South East South Central,0.000583
2,Northeast New England,0.000406
3,Midwest East North Central,0.000357
4,West Mountain,0.00031
5,Northeast Middle Atlantic,0.000197
6,South West South Central,0.000111
7,South South Atlantic,9.1e-05
8,West Pacific,7.6e-05


### 6. Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.

In [94]:
high_quality_data = data[data['article_quality'].isin(['FA', 'GA'])]
divisionwise_high_quality_count = high_quality_data.groupby('regional_division')['article_title'].count()
divisionwise_pop = data.groupby('regional_division')['population'].mean()
divisionwise_high_quality_coverage_per_capita = divisionwise_high_quality_count / divisionwise_pop
divisionwise_high_quality_ranked = divisionwise_high_quality_coverage_per_capita.sort_values(ascending=False).head(10)
divisionwise_high_quality_ranked = divisionwise_high_quality_ranked.rename('High Quality Articles per Capita').reset_index()
divisionwise_high_quality_ranked

Unnamed: 0,regional_division,High Quality Articles per Capita
0,Midwest West North Central,0.000115
1,South East South Central,9.7e-05
2,West Mountain,8.7e-05
3,Midwest East North Central,5.5e-05
4,Northeast New England,5.2e-05
5,Northeast Middle Atlantic,4.4e-05
6,South West South Central,3.3e-05
7,West Pacific,2.9e-05
8,South South Atlantic,2.1e-05
